{"id":933,"date":"2010-08-03T11:48:17","date_gmt":"2010-08-03T15:48:17","guid":{"rendered":"http:\/\/scientopia.org\/blogs\/goodmath\/?p=933"},"modified":"2010-08-03T11:48:17","modified_gmt":"2010-08-03T15:48:17","slug":"revenge-of-the-return-of-the-compression-idiot","status":"publish","type":"post","link":"http:\/\/www.goodmath.org\/blog\/2010\/08\/03\/revenge-of-the-return-of-the-compression-idiot\/","title":{"rendered":"Revenge of the Return of the Compression Idiot"},"content":{"rendered":"<p> Do I <em>really<\/em> need to go through this again? Argh!<\/p>\n<p> As long-time readers will remember, a while back, I had an encounter with<br \/>\na crackpot named Jules Gilbert who claimed to be able to cyclically compress<br \/>\nfiles. That is, he could take a particular input file, compress it, do<br \/>\nsomething to it, and then compress it again, making it smaller. He claimed to<br \/>\nbe able to do this for <em>any<\/em> input. <\/p>\n<p> And to make it more annoying, his claim to be able to do this brilliant<br \/>\nmagical cyclical compression was attached to an obnoxious christian &#8220;I want to<br \/>\nsave you soul&#8221; rant, where he effectively offered to share the profits from<br \/>\nthis brilliant invention with me, if I&#8217;d only give him the chance to tell me<br \/>\nabout Jesus.<\/p>\n<p> I tried to get rid of the schmuck. I asked him to leave me alone politely.<br \/>\nThen I asked him to leave me alone not-so politely. Then I asked him to<br \/>\nleave me alone very, very rudely. Then I publicly mocked him on this blog.<br \/>\nAfter the last, he finally seemed to take the hint and go away.<\/p>\n<p> Unfortunately, good things often don&#8217;t last. And yesterday, I received<br \/>\nanother new message from him, bragging about the results of his uber-magical<br \/>\ncompression system. He has now perfected it to the point where he can take<br \/>\n<em>any<\/em>  input file, and compress it down to about 50K:<\/p>\n<blockquote>\n<p>I am now compressing single bit as:<\/p>\n<p> I use C code to pack up one bit to a byte and call bzip2&#8217;s compress<br \/>\nbuffer.<\/p>\n<p>I print the result and compute the size.<\/p>\n<p> (I did this for both gzip and bzip2.)  Gzip didn&#8217;t compress;  it gave<br \/>\nme back about 6.25, but I needed at least 8.0.<\/p>\n<p> And bzip2 gave it to me.  About 8.2 and higher.  (This is with no<br \/>\nactual I\/O, just buffer to buffer.)<\/p>\n<p> This was with random data.  At least that&#8217;s what people call it.<br \/>\nBzip2 data.  (It looks pretty random to me.)<\/p>\n<p> I expect to show up at a friend&#8217;s house and show him the system on<br \/>\nMonday.  THIS IS NOT A COMPLETE SYSTEM, but it&#8217;s not just research<br \/>\neither.  The input was randomly chosen (not by me, by a C function,)<br \/>\nand the program has been arranged to simply open another file when a<br \/>\ncurrent file is exhausted.<\/p>\n<p> Every input file has been compressed once with bzip2.  Each input file<br \/>\nis at least 5k bytes and was chosen with the intent to represent all<br \/>\npopular OS&#8217;s somewhat equally.<\/p>\n<p> This program is easily able to process MN&#8217;s 415k file.  And since it<br \/>\nuses 256 byte blocks, I can bring it down to about 50k (number<br \/>\ndemonstrated in testing.)  Another way to say this is that (with this<br \/>\nprogram,) all files greater than about 50k can be made that small.<\/p>\n<p> This is the simplist system I have ever made that uses my &#8220;modern&#8221;<\/p>\n<p>techniques (I mean, not my LSQ ideas of a decade ago.)  i started<br \/>\ncoding it this morning, to see if a certain idea worked&#8230;<\/p>\n<p> I&#8217;m trying to drop bzip but it&#8217;s pretty good!  I&#8217;m also trying to<br \/>\nmodulate my input to it, in order to obtain a gain from the pattern<br \/>\nrecognition (the BWT may not be an actual FFT but it&#8217;s a similar<br \/>\nidea.)  If I can do this I will keep it.<\/p>\n<p> If all this is greek to you, here&#8217;s the thing:  Just about every other<br \/>\ncomputer scientist in the whole wide world believes this is<br \/>\nimpossible.  Why is this valuable?  Because it means that all files<br \/>\nand messages can be compressed in while in transit to, say, 50k &#8212;<br \/>\nprobably much less.  (Or while sitting on a CD-ROM or DVD.)<\/p>\n<\/blockquote>\n<p> I love that line in there: &#8220;Just about every other computer scientist in<br \/>\nthe whole wide world believes this is impossible&#8221;.<\/p>\n<p> Jules, baby, belief has <em>nothing<\/em> to do with it.<\/p>\n<p><!--more--><\/p>\n<p> Let&#8217;s walk through this one more time. What is a compression system?<br \/>\nFundamentally, it&#8217;s a function with three properties:<\/p>\n<ol>\n<li> It&#8217;s computably invertible. That is, if <img src='http:\/\/l.wordpress.com\/latex.php?latex=x&#038;bg=FFFFFF&#038;fg=000000&#038;s=0' title='x' style='vertical-align:1%' class='tex' alt='x' \/> is an input, and <img src='http:\/\/l.wordpress.com\/latex.php?latex=f%28x%29&#038;bg=FFFFFF&#038;fg=000000&#038;s=0' title='f(x)' style='vertical-align:1%' class='tex' alt='f(x)' \/> is the compressed output, there must be a computable function <img src='http:\/\/l.wordpress.com\/latex.php?latex=f%5E%7B-1%7D%20%3A%20%28forall%20x%29%20f%5E%7B-1%7D%28f%28x%29%29%20%3D%20x&#038;bg=FFFFFF&#038;fg=000000&#038;s=0' title='f^{-1} : (forall x) f^{-1}(f(x)) = x' style='vertical-align:1%' class='tex' alt='f^{-1} : (forall x) f^{-1}(f(x)) = x' \/>.<\/li>\n<li> It&#8217;s one-to-one. It <em>must<\/em> be one-to-one &#8211; because if it weren&#8217;t it wouldn&#8217;t be invertible. If there were two values, <img src='http:\/\/l.wordpress.com\/latex.php?latex=x&#038;bg=FFFFFF&#038;fg=000000&#038;s=0' title='x' style='vertical-align:1%' class='tex' alt='x' \/> and <img src='http:\/\/l.wordpress.com\/latex.php?latex=y&#038;bg=FFFFFF&#038;fg=000000&#038;s=0' title='y' style='vertical-align:1%' class='tex' alt='y' \/> where <img src='http:\/\/l.wordpress.com\/latex.php?latex=x%20neq%20y%20land%20f%28x%29%20%3D%20f%28y%29&#038;bg=FFFFFF&#038;fg=000000&#038;s=0' title='x neq y land f(x) = f(y)' style='vertical-align:1%' class='tex' alt='x neq y land f(x) = f(y)' \/>, then <img src='http:\/\/l.wordpress.com\/latex.php?latex=f%5E%7B-1%7D&#038;bg=FFFFFF&#038;fg=000000&#038;s=0' title='f^{-1}' style='vertical-align:1%' class='tex' alt='f^{-1}' \/> wouldn&#8217;t be able to decompress &#8211; because it wouldn&#8217;t know which value to return for <img src='http:\/\/l.wordpress.com\/latex.php?latex=f%5E%7B-1%7D%28f%28x%29%29&#038;bg=FFFFFF&#038;fg=000000&#038;s=0' title='f^{-1}(f(x))' style='vertical-align:1%' class='tex' alt='f^{-1}(f(x))' \/>.<\/li>\n<li> For some set of interesting <img src='http:\/\/l.wordpress.com\/latex.php?latex=x&#038;bg=FFFFFF&#038;fg=000000&#038;s=0' title='x' style='vertical-align:1%' class='tex' alt='x' \/>-values, the size of <img src='http:\/\/l.wordpress.com\/latex.php?latex=f%28x%29&#038;bg=FFFFFF&#038;fg=000000&#038;s=0' title='f(x)' style='vertical-align:1%' class='tex' alt='f(x)' \/> is smaller than the size of <img src='http:\/\/l.wordpress.com\/latex.php?latex=x&#038;bg=FFFFFF&#038;fg=000000&#038;s=0' title='x' style='vertical-align:1%' class='tex' alt='x' \/>.<\/li>\n<\/ol>\n<p> In an <em>ideal<\/em> compression function, you&#8217;d have the property that f is total, and <img src='http:\/\/l.wordpress.com\/latex.php?latex=%28forall%20x%29%20len%28f%28x%29%29%20%26lt%3B%20len%28x%29&#038;bg=FFFFFF&#038;fg=000000&#038;s=0' title='(forall x) len(f(x)) &lt; len(x)' style='vertical-align:1%' class='tex' alt='(forall x) len(f(x)) &lt; len(x)' \/>.<\/p>\n<p> Ideal compression functions are <em>impossible<\/em>. And it&#8217;s very easy to prove. Suppose you had a compression function which could remove one bit &#8211; just one bit &#8211; from any input file. Consider the set of file with N bits.  You&#8217;d like to be able to compress those to N-1 bits. You <em>can&#8217;t<\/em> do it. Why? Because there are <img src='http:\/\/l.wordpress.com\/latex.php?latex=2%5E%7BN%7D&#038;bg=FFFFFF&#038;fg=000000&#038;s=0' title='2^{N}' style='vertical-align:1%' class='tex' alt='2^{N}' \/> possible values with N bits; and there are <img src='http:\/\/l.wordpress.com\/latex.php?latex=2%5E%7BN-1%7D&#038;bg=FFFFFF&#038;fg=000000&#038;s=0' title='2^{N-1}' style='vertical-align:1%' class='tex' alt='2^{N-1}' \/> possible values with N-1 bits. You can&#8217;t have a total one-to-one function from a set with <img src='http:\/\/l.wordpress.com\/latex.php?latex=2%5EN&#038;bg=FFFFFF&#038;fg=000000&#038;s=0' title='2^N' style='vertical-align:1%' class='tex' alt='2^N' \/> values to a set with <img src='http:\/\/l.wordpress.com\/latex.php?latex=2%5E%7BN-1%7D&#038;bg=FFFFFF&#038;fg=000000&#038;s=0' title='2^{N-1}' style='vertical-align:1%' class='tex' alt='2^{N-1}' \/> values. Can&#8217;t do it. Impossible. Absolutely, totally, profoundly impossible. End of discussion. <\/p>\n<p> You might respond: but I&#8217;ve got gzip and bzip2 on my computer, and they clearly work! Well, that&#8217;s true. They work. For <em>some<\/em> inputs. But not for all. gzip and bzip are based on compression algorithms that exploit the structure of their typical inputs. Text files typically have a <em>huge<\/em> amount of repetition in them &#8211; and the LZW family of algorithms that common compressors are built on do a very good job of exploiting that. But they don&#8217;t work on everything. Just for one example, I tried compressing an MP3 on my disk. Before compression, its length was  7937898 bytes, After compression by bzip, it&#8217;s length was 7972829. &#8220;Compressing&#8221; the file actually <em>added<\/em> nearly 35,000 bytes. Our compression functions work pretty well on the types of data for which they&#8217;re designed, which contain lots of redundancy. But a true, general, ideal compression system? It&#8217;s impossible. <\/p>\n<p> The usual crackpot response to this is something along the lines of &#8220;You just <em>believe<\/em> it&#8217;s impossible. Scientists have been wrong before.&#8221;<\/p>\n<p> The thing is, there are proofs, and then there are proofs. Every proof is dependent on its basic premises. If those premises turn out to be false, then the proof is also false. This has happened lot of times before. For example, back when I was in high school, there was an absolute proof, built using Shannon&#8217;s information theory, that you couldn&#8217;t transmit data with more than 600 modulations per second over a telephone line. That meant that a modem couldn&#8217;t transmit data any faster than 600 bits per second. Not just the kind of modem they built back then, but <em>any<\/em> modem &#8211; because it was a fundamental property of the bandwidth of the telephone line.<\/p>\n<p> It turned out to be wrong. The proof itself didn&#8217;t have any problems &#8211; but the model it used for the bandwidth capacity of a telephone line was incorrect. So by the time I started college, 1200bps modems were common; by the time I got to grad school, 28,800bps modems were common; before broadband started hitting the scene, 56,600bps modems were common. That original proof of the bandwidth capacity of a phone line was off by a factor of 100. <\/p>\n<p> But computing the bandwidth capacity of a telephone line is remarkably tricky. There are a ton of places where incorrect assumptions about the line can enter the proof &#8211; and so, while the proof itself is fine, it&#8217;s not a proof about real telephone lines.<\/p>\n<p> That&#8217;s the kind of problem that you encounter in mathematical proofs about the real world. If the proof is actually valid, its validity in inescapable. But truth and validity are two different things. Truth is, in a vague sense, whether or not the validity of the proof implies anything about some real-world phenomenon that the proof is applied to. Validity is just internal consistency of the proof in its logical system. So a proof can be valid, without a particular application of that proof being true. To determine whether or not a proof is true, you need to check whether or not its axioms actually map correctly onto a real-world application.<\/p>\n<p> In the case of this kind of compression nonsense, the mapping between the proof and the real world is incredibly direct. The compressor, in order to work, <em>must<\/em> be a one-to-one function. There is no way that it can decompress if it isn&#8217;t. <\/p>\n<p> Mr. Gilbert claims that he can take <em>any<\/em> input file, and compress it down to 50K. That is, quite simply impossible.<\/p>\n<p> There are 2<sup>400,000<\/sup> possible 50K long files. That seems like an astonishingly large number. It&#8217;s larger than the number of atoms in the observable universe. It&#8217;s a number that&#8217;s really beyond the ability of our imagination to comprehend. But&#8230; I&#8217;ve got a rather large collection of music on my hard-drive. A bit over 30GB worth. The average file there is around 5 megabytes. That&#8217;s around 5,000,000 bytes, or\t40,000,000 bits. There are 2<sup>40,000,000<\/sup> possible files that long.\tAnd Jules is claiming that he can take <em>any<\/em> of those files, and compress them to 50K. It&#8217;s an unavoidable fact that even if we restrict all files to be <em>exactly<\/em> 5,000,000 bytes long, then his system <em>must<\/em> be mapping an average of <img src='http:\/\/l.wordpress.com\/latex.php?latex=2%5E%7B100%7D&#038;bg=FFFFFF&#038;fg=000000&#038;s=0' title='2^{100}' style='vertical-align:1%' class='tex' alt='2^{100}' \/> files onto each possible 50K  output file. Therefore, his system isn&#8217;t one to one &#8211; therefore, it isn&#8217;t a compression system &#8211; i.e., it <em>doesn&#8217;t work<\/em>. <\/p>\n<p> The number don&#8217;t lie. The proof has no holes. There&#8217;s no way to do what he claims to be doing. And I&#8217;ve explained this before &#8211; both in email to him, and in posts on this blog. But some people are so convinced of their own perfect brilliance that they think reality doesn&#8217;t apply to them. Sorry Jules, but even with the help of Jesus himself, your stuff <em>can&#8217;t<\/em> work. <\/p>\n","protected":false},"excerpt":{"rendered":"<p>Do I really need to go through this again? Argh! As long-time readers will remember, a while back, I had an encounter with a crackpot named Jules Gilbert who claimed to be able to cyclically compress files. That is, he could take a particular input file, compress it, do something to it, and then compress [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[7],"tags":[],"class_list":["post-933","post","type-post","status-publish","format-standard","hentry","category-bad-software"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/p4lzZS-f3","jetpack_sharing_enabled":true,"jetpack_likes_enabled":true,"_links":{"self":[{"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/posts\/933","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/comments?post=933"}],"version-history":[{"count":0,"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/posts\/933\/revisions"}],"wp:attachment":[{"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/media?parent=933"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/categories?post=933"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/tags?post=933"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}