Do I really need to go through this again? Argh!
As long-time readers will remember, a while back, I had an encounter with
a crackpot named Jules Gilbert who claimed to be able to cyclically compress
files. That is, he could take a particular input file, compress it, do
something to it, and then compress it again, making it smaller. He claimed to
be able to do this for any input.
And to make it more annoying, his claim to be able to do this brilliant
magical cyclical compression was attached to an obnoxious christian “I want to
save you soul” rant, where he effectively offered to share the profits from
this brilliant invention with me, if I’d only give him the chance to tell me
I tried to get rid of the schmuck. I asked him to leave me alone politely.
Then I asked him to leave me alone not-so politely. Then I asked him to
leave me alone very, very rudely. Then I publicly mocked him on this blog.
After the last, he finally seemed to take the hint and go away.
Unfortunately, good things often don’t last. And yesterday, I received
another new message from him, bragging about the results of his uber-magical
compression system. He has now perfected it to the point where he can take
any input file, and compress it down to about 50K:
I am now compressing single bit as:
I use C code to pack up one bit to a byte and call bzip2’s compress
I print the result and compute the size.
(I did this for both gzip and bzip2.) Gzip didn’t compress; it gave
me back about 6.25, but I needed at least 8.0.
And bzip2 gave it to me. About 8.2 and higher. (This is with no
actual I/O, just buffer to buffer.)
This was with random data. At least that’s what people call it.
Bzip2 data. (It looks pretty random to me.)
I expect to show up at a friend’s house and show him the system on
Monday. THIS IS NOT A COMPLETE SYSTEM, but it’s not just research
either. The input was randomly chosen (not by me, by a C function,)
and the program has been arranged to simply open another file when a
current file is exhausted.
Every input file has been compressed once with bzip2. Each input file
is at least 5k bytes and was chosen with the intent to represent all
popular OS’s somewhat equally.
This program is easily able to process MN’s 415k file. And since it
uses 256 byte blocks, I can bring it down to about 50k (number
demonstrated in testing.) Another way to say this is that (with this
program,) all files greater than about 50k can be made that small.
This is the simplist system I have ever made that uses my “modern”
techniques (I mean, not my LSQ ideas of a decade ago.) i started
coding it this morning, to see if a certain idea worked…
I’m trying to drop bzip but it’s pretty good! I’m also trying to
modulate my input to it, in order to obtain a gain from the pattern
recognition (the BWT may not be an actual FFT but it’s a similar
idea.) If I can do this I will keep it.
If all this is greek to you, here’s the thing: Just about every other
computer scientist in the whole wide world believes this is
impossible. Why is this valuable? Because it means that all files
and messages can be compressed in while in transit to, say, 50k —
probably much less. (Or while sitting on a CD-ROM or DVD.)
I love that line in there: “Just about every other computer scientist in
the whole wide world believes this is impossible”.
Jules, baby, belief has nothing to do with it.
Let’s walk through this one more time. What is a compression system?
Fundamentally, it’s a function with three properties:
- It’s computably invertible. That is, if is an input, and is the compressed output, there must be a computable function .
- It’s one-to-one. It must be one-to-one – because if it weren’t it wouldn’t be invertible. If there were two values, and where , then wouldn’t be able to decompress – because it wouldn’t know which value to return for .
- For some set of interesting -values, the size of is smaller than the size of .
In an ideal compression function, you’d have the property that f is total, and .
Ideal compression functions are impossible. And it’s very easy to prove. Suppose you had a compression function which could remove one bit – just one bit – from any input file. Consider the set of file with N bits. You’d like to be able to compress those to N-1 bits. You can’t do it. Why? Because there are possible values with N bits; and there are possible values with N-1 bits. You can’t have a total one-to-one function from a set with values to a set with values. Can’t do it. Impossible. Absolutely, totally, profoundly impossible. End of discussion.
You might respond: but I’ve got gzip and bzip2 on my computer, and they clearly work! Well, that’s true. They work. For some inputs. But not for all. gzip and bzip are based on compression algorithms that exploit the structure of their typical inputs. Text files typically have a huge amount of repetition in them – and the LZW family of algorithms that common compressors are built on do a very good job of exploiting that. But they don’t work on everything. Just for one example, I tried compressing an MP3 on my disk. Before compression, its length was 7937898 bytes, After compression by bzip, it’s length was 7972829. “Compressing” the file actually added nearly 35,000 bytes. Our compression functions work pretty well on the types of data for which they’re designed, which contain lots of redundancy. But a true, general, ideal compression system? It’s impossible.
The usual crackpot response to this is something along the lines of “You just believe it’s impossible. Scientists have been wrong before.”
The thing is, there are proofs, and then there are proofs. Every proof is dependent on its basic premises. If those premises turn out to be false, then the proof is also false. This has happened lot of times before. For example, back when I was in high school, there was an absolute proof, built using Shannon’s information theory, that you couldn’t transmit data with more than 600 modulations per second over a telephone line. That meant that a modem couldn’t transmit data any faster than 600 bits per second. Not just the kind of modem they built back then, but any modem – because it was a fundamental property of the bandwidth of the telephone line.
It turned out to be wrong. The proof itself didn’t have any problems – but the model it used for the bandwidth capacity of a telephone line was incorrect. So by the time I started college, 1200bps modems were common; by the time I got to grad school, 28,800bps modems were common; before broadband started hitting the scene, 56,600bps modems were common. That original proof of the bandwidth capacity of a phone line was off by a factor of 100.
But computing the bandwidth capacity of a telephone line is remarkably tricky. There are a ton of places where incorrect assumptions about the line can enter the proof – and so, while the proof itself is fine, it’s not a proof about real telephone lines.
That’s the kind of problem that you encounter in mathematical proofs about the real world. If the proof is actually valid, its validity in inescapable. But truth and validity are two different things. Truth is, in a vague sense, whether or not the validity of the proof implies anything about some real-world phenomenon that the proof is applied to. Validity is just internal consistency of the proof in its logical system. So a proof can be valid, without a particular application of that proof being true. To determine whether or not a proof is true, you need to check whether or not its axioms actually map correctly onto a real-world application.
In the case of this kind of compression nonsense, the mapping between the proof and the real world is incredibly direct. The compressor, in order to work, must be a one-to-one function. There is no way that it can decompress if it isn’t.
Mr. Gilbert claims that he can take any input file, and compress it down to 50K. That is, quite simply impossible.
There are 2400,000 possible 50K long files. That seems like an astonishingly large number. It’s larger than the number of atoms in the observable universe. It’s a number that’s really beyond the ability of our imagination to comprehend. But… I’ve got a rather large collection of music on my hard-drive. A bit over 30GB worth. The average file there is around 5 megabytes. That’s around 5,000,000 bytes, or 40,000,000 bits. There are 240,000,000 possible files that long. And Jules is claiming that he can take any of those files, and compress them to 50K. It’s an unavoidable fact that even if we restrict all files to be exactly 5,000,000 bytes long, then his system must be mapping an average of files onto each possible 50K output file. Therefore, his system isn’t one to one – therefore, it isn’t a compression system – i.e., it doesn’t work.
The number don’t lie. The proof has no holes. There’s no way to do what he claims to be doing. And I’ve explained this before – both in email to him, and in posts on this blog. But some people are so convinced of their own perfect brilliance that they think reality doesn’t apply to them. Sorry Jules, but even with the help of Jesus himself, your stuff can’t work.