The Return of the Compression Idiot

Remember a while back, I wrote about a crackpot who pestered me both about
converting to Christianity, and his wonderful, miraculous compression system? He
claimed to be able to repeatedly compress any file, making it smaller each time.

Well, he’s back pestering me again. Repeatedly asking him to leave me alone,
shouting at him, etc., hasn’t worked. (His current claim is that he doesn’t know how
to delete me from his gmail contacts list.) So I’m resorting to another round of
public humiliation, which I hope will be informative and entertaining as well.

To remind you of the relevant background about compression: what compression does
is reduce the size of a string by eliminating redundancy in that string. A trivial example
of that is that most text strings use a very limited range of characters – a string
made up of pure ASCII text will typically need less that 7 bits per character – but the
common representation uses a minimum of 8 bits per character. The high bit is always zero. So
in a string like that, without considering anything else, 1/8th of the size of the document
is completely redundant.

Looked at in terms of information theory, for a given system, any string contains a particular
amount of information. That quantity of information is typically much smaller than any
common representation – the representation is full of padding and redundancy. So what compression
does is reduce the size of an input string to something closer to the minimum length required by
the information it contains.

But most strings are not compressible at all – which is really
easy to prove. A compression function must be invertible – that is, a given
function f can only be a compression function if there is a function f-1, such
that f-1(f(n)) = n.

Think of the strings of N bits. Suppose we want to compress them to one-half
of their original length. How many of those strings can be compressed that small?

There are 2n/2 strings of length n/2; and there are 2n uncompressed
strings. So, at best, 1/2n/2 strings are compressible to
a string of length n/2 – any more, and the compression function isn’t invertible – you can’t
uncompress!

Even a trivial compressor – one that removes exactly one bit from the input string – can’t compress half of its inputs. If it could, it wouldn’t be invertible. And
even that analysis has a problem: it assumes that all inputs have length N. If you were looking
at strings whose length is less than or equal to n, then it gets even worse.

The problem is very simple to see – the number of compressed strings is smaller than the number of input strings – but to uncompress, you need to know exactly which input string produced a compressed string.

My persistently dimwitted correspondent is apparently incapable of understanding this. In
his latest missive, he sent me a list of documents, which are the result of repeatedly
running bzip2 on an input document. In between runs of bzip2, he runs his own magic program,
which takes the uncompressible output from bzip2, and renders it compressible. According to him,
using his system, he can compress any input document down to about 12K. As a
“proof” of this, he provides a script that runs his program, and shows the file sizes. This
is pretty long, but it’s amusing. What he’s doing is iteratively running bzip2, then running his
magic program, and then running bzip2 again.

7160799 -rw-r--r--  1 user  wheel  415241 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  415241 Apr 14 10:24 in
7160797 -rw-r--r--  1 user  wheel  375688 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  340370 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  307278 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  277451 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  251925 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  228574 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  208202 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  189791 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  172300 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  155808 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  140951 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  128431 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  116992 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  107075 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  97124 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  89283 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  82238 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  74710 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  68219 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  63097 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  58134 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  53304 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  48433 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  45429 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  41524 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  38161 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  35385 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  33316 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  30773 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  29004 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  26714 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  25371 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  23605 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  22728 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  21723 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  20478 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  20092 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  19682 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  19247 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  18686 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  18102 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  17418 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  16543 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  15252 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  15123 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  14958 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  14779 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  14558 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  14313 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  14044 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  13705 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  13283 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  12838 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  12242 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  12469 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  11729 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  11882 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  12086 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  12295 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  11394 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  11543 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  11729 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  11910 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  12105 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  12334 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  11441 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  11565 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  11772 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  12041 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  12235 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  12442 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  11580 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  11754 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  11918 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  12120 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  12369 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  11564 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  11718 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  11883 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  12047 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  12250 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  12481 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  11672 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  11879 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  12093 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  12300 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  11361 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  11510 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  11710 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  11880 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  12109 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  12371 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  11584 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  11790 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  12017 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  12219 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  12431 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  11619 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  11777 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  11977 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  12175 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  12400 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  11576 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  11781 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  12014 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  12203 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  12461 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  11675 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  11854 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  12082 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  12341 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  11489 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  11641 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  11850 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  12047 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  12277 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  12495 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  11692 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  11904 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  12114 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  12296 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  11348 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  11447 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  11564 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  11736 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  11950 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  12142 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  12379 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  11523 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  11642 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  11849 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  12026 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  12249 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  12472 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  11710 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  11891 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  12126 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  12341 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  11403 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  11563 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  11765 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  12004 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  12225 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  12435 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  11563 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  11774 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  11926 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  12152 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  12395 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  11541 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  11744 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  11920 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  12135 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  12320 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  11493 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  11624 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  11788 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  11988 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  12216 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  12461 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  11620 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  11838 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  12017 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  12232 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  12468 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  11759 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  11946 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  12180 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  12385 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  11498 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  11661 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  11869 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  12077 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  12273 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  12515 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  11752 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  11972 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  12146 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  12343 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  11474 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  11593 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  11792 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  12024 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  12223 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  12486 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  11745 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  11932 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  12121 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  12318 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  11396 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  11573 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  11716 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  11949 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  12175 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  12384 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  11565 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  11738 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  11914 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  12123 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  12307 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  11370 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  11521 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  11706 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  11923 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  12112 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  12369 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  11539 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  11672 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  11883 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  12111 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  12383 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  11552 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  11740 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  11893 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  12124 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  12349 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  11547 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  11752 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  11989 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  12233 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  12452 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  11637 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  11830 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  12010 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  12225 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  12469 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  11731 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  11912 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  12120 Apr 14 10:24 in
7160799 -rw-r--r--  1 user  wheel  12355 Apr 14 10:24 in
7160798 -rw-r--r--  1 user  wheel  11500 Apr 14 10:25 in
7160799 -rw-r--r--  1 user  wheel  11664 Apr 14 10:25 in
7160798 -rw-r--r--  1 user  wheel  11876 Apr 14 10:25 in
7160799 -rw-r--r--  1 user  wheel  12066 Apr 14 10:25 in
7160798 -rw-r--r--  1 user  wheel  12297 Apr 14 10:25 in
7160799 -rw-r--r--  1 user  wheel  11397 Apr 14 10:25 in
7160798 -rw-r--r--  1 user  wheel  11558 Apr 14 10:25 in
7160799 -rw-r--r--  1 user  wheel  11750 Apr 14 10:25 in
7160798 -rw-r--r--  1 user  wheel  11935 Apr 14 10:25 in
7160799 -rw-r--r--  1 user  wheel  12151 Apr 14 10:25 in
7160798 -rw-r--r--  1 user  wheel  12367 Apr 14 10:25 in
7160799 -rw-r--r--  1 user  wheel  11495 Apr 14 10:25 in
7160798 -rw-r--r--  1 user  wheel  11676 Apr 14 10:25 in
7160799 -rw-r--r--  1 user  wheel  11879 Apr 14 10:25 in
7160798 -rw-r--r--  1 user  wheel  12113 Apr 14 10:25 in
7160799 -rw-r--r--  1 user  wheel  12341 Apr 14 10:25 in
7160798 -rw-r--r--  1 user  wheel  11458 Apr 14 10:25 in
7160799 -rw-r--r--  1 user  wheel  11687 Apr 14 10:25 in

For any arbitrary input file, he can do the same process – so that for any
possible input, his process will, eventually, converge on a file size around 12K. Sounds
absolutely amazing, doesn’t it?

But you know what? I can do better. I have here a script which you can
run on any input, and which will render it recompressible – and it works until the file
reaches around 40 bytes. And it usually does it in very few steps. For example,
I took my local copy of plan9 from userspace – source code and binaries – and
tarred it up. The resulting file contains 149,852,160 bytes. Running gzip on it
shrinks it to 52,824,151. Pretty good compression – around 60%. Interestingly,
rerunning gzip once does actually compress it a tad more – reducing it to 52,563,667 bytes. Doing it again increases the size by a small amount – repeating just cycles aronud 52,600,000 bytes.

Now, I’ll run my magic program on it. The output of my magic program is exactly
the same as the length of its input – only now it’s compressible! So I’ll run gzip on
it again. The result is 29,918,882. Wow!

Now I’ll repeat that! 17013250! Again: 9674608. Again: 5502157. And again, and again, and again, until I get down to 43 bytes.

What’s the catch? It should be obvious – it’s exactly what I said when I talked about
the proof. I can’t uncompress. I can repeatedly remove information from the
file, making it possible to compress it more – but I can’t put that information back,
because it’s not in the compressed file.

And that’s exactly what the jerk did: he can compress, and compress, and compress to his hearts content. But he can’t uncompress.

For folks who are interested, here’s my magic recompressibleizer program:

#include
int main(int argc, char** argv) {
for (int c = getchar(); c != -1; c = getchar()) {}
c = c & 170;
putchar(c);
}
}

What does it do? It changes every other bit in the input file to 0. That removes a heck
of a lot of information – now every byte of the file contains 4 meaningful bits instead
of 8. So gzip can then compress it. And then I can run my program on it again. And again. And
again. Until it reduces the file to a basic minimum space overhead needed by
gzip.

Of course, that “recompressibleizer” script is silly. But the basic idea of it – reducing
the amount of information contained in a compressed string string in an irreversible way –
is what any program that produces iterative compressibility must do. You can’t
ever get around the basic facts of compressibility: most strings aren’t compressible. They
can’t be.

The simple fact, from the proof up towards the top of this post, is that the set of
strings of length N is larger than the set of strings of length less than N. You can’t
get around that.

Any idiot can produce a script which renders uncompressible documents compressible – if they don’t have to be able to uncompress them afterwards. If you want to be able to
uncompress them, then you’ve got a problem: you can’t reinsert the information you
extracted without knowing what it is. The only way of making this scheme work
at all is by adding information to the decompression program – making it larger. As I
explained last time, you can do that, and it can even have a very nice payoff
in the size of the compressed document. For example, using a pre-populated dictionary
of common n-grams in english text can have a great effect on the compressed size of english
documents. But your compressor and decompressor need to be considerably larger to store that dictionary, and they’ll perform worse than a standard LZW compressor on non-english text inputs. And it only works once – once you’ve used that trick to strip out the english-language pattern
data, using that trick again won’t work – the patterns that it takes advantage of
are gone.

0 thoughts on “The Return of the Compression Idiot

  1. Russell

    So, Mark. Since you’re on the inside, give us the scoop on Google’s plans to speed up Python.
    😉

    Reply
  2. Deen

    Seems to me, then, is that this “Mr Wheel” needs to demonstrate both the compression and decompression of a document. If he can’t, he should apologize to you and leave you alone.

    Reply
  3. Mark C. Chu-Carroll

    Re #1:
    If I actually knew anything about it, I wouldn’t be able to tell it to you. But since I actually don’t know anything about it, I can safely say everything I know about it – which is absolutely nothing 🙂
    Seriously, one of the things that makes Google such a fun place to work is that things are very bottom up, and structured around small, fairly autonomous teams. Since I’m currently working in data analysis for logs, I’m very far removed from any team that might be working on Python.

    Reply
  4. Mark C. Chu-Carroll

    Re #2:
    Wheel isn’t his name, it’s the name of one of the standard administrative groups under unix. My guess is that he’s running MacOS, where the name of the default group for
    adminstrative users is “wheel”. (Most Unix variants have a “wheel” group, but it’s rarely a default, which I why I suppose MacOS.

    Reply
  5. Aaron

    Interesting.
    So he’s smart enough to invent some world-shattering compression algorithm, but not smart enough to remove you from his google contact list?
    Idiot savant?

    Reply
  6. Deen

    By the way, I expect that he won’t give a demonstration. To prevent cheating, he has to allow you to perform either the compression or the decompression part yourself (or both), and for you to choose the file to compress. I predict that he won’t allow that, because he’ll claim that the risk is too great that you’ll steal his code (never mind the existence of NDAs).
    Still, he would be wise to perform the experiment for himself, to prove to himself if his idea works or not.

    Reply
  7. Antonio

    Great fun! When I was eight I came up with a very simple scheme to divide any two numbers. I thought I was gonna be famous for that. The only problem, though, is that it never ever produced a single right result… I’m afraid this poor chap is somehow in the same position I was 30 years ago 🙂

    Reply
  8. Brian

    Reminds me of my favorite compression algorithm:
    Look at an arbitrary bitstring. The zeros are empty, and of course, contain no information, so you can remove them. So, now you have a string of just ones. (dividing the length in about half)
    However, the only information in this string is the length, so calculate the length and represent it in binary. (now the string is ln the previous length)
    But now we have zeros again, so we can iterate.
    After just a few iterations, we’ll end up with just one bit: 1. Can’t beat that!

    Reply
  9. Andrew

    My favorite compression scheme was from a spoof post to one of the compression usenet groups about a decade ago (I can’t find it now or else I would link to it – it was brilliant).
    It used a long sequence of operations to reduce the size of any sequence of bits. For example: discard the lowest 3 bits – this will make it impossible to decompress 7 out of 8 files but this is OK; 7 out of 8 people will use this program to store porn of some kind and will not complain if it doesn’t work.
    Once you got down to a single bit, the operation was simple. If the bit is 1 then store a zero byte file, otherwise don’t store a zero byte file.
    Compressing data down to zero bytes has many advantages – you could download HD movies over the internet without even being connected internet. The author pointed out that it didn’t correctly compress random data – but nobody wants to compress random data anyway.

    Reply
  10. JJ

    It would be easier to just explain that compression algorithms never really compress anything. The word compression is the problem.
    To be really simple just say it’s just about pointing from a big number (big file) to a smaller number (small file). The algorithm is just the decision over which numbers we like and which we don’t like, so that while the numbers we like can most often be given a smaller number to point to (like people in a crowd) the numbers we don’t like must (give room and move to) point even bigger numbers.
    0 Don’t like -> 2
    1 Don’t like -> 3
    2 Like -> 0
    3 Like -> 1
    Uuuu! Compression! 0 is smaller than 2! Uuuu!
    One might peruse from any infinite compression artist what is the number of bits after which The Miracle of Compression Emergence actually happens.

    Reply
  11. Uncle Al

    Universal Shrinkulator: Make a B&W graphic of the datafield, then png format. Compression! Draw randomly oriented lines (need only pi rotation – already down by 50%!) of random widths using unremarkable BASIC RND seeds. Any pixels under the line reverse color. Proceed until the entire field is an alternating checkerboard of black and white pixels. *Any* length file can be reversibly compressed to: Two seeds, number of cycles, 0 or 1 (first pixel color), field length and width.
    Those sum to waaaay smaller than 12K. Optimizing the pixel recoloring algoritm is left as an exercise for the alert reader.

    Reply
  12. idlemind

    *Any* length file can be reversibly compressed to: Two seeds, number of cycles, 0 or 1 (first pixel color), field length and width.
    Those sum to waaaay smaller than 12K.

    No they don’t. This is just an obfuscated version of the old “run a pseudorandom number generator until the sequence is replicated and output the position in the PRNG’s output.” It’s not hard to show that, even given a PRNG with a long enough period, the resulting count likely as not contains as many or more bits than the input does.

    Reply
  13. Mark C. Chu-Carroll

    Re #15:
    I’ll answer you the same way that I always answer the Cantor critics who claim to have an enumeration of the real numbers:
    How you get around the proof?
    If you have 2N strings of length less than N,
    and you want to compress it by just one bit, you’re mapping
    2N uncompressed values onto 2N-1
    compressed values. So every compressed value has on average two uncompressed strings mapped to it. How do you distinguish them?

    Reply
  14. Sundara Raman

    I can imagine ‘Mr. Wheel’ (let’s call him so for lack of other information) – believing himself to be a ‘misunderstood genius’, struggling to get the attention of people like you so that his excellent idea can come out and he can get the fame (and perhaps money) he
    ‘deserves’! Somewhat pathetic in a way…
    Moral of the story: More people need to learn and keep in mind the Pigeonhole principle.
    Two notes to the author:
    1. The (or whatever library) has been stripped out in the program
    2. In the second half of the article using the word ‘uncompress’ in two meanings is slightly confusing: “Any idiot can produce a script which renders uncompressible documents compressible – if they don’t have to be able to uncompress them afterwards”. The second ‘uncompress’ can better be called ‘decompress’ as you’ve done elsewhere in the article.
    3. The Preview function fails silently if Javascript is not enabled – it gives some crypting error like ‘The requested document was not found’. I guessed Javascript was the problem only by looking at the URL which had something like ‘javascript-is-required’. I don’t know if this is a problem with ScienceBlogs and if so if you can do anything about it, just informing.

    Reply
  15. Egaeus

    Mark, in case you’re unaware, Uncle Al is also a crank. A very imaginative crank, but a crank nonetheless. If you were to take his comments at face value, he has solved just about all of the world’s problems. His ability to engineer ad ignorantiam is unparalleled.

    Reply
  16. jinnah

    Let me see. If you can’t get the data back out of the file, then it’s pretty much worthless. So you can simply throw it away.
    Tada. Compression down to 0 bytes.

    Reply
  17. EM

    Well,I dont know if this is the same person I met, say 15 years back (back than he was around 20 years old) where I first heard this ridicules solution.
    This individual was speaking about this staff (hologram compression), obviously he was clueless.

    Reply
  18. Alexey Feldgendler

    Someone once wrote a program that would compress and decompress any file. It indeed managed to make every file smaller and to decompress it correctly afterwards. The trick was that it saved the extra data in the alternative data streams using a relatively unknown feature of the Windows file system, NTFS. Of course, what it did is that it used a design shortcoming of the file system to hide the data where it’s not readily visible (for example, the apparent size of the file with a megabyte-large alternative data stream can be just a few bytes). So the super-compressor “worked” in a typical test you would do (compress a file and then decompress it), but, of course, it’s impossible to decompress the file after it’s moved to another file system or transferred over the network. Also, while such ”compression” reduces the apparent size of the file, it doesn’t increase the free space on your disk. But it’s a clever hack, isn’t it? Took some time to figure it out.

    Reply

Leave a Reply to Aaron Cancel reply