{"id":751,"date":"2009-03-08T12:38:38","date_gmt":"2009-03-08T12:38:38","guid":{"rendered":"http:\/\/scientopia.org\/blogs\/goodmath\/2009\/03\/08\/i-get-mail-iterative-compression\/"},"modified":"2009-03-08T12:38:38","modified_gmt":"2009-03-08T12:38:38","slug":"i-get-mail-iterative-compression","status":"publish","type":"post","link":"http:\/\/www.goodmath.org\/blog\/2009\/03\/08\/i-get-mail-iterative-compression\/","title":{"rendered":"I Get Mail: Iterative Compression"},"content":{"rendered":"<p> Like a lot of other bloggers, I often get annoying email from people. This week, I&#8217;ve been dealing with a particularly annoying jerk, who&#8217;s been bothering me for multiple reasons. First, he wants me to &#8220;lay off&#8221; the Christians (because if I don&#8217;t, God&#8217;s gonna get me). Second, he wants to convince me to become a Christian. And third, he wants to sell me on his brilliant new compression scheme.<\/p>\n<p> See, aside from the religious stuff, he&#8217;s a technical visionary. He&#8217;s invented a method where he can take a source document, and <em>repeatedly<\/em> compress it, making it smaller each time.<\/p>\n<p> This is a stupid idea that I&#8217;ve seen entirely too many times. But instead of just making fun of it, I thought it would be interesting to explain in detail why it doesn&#8217;t work. It touches on a bunch of basic facts about how data compression works, and provides a nice excuse for me to write a bit about compression.<\/p>\n<div><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/scientopia.org\/img-archive\/goodmath\/img_366.png?resize=319%2C272\" width=\"319\" height=\"272\" alt=\"basic-compression.png\" \/><\/div>\n<p> The basic idea of data compression is that you&#8217;re eliminating redundancies in the original text. You can&#8217;t discard information. Mathematically, a compression function is an <em>invertible<\/em> function C from an array of characters to an array of characters (or you could use bits if you prefer), such that if y=C(x), then <em>on the average input<\/em>, the length of y is smaller than the length of x.<\/p>\n<p> An <em>ideal<\/em> compression system is one where for all possible values of x, C(x) is shorter than x. C is your compressor; and since C is reversible, it has a unique inverse function C<sup>-1<\/sup> such that C<sup>-1<\/sup>(C(x)) = x. An illustration of this basic compression system is in the diagram to the side.<\/p>\n<p><!--more--><\/p>\n<p> An ideal compression system can&#8217;t possibly exist, for a very simple reason. In an ideal compression system, the compression function in mapping from a <em>larger<\/em> set to a <em>smaller<\/em> set. That means that for <em>multiple<\/em> inputs, it <em>must<\/em> produce the same output. The better the ideal compressor &#8211; i.e, the smaller it manages to to make its compressed output &#8211; the more inputs must be compressed to the same output. But if multiple inputs compress to the same output, then you don&#8217;t have an invertible compression function: you can&#8217;t decompress your text.<\/p>\n<p> Repeated compression is really trivial mathematically. The basic idea of it is that you&#8217;ve got your original compression function, C(x). For some set of inputs, C(x) is smaller than x. You can&#8217;t just re-apply C to its own output; if C is worth anything as a compression function, it&#8217;s eliminated all of the redundancy that it&#8217;s algorithm is capable of finding &#8211; so its output is uncompressible by its own algorithm. So what you do is introduce an intermediate function, I. You evaluate I on C(x), and then you evaluate C on the result of that: C(I(C(x))). The reason that this is mathematically trivial is because this just corresponds to a better compression function, D. If there exists an intermediate function I, which renders the output of C compressible by C, that means that there&#8217;s an improved version of C which incorporates I.<\/p>\n<p> This far, there&#8217;s nothing controversial. In fact, there are variations on some of the common compression algorithms that work in two-pass modes. For example, for appropriate inputs, you can do a two-pass version of LZW that does a better job of finding the longest repeated strings than the single pass version. It does better than standard LZW compression, at the cost of a significant increase in execution time (more than double), and a significant increase in memory usage (you basically need to be able to hold the entire string to be compressed in memory). We don&#8217;t typically use those algorithms, because the improvement in compression rates is modest at best &#8211; and for many inputs, there&#8217;s no improvement at all. The two-pass LZW could be written as LZW-compress, transform, LZW-compress &#8211; the two pass algorithm is just a more efficient version of that. The basic idea of this is illustrated in the diagram here; you can&#8217;t decompress the document without having some kind of extra information available. In the common case of delta compression, that extra information is the original version of the document.<\/p>\n<p> What many clueless folks claim is something much stronger that the two-pass approach. That is, they claim that they can <em>repeatedly<\/em> apply the intermediate function, and each time, it will transform the original string into something which can be compressed more by the original compression function.<\/p>\n<p> That is, length(x) &gt; length(C(x)) &gt; length(C(I(C(x)))) &gt; length(C(I(C(I(C(x)))))), and so on.<\/p>\n<p> The usual explanation of why this works (and the explanation provided by my correspondent) is that the intermediate function is <em>removing<\/em> information; but that the reverse transform on the decompression side re-inserts the information. So the amount of information in the compressed string is being reduced by I; and the resulting string can be shorter, since it needs to encode less information.<\/p>\n<div><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/scientopia.org\/img-archive\/goodmath\/img_367.png?resize=524%2C278\" width=\"524\" height=\"278\" alt=\"extra-info-compress.png\" \/><\/div>\n<p> That&#8217;s fine. In fact that can, in theory, work. In fact, that does <em>in practice<\/em> work &#8211; if you look at what&#8217;s called <em>delta compression<\/em>, that&#8217;s almost exactly what it&#8217;s doing: since you know an original text, you can create a compressed copy of an altered version of that text by taking advantage of the fact that you know the original. It works incredibly well.<\/p>\n<p> Of course, doing it with an intermediate function is really kind of silly; it&#8217;s far more efficient to encode it into the original compression function. It&#8217;s pretty easy to encode things like that into the function. For example, I can create an improved english-language LZW compression system. The way that I&#8217;d do that is to pre-load the LZW dictionary: take common english words and phrases, and assign them dictionary entries. Then when I go to compress, I&#8217;ll get a better compression rate than the standard LZW &#8211; because normally, LZW would take up space in its compression dictionary building up those entries. By requiring my decompressor to have the information about the pre-populated dictionaries, I can do a better job of compressing, because I don&#8217;t need to put all of the information into the compressed file.<\/p>\n<p> You can, theoretically, do this as part of a multi-pass system. That is, you can look at a large number of compressed documents, and find common structures in the compressed file. Then you can populate a dictionary of those structures, and use that to transform the file, rendering it compressible. It&#8217;s <em>harder<\/em> than just incorporating it into the original algorithm. But it&#8217;s possible. The catch, of course, is that you have to do that <em>in advance<\/em>. The intermediate function can&#8217;t just look for structures and set up a dictionary &#8211; the specific structures that it&#8217;s going to look for have to be incorporated into the design of the intermediate &#8211; if they&#8217;re not, then you need to include that dictionary into the compressed text &#8211; in which case you&#8217;re <em>not<\/em> removing any information. <\/p>\n<p> Can that work iteratively? No. You can only remove a given chunk of information once. Then it&#8217;s gone. To use the english example, you can write a compression system for text that knows how to encode common words as short bit strings. (In LZW terms, you can pre-populate the dictionary with common words.) But once you&#8217;ve run the compressor with that once, re-using it isn&#8217;t going to work.<\/p>\n<p> No matter how you try to weasel around it, you&#8217;ll come back to the same basic problem. There&#8217;s a certain amount of information inherent in a text. Most texts contain a huge amount of redundancy, which is why they&#8217;re compressible. But once you&#8217;ve eliminated a particular kind of redundancy, <em>it&#8217;s gone<\/em>. You can reduce the length of a text by removing information from it &#8211; but to be able to decompress it, the decompressor needs to know exactly what information was removed &#8211; so you can only remove fixed, pre-determined chunks of information. And you can only do that once &#8211; then the information is gone; trying to remove it again doesn&#8217;t do anything.<\/p>\n<p> The takeaways from this saga?<\/p>\n<ol>\n<li> <em>Most<\/em> things aren&#8217;t compressible. <\/li>\n<li> If you&#8217;re using a halfway decent compression system, the output from it is pretty much guaranteed to fit into the category of non-compressible strings.<\/li>\n<li> There are ways to &#8220;tighten&#8221; a compressed text by removing information from it, but they&#8217;re limited to fixed, predetermined sets of information.<\/li>\n<li> You can use an intermediate function to remove information from a compressed text, but it&#8217;s easier to just incorporate the removal process into the original compression algorithm.<\/li>\n<li> Iterative information removal can&#8217;t work.<\/li>\n<\/ol>\n","protected":false},"excerpt":{"rendered":"<p>Like a lot of other bloggers, I often get annoying email from people. This week, I&#8217;ve been dealing with a particularly annoying jerk, who&#8217;s been bothering me for multiple reasons. First, he wants me to &#8220;lay off&#8221; the Christians (because if I don&#8217;t, God&#8217;s gonna get me). Second, he wants to convince me to become [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[7,30],"tags":[],"class_list":["post-751","post","type-post","status-publish","format-standard","hentry","category-bad-software","category-information-theory"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/p4lzZS-c7","jetpack_sharing_enabled":true,"jetpack_likes_enabled":true,"_links":{"self":[{"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/posts\/751","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/comments?post=751"}],"version-history":[{"count":0,"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/posts\/751\/revisions"}],"wp:attachment":[{"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/media?parent=751"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/categories?post=751"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/www.goodmath.org\/blog\/wp-json\/wp\/v2\/tags?post=751"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}