Category Archives: information theory

Giving IDists too much credit: the Pandas Thumb and CSI

Being a Nice Jewish BoyTM, Christmas is one of the most boring days of the
entire year. So yesterday, I was sitting with my laptop, looking for something interesting to read. I try to regularly read the [Panda’s Thumb][pt], but sometimes when I don’t have time, I just drop a bookmark in my “to read” folder; so on a boring Christmas afternoon, my PT backlog seemed like exactly what I needed.
[One of the articles in my backlog caught my interest.][pt-sc] (I turned out to be short enough that I should have just read it instead of dropping it into the backlog, but hey, that’s how things go sometimes!) The article was criticizing that genius of intelligent design, Sal Cordova, and [his article about Zebrafish and the genetics of regeneration
in some zebrafish species.][sc] I actually already addressed Sal’s argument [here][bm-sc].
[pt]: http://www.pandasthumb.org
[pt-sc]: http://www.pandasthumb.org/archives/2006/11/when_ignorance.html
[sc]: http://www.uncommondescent.com/archives/1781
[bm-sc]: http://scienceblogs.com/goodmath/2006/11/bad_news_for_uncommon_descent_1.php

Continue reading

Ω: my favorite strange number

Ω is my own personal favorite transcendental number. Ω isn’t really a specific number, but rather a family of related numbers with bizzare properties. It’s the one real transcendental number that I know of that comes from the theory of computation, that is important, and that expresses meaningful fundamental mathematical properties. It’s also deeply non-computable; meaning that not only is it non-computable, but even computing meta-information about it is non-computable. And yet, it’s *almost* computable. It’s just all around awfully cool.
So. What is it Ω?
It’s sometimes called the *halting probability*. The idea of it is that it encodes the *probability* that a given infinitely long random bit string contains a prefix that represents a halting program.
It’s a strange notion, and I’m going to take a few paragraphs to try to elaborate on what that means, before I go into detail about how the number is generated, and what sorts of bizzare properties it has.
Remember that in the theory of computation, one of the most fundamental results is the non-computability of *the halting problem*. The halting problem is the question “Given a program P and input I, if I run P on I, will it ever stop?” You cannot write a program that reads P and I and gives you the answer to the halting problem. It’s impossible. And what’s more, the statement that the halting problem is not computable is actually *equivalent* to the fundamental statement of Gödel’s incompleteness theorem.
Ω is something related to the halting problem, but stranger. The fundamental question of Ω is: if you hand me a string of 0s and 1s, and I’m allowed to look at it one bit at a time, what’s the probability that *eventually* the part that I’ve seen will be a program that eventually stops?
When you look at this definition, your reaction *should* be “Huh? Who cares?”
The catch is that this number – this probability – is a number which is easy to define; it’s not computable; it’s completely *uncompressible*; it’s *normal*.
Let’s take a moment and look at those properties:
1. Non-computable: no program can compute Ω. In fact, beyond a certain value N (which is non-computable!), you cannot compute the value of *any bit* of Ω.
2. Uncompressible: there is no way to represent Ω in a non-infinite way; in fact, there is no way to represent *any substring* of Ω in less bits than there are in that substring.
3. Normal: the digits of Ω are completely random and unpatterned; the value of and digit in Ω is equally likely to be a zero or a one; any selected *pair* of digits is equally likely to be any of the 4 possibilities 00, 01, 10, 100; and so on.
So, now we know a little bit about why Ω is so strange, but we still haven’t really defined it precisely. What is Ω? How does it get these bizzare properties?
Well, as I said at the beginning, Ω isn’t a single number; it’s a family of numbers. The value of *an* Ω is based on two things: an effective (that is, turing equivalent) computing device; and a prefix-free encoding of programs for that computing device as strings of bits.
(The prefix-free bit is important, and it’s also probably not familiar to most people, so let’s take a moment to define it. A prefix-free encoding is an encoding where for any given string which is valid in the encoding, *no prefix* of that string is a valid string in the encoding. If you’re familiar with data compression, Huffman codes are a common example of a prefix-free encoding.)
So let’s assume we have a computing device, which we’ll call φ. We’ll write the result of running φ on a program encoding as the binary number p as &phi(p). And let’s assume that we’ve set up φ so that it only accepts programs in a prefix-free encoding, which we’ll call ε; and the set of programs coded in ε, we’ll call Ε; and we’ll write φ(p)↓ to mean φ(p) halts. Then we can define Ω as:

Ωφ,ε = Σp ∈ Ε|p↓ 2-len(p)

So: for each program in our prefix-free encoding, if it halts, we *add* 2-length(p) to Ω.
Ω is an incredibly odd number. As I said before, it’s random, uncompressible, and has a fully normal distribution of digits.
But where it gets fascinatingly bizzare is when you start considering its computability properties.
Ω is *definable*. We can (and have) provided a specific, precise definition of it. We’ve even described a *procedure* by which you can conceptually generate it. But despite that, it’s *deeply* uncomputable. There are procedures where we can compute a finite prefix of it. But there’s a limit: there’s a point beyond which we cannot compute *any* digits of it. *And* there is no way to compute that point. So, for example, there’s a very interesting paper where the authors [computed the value of Ω for a Minsky machine][omega]; they were able to compute 84 bits of it; but the last 20 are *unreliable*, because it’s uncertain whether or not they’re actually beyond the limit, and so they may be wrong. But we can’t tell!
What does Ω mean? It’s actually something quite meaningful. It’s a number that encodes some of the very deepest information about *what* it’s possible to compute. It gives a way to measure the probability of computability. In a very real sense, it represents the overall nature and structure of computability in terms of a discrete probability. It’s an amazingly dense container of information – as a n infinitely long number and so thoroughly non-compressible, it contains an unmeasurable quantity of information. And we can’t get near most of it!
[omega]: http://www.expmath.org/expmath/volumes/11/11.3/Calude361_370.pdf

Q&A: What is information?

I received an email from someone with some questions about information theory; they relate to some sufficiently common questions/misunderstandings of information theory that I thought it was worth turning the answer into a post.
There are two parts here: my correspondent started with a question; and then after I answered it, he asked a followup.
The original question:
————————
>Recently in a discussion group, a member posted a series of symbols, numbers,
>and letters:
>
>`+
>The question was what is its meaning and whether this has information or not.
>I am saying is does have information and an unknown meaning (even though I am
>sure they were just characters randomly typed by the sender), because of the
>fact that in order to recognize a symbol, that is information. Others are
>claiming there is no information because there is no meaning. Specifically that
>while the letters themselves have meaning, together the message or “statement”
>does not, and therefore does not contain information. Or that the information
>content is zero.
>
>Perhaps there are different ways to define information?
>
>I think I am correct that it does contain information, but just has no meaning.
This question illustrates one of the most common errors made when talking about information. Information is *not* language: it does *not* have *any* intrinsic meaning. You can describe information in a lot of different ways; in this case, the one that seems most intuitive is: information is something that reduces a possibility space. When you sit down to create a random string of characters, there is a huge possibility space of the strings that could be generated by that. A specific string narrows the space to one possibility. Anything that reduces possibilities is something that *generates* information.
For a very different example, suppose we have a lump of radium. Radium is a radioactive metal which breaks down and emits alpha particles (among other things). Suppose we take out lump of radium, and put it into an alpha particle detector, and record the time intervals between the emission of alpha particles.
The radium is *generating information*. Before we started watching it, there were a huge range of possibilities for exactly when the decays would occur. Each emission – each decay event – narrows the possibility space of other emissions. So the radium is generating information.
That information doesn’t have any particular *meaning*, other than being the essentially random time-stamps at which we observed alpha particles. But it’s information.
A string of random characters may not have any *meaning*; but that doesn’t mean it doesn’t contain information. It *does* contain information; in fact, it *must* contain information: it is a *distinct* string, a *unique* string – one possibility out of many for the outcome of the random process of generation; and as such, it contains information.
The Followup
—————
>The explanation I have gotten from the person I have been debating, as to what
>he says is information is:
>
>I = -log2 P(E)
>
>where:
>I: information in bits
>P: probability
>E: event
>
>So for the example:
>`+
>He says:
>”I find that there is no meaning, and therefore I infer no information. I
>calculate that the probability of that string occuring was ONE (given no
>independent specification), and therefore the amount of information was ZERO. I
>therefore conclude it has no meaning.”
>
>For me, even though that string was randomly typed, I was able to look at the
>characters, and find there placement on a QWERTY keyboard, compare the
>characters to the digits on the hand, and found that over all, the left hand
>index finger keys were used almost twice as much. I could infer that a person
>was left handed tried to type the “random” sequence. So to me, even though I
>don’t know the math, and can’t measure the amount of information, the fact I
>was able to make that inference of a left handed typer tells me that there is
>information, and not “noise”.
The person quoted by my correspondent is an idiot; and clearly one who’s been reading Dembski or one of his pals. In my experience, they’re the only ones who continually stress that log-based equation for information.
But even if we ignore the fact that he’s a Dembski-oid, he’s *still* an idiot. You’ll notice that nowhere in the equation does *meaning* enter into the definition of information content. What *does* matter is the *probability* of the “event”; in this case, the probability of the random string of characters being the result of the process that generated it.
I don’t know how he generated that string. For the sake of working things through, let’s suppose that it was generated by pounding keys on a minimal keyboard; and let’s assume that the odds of hitting any key on that keyboard are equal (probably an invalid assumption, but it will only change the quantity of information, not its presence, which is the point of this example.) A basic minimal keyboard has about sixty keys. (I’m going by counting the keys on my “Happy Hacking Keyboard”. Of those, seven are shifts of various sorts: 2 shift, 2 control, one alt, one command, one mode), and one is “return”. So we’re left with 52 keys. To make things simple, let’s ignore the possibility of shifts affecting the result (this will result in us getting a *lower* information content, but that’s fine). So, it’s 80 characters; each *specific* character generated is an event with probability 1/52. So each *character* of the string has, by the Shannon-based definition quoted above, about 5.7 bits of information. The string as a whole has about 456 bits of information.
The fact that the process that generated it is random doesn’t make the odds of a particular string be one. For a trivial example, I’m going to close my eyes and pound on my keyboard with both hands twice, and then select the first 20 characters from each:
First: “`;kldsl;ksd.z.mdΩ.l.x`”
Second: “`lkficeewrflk;erwm,.r`”
Quite different results overall? Seems like I’m a bit heavy on the right hand on the k/l area. But note how different the outcomes are. *That* is information. Information without *semantics*; without any intrinsic meaning. But that *doesn’t matter*. Information *is not language*; the desire of [certain creationist morons][gitt] to demand that information have the properties of language is nonsense, and has *nothing* to do with the mathematical meaning of information.
The particular importance of this fact is that it’s a common creationist canard that information *must* have meaning; that therefore information can’t be created by a natural process, because natural processes are random, and randomness has no meaning; that DNA contains information; and therefore DNA can’t be the result of a natural process.
The sense in which DNA contains information is *exactly* the same as that of the random strings – both the ones in the original question, and the ones that I created above. DNAs information is *no different*.
*Meaning* is something that you get from language; information *does not* have to be *language*. Information *does not* have to have *any* meaning at all. That’s why we have distinct concepts of *messages*, *languages*, *semantics*, and *information*: because they’re all different.
[gitt]: http://scienceblogs.com/goodmath/2006/07/bad_bad_bad_math_aig_and_infor.php

Bad, bad, bad math! AiG and Information Theory

While taking a break from some puzzling debugging, I decided to hit one of my favorite comedy sites, Answers in Genesis. I can pretty much always find something sufficiently stupid to amuse me on their site. Today, I came across a gem called [“Information, science and biology”][gitt], by the all too appropriately named “Werner Gitt”. It’s yet another attempt by a creationist twit to find some way to use information theory to prove that life must have been created by god.
It looks like the Gitt hasn’t actually *read* any real information theory, but has rather just read Dembski’s wretched mischaracterizations, and then regurgitated and expanded upon them. Dembski was bad enough; building on an incomplete understand of Dembski’s misrepresentations and errors is just astonishing.
Anyway, after butchering an introduction to Shannon theory, he moves onward.
>The highest information density known to us is that of the DNA
>(deoxyribonucleic acid) molecules of living cells. This chemical storage medium
>is 2 nm in diameter and has a 3.4 NM helix pitch (see Figure 1). This results
>in a volume of 10.68 x 10-21 cm3 per spiral. Each spiral contains ten chemical
>letters (nucleotides), resulting in a volumetric information density of 0.94 x
>1021 letters/cm3. In the genetic alphabet, the DNA molecules contain only the
>four nucleotide bases, that is, adenine, thymine, guanine and cytosine. The
>information content of such a letter is 2 bits/nucleotide. Thus, the
>statistical information density is 1.88 x 1021 bits/cm3.
This is, of course, utter gibberish. DNA is *not* the “highest information density known”. In fact, the concept of *information density* is not well-defined *at all*. How do you compare the “information density” of a DNA molecule with the information density of an electromagnetic wave emitted by a pulsar? It’s meaningless to compare. But even if we accept information as only physically encoded, Consider the information density of a crystal, like a diamond. A diamond is an incredibly compact crystal of carbon atoms. There are no perfect diamonds: all crystals contain irregularities and impurities. Consider how dense the information of that crystal is: the position of every flaw, every impurity, the positions of the subset of carbon atoms in the crystal that are carbon-14 as opposed to carbon-12. Considerably denser than DNA, huh?
After this is where it *really* starts to get silly. Our Gitt claims that Shannon theory is incomplete, because after all, it’s got a strictly *quantitative* measure of information: it doesn’t care about what the message *means*. So he sets out to “fix” that problem. He proposes five levels of information: statistics, syntax, semantics, pragmatics, and apobetics. He claims that Shannon theory (and in fact information theory *as a whole*) only concerns itself with the first; because it doesn’t differentiate between syntactically valid and invalid information.
Let’s take a quick run through the five, before I start mocking them.
1. Statistics. This is what information theory refers to as information content, expressed in terms of an event sequence (as I said, he’s following Dembski); so we’re looking at a series of events, each of which is receiving a character of a message, and the information added by each event is how surprising that event was. That’s why he calls it statistical.
2. Syntax. The structure of the language encoded by the message. At this level, it is assumed that every message is written in a *code*; you can distinguish between “valid” and “invalid” messages by checking whether they are valid strings of characters for the given code.
3. Semantics. What the message *means*.
4. Pragmatics. The *primitive intention* of the transmitter of the message; the specific events/actions that the transmitter wanted to occur as a result of sending the message.
5. Apobetics: The *purpose* of the message.
According to him, level 5 is the most important one.
Throughout the article, he constantly writes “theorems”. He clearly doesn’t understand what the word “theorem” means, because these things are just statements that he would *like* to be true, but which are unproven, and often unprovable. A few examples?
For example, if we look at the section about “syntax”, we find the following as theorems:
>Theorem 4: A code is an absolutely necessary condition for the representation
>of information.
>
>Theorem 5: The assignment of the symbol set is based on convention and
>constitutes a mental process.
>
>Theorem 6: Once the code has been freely defined by convention, this definition
>must be strictly observed thereafter.
>
>Theorem 7: The code used must be known both to the transmitter and receiver if
>the information is to be understood.
>
>Theorem 8: Only those structures that are based on a code can represent
>information (because of Theorem 4). This is a necessary, but still inadequate,
>condition for the existence of information.
>
>These theorems already allow fundamental statements to be made at the level of
>the code. If, for example, a basic code is found in any system, it can be
>concluded that the system originates from a mental concept.
How do we conclude that a code is a necessary condition for the representation of information? We just assert it. Worse, how do we conclude that *only* things that are based on a code represent information? Again, just an assertion – but an *incredibly* strong one. He is asserting that *nothing* without a
structured encoding is information. And this is also the absolute crux of his argument: information only exists as a part of a code *designed by an intelligent process*.
Despite the fact that he claims to be completing Shannon theory, there is *nothing* to do with math in the rest of this article. It’s all words. Theorems like the ones quoted above, but becoming progressively more outrageous and unjustified.
For example, his theorem 11:
>The apobetic aspect of information is the most important, because it embraces
>the objective of the transmitter. The entire effort involved in the four lower
>levels is necessary only as a means to an end in order to achieve this
>objective.
After this, we get to his conclusion, which is quite a prize.
>On the basis of Shannon’s information theory, which can now be regarded as
>being mathematically complete, we have extended the concept of information as
>far as the fifth level. The most important empirical principles relating to the
>concept of information have been defined in the form of theorems.
See, to him, a theorem is nothing but a “form”: a syntactic structure. And this whole article, to him, is mathematically complete.
>The Bible has long made it clear that the creation of the original groups of
>fully operational living creatures, programmed to transmit their information to
>their descendants, was the deliberate act of the mind and the will of the
>Creator, the great Logos Jesus Christ.
>
>We have already shown that life is overwhelmingly loaded with information; it
>should be clear that a rigorous application of the science of information is
>devastating to materialistic philosophy in the guise of evolution, and strongly
>supportive of Genesis creation.
That’s where he wanted to go all through this train-wreck. DNA is the highest-possible density information source. It’s a message originated by god, and transmitted by each generation to its children.
And as usual for the twits (or Gitts) that write this stuff, they’re pretending to put together logical/scientific/mathematical arguments for god; but they can only do it by specifically including the necessity of god as a premise. In this case, he asserts that DNA is a message; and a message must have an intelligent agent creating it. Since living things cannot be the original creators of the message, since the DNA had to be created before us. Therefore there must be a god.
Same old shit.
[gitt]: http://www.answersingenesis.org/tj/v10/i2/information.asp

Dembski on Non-Evolutionary Math

I’ve been taking a look at William Dembski’s paper, “[Information as a Measure of Variation][imv]”. It was recommended to me as a paper demonstrating Demsbki’s skill as a mathematician that isn’t aimed at evolution-bashing. I’m not going to go into too much detail about it; it’s just not that good. If this is the best work he’s done as a mathematician, well, that’s pretty sad.
The main problems with the paper are:
1. He either doesn’t understand or misrepresents some of the fundamentals of the field he’s allegedly discussing;
2. He presents many of the central ideas of the paper (that is, describing information content of an event in an event sequence in terms of how it affects the probabilities of events that occur after it) as if they were novel when they really are not (in fact, this idea dates back to Shannon’s [self-information][self-info]); and
3. Much of the paper is very unclear, even obfuscatory.
The second two are typical of Dembski’s writing, and not particularly interesting. I’m going to focus on three egregious cases of misrepresentation.
### Misrepresentation One: IT = Shannon IT
The misrepresentations start from quite literally the first line of the paper. The first two paragraphs of the body of the paper are:
>Ordinarily, information refers to the meaning or semantic content of a message.
>Getting a handle on the meaning of a message, however, has proven difficult
>mathematically. Thus, when mathematicians speak of information, they are
>concerned not so much with the meaning of a message as with the vehicle by
>which the message is transmitted from sender to receiver.
>
>The most common vehicle for transmitting messages is the character string.
>The mathematical theory of information is largely about quantifying the com-
>plexity of such strings, characterizing their statistical properties when they
>are sent across a noisy communication channel (noise being represented as a
>stochastic process that disrupts the strings in statistically well-defined
>ways), preserving the strings despite the presence of noise (i.e., the theory
>of error-correcting codes), compressing the strings to improve efficiency, and
>transforming the strings into other strings to maintain their security (i.e.,
>cryptography).
This is wrong, and it’s deliberately wrong.
The reason it’s wrong? As I described in my [introduction to information theory][intro-it] (IT), there are two main branches of IT: Shannon and Kolmogorov-Chaitin. Demsbki is definitely aware of both; he references the work of Chaitin in papers written *before* this one. But in his description of information theory here, he focuses exclusively on Shannon theory, and presents it as if it were the entirety of mathematical IT.
Why would he do that? Well, because it makes it easier to make his argument about why it makes sense to view information in terms of a series of events. Later in the paper, he’s going to argue that information entropy is the *wrong* measure for information; but that argument can *only* make sense in Shannon theory. And even for the Shannon IT that he uses, the way that he characterizes it is sloppy.
### Misrepresentation Two: Information Content in Poker Hands
Moving on, here’s another really dreadful passage:
>Consider, for instance, the following individuation of poker hands: RF (a
>royal flush) and ¬RF (all other poker hands). To learn that something other
>than a royal flush was dealt (i.e., possibility ¬RF ) is clearly to acquire
>less information than to learn that a royal flush was dealt (i.e., possibility
>RF ). A royal flush is highly specific. We have acquired a lot of information
>when we learn that a royal flush was dealt. On the other hand, we have acquired >hardly any information when we learn that something other than a royal flush
>was dealt. Most poker hands are not royal flushes, and we expect to be dealt
>them only rarely. Nevertheless, if our measure of information is simply an
>enumeration of eliminated possibilities, the same numerical value must be
>assigned in both instances since, in each instance, a single possibility is
>eliminated.
Looking at this, it’s hard to tell if he just really *doesn’t get it*, or if he’s trying to slip something past the readers. It’s sufficiently messed up that it’s hard to determine exactly what he means. I can see two very different readings.
1. The two possibilities are, as he says, RF and ¬ RF. That is, we’re going to be told whether or not a hand is a royal flush. We are *not* going to be told what the cards in the hand are: we are simple going to get a yes/no answer to the question “Is the hand a royal flush?” If that’s the case, then this is completely wrong: being told “Yes, it’s an RF” gives you enough information to determine that the hand is one of four sets of cards: (the RF in each of the four suits); being told “No, it’s not an RF” is only enough information to determine that the hand is one of 52! – 4 possible sets of cards. And *in either case*, the information content is determined *not solely by the statement that the cards are, or are not, a royal flush*. The information content of those statements (per Kolmogorov-Chaitin, which is the kind of IT applicable to analyzing statements of this sort) is based on the statement *plus* the rules of poker that define the meaning of a royal flush.
2. We are told exactly what cards are in the hand. In that case, we know whether or not it’s a RF because we know what cards are there. In that case, whether you’ve got an RF or not, *you have a precise description of exactly one poker hand*. No matter what variant of IT you’re using; no matter whether you’ve got a flush or not; the amount of information that you have *is exactly the same*: it’s the identity of the 5 cards in your hand.
### Misrepresentation Three: Explaining Entropy as a Measure
On more example of the misleadingness: the beginning of section two tries to explain why information theorists use entropy. He presents an explanation of information associated with an event; and then he explains entropy *in terms of* that presentation of information as events; and then presents the IT notion of entropy as something *mysterious*:
>Nonetheless, this this notion is almost entirely passed over in favor of a
>different notion, called entropy. Entropy, rather than being associated with a
>particular event, is associated with a partition of events for a given
>reference class of possibilities Ω….
Followed by an rather obfuscatory presentation of the equation for the Shannon IT entropy of an event sequence. So far, this part *could* be taken as just typical of Dembski’s dreadfully obscure writing style. But the next step he takes is where he’s deliberately being misleading. He asks why entropy rather than event probability is the preferred measure for information content?
But he shifts the goalposts. Up until now, he’s been talking about mathematicians and IT theorists. Now suddenly, the question isn’t about why entropy is the preferred measure among *information theorists*; it’s why it’s the preferred measure among *communication theorists* (which are a different group of people); and the *answer* that he quotes is about *communication engineers*.
So: again, he’s deliberately being misleading. He’s trying to convince you that you should think of information content in terms of probability. But instead of making that argument, he presents the definitions in a very strange way, and uses subtle unjustified changes to allow him to present the *weakest possible explanation* for why his preferred measure is not widely accepted.
### Conclusion
Dembski’s a damned lousy mathematician. Even when he isn’t trying to mislead people about evolution, he’s sloppy, obfuscatory, and prone to arguments based on strawmen. If this is an example of his best work as a mathematician, then someone really screwed up in giving him a PhD.
[imv]: http://www.iscid.org/pcid/2005/4/2/dembski_information_variation.php
[intro-it]: http://scienceblogs.com/goodmath/2006/06/an_introduction_to_information.php
[self-info]: http://en.wikipedia.org/wiki/Self-information

An Introduction to Information Theory (updated from Blogspot)

Back when I first started this blog on blogspot, one of the first things I did was write an introduction to information theory. It’s not super deep, but it was a decent overview – enough, I thought, to give people a good idea of what it’s about, and to help understand why the common citations of it are almost always misusing its terms to create bizzare conclusions, like some [“law of conservation of information”,][conservation]
[conservation]: http://en.wikipedia.org/wiki/Law_of_conservation_of_information “wikipedia on Dembski’s law of CoI”
This is a slight revision of that introduction. For the most part, I’m just going to clean up the formatting, but once I’m going through it, I’ll probably end up doing some other polishing.
****
So what is information theory?
================================
Information theory is a relatively new field of mathematics that tries to characterize what information is in a quantifiable way. It’s an area of math that was almost totally obscure until not too long ago, and one which is frequently misunderstood, even by people who mean well.
A bit of history
—————–
Modern information theory comes from two very different sources: Shannon’s information theory, and Kolmogorov/Chaitin algorithmic information theory. Fortunately, they both converge, mostly, on the same place.
### Shannon’s Information Theory
The more famous branch of IT is Claude Shannon’s information theory, as presented in his infamous paper, [Communication in the Presence of Noise][shannon]. Shannon’s interest came from his work at Bell Labs for AT&T, which wanted a way of being able to figure out how much wire they needed to lay for the telephone network.
AT&T was interested because for a telephone company, laying wire in an incredibly expensive operation. The wire itself (especially back in the ways when it was bundles of copper) is quite expensive; the process of running and burying the cables is even more expensive. So ideally, as a phone company, you want to run the cables once; you want to lay enough that you’ll never have to come back and do it again; and you want to lay the smallest amount that will do the job, since the materials are so expensive.
The problem that they wanted Shannon to help solve was, how could they determine how much wire did they actually need to lay? It’s actually an astonishingly tricky question. The less they laid, the less it cost them in the short term. But they *knew* that the number of phone lines was going to increase dramatically – so if they were cheap, and laid too little wire, when the number of phones exceeded the capacity of what they had already laid, they’d have to go back and lay more. So they wanted to be able to make a good prediction of the minimum amount of wire they could lay that would meet their needs both at the time it was laid, and in the future.
But there’s a big trick to that. First: how much information can a single wire carry? And when you bundle a bunch of wires together, how much can you pump over a single wire without it interfering with signals on it’s neighbors in the bundle?
That’s what Shannon was trying to understand. He was trying to find a mathematical model that he could use to describe information transmission, including the effects of imperfections in transmission, including the introduction of noise and interference. He wanted to be able to quantify information in a meaningful way that would allow him to ask questions like: “At what point does noise in the line start to eliminate the information I want to transmit?”.
The answer is just fascinating: a given communication channel has a capacity for carrying information; and a given message has a quantity of information in it. Adding noise to the signal *adds* information to message *until* the capacity of the channel is reached, at which point information from the noise will start to *replace* information from the message; at that point, you can say that information from the message will start to be lost.
Shannon called the information content of a signal it’s *entropy*, because he saw a similarity between his information entropy and thermodynamic entropy: in a communicating system, entropy *never* decreases: it increases until the capacity of the channel is reached, and then it stays content. You can’t reduce the amount of information in a message. (There are various rumours that in fact this choice of terminology was actually suggested by von Neumann himself.)
That naming led directly to the most common misuse of information theory. But that’s a post for another day.
### Algorithmic Information Theory
The other main branch of information theory came from a very different direction. The two pioneers were Andrey Kolmogorov, and Greg Chaitin. Kolmogorov and Chaitin didn’t know each other at the time they independently invented the same mathematical formalization (in fact, Chaitin was a teenager at the time!), a fact which led to some friction.
In the interests of honesty, I’ll say that Greg Chaitin works in the same building that I do, and I’ve met him a number of times, and been greatly impressed by him, so my description is heavily influenced by Greg.
Anyway – algorithmic information theory grew out of some of the fundamental mathematics being done at the beginning of the 20th century. There was an effort to create a single, fundamental, set of mathematical axioms and inference rules, which was both complete (every true statement was provably true), and consistent (every well-formed statement was either true or false). This effort failed, miserably; the nail in the coffin was Gödel’s incompleteness theory. Gödel presented an extremely complicated proof that showed, essentially, that no formal system could be both complete and consistent. (See my recent article about [Extreme Math][extreme] for more about this.)
[extreme]: http://scienceblogs.com/goodmath/2006/06/extreme_math_1_1_2.php “1+1=2”
Most people saw Gödel’s proof, did the equivalent of saying “Oh, shit!”, and then pretended that they hadn’t seen it. It does mean that there are fundamental limits to what you can do using mathematical formalization, but for the most part, they don’t affect you in normal day to day math. (Sort of the way that car designers didn’t change the way they build cars because of relativity. Yes, it showed that the fundamental physical model that they were using was wrong – but at the speeds that cars move, that wrongness is so small that it just doesn’t matter.)
But for people in the early stages of what would become computer science, this was a big deal. One of the early hopes was that the mathematical system would produce a “decision procedure”, a mechanical process by which a computing device could check the truth or falsehood of any mathematical statement. Gödel showed that it couldn’t be done.
But the early computer scientists – in particular, Alan Turing – embraced it. It led directly to two of the fundamental rules of computer science:
1. The Church-Turing Thesis: all mechanical computing systems are basically the same: there is a fundamental limit to what is computable, and any “acceptable” system can compute anything up to that limit – so if you can find a way of doing it on any computing device, then you can, ultimately, do it on every acceptable computing device.
2. The Halting Problem: there are some things that you cannot program a device to do. In particular, you cannot write a program for any computing device that examines another program, and tells you if that other program will eventually stop.
The halting problem turns out to say *exactly* the same thing as the Gödel incompleteness theorem. Only you can write a proof of it that a freshman college student can understand in ten minutes, instead of a proof that the average math grad student will need a semester to plow through.
Chaitin and Kolmogorov saw this as a big deal: using an algorithmic approach to how to process information, you could prove something very simply, which would require a much more complicated proof using other methods.
K-C information theory grew out of this. According to Kolmogorov and Chaitin, the fundamental amount of information contained in a string (called the string’s information entropy after Shannon), is the shortest string of program + data for a computing device that can generate that string.
(If you’re interested in Gödel’s incompleteness theorem, then Hofstadter’s
[Godel, Escher, Bach: an Eternal Golden Braid][geb] is a really fun introduction. For K-C theory, Greg Chaitin has written a bunch of really [great][chaitin1] [books][chaitin2] for mathematical laymen.
[geb]: http://www.amazon.com/gp/search/ref=br_ss_hs/002-3998406-1014458?search-alias=aps&keywords=godel%20escher%20bach “GEB amazon link”
[chaitin1]: http://www.amazon.com/gp/product/9814021725/sr=8-4/qid=1141849782/ref=pd_bbs_4/002-3998406-1014458?%5Fencoding=UTF8
[chaitin2]: http://www.amazon.com/gp/product/1852336684/sr=8-3/qid=1141849782/ref=pd_bbs_3/002-3998406-1014458?%5Fencoding=UTF8″
Information and Entropy
————————-
Now that I’ve got a bit of background and origins done, it’s time to get to some of the fun stuff.
As I said yesterday, both of the big branches of information theory have a concept of *entropy*. While the exact presentations differ, because they’re built on somewhat different theoretical foundations, the basic meaning of entropy in both branches is pretty much the same. I’m going to stick with the terminology of the Kolmogorov-Chaitin branch, but the basic idea is the same in either.
In K-C theory, we talk about the information content of *strings*. A string is an ordered sequence of characters from a fixed alphabet. It doesn’t really matter what alphabet you choose; the point of this kind of theory isn’t to come up with a specific number describing the complexity of a string, but to be able to talk about what information means in a formal algorithmic sense.
The K-C definition of the entropy of a string is the total quantity of information encoded in that string.
Here’s where it gets fun.
Another way of saying exactly the same thing is that the entropy of a string is a measure of the *randomness* of the string.
Structured strings are structured *precisely* because they contain *redundancy* – and redundancy does not add information.
Which string has more information: “XXXXYYYYY” or “4X5Y”? They have the same amount. One is a lot smaller than the other. But they contain the same amount of information.
Here’s a third way of saying exactly the same thing: the entropy of a string is the length of the smallest compressed string that can be decompressed into the original.
Here’s a better example. Go back and look at the first two paragraphs of yesterday’s post. It took 514 characters.
Here’s the same information, compressed (using gzip) and then made
readable using a unix utility called uuencode:
M’XL(“)E8$$0“VIU;FL`19$Q=JPP#$7[6≤7KIN’,`M+]≤HHLPH“)8BMOPY
M[#Y/GCDG#
MLY2EI8$9H5GLX=*R(_+ZP/,-5-1#HRJNT`77@LL,MZD,”H7LSUKDW3@$#V2
MH(KTO$Z^$%CN1Z>3L*J^?6ZW?^Y2+10;+SOO’OC”/;7T5QA%987SY02I3I$
MLKW”W,VZ-(J$E”[$;’2KYI^-_L./3BW.+WF3XE8)≤@D8X^U59DQ7IA*F+X/
MM?I!RJ*%FE%])Z+EXE+LSN*,P$YNX5/P,OCVG;IK=5_K&CK6J7%’+5M,R&J]
M95*W6O5EI%G^K)8B/XV#L=:5_`=5ELP#Y#UJ??>[[DI=J”*2D];K_F230″
$`@(““`
`
That’s only 465 characters; and if I didn’t have to encode it to stop
it from crashing your web-browser, it would only have been 319
characters. Those characters are a lot more random than the original –
so they have a higher information density. The closer we get
to the shortest possible compressed length, the higher the information
density gets.
Actually, I’m lying slightly; there’s an additional factor that needs to be considered in K-C theory.
Remember that K-C comes from the foundational mathematics of computer science, something called recursive function theory. We’re talking about strings being processed algorithmically by a program. So really, we get to talk about programs that generate strings. So it’s not just strings: you have a program and a data string. Really, you measure information content of a string as the total size of the smallest pairing of program and data that generates that string. But for most purposes, that doesn’t make that much difference: the combination of the program and the data can be encoded into a string.
In the two examples above, the program to follow is implied by the contents of the string; if we wanted to really model it precisely, we’d need to include the programs:
* For the string “XXXXYYYYY”, the program is roughly:
while there are more characters in the data:
read a character
output the character that you read
end
* For the string “4X5Y”, the program is roughly:
while there are more characters in the data:
read a number n from the data
read a character c from the data
output c n times
end
Now, specifics aside: the definitions I gave earlier are pretty much correct in both K-C and Shannon information theory: information content is proportional to randomness is proportional to minimum compressed length.
So – what happens if I take a string in very non-compressed format (like, say, an uncompressed digitized voice) and send it over a noisy phone line? Am I gaining information, or losing information?
The answer is: *gaining* information. Introducing randomness into the string is *adding* information.
“AAAABBBB” contains less information than “AAAABxBB”.
The way this is used to refute bozos who claim things like “You can’t create information” should be obvious.

Dembski's Profound Lack of Comprehension of Information Theory

I was recently sent a link to yet another of Dembski’s wretched writings about specified complexity, titled Specification: The Pattern The Signifies Intelligence.
While reading this, I came across a statement that actually changes my opinion of Dembski. Before reading this, I thought that Dembski was just a liar. I thought that he was a reasonably competent mathematician who was willing to misuse his knowledge in order to prop up his religious beliefs with pseudo-intellectual rigor. I no longer think that. I’ve now become convinced that he’s just an idiot who’s able to throw around mathematical jargon without understanding it.
In this paper, as usual, he’s spending rather a lot of time avoiding defining specification. Purportedly, he’s doing a survey of the mathematical techniques that can be used to define specification. Of course, while rambling on and on, he manages to never actually say just what the hell specification is – just goes on and on with various discussions of what it could be.
Most of which are wrong.
“But wait”, I can hear objectors saying. “It’s his theory! How can his own definitions of his own theory be wrong? Sure, his theory can be wrong, but how can his own definition of his theory be wrong?” Allow me to head off that objection before I continue.
Demsbki’s theory of specicfied complexity as a discriminator for identifying intelligent design relies on the idea that there are two distinct quantifiable properties: specification, and complexity. He argues that if you can find systems that posess sufficient quantities of both specification and complexity, that those systems cannot have arisen except by intelligent intervention.
But what if Demsbki defines specification and complexity as the same thing? Then his definitions are wrong: because he requires them to be distinct concepts, but he defines them as being the same thing.
Throughout this paper, he pretty ignores the complexity to focus on specification. He’s pretty careful never to say “specification is this”, but rather “specification can be this”. If you actually read what he does say about specification, and you go back and compare it to some of his other writings about complexity, you’ll find a positively amazing resemblance.
But onwards. Here’s the part that really blew my mind.
One of the methods that he purports to use to discuss specification is based on Kolmogorov-Chaitin algorithmic information theory. And in his explanation, he demonstrates a profound lack of comprehension of anything about KC theory.
First – he purports to discuss K-C within the framework of probability theory. K-C theory has nothing to do with probability theory. K-C theory is about the meaning of quantifying information; the central question of K-C theory is: How much information is in a given string? It defines the answer to that question in terms of computation and the size of programs that can generate that string.
Now, the quotes that blew my mind:

Consider a concrete case. If we flip a fair coin and note the occurrences of heads and tails in
order, denoting heads by 1 and tails by 0, then a sequence of 100 coin flips looks as follows:

(R) 11000011010110001101111111010001100011011001110111
00011001000010111101110110011111010010100101011110.

This is in fact a sequence I obtained by flipping a coin 100 times. The problem algorithmic
information theory seeks to resolve is this: Given probability theory and its usual way of
calculating probabilities for coin tosses, how is it possible to distinguish these sequences in terms
of their degree of randomness? Probability theory alone is not enough. For instance, instead of
flipping (R) I might just as well have flipped the following sequence:

(N) 11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111.

Sequences (R) and (N) have been labeled suggestively, R for “random,” N for “nonrandom.”
Chaitin, Kolmogorov, and Solomonoff wanted to say that (R) was “more random” than (N). But
given the usual way of computing probabilities, all one could say was that each of these
sequences had the same small probability of occurring, namely, 1 in 2100, or approximately 1 in
1030. Indeed, every sequence of 100 coin tosses has exactly this same small probability of
occurring.
To get around this difficulty Chaitin, Kolmogorov, and Solomonoff supplemented conventional
probability theory with some ideas from recursion theory, a subfield of mathematical logic that
provides the theoretical underpinnings for computer science and generally is considered quite far
removed from probability theory.

It would be difficult to find a more misrepresentative description of K-C theory than this. This has nothing to do with the original motivation of K-C theory; it has nothing to do with the practice of K-C theory; and it has pretty much nothing to do with the actual value of K-C theory. This is, to put it mildly, a pile of nonsense spewed from the keyboard of an idiot who thinks that he knows something that he doesn’t.
But it gets worse.

Since one can always describe a sequence in terms of itself, (R) has the description

copy '11000011010110001101111111010001100011011001110111
00011001000010111101110110011111010010100101011110'.

Because (R) was constructed by flipping a coin, it is very likely that this is the shortest
description of (R). It is a combinatorial fact that the vast majority of sequences of 0s and 1s have
as their shortest description just the sequence itself. In other words, most sequences are random
in the sense of being algorithmically incompressible. It follows that the collection of nonrandom
sequences has small probability among the totality of sequences so that observing a nonrandom
sequence is reason to look for explanations other than chance.

This is so very wrong that it demonstrates a total lack of comprehension of what K-C theory is about, how it measures information, or what it says about anything. No one who actually understands K-C theory would ever make a statement like Dembski’s quote above. No one.
But to make matters worse – this statement explicitly invalidates the entire concept of specified complexity. What this statement means – what it explicitly says if you understand the math – is that specification is the opposite of complexity. Anything which posesses the property of specification by definition does not posess the property of complexity.
In information-theory terms, complexity is non-compressibility. But according to Dembski, in IT terms, specification is compressibility. Something that possesses “specified complexity” is therefore something which is simultaneously compressible and non-compressible.
The only thing that saves Dembski is that he hedges everything that he says. He’s not saying that this is what specification means. He’s saying that this could be what specification means. But he also offers a half-dozen other alternative definitions – with similar problems. Anytime you point out what’s wrong with any of them, he can always say “No, that’s not specification. It’s one of the others.” Even if you go through the whole list of possible definitions, and show why every single one is no good – he can still say “But I didn’t say any of those were the definition”.
But the fact that he would even say this – that he would present this as even a possibility for the definition of specification – shows that Dembski quite simply does not get it. He believes that he gets it – he believes that he gets it well enough to use it in his arguments. But there is absolutely no way that he understands it. He is an ignorant jackass pretending to know things so that he can trick people into accepting his religious beliefs.

The Problem with Irreducibly Complexity (revised post from blogger)

As I mentioned yesterday, I’m going to repost a few of my critiques of the bad math of the IDists, so that they’ll be here at ScienceBlogs. Here’s the first: Behe and irreducibly complexity. This isn’t quite the original blogger post; I’ve made a few clarifications and formatting fixes; but the content remains essentially the same. You can find the original post in my blogger information theory index. The original publication date was March 13, 2006.
Today, I thought I’d take on another of the intelligent design sacred cows: irreducible complexity. This is the cornerstone of some of the really bad arguments used by people like Michael Behe.
To quote Behe himself:

By irreducibly complex I mean a single system composed of several well-matched, interacting parts that contribute to the basic function, wherein the removal of any one of the parts causes the system to effectively cease functioning. An irreducibly complex system cannot be produced directly (that is, by continuously improving the initial function, which continues to work by the same mechanism) by slight, successive modifications of a precursor system, because any precursor to an irreducibly complex system that is missing a part is by definition nonfunctional. An irreducibly complex biological system, if there is such a thing, would be a powerful challenge to Darwinian evolution. Since natural selection can only choose systems that are already working, then if a biological system cannot be produced gradually it would have to arise as an integrated unit, in one fell swoop, for natural selection to have any thing to act on.

Now, to be clear and honest upfront: Behe does not claim that this is a mathematical argument. But that doesn’t mean that I don’t get to use math to shred it.
There are a ton of problems with the whole IC argument, but I’m going to take a different tack, and say that even if those other flaws weren’t there, it’s still a meaningless argument. Because from a mathematical point of view, there’s a critical, fundamental problem with the entire idea of irreducible complexity: you can’t prove that something is irreducible complex.
This is a result of some work done by Greg Chaitin in Algorithmic Complexity Theory. A fairly nifty version of this can be found on Greg’s page.
The fundamental result is: given a system S, you cannot in general show that there is no smaller/simpler system that performs the same task as S.
As usual for algorithmic information theory, the proof is in terms of computer programs, but it works beyond that; you can think of the programs as the instructions to build and/or operate an arbitrary device.
First, suppose that we have a computing system φ, which we’ll treat as a function. So φ(x) = the result of running program x on φ. x is both a program and its input data coded into a single string, so x=(c,d), where c is code, and d is data.
Now, suppose we have a formal axiomatic system, which describes the basic rules that φ operates under. We can call this FAS.
If it’s possible to tell where you have minimal program using the axiomatic system, then you can write a program that examines other programs, and determines if they’re minimal. Even better: you can write a program that will generate a list of every possible minimal program, sorted by size.


Let’s jump aside for just a second to show how you can generate a list of every possible minimum program. Here’s a sketch of the program:
minimal.jpg

  1. First, write a program which generates every possible string of one character, then every possible string of two characters, etc., and outputs them in sequence.
  2. Connect the output of that program to another program, which checks each string that it receives as input to see if it’s a syntactically valid program for φ. If it is, it outputs it. If it isn’t, it just discards it.
  3. At this point, we’ve got a program which is generating every possible program for φ. Now, remember that we said that using FAS, we can write a program that tests an input program to determine if its minimal. So, we use that program to test our inputs, to see if they’re minimal. If they are, we output them; if they aren’t, we discard them.

Now, let’s take a second, and write out the program in mathematical terms:
Remember that φ is a function modeling our computing system, FAS is the formal axiomatic system. We can describe φ as a function from a combination of program and data to an output: φ(c,d)=result.
In this case, c is the program above; d is FAS. So φ(c,FAS)=a list of minimal programs.


Now, back to the main track.
Using the program that we sketched above, given any particular length, we can easily generate programs larger than that length.
Take our program, c, and our formal axiomatic system, FAS. and compute their
length. Call that l(c,FAS). If we know l(c,FAS), we can run φ(c,FAS) until it generates a string longer than l(c,FAS).
Ok. Now, write a program c’ that for φ that runs φ(c,FAS) until it finds a program K, where the length of the output of φ(K) is larger than l(c,FAS) + length(c’). c’ then outputs the same thing as φ(K).
This is the tricky part. What does this program do? It runs a program which generates a sequence of provably minimal programs. It runs those provably minimal programs until it finds one larger than itself plus all of its data. Then it runs that and emits the output.
So – c’ outputs the same result as a supposedly minimal program K, where K is larger than c’ and its data. But since c’ is a program which emits the same result as K, but is smaller, then K cannot be minimal.
No matter what you do – no matter what kind of formal system you’ve developed for showing that something is minimal, you’re screwed. Godel just came in and threw a wrench into the works. There is absolutely no way that you can show that any system is minimal – the idea of doing it is intrinsically contradictory.
Evil, huh?
But the point of it is quite deep. It’s not just a mathematical game. We can’t tell when something complicated is minimal. Even if we knew every relevant fact about all of the chemistry and physics that affects things, even if the world were perfectly deterministic, we can’t tell when something is as simple as it can possibly be.
So irreducibly complexity is useless in an argument; because we can’t know when something is irreducibly complex.