Monthly Archives: June 2006

Friday Random Ten, June 30

It’s that time of the week again, when I bore you with my bizzare taste in music. Quite an eclectic mix this week.

  1. Spock’s Beard, “Thoughts”. A track from an oldish Spock’s Beard album. SB is an American neoprog band, which sounds something like a blend of old Genesis, Kansas, and Rush. Very good band. This isn’t my favorite of their albums (that would be “V”).
  2. Gentle Giant, “Way of Life”. A classic song off of a classic album.
  3. Whirligig, “Mister Fox”. An interesting little ballad by a wonderful NYC based Irish band.
  4. Peter Gabriel, “San Jacinto”. Peter Gabriel at his absolute best. He’s never done anything to match the “Security” album, and this is one of my favorite tracks off of there. Starts off mellow and kind of mysterious sounding, and gradually builds, and then fades.
  5. The Clogs, “Lady Go”. A track with vocals from one of those “post-rock ensembles” that I love so much. Very strange sounding; partly a capella falsetto; lots of dark rythmic stiff in other parts.
  6. Broadside Electric, “Tam Lin”. The old traditional ballad performed by a really cool local electric folk band. (And one of the members of the band is actually a math professor at Suny Stonybrook! But she hadn’t joined yet on this album.)
  7. Mel Brooks & broadway cast of “The Producers”, “Springtime for Hitler”. The original producers is one of my all-time favorite comedy movies. I still haven’t managed to get in to see the show. But the soundtrack is absolutely brilliant.
  8. Psychograss, “Looks like a Duck”. Psychograss is a thoroughly amazing band: Tony Trischka, David Grier, Mike Marshall, Darol Anger, and Todd Phillips. They’re mostly bluegrass, but with various strange influences mixed in. This track has some of the most subtly amazing banjo playing you’ll ever hear, not to mention a knockout fiddle bit at the end.
  9. John Corigliano (performed by Stanley Drucker), “Clarinet Concerto, movement ii: Antiphonal Toccata”. I’m actually a classically trained clarinetist. I used to think that I didn’t like Stan Drucker’s playing. Then I heard this. I’ve since learned that while his performances of some of the old classical standards for Clarinet (Mozart’s Clarinet Concerto, the Weber concertos, etc.) are rather uninspired, he is utterly magnificent when it comes to modern music. He clearly loves playing the newer stuff, and it shows. This is also the most technically challenging piece for Clarinet that I’ve ever heard.
  10. Vasen, “Sluken”. Vasen is a Swedish folk band. The lead player plays a peculiar instrument called the Nyckelharpa – it’s a violin with a keyboard. They’re a great band, especially if you get to see them live.

One more plug for DonorsChoose

This is the last time I’m going to bug folks to remind them to donate to the SB challenges.
The DonorsChoose fundraiser here at ScienceBlogs is just about over. Three more days for you to help some kids get a good education in math and science. The GoodMath/BadMath challenge is here; and Janet has a rundown on the challenges that are close to their goals. (If the challenge is met, DonorsChoose will add in an extra 5% bonus.)
As an extra incentive, for the next 10 people who donate to the GM/BM challenge, if you send me a copy of your DonorsChoose receipt, I’ll let you pick one topic for me to write a post about. The only restriction is that the topic be related to some kind of math – good or bad – and that it’s legit. (So don’t send me some good math and ask me to write a bad math post about it.)
Please, do go over to DC, and donate whatever you can afford, As I said at the very beginning of our drive, I’ve taught kids from the kinds of schools that we’re trying to help. There’s something deeply wrong with a school where you have kids who want to learn, but they don’t really get the chance, because they don’t have the things they need – like textbooks.

Dishonest Dembski:the Universal Probability Bound

Dishonest Dembski:the Universal Probability Bound
One of the dishonest things that Dembski frequently does that really bugs me is take bogus arguments, and dress them up using mathematical terminology and verbosity to make them look more credible.
An example of this is Dembski’s *universal probability bound*. Dembski’s definition of the UPB from the [ICSID online encyclopedia][upb-icsid] is:
>A degree of improbability below which a specified event of that probability
>cannot reasonably be attributed to chance regardless of whatever
>probabilitistic resources from the known universe are factored in. Universal
>probability bounds have been estimated anywhere between 10-50 (Emile Borel)
>and 10-150 (William Dembski).
He’s quantified it in several different ways. I’ve found three different versions of the calculation of the UPB: two of them from wikipedia; one is from a message thread at ICSID which the author claims is a quote from one of Dembski’s books.
Let’s look at Dembski’s own words first:
>Specifically, within the known physical universe there are estimated to be no
>more than 1080 elementary particles. Moreover, the properties of matter are
>such that transitions from one state to another cannot occur at a rate faster
>that 1045 times per second. Finally, the universe itself is about a billion
>times younger than 1025 seconds (assuming the universe is around 10 to 20
>billion years old). ….these cosmological constraints imply that the total
>number of specified events throughout cosmic history cannot exceed
>1080 * 1045 x 1025 = 10150.
He goes on to assert that this is the “maximum number of trials” that could have occurred since the beginning of the universe, and that for anything less likely than that which is observed to occur, it is not reasonable to say it is caused by chance.
Wikipedia presents this definition, and a more recent one which lowers the UPB, but as they don’t provide all of the details of the equation, I’ll skip it for now. Wikipedia’s explanation of this original form of the UPB is:
>Dembski’s original value for the universal probability bound is 1 in 10150,
>derived as the inverse of the product of the following approximate
> * 1080, the number of elementary particles in the observable
> universe.
> * 1045, the maximum rate per second at which transitions in
> physical states can occur (i.e., the inverse of the Planck time).
> * 1025, a billion times longer than the typical estimated age of
> the universe in seconds.
>Thus, 10150 = 1080 × 1045 × 1025.
>Hence, this value corresponds to an upper limit on the number of physical
>events that could possibly have occurred since the big bang.
Here’s the fundamental dishonesty: None of those numbers have *anything* to do with what he’s supposedly trying to prove. He’s trying to create a formal-sounding version of the big-number problem by throwing together a bunch of fancy-sounding numbers, multiplying them together, and claiming that they somehow suddenly have meaning.
But they don’t.
It’s actually remarkably easy to show what utter nonsense this is. I’ll do a fancy one first, and a trivial one second.
Let’s create an incredibly simplified model of a region of space. Let’s say we have a cube of space, 1 kilometer on a side. Further, let’s suppose that this space contains 1000 particles, and they are all electrons. And further, let’s suppose that each 1mm cube in this cubic kilometer can only have one electron in it.
This is a model which is so much simpler than reality that it’s downright silly. But everything about the real world would make it more complex, and it’s sufficient for our purposes.
Now: consider the probability of any *configuration* of the electrons in the region of space. A configuration is a selection of the set of 1mm cubes that contain electrons. The number of different configurations of this region of space is (109!)/((1000!)*(109-1000)!). That works out to (109*(109-1)*(109-2)*…*(109-1000))/(1000!).
1000! is roughly 4×102568 according to my scheme interpreter. We’ll be generous, and use 1×102569, to make things easier. To estimate the numerator, we can treat it as (109)*((108)999), which will be much smaller. That’s 107801. So the probability of any particular configuration within that cube is 1 in 105232.
So any state of particles within that cube is an event with probability considerably smaller than 1 in 105232. So what Dembski is saying is that *every* possible configuration of matter in space in the entire universe is impossible without intelligent intervention.
And the trivial one? Grab two decks of distinguishable cards. Shuffle them together, and lay them out for a game of spider solitaire. What’s the probability of that particular lay of cards? 104! , or, very roughly, something larger than 1×10166. Is god personally arranging ,my cards every time I play spider?
Anyone who’s ever taken any class on probability *knows* this stuff. One college level intro, and you know that routine daily events can have incredibly small probabilities – far smaller than his alleged UPB. But Dembski calls himself a mathematician, and he writes about probability quite frequently. As much as I’ve come to believe that he’s an idiot, things like this just don’t fit: he *must* know that this is wrong, but he continues to publish it anyway.

Skewing Statistics for Politics

As I’ve frequently said, statistics is an area which is poorly understood by most people, and as a result, it’s an area which is commonly used to mislead people. The thing is, when you’re working with statistics, it’s easy to find a way of presenting some value computed from the data that will appear to support a predetermined conclusion – regardless of whether the data as a whole supports that conclusion. Politicians and political writers are some of the worst offenders at this.
Case in point: over at [powerline][powerline], they’re reporting:
>UPI reports that Al Gore’s movie, An Inconvenient Truth, hasn’t done so well
>after a promising start:
>> Former U.S. vice-President Al Gore’s documentary “An Inconvenient Truth”
>>has seen its ticket sales plummet after a promising start.
>>After Gore’s global warming documentary garnered the highest average per play
>>ever for a film documentary during its limited Memorial Day weekend opening,
>>recent theater takes for the film have been less than stellar, Daily Variety
>> The film dropped from its record $70,333 per play to $12,334 during its
>>third week and its numbers have continued to fall as the film opens in smaller
>>cities and suburbs across the country.
>It’s no shock, I suppose, that most people aren’t interested in seeing
>propaganda films about the weather. But the topic is an interesting and
>important one which we wrote about quite a few years ago, and will try to
>return to as time permits.
So: they’re quoting a particular figure: *dollars per screen-showing*, as a measure of how the movie is doing. The thing is, that’s a pretty darn weird statistic. Why would they use dollars/screen-showing, instead of total revenue?
Because it’s the one statistic that lets them support the conclusion that they wanted to draw. What are the real facts? Official box office statistics for gross per weekend (rounded to the nearest thousand):
* May 26: $281,000 (in 4 theaters)
* June 2: $1,356,000 (in 77 theaters)
* June 9: $1,505,000 (in 122 theaters)
* June 16: $1,912,000 (in 404 theaters)
* June 23: $2,016,000 (in 514 theaters)
Each weekend, it has made more money than the previous weekend. (Those are per weekend numbers, not cumulative. The cumulative gross for the movie is $9,630,000.)
But the per showing gross has gone down. Why? Because when it was first released, it was being shown in a small number of showings in a small number of theaters. When it was premiered in 4 theaters, they sold out standing room only – so the gross per showing was very high. Now, four weeks later, it’s showing in over 500 theaters, and the individual showings aren’t selling out anymore. But *more people* are seeing it – every weekend, the number of people seeing it has increased!
The Powerline article (and the UPI article which it cites) are playing games with numbers to skew the results. They *want* to say that Al Gore’s movie is tanking in the theaters, so they pick a bizzare statistic to support that, even though it’s highly misleading. In fact, it’s one of the best performing documentaries *ever*. It’s currently the number seven grossing documentary of all time, and it’s about $600,000 off from becoming number 5.
What was the per-theater (note, not per showing, but per theater) gross for the last Star Wars movie four weeks into its showing? $4,500/theater (at 3,322 theaters), according to [Box Office Mojo][bom]. So, if we want to use reasoning a lot like powerline, we can argue that Al Gore’s movie is doing *as well as Star Wars* an a dollars/theater-weekend basis.
But that would be stupid, wouldn’t it.

Good Math, Bad Math might be in trouble later this week

I just received some email that would seriously worry me if it weren’t for the fact that I’m not an idiot.

There is NO TIME to waste! Go read!!!!!!

Going there, I find:

We are now 98% confident that the UN Plaza will be hit by a terrorist nuclear bomb between Thursday evening June 29th and Tuesday evening July 4th, 2006
It is certainly true that: No nukes is good nukes! But just because we got the date wrong (3 times) does not mean that the scriptural threat has evaporated. It is still there in black and white in bible symbolism. So we still have the almost impossible task of persuading a typical New Yorker with faith in God, that the Bible predicts the very day and place of the first terrorist nuke. There is obviously a massive credibility gap between: “Here endeth the lesson” and “Here endeth NYC”. But every journey, however long, begins with one small step. So here is our attempt to fill that gap.
Firstly we again strongly advise anyone in New York City with any faith in God, whatever his religion or whatever his distrust of organised religion, to take the last Thursday in June off and to get out of NYC for that weekend and not come back until the evening of July 4 if nothing happens.
You can then study this fascinating article outside NYC at your leisure, during that weekend or more to the point, after that weekend! It is going to be hard to find the necessary time during the next few days, given the busy schedule of every New Yorker, to sit down and fully analyze the fruits of 14 years of bible decoding and reach a rational decision about such a momentous prediction. So the sensible course of action might be to judge for yourself whether we are sincere in our efforts to decode the bible. And if you see that we are sincere, then rather than taking an intellectual walk from basic faith to accurate bible prophecy, just rely on all of the work that we have so far done and on the basis of your faith and our sincerity, take the weekend out of the city.

You gotta love this. Not only is it a splendid example of worse kind of pseudo-numerological gibberish, but they admit to having been wrong three times already. But this time, this time!, they’ve really got it right! We should trust them and get the hell of of NYC this weekend, no matter what it costs!

Categories: Products, Exponentials, and the Cartesian Closed Categories

Before I dive into the depths of todays post, I want to clarify something. Last time, I defined categorical products. Alas, I neglected to mention one important point, which led to a bit of confusion in the comments, so I’ll restate the important omission here.
The definition of categorical product defines what the product looks like *if it’s in the category*. There is no requirement that a category include the products for all, or indeed for any, of its members. Categories are *not closed* with respect to categorical product.
That point leads up to the main topic of this post. There’s a special group of categories called the **Cartesian Closed Categories** that *are* closed with respect to product; and they’re a very important group of categories indeed.
However, before we can talk about the CCCs, we need to build up a bit more.
Cartesian Categories
A *cartesian category* C (note **not** cartesian *closed* category) is a category:
1. With a terminal object t, and
2. ∀ a, b ∈ Obj(C), the objects and arrows of the categorical product a×b are in C.
So, a cartesian category is a category closed with respect to product. Many of the common categories are cartesian: the category of sets, and the category of enumerable sets, And of course, the meaning of the categorical product in set? Cartesian product of sets.
Categorical Exponentials
Like categorical product, the value of a categorical exponential is not *required* to included in a category. Given two objects x and y from a category C, their *categorical exponential* xy, if it exists in the category, is defined by a set of values:
* An object xy,
* An arrow evaly,x : xy×y → x, called an *evaluation map*.
* ∀ z ∈ Obj(C), an operation ΛC : (z✗y → x) → (z → xy). (That is, an operation mapping from arrows to arrows.)
These values must have the following properties:
1. ∀ f : z×y → x, g : z → xy:
* valy,x º (ΛC(f)×1y)
2. ∀ f : z×y → x, g : z → xy:
* ΛC(evaly,x *ordm; (z×1y)) = z
To make that a bit easier to understand, let’s turn it into a diagram.
You can also think of it as a generalization of a function space. xy is the set of all functions from y to x. The evaluation map is simple description in categorical terms of an operation that applies a function from a to b (an arrow) to a value from a, resulting in an a value from b.
(I added the following section after this was initially posted; a commenter asked a question, and I realized that I hadn’t explained enough here, so I’ve added the explanation.
So what does the categorical exponential mean? I think it’s easiest to explain in terms of sets and functions first, and then just step it back to the more general case of objects and arrows.
If X and Y are sets, then XY is the set of functions from Y to X.
Now, look at the diagram:
* The top part says, basically, that g is a function from Z to to XY: so g takes a member of Z, and uses it to select a function from Y to X.
* The vertical arrow says:
* given the pair (z,y), f(z,y) maps (z,y) to a value in X.
* given a pair (z,y), we’re going through a function. It’s almost like currying:
* The vertical arrow going down is basically taking g(z,y), and currying it to g(z)(y).
* Per the top part of the diagram, g(z) selections a function from y to x. (That is, a member of XY.)
* So, at the end of the vertical arrow, we have a pair (g(z), y).
* The “eval” arrow maps from the pair of a function and a value to the result of applying the function to the value.
Now – the abstraction step is actually kind of easy: all we’re doing is saying that there is a structure of mappings from object to object here. This particular structure has the essential properties of what it means to apply a function to a value. The internal values and precise meanings of the arrows connecting the values can end up being different things, but no matter what, it will come down to something very much like function application.
Cartesian Closed Categories
With exponentials and products, we’re ready for the cartesian closed categories (CCCs):
A Cartesian closed category is a category that is closed with respect to both products and exponentials.
Why do we care? Well, the CCCs are in a pretty deep sense equivalent to the simply typed lambda calculus. That means that the CCCs are deeply tied to the fundamental nature of computation. The *structure* of the CCCs – with its closure WRT product and exponential – is an expression of the basic capability of an effective computing system.
We’re getting close to being able to get to some really interesting things. Probably one more set of definitions; and then we’ll be able to do things like show a really cool version of the classic halting problem proof.

A Great Quote about Methods

On my way home from picking up my kids from school, I heard a story on NPR that included a line from one of the editors of the New England Journal of Medicine, which I thought was worth repeating here.
They were discussing an article in this month’s NEJM about [vioxx/rofecoxib][nejm]. The article is a correction to an earlier NEJM article that concluded that the cardiac risks of Vioxx were not really significant until around 18 months of continued use. With more data available, it appears that the 18 month was just an artifact, not a real phenomenon, and the data appears to show that the cardiac risks of Vioxx start very quickly.
As usual for something like this, the authors and the corporate sponsors of the work are all consulted in the publication of a correction or retraction. But in this case, the drug maker objected to the correction, because they wanted the correction to include an additional analysis, which appears to show an 18-month threshold for cardiac risk.
The editors response, paraphrased (since I heard it on the radio and don’t have the exact quote):
“That’s not how you do science. You don’t get all of the data, and then
look for an analysis that produces the results you want. You agree on the
analysis you’re going to do *before* you start the study, and then you use
the analysis that you said you were going to use.”
Hey [Geiers][geier], you listening?

Validity of Mathematical Modeling

In the comments to one of my earlier [Demsbki][demsbki-info-theory] posts, I was responding to a comment by a poster, and realized that what we were discussing was probably interesting enough to promote up to a front-page post.
A poster named [Garote][garote] said:
>I think PaulC said exactly what needs to be said, that grants us all
>hearty permission to ignore Dembski’s work. I think it bears repeating:
>”If I am betting on coin flips and notice any kind of algorithmic pattern then
>there are legitimate reasons to conclude that the person flipping coins is not
>playing fair. However, this method does not allow one to discriminate between
>the product of evolution and the product of “intelligent design” since neither
>of these processes in any way resemble random coin flips.”
>Dembski’s feverishly comparing apples to oranges, to prove that one is an
Now, my contempt for Dembski is pretty clear. But this argument against him struck me as making a mistake – and it’s a mistake that strikes at something near-and-dear to my interests: mathematical modeling. Modeling can be sorely abused, [as I’ve discussed before][modeling]; but I don’t want to throw out the baby with the bathwater.
My whole viewpoint on mathematics is that much of its value comes from **abstraction**: the ability to look at something complex; to identify a set of relevant properties for understanding some feature of it; and to use that to create a simpler mathematical model of the complex system based on its fundamental principles. That kind of abstraction isn’t just useful for diddling around in the world of pure math, but for much of how math is used in science. There are a lot of interesting things that we can’t grasp in their entirety; but that we can understand by breaking them down through abstraction into a set of simpler facets that we can study.
I don’t want to throw away the validity of the concept of abstraction and mathematical modeling; but one way of reading that comment is as an argument against the concept of working with a simplified abstract model to understand something complex. (Note that I’m not saying that that’s what Garote meant; but I thought that focusing on this could produce an interesting discussion.)
In Dembski’s favor, I can see the “coin-flip” thing is an abstraction of a complex process to something simple that contains a relevant feature. The thing is, there are things in nature that produce highly random results; and there are other things that produce interesting patterns. Recognizing that there’s something different about processes that produce patterns could be an interesting thing; and recognizing the distinction between randomness and pattern is an interesting subject which probability theorists have spent time studying; it’s right at the intersection between probability theory and information theory, and it’s produced some interesting insights about what information is and what it means.
The problem WRT to Demsbki isn’t that he’s reducing a problem to a mathematical model through abstraction and then analyzing the model. The problem is that when you build a mathematical model, you *need to make sure that your abstraction captures all of the relevant features of what you’re modeling*. And when you draw conclusions based on the analysis of the mathematical model, *you need to make sure that your abstraction isn’t the source of the conclusion*. Both of those are different ways of stating the fundamental rule of mathematical modeling:

**Mathematical models must be validated against the thing that they model.**

Demsbki’s reduction of the process of analyzing certain features of the universe to the a metaphor involving recognition of patterns in coin flips is fine with me. The problem is that he creates a model that *omits important elements* of the real world, and then insists that the *conclusions drawn from his abstract model are applicable to the real world*. But in fact, his conclusions are the result of properties of his model, *not* of the real world. The validation of his model fails: the specific conclusions are the result of eliminating things from his model that might permit a different conclusion.
To continue with the example of the coin flip metaphor: if you’re studying patterns, you could abstract the problem of observing patterns to one of observing flipping coins in an idealized universe, where the coin is perfectly even, and nothing about the surface that it lands on can affect the outcome of a flip. That might be a useful model for trying to understand the difference between random sequences of flips, and patterns in sequences.
If you then try to use that model to determine the *cause* of a pattern, you’ll conclude that the only possible cause of a pattern *is the actions of the coin flipper*. If you’ve abstracted away everything that could influence the outcome of the coin-flip except the influence of the coin-flipper, then *in your model*, concluding that the flipper is the only possible source of a pattern could be a reasonable conclusion.
That doesn’t mean that you can then say that in the real world, the *only possible cause* of a pattern in a sequence of coin-flips is some action taken by the coin-flipper. You can’t apply that simplified abstraction of your model to the real world without showing that it’s a valid model with respect to the the properties that affected your conclusion.
The ultra-simple coin flip model isn’t valid for the real world, because it deliberately omits factors of the real world that could influence the conclusion. In the real world, there are unfair coins (both deliberately unfair coins, and coins that have become unfair either through minting flaws or through wear and tear). There are magnetic fields. There are irregular surfaces.
The problem with many of the conclusions that bad mathematicians draw from mathematical models is that *the models don’t represent what they claim to represent*. They abstract away relevant features of reality, and then demand that reality doesn’t possess those features, because they aren’t in the model.

Strange Connections

Here at SB, we use Google analytics for getting info about how many people are reading our blogs, and how they get here. I also have a SiteMeter monitor on GM/BM. One thing that I get a kick out of is taking a look at my hits, and seeing what kinds of interesting connections come up. Sometimes it’s funny; sometimes it’s informative, sometimes it’s just depressing.
So last night, I was unwinding after putting my kids to bed, and was taking a look. The interesting/amusing connections I found:
1. The number one search term leading people to Good Math, Bad Math? “pharyngula”. PZ, I hate you! (And yes, I’m deliberately not linking. 🙂 )
2. The number one persons name in searches that lead people to GM/BM? “Ken Ham”. Now *that* is depressing. Ken Ham, who I’ve never written about, directs more hits to my blog than my own name.
3. The strangest search term that led readers here: “butt propeller”. And not just once; three separate visits on three separate days got here via searches for “butt propeller”. I’m really not sure what to make of this one. Probably related to my post on Swinburne where I talked about monkeys flying out of my butt as a metaphor.
4. An interesting connection: I posted something yesterday about a [really bad probability argument][bad-prob] for Christianity. I got the link through email from a reader. It turns out that it’s on a fairly obscure site that [Orac][orac] had linked to in a friday post about Holocaust denial. The reader who sent me the link is definitely one of Orac’s readers as well; so the nasty probability argument was discovered through an unrelated subject on scienceblogs. I like the SB networking aspect of that; on the other hand, I wish I’d looked at the site hosting the wretched argument that I mocked; I’d rather not have given a bunch of asshole holocaust deniers the publicity.
5. Back on GM/BMs [old home][gmbm-blogger] on blogger, I did a couple of posts about a crackpot named Gary Osborne. Gary came to the blog to “defend” himself. (His idea of defending himself is call other people names, and then complain that their criticisms of him are ad hominems.) An interesting thing that I noticed was that every time he posted on the blog, I’d see referrals from a rather obscure british search engine with Gary’s name as the search keyword. After he stopped responding, the hits from that engine went away. Suddenly, I’m seeing a bunch of referrals from that engine, and I discovered that Gary has, once again, posted a [web-page][osborne-declares-victory] on a site without comments discussing his wonderful victory over me, and various GM/BM readers who joined the original discussion. Gotta love people who run away from debates and declare victory, eh?
6. Most disturbing search terms that found my blog?: “juniper lee cartoon porn”. I have no idea who Juniper Lee is. Frankly, I’m too scared to do the search and find out.
7. Unexpected linkages: 65 page views referred from [animalcules][animalcules]. Since I’ve never submitted any of my posts to that, and I didn’t know of anyone linking to me in articles in animalcules, I was surprised (in a good way) by that.
8. Another very surprising one: a search for “daily show” and “kurt vonnegut” led people to GM/BM twice.
9. A couple of frequent commenters here are frequent trollers on Panda’s Thumb. Since I frequently don’t pay a lot of attention to the names of commenters on other blogs, unless I get into a discussion with them, I hadn’t noticed the connection.
10. Three different people set up blogs specifically to respond to something I posted. Each has exactly one post.

An Introduction to Information Theory (updated from Blogspot)

Back when I first started this blog on blogspot, one of the first things I did was write an introduction to information theory. It’s not super deep, but it was a decent overview – enough, I thought, to give people a good idea of what it’s about, and to help understand why the common citations of it are almost always misusing its terms to create bizzare conclusions, like some [“law of conservation of information”,][conservation]
[conservation]: “wikipedia on Dembski’s law of CoI”
This is a slight revision of that introduction. For the most part, I’m just going to clean up the formatting, but once I’m going through it, I’ll probably end up doing some other polishing.
So what is information theory?
Information theory is a relatively new field of mathematics that tries to characterize what information is in a quantifiable way. It’s an area of math that was almost totally obscure until not too long ago, and one which is frequently misunderstood, even by people who mean well.
A bit of history
Modern information theory comes from two very different sources: Shannon’s information theory, and Kolmogorov/Chaitin algorithmic information theory. Fortunately, they both converge, mostly, on the same place.
### Shannon’s Information Theory
The more famous branch of IT is Claude Shannon’s information theory, as presented in his infamous paper, [Communication in the Presence of Noise][shannon]. Shannon’s interest came from his work at Bell Labs for AT&T, which wanted a way of being able to figure out how much wire they needed to lay for the telephone network.
AT&T was interested because for a telephone company, laying wire in an incredibly expensive operation. The wire itself (especially back in the ways when it was bundles of copper) is quite expensive; the process of running and burying the cables is even more expensive. So ideally, as a phone company, you want to run the cables once; you want to lay enough that you’ll never have to come back and do it again; and you want to lay the smallest amount that will do the job, since the materials are so expensive.
The problem that they wanted Shannon to help solve was, how could they determine how much wire did they actually need to lay? It’s actually an astonishingly tricky question. The less they laid, the less it cost them in the short term. But they *knew* that the number of phone lines was going to increase dramatically – so if they were cheap, and laid too little wire, when the number of phones exceeded the capacity of what they had already laid, they’d have to go back and lay more. So they wanted to be able to make a good prediction of the minimum amount of wire they could lay that would meet their needs both at the time it was laid, and in the future.
But there’s a big trick to that. First: how much information can a single wire carry? And when you bundle a bunch of wires together, how much can you pump over a single wire without it interfering with signals on it’s neighbors in the bundle?
That’s what Shannon was trying to understand. He was trying to find a mathematical model that he could use to describe information transmission, including the effects of imperfections in transmission, including the introduction of noise and interference. He wanted to be able to quantify information in a meaningful way that would allow him to ask questions like: “At what point does noise in the line start to eliminate the information I want to transmit?”.
The answer is just fascinating: a given communication channel has a capacity for carrying information; and a given message has a quantity of information in it. Adding noise to the signal *adds* information to message *until* the capacity of the channel is reached, at which point information from the noise will start to *replace* information from the message; at that point, you can say that information from the message will start to be lost.
Shannon called the information content of a signal it’s *entropy*, because he saw a similarity between his information entropy and thermodynamic entropy: in a communicating system, entropy *never* decreases: it increases until the capacity of the channel is reached, and then it stays content. You can’t reduce the amount of information in a message. (There are various rumours that in fact this choice of terminology was actually suggested by von Neumann himself.)
That naming led directly to the most common misuse of information theory. But that’s a post for another day.
### Algorithmic Information Theory
The other main branch of information theory came from a very different direction. The two pioneers were Andrey Kolmogorov, and Greg Chaitin. Kolmogorov and Chaitin didn’t know each other at the time they independently invented the same mathematical formalization (in fact, Chaitin was a teenager at the time!), a fact which led to some friction.
In the interests of honesty, I’ll say that Greg Chaitin works in the same building that I do, and I’ve met him a number of times, and been greatly impressed by him, so my description is heavily influenced by Greg.
Anyway – algorithmic information theory grew out of some of the fundamental mathematics being done at the beginning of the 20th century. There was an effort to create a single, fundamental, set of mathematical axioms and inference rules, which was both complete (every true statement was provably true), and consistent (every well-formed statement was either true or false). This effort failed, miserably; the nail in the coffin was Gödel’s incompleteness theory. Gödel presented an extremely complicated proof that showed, essentially, that no formal system could be both complete and consistent. (See my recent article about [Extreme Math][extreme] for more about this.)
[extreme]: “1+1=2”
Most people saw Gödel’s proof, did the equivalent of saying “Oh, shit!”, and then pretended that they hadn’t seen it. It does mean that there are fundamental limits to what you can do using mathematical formalization, but for the most part, they don’t affect you in normal day to day math. (Sort of the way that car designers didn’t change the way they build cars because of relativity. Yes, it showed that the fundamental physical model that they were using was wrong – but at the speeds that cars move, that wrongness is so small that it just doesn’t matter.)
But for people in the early stages of what would become computer science, this was a big deal. One of the early hopes was that the mathematical system would produce a “decision procedure”, a mechanical process by which a computing device could check the truth or falsehood of any mathematical statement. Gödel showed that it couldn’t be done.
But the early computer scientists – in particular, Alan Turing – embraced it. It led directly to two of the fundamental rules of computer science:
1. The Church-Turing Thesis: all mechanical computing systems are basically the same: there is a fundamental limit to what is computable, and any “acceptable” system can compute anything up to that limit – so if you can find a way of doing it on any computing device, then you can, ultimately, do it on every acceptable computing device.
2. The Halting Problem: there are some things that you cannot program a device to do. In particular, you cannot write a program for any computing device that examines another program, and tells you if that other program will eventually stop.
The halting problem turns out to say *exactly* the same thing as the Gödel incompleteness theorem. Only you can write a proof of it that a freshman college student can understand in ten minutes, instead of a proof that the average math grad student will need a semester to plow through.
Chaitin and Kolmogorov saw this as a big deal: using an algorithmic approach to how to process information, you could prove something very simply, which would require a much more complicated proof using other methods.
K-C information theory grew out of this. According to Kolmogorov and Chaitin, the fundamental amount of information contained in a string (called the string’s information entropy after Shannon), is the shortest string of program + data for a computing device that can generate that string.
(If you’re interested in Gödel’s incompleteness theorem, then Hofstadter’s
[Godel, Escher, Bach: an Eternal Golden Braid][geb] is a really fun introduction. For K-C theory, Greg Chaitin has written a bunch of really [great][chaitin1] [books][chaitin2] for mathematical laymen.
[geb]: “GEB amazon link”
Information and Entropy
Now that I’ve got a bit of background and origins done, it’s time to get to some of the fun stuff.
As I said yesterday, both of the big branches of information theory have a concept of *entropy*. While the exact presentations differ, because they’re built on somewhat different theoretical foundations, the basic meaning of entropy in both branches is pretty much the same. I’m going to stick with the terminology of the Kolmogorov-Chaitin branch, but the basic idea is the same in either.
In K-C theory, we talk about the information content of *strings*. A string is an ordered sequence of characters from a fixed alphabet. It doesn’t really matter what alphabet you choose; the point of this kind of theory isn’t to come up with a specific number describing the complexity of a string, but to be able to talk about what information means in a formal algorithmic sense.
The K-C definition of the entropy of a string is the total quantity of information encoded in that string.
Here’s where it gets fun.
Another way of saying exactly the same thing is that the entropy of a string is a measure of the *randomness* of the string.
Structured strings are structured *precisely* because they contain *redundancy* – and redundancy does not add information.
Which string has more information: “XXXXYYYYY” or “4X5Y”? They have the same amount. One is a lot smaller than the other. But they contain the same amount of information.
Here’s a third way of saying exactly the same thing: the entropy of a string is the length of the smallest compressed string that can be decompressed into the original.
Here’s a better example. Go back and look at the first two paragraphs of yesterday’s post. It took 514 characters.
Here’s the same information, compressed (using gzip) and then made
readable using a unix utility called uuencode:
That’s only 465 characters; and if I didn’t have to encode it to stop
it from crashing your web-browser, it would only have been 319
characters. Those characters are a lot more random than the original –
so they have a higher information density. The closer we get
to the shortest possible compressed length, the higher the information
density gets.
Actually, I’m lying slightly; there’s an additional factor that needs to be considered in K-C theory.
Remember that K-C comes from the foundational mathematics of computer science, something called recursive function theory. We’re talking about strings being processed algorithmically by a program. So really, we get to talk about programs that generate strings. So it’s not just strings: you have a program and a data string. Really, you measure information content of a string as the total size of the smallest pairing of program and data that generates that string. But for most purposes, that doesn’t make that much difference: the combination of the program and the data can be encoded into a string.
In the two examples above, the program to follow is implied by the contents of the string; if we wanted to really model it precisely, we’d need to include the programs:
* For the string “XXXXYYYYY”, the program is roughly:
while there are more characters in the data:
read a character
output the character that you read
* For the string “4X5Y”, the program is roughly:
while there are more characters in the data:
read a number n from the data
read a character c from the data
output c n times
Now, specifics aside: the definitions I gave earlier are pretty much correct in both K-C and Shannon information theory: information content is proportional to randomness is proportional to minimum compressed length.
So – what happens if I take a string in very non-compressed format (like, say, an uncompressed digitized voice) and send it over a noisy phone line? Am I gaining information, or losing information?
The answer is: *gaining* information. Introducing randomness into the string is *adding* information.
“AAAABBBB” contains less information than “AAAABxBB”.
The way this is used to refute bozos who claim things like “You can’t create information” should be obvious.