If you regularly follow comments on this blog, you’ll know that I’ve been
having a back-and-forth with a guy who doesn’t know much about information
theory, and who’s been using his ignorance to try to assemble arguments against the
Kolmogorov-Chaitin information-theoretic measure of information in a string.
In the course of doing that, I came up with what I think are some interesting ways
of explaining bits of it, so I thought I’d promote it to the top-level to share
with more readers.
To be absolutely clear up front: I’m far from an expert on K-C theory. It’s something that I find incredibly fascinating, and I’ve read a lot of Chaitin’s work. I’ve been fortunate enough to meet Greg Chaitin a bunch of times when we both worked at
IBM, and I’ve attended a bunch of his lectures. But this isn’t my main area of expertise,
not by a long-shot.
If you don’t know any K-C information theory, then you can look at my
introduction here. The rest is beneath the fold.
Information vs. Meaning
Kolmogorov-Chaitin information theory is focused on
quantification – that is, on being able to measure the quantity of
information in a string of characters. It doesn’t care what the string
means. Information theory cares about how much information
is in a string; it doesn’t care what the string means. In fact, you can say
something much stronger about information theory: if you consider meaning
at all, then you’re not doing information theory. A truly
abstract string could mean many different things: information theory’s
measurement of its information content must include everything that
could be used relative to any possible system of meaning to determine
the information content of a string.
I’ll borrow a tiny bit of formality from denotational semantics. In
denotational semantics, we take a domain of objects and functions that range
over those objects, and we map statements – sentences that have some
syntactic structure – onto the objects and functions in the domain.
To assign meaning, what you’re basically doing is creating a system where
you can map from syntax – that is, from strings of symbols – to
objects and functions. The string of characters “Mark Chu-Carroll” has
absolutely no intrinsic meaning. But by interpreting in a domain
of meaning, we can say that the domain maps that string of symbols
in a way that makes it a representation of a reference to me. But there’s
nothing about that string that makes it necessarily be a reference
to me. It could mean many different things, depending on domain of meaning
which we use to interpret it.
To be a bit more specific, consider the string
“0000000000000000000000000000000000000000”. That’s a string of 40 “0”s. If you
interpret that as a number – that is, you understand the string of digits as
having a meaning as a number, it’s the number 0. But a string of 40
zeros is not equivalent to the number 0 in terms of information
theory. That string could also mean the number 40 understood as a unary
number. Or it could mean a sequence of 5 bytes containing nothing but zeros in
a computer memory. Or it could mean sequence of 41 bytes, the first forty
containing the ASCII code for the character “0”, and one containing a
terminator-mark. The string contains enough information to allow you to
interpret it as any of those things. But if you choose any of those
as the meaning of it, and work out a representation of that meaning,
you’ll wind up with a different answer to the quantity of information
contained in that string than you would for the string considered as an
abstract. Choosing a framework in which to assign it meaning can make a string
appear to have either more or less information than the
You can make it appear to have more information by choosing a system of
meaning in which the system itself contains extra information that it assigns
to the string; for example, when I say “the complete table of printable
unicode code-points in order”, if I interpret that string in the right
framework of meaning, it means a sequence of over 1100 characters.
But those 1100 characters aren’t containing in the string; they’re part of the
framework of meaning that’s used to interpret the string.
You can also choose a framework in which to assign meaning which makes a
string appear to have less information content than the
string-as-abstract. For example, the string of 0s above. If we interpret a
string of 40 “0”s as meaning zero, then in terms of meaning, that
string has one character of information.
Intuitively, it’s hard for us to separation information from meaning. Informally,
we consider them to be the same thing. But in information theory, they’re very
different. As I said above, meaning is what happens when you interpret
a piece of information in some context. But absent any context, how can
you quantify its information content?
To be strict, you can’t. Without any context of any kind, you’ve got no
way to quantify anything. But you can create a sort of minimal context, which
gives you a really simple way to quantify the amount of information in the
The basic, fundamental idea is description. The amount of
information contained in a string is the length of the shortest
description that uniquely identifies that string.
And that’s the essence of information theory: description. The
amount of information in something is the shortest description that
can uniquely reproduce that thing. It doesn’t matter what the thing means. You
don’t need to understand it. You don’t need to interpret it. All you need is a
description that allows you to reproduce it. To produce the string
10101101, you don’t need to know that it means the number ten million, one hundred and
one thousand, one hundred and one interpreted as a number written
in decimal notation. You don’t need to know that it means the number
173, interpreted as a binary number. All you need to know is that
it’s a sequence of symbols.
That definition might seem like a trick – it just replaces “quantity of
information” with “length of the shortest description”. How can we
describe something in a way that doesn’t involve meaning?
The answer is: by computation. We can define a simple,
minimal universal computing device. A description is a program that
generates a string. The shortest description of a string is the shortest
input-free program that can generate that string.
Of course, there’s another copout there. What’s minimal? And
won’t the information content of a string be different using different computing
Minimality is a tricky topic to tackle formally. You could easily write
a PhD dissertation (and someone probably already has) on a precise
definition of a minimal computing device. But informally, the idea that
we’re getting at is very simple: no magic instructions. That is,
each instruction performs a single action – so you can’t output
more than one symbol with a single instruction.
For the second question: yes, that’s true – to a limited extent. But
given any pair of computing machines, the difference between the
length of the shortest programs to generate a specific string will vary
by no more than a single fixed constant.
That last point is easy (and fun) to prove. First, we need to
formalize the statement. Take two computing
machines, P and Q. Let the information content of a string S relative
to P be written as I(S, P), and relative to Q be I(S,Q).
For all strings S, the absolute value of I(S,P) – I(S,Q) will
be less than or equal to a constant C.
Here’s the proof:
- if P and Q are both universal computing devices, then
there is a program for P which allows it to emulate Q;
and there is a program for Q which allows it to emulate P.
- Let the program for P to emulate Q have size Pq,
and let the program for Q to emulate P be Qp.
- Consider a string S. Let the program for P to generate S
be PS, and the program for Q to generate S be
QS. If PS is shorter than QS,
then the shortest program for Q to generate S cannot possibly
be longer than QS + QP.
- Reverse the above statement for P and Q.
- Now, take the longer of PQ and QP. For
any string S, it should be obvious that information content of
S relative to P or Q cannot possibly differ by more that
No magic there. It’s very straightforward.
Mark C. Chu-Carroll: You could easily write a PhD dissertation (and someone probably already has) on a precise definition of a minimal computing device.
Turing’s “On Computable Numbers, with an Application to the Entscheidungsproblem” wasn’t his dissertation, but definitively covers the material. There’s a few papers on the simplest possible universal Turing Machines; (doi:10.1016/j.tcs.2006.06.002) isn’t bad.
Thanks for that short and clear introduction. I was familiar with the basics of K-C theory, so most of it wasn’t new to me, but I wasn’t aware of the proof that minimal descriptions all have to be within a fixed constant of each other. As soon as I read point 1 of the proof, though, I knew exactly where you were headed. I love computing. 🙂
MCC:We can define a simple, minimal universal computing device
There is no actual requirement that the device be minimal, other than aesthetics. Any two devices will still have their answers differ by at most a constant (albeit a very large constant).
A device in with the command “print_complete_works_of_Shakespeare” is just unsatisfying, but technically legal.
Quarto or Folio? 🙂
Thanks for the intro/summary – I’d gathered as much from bits and pieces of your other posts and comments on information theory.
Re information vs. meaning, of course this is bound up in Dembski’s fruitless and rather desperate pursuit of some definition of information that is on one hand scientifically valid, and on the other would require external (preferably divine) input in the origin and evolution of the nucleic acid instructions that regulate basic microbiological functions.
One could also cast the comparison as information vs. purpose, where the K-C information content doesn’t depend on what the string is going to be used for, while for Dembski et al., it’s of course vital that the string (i.e., the DNA or RNA sequence) be fit for use as a microcellular instruction set.
Just a quick question — how do you write the number 0 in unary numbers?
In unary, 0 is an empty string.
Mark, thanks for this very clear summary. This is one of my favorite blogs on SB.
Thanks, Mark. I’m not sure how useful that would be in practice, but then that’s why the Roman system disappeared — it simply wasn’t very useful.
No, it’s not all that useful in practice. The main use of unary is in highly formalized explorations in automata theory. If you want to do things like play with a Turing machine, and you’re considered with capability not speed, then unary is frequently the easiest notation to work with. But on a Turing tape, the fact that unary zero is invisible isn’t a problem, because you use delimiters, and the side-by-side delimiters clearly indicate a zero.
Mark, do you agree then with the statement that this page (for example) does not contain “knowledge”? Lots of people talk about documents as if they contain knowledge. (“knowledge management”) However, it can only become knowledge by attributing meaning to the information, which is done by an interpreter (usually a human).
I don’t know. I’m honestly not sure of how to define “knowledge”. Speaking intuitively, I don’t think that “knowledge” can really be said to exist independently of someone who knows it – so a passive document can’t contain “knowledge”, since it can’t “know” anything.
I would argue that this page contains both information and intended meaning, which I think might be what you’re getting at. I wouldn’t be comfortable saying that it contains meaning, because meaning only exists in the mind of a reader. But I wouldn’t be comfortable saying that it’s meaningless either, because it was written with the intention of producing a specific meaning in the mind of the reader.
Well, knowledge is “a belief justified by evidence” according a friend of mine who’s doing AI research. Of course, then you have to define “belief”, “evidence” and “justified”…
But given any pair of computing machines, the difference between the length of the shortest programs to generate a specific string will vary by no more than a single fixed constant.
One interesting thing is that this constant is computable and represents the boundary between the programs which can be proved minimal and those which cannot be proved to be minimal.
Chaitin explicitly computes the constant for a LISP interpreter in one of his books.
Seems like Dembski would be ahead to try something more along the lines of Crutchfield/Shalizi computational mechanics, which provides a natural measure of complexity.
As it is, he’s really just making a huge and basic category error.
“Well, knowledge is “a belief justified by evidence” according a friend of mine who’s doing AI research. Of course, then you have to define “belief”, “evidence” and “justified”…”
This isn’t right, since it doesn’t rule out “knowing” false claims. I can’t know that that p if p isn’t true–even if I have some (misleading) evidence for p. So the better approximation is: knowledge is justified *true* belief.
But even this won’t work: there appear to be justified true beliefs that don’t amount to knowledge. Suppose, for example, that I look at my watch; it says ‘6:00’, and I come to believe on that basis that it is six o’clock. Suppose further that it *is* 6:00. (So I have a justified true belief that it is 6:00.) Now the twist: unbeknownst to me, my watch stopped working exactly 24 hours ago. Do I know that it is 6:00? Most people have the intuition that I don’t–after all, had I looked at my watch fifteen minutes earlier, I *still* would have believed it was 6:00. Thus we have a case of justified true belief that isn’t knowledge.
There’s a very large philosophical literature trying to work out exactly what the necessary and sufficient conditions for knowledge are. (As well as literatures on belief, justification, and evidence.)
If P_S is smaller than Q_S and Q has a program with length Q_P to emulate P, shouldn’t step 3 say that smallest Q_S
aargh I should have previewed. I meant to say in the first paragraph: that Q_S is smaller or equal to P_S + Q_P.
“Just a quick question — how do you write the number 0 in unary numbers?
Posted by: psweet | September 23, 2009 9:10 AM
In unary, 0 is an empty string.”
In other words you can’t actually write 0 in unary. You could say “I did not write a numeral in this place here to represent 0.”
I agree that a unary numeral system doesn’t work out as very useful. Still, such a notation makes readily apparent the order of the natural numbers, since eariler numbers take up less space to write, while Hindu-Arabic notation itself does not suggest that 1 preceds 2. A unary numeral system also readily suggests a natural numbers as magnitudes perspective, while other notations don’t. Actually, come to think of it, I use a unary system everday at my job, and it’s far more useful to tally up events than any other system. The wikipedia also indicates that counting on one’s fingers (which I would guess that virtually all of us did at one point as children, or we did something similar) effectively qualifies as a unary numeral system. I’ve changed my mind… unary numeral systems come as significantly useful. Maybe it comes as best to say that a unary numeral system comes as maximally suggestive for the natural numbers, but minimally compact, since it requires far too many symbols to effectively talk about how long civilization has existed.
“Knowledge” is axiomatized by Epistemic Logic. Although the Modal Logic from which it is derived is fairly well-established and uncontroversial, Epistemic Logic is filled with dissent and disagreement over foundational and implementation approaches, as with (search the Stanford Encyclopedia of Philosophy) and in it proposed resolution of philosophical puzzles. It overlaps Computer Science and Artificial Intelligence, and it overlaps Model Theory. But it only weakly connects to Information Theory.
Things get sticky indeed when we try to combine Epistemic axioms with a model of Time, to get Temporal Epistemic Logic, and when we try to model communication between agents, some of whom might be liars of various kinds.
“The realization of knowledge (or truth) by new information can be seen as a specific form of what is called ‘knowability’ in philosophy.”
What do I mean, informally, by rational belief and knowledge in this context?
An agent can know (believe) these things:
(1) Axioms accepted by the agent;
(2) Theorems that the agent can prove from the axioms, under rules of deduction;
(3) Things announced by agents whom the given agent believes to always tell the truth;
(4) The negation of things announced by agents whom the given agent believes to always tell exactly the opposite of the truth (in a two-valued logic);
(5) Deductions made from announcements believed, or from those inverted by total skepticism of a pure liar.
(6) Things which the agent believed in the past, but now (in belief revision) retracted from belief and inverted, i.e. formerly believe to be true, now believe to be false;
(7) Things which the agent believed in the past, but now (in belief revision) retracted from belief and demoted, i.e. formerly believe to be true, now believe to be only possibly true;
(8) Deductions from revised beliefs.
An agent can NOT know these things:
(1) Since this is not a Paraconsistent logic, one cannot believe anything which one knows to be logically inconsistent or invalid, i.e.
P and NOT P.
Note that I haven’t even mentioned knowledge derived from sensory impressions, or scientific experiments, as those are yet another cans of worms.
“In other words you can’t actually write 0 in unary. You could say ‘I did not write a numeral in this place here to represent 0.'”
There’s also a straightforward alternative: let zero be represented by ‘1’, one be represented by ’11’, etc.
There’s something goofy about KC info when you compare it to Shannon-style info.
In Shannon’s information theory, probability sits between meaning and coding. An ideal coding system would use n bits for something whose probability is 2**(-n).
So, think of the string of forty zero bits (instead of ascii zeros, but this is riffing off the example above). If it is part of an efficient coding system, then it represents something whose probability is one in a trillion.
But KC information theory would look at that and (with the sorts of machines we tend to imagine) say “Whoa, that’s just forty zeroes in a row, I’ve got a shorter way to encode that.”
However, the compression method represented by such a machine is going to– I mean it must– encode some strings originally less than 40 bits to more bits than they had. (The proof is that one of those strings is used up to encode “40 zeroes,” so at least one shorter string is used up and can’t be used to represent a string as short as itself.)
Going back to Shannon-land, that means that using a UTM as a compressor introduces inefficiency if you already had an efficient encoding for your meaning domain. That inefficiency is going to be limited by a constant– a different constant from the ones mentioned above, it’s how hard it is to represent an arbitrary constant in the machine’s language– but it’s still an inefficiency. (BTW, compression programs sometimes have a special case for this, a short code that means “I couldn’t improve this so here it is literally.”)
So, finally my point: you must assume a specific UTM to do KC stuff, and that’s a different assumption than that the information is already efficiently represented for the domain. Using a UTM to re-encode a string to measure its “information content” assumes the string is redundant, or comes from a source of strings that are sometimes redundant. Otherwise you would just look at the length of the string and call that its information content.
So, what I’m saying is that either KC information theory is injecting that meaning, or if you insist that KC information theory doesn’t mean anything, then if you interpret KC as being about measuring information content then that’s one piece of meaning you’re adding yourself, and the other is that the original information comes from a redundant source.
I say this not just for verbal gymnastics but because the idea of a redundant source starts to look weird from this point of view. All of the kinds of phenomena we know happened because the world started in a lower-entropy state. And KC points at the fact that such a world exists.
Maybe, but if you do KC, then Shannon rats on your motives.
KC is about minimum length to re-encode an already-encoded message using a program in a given UTM’s language.
Shannon is about the length of a message in a code constructed to minimize its expected length, given a probability function for the set of unencoded items it might encode.
Viewed purely mathematically, these are just two different functions of different lists of arguments, and it’s no surprise they’re not identical.
But if you think of either as the “quantity of information in” a string, then you’re going outside of math and adding meaning. The existence of a similarly compelling, but conflicting definition of “quantity of information in” is just a hint of that. (This is more than the constant-limited difference between different UTMs.)
Information theory is not about meaning as long as you consider “quantity of information in” to be a meaningless string, like “frabjous galumphability of” or “f( … )”.
The kind of meaning revealed by the difference between Shannon and KC seems kind of general (efficiency, redundancy, cost as something to minimize…?), but maybe it reveals more than we think it does.
Doh. Either Shannon and KC measures are simply different beasts, or the KC re-encoding of a Shannon-encoded string is at worst a constant number of bits longer, as I said in previous-previous message, doh, it’s not “worse than” the difference between different UTMs.
I always thought of information in terms of entropy. Max information = pure random string = max entropy. When coding a message, you don’t want pure random strings because you DO care about meaning. But neither do you want minimum entropy (1,2,3,4,5..) because the message is so predictable that it is meaningless. In message transmission, you want to hit the middle of the bell curve.
In terms of measuring information content and quantifying it, completely divorced from meaning, would you not just count the bits? It seems this is a trivial matter. It is when you bring meaning and entropy into the picture that things get interesting
Mark, have you seen http://golem.ph.utexas.edu/category/2010/02/algorithmic_thermodynamics.html ? Here John Baez and Mike Stay investigate the relationships between ‘entropy’ in heat engines, Shannon information and Chaitlin information.