When a friend asks me to write about something, I try do it. Yesterday, a friend of mine from my Google days, Daniel Martin, sent me a link, and asked to write about it. Daniel isn’t just a former coworker of mine, but he’s a math geek with the same sort of warped sense of humor as me. He knew my blog before we worked at Google, and on my first Halloween at Google, he came to introduce himself to me. He was wearing a purple shirt with his train ticket on a cord around his neck. For those who know any abstract algebra, get ready to groan: he was purple, and he commuted. He was dressed as an Abelian grape.

The real subject of the article involves a recent twitter-storm around a professor at Boston University. This professor tweeted some about racism and history, and she did it in very blunt, not-entirely-professional terms. The details of what she did isn’t something I want to discuss here. (Briefly, I think it wasn’t a smart thing to tweet like that, but plenty of white people get away with worse every day; the only reason that she’s getting as much grief as she is is because she dared to be a black woman saying bad things about white people, and the assholes at Breitbart used that to fuel the insatiable anger and hatred of their followers.)

But I don’t want to go into the details of that here. Lots of people have written interesting things about it, from all sides. Just by posting about this, I’m probably opening myself up to yet another wave of abuse, but I’d prefer to avoid and much of that as I can. Instead, I’m just going to rip out the introduction to this article, because it makes a kind of incredibly stupid mathematical argument that requires correction. Here are the first and second paragraphs:

There aren’t too many African Americans in higher education.

In fact, black folks only make up about 4 percent of all full time tenured college faculty in America. To put that in context, only 14 out of the 321—that’s about 4 percent—of U.S. astronauts have been African American. So in America, if you’re black, you’ve got about as good a chance of being shot into space as you do getting a job as a college professor.

Statistics and probability can be a difficult field of study. But… a lot of its everyday uses are really quite easy. If you’re going to open your mouth and make public statements involving probabilities, you probably should make sure that you at least understand the first chapter of “probability for dummies”.

This author doesn’t appear to have done that.

The most basic fact of understanding how to compare pretty much anything numeric in the real world is that you can only compare quantities that have the same units. You can’t compare 4 kilograms to 5 pounds, and conclude that 5 pounds is bigger than 4 kilograms because 5 is bigger than four.

That principle applies to probabilities and statistics: you need to make sure that you’re comparing apples to apples. If you compare an apple to a grapefruit, you’re not going to get a meaningful result.

The proportion of astronauts who are black is 14/321, or a bit over 4%. That means that out of every 100 astronauts, you’d expect to find four black ones.

The proportion of college professors who are black is also a bit over 4%. That means that out of every 100 randomly selected college professors, you’d expect 4 to be black.

So far, so good.

But from there, our intrepid author takes a leap, and says “if you’re black, you’ve got about as good a chance of being shot into space as you do getting a job as a college professor”.

Nothing in the quoted statistic in any way tells us anything about anyone’s chances to become an astronaut. Nothing at all.

This is a classic statistical error which is very easy to avoid. It’s a unit error: he’s comparing two things with different units. The short version of the problem is: he’s comparing black/astronaut with astronaut/black.

You can’t derive anything about the probability of a black person becoming an astronaut from the ratio of black astronauts to astronauts.

Let’s pull out some numbers to demonstrate the problem. These are completely made up, to make the calculations easy – I’m not using real data here.

Suppose that:

• the US population is 300,000,000;
• black people are 40% of the population, which means that there are are 120,000,000 black people.
• there are 1000 universities in America, and there are 50 faculty per university, so there are 50,000 university professors.
• there are 50 astronauts in the US.
• If 4% of astronauts and 4% of college professors are black, that means that there are 2,000 black college professors, and 2 black astronauts.

In this scenario, as in reality, the percentage of black college professors and the percentage of black astronauts are equal. What about the probability of a given black person being a professor or an astronaut?

The probability of a black person being a professor is 2,000/120,000,000 – or 1 in 60,000. The probability of a black person becoming an astronaut is just 2/120,000,000 – or 1 in 60 million. Even though the probability of a random astronaut being black is the same as a the probability of a random college professor being black, the probability of a given black person becoming a college professor is 10,000 times higher that the probability of a given black person becoming an astronaut.

This kind of thing isn’t rocket science. My 11 year old son has done enough statistics in school to understand this problem! It’s simple: you need to compare like to like. If you can’t understand that, if you can’t understand your statistics enough to understand their units, you should probably try to avoid making public statements about statistics. Otherwise, you’ll wind up doing something stupid, and make yourself look like an idiot.

(In the interests of disclosure: an earlier version of this post used the comparison of apples to watermelons. But given the racial issues discussed in the post, that had unfortunate unintended connotations. When someone pointed that out to me, I changed it. To anyone who was offended: I am sorry. I did not intend to say anything associated with the racist slurs; I simply never thought of it. I should have, and I shouldn’t have needed someone to point it out to me. I’ll try to be more careful in the future.)

# Expressions and Arity (Part 2): Equivalence

Continuing where I left off: we were talking about arity in Martin-Löf’s theory of expressions. There are two basic problems that arity solves: it makes certain kinds of impossible-to-evaluate expressions be invalid in the theory; and it helps enable some way of comparing expressions for equality. Arity solves both of those problems by imposing a simple type system over expressions.

At the end of the last post, I started giving a sketch of what arities look like. Now we’re going to dive in, and take a look at how to determine the arity of an expression. It’s a fairly simple system of rules.

Before diving in, I want to stress the most important thing about the way that these rules work is that the expressions are totally syntactic and static. This is something that confused me the first time I tried to read about expression theory. When you see an expression, you think about how it’s evaluated. But expression theory is a purely syntactic theory: it’s about analyzing expressions as syntactic entities. There are, deliberately, no evaluation rules. It’s about understanding what the notations mean, and how to determine when two expressions are equivalent.

If, under the rules of Martin-Löf’s expression theory, two expressions are equivalent, then if you were to chose a valid set of evaluation rules, the two expressions will evaluate to the same value. But expression equivalence is stronger: expressions are equivalent only if you can prove their equivalence from their syntax.

That clarified, let’s start by looking at the rules of arity in expressions.

Variables and Constants
Every variable and every primitive constant has a pre-defined arity; if $x$ is a variable or primitive constant with arity $\alpha$, then the expression $x$ has arity $\alpha$.
Definitions
In a definition $D := e$, the arity of the defined name $D$ is the same as the arity of the expression $e$.
Applications
If $a$ is an expression of arity $\alpha \twoheadrightarrow \beta$, and $b$ is a expression of arity $\alpha$, then $a(b)$ is an expression of arity $\beta$.
Abstractions
If $e$ is an expression of arity $\beta$ and $x$ is a variable of arity $\alpha$, then $(x)e$ is an expression of arity $\alpha \twoheadrightarrow \beta$.
Combinations
If $e_1$ is an expression af arity $\alpha_1$, $e_2$ is an expression of arity $\alpha_2$, …, and $e_n$ is an expression of arity $\alpha_n$, then a combination expression $e_1, e_2, ..., e_n$ is an expression of arity $\alpha_1 \otimes \alpha_2 \otimes \ldots \otimes \alpha_n$.
Selections
If $e$ is an expression of arity $\alpha_1 \otimes \alpha_2 \otimes \ldots \otimes \alpha_n$ where $n \ge 2$, then $(e).i$ is an expression af arity $e_i$.

Let’s try working through an example: $x^2 + 3x + 7$.

1. As we saw in this post, this is equivalent to the simple AST-form: $(x)+(+(*(x,x), *(3, x)),7)$.
2. $x$” is a variable of arity 0; “3” and “7” are constants of arity 0; “$+$” and “$*$” are constants of arity $(0 \otimes 0) \twoheadrightarrow 0$.
3. From the combination rule, since $x$ and $3$ both have arity 0, $(x, x)$ and $(3, x)$ each have arity $0 \otimes 0$.
4. Since $(x, x)$ has arity $0 \otimes 0$, and $*$ has arity $(0 \otimes 0) \twoheadrightarrow 0$, $*(x, x)$ has arity 0. The same thing works for $*(3, x)$.
5. Since the arities of the $*(x, x)$ and $*(3, x)$ are both 0, the combination of the pair (the arguments to the inner “+”) are $0 \otimes 0$, and the arity of the inner sum expression is thus 0.
6. Since 7 has arity 0, the combination of it with the inner sum is $0 \otimes 0$, and the arity of the outer sum is 0.
7. Since $x$ is a variable of arity 0, and outer sum expression has arity 0, the abstract has arity $0 \twoheadrightarrow 0$.

If you’re familiar with type inference in the simply typed lambda calculus, this should look pretty familiar; the only real difference is that the only thing that arity really tracks is applicability and parameter counting.

Just from this much, we can see how this prevents problems. If you try to compute the arity of $3.1$ (that is, the selection of the first element from 3), you find that you can’t: there is no arity rule that would allow you to do that. The selection rule only works on a product-arity, and 3 has arity 0.

The other reason we wanted arity was to allow us to compare expressions. Intuitively, it should be obvious that the expression $e$ and the expression $(x)e(x)$ are in some sense equal. But we need some way of being able to actually precisely define that equality.

The kind of equality that we’re trying to get at here is called definitional equality. We’re not trying to define equality where expressions $a$ and $b$ evaluate to equal values – that would be easy. Instead, we’re trying to get at something more subtle: we want to capture the idea that the expressions are different ways of writing the same thing.

We need arity for this, for a simple reason. Let’s go back to that first example expression: $(x)+(+(*(x,x), *(3, x)),7)$. Is that equivalent to $(y)+(+(*(y,y), *(3, y)),7)$? Or to $8x+1$? If we apply them to the value 3, and then evaluate them using standard arithmetic, then all three expressions evaluate to 25. So are they all equivalent? We want to be able to say that the first two are equivalent expressions, but the last one isn’t. And we’d really like to be able to say that structurally – that is, instead of saying something evaluation-based like “forall values of x, eval(f(x)) == eval(g(x)), therefore f == g”, we want to be able to do something that says $f \equiv g$ because they have the same structure.

Using arity, we can work out a structural definition of equivalence for expressions.

In everything below, we’l write $a: \alpha$ to mean that $a$ has arity $\alpha$, and $a \equiv b : \alpha$ to mean that $a$ and $b$ are equivalent expressions of arity $\alpha$. We’ll define equivalence in a classic inductive form by structure:

Variables and Constants
If $x$ is a variable or constant of arity $\alpha$, then $x \equiv x : alpha$. This is the simplest identity rule: variables and constants are equivalent to themselves.
Definitions
If $a := b$ is a definition, and $b: \alpha$, then $a \equiv b : \alpha$. This is a slightly more complex form of an identity rule: if there’s a definition of $b$ as the value of $a$, then $a$ and $b$ are equivalent.
Application Rules
1. If $a \equiv a$ and $b \equiv b$, then $a(b) \equiv a$. If an applyable expression $a$ is equivalent to another applyable expression $a$, then applying $a$ to an expression $b$ is equivalent to applying $a$ to an expression $b$ if the argument $b$ is equivalent to the argument $b$. That’s a mouthful, but it’s simple: if you have two function application expressions, they’re equivalent if both the function expressions and the argument expressions are equivalent.
2. If $x$ is a variable of arity $\alpha$, and $a$ is an expression of arity $\alpha$ and $b$ is an expression of arity $\beta$, then $((x)b)(a) b[x := a]: \beta$. This is arity’s version of the classic beta rule of lambda calculus: applying an abstraction to an argument means substituting the argument for all references to the abstracted parameter in the body of the abstraction.
Abstraction Rules
1. If $x$ is a variable of arity $\alpha$, and $b \equiv b: \beta$, then $(x)b \equiv (x)b: \alpha \twoheadrightarrow \beta$. If two expressions are equivalent, then two abstractions using the same variable over the same expression is equivalent.
2. If $x$ and $y$ are both variables of arity $\alpha$, and $b$ is an expression of arity $\beta$, then $(x)b \equiv (y)(b[x := y]): \alpha \twoheadrightarrow \beta$, provided $y$ is not free in $b$.
Basically, the renaming variables in abstractions don’t matter, as long as you aren’t using the variable in the body of the abstraction. So $(x)(3+4y)$ is equivalent to $(z)(3+4y)$, but it’s not equivalent to $(y)(3+4y)$, because $y$ is a free variable in $3+4y$, and the abstraction would create a binding for $y$.

3. This is arities version of the eta-rule from lambda calculus: If $x$ is a variable of arity $\alpha$, and $b$ is an expression of arity $\alpha \twoheadrightarrow \beta$, then $(x)(b(x)) \equiv b: \alpha \twoheadrightarrow \beta$. This is a fancy version of an identity rule: abstraction and application cancel.

Combination Rules
1. If $a_1 \equiv a_1$, $a_2 \equiv a_2$, …, $a_n \equiv a_n$, then $a_1, a_2, ..., a_n \equiv a_1$. This one is simple: if you have two combination expressions with the same arity, then they’re equivalent if their elements are equivalent.
2. If $e: \alpha_1 \otimes \alpha_2 \otimes ... \otimes \alpha_n$, then $e.1, e.2, ..., e.n \equiv: e : \alpha_1 \otimes \alpha_2 \otimes ... \otimes \alpha_n$. Another easy one: if you take a combination expression, and you decompose it using selections, and then recombine those selection expressions into a combination, it’s equivalent to the original expression.
Selection Rules
1. If $a \equiv a$, then $a.i \equiv a$. This is the reverse of combinations rule one: if you have two equivalent tuples, then their elements are equivalent.
2. If $a_1: \alpha_1, a_2: \alpha_2, ..., a_n: \alpha_n$, then $(a_1, a_2, ..., a_n).i \equiv a_i$. An element of a combination is equivalent to itself outside of the combination.
Symmetry
If $a: \alpha$, then $a \equiv a: \alpha$.
Symmetry
If $a \equiv b: \alpha$, then $b \equiv a: \alpha$.
Transitivity
If $a \equiv b: \alpha$, and $b \equiv c: \alpha$, then $a \equiv c: \alpha$.

Jumping back to our example: Is $x^2 + 3x + 7$ equivalent to $x^2 + 3x + 7$? If we convert them both into their canonical AST forms, then yes. They’re identical, except for one thing: the variable name in their abstraction. By abstraction rule 2, then, they’re equivalent.

# Free-riding Insurance

Pardon me, while I go off on a bit of a rant. There is a bit of
math content to this, but there’s more politics.

There’s a news story that’s been going around this week about a guy who’s bitterly angry. His name is Luis Lang. Mr. Lang is going blind because of complications of diabetes. The only way to save his eyesight is to get very expensive eye surgery, but Mr. Lang can’t afford it. He doesn’t have insurance, and now that he needs it, he can’t buy it. According to him, this is all the fault of President Obama and the unjust Obamacare insurance system which is denying him access to insurance when he really needs it.

In the US, we’re in the early days of some big changes to our health-care system. Up until a couple of years ago, most people got insurance via their employers. If they didn’t get it through work, then they needed to buy policies on their own. Privately purchased policies were typically extremely expensive, and they came with a “pre-existing condition” exclusion (PECE). The pre-existing exclusion meant that if you had a medical condition that required care before you purchased the policy, the policy wouldn’t pay for the care.

In the new system, most people still get insurance through work. But the people who don’t get to go to government-run insurance exchanges to buy policies. The policies in the exchanges are much cheaper than the old private coverage used to be, and rules like the pre-existing condition exclusion are prohibited in policies on the exchange. In addition, if you make less than a certain income, the government will subsidize your coverage to make it affordable. Under this system, you’re required to buy insurance if it’s possible for you to buy it with the government subsidies; if you choose to go without, you have to pay a penalty. If you’re too poor to buy even with the subsidies, then you’re supposed to be able to get medicare under an expanded medicare program. But the medicare expansions needed to be done by the states, and many states refused to do it, even though the federal government would cover nearly all of the costs (100% for the first few years; 95% indefinitely thereafter.)

I’m not a fan of the new system. Personally, I believe that for-profit insurance is fundamentally immoral. But that’s not the point of this post. We’ve got a system now that should make it possible for people to get coverage. So why is this poor guy going blind, and unable to get insurance?

The answer is simple: because he very deliberately put himself into a terrible situation, and now he’s paying the price for that. And there are very good reasons why people who put themselves into his situation can’t get covered when they really need it.

First, I’ll run through how he got into this mess. Then, I’ll explain why, as sad as it is for this dumbass, he’s stuck.

Our alleged victim of unjust government policies is around 50 years old. He owns a nice home, and runs his own business. He had the opportunity to buy insurance, but he chose not to, because he has a deeply held philosophical/political belief that he should pay his own bills, and so he always paid for his medical care out of his own pocket. When Obamacare came down the pike, he was strongly opposed to it because of that philosophy, and so he paid the penalty rather than buy insurance, and stayed uninsured. It’s important to note here that he made a deliberate choice to remain uninsured.

Mr Lang isn’t a paragon of good health. He’s a smoker, and he’s had diabetes for a couple of years. He admits that he hasn’t been very good about managing his diabetes. (I’m very sympathetic to his trouble managing diabetes: there’s a lot of diabetes in my family – my mother, her brother, my grandfather, and every one of his siblings, and his father all had diabetes. I’ve seen members of my family struggle with it. Diabetes is awful. It’s hard to manage. Most people struggle with it, and many don’t ultimately do it well enough before they wind up with complications. That’s what happened to Mr. Lang: he developed complications.

Specifically, he had a series of small strokes, ruptured blood vessels in his cornea, and a detached retina. Combined, these conditions will cause him to go blind without surgery. (This is exactly what happened to my uncle – he lost his vision due to diabetes.) Mr. Lang’s condition has gotten bad enough that he’s unable to work because of these problems, so he can’t afford to pay for the surgery. So now, he wants to buy insurance. And he can’t.

Why not?

To really see why, we need to take a step back, and look at just what insurance really is.

Reduced to its basics, the idea of insurance is that there’s a chance of an unlikely event happening that you can’t afford to pay for. For example, say that there’s a 1 in 1000 chance of you needing major surgery that will cost $100,000. You can’t afford to pay$100,000 if it happens. So, you get together with 999 other people. Each of you put $100 into the pot. Then if you end up being unlucky, and you need the surgery, you can draw on the$100,000 in the pot to pay for your surgery.

The overwhelming majority of people who put money into the pot are getting nothing concrete for their money. But the people who needed medical care that they couldn’t afford on their own were able to get it. You and the other people who all bought in to the insurance pot were buying insurance against a risk, so that in case something happened, you’d be covered. You know that you’re probably going to lose money on the deal, but you do it to cover the unlikely case that you’ll need it. You’re sharing your risks with a pool of other people. In exchange for taking on a share of the risk (putting your money into the pool without knowing whether you’ll get it back), you take a share of the resource (the right to draw money out of the pool if you need it).

In the modern insurance system, it’s gotten a lot more complicated. But the basic idea is still the same. You’ve got a huge number of people all putting money into the pot, in the form of insurance premiums. When you go to the doctor, the insurance company pays for your care out of the money in that pot. The way that the insurance company sets premiums is complicated, but it comes down to collecting more from each buyer than it expects to need to pay for their medical care. It does that by mathematically analyzing risks.

This system is very easy to game if you can buy insurance whenever you want. You simply don’t buy insurance until something happens, and you need insurance to pay for it. Then you buy coverage. So you weren’t part of the shared risk pool until you knew that you needed more than you were going to pay in to the pool. You’re basically taking a share of the community resources in the insurance pool, without taking a share of the community risk. In philosophical circles, that’s called the free-rider problem.

Insurance can’t work without doing something to prevent free-riders from exploiting the system.

Before Obamacare, the way that the US private insurance system worked was that you could buy insurance any time you want, but when you did, you were only covered for things that developed after you bought it. Any medical condition that required care that developed before you bought insurance wasn’t covered. PECEs prevented the free-rider problem by blocking people from joining the benefits pool without also joining the risk pool: any conditions that developed while you were outside the risk pool weren’t covered. So before Obamacare, Mr. Lang could have gone out and bought insurance when he discovered his medical problems – but that insurance wouldn’t cover the surgery that he needs, because it developed while he was uninsured.

Without PECEs, it’s very easy to exploit the insurance system by free-riding. If you allowed some people to stay out of the insurance system until they needed the coverage, then you’d need to get more money from everyone else who bought insurance. Each year, you’d still need to have a pool of money big enough to cover all of the expected medical care costs for that year. But that pool wouldn’t just need to be big enough to cover the people who bought in at the beginning of the year – it would need to be large enough to cover everyone who bought insurance at the beginning of the year, and everyone who jumped in only when they needed it.

Let’s go back to our example. There’s only one problem that can happen, and it happens to 1 person in 1000 per year, and it costs $100,000 to treat. We’ve got a population of 2000 people. 1000 of them bought into the insurance system. In an average year, 2 people will become ill: one with insurance, and one without. The one with insurance coverage becomes ill, and they get to take the$100,000 they need to cover their care. The person without insurance is stuck, and they need to pay $100,000 for their own care, or go without. In order to cover the expenses, each of the insured people would need to have paid$100.

If people can buy in to the insurance system at any time, without PECEs, then the un-insured person can wait until he gets sick, and buy insurance then. Now the insurance pool needs to cover $200,000 worth of expenses; but they’ve only got one additional member. In order to cover, they need to double the cost per insured person per year to$200. Everyone in the pool needs to pay double premiums in order to accomodate the free-riders!

This leads to a situation that some economists call a death spiral: You need to raise insurance premiums on healthy people in order to have enough money to cover the people who only sign up when they’re unhealthy. But raising your premiums mean that more people can’t afford to buy coverage, and so you have more people not buying insurance until they need it. And that causes you to need to raise your premiums even more, and so it goes, circling around and around.

The only alternative to PECEs that really works to prevent free-riders is to, essentially, forbid people from being free-riders. You can do that by requiring everyone to be covered, or by limiting when they can join the pool.

In the age of PECEs, there was one way of getting insurance without a PECE, and it’s exactly what I suggested in the previous paragraph. Large companies provided their employees with insurance coverage without PECEs. The reason that they could do it was because they were coming to an insurance company once a year with a large pool of people. The costs of the employer-provided insurance were determined by the average expected cost of coverage for that pool of people, divided by the size of the pool. But in the employer-based non-PECE coverage, you still couldn’t wait to get coverage until you needed it: each year, at the beginning of the year, you needed to either opt in or out of coverage; if, in January, you decided to decline coverage, and then in July, you discovered that you needed surgery, you couldn’t change your mind and opt in to insurance in order to get that covered. You had to wait until the following year. So again, you were avoiding free-riders by a combination of two mechanisms. First, you made it so that you had to go out of your way to refuse coverage – so nearly everyone was part of the company’s insurance plan. And second, you prevent free-riding by making it much harder to delay getting insurance until you needed it.

The Obamacare system bans PECEs. In order to avoid the free-rider problem, it does two things. It requires everyone to either buy insurance, or pay a fine; and it requires that you buy insurance for the whole year starting at the beginning of the year. You might think that’s great, or you might think it’s terrible, but either way, it’s one way of making insurance affordable without PECEs.

Mr. Lang wants to be a free-rider. He’s refused to be part of the insurance system, even though he knew that he had a serious medical condition that was likely to require care. Even though he was a regular smoker, and knew of the likelihood of developing serious medical problems as a result. He didn’t want to join the risk pool, and he deliberately opted out, refusing to get coverage when he had the chance.

That was his choice, and under US law, he had every right to make it for himself.

What Mr. Lang does not have the right to do is to be a free-rider.

He made a choice. Now he’s stuck with the results of that choice. As people like Mr. Lang like to say when they’re talking about other people, it’s a matter of personal responsibility. You can’t wait until you need coverage to join the insurance system.

Mr. Lang can buy insurance next year. And he’ll be able to get an affordable policy with government subsidies. And when he gets it, it will cover all of his medical problems. Before the Obamacare system that he loathes and blames, that wouldn’t have been true.

It’s not the fault of President Obama that he can’t buy insurance now. It’s not the fault of congress, or the democratic party, or the republican party. There’s only one person who’s responsible for the fact that he can’t get the coverage that he needs in order to get the surgery that would save his eyesight. And that’s the same person who he can’t see in the mirror anymore.

# Expressions and Arity (part 1)

In the last post, we started looking at expressions. In this post, we’re going to continue doing that, and start looking at a simple form of expression typing called arity.

Before we can do that, we need to introduce a couple of new formalisms to complete our understanding of the elements that form expressions. The reason we need another formalism is because so far, we’ve defined the meaning of expressions in terms of function calls. But we’ve still got some ambiguity about our function calls – specifically, how do parameters work?

Suppose I’ve got a function, $f$. Can I call it as $f(x)$? Or $f(x,y)$? Both? Neither? It depends on exactly what function calls mean, and what it means to be a valid parameter to a function.

There are several approaches that you can take:

1. You can say that a function application expression takes an arbitrary number of arguments. This is, roughly, what we do in dynamically typed programming languages like Python. In Python, you can write f(x) + f(x,y) + f(x, y, z, a, b, c), and the language parser will accept it. (It’ll complain when you try to execute it, but at the level of expressions, it’s a perfectly valid expression.)
2. You can say that every function takes exactly one argument, and that a multi-argument function like $f(x, y, z)$ is really a shorthand for a curried call sequence $f(x)(y)(z)$ – and thus, the “application” operation really takes exactly 2 parameters – a function, and a single value to apply that function to. This is the approach that we usually take in lambda calculus.
3. You can say that application takes two arguments, but the second one can be a combination of multiple objects – effectively a tuple. That’s what we do in a lot of versions of recursive function theory, and in programming languages like SML.

In the theory of expressions, Martin-Löf chose the third approach. A function takes a single parameter, which is a combination of a collection of other objects. If $a$, $b$, $c$, and $d$ are expressions, then $a, b, c, d$ is an expression called the combination of its four elements. This is very closely related to the idea of cartesian products in set theory, but it’s important to realize that it’s not the same thing. We’re not defining elements of a set here: we’re defining a syntactic construct of a pseudo-programming language, where one possible interpretation of it is cartesian products.

In addition to just multi-parameter functions, we’ll use combinations for other things. In type theory, we want to be able to talk about certain mathematical constructs as first-class objects. For example, we’d like to be able to talk about orderings, where an ordering is a collection of objects $A$ combined with an operation $\le$. Using combinations, we can write that very naturally as $A,\le$.

For combinations to be useful, we need to be able to extract the component values from them. In set theory, we’d do that by having projection functions associated with the cross-product. In our theory of expressions, we have selection expressions. If $x=(a, b, c, d)$ is a combination with four elements, then $x.1$ is a selection expression which extracts the first element from $x$.

In programming language terms, combinations give us a way of writing tuple values. Guess what’s next? Record types! Or rather, combinations with named elements. We can write a combination with names: $x = (a: 1, b:2)$, and then write selection expressions using the names, like $x.a$.

Now we can start getting to the meat of things.

In combinatory logic, we’d just start with a collection of primitive constant values, and then build whatever we wanted with them using abstractions, applications, combinations, and selections. Combinatory logic is the parent of computation theory: why can’t we just stick with that foundation?

There are two answers to that. The first is familiar to programming language people: if you don’t put any restrictions on things, then you lose type safety. You’ll be able to write “valid” expressions that don’t mean anything – things like $1.x$, even though “1” isn’t a combination, and so calling a selector on it makes no sense. Or you’d be able to call a function like factorial(27, 13), even though the function only takes one parameter.

The other answer is equality. One thing that we’d really like to be able to do in our theory of expressions is determine when two expressions are the same. For example, we’d really like to be able to do things like say that $((x)e)(x) == e$. But without some form of control, we can’t really do that: the problem of determining whether or not two expressions are equal can become non-decidable. (The canonical example of this problem involves y combinator: $((x)x(x))((x)x(x))$. If we wanted to try to work out whether an expression involving this was equivilant to another expression, we could wind up in an infinite loop of applications.)

The way that Martin-Löf worked around this is to associate an arity with an expression. Each arity of expressions is distinct, and there are rules about what operations can be applied to an expression depending on its arity. You can’t call .2 on “1+3″, because “1+3″ is a single expression, and selectors can only be applied to combined expressions.

To most of us, arity sounds like it should be a numeric value. When we we talk about the arity of a function in a program, what we mean is how many parameters it takes. In Martin-Löf expressions, arity is more expressive than that: it’s almost the same thing as types in the simply typed lambda calculus.

There are two dimensions to arity: single/combined and saturated/unsaturated.

Single expressions are atomic values, where you can’t extract other values from them by selection; multiple expressions are combinations of multiple other expressions.

Saturated expressions are expressions that have no holes in them that can be filled by other expressions – that is, expressions with no free variables. Unsaturated expressions have open holes – which means that they can be applied to other expressions.

Saturated single expressions have arity 0. An expression of arity 0 can’t be applied, and you can’t be the target of selection expressions.

An unsaturated expression has an arity $(\alpha \twoheadrightarrow \beta)$, where both $\alpha$ and $\beta$ are arities. For example, the integer addition function has arity $(0 \otimes 0 \twoheadrightarrow 0)$. (Erik Eidt pointed out that I made an error here. I originally wrote addition as $0 \twoheadrightarrow 0$, where it should have been $0 \otimes 0 \twoheadrightarrow 0$.)

A combined expression $(e_0, e_1, ..., e_n)$ has arity $(\alpha_1 \otimes \alpha_2 \otimes ... \otimes \alpha_n)$, where each of the $\alpha_i$s are the arities of the expression $e_i$.

Sadly, I’m out of time to work on this post, so we’ll have to stop here. Next time, we’ll look at the formal rules for arities, and how to use them to define equality of expressions.

# Understanding Expressions

I’m going to be trying something a bit different with this blog.

What I’ve tried to do here on GM/BM is make each post as self-contained as possible. Obviously, many things take more than one post to explain, but I’ve done my best to take things, and divide them into pieces where there’s a basic concept or process that’s the focus of each post.

I’m finding that for this type theory stuff, I’m having a hard time doing that. Or rather, given my work schedule right now when I’m trying to write about type theory, I’m finding it hard to find enough time to do that, and still be posting regularly. (That sounds like a complaint, but it’s not meant to be. I started a new job at Dropbox about three months ago, and I absolutely love it. I’m spending a lot of time working because I’m having so much fun, not because some big mean boss guy is leaning over me and threatening.)

Anyway, the point of this whole diversion is that I really want to get this blog back onto a regular posting schedule. But to do that, I’m going to have to change my posting style a bit, and make the individual posts shorter, and less self-contained. I’m definitely interested in what you, my readers, think of this – so please, as I roll into this, let me know if you think it’s working or not. Thanks!

In this post, we’re going to start looking at expressions. This might seem like it’s a diversion from the stuff I’ve been writing about type theory, but it really isn’t! Per Martin-Löf developed a theory of expressions which is used by type theorists and many others, and we’re going to be looking at that.

We’ve all seen arithmetic expressions written out since we were in first grade. We think we understand what they mean. But actually, most of us have never really stopped and thought precisely about what an expression actually means. Most of the time, that’s OK: we’ve got an intuitive sense of it that’s good enough. But for type theory, it’s not sufficient. In fact, even if we did have a really good, formal notion of expressions, it wouldn’t be right for type theory: in type theory, we’re rebuilding mathematics from a foundation of computability, and that’s not the basis of any theory of expressions that’s used in other mathematical fields.

Why is this a problem?

Let’s start by looking at a nice, simple expression:

$x^2 + 3x + 7$

What does that mean? Roughly speaking, it’s a function with one parameter: $f(x) = x^2 + 3x + 7$. But that doesn’t really tell us much: all we’ve really done is add a bit of notation. We still don’t know what it means.

Let’s take it a step further. It’s actually describing a computation that adds three elements: $+(x^2, 3x, 7)$. But that’s not quite right either, because we know addition is binary. That means that we need to decide how to divide that addition into two parts. From the commutative property, we know that it doesn’t matter which way we divide it – but from a computational perspective, it might: doing it one way versus the other might take much longer!

We’ll pick left-associative, and say that it’s really $+(+(x^2, 3x), 7)$. We also need to expand other parts of this into this functional idea. If we follow it all out, we wind up with: $+(+(*(x,x), *(3, x)),7)$.

We’ve converted the expression into a collection of applications of primitive functions. Or in terms that are more familiar to geeks like me, we’ve taken the expression, and turned it into an abstract syntax tree (AST) that describes it as a computation.

We’re still being pretty vague here. We haven’t really defined our notion of function or computation. But even with that vagueness, we’ve already started making the notion of what an expression is much more concrete. We’ve taken the abstract notion of expressions, and translated it to a much less abstract notion of a computation expressed as a sequence of computable functions.

This is a really useful thing to do. It’s useful enough that we don’t want to limit it to just “pure” expressions. In the type theoretic view of computation, everything is an expression. That’s important for multiple reasons – but to make it concrete, we’re going to eventually get around to showing how types work in expressions, what it means for an expression to be well-typed, how to infer types for expressions, and so on. We want all of that to work for any program – not just for something that looks like a formula.

Fortunately, this works. We can also take an “expression” like for i in 1 .. 10 do f(i), and analyze it as a function: for(i, 1, 10, f(i)).

So, we’ve got a way of understanding expressions as functions. But even if we want to keep the computational side of things abstract and hand-wavy, that’s still not really enough. We’re closer to understanding expressions, but we’ve still got some huge, gaping holes.

Let’s jump back to that example expression: $x^2 + 3x + 7$. What does it mean? What we’ve seen so far is that we can both understand it, as a series of function calls: $+(+(*(x, x), *(3, x)), 7)$. But we’d like to be able to evaluate it – to execute the computation that it describes. But we can’t do that: it’s got a gaping hole named $x$. What do we do with that?

We’re missing a really important notion: funcional abstraction. Our expression isn’t just an expression: what it really is is a function. We alluded to that before, but now we’re going to deal with it head-on. That expression doesn’t really define a computation: it defines a computational object that computes the function. When an expression has free variables – that is, variables that aren’t assigned a meaning within the expression – our expression represents something that’s been abstracted a level: instead of being a computation that produces a value, it’s an object that takes a parameter, and performs a computation on its parameter(s) in order to produce a value.

In our expression $x^2 + 3x + 7$, we’re looking at an expression in one free variable – which makes it a single-parameter function. In the notation of type theory, we’d write it as $(x)(+(+(*(x, x), *(3, x)), 7)$ – that is,
the parameter variable in parens ahead of the expression that it parameterizes. (Yes, this notation stinks; but I’m sticking with the notations from the texts I’m using, and this is it.)

This notion, of parameters and function abstraction turns out to be more complex than you might expect. I’m going to stop this post here – but around wednesday, I’ll post the next part, which will look at how to understand the arity of an expression.

# A failed attempt to prove P == NP

In computer science, we have one really gigantic open question about complexity. In the lingo, we ask “Does P == NP?”. (I’ll explain what that means below.) On March 9th, a guy named Michael LaPlante posted a paper to ArXiv that purports to prove, once and for all, that P == NP. If this were the case, if Mr. LaPlante (I’m assuming Mr.; if someone knows differently, ie. that it should be Doctor, or Miss, please let me know!) had in fact proved that P==NP, it would be one of the most amazing events in computer science history. And it wouldn’t only be a theoretical triumph – it would have real, significant practical results! I can’t think of any mathematical proof that would be more exciting to me: I really, really wish that this would happen. But Mr. LaPlante’s proof is, sadly, wrong. Trivially wrong, in fact.

In order to understand what all of this means, why it matters, and where he went wrong, we need to take a step back, and briefly look at computational complexity, what P and NP mean, and what are the implications of P == NP? (Some parts of the discussion that follows are re-edited versions of sections of a very old post from 2007.)

Before we can get to the meat of this, which is talking about P versus NP, we need to talk about computational complexity. P and NP are complexity classes of problems – that is, groups of problems that have similar bounds on their performance.

When we look at a computation, one of the things we want to know is: “How long will this take?”. A specific concrete answer to that depends on all sorts of factors – the speed of your computer, the particular programming language you use to run the program, etc. But independent of those, there’s a basic factor that describes something important about how long a computation will take – the algorithm itself fundamental requires some minimum number of operations. Computational complexity is an abstract method of describing how many operations a computation will take, expressed in terms of the size or magnitude of the input.

For example: let’s take a look at insertion sort. Here’s some pseudocode for insertion sort.

def insertion_sort(lst):
result = []
for i in lst:
for j in result:
if i < j:
insert i into result before j
if i wasn't inserted, add it to the end of result
return result


This is, perhaps, the simplest sorting algorithm to understand - most of us figured it out on our own in school, when we had an assignment to alphebetize a list of words. You take the elements of the list to be sorted one at a time; then you figure out where in the list they belong, and insert them.

In the worst possible case, how long does this take?

1. Inserting the first element requires 0 comparisons: just stick it into the list.
2. Inserting the second element takes exactly one comparison: it needs to be compared to the one element in the result list, to determine whether it goes before or after it.
3. Inserting the third element could take either one or two comparisons. (If it's smaller than the first element of the result list, then it can be inserted in front without any more comparisons; otherwise, it needs to be compared against the second element of the result list. So in the worst case, it takes 2 comparisons.
4. In general, for the Nth element of the list, it will take at most n-1 comparisons.

So, in the worst case, it's going to take 0 + 1 + 2 + ... + n-1 comparisons to produce a sorted list of N elements. There's a nice shorthand for computing that series: $\frac{(n-1)(n-2)}{2}$, which simplifies to \frac{n^2 -3n + 2}{2}, which is O(n2).

So while we can't say "computing a list of 100 elements will take 2.3 seconds" (because that depends on a ton of factors - the specific implementation of the code, the programming language, the machine it's running on, etc.), we can say that the time it takes to run increase roughly proportionally to the square of the size of the input - which is what it means when we say that insertion sort is O(n2).

That's the complexity of the insert sort algorithm. When we talk about complexity, we can talk about two different kinds of complexity: the complexity of an algorithm, and the complexity of a problem. The complexity of an algorithm is a measure of how many steps the algorithm takes to execute on an input of a particular size. It's specific to the algorithm, that is, the specific method used to solve the the problem. The complexity of the problem is a bound that bounds the best case of the complexity of any possible algorithm that can solve that problem.

For example, when you look at sort, you can say that there's a minimum number of steps that's needed to compute the correct sorted order of the list. In fact, you can prove that to sort a list of elements, you absolutely require $n lg n$ bits of information: there's no possible way to be sure you have the list in sorted order with less information that that. If you're using an algorithm that puts things into sorted order by comparing values, that means that you absolutely must do O(n lg n) comparisons, because each comparison gives you one bit of information. That means that sorting is an O(n log n) problem. We don't need to know which algorithm you're thinking about - it doesn't matter. There is no possible comparison-based sorting algorithm that takes less than $O(n \log n)$ steps. (It's worth noting that there's some weasel-words in there: there are some theoretical algorithms that can sort in less than O(n lg n), but they do it by using algorithms that aren't based on binary comparisons that yield one bit of information.)

We like to describe problems by their complexity in that way when we can. But it's very difficult. We're very good at finding upper bounds: that is, we can in general come up with ways of saying "the execution time will be less than O(something)", but we are very bad at finding ways to prove that "the minimum amount of time needed to solve this problem is O(something)". That distinction, between the upper bound (maximum time needed to solve a problem), and lower bound (minimum time needed to solve a problem) is the basic root of the P == NP question.

When we're talking about the complexity of problems, we can categorize them into complexity classes. There are problems that are O(1), which means that they're constant time, independent of the size of the input. There are linear time problems, which can be solved in time proportional to the size of the input. More broadly, there are two basic categories that we care about: P and NP.

P is the collection of problems that can be solved in polynomial time. That means that in the big-O notation for the complexity, the expression inside the parens is a polynomial: the exponents are all fixed values. Speaking very roughly, the problems in P are the problems that we can at least hope to solve with a program running on a real computer.

NP is the collection of problems that can be solved in non-deterministic polynomial time. We'll just gloss over the "non-deterministic" part, and say that for a problem in NP, we don't know of a polynomial time algorithm for producing a solution, but given a solution, we can check if it's correct in polynomial time. For problems in NP, the best solutions we know of have worst-case bounds that are exponential - that is, the expression inside of the parens of the O(...) has an exponent containing the size of the problem.

NP problems are things that we can't solve perfectly with a real computer. The real solutions take an amount of time that's exponential in the size of their inputs. Tripling the size of the problem increases its execution time by a factor of 27; quadrupling the input size increases execution time by at least a factor of 256; increasing the input by a factor of 10 increases execution time by at least a factor of 10,000,000,000. For NP problems, we're currently stuck using heuristics - shortcuts that will quickly produce a good guess at the real solution, but which will sometimes be wrong.

NP problems are, sadly, very common in the real world. For one example, there's a classic problem called the travelling salesman. Suppose you've got a door-to-door vacuum cleaner salesman. His territory has 15 cities. You want to find the best route from his house to those 15 cities, and back to his house. Finding that solution isn't just important from a theoretical point of view: the time that the salesman spends driving has a real-world cost! We don't know how to quickly produce the ideal path.

The big problem with NP is that we don't know lower bounds for anything in it. That means that while we know of slow algorithms for finding the solution to problems in NP, we don't know if those algorithms are actually the best. It's possible that there's a fast solution - a solution in polynomial time which will give the correct answer. Many people who study computational complexity believe that if you can check a solution in polynomial time, then computing a solution should also be polynomial time with a higher-order polynomial. (That is, they believe that there should be some sort of bound like "the time to find a solution is no more than the cube of the time to check a solution".) But so far, no one has been able to actually prove a relationship like that.

When you look at NP problems, some of them have a special, amazing property called NP completeness. If you could come up with a polynomial time solution for any single NP-complete problem, then you'd also discover exactly how to come up with a polynomial time solution for every other problem in NP..

In Mr. LaPlante's paper, he claims to have implemented a polynomial time solution to a problem called the maximum clique problem. Maximum clique is NP complete - so if you could find a P-time solution to it, you'd have proven that P == NP, and that there are polynomial time solutions to all NP problems.

The problem that Mr. LaPlante looked at is the maximal clique problem:

• Given:
1. a set of $V$ atomic objects called vertices;
2. a set of $E$ of objects called edges, where each edge is an unordered pair $(x, y)$, where $x$ and $y$ are vertices.
• Find:
• The largest set of vertices C=$\{v_1, ..., v_n\}$ where for any $v_i$, there is an edge between $v_i$ to every other vertex in $C$.

Less formally: given a bunch of dots, where some of the dots are connected by lines, find the largest set of dots where every dot in the set is connected to every other dot in the set.

The author claims to have come up with a simple P-time solution to that.

The catch? He's wrong. His solution isn't P-time. It's sloppy work.

His algorithm is pretty easy to understand. Each vertex has a finite set of edges connecting it to its neighbors. You have each node in the graph send its list of its neighbors to its neighbors. With that information, each node knows what 3-cliques its a part of. Every clique of size larger than 3 is made up of overlapping 3-cliques - so you can have the cliques merge themselves into ever larger cliques.

If you look at this, it's still basically considering every possible clique. But His "analysis" of the complexity of his algorithm is so shallow and vague that it's easy to get things wrong. It's a pretty typical example of a sloppy analysis. Complexity analysis is hard, and it's very easy to get wrong. I don't want to be too hard on Mr. LaPlante, because it's an extremely easy mistake to make. Analyzing algorithmic complexity needs to be done in a careful, exacting, meticulous way - and while Mr. LaPlante didn't do that, most people who are professional programmers could easily make a similar mistake! But the ultimate sloppiness of it is that he never bothers to finish computing the complexity. He makes vague hand-wavy motions at showing the complexity of certain phases of his algorithm, but he never even bothers to combine them and come up with an estimate of the full upper-bound of his algorithm!

I'm not going to go into great detail about this. Instead, I'll refer you to a really excellent paper by Patrick Prosser, which looks at a series of algorithms that compute exact solutions to the maximum clique problem, and how they're analyzed. Compare their analysis to Mr. LaPlante's, and you'll see quite clearly how sloppy LaPlante was. I'll give you a hint about one thing LaPlante got wrong: he's taking some steps that take significant work, and treating them as if they were constant time.

But we don't even really need to look at the analysis. Mr. LaPlante provides an implementation of his supposedly P-time algorithm. He should be able to show us execution times for various randomly generated graphs, and show how that time grows as the size of the graph grows, right? I mean, if you're making claims about something like this, and you've got real code, you'll show your experimental verification as well as your theoretical analysis, right?

Nope. He doesn't. And I consider that to be a really, really serious problem. He's claiming to have reduced an NP-complete problem to a small-polynomial complexity: where are the numbers?

I'll give you a good guess about the answer: the algorithm doesn't complete in a reasonable amount of time for moderately large graphs. You could argue that even if it's polynomial time, you're looking at exponents that are no smaller than 3 (exactly what he claims the bound to be is hard to determine, since he never bothers to finish the analysis!) - a cubic algorithm on a large graph takes a very long time. But... not bothering to show any runtime data? Nothing at all? That's ridiculous. If you look at the Prosser paper above, he manages to give actual concrete measurements of the exponential time algorithms. LaPlante didn't bother to do that. And I can only conclude that he couldn't gather actual numbers to support his idea.

# Big Bang Bogosity

One of my long-time mantras on this blog has been “The worst math is no math”. Today, I’m going to show you yet another example of that: a recent post on Boing-Boing called “The Big Bang is Going Down”, by a self-proclaimed genius named Rick Rosner.

First postulated in 1931, the Big Bang has been the standard theory of the origin and structure of the universe for 50 years. In my opinion, (the opinion of a TV comedy writer, stripper and bar bouncer who does physics on the side) the Big Bang is about to collapse catastrophically, and that’s a good thing.

According to Big Bang theory, the universe exploded into existence from basically nothing 13.7-something billion years ago. But we’re at the beginning of a wave of discoveries of stuff that’s older than 13.7 billion years.

We’re constantly learning more about our universe, how it works, and how it started. New information isn’t necessarily a catastrophe for our existing theories; it’s just more data. There’s constantly new data coming in – and as yet, none of it comes close to causing the big bang theory to catastrophically collapse.

The two specific examples cited in the article are:

1. one quasar that appears to be younger than we might expect – it existed just 900 million years after the current estimate of when the big bang occurred. That’s very surprising, and very exciting. But even in existing models of the big bang, it’s surprising, but not impossible. (No link, because the link in the original article doesn’t work.)
2. an ancient galaxy – a galaxy that existed only 700 million years after the big bang occurred – contains dust. Cosmic dust is made of atoms much larger than hydrogen – like carbon, silicon, and iron, which are (per current theories) the product of supernovas. Supernovas generally don’t happen to stars younger than a couple of billion years – so finding dust in a galaxy less than a billion years after the universe began is quite surprising. But again: impossible under the big bang? No.

The problem with both of these arguments against the big bang is: they’re vague. They’re both handwavy arguments made about crude statements about what “should” be possible or impossible according to the bing bang theory. But neither comes close to the kind of precision that an actual scientific argument requires.

Scientists don’t use math because they like to be obscure, or because they think all of the pretty symbols look cool. Math is a tool used by scientists, because it’s useful. Real theories in physics need to be precise. They need to make predictions, and those predictions need to match reality to the limits of our ability to measure them. Without that kind of precision, we can’t test theories – we can’t check how well they model reality. And precise modelling of reality is the whole point.

The big bang is an extremely successful theory. It makes a lot of predictions, which do a good job of matching observations. It’s evolved in significant ways over time – but it remains by far the best theory we have – and by “best”, I mean “most accurate and successfully predictive”. The catch to all of this is that when we talk about the big bang theory, we don’t mean “the universe started out as a dot, and blew up like a huge bomb, and everything we see is the remnants of that giant explosion”. That’s an informal description, but it’s not the theory. That informal description is so vague that a motivated person can interpret it in ways that are consistent, or inconsistent with almost any given piece of evidence. The real big bang theory isn’t a single english statement – it’s many different mathematical statements which, taken together, produce a description of an expansionary universe that looks like the one we live in. For a really, really small sample, you can take a look at a nice old post by Ethan Siegel over here.

If you really want to make an argument that it’s impossible according to the big bang theory, you need to show how it’s impossible. The argument by Mr. Rosner is that the atoms in the dust in that galaxy couldn’t exist according to the big bang, because there wasn’t time for supernovas to create it. To make that argument, he needs to show that that’s true: he needs to look at the math that describes how stars form and how they behave, and then using that math, show that the supernovas couldn’t have happened in that timeframe. He doesn’t do anything like that: he just asserts that it’s true.

In contrast, if you read the papers by the guys who discovered the dust-filled galaxy, you’ll notice that they don’t come anywhere close to saying that this is impossible, or inconsistent with the big bang. All they say is that it’s surprising, and that we made need to revise our understanding of the behavior of matter in the early stages of the universe. The reason that they say that is because there’s nothing there that fundamentally conflicts with our current understanding of the big bang.

But Mr. Rosner can get away with the argument, because he’s being vague where the scientists are being precise. A scientist isn’t going to say “Yes, we know that it’s possible according to the big bang theory”, because the scientist doesn’t have the math to show it’s possible. At the moment, we don’t have sufficient precise math either way to come to a conclusion; we don’t know. But what we do know is that millions of other observations in different contexts, different locations, observed by different methods by different people, are all consistent with the predictions of the big bang. Given that we don’t have any evidence to support the idea that this couldn’t happen under the big bang, we continue to say that the big bang is the theory most consistent with our observations, that it makes better predictions than anything else, and so we assume (until we have evidence to the contrary) that this isn’t inconsistent. We don’t have any reason to discard the big bang theory on the basis of this!

Mr. Rosner, though, goes even further, proposing what he believes will be the replacement for the big bang.

The theory which replaces the Big Bang will treat the universe as an information processor. The universe is made of information and uses that information to define itself. Quantum mechanics and relativity pertain to the interactions of information, and the theory which finally unifies them will be information-based.

The Big Bang doesn’t describe an information-processing universe. Information processors don’t blow up after one calculation. You don’t toss your smart phone after just one text. The real universe – a non-Big Bang universe – recycles itself in a series of little bangs, lighting up old, burned-out galaxies which function as memory as needed.

In rolling cycles of universal computation, old, collapsed, neutron-rich galaxies are lit up again, being hosed down by neutrinos (which have probably been channeled along cosmic filaments), turning some of their neutrons to protons, which provides fuel for stellar fusion. Each calculation takes a few tens of billions of years as newly lit-up galaxies burn their proton fuel in stars, sharing information and forming new associations in the active center of the universe before burning out again. This is ultra-deep time, with what looks like a Big Bang universe being only a long moment in a vast string of such moments across trillions or quadrillions of giga-years.

This is not a novel idea. There are a ton of variations of the “universe as computation” that have been proposed over the years. Just off the top of my head, I can rattle off variations that I’ve read (in decreasing order of interest) by Minsky (can’t find the paper at the moment; I read it back when I was in grad school), by Fredkin, by Wolfram, and by Langan.

All of these theories assert in one form or another that our universe is either a massive computer or a massive computation, and that everything we can observe is part of a computational process. It’s a fascinating idea, and there are aspects of it that are really compelling.

For example, the Minsky model has an interesting explanation for the speed of light as an absolute limit, and for time dilation. Minksy’s model says that the universe is a giant cellular automaton. Each minimum quanta of space is a cell in the automaton. When a particle is located in a particular cell, that cell is “running” the computation that describes that particle. For a particle to move, the data describing it needs to get moved from its current location to its new location at the next time quanta. That takes some amount of computation, and the cell can only perform a finite amount of computation per quanta. The faster the particle moves, the more of its time quantum are dedicated to motion, and the less it has for anything else. The speed of light, in this theory, is the speed where the full quanta for computing a particle’s behavior is dedicated to nothing but moving it to its next location.

It’s very pretty. Intuitively, it works. That makes it an interesting idea. But the problem is, no one has come up with an actual working model. We’ve got real observations of the behavior of the physical universe that no one has been able to describe using the cellular automaton model.

That’s the problem with all of the computational hypotheses so far. They look really good in the abstract, but none of them come close to actually working in practice.

A lot of people nowadays like to mock string theory, because it’s a theory that looks really ogood, but has no testable predictions. String theory can describe the behavior of the universe that we see. The problem with it isn’t that there’s things we observe in the universe that it can’t predict, but because it can predict just about anything. There are a ton of parameters in the theory that can be shifted, and depending on their values, almost anything that we could observe can be fit by string theory. The problem with it is twofold: we don’t have any way (yet) of figuring out what values those parameters need to have to fit our universe, and we don’t have any way (yet) of performing an experiment that tests a prediction of string theory that’s different from the predictions of other theories.

As much as we enjoy mocking string theory for its lack of predictive value, the computational hypotheses are far worse! So far, no one has been able to come up with one that can come close to explaining all of the things that we’ve already observed, much less to making predictions that are better than our current theories.

But just like he did with his “criticism” of the big bang, Mr. Rosner makes predictions, but doesn’t bother to make them precise. There’s no math to his prediction, because there’s no content to his prediction. It doesn’t mean anything. It’s empty prose, proclaiming victory for an ill-defined idea on the basis of hand-waving and hype.

Boing-Boing should be ashamed for giving this bozo a platform.