A failed attempt to prove P == NP

I wasn’t originally going to write about this, but people keep sending it to me asking for comments.

In computer science, we have one really gigantic open question about complexity. In the lingo, we ask “Does P == NP?”. (I’ll explain what that means below.) On March 9th, a guy named Michael LaPlante posted a paper to ArXiv that purports to prove, once and for all, that P == NP. If this were the case, if Mr. LaPlante (I’m assuming Mr.; if someone knows differently, ie. that it should be Doctor, or Miss, please let me know!) had in fact proved that P==NP, it would be one of the most amazing events in computer science history. And it wouldn’t only be a theoretical triumph – it would have real, significant practical results! I can’t think of any mathematical proof that would be more exciting to me: I really, really wish that this would happen. But Mr. LaPlante’s proof is, sadly, wrong. Trivially wrong, in fact.

In order to understand what all of this means, why it matters, and where he went wrong, we need to take a step back, and briefly look at computational complexity, what P and NP mean, and what are the implications of P == NP? (Some parts of the discussion that follows are re-edited versions of sections of a very old post from 2007.)

Before we can get to the meat of this, which is talking about P versus NP, we need to talk about computational complexity. P and NP are complexity classes of problems – that is, groups of problems that have similar bounds on their performance.

When we look at a computation, one of the things we want to know is: “How long will this take?”. A specific concrete answer to that depends on all sorts of factors – the speed of your computer, the particular programming language you use to run the program, etc. But independent of those, there’s a basic factor that describes something important about how long a computation will take – the algorithm itself fundamental requires some minimum number of operations. Computational complexity is an abstract method of describing how many operations a computation will take, expressed in terms of the size or magnitude of the input.

For example: let’s take a look at insertion sort. Here’s some pseudocode for insertion sort.

def insertion_sort(lst):
  result = []
  for i in lst:
    for j in result:
      if i < j:
        insert i into result before j
      if i wasn't inserted, add it to the end of result
   return result

This is, perhaps, the simplest sorting algorithm to understand - most of us figured it out on our own in school, when we had an assignment to alphebetize a list of words. You take the elements of the list to be sorted one at a time; then you figure out where in the list they belong, and insert them.

In the worst possible case, how long does this take?

  1. Inserting the first element requires 0 comparisons: just stick it into the list.
  2. Inserting the second element takes exactly one comparison: it needs to be compared to the one element in the result list, to determine whether it goes before or after it.
  3. Inserting the third element could take either one or two comparisons. (If it's smaller than the first element of the result list, then it can be inserted in front without any more comparisons; otherwise, it needs to be compared against the second element of the result list. So in the worst case, it takes 2 comparisons.
  4. In general, for the Nth element of the list, it will take at most n-1 comparisons.

So, in the worst case, it's going to take 0 + 1 + 2 + ... + n-1 comparisons to produce a sorted list of N elements. There's a nice shorthand for computing that series: \frac{(n-1)(n-2)}{2}, which simplifies to \frac{n^2 -3n + 2}{2}, which is O(n2).

So while we can't say "computing a list of 100 elements will take 2.3 seconds" (because that depends on a ton of factors - the specific implementation of the code, the programming language, the machine it's running on, etc.), we can say that the time it takes to run increase roughly proportionally to the square of the size of the input - which is what it means when we say that insertion sort is O(n2).

That's the complexity of the insert sort algorithm. When we talk about complexity, we can talk about two different kinds of complexity: the complexity of an algorithm, and the complexity of a problem. The complexity of an algorithm is a measure of how many steps the algorithm takes to execute on an input of a particular size. It's specific to the algorithm, that is, the specific method used to solve the the problem. The complexity of the problem is a bound that bounds the best case of the complexity of any possible algorithm that can solve that problem.

For example, when you look at sort, you can say that there's a minimum number of steps that's needed to compute the correct sorted order of the list. In fact, you can prove that to sort a list of elements, you absolutely require n lg n bits of information: there's no possible way to be sure you have the list in sorted order with less information that that. If you're using an algorithm that puts things into sorted order by comparing values, that means that you absolutely must do O(n lg n) comparisons, because each comparison gives you one bit of information. That means that sorting is an O(n log n) problem. We don't need to know which algorithm you're thinking about - it doesn't matter. There is no possible comparison-based sorting algorithm that takes less than O(n \log n) steps. (It's worth noting that there's some weasel-words in there: there are some theoretical algorithms that can sort in less than O(n lg n), but they do it by using algorithms that aren't based on binary comparisons that yield one bit of information.)

We like to describe problems by their complexity in that way when we can. But it's very difficult. We're very good at finding upper bounds: that is, we can in general come up with ways of saying "the execution time will be less than O(something)", but we are very bad at finding ways to prove that "the minimum amount of time needed to solve this problem is O(something)". That distinction, between the upper bound (maximum time needed to solve a problem), and lower bound (minimum time needed to solve a problem) is the basic root of the P == NP question.

When we're talking about the complexity of problems, we can categorize them into complexity classes. There are problems that are O(1), which means that they're constant time, independent of the size of the input. There are linear time problems, which can be solved in time proportional to the size of the input. More broadly, there are two basic categories that we care about: P and NP.

P is the collection of problems that can be solved in polynomial time. That means that in the big-O notation for the complexity, the expression inside the parens is a polynomial: the exponents are all fixed values. Speaking very roughly, the problems in P are the problems that we can at least hope to solve with a program running on a real computer.

NP is the collection of problems that can be solved in non-deterministic polynomial time. We'll just gloss over the "non-deterministic" part, and say that for a problem in NP, we don't know of a polynomial time algorithm for producing a solution, but given a solution, we can check if it's correct in polynomial time. For problems in NP, the best solutions we know of have worst-case bounds that are exponential - that is, the expression inside of the parens of the O(...) has an exponent containing the size of the problem.

NP problems are things that we can't solve perfectly with a real computer. The real solutions take an amount of time that's exponential in the size of their inputs. Tripling the size of the problem increases its execution time by a factor of 27; quadrupling the input size increases execution time by at least a factor of 256; increasing the input by a factor of 10 increases execution time by at least a factor of 10,000,000,000. For NP problems, we're currently stuck using heuristics - shortcuts that will quickly produce a good guess at the real solution, but which will sometimes be wrong.

NP problems are, sadly, very common in the real world. For one example, there's a classic problem called the travelling salesman. Suppose you've got a door-to-door vacuum cleaner salesman. His territory has 15 cities. You want to find the best route from his house to those 15 cities, and back to his house. Finding that solution isn't just important from a theoretical point of view: the time that the salesman spends driving has a real-world cost! We don't know how to quickly produce the ideal path.

The big problem with NP is that we don't know lower bounds for anything in it. That means that while we know of slow algorithms for finding the solution to problems in NP, we don't know if those algorithms are actually the best. It's possible that there's a fast solution - a solution in polynomial time which will give the correct answer. Many people who study computational complexity believe that if you can check a solution in polynomial time, then computing a solution should also be polynomial time with a higher-order polynomial. (That is, they believe that there should be some sort of bound like "the time to find a solution is no more than the cube of the time to check a solution".) But so far, no one has been able to actually prove a relationship like that.

When you look at NP problems, some of them have a special, amazing property called NP completeness. If you could come up with a polynomial time solution for any single NP-complete problem, then you'd also discover exactly how to come up with a polynomial time solution for every other problem in NP..

In Mr. LaPlante's paper, he claims to have implemented a polynomial time solution to a problem called the maximum clique problem. Maximum clique is NP complete - so if you could find a P-time solution to it, you'd have proven that P == NP, and that there are polynomial time solutions to all NP problems.

The problem that Mr. LaPlante looked at is the maximal clique problem:

  • Given:
    1. a set of V atomic objects called vertices;
    2. a set of E of objects called edges, where each edge is an unordered pair (x, y), where x and y are vertices.
  • Find:
    • The largest set of vertices C=\{v_1, ..., v_n\} where for any v_i, there is an edge between v_i to every other vertex in C.

Less formally: given a bunch of dots, where some of the dots are connected by lines, find the largest set of dots where every dot in the set is connected to every other dot in the set.

The author claims to have come up with a simple P-time solution to that.

The catch? He's wrong. His solution isn't P-time. It's sloppy work.

His algorithm is pretty easy to understand. Each vertex has a finite set of edges connecting it to its neighbors. You have each node in the graph send its list of its neighbors to its neighbors. With that information, each node knows what 3-cliques its a part of. Every clique of size larger than 3 is made up of overlapping 3-cliques - so you can have the cliques merge themselves into ever larger cliques.

If you look at this, it's still basically considering every possible clique. But His "analysis" of the complexity of his algorithm is so shallow and vague that it's easy to get things wrong. It's a pretty typical example of a sloppy analysis. Complexity analysis is hard, and it's very easy to get wrong. I don't want to be too hard on Mr. LaPlante, because it's an extremely easy mistake to make. Analyzing algorithmic complexity needs to be done in a careful, exacting, meticulous way - and while Mr. LaPlante didn't do that, most people who are professional programmers could easily make a similar mistake! But the ultimate sloppiness of it is that he never bothers to finish computing the complexity. He makes vague hand-wavy motions at showing the complexity of certain phases of his algorithm, but he never even bothers to combine them and come up with an estimate of the full upper-bound of his algorithm!

I'm not going to go into great detail about this. Instead, I'll refer you to a really excellent paper by Patrick Prosser, which looks at a series of algorithms that compute exact solutions to the maximum clique problem, and how they're analyzed. Compare their analysis to Mr. LaPlante's, and you'll see quite clearly how sloppy LaPlante was. I'll give you a hint about one thing LaPlante got wrong: he's taking some steps that take significant work, and treating them as if they were constant time.

But we don't even really need to look at the analysis. Mr. LaPlante provides an implementation of his supposedly P-time algorithm. He should be able to show us execution times for various randomly generated graphs, and show how that time grows as the size of the graph grows, right? I mean, if you're making claims about something like this, and you've got real code, you'll show your experimental verification as well as your theoretical analysis, right?

Nope. He doesn't. And I consider that to be a really, really serious problem. He's claiming to have reduced an NP-complete problem to a small-polynomial complexity: where are the numbers?

I'll give you a good guess about the answer: the algorithm doesn't complete in a reasonable amount of time for moderately large graphs. You could argue that even if it's polynomial time, you're looking at exponents that are no smaller than 3 (exactly what he claims the bound to be is hard to determine, since he never bothers to finish the analysis!) - a cubic algorithm on a large graph takes a very long time. But... not bothering to show any runtime data? Nothing at all? That's ridiculous. If you look at the Prosser paper above, he manages to give actual concrete measurements of the exponential time algorithms. LaPlante didn't bother to do that. And I can only conclude that he couldn't gather actual numbers to support his idea.

Big Bang Bogosity

One of my long-time mantras on this blog has been “The worst math is no math”. Today, I’m going to show you yet another example of that: a recent post on Boing-Boing called “The Big Bang is Going Down”, by a self-proclaimed genius named Rick Rosner.

First postulated in 1931, the Big Bang has been the standard theory of the origin and structure of the universe for 50 years. In my opinion, (the opinion of a TV comedy writer, stripper and bar bouncer who does physics on the side) the Big Bang is about to collapse catastrophically, and that’s a good thing.

According to Big Bang theory, the universe exploded into existence from basically nothing 13.7-something billion years ago. But we’re at the beginning of a wave of discoveries of stuff that’s older than 13.7 billion years.

We’re constantly learning more about our universe, how it works, and how it started. New information isn’t necessarily a catastrophe for our existing theories; it’s just more data. There’s constantly new data coming in – and as yet, none of it comes close to causing the big bang theory to catastrophically collapse.

The two specific examples cited in the article are:

  1. one quasar that appears to be younger than we might expect – it existed just 900 million years after the current estimate of when the big bang occurred. That’s very surprising, and very exciting. But even in existing models of the big bang, it’s surprising, but not impossible. (No link, because the link in the original article doesn’t work.)
  2. an ancient galaxy – a galaxy that existed only 700 million years after the big bang occurred – contains dust. Cosmic dust is made of atoms much larger than hydrogen – like carbon, silicon, and iron, which are (per current theories) the product of supernovas. Supernovas generally don’t happen to stars younger than a couple of billion years – so finding dust in a galaxy less than a billion years after the universe began is quite surprising. But again: impossible under the big bang? No.

The problem with both of these arguments against the big bang is: they’re vague. They’re both handwavy arguments made about crude statements about what “should” be possible or impossible according to the bing bang theory. But neither comes close to the kind of precision that an actual scientific argument requires.

Scientists don’t use math because they like to be obscure, or because they think all of the pretty symbols look cool. Math is a tool used by scientists, because it’s useful. Real theories in physics need to be precise. They need to make predictions, and those predictions need to match reality to the limits of our ability to measure them. Without that kind of precision, we can’t test theories – we can’t check how well they model reality. And precise modelling of reality is the whole point.

The big bang is an extremely successful theory. It makes a lot of predictions, which do a good job of matching observations. It’s evolved in significant ways over time – but it remains by far the best theory we have – and by “best”, I mean “most accurate and successfully predictive”. The catch to all of this is that when we talk about the big bang theory, we don’t mean “the universe started out as a dot, and blew up like a huge bomb, and everything we see is the remnants of that giant explosion”. That’s an informal description, but it’s not the theory. That informal description is so vague that a motivated person can interpret it in ways that are consistent, or inconsistent with almost any given piece of evidence. The real big bang theory isn’t a single english statement – it’s many different mathematical statements which, taken together, produce a description of an expansionary universe that looks like the one we live in. For a really, really small sample, you can take a look at a nice old post by Ethan Siegel over here.

If you really want to make an argument that it’s impossible according to the big bang theory, you need to show how it’s impossible. The argument by Mr. Rosner is that the atoms in the dust in that galaxy couldn’t exist according to the big bang, because there wasn’t time for supernovas to create it. To make that argument, he needs to show that that’s true: he needs to look at the math that describes how stars form and how they behave, and then using that math, show that the supernovas couldn’t have happened in that timeframe. He doesn’t do anything like that: he just asserts that it’s true.

In contrast, if you read the papers by the guys who discovered the dust-filled galaxy, you’ll notice that they don’t come anywhere close to saying that this is impossible, or inconsistent with the big bang. All they say is that it’s surprising, and that we made need to revise our understanding of the behavior of matter in the early stages of the universe. The reason that they say that is because there’s nothing there that fundamentally conflicts with our current understanding of the big bang.

But Mr. Rosner can get away with the argument, because he’s being vague where the scientists are being precise. A scientist isn’t going to say “Yes, we know that it’s possible according to the big bang theory”, because the scientist doesn’t have the math to show it’s possible. At the moment, we don’t have sufficient precise math either way to come to a conclusion; we don’t know. But what we do know is that millions of other observations in different contexts, different locations, observed by different methods by different people, are all consistent with the predictions of the big bang. Given that we don’t have any evidence to support the idea that this couldn’t happen under the big bang, we continue to say that the big bang is the theory most consistent with our observations, that it makes better predictions than anything else, and so we assume (until we have evidence to the contrary) that this isn’t inconsistent. We don’t have any reason to discard the big bang theory on the basis of this!

Mr. Rosner, though, goes even further, proposing what he believes will be the replacement for the big bang.

The theory which replaces the Big Bang will treat the universe as an information processor. The universe is made of information and uses that information to define itself. Quantum mechanics and relativity pertain to the interactions of information, and the theory which finally unifies them will be information-based.

The Big Bang doesn’t describe an information-processing universe. Information processors don’t blow up after one calculation. You don’t toss your smart phone after just one text. The real universe – a non-Big Bang universe – recycles itself in a series of little bangs, lighting up old, burned-out galaxies which function as memory as needed.

In rolling cycles of universal computation, old, collapsed, neutron-rich galaxies are lit up again, being hosed down by neutrinos (which have probably been channeled along cosmic filaments), turning some of their neutrons to protons, which provides fuel for stellar fusion. Each calculation takes a few tens of billions of years as newly lit-up galaxies burn their proton fuel in stars, sharing information and forming new associations in the active center of the universe before burning out again. This is ultra-deep time, with what looks like a Big Bang universe being only a long moment in a vast string of such moments across trillions or quadrillions of giga-years.

This is not a novel idea. There are a ton of variations of the “universe as computation” that have been proposed over the years. Just off the top of my head, I can rattle off variations that I’ve read (in decreasing order of interest) by Minsky (can’t find the paper at the moment; I read it back when I was in grad school), by Fredkin, by Wolfram, and by Langan.

All of these theories assert in one form or another that our universe is either a massive computer or a massive computation, and that everything we can observe is part of a computational process. It’s a fascinating idea, and there are aspects of it that are really compelling.

For example, the Minsky model has an interesting explanation for the speed of light as an absolute limit, and for time dilation. Minksy’s model says that the universe is a giant cellular automaton. Each minimum quanta of space is a cell in the automaton. When a particle is located in a particular cell, that cell is “running” the computation that describes that particle. For a particle to move, the data describing it needs to get moved from its current location to its new location at the next time quanta. That takes some amount of computation, and the cell can only perform a finite amount of computation per quanta. The faster the particle moves, the more of its time quantum are dedicated to motion, and the less it has for anything else. The speed of light, in this theory, is the speed where the full quanta for computing a particle’s behavior is dedicated to nothing but moving it to its next location.

It’s very pretty. Intuitively, it works. That makes it an interesting idea. But the problem is, no one has come up with an actual working model. We’ve got real observations of the behavior of the physical universe that no one has been able to describe using the cellular automaton model.

That’s the problem with all of the computational hypotheses so far. They look really good in the abstract, but none of them come close to actually working in practice.

A lot of people nowadays like to mock string theory, because it’s a theory that looks really ogood, but has no testable predictions. String theory can describe the behavior of the universe that we see. The problem with it isn’t that there’s things we observe in the universe that it can’t predict, but because it can predict just about anything. There are a ton of parameters in the theory that can be shifted, and depending on their values, almost anything that we could observe can be fit by string theory. The problem with it is twofold: we don’t have any way (yet) of figuring out what values those parameters need to have to fit our universe, and we don’t have any way (yet) of performing an experiment that tests a prediction of string theory that’s different from the predictions of other theories.

As much as we enjoy mocking string theory for its lack of predictive value, the computational hypotheses are far worse! So far, no one has been able to come up with one that can come close to explaining all of the things that we’ve already observed, much less to making predictions that are better than our current theories.

But just like he did with his “criticism” of the big bang, Mr. Rosner makes predictions, but doesn’t bother to make them precise. There’s no math to his prediction, because there’s no content to his prediction. It doesn’t mean anything. It’s empty prose, proclaiming victory for an ill-defined idea on the basis of hand-waving and hype.

Boing-Boing should be ashamed for giving this bozo a platform.

The Fedex Problem

Over the weekend on twitter, someone tweeted a link to a really fun paper looking paper on arXiv, called The Fedex Problem. It’s a really interesting paper, because it takes a real-world problem, analyzes mathematically, and shows some of the fascinating complexity that lies under the surface of something that seems simple. (I can’t find the original tweet, so I don’t know who first shared it – sorry!

The problem that they looked at involves the logistics of rapid delivery by services like Federal Expression.

When Fedex was founded, they built their overnight delivery business on a new model of how to handle deliveries, called a hub system.

What most delivery businesses traditionally did was take packages in at a local warehouse, and sort them into groups depending on the warehouse that serviced their destination. Each group was then shipped separately, from the source warehouse to the destination warehouse. To do overnight delivery, each warehouse, then, would need to have a full package-sorting system, and they would need to to send a shipment to every other warehouse every single day. For N warehouses, that means that they needed N full sorting systems, and N^2 shipments every day. The costs of the redundant shipping systems and the massive number of shipments are huge, and completely non-scalable. Imagine a system with just 100 local warehouses – that’s only 2 for each state! – you’d need to have 10,000 shipments per day. In practice, the number of fedex warehouses just in the US is considerably larger than just 100: it simply can’t work.

What Fedex chose to do instead was build a hub system. Every package that arrives at a local warehouse gets shipped to the same place, called the hub. The only sorting system is at the hub – individual local warehouses don’t sort, they just ship to the hub. The hub sorts according to destination warehouse, and then ships the packages to the appropriate warehouse for delivery.

The Fedex hub system solved both of the problems of the traditional approach. The individual local shipping points don’t need to do sorting: they just ship to the hub, which takes care of all of it. And since every package goes to the hub, instead of needing N^2 shipments, they only needed 2N shipments per day: N shipments of unsorted packages from local warehouses to the hub, and N shipments of sorted packages from the hub to the local warehouses for local deliveries.

How well this system works is very dependent on the proper location of the hub. To demonstrate why, consider the worst possible hub location for American shipments: Hawaii. Now, every package from anywhere in the mainland to anywhere else in the mainland needs to travel at least an extra 2500 miles to the hub, and an extra 2500 miles back! The cost of shipping it dominated by fuel costs – adding any extra distance to the average shipment costs fuel, and thus money. So the location of hub has a dramatic effect on the shipment costs.

FedEx located their hub in Memphis, Tennessee, because according to its founders, Memphis was close to the “center” of the shipping region for its original target market cities, and thus minimized the total shipping distances. (That wasn’t their sole concern; another factor was that Memphis is in an area that is rarely incapacitated by weather or other natural disasters – but the travel distance was the biggest factor.)

What this paper does is take the Fedex problem, and generalize it. What, exactly, does it mean for a given point to be an ideal hub? Given a network of points in a Euclidean space, how can you compute the ideal hub? And finally, if we look at the actual Federal Express distribution system, using great-circle routing between points, how close is Memphis to being the ideal hub?

If you look at the problem naively, it seems like it should be simple. The ideal hub should just be the “average” location. That’s basically what the FedEx founders said – they picked the geographical center of their original markets. Unfortunately, reality is rarely kind enough to make things fit our intuitions, and that’s not quite right.

To see why, let’s first look at the definition of the ideal hub. What we want to do is to minimize the total distance from local shipping points to the hub. To keep the notation simple, we’ll write the problem down as if we were looking at the one-dimensional version of the problem. Each location is a single value, and we’ll write the distance between two points a and b as the absolute value of a-b: \left|a-b\right|.

If the set of i points in the network are called x_1 \dots x_i, then the total distance value for a hub at x is given by the function:

 f(x) = \Sigma_i 2\left| x - x_i \right|

Since the 2 is a constant factor that affects all x-values easily, for the purposes of simplicity, we can drop it out. So the problem of selecting an ideal hub ultimately reduces to an optimization problem: find the value x where f(x) is a minimum.

Again, naively, it seems as though the average is best. But we can see that it’s not with a simple example. Let’s just look at a one-dimensional case – a list of locations scattered along a line: {1, 2, 3, 5, 13, 14, 60}. The point that’s at the average is 14, with a total one-way travel distance of 14 + 12 + 11 + 9 + 0 + 1 + 46 = 93. But by doing some brute-force, we can see that the total distance for x=5 is just 4 + 3 + 2 + 0 + 8 + 9 + 55 = 81. So the mean isn’t the optimal hub location. In fact, you can prove that in the one-dimensional case, the optimal hub is the median location, not the mean.

In this example, the average isn’t far off – but even in a quick off-the-cuff example, you can see that average isn’t the answer: the mean can be quite different from the median, particularly when you’ve got a lot of clustering, or a couple of outliers. When you extend to more than one dimension, then things can get bad pretty quickly.

The general problem has been looked at in many forms. It’s known, variously, as the Fermat-Torricelli problem, the Weber problem, or the single facility location problem.

Unfortunately, getting an answer to the problem is a lot harder than just naming it. According to the authors of this paper, even the fact of the existence of a unique ideal hub isn’t trivial. Despite all of the people who’ve worked on the problem, they couldn’t find a single complete proof of it! They go ahead and provide one. They prove that:

For any finite set of non-collinear points in any euclidean space R^N,
there is an idea hub, located within the convex hull of the set of points.

The convex hull qualifier there deserves a bit of explanation. It’s an interesting idea that ends up coming up in a bunch of different mathematical contexts. The easiest way to explain it is by the easiest method of finding the convex hull in two dimensions: suppose you’ve got a bunch of points on a plane. Draw the plane on graph paper, and put a nail at the location of each point. Take a rubber band, and stretch it so that every nail on the graph is inside of the rubber band. Let go. The rubber band will end up touching some of the nails; some will be inside, but untouched. The rubber band is the convex hull of the points. It’s basically a general notion of the smallest blob that contains all of the points in the set. The authors present a short, clean proof that the ideal hub location will always be within the “blob” of the points.

The paper goes into some depth in discussing some simple cases of the ideal hub, and how they can be calculated geometrically. But for the general problem, the geometric approaches don’t work. To get the ideal hub for a large collection of points, they’re left with brute force. In the brute force method, you just take every point x in the network, and compute its f(x) value. Since it’s a finite network, you’ll eventually find the ideal hub. (You can, of course, apply
some nice heuristics to simplify the problem: the set of candidates that might be optimal hubs is going to be a lot smaller than the set of all ponits in the network!) But still, computing the optimal hub is quite complicated – O(N^2) in the number of points in the network.

After all of that work, where is the ideal hub? How different is it from the hub that Federal Express actually uses?

As it turns out, Fedex isn’t too far off, but it’s definitely non-optimal. Using US census data, combined with great-circle distances between the census tracts, the ideal hub is in central Indiana. FedEx’s hub is 315 miles off, to the south-south-west of the ideal location.

Interestingly, when UPS (probably FedEx’s biggest rival) set up a hub system, their hub is in Indianapolis – just 85 miles away from the ideal hub.

To me, the most interesting thing about the paper comes after they showed the ideal hub by brute force. If you’ve got more than four points – even just five! – in a two-dimensional space, there’s no simple geometric solution to find the ideal hub. It seems like it should be simple, but it isn’t. The proof of that is really fascinatingly simple: it comes down to group theory and symmetry. Read the paper for the details. A geometric solution is limited by the difference in dimensionality between a particular one-dimensional rotational symmetry group, and the number of objects in the network. As a result, for 4 or less non-collinear points, there’s a way of finding a single unique geometric configuration that can be exploited to characterize the ideal hub. For more than 4 points, there isn’t – there’s no single configuration anymore, which blows away the usefulness of geometric solutions.

If you find this at all interesting, then it’s worth downloading and reading the paper. There’s a whole lot their that I haven’t covered – more than what I have. But I hope this little taste is enough to pique your curiosity.

Logical Statements as Tasks

In the next step of our exploration of type theory, we’re going to take a step away from the set-based stuff. There are set-based intepretations of every statement in set theory. But what we really want to focus on is the interpretation of statements in computational terms.

What that means is that we’re going to take logical statements, and view them as computation tasks – that is, as formal logical specifications of a computation that we’d like to do. Under that interpretation, a proof of a statement S is an implementation of the task specified by S.

This interpretation is much, much easier for computer science types like me than the set-based interpretations. We can walk through the interpretations of all of the statements of our intuitionistic logic in just a few minutes.

We’ll start with the simple statements.

Conjunction
A \land B is a specification for a program that produces a pair (a, b) where a is a solution for A, and b is a solution for B.
Disjunction
A \lor B is a specification for a program that produces either a solution to A or a solution to B, along with a way of identifying which of A and B it solved. We do that using a version of the classical projection functions: A \lor B produce either \text{inl}(A) (that is, the left projection), or \text{inr}(B) (the right projection).
Implication
A \supset B is a specification for a program that produces a solution to B given a solution to A; in lambda calculus terms, it’s a form like \lambda x: b(x).

Now, we can move on to quantified statements. They get a bit more complicated, but if you read the quantifier right, it’s not bad.

Universal
(\forall x \in A) B(x) is a program which, when executed, yields a program of the form \lambda x.b(x), where b(x) is an implementation of B, and x is an implementation of A. In other words, a universal statement is a program factory, which produces a program that turns one program into another program.

To me, the easiest way to understand this is to expand the quantifier. A quantified statement \forall x \in A: B(x) can be read as \forall x: x \in A \Rightarrow B(x). If you read it that way, and just follow the computational interpretation of implication, you get precisely the definition above.

Existential
Existential quantification is easy. An existential statement \exists x \in A: B(x) is a two part problem: it needs a value for a (that is, a value of x for which a proof exists that x \in A), and a proof that for that specific value of x, x \in B. A solution, then, has two parts: it’s a pair (a, b), where a is a value in A, and b is a program that computes the problem B(a).

This is the perspective from which most of Martin-Loff’s type theory pursues things. There’s a reason why ML’s type theory is so well-loved by computer scientists: because what we’re really doing here is taking a foundational theory of mathematics, and turning it into a specification language for describing computing systems.

That’s the fundamental beauty of what Martin-Loff did: he found a way of formulating all of constructive mathematics so that it’s one and the same thing as the theory of computation.

And that’s why this kind of type theory is so useful as a component of programming languages: because it’s allowing you to describe the semantics of your program in terms that are completely natural to the program. The type system is a description of the problem; and the program is the proof.

With full-blown Martin-Loff type system, the types really are a full specification of the computation described by the program. We don’t actually use the full expressiveness of type theory in real languages – among other things, it’s not checkable! But we do use a constrained, limited logic with Martin-Loff’s semantics. That’s what types really are in programming languages: they’re logical statements! As we get deeper into type theory, well see exactly how that works.

Propositions as Proofsets: Unwinding the confusion

My type theory post about the different interpretations of a proposition caused a furor in the comments. Understand what’s going on that caused all of the confusion is going to be important as we continue to move forward into type theory.

The root problem is really interesting, once you see what’s going on. We’re taking a statement that, on the face of it, isn’t about sets. Then we’re appyling a set-based interpretation of it, and looking at the subset relation. That’s all good. The problem is that when we start looking at a set-based interpretation, we’re doing what we would do in classical set theory – but that’s a different thing from what we’re doing here. In effect, we’re changing the statement.

For almost all of us, math is something that we learned from the perspective of axiomatic set theory and first order predicate logic. So that’s the default interpretation that we put on anything mathematical. When you talk about a a proposition as a set, we’re programmed to think of it in that classical way: for any set S, there’s a logical predicate P_s such that by definition, \forall x: x \in S \Leftrightarrow P_s(x). When you see P \Rightarrow Q in a set-theory context, what you think is something like \forall x: x \in P \Rightarrow x \in Q. Under that intepretation, the idea that P \supset Q is equivalent to P \rightarrow Q is absolutely ridiculous. If you follow the logic, implication must be the reverse of the subset relation!

The catch, though, is that we’re not talking about set theory, and the statement P \Rightarrow Q that we’re looking at is emphatically not \forall x : P(x) \Rightarrow Q(x). And that, right there, is the root of the problem.

P \rightarrow Q always means P \rightarrow Q – it doesn’t matter whether we’re doing set theory or type theory or whatever else. But in set theory, when we talk about the intepretation of P as a set, right now, in the world of type theory, we’re talking about a different set.

Super set doesn’t suddenly mean subset. Implication doesn’t start working backwards! and yet, I’m still trying to tell you that i really meant it when i said that superset meant implication! how can that possibly make sense?

In type theory, we´re trying to take a very different look at math. In particular, we’re building everything up on a constructive/computational framework. So we’re necessarily going to look at some different interpretations of things – we’re going to look at things in ways that just don’t make sense in the world of classical set theory/FOPL. We’re not going to contradict set theory – but we’re going to look at things very differently.

For example, the kind of statement we’re talking here about is a complete, closed, logical proposition, not a predicate, nor a set. The proposition P is a statement like “‘hello’ has five letters”.

When we look at a logical proposition P, one of the type theoretic interpretations of it is as a set of facts: P can be viewed as the set of all facts that can be proven true using P. In type theory land, this makes perfect sense: if I’ve got a proof of P, then I’ve got a proof of everything that P can prove. P isn’t a statement about the items in Ps proof-set. P is a logical statement about something, and the elements of the proof-set of P are the things that the statement P can prove.

With that in mind, what does P \Rightarrow Q mean in type theory? It means that everything provable using Q is provable using nothing but P.

(It’s really important to note here that there are no quantifiers in that statement. Again, we are not saying \forall p: P(x) \Rightarrow Q(x). P and Q are atomic propositions – not open quantified statements.)

If you are following the interpretation that says that P is the set of facts that are provable using the proposition P, then if P \Rightarrow Q, that means that everything that’s in Q must also be in P. In fact, it means pretty much exactly the same thing as classical superset. Q is a set of facts provable by the statement Q. The statement Q is provable using the statement P – which means that everything in the provable set of Q must, by definition! be in the provable set of P.

The converse doesn’t hold. There can be things provable by P (and thus in the proof-set of P) which are not provable using Q. So taken as sets of facts provable by logical propositions, P \supset Q!

Again, that seems like it’s the opposite of what we’d expect. But the trick is to recognize the meaning of the statements we’re working with, and that despite a surface resemblance, they’re not the same thing that we’re used to. Type theory isn’t saying that the set theoretic statements are wrong; nor is set theory saying that type theory is wrong.

The catch is simple: we’re trying to inject a kind of quantification into the statement P \Rightarrow Q which isn’t there; and then we’re using our interpretation of that quantified statement to say something different.

But there’s an interpretation of statements in type theory which is entirely valid, but which trips over our intuition: our training has taught us to take it, and expand it into an entirely different statement. We create blanks that aren’t there, fill them in, and by doing so, convert it into something that it isn’t, and confuse ourselves.

Lenovo and the Superfish Scam

I’m a bit late to the scene here, but I think it’s still worth making a stab at this. Also, this is being written on an airplane, where I am travelling home from San Francisco to NY with a case of bronchitis. So I am not exactly at my best.

Lenovo, one of the largest makers of windows-based laptops, sold out its customers as part of one of the worst deliberate violations of computer security I’ve ever seen, by shipping a piece of software called Superfish pre-installed on its computers. Superfish is, with absolutely no exaggeration, one of the most serious, unethical, despicable things I’ve seen in quite a lot time. It’s appalling.

So what is it, and what’s the big deal?

We need to start with some background, and talk a bit about how secure connections work on the internet.

Every time that you visit a website with a secure connection (a URL that starts with https), you’re using a protocol called TLS (formerly SSL). TLS is designed to do two things:

  1. Ensure that you are talking to who you think you’re talking to.
  2. Ensure that no one but you and the person you wanted to talk to can actually see what you’re saying.

The way that it does both of those is based on encryption. Every time you create a secure connect to a website, you’re exchanging credentials with the site to ensure that they’re who they say they are, and then based on those credentials, you establish an encryption key for the rest of your communication.

That connection-establishment process is the critical bit. You need to get some information that allows you to trust them to be who they claim to be. The security and integrity of everything that happens over the connection depends on the truth and integrity of that initial piece of identity verification.

The identity verification piece of TLS is built using public key cryptography, as part of a standard infrastructure for key maintenance and verification called X.509.

I’ve written about public key crypto before – see here. The basic idea behind public key crypto is that you’ve got two keys, called the public and private keys. Anything which is encrypted with the public key can only be decrypted with the private key; anything which is encrypted with the private key can only be decrypted using the public key. Your public key is available to anyone who wants it; no one but you has access to your private key.

If you receive a message from me, and you can decrypt it with my public key, then you know, without a doubt, that you can be sure that I’m the one who encrypted it. Only my private key could have encrypted a message that could then be decrypted with my public key.

For that to work, though, there’s one thing that you need to be sure of: you need to be absolutely sure that the public key that you’ve got is really my public key. If some clever person managed to somehow give you a different key, and convince you that it’s mine, then they can send you messages, and they’ll look exactly as if they came from me. If I handed you my public key on a USB thumbdrive, then you’re sure that the key came from me – but if you received it online, haw can you be sure that it was really me that gave it to you? Maybe someone switched it along the way?

In X.509, we use an idea of delegated trust. That is, we have some small collection of fundamental trusted authorities. Those authorities can issue public/private key pairs, so when someone need a public key, they can go to them and ask for it, and they’ll create one. The authority gives them a certificate, which is a copy of the new public key encrypted by the authority using their private key.

Now, when someone connects to a website, the target site can state who they are by sending a copy of the certificate. The client site recieves the certificate, decrypts it using the authorities public key, and then starts using that public key to encrypt their communications.

If the two sides can keep talking, then the client knows who it’s talking to. It got a public key, and it’s using that public key to talk to the server; so the server couldn’t decrypt the communication unless it had the public key; and it trusts that that it got the right public key, because it was encrypted with the private key of the certificate authority.

This is great as far as it goes, but it leaves us with a single certificate authority (or, at best, a small group). With billions of human users, and possibly trillions of networked devices, having a single authority isn’t manageable. They simple can’t produce enough keys for everyone. So we need to use our trust in the certificate authority to expand the pool of trust. We can do that by saying that if the certificate authority can declare that a particular entity is trustworthy, then we can use that entity itself as a verifier. So now we’ve taken a single trusted authority, and expanded that trust to a collection of places. Each of those newly trusted entities can now also issue new keys, and can certify their validity, by showing their certificate, plus the new encrypted public key. In general, anyone can issue a public key – and we can check its validity by looking at the chain of authorities that verified it, up to the root authority.

There’s a catch to this though: the base certificate providers. If you can trust them, then everything works: if you’ve got a certificate chain, you can use it to verify the identity of the party you’re talking to. But if there’s any question about the validity of the root certificate provider, if there’s any question whether or not you have the correct, valid public key for that provider, then you’re completely hosed. Ultimately, there’s some piece of seed information which you have to start off with. You need to accept the validity of an initial certificate authority based on some other mechanism. The people who sold you your computer, or the people who built your web browser, generally install a root certificate – basically the public key for a trusted certificate authority.

If that root certificate isn’t trustworthy, then nothing that results from it can be trusted. The untrustworthy root certificate can be used by an unscrupulous person to create new certificates allowing them to masquerade as anything that they want.

In particular, an untrustworthy root certificate it makes it easy to perform a man-in-the-middle attack. Suppose you want to talk to your bank, and somehow Joe Schmoe has planted a bad root certificate on your computer:

  1. You try to connect to Bank.
  2. Joe intercepts the request. He accepts your connection request, pretending to be the bank. Using his fake root certificate, he sends you a certificate claiming to be the bank.
  3. Now, you’re connected to Joe, believing that you’re connected to the bank. You try to log in to your account, sending your username and password.
  4. Joe receives your request, connects to the bank, passing on your request.
  5. Joe logs in to the bank on your behalf. The bank returns its successful login screen, showing your account numbers and balances.
  6. Joe copies the successful login screen from his connection to the bank to you.
  7. Every time that you send information to “the bank”, Joe decrypts it using his private key, and then sends it on to the bank masquerading as you. Similarly, when the bank sends something to you, he decrypts it using the banks public key, and then encrypts it with his his private key to send it to you.
  8. You now believe that you are connected to the bank over a secure connection. Everything that you do looks exactly as if you’re connected to the bank over a security encrypted link.
  9. Joe now has your login information, and control of your bank accounts.

In the Lenovo fiasco, Lenovo installed a system called Superfish, which deliberately installs a bad root certificate, and then uses that root certificate to create man-in-the-middle attacks on every secure connection ever made with the computer.

Why does it do that? Purportedly for ad retargeting: it uses its man-in-the-middle to decrypt supposedly secure connections so that it can replace ads in the pages that you view. That way, Lenovo and Superfish get advertising money for the pages you view, instead of the page-owners.

It’s spectacularly despicable. It’s fundamentally compromising everything you do with your computer. It’s doing it in a way that you can’t detect. And it’s doing it all in a sleazy attempt to steal advertising money.

Based on this, I’ve got two pieces of advice:

  1. Don’t ever buy a Lenovo computer. Never, ever, ever. A company that would even consider installing something like this isn’t a company that you can ever trust.
  2. If you have a Lenovo computer already, get rid of it. You could reformat your disks and install a fresh untainted copy of your operating system – but if your hardware manufacturer is a sleezy as Lenovo has demonstrated themselves to be, then there’s no reason to believe that that’s going to be sufficient.

This is, by far, the worst thing that I’ve ever seen a computer manufacturer do. They deserve to be run out of business for this.

Simpler Consensus with Raft

A few weeks ago, I wrote about Paxos, which is (at least in my experience), the most widely used algorithm for consensus in distributed systems. I’m a huge fan of Paxos – I think that it’s a remarkably elegant system.

But Paxos does have its problem.

  1. Paxos has a lot of roles: client, proposer, learner, acceptor, leader, follower. When you want to implement Paxos, you need to figure out all of those roles, and how you’re going to implement them. In general, you end up merging roles – but there are lots of ways of doing that merge. Each particular way of setting up the roles has its own properties, and thus its own tradeoffs that you need to understand.
  2. Paxos, as we normally talk about it, is really a single-consensus protocol – that is, the basic protocol is designed to get a group of agents to come to consensus just once. If you want to be able to repeatedly seek new consensus values, you’re actually going to be using an extension to the basic paxos protocol. There are a ton of Paxos extensions that work to add repeated consensus. Paxos itself is simple and elegant, with well-defined formal properties that we care about – the moment we start modifying it, we can no longer count on those properties unless we can also prove them in our extension!
  3. Paxos was originally described in a truly awful paper. Leslie Lamport was trying to write a paper that would be less dull than the typical bone-dry technical snoozer – but the way that he wrote it actually makes it much harder to understand.

In short: Paxos has more complexity than it needs, and despite that, it needs to be tweaked to be really useful, and getting those tweaks right is hard. There are, sadly, a lot of incorrect Paxos implementations – and their incorrectness has all-too-often come as a surprise to the people who rely on them.

To avoid those problems, there are other consensus algorithms out there. In this post, we’re going to look at one of the Paxos competitors – a consensus algorithm/protocol called raft.

Raft does away with the role complexity of Paxos. In Raft, you have a collection of cooperating agents. There are no distinct proposers, acceptors, or learners: there are just servers. Communication between the servers in raft is done entirely with synchronous remote procedure calls.

In Raft, the target of consensus is a log containing a sequence of events. The log is the history of the distributed system. The goal of raft is that the log be maintained in a consistent state throughout the raft network. Just like in Paxos, if we have 2n+1 servers, up to n can fail without the network losing its consistency.

Raft is designed in terms of remote procedure calls between the elements of the network. In Raft, we never talk about single messages – every communication between servers is a pair of messages: a request from caller to callee; and a response from callee to caller. When a message gets lost, we’ll just talk about it as a failed remote procedure call.

Within a Raft network, at any given time, each server has a state. It can be a follower, a leader, or a candidate. Within the network, there is at most one leader. When there is a leader, all of the other servers are in the follower state. The followers are almost entirely passive. Followers don’t talk to clients at all – they just wait for RPCs from the leader. The leader is the only participant that’s allowed to talk to clients: any client request must go through the current leader. The leader is also the only server that’s allowed to add new entries to the consensus log.

Raft divides time into a sequence of terms. In each term, the servers in the raft network need to select a leader using a process called an election. Raft is a strong leader protocol – no interactions with a client can take place except through a leader. If there’s no leader, then we can’t process client requests without a leader.

So, to understand Raft, there’s three processes that we need to
understand:

  1. Leader election
  2. Transitions between terms
  3. Appending an entry to a log.

In those processes, the servers have a collection of variables
that they use for the Raft protocol:

currentTerm
the current term for the server.
votedFor
the serverID that this server voted for in the current term, or “none”.
log
the list of entries in the log.
commitIndex
The index of the highest log entry known to be committed by the server.
lastApplied
The index of the highest log entry that’s been added to the log – but not necessarily committed. (It doesn’t become committed until a majority of servers accept it.)

Leader Election

In each term, the Raft cluster needs to have a leader. The way that a leader is selected called election.

Elections are triggered by a term transition. When a server in the cluster decided that it needs to start a new term, it increments its term number, puts itself into the candidate state, and sends a RequestVote(term, candidateId) RPC to each of the other servers in the cluster. This request asks the other servers in the cluster to select it as the leader. If it receives enough “yes” votes, it will become the leader.

When a server receives a RequestVote RPC, it checks the term. If it’s smaller than the server’s current term, then it replies “No” – meaning that it cannot support the requestor as leader.

If the term in the request is greater that the receiver’s term, then the receiver cannot have voted in the new term. So it updates to the term from the request, and then it replies “Yes”.

If the term in the request equals the receiver’s term, then the receiver has already updated its term. If it’s already voted for someone else as leader, then it can’t support the requestor, so it replies “No”. If it hasn’t voted for a leader in the term, then it votes for the requestor, and replies “Yes”.

If the requester receives “Yes” votes from more than 1/2 of the cluster (counting itself), then it becomes the leader, and starts both processing requests from clients, and sending heartbeats to the other servers in the cluster.

If it doesn’t receive enough votes, then it waits to see if anyone else becomes the leader and starts sending heartbeats. If it doesn’t get a heartbeat in time, then it starts over: it would increment its term again, and try to start a fresh election.

Term Transitions

For a given server, term transitions happen in three ways:

  1. Timeout: the leading server needs to periodically communicate with each of the followers. This process is called heartbeat: even if the leader has no updates for its followers, it sends RPC calls to the followers just to say “I’m still here”. If a client goes too long without receiving a heartbeat, it decides that the leader was lost, and it will increment the term number, and trigger a new election.
  2. Leader resignation: the current leader can, at any time, decide to stop being the leader. (This is typically done by an implementation as part of a system that says that there’s a maximum period between leader elections. For example, in the Aurora scheduler, we had leader elections at least once per day. In a raft consensus, the leader would trigger this by deciding it was time for it to stop being a leader, and triggering an election by starting a new term.)
  3. External term change: every RPC received by a server includes a term number. If any RPC to a server ever includes a term number greater than the current term for that server, the server will update its term to the new number. As a special case of this, when a leader server decides to resign, it does that by sending an RPC to the other servers with an incremented term number.

Appending to the log

We just spent a fair bit of time talking about leaders and elections. That’s almost beside the point. What we really want to do is just maintain a consistent log across the cluster of servers. Everything except creating log entries is just the book-keeping that’s necessary to make the consistent log work. The log itself is maintained using the AppendEntries RPC call.

When a client request does something that alters the state of the cluster, the leader describes that change by adding an entry to the log. It builds a proposed log entry, and sends it to the other members of the cluster using an RPC. If it gets enough “Yes” votes from other cluster members, then the log entry becomes committed, and the leader updates its commitIndex to the index of the new log entry to reflect that.

The RPC request takes a bunch of parameters:

  1. term: the leader’s term.
  2. leaderId: the id of the leader.
  3. prevLogIndex: the index number of the last log entry in the consensus log preceeding this new entry.
  4. prevLogTerm: the number of the term where the last log entry was committed.
  5. entries: a set of new log entries to be appended to the log.
  6. leaderCommit: the index of the commitlog on the leader after this set of entries has been committed.

When an AppendEntries call is received by a follower, what it does is:

  • If the receiver’s term is greater than the request term, then the receiver rejects the request by replying “No”.
  • If the the receivers commit index is larger than the commit index of the request, then it rejects the request by replying “No”.
  • If the receiver’s log doesn’t contain an entry at prevLogIndex, or that entry’s term doesn’t match the request term, then it rejects the request by replying “No”.
  • If there’s an entry in the log with the same index as the new log entries, and the term in the request matches the receiver’s term, then the receiver removes all entries after prevLogIndex from its log.
  • The receiver then appends the new entries from the request to its log.
  • If the leaderCommit is greater than the commitIndex on the receiver, then the receiver updates its commitIndex.
  • Finally, the receiver replies “Yes”.

When a majority of the cluster members have accepted an AppendEntries call, then the log entry gets committed.

The one part of this that’s confusing is how the logs get managed. The leader creates a new log entry, and sends it to the other servers. The complexity comes from dealing with cases where something doesn’t reach consensus.

For example, the leader sends entries 5, 6, and 7 to server S. S adds the entries to its copy of log – it now contains [1, 2, 3, 4, 5, 6, 7]. Meanwhile, the leader also sends those entries to server T, but the RPC to T fails due to a network fault. Another client request happens, and now the leader sends [5, 6, 7, 8] to S. S sees that it’s got entry 5 already: so it discards everything after 5, and then re-appends.

So the trailing segment of the log can change! How do we handle consensus?

The next time that the leader sends an AppendEntries to a follower, it contains the leader’s commitIndex. The follower updates its commit index to that value. Once it’s done that, any request from a leader that tries to modify anything that comes before that commit index will be rejected.

The consensus commit thus doesn’t really occur until the next heartbeat call after a log update.

Raft versus Paxos

That’s the basics of Raft.

In comparison to Paxos, there’s a couple of things to notice:

  1. There’s a lot less confusion around roles. Paxos had a ton of different roles, and rules for interactions between the different roles. Raft doesn’t have any of that: it’s just servers, with one of the servers designated as the leader.
  2. Raft explicitly manages a log, and it adds complexity around log management. In Paxos, you’re just managing a single consensus value; in Raft, you’ve got a sequence of log entries.
  3. Paxos is defined in terms of messages; Raft is designed in terms of remote procedure calls.

So is Raft really simpler than Paxos? I think that’s up for discussion. Personally, I prefer Paxos. There’s a lot of complexity hidden under the covers of the RPC system. It looks simple on the surface, but all of the complexity of message passing, lost messages, message duplication – it’s still there. It’s just been swept under the carpet, as if that really makes it easier.

The way that the logs get maintained is confusing. That’s inevitable: getting distributed knowledge is never easy. Raft at least makes that part of things explicit, whereas it’s a common part of Paxos implementations, but it’s not really specified in the protocol.

The beauty of math; the humor of stupidity.

%d bloggers like this: