The Fedex Problem

Over the weekend on twitter, someone tweeted a link to a really fun paper looking paper on arXiv, called The Fedex Problem. It’s a really interesting paper, because it takes a real-world problem, analyzes mathematically, and shows some of the fascinating complexity that lies under the surface of something that seems simple. (I can’t find the original tweet, so I don’t know who first shared it – sorry!

The problem that they looked at involves the logistics of rapid delivery by services like Federal Expression.

When Fedex was founded, they built their overnight delivery business on a new model of how to handle deliveries, called a hub system.

What most delivery businesses traditionally did was take packages in at a local warehouse, and sort them into groups depending on the warehouse that serviced their destination. Each group was then shipped separately, from the source warehouse to the destination warehouse. To do overnight delivery, each warehouse, then, would need to have a full package-sorting system, and they would need to to send a shipment to every other warehouse every single day. For N warehouses, that means that they needed N full sorting systems, and N^2 shipments every day. The costs of the redundant shipping systems and the massive number of shipments are huge, and completely non-scalable. Imagine a system with just 100 local warehouses – that’s only 2 for each state! – you’d need to have 10,000 shipments per day. In practice, the number of fedex warehouses just in the US is considerably larger than just 100: it simply can’t work.

What Fedex chose to do instead was build a hub system. Every package that arrives at a local warehouse gets shipped to the same place, called the hub. The only sorting system is at the hub – individual local warehouses don’t sort, they just ship to the hub. The hub sorts according to destination warehouse, and then ships the packages to the appropriate warehouse for delivery.

The Fedex hub system solved both of the problems of the traditional approach. The individual local shipping points don’t need to do sorting: they just ship to the hub, which takes care of all of it. And since every package goes to the hub, instead of needing N^2 shipments, they only needed 2N shipments per day: N shipments of unsorted packages from local warehouses to the hub, and N shipments of sorted packages from the hub to the local warehouses for local deliveries.

How well this system works is very dependent on the proper location of the hub. To demonstrate why, consider the worst possible hub location for American shipments: Hawaii. Now, every package from anywhere in the mainland to anywhere else in the mainland needs to travel at least an extra 2500 miles to the hub, and an extra 2500 miles back! The cost of shipping it dominated by fuel costs – adding any extra distance to the average shipment costs fuel, and thus money. So the location of hub has a dramatic effect on the shipment costs.

FedEx located their hub in Memphis, Tennessee, because according to its founders, Memphis was close to the “center” of the shipping region for its original target market cities, and thus minimized the total shipping distances. (That wasn’t their sole concern; another factor was that Memphis is in an area that is rarely incapacitated by weather or other natural disasters – but the travel distance was the biggest factor.)

What this paper does is take the Fedex problem, and generalize it. What, exactly, does it mean for a given point to be an ideal hub? Given a network of points in a Euclidean space, how can you compute the ideal hub? And finally, if we look at the actual Federal Express distribution system, using great-circle routing between points, how close is Memphis to being the ideal hub?

If you look at the problem naively, it seems like it should be simple. The ideal hub should just be the “average” location. That’s basically what the FedEx founders said – they picked the geographical center of their original markets. Unfortunately, reality is rarely kind enough to make things fit our intuitions, and that’s not quite right.

To see why, let’s first look at the definition of the ideal hub. What we want to do is to minimize the total distance from local shipping points to the hub. To keep the notation simple, we’ll write the problem down as if we were looking at the one-dimensional version of the problem. Each location is a single value, and we’ll write the distance between two points a and b as the absolute value of a-b: \left|a-b\right|.

If the set of i points in the network are called x_1 \dots x_i, then the total distance value for a hub at x is given by the function:

 f(x) = \Sigma_i 2\left| x - x_i \right|

Since the 2 is a constant factor that affects all x-values easily, for the purposes of simplicity, we can drop it out. So the problem of selecting an ideal hub ultimately reduces to an optimization problem: find the value x where f(x) is a minimum.

Again, naively, it seems as though the average is best. But we can see that it’s not with a simple example. Let’s just look at a one-dimensional case – a list of locations scattered along a line: {1, 2, 3, 5, 13, 14, 60}. The point that’s at the average is 14, with a total one-way travel distance of 14 + 12 + 11 + 9 + 0 + 1 + 46 = 93. But by doing some brute-force, we can see that the total distance for x=5 is just 4 + 3 + 2 + 0 + 8 + 9 + 55 = 81. So the mean isn’t the optimal hub location. In fact, you can prove that in the one-dimensional case, the optimal hub is the median location, not the mean.

In this example, the average isn’t far off – but even in a quick off-the-cuff example, you can see that average isn’t the answer: the mean can be quite different from the median, particularly when you’ve got a lot of clustering, or a couple of outliers. When you extend to more than one dimension, then things can get bad pretty quickly.

The general problem has been looked at in many forms. It’s known, variously, as the Fermat-Torricelli problem, the Weber problem, or the single facility location problem.

Unfortunately, getting an answer to the problem is a lot harder than just naming it. According to the authors of this paper, even the fact of the existence of a unique ideal hub isn’t trivial. Despite all of the people who’ve worked on the problem, they couldn’t find a single complete proof of it! They go ahead and provide one. They prove that:

For any finite set of non-collinear points in any euclidean space R^N,
there is an idea hub, located within the convex hull of the set of points.

The convex hull qualifier there deserves a bit of explanation. It’s an interesting idea that ends up coming up in a bunch of different mathematical contexts. The easiest way to explain it is by the easiest method of finding the convex hull in two dimensions: suppose you’ve got a bunch of points on a plane. Draw the plane on graph paper, and put a nail at the location of each point. Take a rubber band, and stretch it so that every nail on the graph is inside of the rubber band. Let go. The rubber band will end up touching some of the nails; some will be inside, but untouched. The rubber band is the convex hull of the points. It’s basically a general notion of the smallest blob that contains all of the points in the set. The authors present a short, clean proof that the ideal hub location will always be within the “blob” of the points.

The paper goes into some depth in discussing some simple cases of the ideal hub, and how they can be calculated geometrically. But for the general problem, the geometric approaches don’t work. To get the ideal hub for a large collection of points, they’re left with brute force. In the brute force method, you just take every point x in the network, and compute its f(x) value. Since it’s a finite network, you’ll eventually find the ideal hub. (You can, of course, apply
some nice heuristics to simplify the problem: the set of candidates that might be optimal hubs is going to be a lot smaller than the set of all ponits in the network!) But still, computing the optimal hub is quite complicated – O(N^2) in the number of points in the network.

After all of that work, where is the ideal hub? How different is it from the hub that Federal Express actually uses?

As it turns out, Fedex isn’t too far off, but it’s definitely non-optimal. Using US census data, combined with great-circle distances between the census tracts, the ideal hub is in central Indiana. FedEx’s hub is 315 miles off, to the south-south-west of the ideal location.

Interestingly, when UPS (probably FedEx’s biggest rival) set up a hub system, their hub is in Indianapolis – just 85 miles away from the ideal hub.

To me, the most interesting thing about the paper comes after they showed the ideal hub by brute force. If you’ve got more than four points – even just five! – in a two-dimensional space, there’s no simple geometric solution to find the ideal hub. It seems like it should be simple, but it isn’t. The proof of that is really fascinatingly simple: it comes down to group theory and symmetry. Read the paper for the details. A geometric solution is limited by the difference in dimensionality between a particular one-dimensional rotational symmetry group, and the number of objects in the network. As a result, for 4 or less non-collinear points, there’s a way of finding a single unique geometric configuration that can be exploited to characterize the ideal hub. For more than 4 points, there isn’t – there’s no single configuration anymore, which blows away the usefulness of geometric solutions.

If you find this at all interesting, then it’s worth downloading and reading the paper. There’s a whole lot their that I haven’t covered – more than what I have. But I hope this little taste is enough to pique your curiosity.

Logical Statements as Tasks

In the next step of our exploration of type theory, we’re going to take a step away from the set-based stuff. There are set-based intepretations of every statement in set theory. But what we really want to focus on is the interpretation of statements in computational terms.

What that means is that we’re going to take logical statements, and view them as computation tasks – that is, as formal logical specifications of a computation that we’d like to do. Under that interpretation, a proof of a statement S is an implementation of the task specified by S.

This interpretation is much, much easier for computer science types like me than the set-based interpretations. We can walk through the interpretations of all of the statements of our intuitionistic logic in just a few minutes.

We’ll start with the simple statements.

Conjunction
A \land B is a specification for a program that produces a pair (a, b) where a is a solution for A, and b is a solution for B.
Disjunction
A \lor B is a specification for a program that produces either a solution to A or a solution to B, along with a way of identifying which of A and B it solved. We do that using a version of the classical projection functions: A \lor B produce either \text{inl}(A) (that is, the left projection), or \text{inr}(B) (the right projection).
Implication
A \supset B is a specification for a program that produces a solution to B given a solution to A; in lambda calculus terms, it’s a form like \lambda x: b(x).

Now, we can move on to quantified statements. They get a bit more complicated, but if you read the quantifier right, it’s not bad.

Universal
(\forall x \in A) B(x) is a program which, when executed, yields a program of the form \lambda x.b(x), where b(x) is an implementation of B, and x is an implementation of A. In other words, a universal statement is a program factory, which produces a program that turns one program into another program.

To me, the easiest way to understand this is to expand the quantifier. A quantified statement \forall x \in A: B(x) can be read as \forall x: x \in A \Rightarrow B(x). If you read it that way, and just follow the computational interpretation of implication, you get precisely the definition above.

Existential
Existential quantification is easy. An existential statement \exists x \in A: B(x) is a two part problem: it needs a value for a (that is, a value of x for which a proof exists that x \in A), and a proof that for that specific value of x, x \in B. A solution, then, has two parts: it’s a pair (a, b), where a is a value in A, and b is a program that computes the problem B(a).

This is the perspective from which most of Martin-Loff’s type theory pursues things. There’s a reason why ML’s type theory is so well-loved by computer scientists: because what we’re really doing here is taking a foundational theory of mathematics, and turning it into a specification language for describing computing systems.

That’s the fundamental beauty of what Martin-Loff did: he found a way of formulating all of constructive mathematics so that it’s one and the same thing as the theory of computation.

And that’s why this kind of type theory is so useful as a component of programming languages: because it’s allowing you to describe the semantics of your program in terms that are completely natural to the program. The type system is a description of the problem; and the program is the proof.

With full-blown Martin-Loff type system, the types really are a full specification of the computation described by the program. We don’t actually use the full expressiveness of type theory in real languages – among other things, it’s not checkable! But we do use a constrained, limited logic with Martin-Loff’s semantics. That’s what types really are in programming languages: they’re logical statements! As we get deeper into type theory, well see exactly how that works.

Propositions as Proofsets: Unwinding the confusion

My type theory post about the different interpretations of a proposition caused a furor in the comments. Understand what’s going on that caused all of the confusion is going to be important as we continue to move forward into type theory.

The root problem is really interesting, once you see what’s going on. We’re taking a statement that, on the face of it, isn’t about sets. Then we’re appyling a set-based interpretation of it, and looking at the subset relation. That’s all good. The problem is that when we start looking at a set-based interpretation, we’re doing what we would do in classical set theory – but that’s a different thing from what we’re doing here. In effect, we’re changing the statement.

For almost all of us, math is something that we learned from the perspective of axiomatic set theory and first order predicate logic. So that’s the default interpretation that we put on anything mathematical. When you talk about a a proposition as a set, we’re programmed to think of it in that classical way: for any set S, there’s a logical predicate P_s such that by definition, \forall x: x \in S \Leftrightarrow P_s(x). When you see P \Rightarrow Q in a set-theory context, what you think is something like \forall x: x \in P \Rightarrow x \in Q. Under that intepretation, the idea that P \supset Q is equivalent to P \rightarrow Q is absolutely ridiculous. If you follow the logic, implication must be the reverse of the subset relation!

The catch, though, is that we’re not talking about set theory, and the statement P \Rightarrow Q that we’re looking at is emphatically not \forall x : P(x) \Rightarrow Q(x). And that, right there, is the root of the problem.

P \rightarrow Q always means P \rightarrow Q – it doesn’t matter whether we’re doing set theory or type theory or whatever else. But in set theory, when we talk about the intepretation of P as a set, right now, in the world of type theory, we’re talking about a different set.

Super set doesn’t suddenly mean subset. Implication doesn’t start working backwards! and yet, I’m still trying to tell you that i really meant it when i said that superset meant implication! how can that possibly make sense?

In type theory, we´re trying to take a very different look at math. In particular, we’re building everything up on a constructive/computational framework. So we’re necessarily going to look at some different interpretations of things – we’re going to look at things in ways that just don’t make sense in the world of classical set theory/FOPL. We’re not going to contradict set theory – but we’re going to look at things very differently.

For example, the kind of statement we’re talking here about is a complete, closed, logical proposition, not a predicate, nor a set. The proposition P is a statement like “‘hello’ has five letters”.

When we look at a logical proposition P, one of the type theoretic interpretations of it is as a set of facts: P can be viewed as the set of all facts that can be proven true using P. In type theory land, this makes perfect sense: if I’ve got a proof of P, then I’ve got a proof of everything that P can prove. P isn’t a statement about the items in Ps proof-set. P is a logical statement about something, and the elements of the proof-set of P are the things that the statement P can prove.

With that in mind, what does P \Rightarrow Q mean in type theory? It means that everything provable using Q is provable using nothing but P.

(It’s really important to note here that there are no quantifiers in that statement. Again, we are not saying \forall p: P(x) \Rightarrow Q(x). P and Q are atomic propositions – not open quantified statements.)

If you are following the interpretation that says that P is the set of facts that are provable using the proposition P, then if P \Rightarrow Q, that means that everything that’s in Q must also be in P. In fact, it means pretty much exactly the same thing as classical superset. Q is a set of facts provable by the statement Q. The statement Q is provable using the statement P – which means that everything in the provable set of Q must, by definition! be in the provable set of P.

The converse doesn’t hold. There can be things provable by P (and thus in the proof-set of P) which are not provable using Q. So taken as sets of facts provable by logical propositions, P \supset Q!

Again, that seems like it’s the opposite of what we’d expect. But the trick is to recognize the meaning of the statements we’re working with, and that despite a surface resemblance, they’re not the same thing that we’re used to. Type theory isn’t saying that the set theoretic statements are wrong; nor is set theory saying that type theory is wrong.

The catch is simple: we’re trying to inject a kind of quantification into the statement P \Rightarrow Q which isn’t there; and then we’re using our interpretation of that quantified statement to say something different.

But there’s an interpretation of statements in type theory which is entirely valid, but which trips over our intuition: our training has taught us to take it, and expand it into an entirely different statement. We create blanks that aren’t there, fill them in, and by doing so, convert it into something that it isn’t, and confuse ourselves.

Lenovo and the Superfish Scam

I’m a bit late to the scene here, but I think it’s still worth making a stab at this. Also, this is being written on an airplane, where I am travelling home from San Francisco to NY with a case of bronchitis. So I am not exactly at my best.

Lenovo, one of the largest makers of windows-based laptops, sold out its customers as part of one of the worst deliberate violations of computer security I’ve ever seen, by shipping a piece of software called Superfish pre-installed on its computers. Superfish is, with absolutely no exaggeration, one of the most serious, unethical, despicable things I’ve seen in quite a lot time. It’s appalling.

So what is it, and what’s the big deal?

We need to start with some background, and talk a bit about how secure connections work on the internet.

Every time that you visit a website with a secure connection (a URL that starts with https), you’re using a protocol called TLS (formerly SSL). TLS is designed to do two things:

  1. Ensure that you are talking to who you think you’re talking to.
  2. Ensure that no one but you and the person you wanted to talk to can actually see what you’re saying.

The way that it does both of those is based on encryption. Every time you create a secure connect to a website, you’re exchanging credentials with the site to ensure that they’re who they say they are, and then based on those credentials, you establish an encryption key for the rest of your communication.

That connection-establishment process is the critical bit. You need to get some information that allows you to trust them to be who they claim to be. The security and integrity of everything that happens over the connection depends on the truth and integrity of that initial piece of identity verification.

The identity verification piece of TLS is built using public key cryptography, as part of a standard infrastructure for key maintenance and verification called X.509.

I’ve written about public key crypto before – see here. The basic idea behind public key crypto is that you’ve got two keys, called the public and private keys. Anything which is encrypted with the public key can only be decrypted with the private key; anything which is encrypted with the private key can only be decrypted using the public key. Your public key is available to anyone who wants it; no one but you has access to your private key.

If you receive a message from me, and you can decrypt it with my public key, then you know, without a doubt, that you can be sure that I’m the one who encrypted it. Only my private key could have encrypted a message that could then be decrypted with my public key.

For that to work, though, there’s one thing that you need to be sure of: you need to be absolutely sure that the public key that you’ve got is really my public key. If some clever person managed to somehow give you a different key, and convince you that it’s mine, then they can send you messages, and they’ll look exactly as if they came from me. If I handed you my public key on a USB thumbdrive, then you’re sure that the key came from me – but if you received it online, haw can you be sure that it was really me that gave it to you? Maybe someone switched it along the way?

In X.509, we use an idea of delegated trust. That is, we have some small collection of fundamental trusted authorities. Those authorities can issue public/private key pairs, so when someone need a public key, they can go to them and ask for it, and they’ll create one. The authority gives them a certificate, which is a copy of the new public key encrypted by the authority using their private key.

Now, when someone connects to a website, the target site can state who they are by sending a copy of the certificate. The client site recieves the certificate, decrypts it using the authorities public key, and then starts using that public key to encrypt their communications.

If the two sides can keep talking, then the client knows who it’s talking to. It got a public key, and it’s using that public key to talk to the server; so the server couldn’t decrypt the communication unless it had the public key; and it trusts that that it got the right public key, because it was encrypted with the private key of the certificate authority.

This is great as far as it goes, but it leaves us with a single certificate authority (or, at best, a small group). With billions of human users, and possibly trillions of networked devices, having a single authority isn’t manageable. They simple can’t produce enough keys for everyone. So we need to use our trust in the certificate authority to expand the pool of trust. We can do that by saying that if the certificate authority can declare that a particular entity is trustworthy, then we can use that entity itself as a verifier. So now we’ve taken a single trusted authority, and expanded that trust to a collection of places. Each of those newly trusted entities can now also issue new keys, and can certify their validity, by showing their certificate, plus the new encrypted public key. In general, anyone can issue a public key – and we can check its validity by looking at the chain of authorities that verified it, up to the root authority.

There’s a catch to this though: the base certificate providers. If you can trust them, then everything works: if you’ve got a certificate chain, you can use it to verify the identity of the party you’re talking to. But if there’s any question about the validity of the root certificate provider, if there’s any question whether or not you have the correct, valid public key for that provider, then you’re completely hosed. Ultimately, there’s some piece of seed information which you have to start off with. You need to accept the validity of an initial certificate authority based on some other mechanism. The people who sold you your computer, or the people who built your web browser, generally install a root certificate – basically the public key for a trusted certificate authority.

If that root certificate isn’t trustworthy, then nothing that results from it can be trusted. The untrustworthy root certificate can be used by an unscrupulous person to create new certificates allowing them to masquerade as anything that they want.

In particular, an untrustworthy root certificate it makes it easy to perform a man-in-the-middle attack. Suppose you want to talk to your bank, and somehow Joe Schmoe has planted a bad root certificate on your computer:

  1. You try to connect to Bank.
  2. Joe intercepts the request. He accepts your connection request, pretending to be the bank. Using his fake root certificate, he sends you a certificate claiming to be the bank.
  3. Now, you’re connected to Joe, believing that you’re connected to the bank. You try to log in to your account, sending your username and password.
  4. Joe receives your request, connects to the bank, passing on your request.
  5. Joe logs in to the bank on your behalf. The bank returns its successful login screen, showing your account numbers and balances.
  6. Joe copies the successful login screen from his connection to the bank to you.
  7. Every time that you send information to “the bank”, Joe decrypts it using his private key, and then sends it on to the bank masquerading as you. Similarly, when the bank sends something to you, he decrypts it using the banks public key, and then encrypts it with his his private key to send it to you.
  8. You now believe that you are connected to the bank over a secure connection. Everything that you do looks exactly as if you’re connected to the bank over a security encrypted link.
  9. Joe now has your login information, and control of your bank accounts.

In the Lenovo fiasco, Lenovo installed a system called Superfish, which deliberately installs a bad root certificate, and then uses that root certificate to create man-in-the-middle attacks on every secure connection ever made with the computer.

Why does it do that? Purportedly for ad retargeting: it uses its man-in-the-middle to decrypt supposedly secure connections so that it can replace ads in the pages that you view. That way, Lenovo and Superfish get advertising money for the pages you view, instead of the page-owners.

It’s spectacularly despicable. It’s fundamentally compromising everything you do with your computer. It’s doing it in a way that you can’t detect. And it’s doing it all in a sleazy attempt to steal advertising money.

Based on this, I’ve got two pieces of advice:

  1. Don’t ever buy a Lenovo computer. Never, ever, ever. A company that would even consider installing something like this isn’t a company that you can ever trust.
  2. If you have a Lenovo computer already, get rid of it. You could reformat your disks and install a fresh untainted copy of your operating system – but if your hardware manufacturer is a sleezy as Lenovo has demonstrated themselves to be, then there’s no reason to believe that that’s going to be sufficient.

This is, by far, the worst thing that I’ve ever seen a computer manufacturer do. They deserve to be run out of business for this.

Simpler Consensus with Raft

A few weeks ago, I wrote about Paxos, which is (at least in my experience), the most widely used algorithm for consensus in distributed systems. I’m a huge fan of Paxos – I think that it’s a remarkably elegant system.

But Paxos does have its problem.

  1. Paxos has a lot of roles: client, proposer, learner, acceptor, leader, follower. When you want to implement Paxos, you need to figure out all of those roles, and how you’re going to implement them. In general, you end up merging roles – but there are lots of ways of doing that merge. Each particular way of setting up the roles has its own properties, and thus its own tradeoffs that you need to understand.
  2. Paxos, as we normally talk about it, is really a single-consensus protocol – that is, the basic protocol is designed to get a group of agents to come to consensus just once. If you want to be able to repeatedly seek new consensus values, you’re actually going to be using an extension to the basic paxos protocol. There are a ton of Paxos extensions that work to add repeated consensus. Paxos itself is simple and elegant, with well-defined formal properties that we care about – the moment we start modifying it, we can no longer count on those properties unless we can also prove them in our extension!
  3. Paxos was originally described in a truly awful paper. Leslie Lamport was trying to write a paper that would be less dull than the typical bone-dry technical snoozer – but the way that he wrote it actually makes it much harder to understand.

In short: Paxos has more complexity than it needs, and despite that, it needs to be tweaked to be really useful, and getting those tweaks right is hard. There are, sadly, a lot of incorrect Paxos implementations – and their incorrectness has all-too-often come as a surprise to the people who rely on them.

To avoid those problems, there are other consensus algorithms out there. In this post, we’re going to look at one of the Paxos competitors – a consensus algorithm/protocol called raft.

Raft does away with the role complexity of Paxos. In Raft, you have a collection of cooperating agents. There are no distinct proposers, acceptors, or learners: there are just servers. Communication between the servers in raft is done entirely with synchronous remote procedure calls.

In Raft, the target of consensus is a log containing a sequence of events. The log is the history of the distributed system. The goal of raft is that the log be maintained in a consistent state throughout the raft network. Just like in Paxos, if we have 2n+1 servers, up to n can fail without the network losing its consistency.

Raft is designed in terms of remote procedure calls between the elements of the network. In Raft, we never talk about single messages – every communication between servers is a pair of messages: a request from caller to callee; and a response from callee to caller. When a message gets lost, we’ll just talk about it as a failed remote procedure call.

Within a Raft network, at any given time, each server has a state. It can be a follower, a leader, or a candidate. Within the network, there is at most one leader. When there is a leader, all of the other servers are in the follower state. The followers are almost entirely passive. Followers don’t talk to clients at all – they just wait for RPCs from the leader. The leader is the only participant that’s allowed to talk to clients: any client request must go through the current leader. The leader is also the only server that’s allowed to add new entries to the consensus log.

Raft divides time into a sequence of terms. In each term, the servers in the raft network need to select a leader using a process called an election. Raft is a strong leader protocol – no interactions with a client can take place except through a leader. If there’s no leader, then we can’t process client requests without a leader.

So, to understand Raft, there’s three processes that we need to
understand:

  1. Leader election
  2. Transitions between terms
  3. Appending an entry to a log.

In those processes, the servers have a collection of variables
that they use for the Raft protocol:

currentTerm
the current term for the server.
votedFor
the serverID that this server voted for in the current term, or “none”.
log
the list of entries in the log.
commitIndex
The index of the highest log entry known to be committed by the server.
lastApplied
The index of the highest log entry that’s been added to the log – but not necessarily committed. (It doesn’t become committed until a majority of servers accept it.)

Leader Election

In each term, the Raft cluster needs to have a leader. The way that a leader is selected called election.

Elections are triggered by a term transition. When a server in the cluster decided that it needs to start a new term, it increments its term number, puts itself into the candidate state, and sends a RequestVote(term, candidateId) RPC to each of the other servers in the cluster. This request asks the other servers in the cluster to select it as the leader. If it receives enough “yes” votes, it will become the leader.

When a server receives a RequestVote RPC, it checks the term. If it’s smaller than the server’s current term, then it replies “No” – meaning that it cannot support the requestor as leader.

If the term in the request is greater that the receiver’s term, then the receiver cannot have voted in the new term. So it updates to the term from the request, and then it replies “Yes”.

If the term in the request equals the receiver’s term, then the receiver has already updated its term. If it’s already voted for someone else as leader, then it can’t support the requestor, so it replies “No”. If it hasn’t voted for a leader in the term, then it votes for the requestor, and replies “Yes”.

If the requester receives “Yes” votes from more than 1/2 of the cluster (counting itself), then it becomes the leader, and starts both processing requests from clients, and sending heartbeats to the other servers in the cluster.

If it doesn’t receive enough votes, then it waits to see if anyone else becomes the leader and starts sending heartbeats. If it doesn’t get a heartbeat in time, then it starts over: it would increment its term again, and try to start a fresh election.

Term Transitions

For a given server, term transitions happen in three ways:

  1. Timeout: the leading server needs to periodically communicate with each of the followers. This process is called heartbeat: even if the leader has no updates for its followers, it sends RPC calls to the followers just to say “I’m still here”. If a client goes too long without receiving a heartbeat, it decides that the leader was lost, and it will increment the term number, and trigger a new election.
  2. Leader resignation: the current leader can, at any time, decide to stop being the leader. (This is typically done by an implementation as part of a system that says that there’s a maximum period between leader elections. For example, in the Aurora scheduler, we had leader elections at least once per day. In a raft consensus, the leader would trigger this by deciding it was time for it to stop being a leader, and triggering an election by starting a new term.)
  3. External term change: every RPC received by a server includes a term number. If any RPC to a server ever includes a term number greater than the current term for that server, the server will update its term to the new number. As a special case of this, when a leader server decides to resign, it does that by sending an RPC to the other servers with an incremented term number.

Appending to the log

We just spent a fair bit of time talking about leaders and elections. That’s almost beside the point. What we really want to do is just maintain a consistent log across the cluster of servers. Everything except creating log entries is just the book-keeping that’s necessary to make the consistent log work. The log itself is maintained using the AppendEntries RPC call.

When a client request does something that alters the state of the cluster, the leader describes that change by adding an entry to the log. It builds a proposed log entry, and sends it to the other members of the cluster using an RPC. If it gets enough “Yes” votes from other cluster members, then the log entry becomes committed, and the leader updates its commitIndex to the index of the new log entry to reflect that.

The RPC request takes a bunch of parameters:

  1. term: the leader’s term.
  2. leaderId: the id of the leader.
  3. prevLogIndex: the index number of the last log entry in the consensus log preceeding this new entry.
  4. prevLogTerm: the number of the term where the last log entry was committed.
  5. entries: a set of new log entries to be appended to the log.
  6. leaderCommit: the index of the commitlog on the leader after this set of entries has been committed.

When an AppendEntries call is received by a follower, what it does is:

  • If the receiver’s term is greater than the request term, then the receiver rejects the request by replying “No”.
  • If the the receivers commit index is larger than the commit index of the request, then it rejects the request by replying “No”.
  • If the receiver’s log doesn’t contain an entry at prevLogIndex, or that entry’s term doesn’t match the request term, then it rejects the request by replying “No”.
  • If there’s an entry in the log with the same index as the new log entries, and the term in the request matches the receiver’s term, then the receiver removes all entries after prevLogIndex from its log.
  • The receiver then appends the new entries from the request to its log.
  • If the leaderCommit is greater than the commitIndex on the receiver, then the receiver updates its commitIndex.
  • Finally, the receiver replies “Yes”.

When a majority of the cluster members have accepted an AppendEntries call, then the log entry gets committed.

The one part of this that’s confusing is how the logs get managed. The leader creates a new log entry, and sends it to the other servers. The complexity comes from dealing with cases where something doesn’t reach consensus.

For example, the leader sends entries 5, 6, and 7 to server S. S adds the entries to its copy of log – it now contains [1, 2, 3, 4, 5, 6, 7]. Meanwhile, the leader also sends those entries to server T, but the RPC to T fails due to a network fault. Another client request happens, and now the leader sends [5, 6, 7, 8] to S. S sees that it’s got entry 5 already: so it discards everything after 5, and then re-appends.

So the trailing segment of the log can change! How do we handle consensus?

The next time that the leader sends an AppendEntries to a follower, it contains the leader’s commitIndex. The follower updates its commit index to that value. Once it’s done that, any request from a leader that tries to modify anything that comes before that commit index will be rejected.

The consensus commit thus doesn’t really occur until the next heartbeat call after a log update.

Raft versus Paxos

That’s the basics of Raft.

In comparison to Paxos, there’s a couple of things to notice:

  1. There’s a lot less confusion around roles. Paxos had a ton of different roles, and rules for interactions between the different roles. Raft doesn’t have any of that: it’s just servers, with one of the servers designated as the leader.
  2. Raft explicitly manages a log, and it adds complexity around log management. In Paxos, you’re just managing a single consensus value; in Raft, you’ve got a sequence of log entries.
  3. Paxos is defined in terms of messages; Raft is designed in terms of remote procedure calls.

So is Raft really simpler than Paxos? I think that’s up for discussion. Personally, I prefer Paxos. There’s a lot of complexity hidden under the covers of the RPC system. It looks simple on the surface, but all of the complexity of message passing, lost messages, message duplication – it’s still there. It’s just been swept under the carpet, as if that really makes it easier.

The way that the logs get maintained is confusing. That’s inevitable: getting distributed knowledge is never easy. Raft at least makes that part of things explicit, whereas it’s a common part of Paxos implementations, but it’s not really specified in the protocol.

Quick Vaccine Math

A friend of mine asked me to write about the math of vaccines. A lot of people have been talking about it lately, so I’m not really sure if I’ve got anything new to add, but I can at least give my usual mathy spin to it.

Vaccines have been getting a lot of attention lately, for good reason. There are a lot of people in America who’ve bought in to a bogus line about the supposed danger of vaccines, and the supposed benign-ness of the diseases that they can prevent. That’s led to many children not getting vaccinated as they should, which has culminated in a recent outbreak of measles caused by a contagious but not yet symptomatic child at DisneyLand.

When it comes to vaccines, there’s two things that a lot of people don’t understand. One is herd immunity; the other is probability of infection.

Herd immunity is a really important concept. In an ideal world, if you got vaccinated against measles, there’d be no chance that you’d ever catch it. But it doesn’t work that way. What a vaccine does is reduce the probability that you’ll catch the disease. Due to lots of random factors involving the way that a given individual’s immune system works, the vaccine can’t be perfect. Beyond that, there are also many people who either can’t be immunized, or whose immune system is not functioning correctly. For example, people who are getting chemotherapy for cancer have severely depressed immune systems, and even if they’ve been immunized, their immune systems aren’t capable of preventing the disease.

So just relying on the fact that you’ve been immunized isn’t really enough. To prevent outbreaks of the disease, we rely on an emergent property of a vaccinated population. If enough people are immune to the disease, then even if one person somehow gets infected with it, they won’t be able to cause it to spread.

Let’s walk through a simple example. Suppose we’ve got a disease where the vaccine is 95% effective – that is, 95 out of every hundred people who received it are completely immune to infection by it. Let’s also suppose that this is a highly infectious disease: out of every 100 non-immune people who are exposed to it, 95 will become ill. If everyone is immunized, how many people need to be exposed to a sick person in order for the disease to spread?

Infections turn into outbreaks when the number of infected people grows – if each sick person infects more than one other person, then the infection will start to grow exponentially. The severity of the outbreak will depend on how many people get infected by each sick person.

Suppose that the first sick person has contact with 20 people while they’re contagious. 95% of them are immune – which means that only one out of that twenty is succeptible. There’s a 95% chance that that person will get infected. This isn’t good, but if it’s kept to that rate, we won’t have an outbreak: each sick person will probably infect one other person on average – and not always even that. So the infection will die out without exploding into a significant outbreak.

What happens in 5% of the population doesn’t get vaccinated? Then the pool of infected people grows to 10%. And in our contrived example, we now have a 90% probability of the sick person making two other people ill. That’s more than enough to cause a major outbreak! On average, each sick person will cause 1.8 other people to become sick!

When the population’s immunity rate (either through vaccine, or through prior infection) gets to be high enough that an infection can no longer spread, the population is said to have herd immunity: even individuals who can’t be immunized no longer need to worry about catching it, because the population doesn’t have the capacity to spread it around in a major outbreak.

(In reality, the effectiveness of the measless vaccine really is in the 95 percent range – actually slightly higher than that; various sources estimate it somewhere between 95 and 97 percent effective! And the infectivity of most diseases is lower than the example above. Measles (which is a highly, highly contagious disease, far more contagious than most!) is estimated to infect between 80 and 90 percent of exposed non-immune people. So if enough people are immunized, herd immunity will take hold even with more that 20 people be exposed by every sick person.)

Moving on: there’s a paradox that some antivaccine people use in their arguments. If you look at an outbreak of an illness that we vaccinate for, you’ll frequently find that more vaccinated people become ill than unvaccinated. Therefore, they say, it’s not the fault of the unvaccinated. We’ll look at the math to see the problem with that.

Let’s use the same numbers as above: 95% vaccine effectiveness, 95% contagion. In addition, let’s say that 2% of people choose to go unvaccinated.

That means thats that 98% of the population has been immunized, and 95% of them are immune. So now 92% of the population has immunity.

If each infected person has contact with 20 other people, then we can expect expect 8% of those 20 to be infectable – or 1.6; and of those, 95% will become ill – or 1.52. So on average, each sick person will infect 1 1/2 other people. That’s enough to cause a significant outbreak. Without the non-immunized people, the infection rate is less than 1 – not enough to cause an outbreak. The non-immunized population reduced the herd immunity enough to cause an outbreak.

Now, within the population, how many immunized versus non-immunized people will get sick?

Out of every 100 people, there are 5 who got vaccinated, but aren’t immune. Out of that same 100 people, there are 2% that didn’t get vaccinated, or 2. So we’d expect that in 100 cases of the disease, about 70 of them to be vaccinated, and 30 unvaccinated.

The vaccinated population is much, much larger – 50 times larger! – than the unvaccinated, so we’d expect more vaccinated people to become ill, even though it’s the smaller unvaccinated group that broke the herd immunity! The easiest way to see that is to take those numbers, and normalize them.

In the vaccinated community, those 70 sick people are – in the worst possible case, where every single non-immune vaccinated person became ill! – the 5% non-immune from a population of 1400 people. So the worst possible infection rate in the vaccinated population is just 5% – and in reality, it’s more like 4.75%. But those 30 sick people from the unvaccinated pool are 30 out of about 32 non-immunized people who were exposed. The unvaccinated people were more than 20 times more likely to be infected.

The reality of vaccines is pretty simple.

  1. Vaccines are highly effective.
  2. The diseases that vaccines prevent are not benign.
  3. Vaccines are really, really safe. None of the horror stories told by anti-vaccine people have any basis in fact. Vaccines don’t damage your immune system, they don’t cause autism, and they don’t cause cancer.
  4. Not vaccinating your children (or yourself!) doesn’t just put you at risk for illness; it dramatically increases the chances of other people becoming ill. Even when more vaccinated people than unvaccinated become ill, that’s largely caused by the unvaccinated population.

In short: everyone who is healthy enough to be vaccinated should get vaccinated. If you don’t, you’re a despicable free-riding asshole who’s deliberately choosing to put not just yourself but other people at risk.

The Program is the Proof: Propositions in Type Theory

As usual, I’m going in several different directions. I’m going to continue doing data structure posts, but at the same time I also want to get back to the type theory stuff that I was writing about before I put the blog on hiatus.

So let’s get back to a bit of Martin-Loff type theory! (The stuff I’m writing about today corresponds, roughly, to chapter 2 of the Nordstrom/Petersson/Smith text.)

One of the key ideas of Martin-Loff’s type theory is that a logical statement is exactly the same thing as a specification of a computation. When you define a predicate like “Even”, the definition specifies both the set of numbers that satisfy the predicate, and the computation that tests a number for membership in the set of even numbers. If you haven’t provided enough information to fully specify the computation, then in Martin-Loff type theory, you haven’t defined a predicate or a set.

When you say “2 is even”, what you’re really saying in terms of the type theory is that “The computation for ‘even(2)’ will succeed”. The computation and the logical statement are the same thing.

In functional programming, we like to say that the program is the proof. Martin-Loff type theory is where that came from – and today we’re going to take a first look in detail at exactly what it means. In the world of type theory, the program is the proof, and the proof doesn’t exist without the program.

This creates an interesting set of equivalent interpretations. When you see a statement like “x : T” (or x \in T), that could be interpreted in the following ways, all of which are really equivalent in type theory.

  1. Set theoretic: x is a member of the set T.
  2. Intuitionistic: x is a proof object for the proposition T.
  3. Computational: x is a program that satisfies the specification T.
  4. Abstract: x is a solution for the problem T.

In the rest of this post, I’m going to focus on those four interpretations, and explain how each of them makes sense in this version of type theory.

The set theoretic interpretation is obvious – as the name suggests, it’s nothing but what we all learned from basic set theory. An object is a member of a set – which means, from set theory, that the object satisfies some predicate in first order predicate logic – because that’s what it means to be a member of a set.

The intuitionistic interpretation is almost the same as the set theoretic, but rewritten for intuitionistic logic. In intuitionistic logic, the predicate over the set is written as a proposition T, and if we know that x is a member of the set T, then that means that we have a proof that x demonstrates that T is true.

The computational interpretation takes the intuitionistic one, and rephrases it in computational terms. A logical proposition, rendered into a computational setting, is just a specification of a program; and a proof of the proposition is a program that satisfies the specification.

Finally, the abstract interpretation just rephrases the computational one into terms that aren’t tied to a computing device. A predicate is a problem that needs to be solved; anything that provides a solution to the problem is demonstrating a member of the set.

The key takeaway though is the basic idea here of what a type is. What we’re talking about as a type here is something that goes far beyond any programming language’s idea of what a type is. In intuitionistic type theory, a type is a specification of a computation. If we had this in a real language, that would mean that any program that typechecked would be guaranteed to work: asserting that x has type T means, precisely, that x is a computation that matches the specification!

(Of course, that’s not the panacea that you might think the first time you hear that. The catch is simple: the type is a specification of the computation. That means that just writing a type is a form of programming! And that means that your type descriptions are going to have bugs. But we’ll come back to that in a much later post.)

What type theory is doing is taking something like set theory, and re-rendering it entirely in a computational world. It still has a mapping from the computations to the abstract concepts that we use when we’re thinking, but if we can talk about those abstract concepts in type theory, we’ll always do it by mapping them into some kind of computation.

In type theory, we’re not dealing in a world of pure mathematical objects that exist if we can describe them; instead, we’re building a world where everything is at least theoretically computable. That might seem constraining, but every proof already corresponds to a computation of some sort; the only additional constraint here is that we can’t play tricks like the axiom of choice, where we can “prove” the existence of some unattainable, intangible, nonsensical object.

To make that work, we’re going to take all of the construct that we’re used to seeing in intuitionistic logic, and give them a meaning in terms of computations.

For example, in set theory, we can have a statement A \supset B – meaning that A is a superset of B, that every element of B is also necessarity an element of A. In type theory, since A and B are specifications of computations, that means that a member (or proof) of A \supset B is a computation that given a proof of A, generates a proof of B – in short, that A implies B.

Now, suppose that we want to prove A \supset A. How could we do that? We need a program that given a proof of A generates a proof of A. That is, we need an implementation of the identity function: \lambda a . a.

In fact, using the computation interpretation of things, we can interpret A \supset B as being the type of a function that takes an instance of a, and generates an instance of b – that is, that if f : (A \supset B), then f is a function from an instance of A to an instance of B!

The only trick to that is understanding that in type theory, saying that a is an element of A means that a is a proof of A. Using the same interpretation, that means that f: A \supset B means that f is a proof of A \supset B – which means the same thing as saying that given an example of an element of A (a proof of A), f will produce an element of B (a proof of B). The statement A \supset B is exactly the same thing as the logical implication A \rightarrow B, which is exactly the same thing as the type of a function from A to B.


Notes on sources: I’m working from two main references in this series of posts, both of which are available in PDF form online.

  1. “Programming in Martin-Lof’s Type Theory”, by Nordstrom, Petersson, and Smith, which you can download here.
  2. “Type Theory and Functional Programming” by Simon Thompson (available here).

In addition, I first learned a lot of this from reading some papers by Phil Wadler and Simon Peyton Jones. The exactly references but the exact references to those are long-lost in the shadows of my memory. But any of their papers are well-worth reading, so just read them all!)

Mathematical Data Structures Part 1: Binary Heaps

Being a PhD in computer science, it’s only natural that I love data structures. In particular, I’m fascinated by the way that the math factors in to the way we structure data. Data structures fit into a beautiful intersection between information theory and algorithms: the way that a good data structure is built is a reflection of what information it really needs to maintain. The best data structure encodes exactly the information it needs in order to do it’s job – no more, and no less. The mathematical impacts of that are beautiful, and sometimes surprising. To try to demonstrate that, I’m going to take a couple of posts, and work my way through one of my favorite examples of a surprising outcome in a structure called a fibonacci heap.

A heap is a structure designed to solve a common problem. You’ve got a collection of objects, each of which has an associated numeric value. You want, at any time, to be able to find and remove the largest value in the collection, and to be able to add new elements to it. Those two operations are the core of the heap. Some variations also allow you to increase the value of objects inside the heap, or to remove values other than the maximum.

There are a lot of different ways to implement a heap. One obvious one is to just maintain a sorted sequence of objects. The problem with that is performance: some of the common operations are painfully slow!

Using the sorted sequence approach, removing the largest value is easy: you just remove the last element of the sequence. That’s very fast: it’s constant time. But you also need to be able to add values to the heap, and that’s not so good.

There’s two basic ways of doing a sequence: an array, or a linked list. In both cases, the performance isn’t acceptable. If we used an array,then in order to add a new object to the collection, we’d need to:

  1. Find the correct position for it in the array. We can do that by doing a binary search, which takes time O(lg n) where n is the length of the array. This step isn’t bad – in general, we’re pretty happy with O(lg n) operations.
  2. Insert the value into the array – which means shifting all of the elements that come after it one place to the right. That’s O(n) time, which is pretty crappy.

In the linked list approach, inserting the value isn’t a problem – it’s a constant time operation. But finding the position where it should be inserted is linear time. So we’re still talking about linear time.

Similarly, we could use a linked list, where inserting the element is constant time, but then finding its position is O(n) – again, unacceptable.

The problem with the sorted sequence approach isn’t really related to the kind of structure we use to maintain the sorted list; the problem is that we’re maintaining more information that we need. At any time, we want to be able to find the largest element of the heap quickly – we don’t care about the relative positions of any pair of values that don’t include the largest element of the collection! But if we keep a sorted list, every time we insert an element, we’re spending a lot of time comparing things whose comparison we don’t really care about!

To be able to make it faster, we need to build a data structure that doesn’t waste time and effort computing and maintaining information that we don’t want.

So our goal is to find ways of building structures that always let us both find the largest element quickly, and add new elements quickly, without maintaining more information that is really necessary. We’ll start off with a simple but good version, and then work our way through to better ones.

The most basic implementation of a heap is called a binary heap. A binary heap is a binary tree with two key properties:

  1. Every node in the tree is larger than its children.
  2. The tree is left-full: every level of the tree is full except for the last; and the last level is filled in from left to right.

The left-full property might seem a bit strange, but it turns out to be pretty straightforward. A binary heap can be implemented using an array. The root node is stored in the first position of the array; its children are in positions 2 and 3; the children of node 2 are stored in positions 4 and 5; the childen of position 3 are stored in positions 6 and 7. Using one-based indices, for any node N, it’s children are stored in positions 2N and 2N+1. Adding a new leaf to the tree can always be done by just appending one value to the array. The left-full property just means that you always extend the array by adding an element onto the left.

Implementing a heap this way is simple:

  1. To get the maximum value, you just look at the first element of the array – O(1).
  2. To remove the largest element from the array, you get the value from the first element of the array, and save it. Then you remove the last element from the array, and bubble it down – swapping it with one of its children if they’re bigger than it. We’ll look at this in more detail, but the bubble down process is O(lg n) in the worst case.
  3. Inserting a new element is done by adding it to the end of the array, and then bubbling up, by comparing it to its parent, and swapping if it’s bigger than its parent. Again, it’s O(lg n).

I’m going to show code for this. For fun, I wrote the code in a language called xtend. Xtend is a Java extension that cleans up the syntax, gets rid of semicolons, improves the type system, adds lambdas, and does a few other really neat things.

The whole beast is just a wrapper around an array:

class BinHeap> {
  val ArrayList _contents

  new() {
    _contents = new ArrayList()
  }

  ...
}

If you know Java, this is mostly clear. In xtend, you write constructors using the name “new” instead of the name of the class being constructed.

Then we’ll set up some utilities to make other stuff easier to write.

  def leftChildPosition(int pos) {
    2 * (pos + 1) - 1
  }

  def rightChildPosition(int pos) {
    2 * (pos + 1)
  }

  def int parentPosition(int pos) {
    if (pos == 0) {
      throw new MaxHeapException()
    } else {
      (pos + 1)/ 2 - 1
    }
  }

  def void swap(int one, int two) {
    val T first = _contents.get(one)
    _contents.set(one, _contents.get(two))
    _contents.set(two, first)
  }

Again, these should be straightforward. The only tricky thing is that the JVM uses zero-based arrays – so the left child of the node in position N is (2*(N+1) - 1): we need to add one to the node number to shift to one-based position; and then subtract one from the result to switch back to zero-based position. We do a similar thing for each of the other position computations.

Now we can get to the interesting bits. How do we get values into the heap?

def insert(T v) {
  val idx = _contents.size()
  _contents.add(v)
  bubbleUp(idx)
}

Insert is exactly what I described in prose above: append the new value onto the end of the array, and then bubble it up. Bubbling is the interesting part:

  private def void bubbleUp(int pos) {
    if (pos > 0) {
      val parentPos = parentPosition(pos)
      if (_contents.get(pos) > _contents.get(parentPos)) {
        swap(pos, parentPos)
        bubbleUp(parentPos)
      }
    }
  }

Bubbling up from a position P compares P to its parent. If it’s bigger than its parent, it swaps positions with the parent, and then tries to continue bubbling up from its new position.

For example, imagine we had a tree like:

9
  8
    5
      4
      0
    6
      3
  7
    2
    1

Now, suppose we wanted to add the value “10” to this. We’d add 10 to the end of the array, which would make it a child of 6. That would give us:

  9
    8
      5
        4
        0
      6
        3
        10
    7
      2
      1

So, we’d compare 10 to its parent – it’s bigger, so we’d swap:

  9
    8
      5
        4
        0
      10
        3
        6
    7
      2
      1

Then we’d compare 10 to its new parent, 8. It’s bigger, so we swap:

  9
    10
      5
        4
        0
      8
        3
        6
    7
      2
      1

And finally, we’d compare 10 to its new parent, 9. It’s bigger so we swap, and then we’re done.

  10
    9
      5
        4
        0
      8
        3
        6
    7
      2
      1

Appending to the end of the array is constant time, so the dominant time cost is the bubbling. The maximum possible number of swaps in the doubling process is depth of the tree minus 1 – and the depth of a full binary tree with N members is \lceil ln N \rceil. So it’s O(lg n) swaps, and the overall cost of inserts is O(lg n).

Getting the largest value is trivial:

def getMax() {
  _contents.get(0)
}

Removing the largest value is a lot like adding a value: we really play with the last element of the array, and then do a bubbling process – only this time we’ll bubble in the opposite direction:

  def removeMax() {
    if (_contents.size == 0) {
      throw new MaxHeapException()
    } else {
      val result = getMax()
      val last = _contents.remove(_contents.size() - 1)
      if (_contents.size() > 0) {
        _contents.set(0, last)
        bubbleDown(0)
      }
      result
    }
  }

Bubbling down is similar to bubbling up, but it’s a bit more complicated, because we need to look at both children.

  private def void bubbleDown(int pos) {
    val rightChildPos = rightChildPosition(pos)
      val leftChildPos = leftChildPosition(pos)
      if (leftChildPos >= _contents.size) {
        return
      }
      // Try to bubble left if there is no right child, or if the lift child is
      // bigger than the right.
      if (rightChildPos >= _contents.size || _contents.get(leftChildPos) > _contents.get(rightChildPos)) {
        if (_contents.get(pos) < _contents.get(leftChildPos)) {
          swap(pos, leftChildPos)
          bubbleDown(leftChildPos)
        }

      } else {
        // Try to bubble right
        if (_contents.get(pos) < _contents.get(rightChildPos)) {
          swap(pos, rightChildPos)
          bubbleDown(rightChildPos)
        }
      }
  }

The process is almost the same as bubbling up, but moving in the opposite direction. We're starting with a parent node, and comparing it to its children. If it's bigger than either of its children, then we swap it with the largest child, and then continue bubbling down.

For example, let's look at the same heap we looked at for insert:

  9
    8
      5
        4
        0
      6
        3
    7
      2
      1
  

If we want to remove 9, we set the value 9 aside, and then remove 3 from the end of the array, and put it at the root of the tree:

  3
    8
      5
        4
        0
      6
    7
      2
      1

Then we'd compare 3 against its two children, 8 and 7. Since 8 is the larger child, we swap 8 for 3:

  8
    3
      5
        4
        0
      6
    7
      2
      1

Now we compare 3 with its new children, 5 and 6. 6 is bigger, so we swap 6 with 3:

  8
    6
      5
        4
        0
      3
    7
      2
      1

3 has no children, so we're done: it's bubbled down as far as it can go.

Note: I messed up this example in the original version of the post. Thanks to John Armstrong for pointing it out.

The cost here is the same as insert, for the same reason. The dominant cost is the bubbling, and the bubbling is bounded by the depth of the tree. So removing the maximum is also O(lg n).

It's worth noting that heaps can be used to build a very reasonable sorting algorithm. To sort a collection, just insert all of the elements of the collection, and then remove them one by one. It's O(n lg n), and it's conceptually quite simple. It's not widely used, because the old classic quicksort is faster - not in big(0) notation, but it ends up with a smaller constant. (In big-O notation, something that takes 3(lg n) steps and something that takes 6(lg n) steps are both O(lg n), but the one whose constant is 3 is still twice as fast as the one whose constant is 6.)

Paxos, a really beautiful protocol for distributed consensus

The work that I do in real life is all focused on tools for other developers. In todays environment, that means that I’ve spent a lot of time working on tools that, in one way or another, help other developers deal with distributed systems. In that work, I’ve noticed that there are some really key things that straddle the line between pure math and pure engineering. That’s really interesting to someone like me!

A good example of that is something called paxos. My first exposure to paxos was very interesting. I’d just been hired by Google, and was working on their build tool. At the time, engineers in Google had a problem. Google’s codebase was contained in one massive version control repository. Doing things that way has a ton of really neat advantages – most importantly, the fact that it makes it really convenient to reuse code written by anyone else at the company. The problem was that code reuse can become very confusing. Project A reuses a bit of code written by people from project B. B’s code reused some stuff from C, and C from D, and D from E. So now project A is using code from project E, and they don’t know why!

In this case, I had someone from a storage project coming to me trying to figure out just why his system had a dependency on a plan9 database system called paxos. I had to built a tool that would allow people to ask questions like “Why does A depend on E?”.

As it turned out, paxos was a really important thing, and it was widely reused through the Google codebase. Once I learned about it, I started seeing it everywhere. Since I left Google nearly four years ago, I didn’t stop seeing it. It’s ubiquitous in distributed systems. Outside of Google, we weren’t using that friendly old plan9 paxos implementation – but the paxos model has been reimplemented dozens of times, because it’s so darned useful!

paxos is a system for managing consensus.

In distributed systems, there a collection of hard problems that you constantly need to deal with.

  1. Things fail. You can never count on anything being reliable. Even if you have perfectly bug-free software, and hardware that never breaks, you’ve still got to deal with the fact that network connections can break, or messages within a network can get lost, or that some bozo might sever your network connection with a bulldozer. (That really happened while I was at Google!)
  2. Given (1), you can never rely on one copy of anything, because that copy might become unavailable due to a failure. So you need to keep multiple copies, and those copies need to be consistent – meaning that at any time, all of the copies agree about their contents.
  3. There’s no way to maintain a single completely consistent view of time between multiple computers. Due to inconsistencies in individual machine performance, and variable network delays, variable storage latency, and several other factors, there’s no canonical way of saying that for two events X and Y, “X happened before Y”. What that means is that when you try to maintain a consistent set of data, you can’t just say “Run all of the events in order”, because while one server maintaining one copy might “know” that X happened before Y, another server maintaining another copy might be just as certain that Y happened before X.

In a world where you can’t count on different agents seeing events in the same order, and where you expect things to be constantly failing, how can you make sure that any distributed system you build ends up with a consistent view of reality?

The answer is a consensus protocol. You need to create a mechanism based on communication between the copies of your system that allows them to mantain a consistent consensus of what the current state of the world is, even in the presence of failures of machines, storage systems, and communications.

paxos is a very elegant, reasonably simple consensus protocol.

Let’s get a bit more precise. Paxos is built on a model of storage. The original application of it was a consistent database, so it’s built around the idea of keeping data consistent. In paxos, the state of the storage is modelled as a sequence of transactions. Each transaction is a pair (t, v), where t is a numeric transaction identifier, and a v is a transaction value.

The state of the system being modelled is a sequence of transaction pairs, [(ti, vi), (tj, vj), ..], where the t values are increasing as you progress through the sequence. As time passes, new transaction pairs can be added to the state.

The focus of the paxos protocol is ensuring that in a collection of 2n+1 participants, all surviving participants will agree on the current value of the state, even if up to n participants fail, and even if messages can be delivered in arbitrary order.

Before I go further into the description of paxos, we need to look at the basic assumptions that underlie it. Like any formal model, it’s not defined in terms of real computers. It’s defined in terms of an abstraction that approximates reality. In this case, the approximation is quite good, but we still need to go through the basic assumptions that make up its model of the universe.

  1. Processors (aka participants, servers, computers):
    1. operate at any speed. No two processors necessarily operate at the same speed.
    2. may fail without warning.
    3. may rejoin after recovering from a failure.
    4. are cooperative (in the sense that they do not attempt to cause failures).
  2. Network:
    1. Delivers messages between any pair of processors.
    2. Transmits messages asynchronously.
    3. Delivers messages at arbitrary speeds.
    4. Does not guarantee that messages will be delivered in the order in which they were transmitted.
    5. Does guarantee that a message, if delivered, will be delivered correctly, without any changes.
    6. May fail to deliver a message.
    7. May deliver multiple copies of the same message.

In short, everything can fail at any time; after failure, participants can recover and rejoin the system; any no part of the system acts in an actively adversarial way.

The protocol describes the behavior of the system in terms of a collection of roles. A participant can have more than one role in the system – in fact, in most implementations of paxos, all partipants do have multiple roles. The roles are:

Client
The client is not part of the paxos cluster. It’s an external entity whose actions trigger state changes by making requests to the paxos system. Each state update in paxos is initiated by a client request, and completed by a reply to the client.
Acceptor
An acceptor (also called a voter) is a participant in the maintanence of distributed storage. A state change in a paxos cluster does not occur until a majority (quorum) of acceptors agree upon it.
Proposer
A proposer recieves a request from the client, and attempts to get a quorum of acceptors to agree on it.
Leader
One of the proposers is special. It is the single proposer who most recently had a proposal accepted. In many paxos implementations, there is only one active proposer serving client requests: the only time the other proposers send proposals is when the current leader fails, and a new one needs to be selected.
Learner
The learner is the real service provided by the paxos cluster. Once a proposal is accepted, a learner processes the request from the client, and sends it the result.

In a typical paxos cluster, the client sends requests to a proposer. The proposer sends a proposal to update the state with the new client request, and attempts to convince a majority of the acceptors to accept it. Once a majority accepts it, the client request is processed by the learner, and a result is returned to the client.

The meat of paxos the protocol that the proposer gets a majority of acceptors to agree on a proposal, and how that protocol process ensures that the collection of acceptors maintains a consistent state.

The protocol itself is pretty simple. Each round is effectively independent, and consists of a process of attempting to reach consensus. Within each round, finding consensus is a two-phase process, where each phase consists of a message sent from a proposer to a group of acceptors, and a reply from the acceptors to the proposer.

  1. Phase One: Prepare/Promise
    • Proposer: A proposer attempts to start setting a new consensus by sending a Prepare(N) message to a quorum of acceptors. It can send to any group of acceptors, so long as that group forms a majority of the acceptors. The prepare message specifies a numeric identifier N for its proposal, which is larger than any proposal that’s been sent by this proposer.
    • Acceptors:
      Each acceptor, upon receiving the proposal, checks if the N-value from the prepare message is greater than any proposal from the current round that it has accepted. If so, it sends a reply called a Promise to the proposer, promising that it will never accept any proposal with a number less that N. If the acceptor has accepted a proposal with number less than N in the current round, then it includes the pair (v, n_v) consisting of the proposed consensus value v and the number n_v of the accepted proposal that proposed v.
      The acceptor thus sends a message Promise(N, (v, nv)) (if it has accepted a proposal this round) or Promise(N, null) (if it has not yet accepted a proposal with number less than N).

      Once it’s sent a promise message, it must not accept any request for a proposal with number less that N. Note though that this does not mean that the acceptor promises to accept the proposal: all it’s doing is promising not to accept any proposal with number less than N! If in receives a message Prepare(N+1), it’s free to promise that – but if it does, it will no longer be able to accept the proposal for N.

      (If N is smaller that the number of any proposal promised or accepted by the acceptor, then in the original version of paxos, the acceptor does nothing; in some optimizations of the protocol, it replies Reject(nv).)

    What this phase does is allow a proposer to determine whether or not a new proposal is even worth considering. If a quorum (majority) of acceptors send promises, then it can move on to phase 2.

  2. Phase Two: Accept!/Accepted

    When a proposer recieves promises from a quorum of acceptors, then it moves forward to try to actually commit the proposal. In order to do this, it needs to choose a value for the proposal. If any of the Promise messages contained a value, then the value of this proposal must be set to the value of the highest proposal number in any of the promises. If all of the promises were empty, then the proposer can choose any value that it wants for the proposal.

    Once the proposer has chosen a value, then it sends a message Accept!(N, V) to a quorum of acceptors. This is typically written with the exclamation point, because it’s really a command to the acceptors: they’re being told to accept the proposal, if they can.

    When an acceptor receives an Accept!(N, v) message, if it has not issued a promise for a proposal with number greater than N, then it must accept the message. It accepts the proposal by sending a message Accepted(N, V) to both the original proposer, and all of the learners.

    When Accepted messages have been received from a quorum of acceptors, the new value V becomes the consensus value for the paxos cluster, and the new proposal number N is fully committed.

As with so many things, this is easier to understand when you think about an example. One use of paxos that I’ve worked with is in a cluster scheduling service. In that system:

  • a client is a user attempting to run a new job on the cluster. It sends a request to the scheduler detailing the set of resources that it wants to request.
  • Each duplicate of the scheduler is a proposer, an acceptor, and a learner. There’s one active instance of the scheduler, which is the leader. When a client wants to schedule a job, its request gets sent to the leading scheduler.
  • In the normal non-error case, this works as follows:
    1. When a scheduling request is received, the leader proposes scheduling the job, by sending a message to all of the other schedulers saying that it wants to schedule job N.
    2. The other schedulers, if they haven’t seen a proposal for a job with number greater than i, make promises to accept that proposal.
    3. The leading scheduler chooses resources for the job, and then sends an Accept! message to the other schedulers.
    4. The other schedulers reply accepting the scheduling. The non-leader schedulers, acting as learnings, record the scheduling information, and the leader actually starts the job.
  • Errors occur when there was some kind of failure. In that case, we don’t necessarily know who the leader is – so we get multiple schedulers trying to act as if they’re the leader. So they each send proposals. Whichever proposal had the largest proposal number will eventually get accepted, and its proposer becomes the new leader.

It’s a pretty simple thing – the core concept is simply that no consensus proposal is considered “committed” until it’s been accepted by a majority of the participants. And if it’s been accepted by a majority of the participants, that means that no conflicting proposal can ever reach consensus – because that would require at least one participant to accept 2 conflicting proposals.

But there’s still a bit of formality that’s working looking at. Exactly what guarantees does paxos give? What properties does paxos-style consensus have?

Even the formal properties of paxos are easy to understand. Paxos provides two key properties: validity, and agreement.

Validity
No value ever reaches consensus without first being proposed, and having its proposal accepted.
Agreement
No two distinct values ever reach consensus at the same time.

You an easily prove those two properties. In fact, the proof is completely obvious once you recognize that the paxos protocol has two invariants (and those invariants are themselves clear from the definition of the protocol!):

  1. An acceptor can only accept a proposal p if and only if it has not yet made a promise
    for a proposal n where is the consensus value of the
    highest numbered proposal that has been accepted before this proposal.

Getting back to the beginning: the point of all of this is to have a system in which we can be sure that things work correctly even in the presence of failures. In paxos, as long as at some point there was a quorum of machines that come to agreement, then any failure that leaves a surviving quorum of machines must have overlapped with the previous quorum – which means that the previous consensus still remains in effect, and will be propagated to the remaining participants. If you’ve got 5 machines, then two can fail, and you won’t lose consistency among the remaining ones.

The ABC conjecture – aka the soap opera of the math world.

Sorry for the silence of this blog for the last few months. This spring, my mother died, and I was very depressed about it. Depression is a difficult thing, and it left me without the energy or drive to do the difficult work of writing this kind of material. I’m trying to get back into the cycle of writing. I’m trying to make some progress in writing about type theory, but I’m starting with a couple of easier posts.

In the time when I was silent, I had a couple of people write to me to ask me to explain something called the ABC conjecture.

The ABC conjecture is a mathematical question about number theory that was proposed in the 1980s – so it’s relatively new as number theory problems go. It’s gotten a lot of attention recently, due to an almost soap-operatic series of events.

It’s a very hard problem, and no one had made any significant progress on it until about five years ago, when a well respected Japanese mathematician named Shinichi Mochizucki published a series of papers containing a proof of the conjecture.

Normally, when a proof of a hard problem gets published, mathematicians go nuts! Everyone starts poring over it, trying to figure it out, and see if it’s valid. That’s what happened the previous time someone thought they’d prooved it. But this time, no one has been able to make sense out of the proof!

The problem is that in order to build his proof, professor Mochizucki created a whole new mathematical theory, called inter-universal Teichmüller theory. The entire ABC conjecture proof is built in this new theory, and no one other than professor Mochizucki himself understands Teichmüller theory. Before anyone else can actually follow the proof, they need to understand the theory. Professor Mochizucki is a bit of a recluse – he has declined to travel anywhere to teach his new mathematical system. So in the five years since he first published it, no one has been able to understand it well enough to determine whether or not the proof is correct. One error in it was found, but corrected, and the whole proof remains in question.

Exactly why the proof remains unchecked after five years is a point of contention. Lots of mathematicians are angry at Professor Mochizucki for not being willing to explain or teach his theory. A common statement among critics is that if you create a new mathematical theory, you need to be willing to actually explain it to people: work with a group of mathematicians to teach it to them, so that they’ll be able to use it to verify the proof. But Professor Mochizuchki’s response has been that he has explained it: he’s published a series of papers describing the theory. He doesn’t want to travel and take time away from his work for people who haven’t been willing to take the time to read what’s he’s written. He’s angry that after five years, no one has bothered to actually figure out his proof.

I’m obviously not going to attempt to weigh in on whether or not Professor Mochizuki’s proof is correct or not. That’s so far beyond the ability of my puny little brain that I’d need to be a hundred times smarter before it would even be laughable! Nor am I going to take sides about whether or not the Professor should be travelling to teach other mathematicians his theory. But what I can do is explain a little bit about what the ABC conjecture is, and why people care so much about it.

It’s a conjecture in number theory. Number theorists tend to be obsessed with prime numbers, because the structure of the prime numbers is a huge and fundamental part of the structure and behavior of numbers as a whole. The ABC conjecture tries to describe one property of the structure of the set of prime numbers within the system of the natural numbers. Mathematicians would love to have a proof for it, because of what it would tell them about the prime numbers.

Before I can explain the problem, there’s a bit of background that we need to go through.

  1. Any non-prime number N is the product of some set of prime numbers. Those numbers are called the prime factors of N. For example, 8 is 2×2×2 – so the set of prime factors of 8 is {2}. 28 is 2×2×7, so the prime factors of 28 are {2, 7}. 360 = 8 × 45 = 2×2×2×(9×5) = 2×2×2×3×3×5, so the prime factors of 360 are {2, 3, 5}.
  2. For any number N, the radical of N is product of its set of prime factors. So the radical of 8 (written rad(8)) is 2; rad(14)=14; rad(28)=14; rad(36)=6, rad(360)=30, etc.
  3. Given two positive integers N and M, N and M are coprime if they have no common prime factors. A tiny bit more formally, if pf(N) is the set of prime factors of N, and M and N are coprime if and only if pf(N) ∩ pu(M) = ∅. (Also, if M and N are coprime, then rad(M×N) = ram(M)×rad(N).)

The simplest way of saying the ABC conjecture is that for the vast majority of integers A, B, and C, where A + B = C and A and B are coprime, C must be smaller than rad(A*B).

Of course, that’s hopelessly imprecise for mathematicians! What does “the vast majority” mean?

The usual method at times like these is to find some way of characterizing the size of the relative sizes of the set where the statement is true and where the statement is false. For most mathematicians, the sizes of sets that are interesting are basically 0, 1, finite, countably infinite, and uncountably infinite. For the statement of the ABC conjecture, they claim that the set of values for which the statement is true is infinite, but that the set of values for which it is false are finite. Specifically, they want to be able to show that the set of numbers for which rad(A*B)>C is finite.

To do that, they pull out a standard trick. Sadly, I don’t recall the proper formal term, but I’ll call it epsilon bounding. The idea is that you’ve got a statement S about a number (or region of numbers) N. You can’t prove your statement about N specifically – so you prove it about regions around N.

As usual, it’s clearest with an example. We want to say that C > rad(A*B) for most values of A and B. The way we can show that is by saying that for any value ε, the set of values (A, B, C) where A and B are coprime, and A + B = C, and rad(A*B) > C + ε is finite.

What this formulation does is give us a formal idea of how rare this is. It’s possible that there are some values for A and B where rad(A*B) is bigger that 1,000,000,000,000,000,000 + C. But the number of places where that’s true is finite. Since the full system of numbers is infinite, that means that in the overwhelming majority of cases, rad(A*B) < C. The size of the set of numbers where that's not true is so small that it might at well be 0 in comparison to the size of the set of numbers where it is true. Ultimately, it seems almost trivial once you understand what the conjecture is. It's nothing more that the hypothesis that that if A + B = C, then most of the time, pf(A)*pf(B) < C. Once you've got that down, the question is, what's the big deal? Professor Mochuzuki developed five hundred pages of theory for this? People have spent more than five years trying to work through his proof just to see if it’s correct for a statement like this? Why does anybody care so much?

One answer is: mathematicians are crazy people!

The better answer is that simple statements like this end up telling us very profound things about the deep structure of numbers. The statements reduce to something remarkably simple, but the meaning underneath it is far more complex than it appears.

Just to give you one example of what this means: If the conjecture is true, then there’s a three-line proof of Fermat’s last theorem. (The current proof of Fermat’s last theorem, by Andrew Wiles, is over 150 pages of dense mathematics.) There’s quite a number of things that number theoreticians care about that would fall out of a successful proof.