Category Archives: Good Math

Understanding Expressions

Administrivia

I’m going to be trying something a bit different with this blog.

What I’ve tried to do here on GM/BM is make each post as self-contained as possible. Obviously, many things take more than one post to explain, but I’ve done my best to take things, and divide them into pieces where there’s a basic concept or process that’s the focus of each post.

I’m finding that for this type theory stuff, I’m having a hard time doing that. Or rather, given my work schedule right now when I’m trying to write about type theory, I’m finding it hard to find enough time to do that, and still be posting regularly. (That sounds like a complaint, but it’s not meant to be. I started a new job at Dropbox about three months ago, and I absolutely love it. I’m spending a lot of time working because I’m having so much fun, not because some big mean boss guy is leaning over me and threatening.)

Anyway, the point of this whole diversion is that I really want to get this blog back onto a regular posting schedule. But to do that, I’m going to have to change my posting style a bit, and make the individual posts shorter, and less self-contained. I’m definitely interested in what you, my readers, think of this – so please, as I roll into this, let me know if you think it’s working or not. Thanks!


In this post, we’re going to start looking at expressions. This might seem like it’s a diversion from the stuff I’ve been writing about type theory, but it really isn’t! Per Martin-Löf developed a theory of expressions which is used by type theorists and many others, and we’re going to be looking at that.

We’ve all seen arithmetic expressions written out since we were in first grade. We think we understand what they mean. But actually, most of us have never really stopped and thought precisely about what an expression actually means. Most of the time, that’s OK: we’ve got an intuitive sense of it that’s good enough. But for type theory, it’s not sufficient. In fact, even if we did have a really good, formal notion of expressions, it wouldn’t be right for type theory: in type theory, we’re rebuilding mathematics from a foundation of computability, and that’s not the basis of any theory of expressions that’s used in other mathematical fields.

Why is this a problem?

Let’s start by looking at a nice, simple expression:

x^2 + 3x + 7

What does that mean? Roughly speaking, it’s a function with one parameter: f(x) = x^2 + 3x + 7. But that doesn’t really tell us much: all we’ve really done is add a bit of notation. We still don’t know what it means.

Let’s take it a step further. It’s actually describing a computation that adds three elements: +(x^2, 3x, 7). But that’s not quite right either, because we know addition is binary. That means that we need to decide how to divide that addition into two parts. From the commutative property, we know that it doesn’t matter which way we divide it – but from a computational perspective, it might: doing it one way versus the other might take much longer!

We’ll pick left-associative, and say that it’s really +(+(x^2, 3x), 7). We also need to expand other parts of this into this functional idea. If we follow it all out, we wind up with: +(+(*(x,x), *(3, x)),7).

We’ve converted the expression into a collection of applications of primitive functions. Or in terms that are more familiar to geeks like me, we’ve taken the expression, and turned it into an abstract syntax tree (AST) that describes it as a computation.

We’re still being pretty vague here. We haven’t really defined our notion of function or computation. But even with that vagueness, we’ve already started making the notion of what an expression is much more concrete. We’ve taken the abstract notion of expressions, and translated it to a much less abstract notion of a computation expressed as a sequence of computable functions.

This is a really useful thing to do. It’s useful enough that we don’t want to limit it to just “pure” expressions. In the type theoretic view of computation, everything is an expression. That’s important for multiple reasons – but to make it concrete, we’re going to eventually get around to showing how types work in expressions, what it means for an expression to be well-typed, how to infer types for expressions, and so on. We want all of that to work for any program – not just for something that looks like a formula.

Fortunately, this works. We can also take an “expression” like for i in 1 .. 10 do f(i), and analyze it as a function: for(i, 1, 10, f(i)).

So, we’ve got a way of understanding expressions as functions. But even if we want to keep the computational side of things abstract and hand-wavy, that’s still not really enough. We’re closer to understanding expressions, but we’ve still got some huge, gaping holes.

Let’s jump back to that example expression: x^2 + 3x + 7. What does it mean? What we’ve seen so far is that we can both understand it, as a series of function calls: +(+(*(x, x), *(3, x)), 7). But we’d like to be able to evaluate it – to execute the computation that it describes. But we can’t do that: it’s got a gaping hole named x. What do we do with that?

We’re missing a really important notion: funcional abstraction. Our expression isn’t just an expression: what it really is is a function. We alluded to that before, but now we’re going to deal with it head-on. That expression doesn’t really define a computation: it defines a computational object that computes the function. When an expression has free variables – that is, variables that aren’t assigned a meaning within the expression – our expression represents something that’s been abstracted a level: instead of being a computation that produces a value, it’s an object that takes a parameter, and performs a computation on its parameter(s) in order to produce a value.

In our expression x^2 + 3x + 7, we’re looking at an expression in one free variable – which makes it a single-parameter function. In the notation of type theory, we’d write it as (x)(+(+(*(x, x), *(3, x)), 7) – that is,
the parameter variable in parens ahead of the expression that it parameterizes. (Yes, this notation stinks; but I’m sticking with the notations from the texts I’m using, and this is it.)

This notion, of parameters and function abstraction turns out to be more complex than you might expect. I’m going to stop this post here – but around wednesday, I’ll post the next part, which will look at how to understand the arity of an expression.

The Fedex Problem

Over the weekend on twitter, someone tweeted a link to a really fun paper looking paper on arXiv, called The Fedex Problem. It’s a really interesting paper, because it takes a real-world problem, analyzes mathematically, and shows some of the fascinating complexity that lies under the surface of something that seems simple. (I can’t find the original tweet, so I don’t know who first shared it – sorry!

The problem that they looked at involves the logistics of rapid delivery by services like Federal Expression.

When Fedex was founded, they built their overnight delivery business on a new model of how to handle deliveries, called a hub system.

What most delivery businesses traditionally did was take packages in at a local warehouse, and sort them into groups depending on the warehouse that serviced their destination. Each group was then shipped separately, from the source warehouse to the destination warehouse. To do overnight delivery, each warehouse, then, would need to have a full package-sorting system, and they would need to to send a shipment to every other warehouse every single day. For N warehouses, that means that they needed N full sorting systems, and N^2 shipments every day. The costs of the redundant shipping systems and the massive number of shipments are huge, and completely non-scalable. Imagine a system with just 100 local warehouses – that’s only 2 for each state! – you’d need to have 10,000 shipments per day. In practice, the number of fedex warehouses just in the US is considerably larger than just 100: it simply can’t work.

What Fedex chose to do instead was build a hub system. Every package that arrives at a local warehouse gets shipped to the same place, called the hub. The only sorting system is at the hub – individual local warehouses don’t sort, they just ship to the hub. The hub sorts according to destination warehouse, and then ships the packages to the appropriate warehouse for delivery.

The Fedex hub system solved both of the problems of the traditional approach. The individual local shipping points don’t need to do sorting: they just ship to the hub, which takes care of all of it. And since every package goes to the hub, instead of needing N^2 shipments, they only needed 2N shipments per day: N shipments of unsorted packages from local warehouses to the hub, and N shipments of sorted packages from the hub to the local warehouses for local deliveries.

How well this system works is very dependent on the proper location of the hub. To demonstrate why, consider the worst possible hub location for American shipments: Hawaii. Now, every package from anywhere in the mainland to anywhere else in the mainland needs to travel at least an extra 2500 miles to the hub, and an extra 2500 miles back! The cost of shipping it dominated by fuel costs – adding any extra distance to the average shipment costs fuel, and thus money. So the location of hub has a dramatic effect on the shipment costs.

FedEx located their hub in Memphis, Tennessee, because according to its founders, Memphis was close to the “center” of the shipping region for its original target market cities, and thus minimized the total shipping distances. (That wasn’t their sole concern; another factor was that Memphis is in an area that is rarely incapacitated by weather or other natural disasters – but the travel distance was the biggest factor.)

What this paper does is take the Fedex problem, and generalize it. What, exactly, does it mean for a given point to be an ideal hub? Given a network of points in a Euclidean space, how can you compute the ideal hub? And finally, if we look at the actual Federal Express distribution system, using great-circle routing between points, how close is Memphis to being the ideal hub?

If you look at the problem naively, it seems like it should be simple. The ideal hub should just be the “average” location. That’s basically what the FedEx founders said – they picked the geographical center of their original markets. Unfortunately, reality is rarely kind enough to make things fit our intuitions, and that’s not quite right.

To see why, let’s first look at the definition of the ideal hub. What we want to do is to minimize the total distance from local shipping points to the hub. To keep the notation simple, we’ll write the problem down as if we were looking at the one-dimensional version of the problem. Each location is a single value, and we’ll write the distance between two points a and b as the absolute value of a-b: \left|a-b\right|.

If the set of i points in the network are called x_1 \dots x_i, then the total distance value for a hub at x is given by the function:

 f(x) = \Sigma_i 2\left| x - x_i \right|

Since the 2 is a constant factor that affects all x-values easily, for the purposes of simplicity, we can drop it out. So the problem of selecting an ideal hub ultimately reduces to an optimization problem: find the value x where f(x) is a minimum.

Again, naively, it seems as though the average is best. But we can see that it’s not with a simple example. Let’s just look at a one-dimensional case – a list of locations scattered along a line: {1, 2, 3, 5, 13, 14, 60}. The point that’s at the average is 14, with a total one-way travel distance of 14 + 12 + 11 + 9 + 0 + 1 + 46 = 93. But by doing some brute-force, we can see that the total distance for x=5 is just 4 + 3 + 2 + 0 + 8 + 9 + 55 = 81. So the mean isn’t the optimal hub location. In fact, you can prove that in the one-dimensional case, the optimal hub is the median location, not the mean.

In this example, the average isn’t far off – but even in a quick off-the-cuff example, you can see that average isn’t the answer: the mean can be quite different from the median, particularly when you’ve got a lot of clustering, or a couple of outliers. When you extend to more than one dimension, then things can get bad pretty quickly.

The general problem has been looked at in many forms. It’s known, variously, as the Fermat-Torricelli problem, the Weber problem, or the single facility location problem.

Unfortunately, getting an answer to the problem is a lot harder than just naming it. According to the authors of this paper, even the fact of the existence of a unique ideal hub isn’t trivial. Despite all of the people who’ve worked on the problem, they couldn’t find a single complete proof of it! They go ahead and provide one. They prove that:

For any finite set of non-collinear points in any euclidean space R^N,
there is an idea hub, located within the convex hull of the set of points.

The convex hull qualifier there deserves a bit of explanation. It’s an interesting idea that ends up coming up in a bunch of different mathematical contexts. The easiest way to explain it is by the easiest method of finding the convex hull in two dimensions: suppose you’ve got a bunch of points on a plane. Draw the plane on graph paper, and put a nail at the location of each point. Take a rubber band, and stretch it so that every nail on the graph is inside of the rubber band. Let go. The rubber band will end up touching some of the nails; some will be inside, but untouched. The rubber band is the convex hull of the points. It’s basically a general notion of the smallest blob that contains all of the points in the set. The authors present a short, clean proof that the ideal hub location will always be within the “blob” of the points.

The paper goes into some depth in discussing some simple cases of the ideal hub, and how they can be calculated geometrically. But for the general problem, the geometric approaches don’t work. To get the ideal hub for a large collection of points, they’re left with brute force. In the brute force method, you just take every point x in the network, and compute its f(x) value. Since it’s a finite network, you’ll eventually find the ideal hub. (You can, of course, apply
some nice heuristics to simplify the problem: the set of candidates that might be optimal hubs is going to be a lot smaller than the set of all ponits in the network!) But still, computing the optimal hub is quite complicated – O(N^2) in the number of points in the network.

After all of that work, where is the ideal hub? How different is it from the hub that Federal Express actually uses?

As it turns out, Fedex isn’t too far off, but it’s definitely non-optimal. Using US census data, combined with great-circle distances between the census tracts, the ideal hub is in central Indiana. FedEx’s hub is 315 miles off, to the south-south-west of the ideal location.

Interestingly, when UPS (probably FedEx’s biggest rival) set up a hub system, their hub is in Indianapolis – just 85 miles away from the ideal hub.

To me, the most interesting thing about the paper comes after they showed the ideal hub by brute force. If you’ve got more than four points – even just five! – in a two-dimensional space, there’s no simple geometric solution to find the ideal hub. It seems like it should be simple, but it isn’t. The proof of that is really fascinatingly simple: it comes down to group theory and symmetry. Read the paper for the details. A geometric solution is limited by the difference in dimensionality between a particular one-dimensional rotational symmetry group, and the number of objects in the network. As a result, for 4 or less non-collinear points, there’s a way of finding a single unique geometric configuration that can be exploited to characterize the ideal hub. For more than 4 points, there isn’t – there’s no single configuration anymore, which blows away the usefulness of geometric solutions.

If you find this at all interesting, then it’s worth downloading and reading the paper. There’s a whole lot their that I haven’t covered – more than what I have. But I hope this little taste is enough to pique your curiosity.

Logical Statements as Tasks

In the next step of our exploration of type theory, we’re going to take a step away from the set-based stuff. There are set-based intepretations of every statement in set theory. But what we really want to focus on is the interpretation of statements in computational terms.

What that means is that we’re going to take logical statements, and view them as computation tasks – that is, as formal logical specifications of a computation that we’d like to do. Under that interpretation, a proof of a statement S is an implementation of the task specified by S.

This interpretation is much, much easier for computer science types like me than the set-based interpretations. We can walk through the interpretations of all of the statements of our intuitionistic logic in just a few minutes.

We’ll start with the simple statements.

Conjunction
A \land B is a specification for a program that produces a pair (a, b) where a is a solution for A, and b is a solution for B.
Disjunction
A \lor B is a specification for a program that produces either a solution to A or a solution to B, along with a way of identifying which of A and B it solved. We do that using a version of the classical projection functions: A \lor B produce either \text{inl}(A) (that is, the left projection), or \text{inr}(B) (the right projection).
Implication
A \supset B is a specification for a program that produces a solution to B given a solution to A; in lambda calculus terms, it’s a form like \lambda x: b(x).

Now, we can move on to quantified statements. They get a bit more complicated, but if you read the quantifier right, it’s not bad.

Universal
(\forall x \in A) B(x) is a program which, when executed, yields a program of the form \lambda x.b(x), where b(x) is an implementation of B, and x is an implementation of A. In other words, a universal statement is a program factory, which produces a program that turns one program into another program.

To me, the easiest way to understand this is to expand the quantifier. A quantified statement \forall x \in A: B(x) can be read as \forall x: x \in A \Rightarrow B(x). If you read it that way, and just follow the computational interpretation of implication, you get precisely the definition above.

Existential
Existential quantification is easy. An existential statement \exists x \in A: B(x) is a two part problem: it needs a value for a (that is, a value of x for which a proof exists that x \in A), and a proof that for that specific value of x, x \in B. A solution, then, has two parts: it’s a pair (a, b), where a is a value in A, and b is a program that computes the problem B(a).

This is the perspective from which most of Martin-Loff’s type theory pursues things. There’s a reason why ML’s type theory is so well-loved by computer scientists: because what we’re really doing here is taking a foundational theory of mathematics, and turning it into a specification language for describing computing systems.

That’s the fundamental beauty of what Martin-Loff did: he found a way of formulating all of constructive mathematics so that it’s one and the same thing as the theory of computation.

And that’s why this kind of type theory is so useful as a component of programming languages: because it’s allowing you to describe the semantics of your program in terms that are completely natural to the program. The type system is a description of the problem; and the program is the proof.

With full-blown Martin-Loff type system, the types really are a full specification of the computation described by the program. We don’t actually use the full expressiveness of type theory in real languages – among other things, it’s not checkable! But we do use a constrained, limited logic with Martin-Loff’s semantics. That’s what types really are in programming languages: they’re logical statements! As we get deeper into type theory, well see exactly how that works.

Propositions as Proofsets: Unwinding the confusion

My type theory post about the different interpretations of a proposition caused a furor in the comments. Understand what’s going on that caused all of the confusion is going to be important as we continue to move forward into type theory.

The root problem is really interesting, once you see what’s going on. We’re taking a statement that, on the face of it, isn’t about sets. Then we’re appyling a set-based interpretation of it, and looking at the subset relation. That’s all good. The problem is that when we start looking at a set-based interpretation, we’re doing what we would do in classical set theory – but that’s a different thing from what we’re doing here. In effect, we’re changing the statement.

For almost all of us, math is something that we learned from the perspective of axiomatic set theory and first order predicate logic. So that’s the default interpretation that we put on anything mathematical. When you talk about a a proposition as a set, we’re programmed to think of it in that classical way: for any set S, there’s a logical predicate P_s such that by definition, \forall x: x \in S \Leftrightarrow P_s(x). When you see P \Rightarrow Q in a set-theory context, what you think is something like \forall x: x \in P \Rightarrow x \in Q. Under that intepretation, the idea that P \supset Q is equivalent to P \rightarrow Q is absolutely ridiculous. If you follow the logic, implication must be the reverse of the subset relation!

The catch, though, is that we’re not talking about set theory, and the statement P \Rightarrow Q that we’re looking at is emphatically not \forall x : P(x) \Rightarrow Q(x). And that, right there, is the root of the problem.

P \rightarrow Q always means P \rightarrow Q – it doesn’t matter whether we’re doing set theory or type theory or whatever else. But in set theory, when we talk about the intepretation of P as a set, right now, in the world of type theory, we’re talking about a different set.

Super set doesn’t suddenly mean subset. Implication doesn’t start working backwards! and yet, I’m still trying to tell you that i really meant it when i said that superset meant implication! how can that possibly make sense?

In type theory, we´re trying to take a very different look at math. In particular, we’re building everything up on a constructive/computational framework. So we’re necessarily going to look at some different interpretations of things – we’re going to look at things in ways that just don’t make sense in the world of classical set theory/FOPL. We’re not going to contradict set theory – but we’re going to look at things very differently.

For example, the kind of statement we’re talking here about is a complete, closed, logical proposition, not a predicate, nor a set. The proposition P is a statement like “‘hello’ has five letters”.

When we look at a logical proposition P, one of the type theoretic interpretations of it is as a set of facts: P can be viewed as the set of all facts that can be proven true using P. In type theory land, this makes perfect sense: if I’ve got a proof of P, then I’ve got a proof of everything that P can prove. P isn’t a statement about the items in Ps proof-set. P is a logical statement about something, and the elements of the proof-set of P are the things that the statement P can prove.

With that in mind, what does P \Rightarrow Q mean in type theory? It means that everything provable using Q is provable using nothing but P.

(It’s really important to note here that there are no quantifiers in that statement. Again, we are not saying \forall p: P(x) \Rightarrow Q(x). P and Q are atomic propositions – not open quantified statements.)

If you are following the interpretation that says that P is the set of facts that are provable using the proposition P, then if P \Rightarrow Q, that means that everything that’s in Q must also be in P. In fact, it means pretty much exactly the same thing as classical superset. Q is a set of facts provable by the statement Q. The statement Q is provable using the statement P – which means that everything in the provable set of Q must, by definition! be in the provable set of P.

The converse doesn’t hold. There can be things provable by P (and thus in the proof-set of P) which are not provable using Q. So taken as sets of facts provable by logical propositions, P \supset Q!

Again, that seems like it’s the opposite of what we’d expect. But the trick is to recognize the meaning of the statements we’re working with, and that despite a surface resemblance, they’re not the same thing that we’re used to. Type theory isn’t saying that the set theoretic statements are wrong; nor is set theory saying that type theory is wrong.

The catch is simple: we’re trying to inject a kind of quantification into the statement P \Rightarrow Q which isn’t there; and then we’re using our interpretation of that quantified statement to say something different.

But there’s an interpretation of statements in type theory which is entirely valid, but which trips over our intuition: our training has taught us to take it, and expand it into an entirely different statement. We create blanks that aren’t there, fill them in, and by doing so, convert it into something that it isn’t, and confuse ourselves.

Simpler Consensus with Raft

A few weeks ago, I wrote about Paxos, which is (at least in my experience), the most widely used algorithm for consensus in distributed systems. I’m a huge fan of Paxos – I think that it’s a remarkably elegant system.

But Paxos does have its problem.

  1. Paxos has a lot of roles: client, proposer, learner, acceptor, leader, follower. When you want to implement Paxos, you need to figure out all of those roles, and how you’re going to implement them. In general, you end up merging roles – but there are lots of ways of doing that merge. Each particular way of setting up the roles has its own properties, and thus its own tradeoffs that you need to understand.
  2. Paxos, as we normally talk about it, is really a single-consensus protocol – that is, the basic protocol is designed to get a group of agents to come to consensus just once. If you want to be able to repeatedly seek new consensus values, you’re actually going to be using an extension to the basic paxos protocol. There are a ton of Paxos extensions that work to add repeated consensus. Paxos itself is simple and elegant, with well-defined formal properties that we care about – the moment we start modifying it, we can no longer count on those properties unless we can also prove them in our extension!
  3. Paxos was originally described in a truly awful paper. Leslie Lamport was trying to write a paper that would be less dull than the typical bone-dry technical snoozer – but the way that he wrote it actually makes it much harder to understand.

In short: Paxos has more complexity than it needs, and despite that, it needs to be tweaked to be really useful, and getting those tweaks right is hard. There are, sadly, a lot of incorrect Paxos implementations – and their incorrectness has all-too-often come as a surprise to the people who rely on them.

To avoid those problems, there are other consensus algorithms out there. In this post, we’re going to look at one of the Paxos competitors – a consensus algorithm/protocol called raft.

Raft does away with the role complexity of Paxos. In Raft, you have a collection of cooperating agents. There are no distinct proposers, acceptors, or learners: there are just servers. Communication between the servers in raft is done entirely with synchronous remote procedure calls.

In Raft, the target of consensus is a log containing a sequence of events. The log is the history of the distributed system. The goal of raft is that the log be maintained in a consistent state throughout the raft network. Just like in Paxos, if we have 2n+1 servers, up to n can fail without the network losing its consistency.

Raft is designed in terms of remote procedure calls between the elements of the network. In Raft, we never talk about single messages – every communication between servers is a pair of messages: a request from caller to callee; and a response from callee to caller. When a message gets lost, we’ll just talk about it as a failed remote procedure call.

Within a Raft network, at any given time, each server has a state. It can be a follower, a leader, or a candidate. Within the network, there is at most one leader. When there is a leader, all of the other servers are in the follower state. The followers are almost entirely passive. Followers don’t talk to clients at all – they just wait for RPCs from the leader. The leader is the only participant that’s allowed to talk to clients: any client request must go through the current leader. The leader is also the only server that’s allowed to add new entries to the consensus log.

Raft divides time into a sequence of terms. In each term, the servers in the raft network need to select a leader using a process called an election. Raft is a strong leader protocol – no interactions with a client can take place except through a leader. If there’s no leader, then we can’t process client requests without a leader.

So, to understand Raft, there’s three processes that we need to
understand:

  1. Leader election
  2. Transitions between terms
  3. Appending an entry to a log.

In those processes, the servers have a collection of variables
that they use for the Raft protocol:

currentTerm
the current term for the server.
votedFor
the serverID that this server voted for in the current term, or “none”.
log
the list of entries in the log.
commitIndex
The index of the highest log entry known to be committed by the server.
lastApplied
The index of the highest log entry that’s been added to the log – but not necessarily committed. (It doesn’t become committed until a majority of servers accept it.)

Leader Election

In each term, the Raft cluster needs to have a leader. The way that a leader is selected called election.

Elections are triggered by a term transition. When a server in the cluster decided that it needs to start a new term, it increments its term number, puts itself into the candidate state, and sends a RequestVote(term, candidateId) RPC to each of the other servers in the cluster. This request asks the other servers in the cluster to select it as the leader. If it receives enough “yes” votes, it will become the leader.

When a server receives a RequestVote RPC, it checks the term. If it’s smaller than the server’s current term, then it replies “No” – meaning that it cannot support the requestor as leader.

If the term in the request is greater that the receiver’s term, then the receiver cannot have voted in the new term. So it updates to the term from the request, and then it replies “Yes”.

If the term in the request equals the receiver’s term, then the receiver has already updated its term. If it’s already voted for someone else as leader, then it can’t support the requestor, so it replies “No”. If it hasn’t voted for a leader in the term, then it votes for the requestor, and replies “Yes”.

If the requester receives “Yes” votes from more than 1/2 of the cluster (counting itself), then it becomes the leader, and starts both processing requests from clients, and sending heartbeats to the other servers in the cluster.

If it doesn’t receive enough votes, then it waits to see if anyone else becomes the leader and starts sending heartbeats. If it doesn’t get a heartbeat in time, then it starts over: it would increment its term again, and try to start a fresh election.

Term Transitions

For a given server, term transitions happen in three ways:

  1. Timeout: the leading server needs to periodically communicate with each of the followers. This process is called heartbeat: even if the leader has no updates for its followers, it sends RPC calls to the followers just to say “I’m still here”. If a client goes too long without receiving a heartbeat, it decides that the leader was lost, and it will increment the term number, and trigger a new election.
  2. Leader resignation: the current leader can, at any time, decide to stop being the leader. (This is typically done by an implementation as part of a system that says that there’s a maximum period between leader elections. For example, in the Aurora scheduler, we had leader elections at least once per day. In a raft consensus, the leader would trigger this by deciding it was time for it to stop being a leader, and triggering an election by starting a new term.)
  3. External term change: every RPC received by a server includes a term number. If any RPC to a server ever includes a term number greater than the current term for that server, the server will update its term to the new number. As a special case of this, when a leader server decides to resign, it does that by sending an RPC to the other servers with an incremented term number.

Appending to the log

We just spent a fair bit of time talking about leaders and elections. That’s almost beside the point. What we really want to do is just maintain a consistent log across the cluster of servers. Everything except creating log entries is just the book-keeping that’s necessary to make the consistent log work. The log itself is maintained using the AppendEntries RPC call.

When a client request does something that alters the state of the cluster, the leader describes that change by adding an entry to the log. It builds a proposed log entry, and sends it to the other members of the cluster using an RPC. If it gets enough “Yes” votes from other cluster members, then the log entry becomes committed, and the leader updates its commitIndex to the index of the new log entry to reflect that.

The RPC request takes a bunch of parameters:

  1. term: the leader’s term.
  2. leaderId: the id of the leader.
  3. prevLogIndex: the index number of the last log entry in the consensus log preceeding this new entry.
  4. prevLogTerm: the number of the term where the last log entry was committed.
  5. entries: a set of new log entries to be appended to the log.
  6. leaderCommit: the index of the commitlog on the leader after this set of entries has been committed.

When an AppendEntries call is received by a follower, what it does is:

  • If the receiver’s term is greater than the request term, then the receiver rejects the request by replying “No”.
  • If the the receivers commit index is larger than the commit index of the request, then it rejects the request by replying “No”.
  • If the receiver’s log doesn’t contain an entry at prevLogIndex, or that entry’s term doesn’t match the request term, then it rejects the request by replying “No”.
  • If there’s an entry in the log with the same index as the new log entries, and the term in the request matches the receiver’s term, then the receiver removes all entries after prevLogIndex from its log.
  • The receiver then appends the new entries from the request to its log.
  • If the leaderCommit is greater than the commitIndex on the receiver, then the receiver updates its commitIndex.
  • Finally, the receiver replies “Yes”.

When a majority of the cluster members have accepted an AppendEntries call, then the log entry gets committed.

The one part of this that’s confusing is how the logs get managed. The leader creates a new log entry, and sends it to the other servers. The complexity comes from dealing with cases where something doesn’t reach consensus.

For example, the leader sends entries 5, 6, and 7 to server S. S adds the entries to its copy of log – it now contains [1, 2, 3, 4, 5, 6, 7]. Meanwhile, the leader also sends those entries to server T, but the RPC to T fails due to a network fault. Another client request happens, and now the leader sends [5, 6, 7, 8] to S. S sees that it’s got entry 5 already: so it discards everything after 5, and then re-appends.

So the trailing segment of the log can change! How do we handle consensus?

The next time that the leader sends an AppendEntries to a follower, it contains the leader’s commitIndex. The follower updates its commit index to that value. Once it’s done that, any request from a leader that tries to modify anything that comes before that commit index will be rejected.

The consensus commit thus doesn’t really occur until the next heartbeat call after a log update.

Raft versus Paxos

That’s the basics of Raft.

In comparison to Paxos, there’s a couple of things to notice:

  1. There’s a lot less confusion around roles. Paxos had a ton of different roles, and rules for interactions between the different roles. Raft doesn’t have any of that: it’s just servers, with one of the servers designated as the leader.
  2. Raft explicitly manages a log, and it adds complexity around log management. In Paxos, you’re just managing a single consensus value; in Raft, you’ve got a sequence of log entries.
  3. Paxos is defined in terms of messages; Raft is designed in terms of remote procedure calls.

So is Raft really simpler than Paxos? I think that’s up for discussion. Personally, I prefer Paxos. There’s a lot of complexity hidden under the covers of the RPC system. It looks simple on the surface, but all of the complexity of message passing, lost messages, message duplication – it’s still there. It’s just been swept under the carpet, as if that really makes it easier.

The way that the logs get maintained is confusing. That’s inevitable: getting distributed knowledge is never easy. Raft at least makes that part of things explicit, whereas it’s a common part of Paxos implementations, but it’s not really specified in the protocol.

The Program is the Proof: Propositions in Type Theory

As usual, I’m going in several different directions. I’m going to continue doing data structure posts, but at the same time I also want to get back to the type theory stuff that I was writing about before I put the blog on hiatus.

So let’s get back to a bit of Martin-Loff type theory! (The stuff I’m writing about today corresponds, roughly, to chapter 2 of the Nordstrom/Petersson/Smith text.)

One of the key ideas of Martin-Loff’s type theory is that a logical statement is exactly the same thing as a specification of a computation. When you define a predicate like “Even”, the definition specifies both the set of numbers that satisfy the predicate, and the computation that tests a number for membership in the set of even numbers. If you haven’t provided enough information to fully specify the computation, then in Martin-Loff type theory, you haven’t defined a predicate or a set.

When you say “2 is even”, what you’re really saying in terms of the type theory is that “The computation for ‘even(2)’ will succeed”. The computation and the logical statement are the same thing.

In functional programming, we like to say that the program is the proof. Martin-Loff type theory is where that came from – and today we’re going to take a first look in detail at exactly what it means. In the world of type theory, the program is the proof, and the proof doesn’t exist without the program.

This creates an interesting set of equivalent interpretations. When you see a statement like “x : T” (or x \in T), that could be interpreted in the following ways, all of which are really equivalent in type theory.

  1. Set theoretic: x is a member of the set T.
  2. Intuitionistic: x is a proof object for the proposition T.
  3. Computational: x is a program that satisfies the specification T.
  4. Abstract: x is a solution for the problem T.

In the rest of this post, I’m going to focus on those four interpretations, and explain how each of them makes sense in this version of type theory.

The set theoretic interpretation is obvious – as the name suggests, it’s nothing but what we all learned from basic set theory. An object is a member of a set – which means, from set theory, that the object satisfies some predicate in first order predicate logic – because that’s what it means to be a member of a set.

The intuitionistic interpretation is almost the same as the set theoretic, but rewritten for intuitionistic logic. In intuitionistic logic, the predicate over the set is written as a proposition T, and if we know that x is a member of the set T, then that means that we have a proof that x demonstrates that T is true.

The computational interpretation takes the intuitionistic one, and rephrases it in computational terms. A logical proposition, rendered into a computational setting, is just a specification of a program; and a proof of the proposition is a program that satisfies the specification.

Finally, the abstract interpretation just rephrases the computational one into terms that aren’t tied to a computing device. A predicate is a problem that needs to be solved; anything that provides a solution to the problem is demonstrating a member of the set.

The key takeaway though is the basic idea here of what a type is. What we’re talking about as a type here is something that goes far beyond any programming language’s idea of what a type is. In intuitionistic type theory, a type is a specification of a computation. If we had this in a real language, that would mean that any program that typechecked would be guaranteed to work: asserting that x has type T means, precisely, that x is a computation that matches the specification!

(Of course, that’s not the panacea that you might think the first time you hear that. The catch is simple: the type is a specification of the computation. That means that just writing a type is a form of programming! And that means that your type descriptions are going to have bugs. But we’ll come back to that in a much later post.)

What type theory is doing is taking something like set theory, and re-rendering it entirely in a computational world. It still has a mapping from the computations to the abstract concepts that we use when we’re thinking, but if we can talk about those abstract concepts in type theory, we’ll always do it by mapping them into some kind of computation.

In type theory, we’re not dealing in a world of pure mathematical objects that exist if we can describe them; instead, we’re building a world where everything is at least theoretically computable. That might seem constraining, but every proof already corresponds to a computation of some sort; the only additional constraint here is that we can’t play tricks like the axiom of choice, where we can “prove” the existence of some unattainable, intangible, nonsensical object.

To make that work, we’re going to take all of the construct that we’re used to seeing in intuitionistic logic, and give them a meaning in terms of computations.

For example, in set theory, we can have a statement A \supset B – meaning that A is a superset of B, that every element of B is also necessarity an element of A. In type theory, since A and B are specifications of computations, that means that a member (or proof) of A \supset B is a computation that given a proof of A, generates a proof of B – in short, that A implies B.

Now, suppose that we want to prove A \supset A. How could we do that? We need a program that given a proof of A generates a proof of A. That is, we need an implementation of the identity function: \lambda a . a.

In fact, using the computation interpretation of things, we can interpret A \supset B as being the type of a function that takes an instance of a, and generates an instance of b – that is, that if f : (A \supset B), then f is a function from an instance of A to an instance of B!

The only trick to that is understanding that in type theory, saying that a is an element of A means that a is a proof of A. Using the same interpretation, that means that f: A \supset B means that f is a proof of A \supset B – which means the same thing as saying that given an example of an element of A (a proof of A), f will produce an element of B (a proof of B). The statement A \supset B is exactly the same thing as the logical implication A \rightarrow B, which is exactly the same thing as the type of a function from A to B.


Notes on sources: I’m working from two main references in this series of posts, both of which are available in PDF form online.

  1. “Programming in Martin-Lof’s Type Theory”, by Nordstrom, Petersson, and Smith, which you can download here.
  2. “Type Theory and Functional Programming” by Simon Thompson (available here).

In addition, I first learned a lot of this from reading some papers by Phil Wadler and Simon Peyton Jones. The exactly references but the exact references to those are long-lost in the shadows of my memory. But any of their papers are well-worth reading, so just read them all!)

Mathematical Data Structures Part 1: Binary Heaps

Being a PhD in computer science, it’s only natural that I love data structures. In particular, I’m fascinated by the way that the math factors in to the way we structure data. Data structures fit into a beautiful intersection between information theory and algorithms: the way that a good data structure is built is a reflection of what information it really needs to maintain. The best data structure encodes exactly the information it needs in order to do it’s job – no more, and no less. The mathematical impacts of that are beautiful, and sometimes surprising. To try to demonstrate that, I’m going to take a couple of posts, and work my way through one of my favorite examples of a surprising outcome in a structure called a fibonacci heap.

A heap is a structure designed to solve a common problem. You’ve got a collection of objects, each of which has an associated numeric value. You want, at any time, to be able to find and remove the largest value in the collection, and to be able to add new elements to it. Those two operations are the core of the heap. Some variations also allow you to increase the value of objects inside the heap, or to remove values other than the maximum.

There are a lot of different ways to implement a heap. One obvious one is to just maintain a sorted sequence of objects. The problem with that is performance: some of the common operations are painfully slow!

Using the sorted sequence approach, removing the largest value is easy: you just remove the last element of the sequence. That’s very fast: it’s constant time. But you also need to be able to add values to the heap, and that’s not so good.

There’s two basic ways of doing a sequence: an array, or a linked list. In both cases, the performance isn’t acceptable. If we used an array,then in order to add a new object to the collection, we’d need to:

  1. Find the correct position for it in the array. We can do that by doing a binary search, which takes time O(lg n) where n is the length of the array. This step isn’t bad – in general, we’re pretty happy with O(lg n) operations.
  2. Insert the value into the array – which means shifting all of the elements that come after it one place to the right. That’s O(n) time, which is pretty crappy.

In the linked list approach, inserting the value isn’t a problem – it’s a constant time operation. But finding the position where it should be inserted is linear time. So we’re still talking about linear time.

Similarly, we could use a linked list, where inserting the element is constant time, but then finding its position is O(n) – again, unacceptable.

The problem with the sorted sequence approach isn’t really related to the kind of structure we use to maintain the sorted list; the problem is that we’re maintaining more information that we need. At any time, we want to be able to find the largest element of the heap quickly – we don’t care about the relative positions of any pair of values that don’t include the largest element of the collection! But if we keep a sorted list, every time we insert an element, we’re spending a lot of time comparing things whose comparison we don’t really care about!

To be able to make it faster, we need to build a data structure that doesn’t waste time and effort computing and maintaining information that we don’t want.

So our goal is to find ways of building structures that always let us both find the largest element quickly, and add new elements quickly, without maintaining more information that is really necessary. We’ll start off with a simple but good version, and then work our way through to better ones.

The most basic implementation of a heap is called a binary heap. A binary heap is a binary tree with two key properties:

  1. Every node in the tree is larger than its children.
  2. The tree is left-full: every level of the tree is full except for the last; and the last level is filled in from left to right.

The left-full property might seem a bit strange, but it turns out to be pretty straightforward. A binary heap can be implemented using an array. The root node is stored in the first position of the array; its children are in positions 2 and 3; the children of node 2 are stored in positions 4 and 5; the childen of position 3 are stored in positions 6 and 7. Using one-based indices, for any node N, it’s children are stored in positions 2N and 2N+1. Adding a new leaf to the tree can always be done by just appending one value to the array. The left-full property just means that you always extend the array by adding an element onto the left.

Implementing a heap this way is simple:

  1. To get the maximum value, you just look at the first element of the array – O(1).
  2. To remove the largest element from the array, you get the value from the first element of the array, and save it. Then you remove the last element from the array, and bubble it down – swapping it with one of its children if they’re bigger than it. We’ll look at this in more detail, but the bubble down process is O(lg n) in the worst case.
  3. Inserting a new element is done by adding it to the end of the array, and then bubbling up, by comparing it to its parent, and swapping if it’s bigger than its parent. Again, it’s O(lg n).

I’m going to show code for this. For fun, I wrote the code in a language called xtend. Xtend is a Java extension that cleans up the syntax, gets rid of semicolons, improves the type system, adds lambdas, and does a few other really neat things.

The whole beast is just a wrapper around an array:

class BinHeap> {
  val ArrayList _contents

  new() {
    _contents = new ArrayList()
  }

  ...
}

If you know Java, this is mostly clear. In xtend, you write constructors using the name “new” instead of the name of the class being constructed.

Then we’ll set up some utilities to make other stuff easier to write.

  def leftChildPosition(int pos) {
    2 * (pos + 1) - 1
  }

  def rightChildPosition(int pos) {
    2 * (pos + 1)
  }

  def int parentPosition(int pos) {
    if (pos == 0) {
      throw new MaxHeapException()
    } else {
      (pos + 1)/ 2 - 1
    }
  }

  def void swap(int one, int two) {
    val T first = _contents.get(one)
    _contents.set(one, _contents.get(two))
    _contents.set(two, first)
  }

Again, these should be straightforward. The only tricky thing is that the JVM uses zero-based arrays – so the left child of the node in position N is (2*(N+1) - 1): we need to add one to the node number to shift to one-based position; and then subtract one from the result to switch back to zero-based position. We do a similar thing for each of the other position computations.

Now we can get to the interesting bits. How do we get values into the heap?

def insert(T v) {
  val idx = _contents.size()
  _contents.add(v)
  bubbleUp(idx)
}

Insert is exactly what I described in prose above: append the new value onto the end of the array, and then bubble it up. Bubbling is the interesting part:

  private def void bubbleUp(int pos) {
    if (pos > 0) {
      val parentPos = parentPosition(pos)
      if (_contents.get(pos) > _contents.get(parentPos)) {
        swap(pos, parentPos)
        bubbleUp(parentPos)
      }
    }
  }

Bubbling up from a position P compares P to its parent. If it’s bigger than its parent, it swaps positions with the parent, and then tries to continue bubbling up from its new position.

For example, imagine we had a tree like:

9
  8
    5
      4
      0
    6
      3
  7
    2
    1

Now, suppose we wanted to add the value “10” to this. We’d add 10 to the end of the array, which would make it a child of 6. That would give us:

  9
    8
      5
        4
        0
      6
        3
        10
    7
      2
      1

So, we’d compare 10 to its parent – it’s bigger, so we’d swap:

  9
    8
      5
        4
        0
      10
        3
        6
    7
      2
      1

Then we’d compare 10 to its new parent, 8. It’s bigger, so we swap:

  9
    10
      5
        4
        0
      8
        3
        6
    7
      2
      1

And finally, we’d compare 10 to its new parent, 9. It’s bigger so we swap, and then we’re done.

  10
    9
      5
        4
        0
      8
        3
        6
    7
      2
      1

Appending to the end of the array is constant time, so the dominant time cost is the bubbling. The maximum possible number of swaps in the doubling process is depth of the tree minus 1 – and the depth of a full binary tree with N members is \lceil ln N \rceil. So it’s O(lg n) swaps, and the overall cost of inserts is O(lg n).

Getting the largest value is trivial:

def getMax() {
  _contents.get(0)
}

Removing the largest value is a lot like adding a value: we really play with the last element of the array, and then do a bubbling process – only this time we’ll bubble in the opposite direction:

  def removeMax() {
    if (_contents.size == 0) {
      throw new MaxHeapException()
    } else {
      val result = getMax()
      val last = _contents.remove(_contents.size() - 1)
      if (_contents.size() > 0) {
        _contents.set(0, last)
        bubbleDown(0)
      }
      result
    }
  }

Bubbling down is similar to bubbling up, but it’s a bit more complicated, because we need to look at both children.

  private def void bubbleDown(int pos) {
    val rightChildPos = rightChildPosition(pos)
      val leftChildPos = leftChildPosition(pos)
      if (leftChildPos >= _contents.size) {
        return
      }
      // Try to bubble left if there is no right child, or if the lift child is
      // bigger than the right.
      if (rightChildPos >= _contents.size || _contents.get(leftChildPos) > _contents.get(rightChildPos)) {
        if (_contents.get(pos) < _contents.get(leftChildPos)) {
          swap(pos, leftChildPos)
          bubbleDown(leftChildPos)
        }

      } else {
        // Try to bubble right
        if (_contents.get(pos) < _contents.get(rightChildPos)) {
          swap(pos, rightChildPos)
          bubbleDown(rightChildPos)
        }
      }
  }

The process is almost the same as bubbling up, but moving in the opposite direction. We're starting with a parent node, and comparing it to its children. If it's bigger than either of its children, then we swap it with the largest child, and then continue bubbling down.

For example, let's look at the same heap we looked at for insert:

  9
    8
      5
        4
        0
      6
        3
    7
      2
      1
  

If we want to remove 9, we set the value 9 aside, and then remove 3 from the end of the array, and put it at the root of the tree:

  3
    8
      5
        4
        0
      6
    7
      2
      1

Then we'd compare 3 against its two children, 8 and 7. Since 8 is the larger child, we swap 8 for 3:

  8
    3
      5
        4
        0
      6
    7
      2
      1

Now we compare 3 with its new children, 5 and 6. 6 is bigger, so we swap 6 with 3:

  8
    6
      5
        4
        0
      3
    7
      2
      1

3 has no children, so we're done: it's bubbled down as far as it can go.

Note: I messed up this example in the original version of the post. Thanks to John Armstrong for pointing it out.

The cost here is the same as insert, for the same reason. The dominant cost is the bubbling, and the bubbling is bounded by the depth of the tree. So removing the maximum is also O(lg n).

It's worth noting that heaps can be used to build a very reasonable sorting algorithm. To sort a collection, just insert all of the elements of the collection, and then remove them one by one. It's O(n lg n), and it's conceptually quite simple. It's not widely used, because the old classic quicksort is faster - not in big(0) notation, but it ends up with a smaller constant. (In big-O notation, something that takes 3(lg n) steps and something that takes 6(lg n) steps are both O(lg n), but the one whose constant is 3 is still twice as fast as the one whose constant is 6.)

Paxos, a really beautiful protocol for distributed consensus

The work that I do in real life is all focused on tools for other developers. In todays environment, that means that I’ve spent a lot of time working on tools that, in one way or another, help other developers deal with distributed systems. In that work, I’ve noticed that there are some really key things that straddle the line between pure math and pure engineering. That’s really interesting to someone like me!

A good example of that is something called paxos. My first exposure to paxos was very interesting. I’d just been hired by Google, and was working on their build tool. At the time, engineers in Google had a problem. Google’s codebase was contained in one massive version control repository. Doing things that way has a ton of really neat advantages – most importantly, the fact that it makes it really convenient to reuse code written by anyone else at the company. The problem was that code reuse can become very confusing. Project A reuses a bit of code written by people from project B. B’s code reused some stuff from C, and C from D, and D from E. So now project A is using code from project E, and they don’t know why!

In this case, I had someone from a storage project coming to me trying to figure out just why his system had a dependency on a plan9 database system called paxos. I had to built a tool that would allow people to ask questions like “Why does A depend on E?”.

As it turned out, paxos was a really important thing, and it was widely reused through the Google codebase. Once I learned about it, I started seeing it everywhere. Since I left Google nearly four years ago, I didn’t stop seeing it. It’s ubiquitous in distributed systems. Outside of Google, we weren’t using that friendly old plan9 paxos implementation – but the paxos model has been reimplemented dozens of times, because it’s so darned useful!

paxos is a system for managing consensus.

In distributed systems, there a collection of hard problems that you constantly need to deal with.

  1. Things fail. You can never count on anything being reliable. Even if you have perfectly bug-free software, and hardware that never breaks, you’ve still got to deal with the fact that network connections can break, or messages within a network can get lost, or that some bozo might sever your network connection with a bulldozer. (That really happened while I was at Google!)
  2. Given (1), you can never rely on one copy of anything, because that copy might become unavailable due to a failure. So you need to keep multiple copies, and those copies need to be consistent – meaning that at any time, all of the copies agree about their contents.
  3. There’s no way to maintain a single completely consistent view of time between multiple computers. Due to inconsistencies in individual machine performance, and variable network delays, variable storage latency, and several other factors, there’s no canonical way of saying that for two events X and Y, “X happened before Y”. What that means is that when you try to maintain a consistent set of data, you can’t just say “Run all of the events in order”, because while one server maintaining one copy might “know” that X happened before Y, another server maintaining another copy might be just as certain that Y happened before X.

In a world where you can’t count on different agents seeing events in the same order, and where you expect things to be constantly failing, how can you make sure that any distributed system you build ends up with a consistent view of reality?

The answer is a consensus protocol. You need to create a mechanism based on communication between the copies of your system that allows them to mantain a consistent consensus of what the current state of the world is, even in the presence of failures of machines, storage systems, and communications.

paxos is a very elegant, reasonably simple consensus protocol.

Let’s get a bit more precise. Paxos is built on a model of storage. The original application of it was a consistent database, so it’s built around the idea of keeping data consistent. In paxos, the state of the storage is modelled as a sequence of transactions. Each transaction is a pair (t, v), where t is a numeric transaction identifier, and a v is a transaction value.

The state of the system being modelled is a sequence of transaction pairs, [(ti, vi), (tj, vj), ..], where the t values are increasing as you progress through the sequence. As time passes, new transaction pairs can be added to the state.

The focus of the paxos protocol is ensuring that in a collection of 2n+1 participants, all surviving participants will agree on the current value of the state, even if up to n participants fail, and even if messages can be delivered in arbitrary order.

Before I go further into the description of paxos, we need to look at the basic assumptions that underlie it. Like any formal model, it’s not defined in terms of real computers. It’s defined in terms of an abstraction that approximates reality. In this case, the approximation is quite good, but we still need to go through the basic assumptions that make up its model of the universe.

  1. Processors (aka participants, servers, computers):
    1. operate at any speed. No two processors necessarily operate at the same speed.
    2. may fail without warning.
    3. may rejoin after recovering from a failure.
    4. are cooperative (in the sense that they do not attempt to cause failures).
  2. Network:
    1. Delivers messages between any pair of processors.
    2. Transmits messages asynchronously.
    3. Delivers messages at arbitrary speeds.
    4. Does not guarantee that messages will be delivered in the order in which they were transmitted.
    5. Does guarantee that a message, if delivered, will be delivered correctly, without any changes.
    6. May fail to deliver a message.
    7. May deliver multiple copies of the same message.

In short, everything can fail at any time; after failure, participants can recover and rejoin the system; any no part of the system acts in an actively adversarial way.

The protocol describes the behavior of the system in terms of a collection of roles. A participant can have more than one role in the system – in fact, in most implementations of paxos, all partipants do have multiple roles. The roles are:

Client
The client is not part of the paxos cluster. It’s an external entity whose actions trigger state changes by making requests to the paxos system. Each state update in paxos is initiated by a client request, and completed by a reply to the client.
Acceptor
An acceptor (also called a voter) is a participant in the maintanence of distributed storage. A state change in a paxos cluster does not occur until a majority (quorum) of acceptors agree upon it.
Proposer
A proposer recieves a request from the client, and attempts to get a quorum of acceptors to agree on it.
Leader
One of the proposers is special. It is the single proposer who most recently had a proposal accepted. In many paxos implementations, there is only one active proposer serving client requests: the only time the other proposers send proposals is when the current leader fails, and a new one needs to be selected.
Learner
The learner is the real service provided by the paxos cluster. Once a proposal is accepted, a learner processes the request from the client, and sends it the result.

In a typical paxos cluster, the client sends requests to a proposer. The proposer sends a proposal to update the state with the new client request, and attempts to convince a majority of the acceptors to accept it. Once a majority accepts it, the client request is processed by the learner, and a result is returned to the client.

The meat of paxos the protocol that the proposer gets a majority of acceptors to agree on a proposal, and how that protocol process ensures that the collection of acceptors maintains a consistent state.

The protocol itself is pretty simple. Each round is effectively independent, and consists of a process of attempting to reach consensus. Within each round, finding consensus is a two-phase process, where each phase consists of a message sent from a proposer to a group of acceptors, and a reply from the acceptors to the proposer.

  1. Phase One: Prepare/Promise
    • Proposer: A proposer attempts to start setting a new consensus by sending a Prepare(N) message to a quorum of acceptors. It can send to any group of acceptors, so long as that group forms a majority of the acceptors. The prepare message specifies a numeric identifier N for its proposal, which is larger than any proposal that’s been sent by this proposer.
    • Acceptors:
      Each acceptor, upon receiving the proposal, checks if the N-value from the prepare message is greater than any proposal from the current round that it has accepted. If so, it sends a reply called a Promise to the proposer, promising that it will never accept any proposal with a number less that N. If the acceptor has accepted a proposal with number less than N in the current round, then it includes the pair (v, n_v) consisting of the proposed consensus value v and the number n_v of the accepted proposal that proposed v.
      The acceptor thus sends a message Promise(N, (v, nv)) (if it has accepted a proposal this round) or Promise(N, null) (if it has not yet accepted a proposal with number less than N).

      Once it’s sent a promise message, it must not accept any request for a proposal with number less that N. Note though that this does not mean that the acceptor promises to accept the proposal: all it’s doing is promising not to accept any proposal with number less than N! If in receives a message Prepare(N+1), it’s free to promise that – but if it does, it will no longer be able to accept the proposal for N.

      (If N is smaller that the number of any proposal promised or accepted by the acceptor, then in the original version of paxos, the acceptor does nothing; in some optimizations of the protocol, it replies Reject(nv).)

    What this phase does is allow a proposer to determine whether or not a new proposal is even worth considering. If a quorum (majority) of acceptors send promises, then it can move on to phase 2.

  2. Phase Two: Accept!/Accepted

    When a proposer recieves promises from a quorum of acceptors, then it moves forward to try to actually commit the proposal. In order to do this, it needs to choose a value for the proposal. If any of the Promise messages contained a value, then the value of this proposal must be set to the value of the highest proposal number in any of the promises. If all of the promises were empty, then the proposer can choose any value that it wants for the proposal.

    Once the proposer has chosen a value, then it sends a message Accept!(N, V) to a quorum of acceptors. This is typically written with the exclamation point, because it’s really a command to the acceptors: they’re being told to accept the proposal, if they can.

    When an acceptor receives an Accept!(N, v) message, if it has not issued a promise for a proposal with number greater than N, then it must accept the message. It accepts the proposal by sending a message Accepted(N, V) to both the original proposer, and all of the learners.

    When Accepted messages have been received from a quorum of acceptors, the new value V becomes the consensus value for the paxos cluster, and the new proposal number N is fully committed.

As with so many things, this is easier to understand when you think about an example. One use of paxos that I’ve worked with is in a cluster scheduling service. In that system:

  • a client is a user attempting to run a new job on the cluster. It sends a request to the scheduler detailing the set of resources that it wants to request.
  • Each duplicate of the scheduler is a proposer, an acceptor, and a learner. There’s one active instance of the scheduler, which is the leader. When a client wants to schedule a job, its request gets sent to the leading scheduler.
  • In the normal non-error case, this works as follows:
    1. When a scheduling request is received, the leader proposes scheduling the job, by sending a message to all of the other schedulers saying that it wants to schedule job N.
    2. The other schedulers, if they haven’t seen a proposal for a job with number greater than i, make promises to accept that proposal.
    3. The leading scheduler chooses resources for the job, and then sends an Accept! message to the other schedulers.
    4. The other schedulers reply accepting the scheduling. The non-leader schedulers, acting as learnings, record the scheduling information, and the leader actually starts the job.
  • Errors occur when there was some kind of failure. In that case, we don’t necessarily know who the leader is – so we get multiple schedulers trying to act as if they’re the leader. So they each send proposals. Whichever proposal had the largest proposal number will eventually get accepted, and its proposer becomes the new leader.

It’s a pretty simple thing – the core concept is simply that no consensus proposal is considered “committed” until it’s been accepted by a majority of the participants. And if it’s been accepted by a majority of the participants, that means that no conflicting proposal can ever reach consensus – because that would require at least one participant to accept 2 conflicting proposals.

But there’s still a bit of formality that’s working looking at. Exactly what guarantees does paxos give? What properties does paxos-style consensus have?

Even the formal properties of paxos are easy to understand. Paxos provides two key properties: validity, and agreement.

Validity
No value ever reaches consensus without first being proposed, and having its proposal accepted.
Agreement
No two distinct values ever reach consensus at the same time.

You an easily prove those two properties. In fact, the proof is completely obvious once you recognize that the paxos protocol has two invariants (and those invariants are themselves clear from the definition of the protocol!):

  1. An acceptor can only accept a proposal p if and only if it has not yet made a promise
    for a proposal n where is the consensus value of the
    highest numbered proposal that has been accepted before this proposal.

Getting back to the beginning: the point of all of this is to have a system in which we can be sure that things work correctly even in the presence of failures. In paxos, as long as at some point there was a quorum of machines that come to agreement, then any failure that leaves a surviving quorum of machines must have overlapped with the previous quorum – which means that the previous consensus still remains in effect, and will be propagated to the remaining participants. If you’ve got 5 machines, then two can fail, and you won’t lose consistency among the remaining ones.

The ABC conjecture – aka the soap opera of the math world.

Sorry for the silence of this blog for the last few months. This spring, my mother died, and I was very depressed about it. Depression is a difficult thing, and it left me without the energy or drive to do the difficult work of writing this kind of material. I’m trying to get back into the cycle of writing. I’m trying to make some progress in writing about type theory, but I’m starting with a couple of easier posts.

In the time when I was silent, I had a couple of people write to me to ask me to explain something called the ABC conjecture.

The ABC conjecture is a mathematical question about number theory that was proposed in the 1980s – so it’s relatively new as number theory problems go. It’s gotten a lot of attention recently, due to an almost soap-operatic series of events.

It’s a very hard problem, and no one had made any significant progress on it until about five years ago, when a well respected Japanese mathematician named Shinichi Mochizucki published a series of papers containing a proof of the conjecture.

Normally, when a proof of a hard problem gets published, mathematicians go nuts! Everyone starts poring over it, trying to figure it out, and see if it’s valid. That’s what happened the previous time someone thought they’d prooved it. But this time, no one has been able to make sense out of the proof!

The problem is that in order to build his proof, professor Mochizucki created a whole new mathematical theory, called inter-universal Teichmüller theory. The entire ABC conjecture proof is built in this new theory, and no one other than professor Mochizucki himself understands Teichmüller theory. Before anyone else can actually follow the proof, they need to understand the theory. Professor Mochizucki is a bit of a recluse – he has declined to travel anywhere to teach his new mathematical system. So in the five years since he first published it, no one has been able to understand it well enough to determine whether or not the proof is correct. One error in it was found, but corrected, and the whole proof remains in question.

Exactly why the proof remains unchecked after five years is a point of contention. Lots of mathematicians are angry at Professor Mochizucki for not being willing to explain or teach his theory. A common statement among critics is that if you create a new mathematical theory, you need to be willing to actually explain it to people: work with a group of mathematicians to teach it to them, so that they’ll be able to use it to verify the proof. But Professor Mochizuchki’s response has been that he has explained it: he’s published a series of papers describing the theory. He doesn’t want to travel and take time away from his work for people who haven’t been willing to take the time to read what’s he’s written. He’s angry that after five years, no one has bothered to actually figure out his proof.

I’m obviously not going to attempt to weigh in on whether or not Professor Mochizuki’s proof is correct or not. That’s so far beyond the ability of my puny little brain that I’d need to be a hundred times smarter before it would even be laughable! Nor am I going to take sides about whether or not the Professor should be travelling to teach other mathematicians his theory. But what I can do is explain a little bit about what the ABC conjecture is, and why people care so much about it.

It’s a conjecture in number theory. Number theorists tend to be obsessed with prime numbers, because the structure of the prime numbers is a huge and fundamental part of the structure and behavior of numbers as a whole. The ABC conjecture tries to describe one property of the structure of the set of prime numbers within the system of the natural numbers. Mathematicians would love to have a proof for it, because of what it would tell them about the prime numbers.

Before I can explain the problem, there’s a bit of background that we need to go through.

  1. Any non-prime number N is the product of some set of prime numbers. Those numbers are called the prime factors of N. For example, 8 is 2×2×2 – so the set of prime factors of 8 is {2}. 28 is 2×2×7, so the prime factors of 28 are {2, 7}. 360 = 8 × 45 = 2×2×2×(9×5) = 2×2×2×3×3×5, so the prime factors of 360 are {2, 3, 5}.
  2. For any number N, the radical of N is product of its set of prime factors. So the radical of 8 (written rad(8)) is 2; rad(14)=14; rad(28)=14; rad(36)=6, rad(360)=30, etc.
  3. Given two positive integers N and M, N and M are coprime if they have no common prime factors. A tiny bit more formally, if pf(N) is the set of prime factors of N, and M and N are coprime if and only if pf(N) ∩ pu(M) = ∅. (Also, if M and N are coprime, then rad(M×N) = ram(M)×rad(N).)

The simplest way of saying the ABC conjecture is that for the vast majority of integers A, B, and C, where A + B = C and A and B are coprime, C must be smaller than rad(A*B).

Of course, that’s hopelessly imprecise for mathematicians! What does “the vast majority” mean?

The usual method at times like these is to find some way of characterizing the size of the relative sizes of the set where the statement is true and where the statement is false. For most mathematicians, the sizes of sets that are interesting are basically 0, 1, finite, countably infinite, and uncountably infinite. For the statement of the ABC conjecture, they claim that the set of values for which the statement is true is infinite, but that the set of values for which it is false are finite. Specifically, they want to be able to show that the set of numbers for which rad(A*B)>C is finite.

To do that, they pull out a standard trick. Sadly, I don’t recall the proper formal term, but I’ll call it epsilon bounding. The idea is that you’ve got a statement S about a number (or region of numbers) N. You can’t prove your statement about N specifically – so you prove it about regions around N.

As usual, it’s clearest with an example. We want to say that C > rad(A*B) for most values of A and B. The way we can show that is by saying that for any value ε, the set of values (A, B, C) where A and B are coprime, and A + B = C, and rad(A*B) > C + ε is finite.

What this formulation does is give us a formal idea of how rare this is. It’s possible that there are some values for A and B where rad(A*B) is bigger that 1,000,000,000,000,000,000 + C. But the number of places where that’s true is finite. Since the full system of numbers is infinite, that means that in the overwhelming majority of cases, rad(A*B) < C. The size of the set of numbers where that's not true is so small that it might at well be 0 in comparison to the size of the set of numbers where it is true. Ultimately, it seems almost trivial once you understand what the conjecture is. It's nothing more that the hypothesis that that if A + B = C, then most of the time, pf(A)*pf(B) < C. Once you've got that down, the question is, what's the big deal? Professor Mochuzuki developed five hundred pages of theory for this? People have spent more than five years trying to work through his proof just to see if it’s correct for a statement like this? Why does anybody care so much?

One answer is: mathematicians are crazy people!

The better answer is that simple statements like this end up telling us very profound things about the deep structure of numbers. The statements reduce to something remarkably simple, but the meaning underneath it is far more complex than it appears.

Just to give you one example of what this means: If the conjecture is true, then there’s a three-line proof of Fermat’s last theorem. (The current proof of Fermat’s last theorem, by Andrew Wiles, is over 150 pages of dense mathematics.) There’s quite a number of things that number theoreticians care about that would fall out of a successful proof.

Intuitionistic Logic for Type Theory

I’ve written about intuitionistic logic before. In fact, there’s a whole section about it in my book. But now that I’m reading a lot about type theory, I’m starting to look at it diferently.

When you study classical axiomatic set theory, you’re necessarily also studying classical first order predicate logic. You have to be doing that, because classical axiomatic set theory is deeply and intimately intertwined with FOPL. Similarly, the semantics of FOPL as it’s used in modern math are inextricably tangled with set theory. Sets are specified by predicates; predicates get their meaning from the sets of objects that they satisfy.

You can view type theory – or at least Martin-Loff’s intuitionistic type theory – as having nearly the same relationship to intuitionistic logic. We’ll see that in detail in later posts, but for now, intuitionistic type theory is a fundamental mathematical framework which is built on intuitionistic logic. So you can’t talk about this kind of type theory unless you understand the basics of the logic.

In this post, I’m going to try to explain what intuitionistic logic is, and how it differs from FOPL. (We’ll see all of this in more detail later.)

Intuitionistic logic is a modal predicate logic, which is built around a constructivist idea of truth. The intuitionistic idea of truth ends up being much stronger than what most of us are used to from standard FOPL: it means that nothing exists unless there is a concrete way of constructing it.

For a concrete example of what that means: in standard FOPL with the ZFC axioms, you can prove the Banach-Tarski paradox. Banach-Tarski (which I wrote about HERE) says that it’s possible to take a sphere the size of an orange, cut it into pieces, and then re-assemble those pieces into two spheres the same size as the original orange. Or, alternatively, that you can take those pieces that you sliced an orange-sized sphere into, and re-assemble them into a sphere the size of the sun.

Many people would say that this is, clearly, ridiculous. Others would point out a variety of rationalizations: that a sphere the size of an orange and a sphere the size of the sun contain the same number of points; or that the slicing process transitioned from a metric topology to a collection of non-metric topologies, or several other possible explanations.

But what no one can dispute is that there is one very important property of this proof. Those slices are unconstructable. That is, they exist based on a proof using the axiom of choice, but the sets of points in those topologies can’t be constructed by any process. They exist as a necessary implication of the axiom of choice, but we can’t construct them, and even given a pair of sets, one of which is one of those slices, and one of which isn’t, we can’t identify which one is.

According to intuitionism, this is ridiculous. Saying that something exists, but that it is forever beyond our reach is foolishness. If we can’t construct it, if we can’t describe how to identify it, what does it mean to say that it must exist?

When you’re working in intuitionistic logic, every proof that a type of thing exists consists of either a concrete example of the thing, or a process for constructing an example of the thing. A proof of a negative is a concrete counterexample, or a process for creating one. In computer-sciency-terms, the process doesn’t need to terminate. You don’t have to be able to construct something in finite time. But you need to have a process that describes how to contsruct it. So you can, for example, still do Cantor’s diagonalization in intuitionistic logic: if someone gives you an alleged complete 1:1 mapping between the real numbers and the integers, the proof tells you how to create a counterexample. But you can’t do the proof of Banach-Tarski, because it relies on an axiom-of-choice existence proof of something non-constructable.

The way that intuitionistic logic creates that constructivist requirement is not what you might expect. When I first heard about it, I assumed that it was based on a statement of principle: a proof has to create a concrete example. But that approach has an obvious problem: how do you mathematically define it? Logic is supposed to be purely symbolic. How can you take an abstract statement about what a proof should be, and make it work in logic?

Logic is built on inference rules. You have a collection of statements, and a collection of rules about how to use those statements to produce proofs. It turns out that by making a couple of simple changes to the rules of inference that you can get exactly the constructivist requirements that we’d want. It’s based on two real changes compared to standard FOPL.

Intuitionistic logic is modal. In FOPL, any given statement is either true or false. If it’s not true, then it’s false. If it’s true, it’s always true, and always was true. There’s no other choice. In intuitionistic logic, that’s not really the case: intuitionistic logic has three states: true, false, and unknown. If you know nothing about it, then it’s formally unknown, and it will stay unknown until there’s a proof about it; once you find a proof, it’s truth value changes from unknown to either true or false. All of the inference rules of intuitionistic logic only allow inference from proven statements. You can’t reason about an unknown – you need to have a proof that moves it from unknown to either true or false first.

The semantics of this are quite simple: it’s a tiny change in the definition of truth. In FOPL, a statement is true if there exists a proof of that statement, and it’s false if there’s a proof of the negation of that statement. In intuitionistic logic, a statement is true if you have a proof of that statement; and it’s false if you can prove that there is no proof of the statement If you haven’t proven A, then A is unknown. If A is unknown, then \lnot A is also unknown. A \lor \lnot A is, similarly, not true until you have a proof of either A or \lnot A: it means that either “There is a proof of A or there is a proof of \lnot A“. But if we don’t know if there’s a proof of either one, then it’s unknown! You could argue that this is true in FOPL as well – but in FOPL, you can rely on the fact that A \lor \lnot A, and you can use that in a proof, and explore both options. In intuitionistic logic, you can’t: you can’t do anything with A \lor \lnot A until you’ve got a proof.

It’s amazing how small the change to FOPL is to produce something that is so strongly constructionist. The easiest way to appreciate it is to just look at the rules, and how they change. To do that, I’m going to quickly walk through the inference rules of intuitionistic logic, and then show you what you’d need to change to get classical FOPL. Most of the time, when I’ve written about logics, I used sequents to write the inference rules; for ease of typesetting (and for the fun of doing something just a bit different), this time, I’m going to use Hilbert calculus (the same method that Gödel used in his incompleteness proof.) In HC, you define axioms and inference rules. For intuitionistic logic, we need to define three inference rules:

  1. Modus Ponens: Given \alpha and \alpha \rightarrow \beta, you can infer \alpha.
  2. Universal Generation: Given \alpha \rightarrow \beta, you can infer \alpha \rightarrow (\forall x: \beta) if x is not free in \beta.
  3. Existential Generation: Given \alpha \rightarrow \beta, you can infer (\exists x: \alpha) \rightarrow \beta, if x is not free in \beta.

With the inference rules out of the way, there’s a collection of axioms. Each axiom is actually a schema: you can substitute any valid statement for any of the variables in the axioms.

  1. Then-1: \alpha \rightarrow (\beta \rightarrow \alpha).
  2. Then-2: (\alpha \rightarrow (\beta \rightarrow \gamma)) \rightarrow ((\alpha \rightarrow \beta) \rightarrow (\alpha \rightarrow \gamma))
  3. And-1: \alpha \land \beta \rightarrow \alpha
  4. And-2: \alpha \land \beta \rightarrow \beta
  5. And-3: \alpha \rightarrow (\beta \rightarrow (\alpha \land \beta))
  6. Or-1: \alpha \rightarrow \alpha \lor \beta
  7. Or-2: \beta \rightarrow \alpha \lor \beta
  8. Or-3: (\alpha \rightarrow \beta) \rightarrow ((\gamma \rightarrow \beta) \rightarrow (\alpha \lor \gamma) \rightarrow \beta)
  9. False: \text{False} \rightarrow \alpha. (For a bit of explanation, this rule means that we don’t need to have \lnot rules – \lnot \alpha can be treated as \text{False} \rightarrow \alpha.)
  10. Universal: (\forall x: \alpha(x)) \rightarrow \alpha(t), if t is not bound by instantiating \alpha(t).
  11. Existential: \alpha(x) \rightarrow (\exists x \alpha(x) if t is not bound by instantiating \alpha(t).

That’s intuitionistic logic. What’s the difference between that and FOPL? What kinds of powerful reasoning features did you need to give up from FOPL to get this strongly constructivist logic?

Just one simple axiom: the law of the excluded middle, \alpha \lor \lnot \alpha.

That’s it. Get rid of the excluded middle, and you’ve got the beautiful constructivist intuitionistic logic. All we had to give up is one of the most intuitionnally obvious rules in all of logic.