Category Archives: Programming

A Gentle Explanation of the Intel Speculative Execution CPU Bug

There’s a lot of talk today about the recent discovery of a significant bug in the Intel CPUs that so many of us use every day. It’s an interesting problem, and understanding it requires knowing a little bit about how CPUs work, so I thought I’d take a shot at writing an explainer.

Let me start with a huge caveat: this involves a lot of details about how CPUs work, and in order to explain it, I’m going to simplify things to an almost ridiculous degree in order to try to come up with an explanation that’s comprehensible to a lay person. I’m never deliberately lying about how things work, but at times, I’m simplifying enough that it will be infuriating to an expert. I’m doing my best to explain my understanding of this problem in a way that most people will be able to understand, but I’m bound to oversimplify in some places, and get details wrong in others. I apologize in advance.

It’s also early days for this problem. Intel is still trying to keep the exact details of the bug quiet, to make it harder for dishonest people to exploit it. So I’m working from the information I’ve been able to gather about the issue so far. I’ll do my best to correct this post as new information comes out, but I can only do that when I’m not at work!

That said: what we know so far is that the Intel bug involves non-kernel code being able to access cached kernel memory through the use of something called speculative execution.

To an average person, that means about as much as a problem in the flux thruster atom pulsar electrical ventury space-time implosion field generator coil.

Let’s start with a quick overview of a modern CPU.

The CPU, in simple terms, the brain of a computer. It’s the component that actually does computations. It reads a sequence of instructions from memory, and then follows those instructions to perform some computation on some values, which are also stored in memory. This is a massive simplification, but basically, you can think of the CPU as a pile of hardware than runs a fixed program:

def simplified_cpu_main_loop():
  IP = 0
  while true:
     (op, in1, in2, out) = fetch(IP)
     val1 = fetch(in1)
     val2 = fetch(in2)
     result, IP = perform(op, in1, in2)
     store(result, out)

There’s a variable called the instruction pointer (abbreviated IP) built-in to the CPU that tells it where to fetch the next instruction from. Each time the clock ticks, the CPU fetches an instruction from the memory address stored in the instruction pointer, fetches the arguments to that instruction from cells in memory, performs the operation on those arguments, and then stores the result into another cell in the computer memory. Each individual operation produces both a result, and a new value for the instruction pointer. Most of the time, you just increment the instruction pointer to look at the next instruction, but for comparisons or branches, you can change it to something else.

What I described above is how older computers really worked. But as CPUs got faster, chipmaker ran into a huge problem: the CPU can perform its operations faster and faster every year. But retrieving a value from memory hasn’t gotten faster at the same rate as executing instructions. The exact numbers don’t matter, but to give you an idea, a modern CPU can execute an instruction in less than one nanosecond, but fetching a single value from memory takes more than 100 nanoseconds. In the scheme we described above, you need to fetch the instruction from memory (one fetch), and then fetch two parameters from memory (another two fetches), execute the instruction (1 nanosecond), and then store the result back into memory (one store). Assuming a store is no slower than a fetch, that means that for one nanosecond of computation time, the CPU needs to do 3 fetches and one store for each instruction. That means that the CPU is waiting, idle, for at least 400ns, during which it could have executed another 400 instructions, if it didn’t need to wait for memory.

That’s no good, obviously. There’s no point in making a fast CPU if all it’s going to do is sit around and wait for slow memory. But designers found ways to work around that, by creating ways to do a lot of computation without needing to pause to wait things to be retrieved from/stored to memory.

One of those tricks was to add registers to the CPUs. A register is a cell of memory inside of the CPU itself. Early processors (like the 6502 that was used by the Apple II) had one main register called an accumulator. Every arithmetic instruction would work by retrieving a value from memory, and then performing some arithmetic operation on the value in the accumulator and the value retrieved from memory, and leave the result in the accumulator. (So, for example, if 0x1234 were the address variable X, you could add the value of X to the accumulator with the instruction “ADD (1234)”. More modern CPUs added many registers, so that you can keep all of the values that you need for some computation in different registers. Reading values from or writing values to registers is lightning fast – in fact, it’s effectively free. So you structure your computations so that they load up the registers with the values they need, then do the computation in registers, and then dump the results out to memory. Your CPU can run a fairly long sequence of instructions without ever pausing for a memory fetch.

Expanding on the idea of putting memory into the CPU, they added ways of reducing the cost of working with memory by creating copies of the active memory regions on the CPU. These are called caches. When you try to retrieve something from memory, if it’s in the cache, then you can access it much faster. When you access something from a memory location that isn’t currently in the cache, the CPU will copy a chunk of memory including that location into the cache.

You might ask why, if you can make the cache fast, why not just make all of memory like the cache? The answer is that the time it takes in hardware to retrieve something from memory increases with amount of memory that you can potentially access. Pointing at a cache with 1K of memory is lightning fast. Pointing at a cache with 1 megabyte of memory is much slower that the 1K cache, but much faster that a 100MB cache; pointing at a cache with 100MB is even slower, and so on.

So what we actually do in practice is have multiple tiers of memory. We have the registers (a very small set – a dozen or so memory cells, which can be accessed instantly); a level-0 cache (on the order of 8k in Intel’s chips), which is pretty fast; a level-1 cache (100s of kilobytes), an L2 cache (megabytes), L3 (tens of megabytes), and now even L4 (100s of megabytes). If something isn’t in L0 cache, then we look for it in L1; if we can’t find it in L1, we look in L2, and so on, until if we can’t find it in any cache, we actually go out to the main memory.

There’s more we can do to make things faster. In the CPU, you can’t actually execute an entire instruction all at once – it’s got multiple steps. For a (vastly simplified) example, in the pseudocode above, you can think of each instruction as four phases: (1) decode the instruction (figuring out what operation it performs, and what its parameters are), (2) fetch the parameters, (3) perform the operations internal computation, and (4) write out the result. By taking advantage of that, you can set up your CPU to actually do a lot of work in parallel. If there are three phases to executing an instruction, then you can execute phase one of instruction one in one cycle; phase one of instruction two and phase two of instruction one in the next cycle; phase one of instruction three, phase two of instruction two, and phase three of instruction one in the third cycle. This process is called pipelining.

To really take advantage of pipelining, you need to keep the pipeline full. If your CPU has a four-stage pipeline, then ideally, you always know what the next four instructions you’re going to execute are. If you’ve got the machine code version of an if-then-else branch, when you start the comparison, you don’t know what’s going to come next until you finish it, because there are two possibilities. That means that when you get to phase 2 of your branch instruction, you can’t start phase one of the next instruction. instruction until the current one is finished – which means that you’ve lost the advantage of your pipeline.

That leads to another neat trick that people play in hardware, called branch prediction. You can make a guess about which way a branch is going to go. An easy way to understand this is to think of some numerical code:

def run_branch_prediction_demo():
  for i in 1 to 1000:
     for j in 1 to 1000:
          q = a[i][j] * sqrt(b[i][j])

After each iteration of the inner loop, you check to see if j == 1000. If it isn’t, you branch back to the beginning of that loop. 999 times, you branch back to the beginning of the loop, and one time, you won’t. So you can predict that you take the backward branch, and you can start executing the early phases of the first instructions of the next iteration. That may, most of the time you’re running the loop, your pipeline is full, and you’re executing your computation quickly!

The catch is that you can’t execute anything that stores a result. You need to be able to say “Oops, everything that I started after that branch was wrong, so throw it away!”. Alongside with branch prediction, most CPUs also provide speculative execution, which is a way of continuing to execute instructions in the pipeline, but being able to discard the results if they’re the result of an incorrect branch prediction.

Ok, we’re close. We’ve got to talk about just another couple of basic ideas before we can get to just what the problem is with these Intel chips.

We’re going to jump up the stack a bit, and instead of talking directly about the CPU hardware, we’re going to talk about the operating system, and how it’s implemented on the CPU.

An operating system is just a program that runs on your computer. The operating system can load and run other programs (your end-user applications), and it manages all of the resources that those other programs can work with. When you use an application that allocates memory, it sent a request called a syscall to the operating system asking it to give it some memory. If your application wants to read data from a disk drive, it makes a syscall to open a file and read data. The operating system is responsible for really controlling all of those resources, and making sure that each program that’s running only accesses the things that it should be allowed to access. Program A can only use memory allocated by program A; if it tries to access memory allocated by program B, it should cause an error.

The operating system is, therefore, a special program. It’s allowed to touch any piece of memory, any resource owned by anything on the computer. How does that work?

There are two pieces. First, there’s something called memory protection. The hardware provides a mechanism that the CPU can use to say something like “This piece of memory is owned by program A”. When the CPU is running program A, the memory protection system will arrange the way that memory looks to the program so that it can access that piece of memory; anything else just doesn’t exist to A. That’s called memory mapping: the system memory of the computer is mapped for A so that it can see certain pieces of memory, and not see others. In addition to memory mapping, the memory protection system can mark certain pieces of memory as only being accessible by privileged processes.

Privileged processes get us to the next point. In the CPU, there’s something called an execution mode: programs can run in a privileged mode (sometimes called kernel space execution), or it can run in a non-privileged mode (sometimes called user-space execution). Only code that’s running in kernel-space can do things like send commands to the memory manager, or change memory protection settings.

When your program makes a syscall, what really happens is that your program puts the syscall parameters into a special place, and then sends a signal called an interrupt. The interrupt switches the CPU into system space, and gives control to the operating system, which reads the interrupt parameters, and does whatever it needs to. Then it puts the result where the user space program expects it, switches back to user-space, and then allows the user space program to continue.

That process of switching from the user space program to the kernel space, doing something, and then switching back is called a context switch. Context switches are very expensive. Implemented naively, you need to redo the memory mapping every time you switch. So the interrupt consists of “stop what you’re doing, switch to privileged mode, switch to the kernel memory map, run the syscall, switch to the user program memory map, switch to user mode”.

Ok. So. We’re finally at the point where we can actually talk about the Intel bug.

Intel chips contain a trick to make syscalls less expensive. Instead of having to switch memory maps on a syscall, they allow the kernel memory to be mapped into the memory map of every process running in the system. But since kernel memory can contain all sorts of secret stuff (passwords, data belonging to other processes, among other things), you can’t let user space programs look at it – so the kernel memory is mapped, but it’s marked as privileged. With things set up that way, a syscall can drop the two “switch memory map” steps in the syscall scenario. Now all a syscall needs to do is switch to kernel mode, run the syscall, and switch back to user mode. It’s dramatically faster!

Here’s the problem, as best I understand from the information that’s currently available:

Code that’s running under speculative execution doesn’t do the check whether or not memory accesses from cache are accessing privileged memory. It starts running the instructions without the privilege check, and when it’s time to commit to whether or not the speculative execution should be continued, the check will occur. But during that window, you’ve got the opportunity to run a batch of instructions against the cache without privilege checks. So you can write code with the right sequence of branch instructions to get branch prediction to work the way you want it to; and then you can use that to read memory that you shouldn’t be able to read.

With that as a starting point, you can build up interesting exploits that can ultimately allow you to do almost anything you want. It’s not a trivial exploit, but with a bit of work, you can use a user space program to make a sequence of syscalls to get information you want into memory, and then write that information wherever you want to – and that means that you can acquire root-level access, and do anything you want.

The only fix for this is to stop doing that trick where you map the kernel memory into every user space memory map, because there’s no way to enforce the privileged memory property in speculative execution. In other words, drop the whole syscall performance optimization. That’ll avoid the security issue, but it’s a pretty expensive fix: requiring a full context switch for every syscall will slow down the execution of user space programs by something between 5 and 30 percent.

Garbage Collection with Semispaces

The roots of most garbage collection ideas come from the Lisp community. Lisp was really the first major garbage collection language that was used to write complicated things. So it’s natural that the first big innovation in the world of GC that we’re going to look at comes from the Lisp community.

In early Lisp systems with garbage collection, the pause that occured when the GC did a mark/sweep to reclaim memory was very long, so it was important to find ways to make the cycle faster. Lisp code had the properly that it tended to allocate a lot of small, short-lived objects: Lisp, particularly early lisp, tended to represent everything using tiny structures called cons cells, and Lisp programs generate bazillions of them. Lots of short-lived cons cells needed to get released in every GC cycle, and the bulk of the GC pause was caused by the amount of time that the GC spend going through all of the dead cons cells, and releasing them.

Beyond just that speed issue, there’s another problem with naive mark-sweep collection when you’re dealing with large numbers of short lived objects, called heap fragmentation. The GC does a pass marking all of the memory in use, and then goes through each unused block of memory, and releases it. What can happen is that you can end up with lots of memory free, but scattered around in lots of small chunks. For an extreme example, imagine that you’re building two lists made up of 8-byte cells. You allocate a cell for list A, and then you do something using A, and generate a new value which you add as a new cell in list B. So you’re alternating: allocate a cell for A, then a cell for B. When you get done, you discard A, and just keep B. After the GC runs, what does your memory look like? If A and B each have 10,000 cells, then what you have is 8 bytes of free memory that used to be part of A, and then 8 bytes of used memory for a cell of B, then 8 bytes of free, then 8 used, etc. You’ve ended up with 80,000 bytes of free memory, none of which can be used to store anything larger than 8 bytes. Eventually, you can wind up with your entire available heap broken into small enough pieces that you can’t actually use it for anything.

What the lisp folks came up with is a way of getting rid of fragmentation, and dramatically reducing the cost of the sweep phase, by using something called semispaces. With semispaces, you do some cleverness that can be summed up as moving from mark-sweep to copy-swap.

The idea is that instead of keeping all of your available heap in one chunk, you divide it into two equal regions, called semispaces. You call one of the semispaces the primary, and the other the secondary. When you allocate memory, you only allocate from the primary space. When the primary space gets to be almost full, you start a collection cycle.

When you’re doing your mark phase, instead of just marking each live value, you copy it to the secondary space. When all of the live values have been copied to the secondary space, you update all of the pointers within the live values to their new addresses in the secondary space.

Then, instead of releasing each of the unused values, you just swap the primary and secondary space. Everything in the old primary space gets released, all at once. The copy phase also compacts everything as it moves it into the secondary space, consolidating all of the free memory in one contiguous chunk. If you implement it well, you can also have beneficial side effect of moving things close in ways that improve the cache performance of your program.

For Lisp programs, semispaces are a huge win: they reduce the cost of the sweep phase to a constant time, at the expense of a nearly linear increase in mark time, which works out really well. And it eliminates the problem of fragmentation. All in all, it’s a great tradeoff!

Of course, it’s got some major downsides as well, which can make it work very poorly in some cases:

  1. The copy phase is significantly more expensive than a traditional mark-phase. The time it takes to copy is linear in the total amount of live data, versus linear in the number of live objects in a conventional mark. Whether semispaces will work well for a given application depend on the properties of the data that you’re working with. If you’ve got lots of large objects, then the increase in time caused by the copy instead of mark can significantly outweigh the savings of the almost free swap, making your GC pauses much longer; but if you’ve got lots of short-lived, small objects, then the increase in time for the copy can be much smaller than time savings from the swap, resulting in dramatically shortened GC pauses.
  2. Your application needs to have access to twice as much memory as you actually expect it to use, because you need two full semispaces. There’s really no good way around this: you really need to have a chunk of unused memory large enough to store all of your live objects – and it’s always possible that nearly everything is alive for a while, meaning that you really do need two equally sized semispaces.
  3. You don’t individually release values, which means that you can’t have any code that runs when an value gets collected. In a conventional mark-sweep, you can have objects provide functions called finalizers to help them clean up when they’re released – so objects like files can close themselves. In a semispace, you can’t do that.

The basic idea of semispaces is beautiful, and it’s adaptable to some other environments where a pure semispace doesn’t make sense, but some form of copying and bulk release can work out well.

For example, years ago, at a previous job, one of my coworkers was working on a custom Java runtime system for a large highly scalable transaction processing system. The idea of this was that you get a request from a client system to perform some task. You perform some computation using the data from the client request, update some data structures on the server, and then return a result to the client. Then you go on to the next request.

The requests are mostly standalone: they do a bunch of computation using the data that they recieved in the request. Once they’re done with a given request, almost nothing that they used processing it will ever be looked at again.

So what they did was integrate a copying GC into the transaction system. Each time they started a new transaction, they’d give it a new memory space to live it. When the transaction finished, they’d do a quick copy cycle to copy out anything that was referenced in the master server data outside the transaction, and then they’d just take that chunk of memory, and make it available for use by the next transaction.

The result? Garbage collection became close to free. The number of pointers into the transaction space from the master server data was usually zero, one, or two. That meant that the copy phase was super-short. The release phase was constant time, just dropping the pointer to the transaction space back into the available queue.

So they were able to go from an older system which had issues with GC pauses to a new system with no pauses at all. It wouldn’t work outside of that specific environment, but for that kind of application, it screamed.

A Beginner’s Guide to Garbage Collection

I was just reading an interesting paper about garbage collection (GC), and realized that I’d never written about it here, so I thought maybe I’d write a couple of articles about it. I’m going to start by talking about the two most basic techniques: mark and sweep collection, and reference counting. In future posts, I’ll move on to talk about various neat things in the world of GC.

So let’s start at the beginning. What is garbage collection?

When you’re writing a program, you need to store values in memory. Most of the time, if you want to do something interesting, you need to be able to work with lots of different values. You read data from your user, and you need to be able to create a place to store it. So (simplifying a bit) you ask your operating system to give you some memory to work with. That’s called memory allocation.

The thing about memory allocation is that the amount of memory that a computer has is finite. If you keep on grabbing more of your computer’s memory, you’re eventually going to run out. So you need to be able to both grab new memory when you need it, and then give it back when you’re done.

In many languages (for example, C or C++), that’s all done manually. You need to write code that says when you want to grab a chunk of memory, and you also need to say when you’re done with it. Your program needs to keep track of how long it needs to use a chunk of memory, and give it back when it’s done. Doing it that way is called manual memory management.

For some programs, manual memory management is the right way to go. On the other hand, it can be very difficult to get right. When you’re building a complicated system with a lot of interacting pieces, keeping track of who’s using a given piece of memory can become extremely complicated. It’s hard to get right – and when you don’t get it right, your program allocates memory and never gives it back – which means that over time, it will be using more and more memory, until there’s none left. That’s called a memory leak. It’s very hard to write a complicated system using manual memory management without memory leaks.

You might reasonably ask, what makes it so hard? You’re taking resources from the system, and using them. Why can’t you just give them back when you’re done with them?

It’s easiest to explain using an example. I’m going to walk through a real-life example from one of my past jobs.

I was working on a piece of software that managed the configuration of services for a cluster management platform. In the system, there were many subsystems that needed to be configured, but we wanted to have one configuration. So we had a piece of configuration that was used to figure out what resources were needed to run a service. There was another piece that was used to figure out where log messages would get stored. There was another piece that specified what was an error that was serious enough to page an engineer. There was another piece that told the system how to figure out which engineer to page. And so on.

We’d process the configuration, and then send pieces of it to the different subsystems that needed them. Often, one subsystem would then need to grab information from the piece of configuration that was the primary responsibility of a different subsystem. For example, when there’s an major error, and you need to page an engineer, we wanted to include a link to the appropriate log in the page. So the pager needed to be able to get access to the logging configuration.

The set of components that worked as part of this configuration system wasn’t fixed. People could add new components as new things got added to the system. Each component would register which section of the configuration it was interested in – but then, when it received its configuration fragment, it could also ask for other pieces of the configuration that it needed.

So, here’s the problem. Any given piece of the configuration could be used by 1, or 2, or 3, or 4, or 20 different components. Which piece of the system should be responsible for knowing when all of the other components are done using it? How can it keep track of that?

That’s the basic problem with manual memory management. It’s easy in easy cases, but in complex systems with overlapping realms of responsibility, where multiple systems are sharing data structures in memory, it’s difficult to build a system where there’s one responsible agent that knows when everyone is done with a piece of memory.

It’s not impossible, but it’s difficult. In a system like the one above, the way that we made it work was by doing a lot of copying of data. Instead of having one copy of a chunk of evaluated configuration which was shared among multiple readers, we’d have many copies of the same thing – one for each component. That worked, but it wasn’t free. We ended up needing to use well over ten times as much memory as we could have using shared data structures. When you’ve got a system that could work with a gigabyte of data, multiplying it by ten is a pretty big deal! Even if you’ve got a massive amount of memory available, making copies of gigabytes of data takes a lot of time!

The most important point here is to understand just how hard it is to get this stuff right. I’ve been a software engineer for a long time, and I’ve worked on a lot of systems. Until the advent of the Rust programming language, I’d never seen a single long-running system built with manual memory management that didn’t have a memory leak somewhere. (I’ll talk more about Rust and how it manages to accomplish this in a later post.)

So manual memory management is very hard to get right, and it can potentially be pretty expensive. On the other hand, it’s predictable: you know, when you write your code, what the costs of managing memory will be, because you wrote the code that does it. And, if you get it right, you can control exacly how much memory your program is using at any time.

The alternative to manual memory management is to somehow make the program automatically keep track of memory, and get rid of it when it’s no longer used. But how do you know when something is no longer used?

It’s pretty easy. You call a chunk of memory live if it can be reached by any variable in the program. If it can’t, it’s garbage, and you can get rid of it. Garbage collection is any mechanism in a programming language or execution environment that automatically figures out when something is garbage, and releases it.

There are two basic methods that we can use to figure out which chunks of memory contain live values, and which are garbage. They’re called reference counting and mark-sweep. (There’s a pool of people who are going to get angry at this definition because, they argue, reference counting isn’t garbage collection. They insist that reference counting is something fundmentally different, and that only mark-sweep is really garbage collection. Obviously I disagree. The definition that I’m using is that anything which automatically releases unused memory is garbage collection.)

In reference counting, each block of memory that gets allocated includes a counter called its reference count. Every time you create a new way of referring to something – by assigning it to a variable, or passing it as a parameter, assigning it to a field of another data structure – you add one to the reference count of the block of memory that contains it. Every time you remove a way of referencing something – by changing a variable, or returning from a function call, or garbage collecting a data structure that referenced it, you decrement its reference count by one. When the reference count for a block of memory hits zero, you can release it.

It’s simple, and it’s predictable. You know that the moment you stop using something, it’s going to get released. That’s great! But there are some problems with reference counting. First, you need to make sure that every single time you change anything, you correctly update the reference counts. Miss any updates, and either things will get released before you’re done with them, or things won’t get released and you’ll leak memory. The other, potentially bigger problem, is that there are a bunch of data structures where simple reference counting doesn’t work. For example, think of a doubly-linked list. That’s a list of values, stored so that each value in the list contains pointers to both the element ahead of it in the list, and to the element behind it in the list. Every element in that list always has at least one thing pointing to it. So none of their reference counts will ever be zero, and no element of the list will ever get collected! (There are ways around that, which we’ll talk about in a later post.)

The other main garbage collection technique is called mark-sweep. In mark-sweep, you have a two-phase process: in the mark phase, you walk over all of the data structures figuring out what’s still being used, and in the sweep phase, you free up anything that isn’t getting used.

In the marking phase, you start with a set of pointers called the root set. The root set consists of the things that you know are being used: the values of all of the variables in the parts of your program that are running, and anything that’s being referenced by the execution environment.

You create a marking queue, consisting initially of the root-set. Then you start to process the queue. For each element in the queue, if it hasn’t been marked yet, you mark it as alive, and then you add everything that it contains a reference to to the queue. If it was already marked as live, you just skip over it: it’s done.

Once the mark phase is done, everything that can possibly be referenced by your running program has been marked as live. So now you can sweep: go through the memory space of your program, and release anything that wasn’t marked as live. Boom: you’ve just gotten rid of everything that’s no longer needed.

Naive mark-sweep has one really big problem: your program can’t change anything during the mark phase! That means that any time you want to release some unusued memory, you need to stop the execution of your program while the garbage collection is going through memory, figuring out what’s still alive.

Personally, I really love working in garbage collected languages. In modern GC systems, the pauses are relatively non-intrusive, and the execution time cost of them is often significantly less than the additional copy-costs of manual collection. But it’s far from a panacaea: it doesn’t even completely prevent memory leaks! (One of the things that surprised me quite a bit earlier in my career was discovering a huge memory leak in a Java program.)

Anyway, that’s the intro to the general ideas. In subsequent posts, I’ll talk about a lot of different things in the area of memory management and garbage collection. I’m mostly going to focus on mark-sweep: reference counting is a very simple idea, and most of the applications of it focus on maintaining that simplicity. But in the world of mark-sweep, there’s a ton of interesting stuff: semispaces (which make the sweep phase of GC faster and more effective), generational garbage collection (which makes the GC system faster, and reduces the number of pauses), incremental collection (which allows the mark phase to be done without stopping the whole program), and various techniques used to implement all of this, like read-barriers, write barriers, and colored pointers.

Time in Distributed Systems: Lamport Timestamps

What I do in my day job is working on infrastructure systems at dropbox. That means that I’m neck-deep in distributed systems programming – that’s my bread and butter. One of the problems that comes up over and over again in distributed systems is time.

In distributed systems, time is a problem. Each computer has a clock built in, but those clocks are independent. The clocks on different machines can vary quite a bit. If a human being is setting them, then they’re probably at best accurate to one second. Even using a protocol like NTP, which synchronizes clocks between different computers, you can only get the clocks accurate to within about a millisecond of each other.

That sounds pretty good. In human timescales, a millisecond is a nearly imperceptible time interval. But to a modern computer, which can execute billions of instructions per second, it’s a long time: long enough to execute a million instructions! To get a sense of how much time that is to a computer, I just measured the time it took to take all of the source code for my main project, compile it from scratch, and execute all of its tests: it took 26 milliseconds.

That’s a lot of work. On the scale of a machine running billions of instructions per second, a millisecond is a long time.

Why does that matter?

For a lot of things that we want to do with a collection of computers, we need to know what event happened first. This comes up in lots of different contexts. The simplest one to explain is a shared resource locked by a mutex.

A mutex is a mutual exclusion lock. It’s basically a control that only allows one process to access some shared resource at a time. For example, you could think of a database that a bunch of processes all talk to. To make an update, a process P needs to send a request asking for access. If no one is using it when the server receives the request, it will give a lock to P, and and then block anyone else from accessing it until P is done. Anyone else who asks for access to the the database will have to wait. When P is done, it releases the lock on the mutex, and then if there’s any processes waiting, the database will choose one, and give it the lock.

Here’s where time comes into things. How do you decide who to give the lock to? You could give it to whoever you received the request from first, using the time on the database host. But that doesn’t always work well. It could easily end up with hosts with a lower-bandwidth connection to the server getting far worse service than a a closer host.

You get better fairness by using “send time” – that is, the time that the request was sent to the server by the client. But that’s where the clock issue comes up. Different machines don’t agree perfectly on the current time. If you use their clocktime to determine gets the lock first, then a machine with a slow clock will always get access before one with a fast clock. What you need is some fair way of determining ordering by some kind of timestamp that’s fair.

There are a couple of algorithms for creating some notion of a clock or timestamp that’s fair and consistent. The simplest one, which we’ll look at in this post, is called Lamport timestamps. It’s impressively simple, but it works really well. I’ve seen it used in real-world implementations of Paxos at places like Google. So it’s simple, but it’s serious.

The idea of Lamport timestamps is to come up with a mechanism that defines a partial order over events in a distributed system. What it defines is a causal ordering: that is, for any two events, A and B, if there’s any way that A could have influenced B, then the timestamp of A will be less than the timestamp of B. It’s also possible to have two events where we can’t say which came first; when that happens, it means that they couldn’t possible have affected each other. If A and B can’t have any affect on each other, then it doesn’t matter which one “comes first”.

The way that you make this work is remarkably simple and elegant. It’s based on the simplest model of distributed system, where a distributed system is a collection of processes. The processes only communicate by explicitly sending messages to each other.

  1. Every individual process p in the distributed system maintains an integer timestamp counter, \tau_p.
  2. Every time a process p performs an action, it increments \tau_p. Actions that trigger increments of \tau_p include message sends.
  3. Every time a process p sends a message to another process, it includes the current value of \tau_p in the message.
  4. When a process p receives a message from a process q, that message includes the value of \tau_q when the message was sent. So it updates its \tau_q to the \max(\tau_p, \tau_q)+1 (one more than the maximum of its current timestamp and the incoming message timestamp).

For any two events A and B in the system, if A \rightarrow B (that is, if A causally occurred before B – meaning that A could have done something that affected B), then we know that the timestamp of A will be smaller than the timestamp of B.

The order of that statement is important. It’s possible for timestamp(A) to be smaller than timestamp(B), but for B to have occurred before A by some wallclock. Lamport timestamps provide a causal ordering: A cannot have influenced or caused B unless A \rightarrow B; but A and B can be independent.

Let’s run through an example of how that happens. I’ll write it out by describing the clock-time sequence of events, and following it by a list of the timestamp counter settings for each host. We start with all timestamps at 0: [A(0), B(0), C(0), D(0).

  • [Event 1] A sends to C; sending trigger a timestamp increment. [A(1), B(0), C(0), D(0)].
  • [Event 2] C receives a message from A, and sets its counter to 2. [A(1), B(0), C(2), D(0).
  • [Event 3] C sends a message to A (C increments to 3, and sends.) [A(1), B(0), C(3), D(0).
  • [Event 4] A recieves the message from C, and sets its clock to 4. [A(4), B(0), C(3), D(0)]
  • [Event 5] B sends a message to D. [A(4), B(1), C(3), D(0)]
  • [Event 6] D receives the message. [A(4), B(1), C(3), D(2)].
  • [Event 7] D sends a message to C. [A(4), B(1), C(3), D(3)].
  • [Event 8] C receives the message, and sets its clock to 4.

According to the Lamport timestamps, in event 5, B sent its message to D at time 1. But by wallclock time, it sent its message after C’s timestamp was already 3, and A’s timestamp was already 4. We know that in our scenario, event 5 happened before event 3 by wallclock time. But in a causal ordering, it didn’t. In causal order, event 8 happened after event 4, and event 7 happened before event 8. In causal comparison, we can’t say whether 7 happened before or after 3 – but it doesn’t matter, because which order they happened in can’t affect anything.

The Lamport timestamp is a partial ordering. It tells us something about the order that things happened in, but far from everything. In effect, if the timestamp of event A is less than the timestamp of event B, it means that either A happened before B or that there’s no causal relation between A and B.

The Lamport timestamp comparisons only become meaningful when there’s an actual causal link between events. In our example, at the time that event 5 occurs, there’s no causal connection at all between the events on host A, and the events on host B. You can choose any arbitrary ordering between causally unrelated events, and as long as you use it consistently, everything will work correctly. But when event 6 happens, now there’s a causal connection. Event 5 could have changed some state on host D, and that could have changed the message that D sent in event 7. Now there’s a causal relationship, timestamp comparisons between messages after 7 has to reflect that. Lamport timestamps are the simplest possible mechanism that captures that essential fact.

When we talk about network time algorithms, we say that what Lamport timestamps do is provide weak clock consistency: If A causally happened before B, then the timestamp of A will be less than the timestamp of B.

For the mutex problem, we’d really prefer to have strong clock consistency, which says that the timestamp of A is smaller than the timestamp of B if and only if A causally occurred before B. But Lamport timestamps don’t give us enough information to do that. (Which is why there’s a more complex mechanism called vector clocks, which I’ll talk about in another post.

Getting back to the issues that this kind of timestamp is meant to solve, we’ve got a partial order of events. But that isn’t quite enough. Sometimes we really need to have a total order – we need to have a single, correct ordering of events by time, with no ties. That total order doesn’t need to be real – by which I mean that it doesn’t need to be the actual ordering in which events occured according to a wallclock. But it needs to be consistent, and no matter which host you ask, they need to always agree on which order things happened in. Pure lamport timestamps don’t do that: they’ll frequently have causally unrelated events with identical timestamps.

The solution to that is to be arbitrary but consistent. Take some extra piece of information that uniquely identifies each host in the distributed system, and use comparisons of those IDs to break ties.

For example, in real systems, every host has a network interface controller (NIC) which has a universally unique identifier called a MAC address. The MAC address is a 48 bit number. No two NICs in the history of the universe will ever have the same MAC address. (There are 281 trillion possible MAC codes, so we really don’t worry about running out.) You could also use hostnames, IP addresses, or just random arbitrarily assigned identifiers. It doesn’t really matter – as long as it’s consistent.

This doesn’t solve all of the problems of clocks in distributed systems. For example, it doesn’t guarantee fairness in Mutex assignment – which is the problem that I used as an example at the beginning of this post. But it’s a necessary first step: algorithms that do guarantee fairness rely on some kind of consistent event ordering.

It’s also just a beautiful example of what good distributed solutions look like. It’s simple: easy to understand, easy to implement correctly. It’s the simplest solution to the problem that works: there is, provably, no simpler mechanism that provides weak clock consistency.

Technical Interviews: More than a brain teaser?

A request I’ve gotten from a couple of readers (unrelated to the recent charity thing) is to talk about engineering interviews at tech companies. I’ve been at Google, Foursquare, Twitter, and now Dropbox, so I’ve spent some time inside multiple companies that put a lot of effort into recruiting.

Tech interviews get a lot of bad press. Only some of it is deserved.

What you frequently hear in criticism is stuff about “gotcha” questions, or “brain teasers”. Those do happen, and when they do, they deserve condemnation. For example, I have seriously had an interviewer for a software job ask me “Why are manholes round?” That’s stupid. People who like it claim that it’s a test of lateral thinking; I think it’s garbage. But these days, that kind of rubbish seems pretty rare.

Instead, what’s become very common is for interviewers to present a programming problem, and ask the candidate to solve it by writing some code. A lot of people really hate these kinds of interviews, and describe them as just brain-teasers. I disagree, and I’m going to try to explain why.

The underlying problem for tech job interviews is that hiring the right people is really, really hard.

When someone applies for a job as a software engineer with your company, you start off with just their resume. Resumes are not particularly informative. All they give you is a brief, possibly partial history of the candidates work experience. From a resume, can’t tell how much they really contributed to the projects they worked on. You can’t tell how much their work added (or subtracted) from the team they were part of. You can’t tell if they get work done in a reasonable amount of time, or if they’re slower than a snail. You can’t even tell if they can write a simple program at all.

So you start by screening resumes, doing your best to infer as much as you can from them. Next, you often get recommendations. Recommendations can be useful, but let’s be honest: All recommendation letters are positive. You’re not going to ask for a recommendation from someone who isn’t going to say great things about you. So no matter how terrible a candidate is, the recommendations are going to say nice things about them. At best, you can sometimes infer a problem by reading between the lines – but that’s a very subjective process.

So you end up interviewing someone who’s resume looks good on paper, and who got a couple of people to write letters for them. How do you determine whether or not they’re going to be a valuable addition to your team?

You need to do something to decide whether or not to hire a particular person. What can you do?

That’s what the interview is for. It’s a way to try to get more information. Sure, this person has a college degree. Sure, they’ve got N years of experience. But can they program? Can they communicate well with their coworkers? Do they actually know what they’re doing?

A tech interview is generally an attempt to get information about a candidate by watching them work on a problem. The interview isn’t about knowing the right answer. It’s not even about getting the correct solution to the problem. It’s about watching a candidate work.

When I ask a job candidate a technical question, there’s three main things I’m looking for.

  1. What’s their process for solving the problem? On this level, I’m trying to figure out: Do they think about it, or do they jump in and start programming? Do they make sure they understand the problem? Do they clearly state their assumptions?
  2. Can they write a simple program? Here I’m trying to see if they’ve got any ability to write
    code. No one writes great code in an interview setting. But I want to know if they’re
    able to sit down with an unfamiliar problem, and work out a solution in code. I want to see if they start coding immediately, or take time to think through their solution before they start writing.
  3. How well can they communicate ideas about programming? Can they grasp the problem from my description? If not, can they figure out what questions they need to ask to understand it? Once they start solving the problem, how well can they explain what they’re doing? Can they describe the algorithm that they’ve chosen? Can they explain why it works?

To try to clarify this, I’m going to walk through a problem that I used to use in interviews. I haven’t used this question in about 3 years, and as far as I know, no one is using the question anymore. The problem involves something called Gray code. Gray code is an alternative representation of numbers in binary form that’s useful for a range of applications involving things like switching systems.

Here’s a quick run through one of the reasons to use gray code. Imagine a system that uses physical switches. You’ve got an array of 8 switches representing a number. It’s currently presenting the number 7 in standard binary – so the first 5 switches are off, and last 3 are on. You want to increment the number. To do that, you need to change the position of four switches at exactly the same time. The odds of your being able to do that without even a transient state that appeared to be a number other than 7 or 8 are vanishingly small.

Gray code solves that by changing the representation. In Gray code, the representation of every number N+1 is only different from the representation of N by exacly one bit. That’s a nice property which makes it useful, even nowadays when we’re not using physical switches for much of anything anymore.

The easiest way that you get the gray code of numbers is by writing a table. You start off by writing 0 and 1, which are the same in both gray code and standard binary:

Decimal Standard Binary Gray
0 0 0
1 1 1

There’s the one-bit gray codes. To get the two bit, make two copies of the rows in that table.
To the first copy, prepend a 0. To the second copy, reverse the order of the rows, prepend a 1:

Decimal Standard Binary Gray
0 00 00
1 01 01
2 10 11
3 11 10

To get to the three bit gray codes, you repeat the process. Copy the rows, prepend 0s to
the first copy; reverse the order of the second, and prepend 1s.

Decimal Standard Binary Gray
0 000 000
1 001 001
2 010 011
3 011 010
4 100 110
5 101 111
6 110 101
7 111 100

So, the gray code of 6 is 101, and the gray code of 7 is 100.

What I would ask an interview candidate to do is: implement a recursive function that given an integer N, returns a string with the gray code of N.

I can understand how some people look at this question, and say, “Yeah, that’s just a stupid puzzle.” On one level, yeah. It’s obvious an artifical question. In fact, in practice, no one ever uses a recursive algorithm for something like this. Even if you have a problem where gray code is part of a practical solution, there’s a better way of converting numbers to gray code than this silly recursive nonsense.

So I agree that it’s artificial. But interview questions have to be artificial. In a typical interview, you’ve got an hour with a candidate. You’re not going to be able to explain a real problem to them in that amount of time, much less have them solve it!

But it’s artificial in a useful way that allowed me, as an interviewer, to learn about the candidate. I wasn’t trying to see if the candidate was number-puzzle wizard who could instantly see the recursion pattern in a problem like this. Most people have never heard of gray code, and to most people (including me, the first time I saw this problem!), the recursion pattern isn’t obvious. But that’s not the point: there’s a lot more to the interview that just the initial problem statement.

I don’t present the problem, and then sit back and watch silently as they try to solve it. If I did, all I’d be learning is whether or not they’re a number-puzzle wizard. I don’t care about that. So I didn’t just leave them floundering trying to somehow come up with a solution. In the beginning, after describing the problem, I set an initial direction. I usually have them start by extending the table themselves, to make sure they understand the process. Then I take their extended table, and add a new column:

Decimal Standard Binary Gray Rec
0 0000 0000
1 0001 0001
2 0010 0011 “1” + gray(1)
3 0011 0010 “1” + gray(0)
4 0100 0110
5 0101 0111
6 0110 0101
7 0111 0100
8 1000 1100 “1” + gray(7)
9 1001 1101 “1” + gray(6)
10 1010 1111 “1” + gray(5)
11 1011 1110
12 1100 1010
13 1101 1011
14 1110 1001
15 1111 1000 “1” + gray(0)

With that on the board/screen, I’d ask them to try to take what I just gave them, and rewrite it a bit. For example, in row 8, instead of “1” + gray(7), come up with an expression using the numeric value “8” of the row, which will produce 7. They should be able to come up with “15 – 8” – and to see that in every row n , where n \ge 8 and n < 16, the gray code of n is “1” + gray(15 – n).

For most people, that’s enough of a clue to be able to start writing the function. If they can’t get there, it shows me that they’ve got some trouble wrapping their head around this abstraction. I’ve got a few more hints up my sleeve to help, but if without all of the help I can give, they still can’t come up with that, that’s one mark against them.

But even if they can’t come up with the relation at all, it’s not the end of the interview. I have, sometimes, ended up recommending hiring someone who had trouble this far! If they can’t come up with that basic relation, it’s just one piece of information about the candidate. I’d file that away, and move on, by giving them the recurrence relation, and then I would ask them to code it.

There’s one problem that comes up in coding, which is interesting and important. The most naive code for gray is something like:

def gray(n):
  print("gray(%s)" % n)
  if n == 0:
    return "0"
  if n == 1:
    return "1"
  num_digits = math.floor(math.log(n, 2)) + 1
  return "1" + gray(int(2**num_digits - 1 - n))

That’s very close, but not right. If you call gray(10), you get the right answer.
If you call gray(4), you get “110”, which is correct. But if you call gray(8), you’d get “110”, when you should have gotten 1010.

Most candidates make this mistake. And then I ask them to trace it through on 8 as an example. They usually see what the problem is. If they don’t, then that’s another red flag.

If they’re really struggling to put together a recursive function, then I’d ask them to just write a function to convert an integer into standard binary. If they can do that, then I start making suggestions about how to convert that to do gray code.

The big indicator here is whether or not they can write a simple recursive function at all. The systems I was working on at the time made heavy use of recursion – if a candidate couldn’t write recursive code, there was simply no way that they’d be able to do the job. (It was depressing to see just how many qualified-looking candidates came in, but who couldn’t write a simple recursive function.)

Through this whole process, how well they were able to talk about what they were doing was as important as the solution they came up with. If they heard the question, and immediately wrote down perfect, beautiful code, but they couldn’t explain how they got it, or how it worked? They’d get a mediocre rating, which wouldn’t get them a job offer. If they made a lot of mistakes in their code, but they were crystal clear about explaining how they worked out a solution, and how it worked? They’d probably get a better rating than the perfect code candidate.

I hope this shines a bit of light on this kind of interview question. While it’s necessarily artificial, it’s artifical in a way that we hope can help us learn more about the candidate. It’s not a trick question that’s irrelevant to the job, like “Why are manholes round?”: this is an attempt to probe at an area of knowledge that any candidate for a software engineering job should have. It’s not an all or nothing problem: even if you start off with no clue of how to approach it, I’m guiding you through. If you can’t solve it without help, this problem gives me some insight into what it is that you don’t understand, and hopefully whether or not that’s going to be a problem if we hire you.

Is it a great way of doing interviews? Honestly, no. But it’s the best way we know of doing it.

As an interview system, it doesn’t do a good job of identifying the very best people to hire. There’s no correlation between outstanding interview performance and outstanding on-the-job performance. But there’s a strong correlation between poor performance on this kind of interview question and poor performance on the job. Great performers on the job show the same distribution of interview performance as average ones; but poor performers on interviews show a significantly negative-shifted job performance distribution.

We haven’t found a way of interviewing people that does a better job than this. It’s the best we have. Statistically, it works far better at selecting people than “open-ended” interviews that don’t involve any kind of practical programming exercise. So for all of its faults, it’s better that the alternatives.

I’m sure there are pedants out there who are asking “So what’s the correct implementation of gray code?” It’s totally not the point of talking about it, but here’s one sloppy but correct implementation. This isn’t the quality of code I would ever use for anything serious at work, but it’s perfectly adequate for an interview.

import math

def gray(n):
    def required_digits(n):
        """Compute the number of digits required to 
        represent a number in binary
        return int(math.floor(math.log(n, 2))) + 1

    def pad_digits(gray, num_digits):
        if len(gray) < num_digits:
            return "0"*(num_digits - len(gray)) + gray
        return gray

    if n == 0:
        return "0"
    if n == 1:
        return "1"
    num_digits = int(math.floor(math.log(n, 2)) + 1)
    return "1" + pad_digits(gray(int(2**num_digits - 1 - n)), num_digits - 1)

Simpler Consensus with Raft

A few weeks ago, I wrote about Paxos, which is (at least in my experience), the most widely used algorithm for consensus in distributed systems. I’m a huge fan of Paxos – I think that it’s a remarkably elegant system.

But Paxos does have its problem.

  1. Paxos has a lot of roles: client, proposer, learner, acceptor, leader, follower. When you want to implement Paxos, you need to figure out all of those roles, and how you’re going to implement them. In general, you end up merging roles – but there are lots of ways of doing that merge. Each particular way of setting up the roles has its own properties, and thus its own tradeoffs that you need to understand.
  2. Paxos, as we normally talk about it, is really a single-consensus protocol – that is, the basic protocol is designed to get a group of agents to come to consensus just once. If you want to be able to repeatedly seek new consensus values, you’re actually going to be using an extension to the basic paxos protocol. There are a ton of Paxos extensions that work to add repeated consensus. Paxos itself is simple and elegant, with well-defined formal properties that we care about – the moment we start modifying it, we can no longer count on those properties unless we can also prove them in our extension!
  3. Paxos was originally described in a truly awful paper. Leslie Lamport was trying to write a paper that would be less dull than the typical bone-dry technical snoozer – but the way that he wrote it actually makes it much harder to understand.

In short: Paxos has more complexity than it needs, and despite that, it needs to be tweaked to be really useful, and getting those tweaks right is hard. There are, sadly, a lot of incorrect Paxos implementations – and their incorrectness has all-too-often come as a surprise to the people who rely on them.

To avoid those problems, there are other consensus algorithms out there. In this post, we’re going to look at one of the Paxos competitors – a consensus algorithm/protocol called raft.

Raft does away with the role complexity of Paxos. In Raft, you have a collection of cooperating agents. There are no distinct proposers, acceptors, or learners: there are just servers. Communication between the servers in raft is done entirely with synchronous remote procedure calls.

In Raft, the target of consensus is a log containing a sequence of events. The log is the history of the distributed system. The goal of raft is that the log be maintained in a consistent state throughout the raft network. Just like in Paxos, if we have 2n+1 servers, up to n can fail without the network losing its consistency.

Raft is designed in terms of remote procedure calls between the elements of the network. In Raft, we never talk about single messages – every communication between servers is a pair of messages: a request from caller to callee; and a response from callee to caller. When a message gets lost, we’ll just talk about it as a failed remote procedure call.

Within a Raft network, at any given time, each server has a state. It can be a follower, a leader, or a candidate. Within the network, there is at most one leader. When there is a leader, all of the other servers are in the follower state. The followers are almost entirely passive. Followers don’t talk to clients at all – they just wait for RPCs from the leader. The leader is the only participant that’s allowed to talk to clients: any client request must go through the current leader. The leader is also the only server that’s allowed to add new entries to the consensus log.

Raft divides time into a sequence of terms. In each term, the servers in the raft network need to select a leader using a process called an election. Raft is a strong leader protocol – no interactions with a client can take place except through a leader. If there’s no leader, then we can’t process client requests without a leader.

So, to understand Raft, there’s three processes that we need to

  1. Leader election
  2. Transitions between terms
  3. Appending an entry to a log.

In those processes, the servers have a collection of variables
that they use for the Raft protocol:

the current term for the server.
the serverID that this server voted for in the current term, or “none”.
the list of entries in the log.
The index of the highest log entry known to be committed by the server.
The index of the highest log entry that’s been added to the log – but not necessarily committed. (It doesn’t become committed until a majority of servers accept it.)

Leader Election

In each term, the Raft cluster needs to have a leader. The way that a leader is selected called election.

Elections are triggered by a term transition. When a server in the cluster decided that it needs to start a new term, it increments its term number, puts itself into the candidate state, and sends a RequestVote(term, candidateId) RPC to each of the other servers in the cluster. This request asks the other servers in the cluster to select it as the leader. If it receives enough “yes” votes, it will become the leader.

When a server receives a RequestVote RPC, it checks the term. If it’s smaller than the server’s current term, then it replies “No” – meaning that it cannot support the requestor as leader.

If the term in the request is greater that the receiver’s term, then the receiver cannot have voted in the new term. So it updates to the term from the request, and then it replies “Yes”.

If the term in the request equals the receiver’s term, then the receiver has already updated its term. If it’s already voted for someone else as leader, then it can’t support the requestor, so it replies “No”. If it hasn’t voted for a leader in the term, then it votes for the requestor, and replies “Yes”.

If the requester receives “Yes” votes from more than 1/2 of the cluster (counting itself), then it becomes the leader, and starts both processing requests from clients, and sending heartbeats to the other servers in the cluster.

If it doesn’t receive enough votes, then it waits to see if anyone else becomes the leader and starts sending heartbeats. If it doesn’t get a heartbeat in time, then it starts over: it would increment its term again, and try to start a fresh election.

Term Transitions

For a given server, term transitions happen in three ways:

  1. Timeout: the leading server needs to periodically communicate with each of the followers. This process is called heartbeat: even if the leader has no updates for its followers, it sends RPC calls to the followers just to say “I’m still here”. If a client goes too long without receiving a heartbeat, it decides that the leader was lost, and it will increment the term number, and trigger a new election.
  2. Leader resignation: the current leader can, at any time, decide to stop being the leader. (This is typically done by an implementation as part of a system that says that there’s a maximum period between leader elections. For example, in the Aurora scheduler, we had leader elections at least once per day. In a raft consensus, the leader would trigger this by deciding it was time for it to stop being a leader, and triggering an election by starting a new term.)
  3. External term change: every RPC received by a server includes a term number. If any RPC to a server ever includes a term number greater than the current term for that server, the server will update its term to the new number. As a special case of this, when a leader server decides to resign, it does that by sending an RPC to the other servers with an incremented term number.

Appending to the log

We just spent a fair bit of time talking about leaders and elections. That’s almost beside the point. What we really want to do is just maintain a consistent log across the cluster of servers. Everything except creating log entries is just the book-keeping that’s necessary to make the consistent log work. The log itself is maintained using the AppendEntries RPC call.

When a client request does something that alters the state of the cluster, the leader describes that change by adding an entry to the log. It builds a proposed log entry, and sends it to the other members of the cluster using an RPC. If it gets enough “Yes” votes from other cluster members, then the log entry becomes committed, and the leader updates its commitIndex to the index of the new log entry to reflect that.

The RPC request takes a bunch of parameters:

  1. term: the leader’s term.
  2. leaderId: the id of the leader.
  3. prevLogIndex: the index number of the last log entry in the consensus log preceeding this new entry.
  4. prevLogTerm: the number of the term where the last log entry was committed.
  5. entries: a set of new log entries to be appended to the log.
  6. leaderCommit: the index of the commitlog on the leader after this set of entries has been committed.

When an AppendEntries call is received by a follower, what it does is:

  • If the receiver’s term is greater than the request term, then the receiver rejects the request by replying “No”.
  • If the the receivers commit index is larger than the commit index of the request, then it rejects the request by replying “No”.
  • If the receiver’s log doesn’t contain an entry at prevLogIndex, or that entry’s term doesn’t match the request term, then it rejects the request by replying “No”.
  • If there’s an entry in the log with the same index as the new log entries, and the term in the request matches the receiver’s term, then the receiver removes all entries after prevLogIndex from its log.
  • The receiver then appends the new entries from the request to its log.
  • If the leaderCommit is greater than the commitIndex on the receiver, then the receiver updates its commitIndex.
  • Finally, the receiver replies “Yes”.

When a majority of the cluster members have accepted an AppendEntries call, then the log entry gets committed.

The one part of this that’s confusing is how the logs get managed. The leader creates a new log entry, and sends it to the other servers. The complexity comes from dealing with cases where something doesn’t reach consensus.

For example, the leader sends entries 5, 6, and 7 to server S. S adds the entries to its copy of log – it now contains [1, 2, 3, 4, 5, 6, 7]. Meanwhile, the leader also sends those entries to server T, but the RPC to T fails due to a network fault. Another client request happens, and now the leader sends [5, 6, 7, 8] to S. S sees that it’s got entry 5 already: so it discards everything after 5, and then re-appends.

So the trailing segment of the log can change! How do we handle consensus?

The next time that the leader sends an AppendEntries to a follower, it contains the leader’s commitIndex. The follower updates its commit index to that value. Once it’s done that, any request from a leader that tries to modify anything that comes before that commit index will be rejected.

The consensus commit thus doesn’t really occur until the next heartbeat call after a log update.

Raft versus Paxos

That’s the basics of Raft.

In comparison to Paxos, there’s a couple of things to notice:

  1. There’s a lot less confusion around roles. Paxos had a ton of different roles, and rules for interactions between the different roles. Raft doesn’t have any of that: it’s just servers, with one of the servers designated as the leader.
  2. Raft explicitly manages a log, and it adds complexity around log management. In Paxos, you’re just managing a single consensus value; in Raft, you’ve got a sequence of log entries.
  3. Paxos is defined in terms of messages; Raft is designed in terms of remote procedure calls.

So is Raft really simpler than Paxos? I think that’s up for discussion. Personally, I prefer Paxos. There’s a lot of complexity hidden under the covers of the RPC system. It looks simple on the surface, but all of the complexity of message passing, lost messages, message duplication – it’s still there. It’s just been swept under the carpet, as if that really makes it easier.

The way that the logs get maintained is confusing. That’s inevitable: getting distributed knowledge is never easy. Raft at least makes that part of things explicit, whereas it’s a common part of Paxos implementations, but it’s not really specified in the protocol.

Mathematical Data Structures Part 1: Binary Heaps

Being a PhD in computer science, it’s only natural that I love data structures. In particular, I’m fascinated by the way that the math factors in to the way we structure data. Data structures fit into a beautiful intersection between information theory and algorithms: the way that a good data structure is built is a reflection of what information it really needs to maintain. The best data structure encodes exactly the information it needs in order to do it’s job – no more, and no less. The mathematical impacts of that are beautiful, and sometimes surprising. To try to demonstrate that, I’m going to take a couple of posts, and work my way through one of my favorite examples of a surprising outcome in a structure called a fibonacci heap.

A heap is a structure designed to solve a common problem. You’ve got a collection of objects, each of which has an associated numeric value. You want, at any time, to be able to find and remove the largest value in the collection, and to be able to add new elements to it. Those two operations are the core of the heap. Some variations also allow you to increase the value of objects inside the heap, or to remove values other than the maximum.

There are a lot of different ways to implement a heap. One obvious one is to just maintain a sorted sequence of objects. The problem with that is performance: some of the common operations are painfully slow!

Using the sorted sequence approach, removing the largest value is easy: you just remove the last element of the sequence. That’s very fast: it’s constant time. But you also need to be able to add values to the heap, and that’s not so good.

There’s two basic ways of doing a sequence: an array, or a linked list. In both cases, the performance isn’t acceptable. If we used an array,then in order to add a new object to the collection, we’d need to:

  1. Find the correct position for it in the array. We can do that by doing a binary search, which takes time O(lg n) where n is the length of the array. This step isn’t bad – in general, we’re pretty happy with O(lg n) operations.
  2. Insert the value into the array – which means shifting all of the elements that come after it one place to the right. That’s O(n) time, which is pretty crappy.

In the linked list approach, inserting the value isn’t a problem – it’s a constant time operation. But finding the position where it should be inserted is linear time. So we’re still talking about linear time.

Similarly, we could use a linked list, where inserting the element is constant time, but then finding its position is O(n) – again, unacceptable.

The problem with the sorted sequence approach isn’t really related to the kind of structure we use to maintain the sorted list; the problem is that we’re maintaining more information that we need. At any time, we want to be able to find the largest element of the heap quickly – we don’t care about the relative positions of any pair of values that don’t include the largest element of the collection! But if we keep a sorted list, every time we insert an element, we’re spending a lot of time comparing things whose comparison we don’t really care about!

To be able to make it faster, we need to build a data structure that doesn’t waste time and effort computing and maintaining information that we don’t want.

So our goal is to find ways of building structures that always let us both find the largest element quickly, and add new elements quickly, without maintaining more information that is really necessary. We’ll start off with a simple but good version, and then work our way through to better ones.

The most basic implementation of a heap is called a binary heap. A binary heap is a binary tree with two key properties:

  1. Every node in the tree is larger than its children.
  2. The tree is left-full: every level of the tree is full except for the last; and the last level is filled in from left to right.

The left-full property might seem a bit strange, but it turns out to be pretty straightforward. A binary heap can be implemented using an array. The root node is stored in the first position of the array; its children are in positions 2 and 3; the children of node 2 are stored in positions 4 and 5; the childen of position 3 are stored in positions 6 and 7. Using one-based indices, for any node N, it’s children are stored in positions 2N and 2N+1. Adding a new leaf to the tree can always be done by just appending one value to the array. The left-full property just means that you always extend the array by adding an element onto the left.

Implementing a heap this way is simple:

  1. To get the maximum value, you just look at the first element of the array – O(1).
  2. To remove the largest element from the array, you get the value from the first element of the array, and save it. Then you remove the last element from the array, and bubble it down – swapping it with one of its children if they’re bigger than it. We’ll look at this in more detail, but the bubble down process is O(lg n) in the worst case.
  3. Inserting a new element is done by adding it to the end of the array, and then bubbling up, by comparing it to its parent, and swapping if it’s bigger than its parent. Again, it’s O(lg n).

I’m going to show code for this. For fun, I wrote the code in a language called xtend. Xtend is a Java extension that cleans up the syntax, gets rid of semicolons, improves the type system, adds lambdas, and does a few other really neat things.

The whole beast is just a wrapper around an array:

class BinHeap> {
  val ArrayList _contents

  new() {
    _contents = new ArrayList()


If you know Java, this is mostly clear. In xtend, you write constructors using the name “new” instead of the name of the class being constructed.

Then we’ll set up some utilities to make other stuff easier to write.

  def leftChildPosition(int pos) {
    2 * (pos + 1) - 1

  def rightChildPosition(int pos) {
    2 * (pos + 1)

  def int parentPosition(int pos) {
    if (pos == 0) {
      throw new MaxHeapException()
    } else {
      (pos + 1)/ 2 - 1

  def void swap(int one, int two) {
    val T first = _contents.get(one)
    _contents.set(one, _contents.get(two))
    _contents.set(two, first)

Again, these should be straightforward. The only tricky thing is that the JVM uses zero-based arrays – so the left child of the node in position N is (2*(N+1) - 1): we need to add one to the node number to shift to one-based position; and then subtract one from the result to switch back to zero-based position. We do a similar thing for each of the other position computations.

Now we can get to the interesting bits. How do we get values into the heap?

def insert(T v) {
  val idx = _contents.size()

Insert is exactly what I described in prose above: append the new value onto the end of the array, and then bubble it up. Bubbling is the interesting part:

  private def void bubbleUp(int pos) {
    if (pos > 0) {
      val parentPos = parentPosition(pos)
      if (_contents.get(pos) > _contents.get(parentPos)) {
        swap(pos, parentPos)

Bubbling up from a position P compares P to its parent. If it’s bigger than its parent, it swaps positions with the parent, and then tries to continue bubbling up from its new position.

For example, imagine we had a tree like:


Now, suppose we wanted to add the value “10” to this. We’d add 10 to the end of the array, which would make it a child of 6. That would give us:


So, we’d compare 10 to its parent – it’s bigger, so we’d swap:


Then we’d compare 10 to its new parent, 8. It’s bigger, so we swap:


And finally, we’d compare 10 to its new parent, 9. It’s bigger so we swap, and then we’re done.


Appending to the end of the array is constant time, so the dominant time cost is the bubbling. The maximum possible number of swaps in the doubling process is depth of the tree minus 1 – and the depth of a full binary tree with N members is \lceil ln N \rceil. So it’s O(lg n) swaps, and the overall cost of inserts is O(lg n).

Getting the largest value is trivial:

def getMax() {

Removing the largest value is a lot like adding a value: we really play with the last element of the array, and then do a bubbling process – only this time we’ll bubble in the opposite direction:

  def removeMax() {
    if (_contents.size == 0) {
      throw new MaxHeapException()
    } else {
      val result = getMax()
      val last = _contents.remove(_contents.size() - 1)
      if (_contents.size() > 0) {
        _contents.set(0, last)

Bubbling down is similar to bubbling up, but it’s a bit more complicated, because we need to look at both children.

  private def void bubbleDown(int pos) {
    val rightChildPos = rightChildPosition(pos)
      val leftChildPos = leftChildPosition(pos)
      if (leftChildPos >= _contents.size) {
      // Try to bubble left if there is no right child, or if the lift child is
      // bigger than the right.
      if (rightChildPos >= _contents.size || _contents.get(leftChildPos) > _contents.get(rightChildPos)) {
        if (_contents.get(pos) < _contents.get(leftChildPos)) {
          swap(pos, leftChildPos)

      } else {
        // Try to bubble right
        if (_contents.get(pos) < _contents.get(rightChildPos)) {
          swap(pos, rightChildPos)

The process is almost the same as bubbling up, but moving in the opposite direction. We're starting with a parent node, and comparing it to its children. If it's bigger than either of its children, then we swap it with the largest child, and then continue bubbling down.

For example, let's look at the same heap we looked at for insert:


If we want to remove 9, we set the value 9 aside, and then remove 3 from the end of the array, and put it at the root of the tree:


Then we'd compare 3 against its two children, 8 and 7. Since 8 is the larger child, we swap 8 for 3:


Now we compare 3 with its new children, 5 and 6. 6 is bigger, so we swap 6 with 3:


3 has no children, so we're done: it's bubbled down as far as it can go.

Note: I messed up this example in the original version of the post. Thanks to John Armstrong for pointing it out.

The cost here is the same as insert, for the same reason. The dominant cost is the bubbling, and the bubbling is bounded by the depth of the tree. So removing the maximum is also O(lg n).

It's worth noting that heaps can be used to build a very reasonable sorting algorithm. To sort a collection, just insert all of the elements of the collection, and then remove them one by one. It's O(n lg n), and it's conceptually quite simple. It's not widely used, because the old classic quicksort is faster - not in big(0) notation, but it ends up with a smaller constant. (In big-O notation, something that takes 3(lg n) steps and something that takes 6(lg n) steps are both O(lg n), but the one whose constant is 3 is still twice as fast as the one whose constant is 6.)

Paxos, a really beautiful protocol for distributed consensus

The work that I do in real life is all focused on tools for other developers. In todays environment, that means that I’ve spent a lot of time working on tools that, in one way or another, help other developers deal with distributed systems. In that work, I’ve noticed that there are some really key things that straddle the line between pure math and pure engineering. That’s really interesting to someone like me!

A good example of that is something called paxos. My first exposure to paxos was very interesting. I’d just been hired by Google, and was working on their build tool. At the time, engineers in Google had a problem. Google’s codebase was contained in one massive version control repository. Doing things that way has a ton of really neat advantages – most importantly, the fact that it makes it really convenient to reuse code written by anyone else at the company. The problem was that code reuse can become very confusing. Project A reuses a bit of code written by people from project B. B’s code reused some stuff from C, and C from D, and D from E. So now project A is using code from project E, and they don’t know why!

In this case, I had someone from a storage project coming to me trying to figure out just why his system had a dependency on a plan9 database system called paxos. I had to built a tool that would allow people to ask questions like “Why does A depend on E?”.

As it turned out, paxos was a really important thing, and it was widely reused through the Google codebase. Once I learned about it, I started seeing it everywhere. Since I left Google nearly four years ago, I didn’t stop seeing it. It’s ubiquitous in distributed systems. Outside of Google, we weren’t using that friendly old plan9 paxos implementation – but the paxos model has been reimplemented dozens of times, because it’s so darned useful!

paxos is a system for managing consensus.

In distributed systems, there a collection of hard problems that you constantly need to deal with.

  1. Things fail. You can never count on anything being reliable. Even if you have perfectly bug-free software, and hardware that never breaks, you’ve still got to deal with the fact that network connections can break, or messages within a network can get lost, or that some bozo might sever your network connection with a bulldozer. (That really happened while I was at Google!)
  2. Given (1), you can never rely on one copy of anything, because that copy might become unavailable due to a failure. So you need to keep multiple copies, and those copies need to be consistent – meaning that at any time, all of the copies agree about their contents.
  3. There’s no way to maintain a single completely consistent view of time between multiple computers. Due to inconsistencies in individual machine performance, and variable network delays, variable storage latency, and several other factors, there’s no canonical way of saying that for two events X and Y, “X happened before Y”. What that means is that when you try to maintain a consistent set of data, you can’t just say “Run all of the events in order”, because while one server maintaining one copy might “know” that X happened before Y, another server maintaining another copy might be just as certain that Y happened before X.

In a world where you can’t count on different agents seeing events in the same order, and where you expect things to be constantly failing, how can you make sure that any distributed system you build ends up with a consistent view of reality?

The answer is a consensus protocol. You need to create a mechanism based on communication between the copies of your system that allows them to mantain a consistent consensus of what the current state of the world is, even in the presence of failures of machines, storage systems, and communications.

paxos is a very elegant, reasonably simple consensus protocol.

Let’s get a bit more precise. Paxos is built on a model of storage. The original application of it was a consistent database, so it’s built around the idea of keeping data consistent. In paxos, the state of the storage is modelled as a sequence of transactions. Each transaction is a pair (t, v), where t is a numeric transaction identifier, and a v is a transaction value.

The state of the system being modelled is a sequence of transaction pairs, [(ti, vi), (tj, vj), ..], where the t values are increasing as you progress through the sequence. As time passes, new transaction pairs can be added to the state.

The focus of the paxos protocol is ensuring that in a collection of 2n+1 participants, all surviving participants will agree on the current value of the state, even if up to n participants fail, and even if messages can be delivered in arbitrary order.

Before I go further into the description of paxos, we need to look at the basic assumptions that underlie it. Like any formal model, it’s not defined in terms of real computers. It’s defined in terms of an abstraction that approximates reality. In this case, the approximation is quite good, but we still need to go through the basic assumptions that make up its model of the universe.

  1. Processors (aka participants, servers, computers):
    1. operate at any speed. No two processors necessarily operate at the same speed.
    2. may fail without warning.
    3. may rejoin after recovering from a failure.
    4. are cooperative (in the sense that they do not attempt to cause failures).
  2. Network:
    1. Delivers messages between any pair of processors.
    2. Transmits messages asynchronously.
    3. Delivers messages at arbitrary speeds.
    4. Does not guarantee that messages will be delivered in the order in which they were transmitted.
    5. Does guarantee that a message, if delivered, will be delivered correctly, without any changes.
    6. May fail to deliver a message.
    7. May deliver multiple copies of the same message.

In short, everything can fail at any time; after failure, participants can recover and rejoin the system; any no part of the system acts in an actively adversarial way.

The protocol describes the behavior of the system in terms of a collection of roles. A participant can have more than one role in the system – in fact, in most implementations of paxos, all partipants do have multiple roles. The roles are:

The client is not part of the paxos cluster. It’s an external entity whose actions trigger state changes by making requests to the paxos system. Each state update in paxos is initiated by a client request, and completed by a reply to the client.
An acceptor (also called a voter) is a participant in the maintanence of distributed storage. A state change in a paxos cluster does not occur until a majority (quorum) of acceptors agree upon it.
A proposer recieves a request from the client, and attempts to get a quorum of acceptors to agree on it.
One of the proposers is special. It is the single proposer who most recently had a proposal accepted. In many paxos implementations, there is only one active proposer serving client requests: the only time the other proposers send proposals is when the current leader fails, and a new one needs to be selected.
The learner is the real service provided by the paxos cluster. Once a proposal is accepted, a learner processes the request from the client, and sends it the result.

In a typical paxos cluster, the client sends requests to a proposer. The proposer sends a proposal to update the state with the new client request, and attempts to convince a majority of the acceptors to accept it. Once a majority accepts it, the client request is processed by the learner, and a result is returned to the client.

The meat of paxos the protocol that the proposer gets a majority of acceptors to agree on a proposal, and how that protocol process ensures that the collection of acceptors maintains a consistent state.

The protocol itself is pretty simple. Each round is effectively independent, and consists of a process of attempting to reach consensus. Within each round, finding consensus is a two-phase process, where each phase consists of a message sent from a proposer to a group of acceptors, and a reply from the acceptors to the proposer.

  1. Phase One: Prepare/Promise
    • Proposer: A proposer attempts to start setting a new consensus by sending a Prepare(N) message to a quorum of acceptors. It can send to any group of acceptors, so long as that group forms a majority of the acceptors. The prepare message specifies a numeric identifier N for its proposal, which is larger than any proposal that’s been sent by this proposer.
    • Acceptors:
      Each acceptor, upon receiving the proposal, checks if the N-value from the prepare message is greater than any proposal from the current round that it has accepted. If so, it sends a reply called a Promise to the proposer, promising that it will never accept any proposal with a number less that N. If the acceptor has accepted a proposal with number less than N in the current round, then it includes the pair (v, n_v) consisting of the proposed consensus value v and the number n_v of the accepted proposal that proposed v.
      The acceptor thus sends a message Promise(N, (v, nv)) (if it has accepted a proposal this round) or Promise(N, null) (if it has not yet accepted a proposal with number less than N).

      Once it’s sent a promise message, it must not accept any request for a proposal with number less that N. Note though that this does not mean that the acceptor promises to accept the proposal: all it’s doing is promising not to accept any proposal with number less than N! If in receives a message Prepare(N+1), it’s free to promise that – but if it does, it will no longer be able to accept the proposal for N.

      (If N is smaller that the number of any proposal promised or accepted by the acceptor, then in the original version of paxos, the acceptor does nothing; in some optimizations of the protocol, it replies Reject(nv).)

    What this phase does is allow a proposer to determine whether or not a new proposal is even worth considering. If a quorum (majority) of acceptors send promises, then it can move on to phase 2.

  2. Phase Two: Accept!/Accepted

    When a proposer recieves promises from a quorum of acceptors, then it moves forward to try to actually commit the proposal. In order to do this, it needs to choose a value for the proposal. If any of the Promise messages contained a value, then the value of this proposal must be set to the value of the highest proposal number in any of the promises. If all of the promises were empty, then the proposer can choose any value that it wants for the proposal.

    Once the proposer has chosen a value, then it sends a message Accept!(N, V) to a quorum of acceptors. This is typically written with the exclamation point, because it’s really a command to the acceptors: they’re being told to accept the proposal, if they can.

    When an acceptor receives an Accept!(N, v) message, if it has not issued a promise for a proposal with number greater than N, then it must accept the message. It accepts the proposal by sending a message Accepted(N, V) to both the original proposer, and all of the learners.

    When Accepted messages have been received from a quorum of acceptors, the new value V becomes the consensus value for the paxos cluster, and the new proposal number N is fully committed.

As with so many things, this is easier to understand when you think about an example. One use of paxos that I’ve worked with is in a cluster scheduling service. In that system:

  • a client is a user attempting to run a new job on the cluster. It sends a request to the scheduler detailing the set of resources that it wants to request.
  • Each duplicate of the scheduler is a proposer, an acceptor, and a learner. There’s one active instance of the scheduler, which is the leader. When a client wants to schedule a job, its request gets sent to the leading scheduler.
  • In the normal non-error case, this works as follows:
    1. When a scheduling request is received, the leader proposes scheduling the job, by sending a message to all of the other schedulers saying that it wants to schedule job N.
    2. The other schedulers, if they haven’t seen a proposal for a job with number greater than i, make promises to accept that proposal.
    3. The leading scheduler chooses resources for the job, and then sends an Accept! message to the other schedulers.
    4. The other schedulers reply accepting the scheduling. The non-leader schedulers, acting as learnings, record the scheduling information, and the leader actually starts the job.
  • Errors occur when there was some kind of failure. In that case, we don’t necessarily know who the leader is – so we get multiple schedulers trying to act as if they’re the leader. So they each send proposals. Whichever proposal had the largest proposal number will eventually get accepted, and its proposer becomes the new leader.

It’s a pretty simple thing – the core concept is simply that no consensus proposal is considered “committed” until it’s been accepted by a majority of the participants. And if it’s been accepted by a majority of the participants, that means that no conflicting proposal can ever reach consensus – because that would require at least one participant to accept 2 conflicting proposals.

But there’s still a bit of formality that’s working looking at. Exactly what guarantees does paxos give? What properties does paxos-style consensus have?

Even the formal properties of paxos are easy to understand. Paxos provides two key properties: validity, and agreement.

No value ever reaches consensus without first being proposed, and having its proposal accepted.
No two distinct values ever reach consensus at the same time.

You an easily prove those two properties. In fact, the proof is completely obvious once you recognize that the paxos protocol has two invariants (and those invariants are themselves clear from the definition of the protocol!):

  1. An acceptor can only accept a proposal p if and only if it has not yet made a promise
    for a proposal n where is the consensus value of the
    highest numbered proposal that has been accepted before this proposal.

Getting back to the beginning: the point of all of this is to have a system in which we can be sure that things work correctly even in the presence of failures. In paxos, as long as at some point there was a quorum of machines that come to agreement, then any failure that leaves a surviving quorum of machines must have overlapped with the previous quorum – which means that the previous consensus still remains in effect, and will be propagated to the remaining participants. If you’ve got 5 machines, then two can fail, and you won’t lose consistency among the remaining ones.

Functional Programming Comes to the Macintosh! Introducing Swift!

So, Apple just recently announced a new programming language! For people like me, that’s big news. (In case you’re a relatively new reader of this blog, by background I’m not a mathematician. I’m a computer scientist, and I did my PhD in programming language design and implementation. I’m utterly obsessed with programming languages. Last time I set out to count how many language I’d learned, it was over 120, and I’ve at least read the specifications of another 30 or so since then. I don’t claim to be able to program in all of those today – but they’re all things that I have written something in at one time or another. Like I said, this is an obsession for me!)

Before I go into describing swift, I’m going to go through a bit of history. Swift is a hybrid language, in a way that reflects where it came from. To understand that, we need to see where it came from.

Where it came from: a step back into Swift’s history

To understand Swift, you need to know a bit of where it came from.

Since the dawn of OSX, the bulk of software development on the Macintosh has been done in a language caled Objective-C. Objective-C is a lovely programming language (for its time), which never really caught on outside of OSX and it’s predecessor NeXTStep. Swift is a hybrid language, combining object-oriented programming and functional programming; Objective-C, it’s ancestor, was a hybrid between old-fashioned procedural imperative programming and object-oriented programming.

Objective-C dates back to around 1984 or so. Back then, the main programming languages that were used for developing applications were predominantly procedural imperative languages, like Pascal and C. What the “imperative” part of that means is that they’re languages where you specify what you want the computer to do using a sequence of very precise steps. The procedural part means that the main structural principle in those languages is dividing code up into functional elements called procedures or subroutines.

For example, suppose you wanted to implement a binary search tree. In a procedural language, you’d write something like:

typedef struct BinarySearchTreeNode {
    char *value;
    struct BinarySearchTreeNode *left;
    struct BinarySearchTreeNode *right;
} BinarySearchTree;

BinarySearchTree *create_tree(char *str) { ... }

void add_value(BinarySearchTree *tree, char *str) { ... }

void _rebalance(BinarySearchTree *tree) { ... }

Even though you’re implementing a data structure, the fundamental code structure doesn’t reflect that. You’ve got a declaration of a datatype (the struct), and you’ve got a collection of functions. The functions aren’t associated with the declaration, or with each other. The code is organized around the functions, and any connection between the functions and the data structures – or between different functions – isn’t reflected in the structure of the code.

If you think in terms of abstractions: the only abstraction in the procedural version is that one procedure can’t access local variables defined in other procedures. All of the details of the BinarySearchTree data structure are open and accessible to anyone who feels like messing with them. Even operations – like the _rebalance function, which is intended to be only used in the implementation of other BinarySearchTree methods, is callable by anyone who wants to call it.

In the late 1980s, that started to change. The idea of object orientation came onto the scene. In object-oriented programming, you’re still writing imperative code, but the overall organization is based on data structures. You implement a data structure, and associated with it are the operations on that structure. In the object-oriented system, the operations on a data structure effectively became part of the structure itself. Re-writing our example above, in an object-oriented way, the code is similar, but the principle organizational structure changes. Instead of data type declaration and a set of functions that are only related to one another by our understanding, the functions (now called methods) are declared and implemented as part of the data structure.

@interface BinarySearchTree
- init: (NSString)str;
- addKey: (NSString)str withValue: val;

@implementation BinarySearchTree {
  NSString key;
  id value;
- init: (NSString)str { ... }
- addNode: (NSString)str { ... }
- rebalance;

Only the implementor of the BinarySearchTree can actually see the internals of the implementation. Rebalance isn’t part of the public interface of the class, and users of the BinarySearchTree can’t access it.

There’s more to object-orientation than this – in particular, there’s an idea called inheritance. Inheritance lets you declare a new data structure as being based on another:

@interface BinaryHeap: BinarySearchTree { }
- (NSString) pop;
- push: (NSString)s;

Inheritance makes it easy to reuse code. To implement a heap, we don’t need to re-implement the binary search tree maintenance methods. We just implement the new methods of BinaryHeap using the BinarySearchTree operations that is inherited by the heap implementation.

Object-orientation seems obvious in retrospect, but back in the late 80s, it was absolutely revolutionary! All of us who’d been programming in old languages were entranced by this new idea, and we wanted to try it!

But there was a problem. When object-oriented programming started becoing popular, most of us were programming in C. We couldn’t just abandon C for some new language, because we relied on tons of existing code, all of which was written in C. For example, back then, I was doing most of my programming on an Amiga. The Amiga UI libraries were all written in C, and if I wanted to write an Amiga program, I needed to use those libraries. I couldn’t just grab Smalltalk and start coding, because Smalltalk couldn’t call the libraries I needed to use.

The way that we started to write object-oriented code was with hybrid languages based on C – languages that contained the familiar C that could interface with our existing systems, but which also had extensions for writing object-oriented programs. There were two main hybrids for C programmers, based on the two main schools of object-oriented programming. There was C++, based on the Simula model of object-orientation, and Objective-C, based on the Smalltalk model. With C++ or Objective-C, you could use object-oriented programming for all of your new code, while still be able to use non object-oriented libraries.

C++ is much more familiar to most people. In the mainstream community, Objective-C never caught on, but C++ became the dominant systems programming language. But Apple, because of OSX’s roots in NeXTStep, embraced Objective-C. If it wasn’t for them, Objective-C would likely have completely disappeared – outside of the Apple world, no one really used it.

Objective-C was an absolutely lovely language for its time. But it was invented around 1984 – it’s an old programming language. Lots of new ideas have come along since the days of Objective-C, and with its roots in C, it’s really hard to make any big changes to Objective-C without breaking existing code.

Now, we’re at the cusp of another big change in programming. Functional programming seems to be becoming the newest rage. The way that functional programming is catching on is very similar to the way that object-orientation did it before: by hybrids. If you look at many of the new, trendy programming languages – such as Scala, Clojure, and Rust – what you find is basically a subset which is the old familiar object-oriented language, and a bunch of functional stuff integrated into it.

With objective-C really starting to look pretty crufty, Apple did the obvious thing: they made a break, and introduced a new hybrid functional/object-oriented language, which they named Swift.

Introducing Swift

In the announcement, they described Swift as “Objective-C without the C”, but I’d describe it more as “Objective-C meets SML”. Swift is a hybrid language in the tradition of Objective-C, but instead of being a hybrid of sprocedural and object-oriented languages, it’s a hybrid of object-oriented and functional languages.

Let’s start with a simple example. In most places, it’s traditional to start with “hello world”, but I think that that’s both boring, and not particularly enlightening. I like to start with the factorial function.

func fact(n: Int) -> Int {
  if n == 0 {
    return 1
  } else {
    return n * fact(n - 1)

let x = fact(10)
println("The factorial of 10 is \(fact(10))")

It’s pretty standard for a modern-ish language. The only thing that’s at all unusual is that if you look inside the println, you can see that Swift does string interpolation: if you put \( expression) into a string, then the result of that expression is converted into a string, and inserted into the enclosing string. That’s been common in scripting languages for a long time, but it hasn’t been a standard thing in systems programming languages like Swift. It’s a small thing, but it’s a good one.

The first big surprise in Swift is how you can test that function. In Objective-C, you would have had to create a full program, compile it, and run the executable. That’s not the case in Swift. In Swift, you can open a playground, which is a fully interactive scripting environment. It’s not that Swift comes with an interpreter – a playground is a workspace with fully access to the compiler, and it can compile and evaluate expressions on the fly! The playground isn’t limited to just text – you can create UI elements, interface builder based UIs, interactive games – just put the code into a workspace, and edit it. As the code changes, the moment it becomes syntactically valid, the playground will dynamically compile it using the LCC backend, and execute the generated code for you!

All I can say is: It’s about time! I’ve been seeing toy versions of this kind of interactivity since I was an undergrad in the 1980s. It’s about time that we got that in full blown systems language. Just this one little feature makes me really want to like Swift. And the rest of the language, while not being exceptional, is good enough to make me want to start coding in it.

In terms of syntax, Swift is mostly pretty straightforward. It’s got a clear C heritage, but with a whole lot of cleanups.

The Type System

People like me really care about type systems. Type systems can be controversial – some people love to program with strong types, and others really hate it. I’m very much in the camp that loves it. In my experience, when I’m programming in a language like SML or OCaml or Scala, which have really powerful, expressive type systems, my code very frequently ends up running correctly the first time I execute it. That’s not because I’m some kind of superhuman genius programmer, or because the languages I like somehow prevent me from making mistakes. It’s because in a strong type system, the particular kinds of mistakes that I’m most likely to make are exactly the kinds of things that will show up as a type error. I make just as many mistakes in a strongly statically typed language as I would in a dynamically typed one, but the type system catches them up front, before I can ever run my program. For my way of programming, I like that – I get all of those stupid errors out of the way quickly, so that I can spend my time tracking down the hard stuff.

Swift has a strong type system, in roughly the vein of the SML family of languages. You have the usual array of simple types (String, Int, Float, Character, etc.) In addition, you can create your own data types, and those types can be generic, with type parameters: in any declaration, you can include type parameters in angle brackets. For example, if you wanted to implement a generic list type, you could declare it as:

class List<T> {
  func insert(v: T) { ... }

T is a type parameter. When you declare a variable that can store a list, you need to specify the value of that type parameter:

  var l: List<Int> = ...;

You can also declare constraints on type parameters. A constraint is a way of saying “You can only use a type parameter that meets these requirements”. For example, if you wanted to implement a generic binary search tree, you’d need to be able to compare elements. If you just wrote BinarySearchTree<T>, you wouldn’t be able to use the less-than operator, because not all types can be compared that way. So you need to declare a constraint:

class BinarySearchTree<T: Comparable> { ... }

Now, if you were to try to create BinarySearchTree<Dictionary<String, String>>, you’d get a compile error. You can’t use the less-than operator on dictionaries, so they don’t satisfy the constraints.

The constraints are built using the object oriented features of Swift. Swift allows you to define something called a protocol, which is equivalent to an interface in Java, or a trait in Scala. A protocol just defines a set of methods; when you declare a class, you can declare that it implements a protocol, and the compiler will check that it implements the needed methods. When you write a type T: P, it just means that the type T implements the protocol P.

The main innovation in the type system (at least for a systems programming language) is optional types. In Swift, when you declare a variable using var foo, the variable foo will always contain a value. It can’t be nil. If you want a variable to potentially be empty, you need to explicitly say so, by declaring it with an optional type, which is written as a type-name followed by a question mark, like “var foo: String?“.

When a variable or parameter has an optional type, you can’t just access the value in that variable directly. You need to do something to check if it’s got a value in it or not. The easiest way to do that is using a conditional:

let opt_value: String? = some_function_call()
if let value = opt_value {

In many cases, you can make the code even simpler, by using option chaining:

let opt_value: String? = some_function_call()

The question-mark in an expression does an automatic test-and-dereference. If the value is present, then the following operation is executed. If it’s absent, then the expression returns nil. The thing after the question mark doesn’t have to be a method call – it can also be a subscript, or a field reference. And you can (as the name suggests), chain these things together:


Finally, you can force a dereference of a value directly using !. If you know that an option value is present, and you don’t want to write out a conditional to test it, you can append an explanation point to the end of the expression. It will generate a run-time nil-dereference error if the optional value is empty.

Handling optionality this way works a lot like what I said about static typing before. It doesn’t make errors go away – the errors that would have turned up an null pointer dereferences can still happen in your code. But most of them will be caught in advance by the compiler; and the places where null pointer reference can happen that won’t be caught by the compiler will be easy to find, because they’ll be explicit dereferences of an option, using “!”. (I’ve written about this idea before; see here for details

Object-Orientation in Swift

Let’s move on a bit. I spent a lot of time talking about data structures and object-orientation. Swift’s version of object-orientation is pretty standard for a modern programming language. For example, here’s a sketch of an implementation of a simple tree-based dictionary. (This doesn’t compile yet; it should, but right now, it crashes the compiler.)

class BinarySearchTreeDictionary<T: Comparable, S> {
  let key: T
  let value: S
  var left: BinarySearchTreeDictionary<T, S>?
  var right: BinarySearchTreeDictionary<T, S>?

  init(key: T, value: S) {
    self.key = key
    self.value = value
    self.left = nil
    self.right = nil

  func insert(key: T, value: S) {
    if key > self.key {
      if let l = self.left {
        l.insert(key, value)
      } else {
        self.left = new BinarySearchTreeDictionary<T, S>(key, value)
    } else {
      if let r = self.right {
        r.insert(key, value)
      } else {
        self.right = new BinarySearchTreeDictionary<T, S>(key, value)

If you’ve been doing much programming in any modern language, this should mostly look pretty familiar. There are a couple of things that might need explanation.

  • One of the fields of our class is declared with “let”, and two with “var”. As I said in the introduction, Swift is a hybrid between functional and object-oriented programming. The two declaration types reflect that: identifiers declared with “let” are constants, and identifiers declared with “var” are variables. In this binary search tree, we’re not going to try to do in-place rebalancing, so we don’t need the key to be mutable.
  • The type of the left and right fields are option types, as we saw in the previous section.
  • Inside of the implementation of insert, you can see one of the effects of the optional type declaration. You can’t just use an optional value, because it might be nil. There are several ways of using optionals. In this case, we use conditional unwrapping, where you put a constant declaration into the condition of the if statement. If the optional value is present, then the constant is bound, and the body of the conditional is executed.

In Swift, classes are values that are always passed by reference. If you want to implement a class that’s passed by value, you can use a struct. Structs are declared the same way as classes, except that they (obviously) use the struct keyword instead of class. In a struct, by default, all methods are constant: they’re not allowed to modify the structure in any way. If you want a method to be able to modify the structure, you need to explicitly declare that using the mutating keyword. If you declare a method as mutating, you can change anything about it that you want – up to and including entirely replacing it by assigning a new structure value to self!

struct SearchTreeNode<Key: Comparable, Value> {
  mutating func pivot_right() {
    var tmp = self
    self = self.left
    tmp.right = self.right
    self.right = tmp

Swift even goes as far as incorporating a bit of aspect-oriented programming in its object system. Aspect-oriented programming is a crazy thing that mostly grew out of object-orientation. The idea of it is that sometimes, you’d like to be able to inject behaviors into your code to add functionality to things like assignments. The way that you do that is by specifying that there’s some action you want to augment, and providing a function containing the code that you want to inject into that action.

For a pretty typical example, you might want to provide a public field in your class, but you want to make sure that no one can assign an invalid value to it, so when someone assigns a value, you provide code to check it for validity.

In Swift, for any field or global variable, there’s a pair of implicit functions that you can define, called willSet and didSet. willSet is called before the assignment happens, and didSet is called after.

For example, if you were implementing a volume control, and you wanted to make sure that no one set the volume higher than a user defined maximum, you could write:

struct VolumeControl {
  var max: Int
  var current: Int {
    didSet(newVolume) {
      if newVolume > max {
        current = max

There’s yet another type of object or data structure that you can create with Swift. They’re called enums. From the name, you might think that enums are enumerations. They can certainly be used to implement enumerations, but they’re useful for much more than that. They’re what programming languages geeks like me call algebraic types, or what Scala calls case classes. An enum is any data type whose value can take multiple forms.

Enum types are a bit more limited than I might have liked, but they’re a really nice thing to have. They’re intended to be used for simple, lightweight values that could take several different forms. For example, a few weeks ago, I posted my parser combinator library. In the parser combinators, a parse result could be one of three things: a successful parse (which includes a product value and a parser input containing the unconsumed inputs), a failed parse (which didn’t include anything, but which indicated that the parse failed, but there wasn’t an error), or an error. To implement that in Swift, I’d use an enum:

enum ParseResult<In, Out> {
  case Success(Out, ParserInput<In>)
  case Failure
  case Error(String)

The way that you use enums usually comes down to pattern matching:

  val e: ParseResult<In, Out> = parser.parse(original_in)
  switch (e) {
    case .Success(let result, var rest): return other.parse(rest)
    case .Failure: return Failure
    case .Error(let msg): return Error(msg)

Finally, there’s one more kind of data structure: tuples. Tuples are light-weight structures which don’t require a separate declaration of the type. You can just write tuple-values directly, and the type is inferred:

  let tuple = (1, 3, "hi there")

Tuples are a wonderful addition. The main use for tuples is allowing functions to return multiple values. You no longer need to muck about with bunching stuff into untyped lists, or stuffing some of your results into out-parameters. You can just return multiple values.

Functional Programming

The really new stuff in Swift is all functional programming stuff. Functional programming is the new hot thing in programming. For good reason. There’s two main facets to Swift’s version of functional programming: managing mutability, and first-class functions and closures.

Managing Mutability

In programming, you can’t avoid mutable state. It’s a fact. Most of the time, it’s the reason that we’re using a computer. For example, I’m using a program called Atom to write this blog post. There wouldn’t be any point is using Atom if I couldn’t modify the file I’m writing.

But mutable state makes things harder. In a large, complex software system, code which avoids mutating things is usually easier to understand, less prone to errors, less prone to unexpected side effects, and generally easier to interact with.

That’s the driving force behind the new hybrid languages: we want to be able to write functional code when it’s possible. But programming in a purely functional language is frequently really awkward, because you need state – so, in a functional language, you need to find a way to fudge the state in, using Monads, streams, or something else.

The key to using functional programming in a mutable-state world is controlling mutability. That is, you want to be able to glance at some code, and say “This is not going to change anything”, or “This might change stuff”. It’s less about making it impossible to change stuff than it is about making it clear just where stuff could be changed, and making sure that it can’t change unless you specifically declared its mutability. I’m a big fan of keeping things functional whenever it’s reasonable, as I wrote about here”.

Swift does a bunch of things to create that guarantee:

  1. Identifiers are declared with either “let” or “var”. If they’re declared with “let”, then the identifier names a constant which cannot be altered by assignment. If the value of a constant identifier is a struct or enum, its fields cannot be altered by assignment either.
  2. Methods on structs cannot alter the underlying struct unless the method is specifically annotated as “mutating”. Without that annotation in the declaration, the object cannot be altered by any of its methods.
  3. Function parameters are, by default, immutable. To allow a parameter to be changed, you need to specifically annotate it with var or inout in its declaration. If it’s a var, then changes to the parameter will not affect the original value in the caller; they will be made on a local copy.
  4. Structs and enums are passed by value. That means that the structure is (conceptually) copied when it’s passed as a parameter to a function. Unless you annotate the function parameter as an output parameter, the parameter cannot be modified; even if you call a mutating method, the mutation will be on the local copy of the value, not on the caller. (The reason for the “conceptually” up there is that the object is copied lazily; if you try to alter it, the copy happens at the point of the first alteration, so passing complex structures by-value doesn’t incur a copy cost unless you modify it.)

Functions and Closures

The other side of the functional programming support isn’t about restricting things, but about enabling new things. And here Swift really shines. Swift has support for first class functions (function parameter and return values), anonymous functions, curried functions, and full closures.

Swift’s support for first class functions means that functions are just values, like any other value type. A swift function can be assigned to a variable, passed to a function, or returned as the result type of a function.

For example, suppose you wanted to write a generic sort function – that is, a sort function that didn’t just compare values using the standard less-than operator, but which could take any function that a user wanted to do comparisons. In Swift, you could write:

  func sort_generic<T>(list: Array<T>, comparator: (T, T) -> Bool) -> Array<T> {
    if comparator(v, w) { ... }

This is something that those of us with experience with Lisp have been absolutely dying to have in a mainstream language for decades.

Closures are a closely related concept. A closure is a function value with one really interesting property. It closes over the environment in which it was declared. To understand what I mean, let’s look at a really simple example, copied from the Swift specification.

func makeIncrementor(amount: Int) -> () -> Int {
  var runningTotal = 0
  func incrementor() -> Int {
      runningTotal += amount
      return runningTotal
  return incrementor

This function returns a value which is, itself a function. The interesting thing is that the function can use any variable defined in any of the scopes enclosing its declaration. So the function incrementor can access the amount parameter and the runningTotal variable, even after the makeIncrementor function has returned. Since those are local to the invocation, each time you call makeIncrementor, it creates new variables, which aren’t shared with other invocations.

So let’s look at what happens when you run it:

let f = makeIncrementor(2)
let g = makeIncrementor(5)
f() 2
f() 4
g() 5
f() 6
g() 10

Anonymous functions make it really easy to work with first-class functions. You don’t need to write a function declaration and then return it the way the example above did. Any time you need a function value, you can create it in-place as an expression.

func makeAnonIncrementor(amount: Int) -> () -> Int {
  var runningTotal = 0
  return {
    runningTotal += amount
    return runningTotal

If the function anonymous function takes parameters, you declare them before the function body with “in”:

  sort_generic(mylist, { (x: Int, y: Int) -> Bool in return x > y})

Currying, finally, is a shorthand way of writing function values. The idea is that if you have a two-parameter function, you can rewrite it as a one parameter function that returns another one parameter function. That sounds confusing until you see it:

  func add_numbers(x: Int, y: Int) -> Int {
      return x + y

  func curried_add_numbers(x: Int) -> Int -> Int {
    return { (y: Int) -> Int in return x + y }

If I want to add 3+2, I can either call add_numbers(3, 2), or curried_add_numbers(3)(2): they do the same thing.

Swift provides a special way of declaring curried functions:

  func better_curried_add_numbers(x: Int)(y: Int) -> Int {
    return x + y

Pattern Matching

Pattern matching isn’t, strictly speaking, a functional language thing; but it’s an idea that was introduced to most programmers by its inclusion in functional programming languages. The idea is simple: you can write assignments or conditionals that automatically decompose a data structure by matching the structural pattern of that data structure.

As usual, that becomes a lot clearer with an example.

    let (x, y, z) = (1, 2.7, "three")

The right hand side of that assignment structure is a tuple with three values. The first one is an integer, the second is a floating point number, and the third is a string. The left-hand side of the assignment has exactly the same structure – so the Swift compiler will match the pieces. That’s equivalent to:

    let x = 1
    let y = 2.7
    let z = "three"

That’s particularly useful for multiple return values. Strictly speaking, Swift doesn’t really have multiple return values: every function returns exactly one value. But that one value may be a tuple containing multiple values. So you can write
things like:

  let result, error_core = my_function(parameters)

Pattern matching also happens in switch statements, where each branch of the switch can be a different pattern, as we saw earlier in the ParseResult example.

All of this stuff means that I can write really beautiful functional code in Swift. And that’s a really, really good thing.

The Warts

For the most part, I’ve been saying nice things about Swift. But it’s not all goodness. There are some problems, and some of them are pretty big.

The biggest thing is: no concurrency, no threading, no locks, no message passing. There’s absolutely no mention of concurrency at all. These days, that’s downright shocking. My phone has 4 CPU cores. Every machine that this language is intended to program, from the iPhone to the iPad to the Macintosh, has multiple CPUs, and programs for them need to deal with concurrency. But there’s not a shred of support in the language. That is, to put it mildly, absolutely insane. You can, of course, hack concurrency in via libraries, but most recent languages – Rust and Go come to mind – make concurrency a fundamental goal for a really good reason: concurrency is hard to get right, and concurrency hacks come at a significant cost in efficiency, correctness, and optimizibility. (Even older languages like Java have concurrency deeply embedded in the structure of the language for exactly that reason; leaving it out of swift is just shocking.)

Another big one is exception handling. Again, there’s absolutely nothing in Swift. This is particularly surprising because Objective-C had an exception handling system retrofitted into it. The libraries that Swift needs to be able to interact with use an ugly macro-based exception handling system – why is there nothing for it in Swift?

There are also a fair number of smaller issues:

  • The syntax is decidedly odd at times. Not a big deal. But there are some particular syntactic warts. There’s a range expression, which is fantastic. In fact, there are two, and they’re really hard to tell apart. 1..4 is a shorthand for (1, 2, 3); 1...4 is a shorthand for (1, 2, 3, 4). I predict much pain from off-by-one errors caused by using the wrong one.
  • The different ways of declaring objects have seemingly arbitrary rules. A struct is allowed to have class variables; a class can’t. Why? I have no idea; the limitation is simple stated in the docs as a fact.
  • Optionals aren’t as powerful as they could be. If you look at a language like Scala, optionals are implemented using the same type semantics as anything else: they’re a parametric type, Option[T]. That makes it possible to do a bunch of really nice stuff without needing to have any special cases wired into the language. In Swift, they’re a special case, and limits in the type system make it much harder to do many things. As a result, there’s a bunch of special case syntax for things like “option chaining” and forced deferencing. This isn’t a huge deal, but it’s frustrating.

And the big problem at the moment: the implementation is incredibly buggy. Not just a little bit buggy: virtually every code example in this article crashed the compiler. Obviously, it’s very early days for Swift, so it shouldn’t surprise anyone that the implementation is immature and buggy, but the degree of bugginess is astonishing. Examples from Apple’s own book on Swift crash the compiler! They’re claiming that with the release of OSX Yosemite this fall that you should be able to start writing real applications using Swift. They’ve got a long way to go, and not a lot of time if that’s going to be true.


This article, despite being ridiculously long, still only really scratches the surface of Swift. I haven’t talked about how Swift does arrays and dictionaries, which is really interesting. I haven’t talked much about syntax. I barely touched on pattern matching and options. I didn’t mention the module system at all. I didn’t talk about named parameters, which are a bit weird, but nice. There’s so much more to Swift than I could possible talk about here.

The thing to take away, then, is that overall, Swift is very exciting. Functional/Object-Oriented hybrid programming, with a strong parametric type system, and first-class interactivity in a systems programming language! I’m really looking forward to using it. I’m desperately hoping that Apple will open-source the compiler, because I want to be able to use Swift for lots of things, not just Macintosh programming. I don’t have super-high hopes for it, but it’s definitely possible, and I expect that open-source people are probably already working on a GCC front-end.

As an experiment, I’ve been trying to port my parser combinators to Swift. But at the moment, the Swift compiler chokes on the code. When I manage to debug my code/Apple manages to debug their compiler enough to make it work, I’ll post an update with that code, and more info about what it’s really like to write Swift code.

Parser Combinators Part 2: This time with types!

If you didn’t read my first post on parser combinators, I highly recommend going back and reading it. This post assumes both that you understand parser combinators, and that you’re already familiar with the way I implemented parser combinators in Python.

All of the code discussed in this post is available in a Github repos. It’s still not anything that I’d seriously use for anything important, but it’s good enough to explain stuff.

Continue reading