First postulated in 1931, the Big Bang has been the standard theory of the origin and structure of the universe for 50 years. In my opinion, (the opinion of a TV comedy writer, stripper and bar bouncer who does physics on the side) the Big Bang is about to collapse catastrophically, and that’s a good thing.
According to Big Bang theory, the universe exploded into existence from basically nothing 13.7-something billion years ago. But we’re at the beginning of a wave of discoveries of stuff that’s older than 13.7 billion years.
We’re constantly learning more about our universe, how it works, and how it started. New information isn’t necessarily a catastrophe for our existing theories; it’s just more data. There’s constantly new data coming in – and as yet, none of it comes close to causing the big bang theory to catastrophically collapse.
The two specific examples cited in the article are:
The problem with both of these arguments against the big bang is: they’re vague. They’re both handwavy arguments made about crude statements about what “should” be possible or impossible according to the bing bang theory. But neither comes close to the kind of precision that an actual scientific argument requires.
Scientists don’t use math because they like to be obscure, or because they think all of the pretty symbols look cool. Math is a tool used by scientists, because it’s useful. Real theories in physics need to be precise. They need to make predictions, and those predictions need to match reality to the limits of our ability to measure them. Without that kind of precision, we can’t test theories – we can’t check how well they model reality. And precise modelling of reality is the whole point.
The big bang is an extremely successful theory. It makes a lot of predictions, which do a good job of matching observations. It’s evolved in significant ways over time – but it remains by far the best theory we have – and by “best”, I mean “most accurate and successfully predictive”. The catch to all of this is that when we talk about the big bang theory, we don’t mean “the universe started out as a dot, and blew up like a huge bomb, and everything we see is the remnants of that giant explosion”. That’s an informal description, but it’s not the theory. That informal description is so vague that a motivated person can interpret it in ways that are consistent, or inconsistent with almost any given piece of evidence. The real big bang theory isn’t a single english statement – it’s many different mathematical statements which, taken together, produce a description of an expansionary universe that looks like the one we live in. For a really, really small sample, you can take a look at a nice old post by Ethan Siegel over here.
If you really want to make an argument that it’s impossible according to the big bang theory, you need to show how it’s impossible. The argument by Mr. Rosner is that the atoms in the dust in that galaxy couldn’t exist according to the big bang, because there wasn’t time for supernovas to create it. To make that argument, he needs to show that that’s true: he needs to look at the math that describes how stars form and how they behave, and then using that math, show that the supernovas couldn’t have happened in that timeframe. He doesn’t do anything like that: he just asserts that it’s true.
In contrast, if you read the papers by the guys who discovered the dust-filled galaxy, you’ll notice that they don’t come anywhere close to saying that this is impossible, or inconsistent with the big bang. All they say is that it’s surprising, and that we made need to revise our understanding of the behavior of matter in the early stages of the universe. The reason that they say that is because there’s nothing there that fundamentally conflicts with our current understanding of the big bang.
But Mr. Rosner can get away with the argument, because he’s being vague where the scientists are being precise. A scientist isn’t going to say “Yes, we know that it’s possible according to the big bang theory”, because the scientist doesn’t have the math to show it’s possible. At the moment, we don’t have sufficient precise math either way to come to a conclusion; we don’t know. But what we do know is that millions of other observations in different contexts, different locations, observed by different methods by different people, are all consistent with the predictions of the big bang. Given that we don’t have any evidence to support the idea that this couldn’t happen under the big bang, we continue to say that the big bang is the theory most consistent with our observations, that it makes better predictions than anything else, and so we assume (until we have evidence to the contrary) that this isn’t inconsistent. We don’t have any reason to discard the big bang theory on the basis of this!
Mr. Rosner, though, goes even further, proposing what he believes will be the replacement for the big bang.
The theory which replaces the Big Bang will treat the universe as an information processor. The universe is made of information and uses that information to define itself. Quantum mechanics and relativity pertain to the interactions of information, and the theory which finally unifies them will be information-based.
The Big Bang doesn’t describe an information-processing universe. Information processors don’t blow up after one calculation. You don’t toss your smart phone after just one text. The real universe – a non-Big Bang universe – recycles itself in a series of little bangs, lighting up old, burned-out galaxies which function as memory as needed.
In rolling cycles of universal computation, old, collapsed, neutron-rich galaxies are lit up again, being hosed down by neutrinos (which have probably been channeled along cosmic filaments), turning some of their neutrons to protons, which provides fuel for stellar fusion. Each calculation takes a few tens of billions of years as newly lit-up galaxies burn their proton fuel in stars, sharing information and forming new associations in the active center of the universe before burning out again. This is ultra-deep time, with what looks like a Big Bang universe being only a long moment in a vast string of such moments across trillions or quadrillions of giga-years.
This is not a novel idea. There are a ton of variations of the “universe as computation” that have been proposed over the years. Just off the top of my head, I can rattle off variations that I’ve read (in decreasing order of interest) by Minsky (can’t find the paper at the moment; I read it back when I was in grad school), by Fredkin, by Wolfram, and by Langan.
All of these theories assert in one form or another that our universe is either a massive computer or a massive computation, and that everything we can observe is part of a computational process. It’s a fascinating idea, and there are aspects of it that are really compelling.
For example, the Minsky model has an interesting explanation for the speed of light as an absolute limit, and for time dilation. Minksy’s model says that the universe is a giant cellular automaton. Each minimum quanta of space is a cell in the automaton. When a particle is located in a particular cell, that cell is “running” the computation that describes that particle. For a particle to move, the data describing it needs to get moved from its current location to its new location at the next time quanta. That takes some amount of computation, and the cell can only perform a finite amount of computation per quanta. The faster the particle moves, the more of its time quantum are dedicated to motion, and the less it has for anything else. The speed of light, in this theory, is the speed where the full quanta for computing a particle’s behavior is dedicated to nothing but moving it to its next location.
It’s very pretty. Intuitively, it works. That makes it an interesting idea. But the problem is, no one has come up with an actual working model. We’ve got real observations of the behavior of the physical universe that no one has been able to describe using the cellular automaton model.
That’s the problem with all of the computational hypotheses so far. They look really good in the abstract, but none of them come close to actually working in practice.
A lot of people nowadays like to mock string theory, because it’s a theory that looks really ogood, but has no testable predictions. String theory can describe the behavior of the universe that we see. The problem with it isn’t that there’s things we observe in the universe that it can’t predict, but because it can predict just about anything. There are a ton of parameters in the theory that can be shifted, and depending on their values, almost anything that we could observe can be fit by string theory. The problem with it is twofold: we don’t have any way (yet) of figuring out what values those parameters need to have to fit our universe, and we don’t have any way (yet) of performing an experiment that tests a prediction of string theory that’s different from the predictions of other theories.
As much as we enjoy mocking string theory for its lack of predictive value, the computational hypotheses are far worse! So far, no one has been able to come up with one that can come close to explaining all of the things that we’ve already observed, much less to making predictions that are better than our current theories.
But just like he did with his “criticism” of the big bang, Mr. Rosner makes predictions, but doesn’t bother to make them precise. There’s no math to his prediction, because there’s no content to his prediction. It doesn’t mean anything. It’s empty prose, proclaiming victory for an ill-defined idea on the basis of hand-waving and hype.
Boing-Boing should be ashamed for giving this bozo a platform.
]]>The problem that they looked at involves the logistics of rapid delivery by services like Federal Expression.
When Fedex was founded, they built their overnight delivery business on a new model of how to handle deliveries, called a hub system.
What most delivery businesses traditionally did was take packages in at a local warehouse, and sort them into groups depending on the warehouse that serviced their destination. Each group was then shipped separately, from the source warehouse to the destination warehouse. To do overnight delivery, each warehouse, then, would need to have a full package-sorting system, and they would need to to send a shipment to every other warehouse every single day. For warehouses, that means that they needed full sorting systems, and shipments every day. The costs of the redundant shipping systems and the massive number of shipments are huge, and completely non-scalable. Imagine a system with just 100 local warehouses – that’s only 2 for each state! – you’d need to have 10,000 shipments per day. In practice, the number of fedex warehouses just in the US is considerably larger than just 100: it simply can’t work.
What Fedex chose to do instead was build a hub system. Every package that arrives at a local warehouse gets shipped to the same place, called the hub. The only sorting system is at the hub – individual local warehouses don’t sort, they just ship to the hub. The hub sorts according to destination warehouse, and then ships the packages to the appropriate warehouse for delivery.
The Fedex hub system solved both of the problems of the traditional approach. The individual local shipping points don’t need to do sorting: they just ship to the hub, which takes care of all of it. And since every package goes to the hub, instead of needing shipments, they only needed shipments per day: shipments of unsorted packages from local warehouses to the hub, and shipments of sorted packages from the hub to the local warehouses for local deliveries.
How well this system works is very dependent on the proper location of the hub. To demonstrate why, consider the worst possible hub location for American shipments: Hawaii. Now, every package from anywhere in the mainland to anywhere else in the mainland needs to travel at least an extra 2500 miles to the hub, and an extra 2500 miles back! The cost of shipping it dominated by fuel costs – adding any extra distance to the average shipment costs fuel, and thus money. So the location of hub has a dramatic effect on the shipment costs.
FedEx located their hub in Memphis, Tennessee, because according to its founders, Memphis was close to the “center” of the shipping region for its original target market cities, and thus minimized the total shipping distances. (That wasn’t their sole concern; another factor was that Memphis is in an area that is rarely incapacitated by weather or other natural disasters – but the travel distance was the biggest factor.)
What this paper does is take the Fedex problem, and generalize it. What, exactly, does it mean for a given point to be an ideal hub? Given a network of points in a Euclidean space, how can you compute the ideal hub? And finally, if we look at the actual Federal Express distribution system, using great-circle routing between points, how close is Memphis to being the ideal hub?
If you look at the problem naively, it seems like it should be simple. The ideal hub should just be the “average” location. That’s basically what the FedEx founders said – they picked the geographical center of their original markets. Unfortunately, reality is rarely kind enough to make things fit our intuitions, and that’s not quite right.
To see why, let’s first look at the definition of the ideal hub. What we want to do is to minimize the total distance from local shipping points to the hub. To keep the notation simple, we’ll write the problem down as if we were looking at the one-dimensional version of the problem. Each location is a single value, and we’ll write the distance between two points and as the absolute value of : .
If the set of points in the network are called , then the total distance value for a hub at is given by the function:
Since the 2 is a constant factor that affects all x-values easily, for the purposes of simplicity, we can drop it out. So the problem of selecting an ideal hub ultimately reduces to an optimization problem: find the value where is a minimum.
Again, naively, it seems as though the average is best. But we can see that it’s not with a simple example. Let’s just look at a one-dimensional case – a list of locations scattered along a line: {1, 2, 3, 5, 13, 14, 60}. The point that’s at the average is 14, with a total one-way travel distance of 14 + 12 + 11 + 9 + 0 + 1 + 46 = 93. But by doing some brute-force, we can see that the total distance for is just 4 + 3 + 2 + 0 + 8 + 9 + 55 = 81. So the mean isn’t the optimal hub location. In fact, you can prove that in the one-dimensional case, the optimal hub is the median location, not the mean.
In this example, the average isn’t far off – but even in a quick off-the-cuff example, you can see that average isn’t the answer: the mean can be quite different from the median, particularly when you’ve got a lot of clustering, or a couple of outliers. When you extend to more than one dimension, then things can get bad pretty quickly.
The general problem has been looked at in many forms. It’s known, variously, as the Fermat-Torricelli problem, the Weber problem, or the single facility location problem.
Unfortunately, getting an answer to the problem is a lot harder than just naming it. According to the authors of this paper, even the fact of the existence of a unique ideal hub isn’t trivial. Despite all of the people who’ve worked on the problem, they couldn’t find a single complete proof of it! They go ahead and provide one. They prove that:
For any finite set of non-collinear points in any euclidean space ,
there is an idea hub, located within the convex hull of the set of points.
The convex hull qualifier there deserves a bit of explanation. It’s an interesting idea that ends up coming up in a bunch of different mathematical contexts. The easiest way to explain it is by the easiest method of finding the convex hull in two dimensions: suppose you’ve got a bunch of points on a plane. Draw the plane on graph paper, and put a nail at the location of each point. Take a rubber band, and stretch it so that every nail on the graph is inside of the rubber band. Let go. The rubber band will end up touching some of the nails; some will be inside, but untouched. The rubber band is the convex hull of the points. It’s basically a general notion of the smallest blob that contains all of the points in the set. The authors present a short, clean proof that the ideal hub location will always be within the “blob” of the points.
The paper goes into some depth in discussing some simple cases of the ideal hub, and how they can be calculated geometrically. But for the general problem, the geometric approaches don’t work. To get the ideal hub for a large collection of points, they’re left with brute force. In the brute force method, you just take every point in the network, and compute its value. Since it’s a finite network, you’ll eventually find the ideal hub. (You can, of course, apply
some nice heuristics to simplify the problem: the set of candidates that might be optimal hubs is going to be a lot smaller than the set of all ponits in the network!) But still, computing the optimal hub is quite complicated – in the number of points in the network.
After all of that work, where is the ideal hub? How different is it from the hub that Federal Express actually uses?
As it turns out, Fedex isn’t too far off, but it’s definitely non-optimal. Using US census data, combined with great-circle distances between the census tracts, the ideal hub is in central Indiana. FedEx’s hub is 315 miles off, to the south-south-west of the ideal location.
Interestingly, when UPS (probably FedEx’s biggest rival) set up a hub system, their hub is in Indianapolis – just 85 miles away from the ideal hub.
To me, the most interesting thing about the paper comes after they showed the ideal hub by brute force. If you’ve got more than four points – even just five! – in a two-dimensional space, there’s no simple geometric solution to find the ideal hub. It seems like it should be simple, but it isn’t. The proof of that is really fascinatingly simple: it comes down to group theory and symmetry. Read the paper for the details. A geometric solution is limited by the difference in dimensionality between a particular one-dimensional rotational symmetry group, and the number of objects in the network. As a result, for 4 or less non-collinear points, there’s a way of finding a single unique geometric configuration that can be exploited to characterize the ideal hub. For more than 4 points, there isn’t – there’s no single configuration anymore, which blows away the usefulness of geometric solutions.
If you find this at all interesting, then it’s worth downloading and reading the paper. There’s a whole lot their that I haven’t covered – more than what I have. But I hope this little taste is enough to pique your curiosity.
]]>What that means is that we’re going to take logical statements, and view them as computation tasks – that is, as formal logical specifications of a computation that we’d like to do. Under that interpretation, a proof of a statement is an implementation of the task specified by .
This interpretation is much, much easier for computer science types like me than the set-based interpretations. We can walk through the interpretations of all of the statements of our intuitionistic logic in just a few minutes.
We’ll start with the simple statements.
Now, we can move on to quantified statements. They get a bit more complicated, but if you read the quantifier right, it’s not bad.
To me, the easiest way to understand this is to expand the quantifier. A quantified statement can be read as . If you read it that way, and just follow the computational interpretation of implication, you get precisely the definition above.
This is the perspective from which most of Martin-Loff’s type theory pursues things. There’s a reason why ML’s type theory is so well-loved by computer scientists: because what we’re really doing here is taking a foundational theory of mathematics, and turning it into a specification language for describing computing systems.
That’s the fundamental beauty of what Martin-Loff did: he found a way of formulating all of constructive mathematics so that it’s one and the same thing as the theory of computation.
And that’s why this kind of type theory is so useful as a component of programming languages: because it’s allowing you to describe the semantics of your program in terms that are completely natural to the program. The type system is a description of the problem; and the program is the proof.
With full-blown Martin-Loff type system, the types really are a full specification of the computation described by the program. We don’t actually use the full expressiveness of type theory in real languages – among other things, it’s not checkable! But we do use a constrained, limited logic with Martin-Loff’s semantics. That’s what types really are in programming languages: they’re logical statements! As we get deeper into type theory, well see exactly how that works.
]]>The root problem is really interesting, once you see what’s going on. We’re taking a statement that, on the face of it, isn’t about sets. Then we’re appyling a set-based interpretation of it, and looking at the subset relation. That’s all good. The problem is that when we start looking at a set-based interpretation, we’re doing what we would do in classical set theory – but that’s a different thing from what we’re doing here. In effect, we’re changing the statement.
For almost all of us, math is something that we learned from the perspective of axiomatic set theory and first order predicate logic. So that’s the default interpretation that we put on anything mathematical. When you talk about a a proposition as a set, we’re programmed to think of it in that classical way: for any set , there’s a logical predicate such that by definition, . When you see in a set-theory context, what you think is something like . Under that intepretation, the idea that is equivalent to is absolutely ridiculous. If you follow the logic, implication must be the reverse of the subset relation!
The catch, though, is that we’re not talking about set theory, and the statement that we’re looking at is emphatically not . And that, right there, is the root of the problem.
always means – it doesn’t matter whether we’re doing set theory or type theory or whatever else. But in set theory, when we talk about the intepretation of as a set, right now, in the world of type theory, we’re talking about a different set.
Super set doesn’t suddenly mean subset. Implication doesn’t start working backwards! and yet, I’m still trying to tell you that i really meant it when i said that superset meant implication! how can that possibly make sense?
In type theory, we´re trying to take a very different look at math. In particular, we’re building everything up on a constructive/computational framework. So we’re necessarily going to look at some different interpretations of things – we’re going to look at things in ways that just don’t make sense in the world of classical set theory/FOPL. We’re not going to contradict set theory – but we’re going to look at things very differently.
For example, the kind of statement we’re talking here about is a complete, closed, logical proposition, not a predicate, nor a set. The proposition is a statement like “‘hello’ has five letters”.
When we look at a logical proposition , one of the type theoretic interpretations of it is as a set of facts: can be viewed as the set of all facts that can be proven true using . In type theory land, this makes perfect sense: if I’ve got a proof of , then I’ve got a proof of everything that can prove. isn’t a statement about the items in s proof-set. is a logical statement about something, and the elements of the proof-set of are the things that the statement can prove.
With that in mind, what does mean in type theory? It means that everything provable using is provable using nothing but .
(It’s really important to note here that there are no quantifiers in that statement. Again, we are not saying . and are atomic propositions – not open quantified statements.)
If you are following the interpretation that says that is the set of facts that are provable using the proposition , then if , that means that everything that’s in must also be in . In fact, it means pretty much exactly the same thing as classical superset. is a set of facts provable by the statement . The statement is provable using the statement – which means that everything in the provable set of must, by definition! be in the provable set of .
The converse doesn’t hold. There can be things provable by (and thus in the proof-set of ) which are not provable using . So taken as sets of facts provable by logical propositions, !
Again, that seems like it’s the opposite of what we’d expect. But the trick is to recognize the meaning of the statements we’re working with, and that despite a surface resemblance, they’re not the same thing that we’re used to. Type theory isn’t saying that the set theoretic statements are wrong; nor is set theory saying that type theory is wrong.
The catch is simple: we’re trying to inject a kind of quantification into the statement which isn’t there; and then we’re using our interpretation of that quantified statement to say something different.
But there’s an interpretation of statements in type theory which is entirely valid, but which trips over our intuition: our training has taught us to take it, and expand it into an entirely different statement. We create blanks that aren’t there, fill them in, and by doing so, convert it into something that it isn’t, and confuse ourselves.
]]>Lenovo, one of the largest makers of windows-based laptops, sold out its customers as part of one of the worst deliberate violations of computer security I’ve ever seen, by shipping a piece of software called Superfish pre-installed on its computers. Superfish is, with absolutely no exaggeration, one of the most serious, unethical, despicable things I’ve seen in quite a lot time. It’s appalling.
So what is it, and what’s the big deal?
We need to start with some background, and talk a bit about how secure connections work on the internet.
Every time that you visit a website with a secure connection (a URL that starts with https), you’re using a protocol called TLS (formerly SSL). TLS is designed to do two things:
The way that it does both of those is based on encryption. Every time you create a secure connect to a website, you’re exchanging credentials with the site to ensure that they’re who they say they are, and then based on those credentials, you establish an encryption key for the rest of your communication.
That connection-establishment process is the critical bit. You need to get some information that allows you to trust them to be who they claim to be. The security and integrity of everything that happens over the connection depends on the truth and integrity of that initial piece of identity verification.
The identity verification piece of TLS is built using public key cryptography, as part of a standard infrastructure for key maintenance and verification called X.509.
I’ve written about public key crypto before – see here. The basic idea behind public key crypto is that you’ve got two keys, called the public and private keys. Anything which is encrypted with the public key can only be decrypted with the private key; anything which is encrypted with the private key can only be decrypted using the public key. Your public key is available to anyone who wants it; no one but you has access to your private key.
If you receive a message from me, and you can decrypt it with my public key, then you know, without a doubt, that you can be sure that I’m the one who encrypted it. Only my private key could have encrypted a message that could then be decrypted with my public key.
For that to work, though, there’s one thing that you need to be sure of: you need to be absolutely sure that the public key that you’ve got is really my public key. If some clever person managed to somehow give you a different key, and convince you that it’s mine, then they can send you messages, and they’ll look exactly as if they came from me. If I handed you my public key on a USB thumbdrive, then you’re sure that the key came from me – but if you received it online, haw can you be sure that it was really me that gave it to you? Maybe someone switched it along the way?
In X.509, we use an idea of delegated trust. That is, we have some small collection of fundamental trusted authorities. Those authorities can issue public/private key pairs, so when someone need a public key, they can go to them and ask for it, and they’ll create one. The authority gives them a certificate, which is a copy of the new public key encrypted by the authority using their private key.
Now, when someone connects to a website, the target site can state who they are by sending a copy of the certificate. The client site recieves the certificate, decrypts it using the authorities public key, and then starts using that public key to encrypt their communications.
If the two sides can keep talking, then the client knows who it’s talking to. It got a public key, and it’s using that public key to talk to the server; so the server couldn’t decrypt the communication unless it had the public key; and it trusts that that it got the right public key, because it was encrypted with the private key of the certificate authority.
This is great as far as it goes, but it leaves us with a single certificate authority (or, at best, a small group). With billions of human users, and possibly trillions of networked devices, having a single authority isn’t manageable. They simple can’t produce enough keys for everyone. So we need to use our trust in the certificate authority to expand the pool of trust. We can do that by saying that if the certificate authority can declare that a particular entity is trustworthy, then we can use that entity itself as a verifier. So now we’ve taken a single trusted authority, and expanded that trust to a collection of places. Each of those newly trusted entities can now also issue new keys, and can certify their validity, by showing their certificate, plus the new encrypted public key. In general, anyone can issue a public key – and we can check its validity by looking at the chain of authorities that verified it, up to the root authority.
There’s a catch to this though: the base certificate providers. If you can trust them, then everything works: if you’ve got a certificate chain, you can use it to verify the identity of the party you’re talking to. But if there’s any question about the validity of the root certificate provider, if there’s any question whether or not you have the correct, valid public key for that provider, then you’re completely hosed. Ultimately, there’s some piece of seed information which you have to start off with. You need to accept the validity of an initial certificate authority based on some other mechanism. The people who sold you your computer, or the people who built your web browser, generally install a root certificate – basically the public key for a trusted certificate authority.
If that root certificate isn’t trustworthy, then nothing that results from it can be trusted. The untrustworthy root certificate can be used by an unscrupulous person to create new certificates allowing them to masquerade as anything that they want.
In particular, an untrustworthy root certificate it makes it easy to perform a man-in-the-middle attack. Suppose you want to talk to your bank, and somehow Joe Schmoe has planted a bad root certificate on your computer:
In the Lenovo fiasco, Lenovo installed a system called Superfish, which deliberately installs a bad root certificate, and then uses that root certificate to create man-in-the-middle attacks on every secure connection ever made with the computer.
Why does it do that? Purportedly for ad retargeting: it uses its man-in-the-middle to decrypt supposedly secure connections so that it can replace ads in the pages that you view. That way, Lenovo and Superfish get advertising money for the pages you view, instead of the page-owners.
It’s spectacularly despicable. It’s fundamentally compromising everything you do with your computer. It’s doing it in a way that you can’t detect. And it’s doing it all in a sleazy attempt to steal advertising money.
Based on this, I’ve got two pieces of advice:
This is, by far, the worst thing that I’ve ever seen a computer manufacturer do. They deserve to be run out of business for this.
]]>But Paxos does have its problem.
In short: Paxos has more complexity than it needs, and despite that, it needs to be tweaked to be really useful, and getting those tweaks right is hard. There are, sadly, a lot of incorrect Paxos implementations – and their incorrectness has all-too-often come as a surprise to the people who rely on them.
To avoid those problems, there are other consensus algorithms out there. In this post, we’re going to look at one of the Paxos competitors – a consensus algorithm/protocol called raft.
Raft does away with the role complexity of Paxos. In Raft, you have a collection of cooperating agents. There are no distinct proposers, acceptors, or learners: there are just servers. Communication between the servers in raft is done entirely with synchronous remote procedure calls.
In Raft, the target of consensus is a log containing a sequence of events. The log is the history of the distributed system. The goal of raft is that the log be maintained in a consistent state throughout the raft network. Just like in Paxos, if we have servers, up to can fail without the network losing its consistency.
Raft is designed in terms of remote procedure calls between the elements of the network. In Raft, we never talk about single messages – every communication between servers is a pair of messages: a request from caller to callee; and a response from callee to caller. When a message gets lost, we’ll just talk about it as a failed remote procedure call.
Within a Raft network, at any given time, each server has a state. It can be a follower, a leader, or a candidate. Within the network, there is at most one leader. When there is a leader, all of the other servers are in the follower state. The followers are almost entirely passive. Followers don’t talk to clients at all – they just wait for RPCs from the leader. The leader is the only participant that’s allowed to talk to clients: any client request must go through the current leader. The leader is also the only server that’s allowed to add new entries to the consensus log.
Raft divides time into a sequence of terms. In each term, the servers in the raft network need to select a leader using a process called an election. Raft is a strong leader protocol – no interactions with a client can take place except through a leader. If there’s no leader, then we can’t process client requests without a leader.
So, to understand Raft, there’s three processes that we need to
understand:
In those processes, the servers have a collection of variables
that they use for the Raft protocol:
In each term, the Raft cluster needs to have a leader. The way that a leader is selected called election.
Elections are triggered by a term transition. When a server in the cluster decided that it needs to start a new term, it increments its term number, puts itself into the candidate state, and sends a RequestVote(term, candidateId)
RPC to each of the other servers in the cluster. This request asks the other servers in the cluster to select it as the leader. If it receives enough “yes” votes, it will become the leader.
When a server receives a RequestVote
RPC, it checks the term. If it’s smaller than the server’s current term, then it replies “No” – meaning that it cannot support the requestor as leader.
If the term in the request is greater that the receiver’s term, then the receiver cannot have voted in the new term. So it updates to the term from the request, and then it replies “Yes”.
If the term in the request equals the receiver’s term, then the receiver has already updated its term. If it’s already voted for someone else as leader, then it can’t support the requestor, so it replies “No”. If it hasn’t voted for a leader in the term, then it votes for the requestor, and replies “Yes”.
If the requester receives “Yes” votes from more than 1/2 of the cluster (counting itself), then it becomes the leader, and starts both processing requests from clients, and sending heartbeats to the other servers in the cluster.
If it doesn’t receive enough votes, then it waits to see if anyone else becomes the leader and starts sending heartbeats. If it doesn’t get a heartbeat in time, then it starts over: it would increment its term again, and try to start a fresh election.
For a given server, term transitions happen in three ways:
We just spent a fair bit of time talking about leaders and elections. That’s almost beside the point. What we really want to do is just maintain a consistent log across the cluster of servers. Everything except creating log entries is just the book-keeping that’s necessary to make the consistent log work. The log itself is maintained using the AppendEntries
RPC call.
When a client request does something that alters the state of the cluster, the leader describes that change by adding an entry to the log. It builds a proposed log entry, and sends it to the other members of the cluster using an RPC. If it gets enough “Yes” votes from other cluster members, then the log entry becomes committed, and the leader updates its commitIndex
to the index of the new log entry to reflect that.
The RPC request takes a bunch of parameters:
When an AppendEntries
call is received by a follower, what it does is:
prevLogIndex
, or that entry’s term doesn’t match the request term
, then it rejects the request by replying “No”.prevLogIndex
from its log.leaderCommit
is greater than the commitIndex
on the receiver, then the receiver updates its commitIndex
. When a majority of the cluster members have accepted an AppendEntries
call, then the log entry gets committed.
The one part of this that’s confusing is how the logs get managed. The leader creates a new log entry, and sends it to the other servers. The complexity comes from dealing with cases where something doesn’t reach consensus.
For example, the leader sends entries 5, 6, and 7 to server S. S adds the entries to its copy of log – it now contains [1, 2, 3, 4, 5, 6, 7]. Meanwhile, the leader also sends those entries to server T, but the RPC to T fails due to a network fault. Another client request happens, and now the leader sends [5, 6, 7, 8] to S. S sees that it’s got entry 5 already: so it discards everything after 5, and then re-appends.
So the trailing segment of the log can change! How do we handle consensus?
The next time that the leader sends an AppendEntries
to a follower, it contains the leader’s commitIndex
. The follower updates its commit index to that value. Once it’s done that, any request from a leader that tries to modify anything that comes before that commit index will be rejected.
The consensus commit thus doesn’t really occur until the next heartbeat call after a log update.
That’s the basics of Raft.
In comparison to Paxos, there’s a couple of things to notice:
So is Raft really simpler than Paxos? I think that’s up for discussion. Personally, I prefer Paxos. There’s a lot of complexity hidden under the covers of the RPC system. It looks simple on the surface, but all of the complexity of message passing, lost messages, message duplication – it’s still there. It’s just been swept under the carpet, as if that really makes it easier.
The way that the logs get maintained is confusing. That’s inevitable: getting distributed knowledge is never easy. Raft at least makes that part of things explicit, whereas it’s a common part of Paxos implementations, but it’s not really specified in the protocol.
]]>Vaccines have been getting a lot of attention lately, for good reason. There are a lot of people in America who’ve bought in to a bogus line about the supposed danger of vaccines, and the supposed benign-ness of the diseases that they can prevent. That’s led to many children not getting vaccinated as they should, which has culminated in a recent outbreak of measles caused by a contagious but not yet symptomatic child at DisneyLand.
When it comes to vaccines, there’s two things that a lot of people don’t understand. One is herd immunity; the other is probability of infection.
Herd immunity is a really important concept. In an ideal world, if you got vaccinated against measles, there’d be no chance that you’d ever catch it. But it doesn’t work that way. What a vaccine does is reduce the probability that you’ll catch the disease. Due to lots of random factors involving the way that a given individual’s immune system works, the vaccine can’t be perfect. Beyond that, there are also many people who either can’t be immunized, or whose immune system is not functioning correctly. For example, people who are getting chemotherapy for cancer have severely depressed immune systems, and even if they’ve been immunized, their immune systems aren’t capable of preventing the disease.
So just relying on the fact that you’ve been immunized isn’t really enough. To prevent outbreaks of the disease, we rely on an emergent property of a vaccinated population. If enough people are immune to the disease, then even if one person somehow gets infected with it, they won’t be able to cause it to spread.
Let’s walk through a simple example. Suppose we’ve got a disease where the vaccine is 95% effective – that is, 95 out of every hundred people who received it are completely immune to infection by it. Let’s also suppose that this is a highly infectious disease: out of every 100 non-immune people who are exposed to it, 95 will become ill. If everyone is immunized, how many people need to be exposed to a sick person in order for the disease to spread?
Infections turn into outbreaks when the number of infected people grows – if each sick person infects more than one other person, then the infection will start to grow exponentially. The severity of the outbreak will depend on how many people get infected by each sick person.
Suppose that the first sick person has contact with 20 people while they’re contagious. 95% of them are immune – which means that only one out of that twenty is succeptible. There’s a 95% chance that that person will get infected. This isn’t good, but if it’s kept to that rate, we won’t have an outbreak: each sick person will probably infect one other person on average – and not always even that. So the infection will die out without exploding into a significant outbreak.
What happens in 5% of the population doesn’t get vaccinated? Then the pool of infected people grows to 10%. And in our contrived example, we now have a 90% probability of the sick person making two other people ill. That’s more than enough to cause a major outbreak! On average, each sick person will cause 1.8 other people to become sick!
When the population’s immunity rate (either through vaccine, or through prior infection) gets to be high enough that an infection can no longer spread, the population is said to have herd immunity: even individuals who can’t be immunized no longer need to worry about catching it, because the population doesn’t have the capacity to spread it around in a major outbreak.
(In reality, the effectiveness of the measless vaccine really is in the 95 percent range – actually slightly higher than that; various sources estimate it somewhere between 95 and 97 percent effective! And the infectivity of most diseases is lower than the example above. Measles (which is a highly, highly contagious disease, far more contagious than most!) is estimated to infect between 80 and 90 percent of exposed non-immune people. So if enough people are immunized, herd immunity will take hold even with more that 20 people be exposed by every sick person.)
Moving on: there’s a paradox that some antivaccine people use in their arguments. If you look at an outbreak of an illness that we vaccinate for, you’ll frequently find that more vaccinated people become ill than unvaccinated. Therefore, they say, it’s not the fault of the unvaccinated. We’ll look at the math to see the problem with that.
Let’s use the same numbers as above: 95% vaccine effectiveness, 95% contagion. In addition, let’s say that 2% of people choose to go unvaccinated.
That means thats that 98% of the population has been immunized, and 95% of them are immune. So now 92% of the population has immunity.
If each infected person has contact with 20 other people, then we can expect expect 8% of those 20 to be infectable – or 1.6; and of those, 95% will become ill – or 1.52. So on average, each sick person will infect 1 1/2 other people. That’s enough to cause a significant outbreak. Without the non-immunized people, the infection rate is less than 1 – not enough to cause an outbreak. The non-immunized population reduced the herd immunity enough to cause an outbreak.
Now, within the population, how many immunized versus non-immunized people will get sick?
Out of every 100 people, there are 5 who got vaccinated, but aren’t immune. Out of that same 100 people, there are 2% that didn’t get vaccinated, or 2. So we’d expect that in 100 cases of the disease, about 70 of them to be vaccinated, and 30 unvaccinated.
The vaccinated population is much, much larger – 50 times larger! – than the unvaccinated, so we’d expect more vaccinated people to become ill, even though it’s the smaller unvaccinated group that broke the herd immunity! The easiest way to see that is to take those numbers, and normalize them.
In the vaccinated community, those 70 sick people are – in the worst possible case, where every single non-immune vaccinated person became ill! – the 5% non-immune from a population of 1400 people. So the worst possible infection rate in the vaccinated population is just 5% – and in reality, it’s more like 4.75%. But those 30 sick people from the unvaccinated pool are 30 out of about 32 non-immunized people who were exposed. The unvaccinated people were more than 20 times more likely to be infected.
The reality of vaccines is pretty simple.
In short: everyone who is healthy enough to be vaccinated should get vaccinated. If you don’t, you’re a despicable free-riding asshole who’s deliberately choosing to put not just yourself but other people at risk.
]]>So let’s get back to a bit of Martin-Loff type theory! (The stuff I’m writing about today corresponds, roughly, to chapter 2 of the Nordstrom/Petersson/Smith text.)
One of the key ideas of Martin-Loff’s type theory is that a logical statement is exactly the same thing as a specification of a computation. When you define a predicate like “Even”, the definition specifies both the set of numbers that satisfy the predicate, and the computation that tests a number for membership in the set of even numbers. If you haven’t provided enough information to fully specify the computation, then in Martin-Loff type theory, you haven’t defined a predicate or a set.
When you say “2 is even”, what you’re really saying in terms of the type theory is that “The computation for ‘even(2)’ will succeed”. The computation and the logical statement are the same thing.
In functional programming, we like to say that the program is the proof. Martin-Loff type theory is where that came from – and today we’re going to take a first look in detail at exactly what it means. In the world of type theory, the program is the proof, and the proof doesn’t exist without the program.
This creates an interesting set of equivalent interpretations. When you see a statement like “x : T” (or ), that could be interpreted in the following ways, all of which are really equivalent in type theory.
In the rest of this post, I’m going to focus on those four interpretations, and explain how each of them makes sense in this version of type theory.
The set theoretic interpretation is obvious – as the name suggests, it’s nothing but what we all learned from basic set theory. An object is a member of a set – which means, from set theory, that the object satisfies some predicate in first order predicate logic – because that’s what it means to be a member of a set.
The intuitionistic interpretation is almost the same as the set theoretic, but rewritten for intuitionistic logic. In intuitionistic logic, the predicate over the set is written as a proposition , and if we know that is a member of the set , then that means that we have a proof that demonstrates that is true.
The computational interpretation takes the intuitionistic one, and rephrases it in computational terms. A logical proposition, rendered into a computational setting, is just a specification of a program; and a proof of the proposition is a program that satisfies the specification.
Finally, the abstract interpretation just rephrases the computational one into terms that aren’t tied to a computing device. A predicate is a problem that needs to be solved; anything that provides a solution to the problem is demonstrating a member of the set.
The key takeaway though is the basic idea here of what a type is. What we’re talking about as a type here is something that goes far beyond any programming language’s idea of what a type is. In intuitionistic type theory, a type is a specification of a computation. If we had this in a real language, that would mean that any program that typechecked would be guaranteed to work: asserting that has type means, precisely, that is a computation that matches the specification!
(Of course, that’s not the panacea that you might think the first time you hear that. The catch is simple: the type is a specification of the computation. That means that just writing a type is a form of programming! And that means that your type descriptions are going to have bugs. But we’ll come back to that in a much later post.)
What type theory is doing is taking something like set theory, and re-rendering it entirely in a computational world. It still has a mapping from the computations to the abstract concepts that we use when we’re thinking, but if we can talk about those abstract concepts in type theory, we’ll always do it by mapping them into some kind of computation.
In type theory, we’re not dealing in a world of pure mathematical objects that exist if we can describe them; instead, we’re building a world where everything is at least theoretically computable. That might seem constraining, but every proof already corresponds to a computation of some sort; the only additional constraint here is that we can’t play tricks like the axiom of choice, where we can “prove” the existence of some unattainable, intangible, nonsensical object.
To make that work, we’re going to take all of the construct that we’re used to seeing in intuitionistic logic, and give them a meaning in terms of computations.
For example, in set theory, we can have a statement – meaning that is a superset of , that every element of is also necessarity an element of . In type theory, since and are specifications of computations, that means that a member (or proof) of is a computation that given a proof of , generates a proof of – in short, that implies .
Now, suppose that we want to prove . How could we do that? We need a program that given a proof of generates a proof of . That is, we need an implementation of the identity function: .
In fact, using the computation interpretation of things, we can interpret as being the type of a function that takes an instance of , and generates an instance of – that is, that if , then is a function from an instance of to an instance of !
The only trick to that is understanding that in type theory, saying that is an element of means that is a proof of . Using the same interpretation, that means that means that is a proof of – which means the same thing as saying that given an example of an element of (a proof of ), will produce an element of (a proof of ). The statement is exactly the same thing as the logical implication , which is exactly the same thing as the type of a function from to .
Notes on sources: I’m working from two main references in this series of posts, both of which are available in PDF form online.
In addition, I first learned a lot of this from reading some papers by Phil Wadler and Simon Peyton Jones. The exactly references but the exact references to those are long-lost in the shadows of my memory. But any of their papers are well-worth reading, so just read them all!)
]]>
A heap is a structure designed to solve a common problem. You’ve got a collection of objects, each of which has an associated numeric value. You want, at any time, to be able to find and remove the largest value in the collection, and to be able to add new elements to it. Those two operations are the core of the heap. Some variations also allow you to increase the value of objects inside the heap, or to remove values other than the maximum.
There are a lot of different ways to implement a heap. One obvious one is to just maintain a sorted sequence of objects. The problem with that is performance: some of the common operations are painfully slow!
Using the sorted sequence approach, removing the largest value is easy: you just remove the last element of the sequence. That’s very fast: it’s constant time. But you also need to be able to add values to the heap, and that’s not so good.
There’s two basic ways of doing a sequence: an array, or a linked list. In both cases, the performance isn’t acceptable. If we used an array,then in order to add a new object to the collection, we’d need to:
In the linked list approach, inserting the value isn’t a problem – it’s a constant time operation. But finding the position where it should be inserted is linear time. So we’re still talking about linear time.
Similarly, we could use a linked list, where inserting the element is constant time, but then finding its position is – again, unacceptable.
The problem with the sorted sequence approach isn’t really related to the kind of structure we use to maintain the sorted list; the problem is that we’re maintaining more information that we need. At any time, we want to be able to find the largest element of the heap quickly – we don’t care about the relative positions of any pair of values that don’t include the largest element of the collection! But if we keep a sorted list, every time we insert an element, we’re spending a lot of time comparing things whose comparison we don’t really care about!
To be able to make it faster, we need to build a data structure that doesn’t waste time and effort computing and maintaining information that we don’t want.
So our goal is to find ways of building structures that always let us both find the largest element quickly, and add new elements quickly, without maintaining more information that is really necessary. We’ll start off with a simple but good version, and then work our way through to better ones.
The most basic implementation of a heap is called a binary heap. A binary heap is a binary tree with two key properties:
The left-full property might seem a bit strange, but it turns out to be pretty straightforward. A binary heap can be implemented using an array. The root node is stored in the first position of the array; its children are in positions 2 and 3; the children of node 2 are stored in positions 4 and 5; the childen of position 3 are stored in positions 6 and 7. Using one-based indices, for any node N, it’s children are stored in positions 2N and 2N+1. Adding a new leaf to the tree can always be done by just appending one value to the array. The left-full property just means that you always extend the array by adding an element onto the left.
Implementing a heap this way is simple:
I’m going to show code for this. For fun, I wrote the code in a language called xtend. Xtend is a Java extension that cleans up the syntax, gets rid of semicolons, improves the type system, adds lambdas, and does a few other really neat things.
The whole beast is just a wrapper around an array:
class BinHeap> { val ArrayList _contents new() { _contents = new ArrayList () } ... }
If you know Java, this is mostly clear. In xtend, you write constructors using the name “new” instead of the name of the class being constructed.
Then we’ll set up some utilities to make other stuff easier to write.
def leftChildPosition(int pos) { 2 * (pos + 1) - 1 } def rightChildPosition(int pos) { 2 * (pos + 1) } def int parentPosition(int pos) { if (pos == 0) { throw new MaxHeapException() } else { (pos + 1)/ 2 - 1 } } def void swap(int one, int two) { val T first = _contents.get(one) _contents.set(one, _contents.get(two)) _contents.set(two, first) }
Again, these should be straightforward. The only tricky thing is that the JVM uses zero-based arrays – so the left child of the node in position is : we need to add one to the node number to shift to one-based position; and then subtract one from the result to switch back to zero-based position. We do a similar thing for each of the other position computations.
Now we can get to the interesting bits. How do we get values into the heap?
def insert(T v) { val idx = _contents.size() _contents.add(v) bubbleUp(idx) }
Insert is exactly what I described in prose above: append the new value onto the end of the array, and then bubble it up. Bubbling is the interesting part:
private def void bubbleUp(int pos) { if (pos > 0) { val parentPos = parentPosition(pos) if (_contents.get(pos) > _contents.get(parentPos)) { swap(pos, parentPos) bubbleUp(parentPos) } } }
Bubbling up from a position compares to its parent. If it’s bigger than its parent, it swaps positions with the parent, and then tries to continue bubbling up from its new position.
For example, imagine we had a tree like:
9 8 5 4 0 6 3 7 2 1
Now, suppose we wanted to add the value “10” to this. We’d add 10 to the end of the array, which would make it a child of 6. That would give us:
9 8 5 4 0 6 3 10 7 2 1
So, we’d compare 10 to its parent – it’s bigger, so we’d swap:
9 8 5 4 0 10 3 6 7 2 1
Then we’d compare 10 to its new parent, 8. It’s bigger, so we swap:
9 10 5 4 0 8 3 6 7 2 1
And finally, we’d compare 10 to its new parent, 9. It’s bigger so we swap, and then we’re done.
10 9 5 4 0 8 3 6 7 2 1
Appending to the end of the array is constant time, so the dominant time cost is the bubbling. The maximum possible number of swaps in the doubling process is depth of the tree minus 1 – and the depth of a full binary tree with N members is . So it’s swaps, and the overall cost of inserts is .
Getting the largest value is trivial:
def getMax() { _contents.get(0) }
Removing the largest value is a lot like adding a value: we really play with the last element of the array, and then do a bubbling process – only this time we’ll bubble in the opposite direction:
def removeMax() { if (_contents.size == 0) { throw new MaxHeapException() } else { val result = getMax() val last = _contents.remove(_contents.size() - 1) if (_contents.size() > 0) { _contents.set(0, last) bubbleDown(0) } result } }
Bubbling down is similar to bubbling up, but it’s a bit more complicated, because we need to look at both children.
private def void bubbleDown(int pos) { val rightChildPos = rightChildPosition(pos) val leftChildPos = leftChildPosition(pos) if (leftChildPos >= _contents.size) { return } // Try to bubble left if there is no right child, or if the lift child is // bigger than the right. if (rightChildPos >= _contents.size || _contents.get(leftChildPos) > _contents.get(rightChildPos)) { if (_contents.get(pos) < _contents.get(leftChildPos)) { swap(pos, leftChildPos) bubbleDown(leftChildPos) } } else { // Try to bubble right if (_contents.get(pos) < _contents.get(rightChildPos)) { swap(pos, rightChildPos) bubbleDown(rightChildPos) } } }
The process is almost the same as bubbling up, but moving in the opposite direction. We're starting with a parent node, and comparing it to its children. If it's bigger than either of its children, then we swap it with the largest child, and then continue bubbling down.
For example, let's look at the same heap we looked at for insert:
9 8 5 4 0 6 3 7 2 1
If we want to remove 9, we set the value 9 aside, and then remove 3 from the end of the array, and put it at the root of the tree:
3 8 5 4 0 6 7 2 1
Then we'd compare 3 against its two children, 8 and 7. Since 8 is the larger child, we swap 8 for 3:
8 3 5 4 0 6 7 2 1
Now we compare 3 with its new children, 5 and 6. 6 is bigger, so we swap 6 with 3:
8 6 5 4 0 3 7 2 1
3 has no children, so we're done: it's bubbled down as far as it can go.
Note: I messed up this example in the original version of the post. Thanks to John Armstrong for pointing it out.
The cost here is the same as insert, for the same reason. The dominant cost is the bubbling, and the bubbling is bounded by the depth of the tree. So removing the maximum is also .
It's worth noting that heaps can be used to build a very reasonable sorting algorithm. To sort a collection, just insert all of the elements of the collection, and then remove them one by one. It's , and it's conceptually quite simple. It's not widely used, because the old classic quicksort is faster - not in big(0) notation, but it ends up with a smaller constant. (In big-O notation, something that takes 3(lg n) steps and something that takes 6(lg n) steps are both , but the one whose constant is 3 is still twice as fast as the one whose constant is 6.)
]]>A good example of that is something called paxos. My first exposure to paxos was very interesting. I’d just been hired by Google, and was working on their build tool. At the time, engineers in Google had a problem. Google’s codebase was contained in one massive version control repository. Doing things that way has a ton of really neat advantages – most importantly, the fact that it makes it really convenient to reuse code written by anyone else at the company. The problem was that code reuse can become very confusing. Project A reuses a bit of code written by people from project B. B’s code reused some stuff from C, and C from D, and D from E. So now project A is using code from project E, and they don’t know why!
In this case, I had someone from a storage project coming to me trying to figure out just why his system had a dependency on a plan9 database system called paxos. I had to built a tool that would allow people to ask questions like “Why does A depend on E?”.
As it turned out, paxos was a really important thing, and it was widely reused through the Google codebase. Once I learned about it, I started seeing it everywhere. Since I left Google nearly four years ago, I didn’t stop seeing it. It’s ubiquitous in distributed systems. Outside of Google, we weren’t using that friendly old plan9 paxos implementation – but the paxos model has been reimplemented dozens of times, because it’s so darned useful!
paxos is a system for managing consensus.
In distributed systems, there a collection of hard problems that you constantly need to deal with.
In a world where you can’t count on different agents seeing events in the same order, and where you expect things to be constantly failing, how can you make sure that any distributed system you build ends up with a consistent view of reality?
The answer is a consensus protocol. You need to create a mechanism based on communication between the copies of your system that allows them to mantain a consistent consensus of what the current state of the world is, even in the presence of failures of machines, storage systems, and communications.
paxos is a very elegant, reasonably simple consensus protocol.
Let’s get a bit more precise. Paxos is built on a model of storage. The original application of it was a consistent database, so it’s built around the idea of keeping data consistent. In paxos, the state of the storage is modelled as a sequence of transactions. Each transaction is a pair (t, v), where t is a numeric transaction identifier, and a v is a transaction value.
The state of the system being modelled is a sequence of transaction pairs, [(t_{i}, v_{i}), (t_{j}, v_{j}), ..], where the t values are increasing as you progress through the sequence. As time passes, new transaction pairs can be added to the state.
The focus of the paxos protocol is ensuring that in a collection of 2n+1 participants, all surviving participants will agree on the current value of the state, even if up to n participants fail, and even if messages can be delivered in arbitrary order.
Before I go further into the description of paxos, we need to look at the basic assumptions that underlie it. Like any formal model, it’s not defined in terms of real computers. It’s defined in terms of an abstraction that approximates reality. In this case, the approximation is quite good, but we still need to go through the basic assumptions that make up its model of the universe.
In short, everything can fail at any time; after failure, participants can recover and rejoin the system; any no part of the system acts in an actively adversarial way.
The protocol describes the behavior of the system in terms of a collection of roles. A participant can have more than one role in the system – in fact, in most implementations of paxos, all partipants do have multiple roles. The roles are:
In a typical paxos cluster, the client sends requests to a proposer. The proposer sends a proposal to update the state with the new client request, and attempts to convince a majority of the acceptors to accept it. Once a majority accepts it, the client request is processed by the learner, and a result is returned to the client.
The meat of paxos the protocol that the proposer gets a majority of acceptors to agree on a proposal, and how that protocol process ensures that the collection of acceptors maintains a consistent state.
The protocol itself is pretty simple. Each round is effectively independent, and consists of a process of attempting to reach consensus. Within each round, finding consensus is a two-phase process, where each phase consists of a message sent from a proposer to a group of acceptors, and a reply from the acceptors to the proposer.
Once it’s sent a promise message, it must not accept any request for a proposal with number less that N. Note though that this does not mean that the acceptor promises to accept the proposal: all it’s doing is promising not to accept any proposal with number less than N! If in receives a message Prepare(N+1), it’s free to promise that – but if it does, it will no longer be able to accept the proposal for N.
(If N is smaller that the number of any proposal promised or accepted by the acceptor, then in the original version of paxos, the acceptor does nothing; in some optimizations of the protocol, it replies Reject(n_{v}).)
What this phase does is allow a proposer to determine whether or not a new proposal is even worth considering. If a quorum (majority) of acceptors send promises, then it can move on to phase 2.
When a proposer recieves promises from a quorum of acceptors, then it moves forward to try to actually commit the proposal. In order to do this, it needs to choose a value for the proposal. If any of the Promise messages contained a value, then the value of this proposal must be set to the value of the highest proposal number in any of the promises. If all of the promises were empty, then the proposer can choose any value that it wants for the proposal.
Once the proposer has chosen a value, then it sends a message Accept!(N, V) to a quorum of acceptors. This is typically written with the exclamation point, because it’s really a command to the acceptors: they’re being told to accept the proposal, if they can.
When an acceptor receives an Accept!(N, v) message, if it has not issued a promise for a proposal with number greater than N, then it must accept the message. It accepts the proposal by sending a message Accepted(N, V) to both the original proposer, and all of the learners.
When Accepted messages have been received from a quorum of acceptors, the new value V becomes the consensus value for the paxos cluster, and the new proposal number N is fully committed.
As with so many things, this is easier to understand when you think about an example. One use of paxos that I’ve worked with is in a cluster scheduling service. In that system:
It’s a pretty simple thing – the core concept is simply that no consensus proposal is considered “committed” until it’s been accepted by a majority of the participants. And if it’s been accepted by a majority of the participants, that means that no conflicting proposal can ever reach consensus – because that would require at least one participant to accept 2 conflicting proposals.
But there’s still a bit of formality that’s working looking at. Exactly what guarantees does paxos give? What properties does paxos-style consensus have?
Even the formal properties of paxos are easy to understand. Paxos provides two key properties: validity, and agreement.
You an easily prove those two properties. In fact, the proof is completely obvious once you recognize that the paxos protocol has two invariants (and those invariants are themselves clear from the definition of the protocol!):
Getting back to the beginning: the point of all of this is to have a system in which we can be sure that things work correctly even in the presence of failures. In paxos, as long as at some point there was a quorum of machines that come to agreement, then any failure that leaves a surviving quorum of machines must have overlapped with the previous quorum – which means that the previous consensus still remains in effect, and will be propagated to the remaining participants. If you’ve got 5 machines, then two can fail, and you won’t lose consistency among the remaining ones.
]]>