The root problem is really interesting, once you see what’s going on. We’re taking a statement that, on the face of it, isn’t about sets. Then we’re appyling a set-based interpretation of it, and looking at the subset relation. That’s all good. The problem is that when we start looking at a set-based interpretation, we’re doing what we would do in classical set theory – but that’s a different thing from what we’re doing here. In effect, we’re changing the statement.
For almost all of us, math is something that we learned from the perspective of axiomatic set theory and first order predicate logic. So that’s the default interpretation that we put on anything mathematical. When you talk about a a proposition as a set, we’re programmed to think of it in that classical way: for any set , there’s a logical predicate such that by definition, . When you see in a set-theory context, what you think is something like . Under that intepretation, the idea that is equivalent to is absolutely ridiculous. If you follow the logic, implication must be the reverse of the subset relation!
The catch, though, is that we’re not talking about set theory, and the statement that we’re looking at is emphatically not . And that, right there, is the root of the problem.
always means – it doesn’t matter whether we’re doing set theory or type theory or whatever else. But in set theory, when we talk about the intepretation of as a set, right now, in the world of type theory, we’re talking about a different set.
Super set doesn’t suddenly mean subset. Implication doesn’t start working backwards! and yet, I’m still trying to tell you that i really meant it when i said that superset meant implication! how can that possibly make sense?
In type theory, we´re trying to take a very different look at math. In particular, we’re building everything up on a constructive/computational framework. So we’re necessarily going to look at some different interpretations of things – we’re going to look at things in ways that just don’t make sense in the world of classical set theory/FOPL. We’re not going to contradict set theory – but we’re going to look at things very differently.
For example, the kind of statement we’re talking here about is a complete, closed, logical proposition, not a predicate, nor a set. The proposition is a statement like “‘hello’ has five letters”.
When we look at a logical proposition , one of the type theoretic interpretations of it is as a set of facts: can be viewed as the set of all facts that can be proven true using . In type theory land, this makes perfect sense: if I’ve got a proof of , then I’ve got a proof of everything that can prove. isn’t a statement about the items in s proof-set. is a logical statement about something, and the elements of the proof-set of are the things that the statement can prove.
With that in mind, what does mean in type theory? It means that everything provable using is provable using nothing but .
(It’s really important to note here that there are no quantifiers in that statement. Again, we are not saying . and are atomic propositions – not open quantified statements.)
If you are following the interpretation that says that is the set of facts that are provable using the proposition , then if , that means that everything that’s in must also be in . In fact, it means pretty much exactly the same thing as classical superset. is a set of facts provable by the statement . The statement is provable using the statement – which means that everything in the provable set of must, by definition! be in the provable set of .
The converse doesn’t hold. There can be things provable by (and thus in the proof-set of ) which are not provable using . So taken as sets of facts provable by logical propositions, !
Again, that seems like it’s the opposite of what we’d expect. But the trick is to recognize the meaning of the statements we’re working with, and that despite a surface resemblance, they’re not the same thing that we’re used to. Type theory isn’t saying that the set theoretic statements are wrong; nor is set theory saying that type theory is wrong.
The catch is simple: we’re trying to inject a kind of quantification into the statement which isn’t there; and then we’re using our interpretation of that quantified statement to say something different.
But there’s an interpretation of statements in type theory which is entirely valid, but which trips over our intuition: our training has taught us to take it, and expand it into an entirely different statement. We create blanks that aren’t there, fill them in, and by doing so, convert it into something that it isn’t, and confuse ourselves.
]]>Lenovo, one of the largest makers of windows-based laptops, sold out its customers as part of one of the worst deliberate violations of computer security I’ve ever seen, by shipping a piece of software called Superfish pre-installed on its computers. Superfish is, with absolutely no exaggeration, one of the most serious, unethical, despicable things I’ve seen in quite a lot time. It’s appalling.
So what is it, and what’s the big deal?
We need to start with some background, and talk a bit about how secure connections work on the internet.
Every time that you visit a website with a secure connection (a URL that starts with https), you’re using a protocol called TLS (formerly SSL). TLS is designed to do two things:
The way that it does both of those is based on encryption. Every time you create a secure connect to a website, you’re exchanging credentials with the site to ensure that they’re who they say they are, and then based on those credentials, you establish an encryption key for the rest of your communication.
That connection-establishment process is the critical bit. You need to get some information that allows you to trust them to be who they claim to be. The security and integrity of everything that happens over the connection depends on the truth and integrity of that initial piece of identity verification.
The identity verification piece of TLS is built using public key cryptography, as part of a standard infrastructure for key maintenance and verification called X.509.
I’ve written about public key crypto before – see here. The basic idea behind public key crypto is that you’ve got two keys, called the public and private keys. Anything which is encrypted with the public key can only be decrypted with the private key; anything which is encrypted with the private key can only be decrypted using the public key. Your public key is available to anyone who wants it; no one but you has access to your private key.
If you receive a message from me, and you can decrypt it with my public key, then you know, without a doubt, that you can be sure that I’m the one who encrypted it. Only my private key could have encrypted a message that could then be decrypted with my public key.
For that to work, though, there’s one thing that you need to be sure of: you need to be absolutely sure that the public key that you’ve got is really my public key. If some clever person managed to somehow give you a different key, and convince you that it’s mine, then they can send you messages, and they’ll look exactly as if they came from me. If I handed you my public key on a USB thumbdrive, then you’re sure that the key came from me – but if you received it online, haw can you be sure that it was really me that gave it to you? Maybe someone switched it along the way?
In X.509, we use an idea of delegated trust. That is, we have some small collection of fundamental trusted authorities. Those authorities can issue public/private key pairs, so when someone need a public key, they can go to them and ask for it, and they’ll create one. The authority gives them a certificate, which is a copy of the new public key encrypted by the authority using their private key.
Now, when someone connects to a website, the target site can state who they are by sending a copy of the certificate. The client site recieves the certificate, decrypts it using the authorities public key, and then starts using that public key to encrypt their communications.
If the two sides can keep talking, then the client knows who it’s talking to. It got a public key, and it’s using that public key to talk to the server; so the server couldn’t decrypt the communication unless it had the public key; and it trusts that that it got the right public key, because it was encrypted with the private key of the certificate authority.
This is great as far as it goes, but it leaves us with a single certificate authority (or, at best, a small group). With billions of human users, and possibly trillions of networked devices, having a single authority isn’t manageable. They simple can’t produce enough keys for everyone. So we need to use our trust in the certificate authority to expand the pool of trust. We can do that by saying that if the certificate authority can declare that a particular entity is trustworthy, then we can use that entity itself as a verifier. So now we’ve taken a single trusted authority, and expanded that trust to a collection of places. Each of those newly trusted entities can now also issue new keys, and can certify their validity, by showing their certificate, plus the new encrypted public key. In general, anyone can issue a public key – and we can check its validity by looking at the chain of authorities that verified it, up to the root authority.
There’s a catch to this though: the base certificate providers. If you can trust them, then everything works: if you’ve got a certificate chain, you can use it to verify the identity of the party you’re talking to. But if there’s any question about the validity of the root certificate provider, if there’s any question whether or not you have the correct, valid public key for that provider, then you’re completely hosed. Ultimately, there’s some piece of seed information which you have to start off with. You need to accept the validity of an initial certificate authority based on some other mechanism. The people who sold you your computer, or the people who built your web browser, generally install a root certificate – basically the public key for a trusted certificate authority.
If that root certificate isn’t trustworthy, then nothing that results from it can be trusted. The untrustworthy root certificate can be used by an unscrupulous person to create new certificates allowing them to masquerade as anything that they want.
In particular, an untrustworthy root certificate it makes it easy to perform a man-in-the-middle attack. Suppose you want to talk to your bank, and somehow Joe Schmoe has planted a bad root certificate on your computer:
In the Lenovo fiasco, Lenovo installed a system called Superfish, which deliberately installs a bad root certificate, and then uses that root certificate to create man-in-the-middle attacks on every secure connection ever made with the computer.
Why does it do that? Purportedly for ad retargeting: it uses its man-in-the-middle to decrypt supposedly secure connections so that it can replace ads in the pages that you view. That way, Lenovo and Superfish get advertising money for the pages you view, instead of the page-owners.
It’s spectacularly despicable. It’s fundamentally compromising everything you do with your computer. It’s doing it in a way that you can’t detect. And it’s doing it all in a sleazy attempt to steal advertising money.
Based on this, I’ve got two pieces of advice:
This is, by far, the worst thing that I’ve ever seen a computer manufacturer do. They deserve to be run out of business for this.
]]>But Paxos does have its problem.
In short: Paxos has more complexity than it needs, and despite that, it needs to be tweaked to be really useful, and getting those tweaks right is hard. There are, sadly, a lot of incorrect Paxos implementations – and their incorrectness has all-too-often come as a surprise to the people who rely on them.
To avoid those problems, there are other consensus algorithms out there. In this post, we’re going to look at one of the Paxos competitors – a consensus algorithm/protocol called raft.
Raft does away with the role complexity of Paxos. In Raft, you have a collection of cooperating agents. There are no distinct proposers, acceptors, or learners: there are just servers. Communication between the servers in raft is done entirely with synchronous remote procedure calls.
In Raft, the target of consensus is a log containing a sequence of events. The log is the history of the distributed system. The goal of raft is that the log be maintained in a consistent state throughout the raft network. Just like in Paxos, if we have servers, up to can fail without the network losing its consistency.
Raft is designed in terms of remote procedure calls between the elements of the network. In Raft, we never talk about single messages – every communication between servers is a pair of messages: a request from caller to callee; and a response from callee to caller. When a message gets lost, we’ll just talk about it as a failed remote procedure call.
Within a Raft network, at any given time, each server has a state. It can be a follower, a leader, or a candidate. Within the network, there is at most one leader. When there is a leader, all of the other servers are in the follower state. The followers are almost entirely passive. Followers don’t talk to clients at all – they just wait for RPCs from the leader. The leader is the only participant that’s allowed to talk to clients: any client request must go through the current leader. The leader is also the only server that’s allowed to add new entries to the consensus log.
Raft divides time into a sequence of terms. In each term, the servers in the raft network need to select a leader using a process called an election. Raft is a strong leader protocol – no interactions with a client can take place except through a leader. If there’s no leader, then we can’t process client requests without a leader.
So, to understand Raft, there’s three processes that we need to
understand:
In those processes, the servers have a collection of variables
that they use for the Raft protocol:
In each term, the Raft cluster needs to have a leader. The way that a leader is selected called election.
Elections are triggered by a term transition. When a server in the cluster decided that it needs to start a new term, it increments its term number, puts itself into the candidate state, and sends a RequestVote(term, candidateId)
RPC to each of the other servers in the cluster. This request asks the other servers in the cluster to select it as the leader. If it receives enough “yes” votes, it will become the leader.
When a server receives a RequestVote
RPC, it checks the term. If it’s smaller than the server’s current term, then it replies “No” – meaning that it cannot support the requestor as leader.
If the term in the request is greater that the receiver’s term, then the receiver cannot have voted in the new term. So it updates to the term from the request, and then it replies “Yes”.
If the term in the request equals the receiver’s term, then the receiver has already updated its term. If it’s already voted for someone else as leader, then it can’t support the requestor, so it replies “No”. If it hasn’t voted for a leader in the term, then it votes for the requestor, and replies “Yes”.
If the requester receives “Yes” votes from more than 1/2 of the cluster (counting itself), then it becomes the leader, and starts both processing requests from clients, and sending heartbeats to the other servers in the cluster.
If it doesn’t receive enough votes, then it waits to see if anyone else becomes the leader and starts sending heartbeats. If it doesn’t get a heartbeat in time, then it starts over: it would increment its term again, and try to start a fresh election.
For a given server, term transitions happen in three ways:
We just spent a fair bit of time talking about leaders and elections. That’s almost beside the point. What we really want to do is just maintain a consistent log across the cluster of servers. Everything except creating log entries is just the book-keeping that’s necessary to make the consistent log work. The log itself is maintained using the AppendEntries
RPC call.
When a client request does something that alters the state of the cluster, the leader describes that change by adding an entry to the log. It builds a proposed log entry, and sends it to the other members of the cluster using an RPC. If it gets enough “Yes” votes from other cluster members, then the log entry becomes committed, and the leader updates its commitIndex
to the index of the new log entry to reflect that.
The RPC request takes a bunch of parameters:
When an AppendEntries
call is received by a follower, what it does is:
prevLogIndex
, or that entry’s term doesn’t match the request term
, then it rejects the request by replying “No”.prevLogIndex
from its log.leaderCommit
is greater than the commitIndex
on the receiver, then the receiver updates its commitIndex
. When a majority of the cluster members have accepted an AppendEntries
call, then the log entry gets committed.
The one part of this that’s confusing is how the logs get managed. The leader creates a new log entry, and sends it to the other servers. The complexity comes from dealing with cases where something doesn’t reach consensus.
For example, the leader sends entries 5, 6, and 7 to server S. S adds the entries to its copy of log – it now contains [1, 2, 3, 4, 5, 6, 7]. Meanwhile, the leader also sends those entries to server T, but the RPC to T fails due to a network fault. Another client request happens, and now the leader sends [5, 6, 7, 8] to S. S sees that it’s got entry 5 already: so it discards everything after 5, and then re-appends.
So the trailing segment of the log can change! How do we handle consensus?
The next time that the leader sends an AppendEntries
to a follower, it contains the leader’s commitIndex
. The follower updates its commit index to that value. Once it’s done that, any request from a leader that tries to modify anything that comes before that commit index will be rejected.
The consensus commit thus doesn’t really occur until the next heartbeat call after a log update.
That’s the basics of Raft.
In comparison to Paxos, there’s a couple of things to notice:
So is Raft really simpler than Paxos? I think that’s up for discussion. Personally, I prefer Paxos. There’s a lot of complexity hidden under the covers of the RPC system. It looks simple on the surface, but all of the complexity of message passing, lost messages, message duplication – it’s still there. It’s just been swept under the carpet, as if that really makes it easier.
The way that the logs get maintained is confusing. That’s inevitable: getting distributed knowledge is never easy. Raft at least makes that part of things explicit, whereas it’s a common part of Paxos implementations, but it’s not really specified in the protocol.
]]>Vaccines have been getting a lot of attention lately, for good reason. There are a lot of people in America who’ve bought in to a bogus line about the supposed danger of vaccines, and the supposed benign-ness of the diseases that they can prevent. That’s led to many children not getting vaccinated as they should, which has culminated in a recent outbreak of measles caused by a contagious but not yet symptomatic child at DisneyLand.
When it comes to vaccines, there’s two things that a lot of people don’t understand. One is herd immunity; the other is probability of infection.
Herd immunity is a really important concept. In an ideal world, if you got vaccinated against measles, there’d be no chance that you’d ever catch it. But it doesn’t work that way. What a vaccine does is reduce the probability that you’ll catch the disease. Due to lots of random factors involving the way that a given individual’s immune system works, the vaccine can’t be perfect. Beyond that, there are also many people who either can’t be immunized, or whose immune system is not functioning correctly. For example, people who are getting chemotherapy for cancer have severely depressed immune systems, and even if they’ve been immunized, their immune systems aren’t capable of preventing the disease.
So just relying on the fact that you’ve been immunized isn’t really enough. To prevent outbreaks of the disease, we rely on an emergent property of a vaccinated population. If enough people are immune to the disease, then even if one person somehow gets infected with it, they won’t be able to cause it to spread.
Let’s walk through a simple example. Suppose we’ve got a disease where the vaccine is 95% effective – that is, 95 out of every hundred people who received it are completely immune to infection by it. Let’s also suppose that this is a highly infectious disease: out of every 100 non-immune people who are exposed to it, 95 will become ill. If everyone is immunized, how many people need to be exposed to a sick person in order for the disease to spread?
Infections turn into outbreaks when the number of infected people grows – if each sick person infects more than one other person, then the infection will start to grow exponentially. The severity of the outbreak will depend on how many people get infected by each sick person.
Suppose that the first sick person has contact with 20 people while they’re contagious. 95% of them are immune – which means that only one out of that twenty is succeptible. There’s a 95% chance that that person will get infected. This isn’t good, but if it’s kept to that rate, we won’t have an outbreak: each sick person will probably infect one other person on average – and not always even that. So the infection will die out without exploding into a significant outbreak.
What happens in 5% of the population doesn’t get vaccinated? Then the pool of infected people grows to 10%. And in our contrived example, we now have a 90% probability of the sick person making two other people ill. That’s more than enough to cause a major outbreak! On average, each sick person will cause 1.8 other people to become sick!
When the population’s immunity rate (either through vaccine, or through prior infection) gets to be high enough that an infection can no longer spread, the population is said to have herd immunity: even individuals who can’t be immunized no longer need to worry about catching it, because the population doesn’t have the capacity to spread it around in a major outbreak.
(In reality, the effectiveness of the measless vaccine really is in the 95 percent range – actually slightly higher than that; various sources estimate it somewhere between 95 and 97 percent effective! And the infectivity of most diseases is lower than the example above. Measles (which is a highly, highly contagious disease, far more contagious than most!) is estimated to infect between 80 and 90 percent of exposed non-immune people. So if enough people are immunized, herd immunity will take hold even with more that 20 people be exposed by every sick person.)
Moving on: there’s a paradox that some antivaccine people use in their arguments. If you look at an outbreak of an illness that we vaccinate for, you’ll frequently find that more vaccinated people become ill than unvaccinated. Therefore, they say, it’s not the fault of the unvaccinated. We’ll look at the math to see the problem with that.
Let’s use the same numbers as above: 95% vaccine effectiveness, 95% contagion. In addition, let’s say that 2% of people choose to go unvaccinated.
That means thats that 98% of the population has been immunized, and 95% of them are immune. So now 92% of the population has immunity.
If each infected person has contact with 20 other people, then we can expect expect 8% of those 20 to be infectable – or 1.6; and of those, 95% will become ill – or 1.52. So on average, each sick person will infect 1 1/2 other people. That’s enough to cause a significant outbreak. Without the non-immunized people, the infection rate is less than 1 – not enough to cause an outbreak. The non-immunized population reduced the herd immunity enough to cause an outbreak.
Now, within the population, how many immunized versus non-immunized people will get sick?
Out of every 100 people, there are 5 who got vaccinated, but aren’t immune. Out of that same 100 people, there are 2% that didn’t get vaccinated, or 2. So we’d expect that in 100 cases of the disease, about 70 of them to be vaccinated, and 30 unvaccinated.
The vaccinated population is much, much larger – 50 times larger! – than the unvaccinated, so we’d expect more vaccinated people to become ill, even though it’s the smaller unvaccinated group that broke the herd immunity! The easiest way to see that is to take those numbers, and normalize them.
In the vaccinated community, those 70 sick people are – in the worst possible case, where every single non-immune vaccinated person became ill! – the 5% non-immune from a population of 1400 people. So the worst possible infection rate in the vaccinated population is just 5% – and in reality, it’s more like 4.75%. But those 30 sick people from the unvaccinated pool are 30 out of about 32 non-immunized people who were exposed. The unvaccinated people were more than 20 times more likely to be infected.
The reality of vaccines is pretty simple.
In short: everyone who is healthy enough to be vaccinated should get vaccinated. If you don’t, you’re a despicable free-riding asshole who’s deliberately choosing to put not just yourself but other people at risk.
]]>So let’s get back to a bit of Martin-Loff type theory! (The stuff I’m writing about today corresponds, roughly, to chapter 2 of the Nordstrom/Petersson/Smith text.)
One of the key ideas of Martin-Loff’s type theory is that a logical statement is exactly the same thing as a specification of a computation. When you define a predicate like “Even”, the definition specifies both the set of numbers that satisfy the predicate, and the computation that tests a number for membership in the set of even numbers. If you haven’t provided enough information to fully specify the computation, then in Martin-Loff type theory, you haven’t defined a predicate or a set.
When you say “2 is even”, what you’re really saying in terms of the type theory is that “The computation for ‘even(2)’ will succeed”. The computation and the logical statement are the same thing.
In functional programming, we like to say that the program is the proof. Martin-Loff type theory is where that came from – and today we’re going to take a first look in detail at exactly what it means. In the world of type theory, the program is the proof, and the proof doesn’t exist without the program.
This creates an interesting set of equivalent interpretations. When you see a statement like “x : T” (or ), that could be interpreted in the following ways, all of which are really equivalent in type theory.
In the rest of this post, I’m going to focus on those four interpretations, and explain how each of them makes sense in this version of type theory.
The set theoretic interpretation is obvious – as the name suggests, it’s nothing but what we all learned from basic set theory. An object is a member of a set – which means, from set theory, that the object satisfies some predicate in first order predicate logic – because that’s what it means to be a member of a set.
The intuitionistic interpretation is almost the same as the set theoretic, but rewritten for intuitionistic logic. In intuitionistic logic, the predicate over the set is written as a proposition , and if we know that is a member of the set , then that means that we have a proof that demonstrates that is true.
The computational interpretation takes the intuitionistic one, and rephrases it in computational terms. A logical proposition, rendered into a computational setting, is just a specification of a program; and a proof of the proposition is a program that satisfies the specification.
Finally, the abstract interpretation just rephrases the computational one into terms that aren’t tied to a computing device. A predicate is a problem that needs to be solved; anything that provides a solution to the problem is demonstrating a member of the set.
The key takeaway though is the basic idea here of what a type is. What we’re talking about as a type here is something that goes far beyond any programming language’s idea of what a type is. In intuitionistic type theory, a type is a specification of a computation. If we had this in a real language, that would mean that any program that typechecked would be guaranteed to work: asserting that has type means, precisely, that is a computation that matches the specification!
(Of course, that’s not the panacea that you might think the first time you hear that. The catch is simple: the type is a specification of the computation. That means that just writing a type is a form of programming! And that means that your type descriptions are going to have bugs. But we’ll come back to that in a much later post.)
What type theory is doing is taking something like set theory, and re-rendering it entirely in a computational world. It still has a mapping from the computations to the abstract concepts that we use when we’re thinking, but if we can talk about those abstract concepts in type theory, we’ll always do it by mapping them into some kind of computation.
In type theory, we’re not dealing in a world of pure mathematical objects that exist if we can describe them; instead, we’re building a world where everything is at least theoretically computable. That might seem constraining, but every proof already corresponds to a computation of some sort; the only additional constraint here is that we can’t play tricks like the axiom of choice, where we can “prove” the existence of some unattainable, intangible, nonsensical object.
To make that work, we’re going to take all of the construct that we’re used to seeing in intuitionistic logic, and give them a meaning in terms of computations.
For example, in set theory, we can have a statement – meaning that is a superset of , that every element of is also necessarity an element of . In type theory, since and are specifications of computations, that means that a member (or proof) of is a computation that given a proof of , generates a proof of – in short, that implies .
Now, suppose that we want to prove . How could we do that? We need a program that given a proof of generates a proof of . That is, we need an implementation of the identity function: .
In fact, using the computation interpretation of things, we can interpret as being the type of a function that takes an instance of , and generates an instance of – that is, that if , then is a function from an instance of to an instance of !
The only trick to that is understanding that in type theory, saying that is an element of means that is a proof of . Using the same interpretation, that means that means that is a proof of – which means the same thing as saying that given an example of an element of (a proof of ), will produce an element of (a proof of ). The statement is exactly the same thing as the logical implication , which is exactly the same thing as the type of a function from to .
Notes on sources: I’m working from two main references in this series of posts, both of which are available in PDF form online.
In addition, I first learned a lot of this from reading some papers by Phil Wadler and Simon Peyton Jones. The exactly references but the exact references to those are long-lost in the shadows of my memory. But any of their papers are well-worth reading, so just read them all!)
]]>
A heap is a structure designed to solve a common problem. You’ve got a collection of objects, each of which has an associated numeric value. You want, at any time, to be able to find and remove the largest value in the collection, and to be able to add new elements to it. Those two operations are the core of the heap. Some variations also allow you to increase the value of objects inside the heap, or to remove values other than the maximum.
There are a lot of different ways to implement a heap. One obvious one is to just maintain a sorted sequence of objects. The problem with that is performance: some of the common operations are painfully slow!
Using the sorted sequence approach, removing the largest value is easy: you just remove the last element of the sequence. That’s very fast: it’s constant time. But you also need to be able to add values to the heap, and that’s not so good.
There’s two basic ways of doing a sequence: an array, or a linked list. In both cases, the performance isn’t acceptable. If we used an array,then in order to add a new object to the collection, we’d need to:
In the linked list approach, inserting the value isn’t a problem – it’s a constant time operation. But finding the position where it should be inserted is linear time. So we’re still talking about linear time.
Similarly, we could use a linked list, where inserting the element is constant time, but then finding its position is – again, unacceptable.
The problem with the sorted sequence approach isn’t really related to the kind of structure we use to maintain the sorted list; the problem is that we’re maintaining more information that we need. At any time, we want to be able to find the largest element of the heap quickly – we don’t care about the relative positions of any pair of values that don’t include the largest element of the collection! But if we keep a sorted list, every time we insert an element, we’re spending a lot of time comparing things whose comparison we don’t really care about!
To be able to make it faster, we need to build a data structure that doesn’t waste time and effort computing and maintaining information that we don’t want.
So our goal is to find ways of building structures that always let us both find the largest element quickly, and add new elements quickly, without maintaining more information that is really necessary. We’ll start off with a simple but good version, and then work our way through to better ones.
The most basic implementation of a heap is called a binary heap. A binary heap is a binary tree with two key properties:
The left-full property might seem a bit strange, but it turns out to be pretty straightforward. A binary heap can be implemented using an array. The root node is stored in the first position of the array; its children are in positions 2 and 3; the children of node 2 are stored in positions 4 and 5; the childen of position 3 are stored in positions 6 and 7. Using one-based indices, for any node N, it’s children are stored in positions 2N and 2N+1. Adding a new leaf to the tree can always be done by just appending one value to the array. The left-full property just means that you always extend the array by adding an element onto the left.
Implementing a heap this way is simple:
I’m going to show code for this. For fun, I wrote the code in a language called xtend. Xtend is a Java extension that cleans up the syntax, gets rid of semicolons, improves the type system, adds lambdas, and does a few other really neat things.
The whole beast is just a wrapper around an array:
class BinHeap> { val ArrayList _contents new() { _contents = new ArrayList () } ... }
If you know Java, this is mostly clear. In xtend, you write constructors using the name “new” instead of the name of the class being constructed.
Then we’ll set up some utilities to make other stuff easier to write.
def leftChildPosition(int pos) { 2 * (pos + 1) - 1 } def rightChildPosition(int pos) { 2 * (pos + 1) } def int parentPosition(int pos) { if (pos == 0) { throw new MaxHeapException() } else { (pos + 1)/ 2 - 1 } } def void swap(int one, int two) { val T first = _contents.get(one) _contents.set(one, _contents.get(two)) _contents.set(two, first) }
Again, these should be straightforward. The only tricky thing is that the JVM uses zero-based arrays – so the left child of the node in position is : we need to add one to the node number to shift to one-based position; and then subtract one from the result to switch back to zero-based position. We do a similar thing for each of the other position computations.
Now we can get to the interesting bits. How do we get values into the heap?
def insert(T v) { val idx = _contents.size() _contents.add(v) bubbleUp(idx) }
Insert is exactly what I described in prose above: append the new value onto the end of the array, and then bubble it up. Bubbling is the interesting part:
private def void bubbleUp(int pos) { if (pos > 0) { val parentPos = parentPosition(pos) if (_contents.get(pos) > _contents.get(parentPos)) { swap(pos, parentPos) bubbleUp(parentPos) } } }
Bubbling up from a position compares to its parent. If it’s bigger than its parent, it swaps positions with the parent, and then tries to continue bubbling up from its new position.
For example, imagine we had a tree like:
9 8 5 4 0 6 3 7 2 1
Now, suppose we wanted to add the value “10” to this. We’d add 10 to the end of the array, which would make it a child of 6. That would give us:
9 8 5 4 0 6 3 10 7 2 1
So, we’d compare 10 to its parent – it’s bigger, so we’d swap:
9 8 5 4 0 10 3 6 7 2 1
Then we’d compare 10 to its new parent, 8. It’s bigger, so we swap:
9 10 5 4 0 8 3 6 7 2 1
And finally, we’d compare 10 to its new parent, 9. It’s bigger so we swap, and then we’re done.
10 9 5 4 0 8 3 6 7 2 1
Appending to the end of the array is constant time, so the dominant time cost is the bubbling. The maximum possible number of swaps in the doubling process is depth of the tree minus 1 – and the depth of a full binary tree with N members is . So it’s swaps, and the overall cost of inserts is .
Getting the largest value is trivial:
def getMax() { _contents.get(0) }
Removing the largest value is a lot like adding a value: we really play with the last element of the array, and then do a bubbling process – only this time we’ll bubble in the opposite direction:
def removeMax() { if (_contents.size == 0) { throw new MaxHeapException() } else { val result = getMax() val last = _contents.remove(_contents.size() - 1) if (_contents.size() > 0) { _contents.set(0, last) bubbleDown(0) } result } }
Bubbling down is similar to bubbling up, but it’s a bit more complicated, because we need to look at both children.
private def void bubbleDown(int pos) { val rightChildPos = rightChildPosition(pos) val leftChildPos = leftChildPosition(pos) if (leftChildPos >= _contents.size) { return } // Try to bubble left if there is no right child, or if the lift child is // bigger than the right. if (rightChildPos >= _contents.size || _contents.get(leftChildPos) > _contents.get(rightChildPos)) { if (_contents.get(pos) < _contents.get(leftChildPos)) { swap(pos, leftChildPos) bubbleDown(leftChildPos) } } else { // Try to bubble right if (_contents.get(pos) < _contents.get(rightChildPos)) { swap(pos, rightChildPos) bubbleDown(rightChildPos) } } }
The process is almost the same as bubbling up, but moving in the opposite direction. We're starting with a parent node, and comparing it to its children. If it's bigger than either of its children, then we swap it with the largest child, and then continue bubbling down.
For example, let's look at the same heap we looked at for insert:
9 8 5 4 0 6 3 7 2 1
If we want to remove 9, we set the value 9 aside, and then remove 3 from the end of the array, and put it at the root of the tree:
3 8 5 4 0 6 7 2 1
Then we'd compare 3 against its two children, 8 and 7. Since 8 is the larger child, we swap 8 for 3:
8 3 5 4 0 6 7 2 1
Now we compare 3 with its new children, 5 and 6. 6 is bigger, so we swap 6 with 3:
8 6 5 4 0 3 7 2 1
3 has no children, so we're done: it's bubbled down as far as it can go.
Note: I messed up this example in the original version of the post. Thanks to John Armstrong for pointing it out.
The cost here is the same as insert, for the same reason. The dominant cost is the bubbling, and the bubbling is bounded by the depth of the tree. So removing the maximum is also .
It's worth noting that heaps can be used to build a very reasonable sorting algorithm. To sort a collection, just insert all of the elements of the collection, and then remove them one by one. It's , and it's conceptually quite simple. It's not widely used, because the old classic quicksort is faster - not in big(0) notation, but it ends up with a smaller constant. (In big-O notation, something that takes 3(lg n) steps and something that takes 6(lg n) steps are both , but the one whose constant is 3 is still twice as fast as the one whose constant is 6.)
]]>A good example of that is something called paxos. My first exposure to paxos was very interesting. I’d just been hired by Google, and was working on their build tool. At the time, engineers in Google had a problem. Google’s codebase was contained in one massive version control repository. Doing things that way has a ton of really neat advantages – most importantly, the fact that it makes it really convenient to reuse code written by anyone else at the company. The problem was that code reuse can become very confusing. Project A reuses a bit of code written by people from project B. B’s code reused some stuff from C, and C from D, and D from E. So now project A is using code from project E, and they don’t know why!
In this case, I had someone from a storage project coming to me trying to figure out just why his system had a dependency on a plan9 database system called paxos. I had to built a tool that would allow people to ask questions like “Why does A depend on E?”.
As it turned out, paxos was a really important thing, and it was widely reused through the Google codebase. Once I learned about it, I started seeing it everywhere. Since I left Google nearly four years ago, I didn’t stop seeing it. It’s ubiquitous in distributed systems. Outside of Google, we weren’t using that friendly old plan9 paxos implementation – but the paxos model has been reimplemented dozens of times, because it’s so darned useful!
paxos is a system for managing consensus.
In distributed systems, there a collection of hard problems that you constantly need to deal with.
In a world where you can’t count on different agents seeing events in the same order, and where you expect things to be constantly failing, how can you make sure that any distributed system you build ends up with a consistent view of reality?
The answer is a consensus protocol. You need to create a mechanism based on communication between the copies of your system that allows them to mantain a consistent consensus of what the current state of the world is, even in the presence of failures of machines, storage systems, and communications.
paxos is a very elegant, reasonably simple consensus protocol.
Let’s get a bit more precise. Paxos is built on a model of storage. The original application of it was a consistent database, so it’s built around the idea of keeping data consistent. In paxos, the state of the storage is modelled as a sequence of transactions. Each transaction is a pair (t, v), where t is a numeric transaction identifier, and a v is a transaction value.
The state of the system being modelled is a sequence of transaction pairs, [(t_{i}, v_{i}), (t_{j}, v_{j}), ..], where the t values are increasing as you progress through the sequence. As time passes, new transaction pairs can be added to the state.
The focus of the paxos protocol is ensuring that in a collection of 2n+1 participants, all surviving participants will agree on the current value of the state, even if up to n participants fail, and even if messages can be delivered in arbitrary order.
Before I go further into the description of paxos, we need to look at the basic assumptions that underlie it. Like any formal model, it’s not defined in terms of real computers. It’s defined in terms of an abstraction that approximates reality. In this case, the approximation is quite good, but we still need to go through the basic assumptions that make up its model of the universe.
In short, everything can fail at any time; after failure, participants can recover and rejoin the system; any no part of the system acts in an actively adversarial way.
The protocol describes the behavior of the system in terms of a collection of roles. A participant can have more than one role in the system – in fact, in most implementations of paxos, all partipants do have multiple roles. The roles are:
In a typical paxos cluster, the client sends requests to a proposer. The proposer sends a proposal to update the state with the new client request, and attempts to convince a majority of the acceptors to accept it. Once a majority accepts it, the client request is processed by the learner, and a result is returned to the client.
The meat of paxos the protocol that the proposer gets a majority of acceptors to agree on a proposal, and how that protocol process ensures that the collection of acceptors maintains a consistent state.
The protocol itself is pretty simple. Each round is effectively independent, and consists of a process of attempting to reach consensus. Within each round, finding consensus is a two-phase process, where each phase consists of a message sent from a proposer to a group of acceptors, and a reply from the acceptors to the proposer.
Once it’s sent a promise message, it must not accept any request for a proposal with number less that N. Note though that this does not mean that the acceptor promises to accept the proposal: all it’s doing is promising not to accept any proposal with number less than N! If in receives a message Prepare(N+1), it’s free to promise that – but if it does, it will no longer be able to accept the proposal for N.
(If N is smaller that the number of any proposal promised or accepted by the acceptor, then in the original version of paxos, the acceptor does nothing; in some optimizations of the protocol, it replies Reject(n_{v}).)
What this phase does is allow a proposer to determine whether or not a new proposal is even worth considering. If a quorum (majority) of acceptors send promises, then it can move on to phase 2.
When a proposer recieves promises from a quorum of acceptors, then it moves forward to try to actually commit the proposal. In order to do this, it needs to choose a value for the proposal. If any of the Promise messages contained a value, then the value of this proposal must be set to the value of the highest proposal number in any of the promises. If all of the promises were empty, then the proposer can choose any value that it wants for the proposal.
Once the proposer has chosen a value, then it sends a message Accept!(N, V) to a quorum of acceptors. This is typically written with the exclamation point, because it’s really a command to the acceptors: they’re being told to accept the proposal, if they can.
When an acceptor receives an Accept!(N, v) message, if it has not issued a promise for a proposal with number greater than N, then it must accept the message. It accepts the proposal by sending a message Accepted(N, V) to both the original proposer, and all of the learners.
When Accepted messages have been received from a quorum of acceptors, the new value V becomes the consensus value for the paxos cluster, and the new proposal number N is fully committed.
As with so many things, this is easier to understand when you think about an example. One use of paxos that I’ve worked with is in a cluster scheduling service. In that system:
It’s a pretty simple thing – the core concept is simply that no consensus proposal is considered “committed” until it’s been accepted by a majority of the participants. And if it’s been accepted by a majority of the participants, that means that no conflicting proposal can ever reach consensus – because that would require at least one participant to accept 2 conflicting proposals.
But there’s still a bit of formality that’s working looking at. Exactly what guarantees does paxos give? What properties does paxos-style consensus have?
Even the formal properties of paxos are easy to understand. Paxos provides two key properties: validity, and agreement.
You an easily prove those two properties. In fact, the proof is completely obvious once you recognize that the paxos protocol has two invariants (and those invariants are themselves clear from the definition of the protocol!):
Getting back to the beginning: the point of all of this is to have a system in which we can be sure that things work correctly even in the presence of failures. In paxos, as long as at some point there was a quorum of machines that come to agreement, then any failure that leaves a surviving quorum of machines must have overlapped with the previous quorum – which means that the previous consensus still remains in effect, and will be propagated to the remaining participants. If you’ve got 5 machines, then two can fail, and you won’t lose consistency among the remaining ones.
]]>In the time when I was silent, I had a couple of people write to me to ask me to explain something called the ABC conjecture.
The ABC conjecture is a mathematical question about number theory that was proposed in the 1980s – so it’s relatively new as number theory problems go. It’s gotten a lot of attention recently, due to an almost soap-operatic series of events.
It’s a very hard problem, and no one had made any significant progress on it until about five years ago, when a well respected Japanese mathematician named Shinichi Mochizucki published a series of papers containing a proof of the conjecture.
Normally, when a proof of a hard problem gets published, mathematicians go nuts! Everyone starts poring over it, trying to figure it out, and see if it’s valid. That’s what happened the previous time someone thought they’d prooved it. But this time, no one has been able to make sense out of the proof!
The problem is that in order to build his proof, professor Mochizucki created a whole new mathematical theory, called inter-universal Teichmüller theory. The entire ABC conjecture proof is built in this new theory, and no one other than professor Mochizucki himself understands Teichmüller theory. Before anyone else can actually follow the proof, they need to understand the theory. Professor Mochizucki is a bit of a recluse – he has declined to travel anywhere to teach his new mathematical system. So in the five years since he first published it, no one has been able to understand it well enough to determine whether or not the proof is correct. One error in it was found, but corrected, and the whole proof remains in question.
Exactly why the proof remains unchecked after five years is a point of contention. Lots of mathematicians are angry at Professor Mochizucki for not being willing to explain or teach his theory. A common statement among critics is that if you create a new mathematical theory, you need to be willing to actually explain it to people: work with a group of mathematicians to teach it to them, so that they’ll be able to use it to verify the proof. But Professor Mochizuchki’s response has been that he has explained it: he’s published a series of papers describing the theory. He doesn’t want to travel and take time away from his work for people who haven’t been willing to take the time to read what’s he’s written. He’s angry that after five years, no one has bothered to actually figure out his proof.
I’m obviously not going to attempt to weigh in on whether or not Professor Mochizuki’s proof is correct or not. That’s so far beyond the ability of my puny little brain that I’d need to be a hundred times smarter before it would even be laughable! Nor am I going to take sides about whether or not the Professor should be travelling to teach other mathematicians his theory. But what I can do is explain a little bit about what the ABC conjecture is, and why people care so much about it.
It’s a conjecture in number theory. Number theorists tend to be obsessed with prime numbers, because the structure of the prime numbers is a huge and fundamental part of the structure and behavior of numbers as a whole. The ABC conjecture tries to describe one property of the structure of the set of prime numbers within the system of the natural numbers. Mathematicians would love to have a proof for it, because of what it would tell them about the prime numbers.
Before I can explain the problem, there’s a bit of background that we need to go through.
The simplest way of saying the ABC conjecture is that for the vast majority of integers A, B, and C, where A + B = C and A and B are coprime, C must be smaller than rad(A*B).
Of course, that’s hopelessly imprecise for mathematicians! What does “the vast majority” mean?
The usual method at times like these is to find some way of characterizing the size of the relative sizes of the set where the statement is true and where the statement is false. For most mathematicians, the sizes of sets that are interesting are basically 0, 1, finite, countably infinite, and uncountably infinite. For the statement of the ABC conjecture, they claim that the set of values for which the statement is true is infinite, but that the set of values for which it is false are finite. Specifically, they want to be able to show that the set of numbers for which rad(A*B)>C is finite.
To do that, they pull out a standard trick. Sadly, I don’t recall the proper formal term, but I’ll call it epsilon bounding. The idea is that you’ve got a statement S about a number (or region of numbers) N. You can’t prove your statement about N specifically – so you prove it about regions around N.
As usual, it’s clearest with an example. We want to say that C > rad(A*B) for most values of A and B. The way we can show that is by saying that for any value ε, the set of values (A, B, C) where A and B are coprime, and A + B = C, and rad(A*B) > C + ε is finite.
What this formulation does is give us a formal idea of how rare this is. It’s possible that there are some values for A and B where rad(A*B) is bigger that 1,000,000,000,000,000,000 + C. But the number of places where that’s true is finite. Since the full system of numbers is infinite, that means that in the overwhelming majority of cases, rad(A*B) < C. The size of the set of numbers where that's not true is so small that it might at well be 0 in comparison to the size of the set of numbers where it is true.
Ultimately, it seems almost trivial once you understand what the conjecture is. It's nothing more that the hypothesis that that if A + B = C, then most of the time, pf(A)*pf(B) < C. Once you've got that down, the question is, what's the big deal? Professor Mochuzuki developed five hundred pages of theory for this? People have spent more than five years trying to work through his proof just to see if it’s correct for a statement like this? Why does anybody care so much?
One answer is: mathematicians are crazy people!
The better answer is that simple statements like this end up telling us very profound things about the deep structure of numbers. The statements reduce to something remarkably simple, but the meaning underneath it is far more complex than it appears.
Just to give you one example of what this means: If the conjecture is true, then there’s a three-line proof of Fermat’s last theorem. (The current proof of Fermat’s last theorem, by Andrew Wiles, is over 150 pages of dense mathematics.) There’s quite a number of things that number theoreticians care about that would fall out of a successful proof.
]]>When you study classical axiomatic set theory, you’re necessarily also studying classical first order predicate logic. You have to be doing that, because classical axiomatic set theory is deeply and intimately intertwined with FOPL. Similarly, the semantics of FOPL as it’s used in modern math are inextricably tangled with set theory. Sets are specified by predicates; predicates get their meaning from the sets of objects that they satisfy.
You can view type theory – or at least Martin-Loff’s intuitionistic type theory – as having nearly the same relationship to intuitionistic logic. We’ll see that in detail in later posts, but for now, intuitionistic type theory is a fundamental mathematical framework which is built on intuitionistic logic. So you can’t talk about this kind of type theory unless you understand the basics of the logic.
In this post, I’m going to try to explain what intuitionistic logic is, and how it differs from FOPL. (We’ll see all of this in more detail later.)
Intuitionistic logic is a modal predicate logic, which is built around a constructivist idea of truth. The intuitionistic idea of truth ends up being much stronger than what most of us are used to from standard FOPL: it means that nothing exists unless there is a concrete way of constructing it.
For a concrete example of what that means: in standard FOPL with the ZFC axioms, you can prove the Banach-Tarski paradox. Banach-Tarski (which I wrote about HERE) says that it’s possible to take a sphere the size of an orange, cut it into pieces, and then re-assemble those pieces into two spheres the same size as the original orange. Or, alternatively, that you can take those pieces that you sliced an orange-sized sphere into, and re-assemble them into a sphere the size of the sun.
Many people would say that this is, clearly, ridiculous. Others would point out a variety of rationalizations: that a sphere the size of an orange and a sphere the size of the sun contain the same number of points; or that the slicing process transitioned from a metric topology to a collection of non-metric topologies, or several other possible explanations.
But what no one can dispute is that there is one very important property of this proof. Those slices are unconstructable. That is, they exist based on a proof using the axiom of choice, but the sets of points in those topologies can’t be constructed by any process. They exist as a necessary implication of the axiom of choice, but we can’t construct them, and even given a pair of sets, one of which is one of those slices, and one of which isn’t, we can’t identify which one is.
According to intuitionism, this is ridiculous. Saying that something exists, but that it is forever beyond our reach is foolishness. If we can’t construct it, if we can’t describe how to identify it, what does it mean to say that it must exist?
When you’re working in intuitionistic logic, every proof that a type of thing exists consists of either a concrete example of the thing, or a process for constructing an example of the thing. A proof of a negative is a concrete counterexample, or a process for creating one. In computer-sciency-terms, the process doesn’t need to terminate. You don’t have to be able to construct something in finite time. But you need to have a process that describes how to contsruct it. So you can, for example, still do Cantor’s diagonalization in intuitionistic logic: if someone gives you an alleged complete 1:1 mapping between the real numbers and the integers, the proof tells you how to create a counterexample. But you can’t do the proof of Banach-Tarski, because it relies on an axiom-of-choice existence proof of something non-constructable.
The way that intuitionistic logic creates that constructivist requirement is not what you might expect. When I first heard about it, I assumed that it was based on a statement of principle: a proof has to create a concrete example. But that approach has an obvious problem: how do you mathematically define it? Logic is supposed to be purely symbolic. How can you take an abstract statement about what a proof should be, and make it work in logic?
Logic is built on inference rules. You have a collection of statements, and a collection of rules about how to use those statements to produce proofs. It turns out that by making a couple of simple changes to the rules of inference that you can get exactly the constructivist requirements that we’d want. It’s based on two real changes compared to standard FOPL.
Intuitionistic logic is modal. In FOPL, any given statement is either true or false. If it’s not true, then it’s false. If it’s true, it’s always true, and always was true. There’s no other choice. In intuitionistic logic, that’s not really the case: intuitionistic logic has three states: true, false, and unknown. If you know nothing about it, then it’s formally unknown, and it will stay unknown until there’s a proof about it; once you find a proof, it’s truth value changes from unknown to either true or false. All of the inference rules of intuitionistic logic only allow inference from proven statements. You can’t reason about an unknown – you need to have a proof that moves it from unknown to either true or false first.
The semantics of this are quite simple: it’s a tiny change in the definition of truth. In FOPL, a statement is true if there exists a proof of that statement, and it’s false if there’s a proof of the negation of that statement. In intuitionistic logic, a statement is true if you have a proof of that statement; and it’s false if you can prove that there is no proof of the statement If you haven’t proven , then is unknown. If is unknown, then is also unknown. is, similarly, not true until you have a proof of either or : it means that either “There is a proof of A or there is a proof of “. But if we don’t know if there’s a proof of either one, then it’s unknown! You could argue that this is true in FOPL as well – but in FOPL, you can rely on the fact that , and you can use that in a proof, and explore both options. In intuitionistic logic, you can’t: you can’t do anything with until you’ve got a proof.
It’s amazing how small the change to FOPL is to produce something that is so strongly constructionist. The easiest way to appreciate it is to just look at the rules, and how they change. To do that, I’m going to quickly walk through the inference rules of intuitionistic logic, and then show you what you’d need to change to get classical FOPL. Most of the time, when I’ve written about logics, I used sequents to write the inference rules; for ease of typesetting (and for the fun of doing something just a bit different), this time, I’m going to use Hilbert calculus (the same method that Gödel used in his incompleteness proof.) In HC, you define axioms and inference rules. For intuitionistic logic, we need to define three inference rules:
With the inference rules out of the way, there’s a collection of axioms. Each axiom is actually a schema: you can substitute any valid statement for any of the variables in the axioms.
That’s intuitionistic logic. What’s the difference between that and FOPL? What kinds of powerful reasoning features did you need to give up from FOPL to get this strongly constructivist logic?
Just one simple axiom: the law of the excluded middle, .
That’s it. Get rid of the excluded middle, and you’ve got the beautiful constructivist intuitionistic logic. All we had to give up is one of the most intuitionnally obvious rules in all of logic.
]]>
What came before:
So now we can move on to step three, where we actually see why mathematical logic is necessarily incomplete.
What I did in the last post was walk through a very laborious process that showed how we could express meta-mathematical properties of logical statements as primitive recursive functions and relations. Using that, we were able to express a non-primitive-recursive predicate provable
, which is true for a particular number if and only if that number is the Gödel number representation of a statement which is provable.
pred provable(x) = some y { proofFor(y, x) } }
The reason for going through all of that was that we really needed to show how we could capture all of the necessary properties of logical statements in terms of arithmetic properties of their Gödel numbers.
Now we can get to the target of Gödel’s effort. What Gödel was trying to do was show how to defeat the careful stratification of the Principia’s logic. In the principia, Russell and Whitehead had tried to avoid problems with self-reference by creating a very strict type-theoretic stratification, where each variable or predicate had a numeric level, and could only reason about objects from lower levels. So if natural numbers were the primitive objects in the domain being reasoned about, then level-1 objects would be things like specific natural numbers, and level-1 predicates could reason about specific natural numbers, but not about sets of natural numbers or predicates over the natural numbers. Level-2 objects would be sets of natural numbers, and level-2 predicates could reason about natural numbers and sets of natural numbers, but not about predicates over sets of natural numbers, or sets of sets of natural numbers. Level-3 objects would be sets of sets of natural numbers… and so on.
The point of this stratification was to make self-reference impossible. You couldn’t make a statement of the form “This predicate is true”: the predicate would be a level-N predicate, and only a level N+1 predicate could reason about a level-N predicate.
What Gödel did in the arithmetic process we went through in the last post is embed a model of logical statements in the natural numbers. That’s the real trick: the logic of the principia is designed to work with a collection of objects that are a model of the natural numbers. By embedding a model of logical statements in the natural numbers, he made it possible for a level-1 predicate (a predicate about a specific natural number) to reason about any logical statement or object. A level-1 predicate can now reason about a level-7 object! A level-1 predicate can reason about the set defined by a level-1 predicate: a level-1 predicate can reason about itself!. A level-1 predicate can, now, reason about any logical statement at all – itself, a level-2 predicate, or a level-27 predicate. Gödel found a way to break the stratification.
Now, we can finally start getting to the point of all of this: incompleteness! We’re going to use our newfound ability to nest logical statements into numbers to construct an unprovable true statement.
In the last post, one of the meta-mathematical properties that we defined for the Gödel-numbered logic was immConseq
, which defines when some statement x is an immediate consequence of a set of statements S. As a reminder, that means that x can be inferred from statements in S in one inferrence step.
We can use that property to define what it means to be a consequence of a set of statements: it’s the closure of immediate consequence. We can define it in pseudo-code as:
def conseq(κ) = { K = κ + axioms added_to_k = false do { added_to_k = false for all c in immConseq(K) { if c not in K { add c to K added_to_k = true } } } while added_to_k return K }
In other words, Conseq(κ) is the complete set of everything that can possibly be inferred from the statements in κ and the axioms of the system. We can say that there’s a proof for a statement x in κ if and only if x ∈ Conseq(κ).
We can take the idea of Conseq use that to define a strong version of what it means for a logical system with a set of facts to be consistent. A system is ω-consistent if and only if there is not a statement a such that: a ∈ Conseq(κ) ∧ not(forall(v, a)) ∈ Conseq(κ).
In other words, the system is ω-consistent as long as it’s never true that both a universal statement and it. But for our purposes, we can treat it as being pretty much the same thing. (Yes, that’s a bit hand-wavy, but I’m not trying to write an entire book about Gödel here!)
(Gödel’s version of the definition of ω-consistency is harder to read than this, because he’s very explicit about the fact that Conseq is a property of the numbers. I’m willing to fuzz that, because we’ve shown that the statements and the numbers are interchangable.)
Using the definition of ω-consistency, we can finally get to the actual statement of the incompleteness theorem!
Gödel’s First Incompleteness Theorem: For every ω-consistent primitive recursive set κ of formulae, there is a primitive-recursive predicate r(x) such that neither forall(v, r) nor not(forall(v, r)) is provable.
To prove that, we’ll construct the predicate r.
First, we need to define a version of our earlier isProofFigure
that’s specific to the set of statements κ:
pred isProofFigureWithKappa(x, kappa) = { all n in 1 to length(x) { isAxiom(item(n, x)) or item(n, x) in kappa or some p in 0 to n { some q in 0 to n { immedConseq(item(n, x), item(p, x), item(q, x)) } } } and length(x) > 0 }
This is the same as the earlier definition – just specialized so that it ensures that every statement in the proof figure is either an axiom, or a member of κ.
We can do the same thing to specialize the predicate proofFor
and provable
:
pred proofForStatementWithKappa(x, y, kappa) = { isProofFigureWithKappa(x, kappa) and item(length(x), x) = y } pred provableWithKappa(x, kappa) = { some y { proofForStatementWithKappa(y, x, kappa) } }
If κ is the set of basic truths that we can work with, then provable in κ is equivalent to provable.
Now, we can define a predicate UnprovableInKappa:
pred NotAProofWithKappa(x, y, kappa) = { not (proofForKappa(x, subst(y, 19, number(y)))) }
Based on everything that we’ve done so far, NotAProofWithKappa is primitive recursive.
This is tricky, but it’s really important. We’re getting very close to the goal, and it’s subtle, so let’s take the time to understand this.
What NotAProofWithKappa does is give us a way to check whether a specific sequence of statements x is not a proof of y.
We want to expand NotAProofWithKappa to something universal. Instead of just saying that a specific sequence of statements x isn’t a proof for y, we want to be able to say that no possible sequence of statements is a proof for y. That’s easy to do in logic: you just wrap the statement in a “∀ x ( )”. In Gödel numbering, we defined a function that does exactly that. So the universal form of provability is: ∀ a (NotAProofWithKappa(a, y, kappa)).
In terms of the Gödel numbering, if we assume that the Gödel number for the variable a is 17, and the variable y is numbered as 19, we’re talking about the statement p = forall(17, ProvableInKappa(17, 19, kappa).
p is the statement that for some logical statement (the value of variable 19, or y in our definition), there is no possible value for variable 17 (a) where a proves y in κ.
All we need to do now is show that we can make p become self-referential. No problem: we can just put number(p) in as the value of y in UnprovableInKappa. If we let q be the numeric value of the statement UnprovableInKappa(a, y), then:
r = subst(q, 19, p)
i = subst(p, 19, r)
i says that there is no possible value x that proves p(p). In other words, p(p) is unprovable: there exists no possible proof that there is no possible proof of p!
This is what we’ve been trying to get at all this time: self-reference! We’ve got a predicate y which is able to express a property of itself. Worse, it’s able to express a negative property of itself!
Now we’re faced with two possible choices. Either i is provable – in which case, κ is inconsistent! Or else i is unprovable – in which case κ is incomplete, because we’ve identified a true statement that can’t be proven!
That’s it: we’ve shown that in the principia’s logic, using nothing but arithmetic, we can create a true statement that cannot be proven. If, somehow, it were to be proven, the entire logic would be inconsistent. So the principia’s logic is incomplete: there are true statements that cannot be proven true.
We can go a bit further: the process that we used to produce this result about the Principia’s logic is actually applicable to other logics. There’s no magic here: if your logic is powerful enough to do Peano arithmetic, you can use the same trick that we demonstrated here, and show that the logic must be either incomplete or inconsistent. (Gödel proved this formally, but we’ll just handwave it.)
Looking at this with modern eyes, it doesn’t seem quite as profound as it did back in Gödel’s day.
When we look at it through the lens of today, what we see is that in the Principia’s logic, proof is a mechanical process: a computation. If every true statement was provable, then you could take any statement S, and write a program to search for a proof of either S or ¬ S, and eventually, that program would find one or the other, and stop.
In short, you’d be able to solve the halting problem. The proof of the halting problem is really an amazingly profound thing: on a very deep level, it’s the same thing as incompleteness, only it’s easier to understand.
But at the time that Gödel was working, Turing hadn’t written his paper about the halting problem. Incompletess was published in 1931; Turing’s halting paper was published in 1936. This was a totally unprecedented idea when it was published. Gödel produced one of the most profound and surprising results in the entire history of mathematics, showing that the efforts of the best mathematicians in the world to produce the perfection of mathematics were completely futile.
]]>