Category Archives: Bad Software

A failed attempt to prove P == NP

I wasn’t originally going to write about this, but people keep sending it to me asking for comments.

In computer science, we have one really gigantic open question about complexity. In the lingo, we ask “Does P == NP?”. (I’ll explain what that means below.) On March 9th, a guy named Michael LaPlante posted a paper to ArXiv that purports to prove, once and for all, that P == NP. If this were the case, if Mr. LaPlante (I’m assuming Mr.; if someone knows differently, ie. that it should be Doctor, or Miss, please let me know!) had in fact proved that P==NP, it would be one of the most amazing events in computer science history. And it wouldn’t only be a theoretical triumph – it would have real, significant practical results! I can’t think of any mathematical proof that would be more exciting to me: I really, really wish that this would happen. But Mr. LaPlante’s proof is, sadly, wrong. Trivially wrong, in fact.

In order to understand what all of this means, why it matters, and where he went wrong, we need to take a step back, and briefly look at computational complexity, what P and NP mean, and what are the implications of P == NP? (Some parts of the discussion that follows are re-edited versions of sections of a very old post from 2007.)

Before we can get to the meat of this, which is talking about P versus NP, we need to talk about computational complexity. P and NP are complexity classes of problems – that is, groups of problems that have similar bounds on their performance.

When we look at a computation, one of the things we want to know is: “How long will this take?”. A specific concrete answer to that depends on all sorts of factors – the speed of your computer, the particular programming language you use to run the program, etc. But independent of those, there’s a basic factor that describes something important about how long a computation will take – the algorithm itself fundamental requires some minimum number of operations. Computational complexity is an abstract method of describing how many operations a computation will take, expressed in terms of the size or magnitude of the input.

For example: let’s take a look at insertion sort. Here’s some pseudocode for insertion sort.

def insertion_sort(lst):
  result = []
  for i in lst:
    for j in result:
      if i < j:
        insert i into result before j
      if i wasn't inserted, add it to the end of result
   return result

This is, perhaps, the simplest sorting algorithm to understand - most of us figured it out on our own in school, when we had an assignment to alphebetize a list of words. You take the elements of the list to be sorted one at a time; then you figure out where in the list they belong, and insert them.

In the worst possible case, how long does this take?

  1. Inserting the first element requires 0 comparisons: just stick it into the list.
  2. Inserting the second element takes exactly one comparison: it needs to be compared to the one element in the result list, to determine whether it goes before or after it.
  3. Inserting the third element could take either one or two comparisons. (If it's smaller than the first element of the result list, then it can be inserted in front without any more comparisons; otherwise, it needs to be compared against the second element of the result list. So in the worst case, it takes 2 comparisons.
  4. In general, for the Nth element of the list, it will take at most n-1 comparisons.

So, in the worst case, it's going to take 0 + 1 + 2 + ... + n-1 comparisons to produce a sorted list of N elements. There's a nice shorthand for computing that series: \frac{(n-1)(n-2)}{2}, which simplifies to \frac{n^2 -3n + 2}{2}, which is O(n2).

So while we can't say "computing a list of 100 elements will take 2.3 seconds" (because that depends on a ton of factors - the specific implementation of the code, the programming language, the machine it's running on, etc.), we can say that the time it takes to run increase roughly proportionally to the square of the size of the input - which is what it means when we say that insertion sort is O(n2).

That's the complexity of the insert sort algorithm. When we talk about complexity, we can talk about two different kinds of complexity: the complexity of an algorithm, and the complexity of a problem. The complexity of an algorithm is a measure of how many steps the algorithm takes to execute on an input of a particular size. It's specific to the algorithm, that is, the specific method used to solve the the problem. The complexity of the problem is a bound that bounds the best case of the complexity of any possible algorithm that can solve that problem.

For example, when you look at sort, you can say that there's a minimum number of steps that's needed to compute the correct sorted order of the list. In fact, you can prove that to sort a list of elements, you absolutely require n lg n bits of information: there's no possible way to be sure you have the list in sorted order with less information that that. If you're using an algorithm that puts things into sorted order by comparing values, that means that you absolutely must do O(n lg n) comparisons, because each comparison gives you one bit of information. That means that sorting is an O(n log n) problem. We don't need to know which algorithm you're thinking about - it doesn't matter. There is no possible comparison-based sorting algorithm that takes less than O(n \log n) steps. (It's worth noting that there's some weasel-words in there: there are some theoretical algorithms that can sort in less than O(n lg n), but they do it by using algorithms that aren't based on binary comparisons that yield one bit of information.)

We like to describe problems by their complexity in that way when we can. But it's very difficult. We're very good at finding upper bounds: that is, we can in general come up with ways of saying "the execution time will be less than O(something)", but we are very bad at finding ways to prove that "the minimum amount of time needed to solve this problem is O(something)". That distinction, between the upper bound (maximum time needed to solve a problem), and lower bound (minimum time needed to solve a problem) is the basic root of the P == NP question.

When we're talking about the complexity of problems, we can categorize them into complexity classes. There are problems that are O(1), which means that they're constant time, independent of the size of the input. There are linear time problems, which can be solved in time proportional to the size of the input. More broadly, there are two basic categories that we care about: P and NP.

P is the collection of problems that can be solved in polynomial time. That means that in the big-O notation for the complexity, the expression inside the parens is a polynomial: the exponents are all fixed values. Speaking very roughly, the problems in P are the problems that we can at least hope to solve with a program running on a real computer.

NP is the collection of problems that can be solved in non-deterministic polynomial time. We'll just gloss over the "non-deterministic" part, and say that for a problem in NP, we don't know of a polynomial time algorithm for producing a solution, but given a solution, we can check if it's correct in polynomial time. For problems in NP, the best solutions we know of have worst-case bounds that are exponential - that is, the expression inside of the parens of the O(...) has an exponent containing the size of the problem.

NP problems are things that we can't solve perfectly with a real computer. The real solutions take an amount of time that's exponential in the size of their inputs. Tripling the size of the problem increases its execution time by a factor of 27; quadrupling the input size increases execution time by at least a factor of 256; increasing the input by a factor of 10 increases execution time by at least a factor of 10,000,000,000. For NP problems, we're currently stuck using heuristics - shortcuts that will quickly produce a good guess at the real solution, but which will sometimes be wrong.

NP problems are, sadly, very common in the real world. For one example, there's a classic problem called the travelling salesman. Suppose you've got a door-to-door vacuum cleaner salesman. His territory has 15 cities. You want to find the best route from his house to those 15 cities, and back to his house. Finding that solution isn't just important from a theoretical point of view: the time that the salesman spends driving has a real-world cost! We don't know how to quickly produce the ideal path.

The big problem with NP is that we don't know lower bounds for anything in it. That means that while we know of slow algorithms for finding the solution to problems in NP, we don't know if those algorithms are actually the best. It's possible that there's a fast solution - a solution in polynomial time which will give the correct answer. Many people who study computational complexity believe that if you can check a solution in polynomial time, then computing a solution should also be polynomial time with a higher-order polynomial. (That is, they believe that there should be some sort of bound like "the time to find a solution is no more than the cube of the time to check a solution".) But so far, no one has been able to actually prove a relationship like that.

When you look at NP problems, some of them have a special, amazing property called NP completeness. If you could come up with a polynomial time solution for any single NP-complete problem, then you'd also discover exactly how to come up with a polynomial time solution for every other problem in NP..

In Mr. LaPlante's paper, he claims to have implemented a polynomial time solution to a problem called the maximum clique problem. Maximum clique is NP complete - so if you could find a P-time solution to it, you'd have proven that P == NP, and that there are polynomial time solutions to all NP problems.

The problem that Mr. LaPlante looked at is the maximal clique problem:

  • Given:
    1. a set of V atomic objects called vertices;
    2. a set of E of objects called edges, where each edge is an unordered pair (x, y), where x and y are vertices.
  • Find:
    • The largest set of vertices C=\{v_1, ..., v_n\} where for any v_i, there is an edge between v_i to every other vertex in C.

Less formally: given a bunch of dots, where some of the dots are connected by lines, find the largest set of dots where every dot in the set is connected to every other dot in the set.

The author claims to have come up with a simple P-time solution to that.

The catch? He's wrong. His solution isn't P-time. It's sloppy work.

His algorithm is pretty easy to understand. Each vertex has a finite set of edges connecting it to its neighbors. You have each node in the graph send its list of its neighbors to its neighbors. With that information, each node knows what 3-cliques its a part of. Every clique of size larger than 3 is made up of overlapping 3-cliques - so you can have the cliques merge themselves into ever larger cliques.

If you look at this, it's still basically considering every possible clique. But His "analysis" of the complexity of his algorithm is so shallow and vague that it's easy to get things wrong. It's a pretty typical example of a sloppy analysis. Complexity analysis is hard, and it's very easy to get wrong. I don't want to be too hard on Mr. LaPlante, because it's an extremely easy mistake to make. Analyzing algorithmic complexity needs to be done in a careful, exacting, meticulous way - and while Mr. LaPlante didn't do that, most people who are professional programmers could easily make a similar mistake! But the ultimate sloppiness of it is that he never bothers to finish computing the complexity. He makes vague hand-wavy motions at showing the complexity of certain phases of his algorithm, but he never even bothers to combine them and come up with an estimate of the full upper-bound of his algorithm!

I'm not going to go into great detail about this. Instead, I'll refer you to a really excellent paper by Patrick Prosser, which looks at a series of algorithms that compute exact solutions to the maximum clique problem, and how they're analyzed. Compare their analysis to Mr. LaPlante's, and you'll see quite clearly how sloppy LaPlante was. I'll give you a hint about one thing LaPlante got wrong: he's taking some steps that take significant work, and treating them as if they were constant time.

But we don't even really need to look at the analysis. Mr. LaPlante provides an implementation of his supposedly P-time algorithm. He should be able to show us execution times for various randomly generated graphs, and show how that time grows as the size of the graph grows, right? I mean, if you're making claims about something like this, and you've got real code, you'll show your experimental verification as well as your theoretical analysis, right?

Nope. He doesn't. And I consider that to be a really, really serious problem. He's claiming to have reduced an NP-complete problem to a small-polynomial complexity: where are the numbers?

I'll give you a good guess about the answer: the algorithm doesn't complete in a reasonable amount of time for moderately large graphs. You could argue that even if it's polynomial time, you're looking at exponents that are no smaller than 3 (exactly what he claims the bound to be is hard to determine, since he never bothers to finish the analysis!) - a cubic algorithm on a large graph takes a very long time. But... not bothering to show any runtime data? Nothing at all? That's ridiculous. If you look at the Prosser paper above, he manages to give actual concrete measurements of the exponential time algorithms. LaPlante didn't bother to do that. And I can only conclude that he couldn't gather actual numbers to support his idea.

Lenovo and the Superfish Scam

I’m a bit late to the scene here, but I think it’s still worth making a stab at this. Also, this is being written on an airplane, where I am travelling home from San Francisco to NY with a case of bronchitis. So I am not exactly at my best.

Lenovo, one of the largest makers of windows-based laptops, sold out its customers as part of one of the worst deliberate violations of computer security I’ve ever seen, by shipping a piece of software called Superfish pre-installed on its computers. Superfish is, with absolutely no exaggeration, one of the most serious, unethical, despicable things I’ve seen in quite a lot time. It’s appalling.

So what is it, and what’s the big deal?

We need to start with some background, and talk a bit about how secure connections work on the internet.

Every time that you visit a website with a secure connection (a URL that starts with https), you’re using a protocol called TLS (formerly SSL). TLS is designed to do two things:

  1. Ensure that you are talking to who you think you’re talking to.
  2. Ensure that no one but you and the person you wanted to talk to can actually see what you’re saying.

The way that it does both of those is based on encryption. Every time you create a secure connect to a website, you’re exchanging credentials with the site to ensure that they’re who they say they are, and then based on those credentials, you establish an encryption key for the rest of your communication.

That connection-establishment process is the critical bit. You need to get some information that allows you to trust them to be who they claim to be. The security and integrity of everything that happens over the connection depends on the truth and integrity of that initial piece of identity verification.

The identity verification piece of TLS is built using public key cryptography, as part of a standard infrastructure for key maintenance and verification called X.509.

I’ve written about public key crypto before – see here. The basic idea behind public key crypto is that you’ve got two keys, called the public and private keys. Anything which is encrypted with the public key can only be decrypted with the private key; anything which is encrypted with the private key can only be decrypted using the public key. Your public key is available to anyone who wants it; no one but you has access to your private key.

If you receive a message from me, and you can decrypt it with my public key, then you know, without a doubt, that you can be sure that I’m the one who encrypted it. Only my private key could have encrypted a message that could then be decrypted with my public key.

For that to work, though, there’s one thing that you need to be sure of: you need to be absolutely sure that the public key that you’ve got is really my public key. If some clever person managed to somehow give you a different key, and convince you that it’s mine, then they can send you messages, and they’ll look exactly as if they came from me. If I handed you my public key on a USB thumbdrive, then you’re sure that the key came from me – but if you received it online, haw can you be sure that it was really me that gave it to you? Maybe someone switched it along the way?

In X.509, we use an idea of delegated trust. That is, we have some small collection of fundamental trusted authorities. Those authorities can issue public/private key pairs, so when someone need a public key, they can go to them and ask for it, and they’ll create one. The authority gives them a certificate, which is a copy of the new public key encrypted by the authority using their private key.

Now, when someone connects to a website, the target site can state who they are by sending a copy of the certificate. The client site recieves the certificate, decrypts it using the authorities public key, and then starts using that public key to encrypt their communications.

If the two sides can keep talking, then the client knows who it’s talking to. It got a public key, and it’s using that public key to talk to the server; so the server couldn’t decrypt the communication unless it had the public key; and it trusts that that it got the right public key, because it was encrypted with the private key of the certificate authority.

This is great as far as it goes, but it leaves us with a single certificate authority (or, at best, a small group). With billions of human users, and possibly trillions of networked devices, having a single authority isn’t manageable. They simple can’t produce enough keys for everyone. So we need to use our trust in the certificate authority to expand the pool of trust. We can do that by saying that if the certificate authority can declare that a particular entity is trustworthy, then we can use that entity itself as a verifier. So now we’ve taken a single trusted authority, and expanded that trust to a collection of places. Each of those newly trusted entities can now also issue new keys, and can certify their validity, by showing their certificate, plus the new encrypted public key. In general, anyone can issue a public key – and we can check its validity by looking at the chain of authorities that verified it, up to the root authority.

There’s a catch to this though: the base certificate providers. If you can trust them, then everything works: if you’ve got a certificate chain, you can use it to verify the identity of the party you’re talking to. But if there’s any question about the validity of the root certificate provider, if there’s any question whether or not you have the correct, valid public key for that provider, then you’re completely hosed. Ultimately, there’s some piece of seed information which you have to start off with. You need to accept the validity of an initial certificate authority based on some other mechanism. The people who sold you your computer, or the people who built your web browser, generally install a root certificate – basically the public key for a trusted certificate authority.

If that root certificate isn’t trustworthy, then nothing that results from it can be trusted. The untrustworthy root certificate can be used by an unscrupulous person to create new certificates allowing them to masquerade as anything that they want.

In particular, an untrustworthy root certificate it makes it easy to perform a man-in-the-middle attack. Suppose you want to talk to your bank, and somehow Joe Schmoe has planted a bad root certificate on your computer:

  1. You try to connect to Bank.
  2. Joe intercepts the request. He accepts your connection request, pretending to be the bank. Using his fake root certificate, he sends you a certificate claiming to be the bank.
  3. Now, you’re connected to Joe, believing that you’re connected to the bank. You try to log in to your account, sending your username and password.
  4. Joe receives your request, connects to the bank, passing on your request.
  5. Joe logs in to the bank on your behalf. The bank returns its successful login screen, showing your account numbers and balances.
  6. Joe copies the successful login screen from his connection to the bank to you.
  7. Every time that you send information to “the bank”, Joe decrypts it using his private key, and then sends it on to the bank masquerading as you. Similarly, when the bank sends something to you, he decrypts it using the banks public key, and then encrypts it with his his private key to send it to you.
  8. You now believe that you are connected to the bank over a secure connection. Everything that you do looks exactly as if you’re connected to the bank over a security encrypted link.
  9. Joe now has your login information, and control of your bank accounts.

In the Lenovo fiasco, Lenovo installed a system called Superfish, which deliberately installs a bad root certificate, and then uses that root certificate to create man-in-the-middle attacks on every secure connection ever made with the computer.

Why does it do that? Purportedly for ad retargeting: it uses its man-in-the-middle to decrypt supposedly secure connections so that it can replace ads in the pages that you view. That way, Lenovo and Superfish get advertising money for the pages you view, instead of the page-owners.

It’s spectacularly despicable. It’s fundamentally compromising everything you do with your computer. It’s doing it in a way that you can’t detect. And it’s doing it all in a sleazy attempt to steal advertising money.

Based on this, I’ve got two pieces of advice:

  1. Don’t ever buy a Lenovo computer. Never, ever, ever. A company that would even consider installing something like this isn’t a company that you can ever trust.
  2. If you have a Lenovo computer already, get rid of it. You could reformat your disks and install a fresh untainted copy of your operating system – but if your hardware manufacturer is a sleezy as Lenovo has demonstrated themselves to be, then there’s no reason to believe that that’s going to be sufficient.

This is, by far, the worst thing that I’ve ever seen a computer manufacturer do. They deserve to be run out of business for this.

Innovation isn’t just hardware!

I’m a bit late to the party on this, but hey, such is life! I work on infrastructure at Twitter, and we’ve been going crazy trying to get stuff deployed in time to be ready for the world cup. So I haven’t had time to write before now!

Anyway, late last week, a professor known for stupid self-promoting stunts announced that the Turing test had been passed! This was, according to said professor, a huge thing, a really big deal, a historic event!

(Perhaps almost as big a deal as the time he had an RFID chip implanted in his arm, and announced that now he was the world’s first cyborg!)

Lots of people have written about the stupidity of the claim. The alleged “winner” was a program that pretended to be a teenaged kid with ADD who wasn’t a native english speaker. It didn’t even attempt to simulate intelligence, just to mislead the judges by providing excuses for its incoherence.

But that’s not what I wanted to comment on. Like I said, I’m late to the game, and that means that the problems with the alleged winner of the competition to pass the Turing test have been covered many times already. What I wanted to comment on was a theme I saw in several of the explanations. Here’s a typical example, taken from an article in the New Yorker:

Here’s what Eugene Goostman isn’t: a supercomputer. It is not a groundbreaking, super-fast piece of innovative hardware but simply a cleverly-coded piece of software, heir to a program called ELIZA that was first developed—as a joke—in the nineteen-sixties.

This is an example of what I call “IBM Thinking”. I used to work for IBM, and one of the (many) frustrations there was a deep-seated belief that “innovation” and “technology” meant hardware. Software is just the silly unimportant stuff that runs on top of hardware; hardware is what matters.

According to this attitude, any advance in technology, any new achievement, must happen because someone created better hardware that made it possible. A system that beats the turing test? If it’s real, it must be a new supercomputer! If it’s not, then it’s not interesting!

This is, in a word, nonsense.

Hardware advances. But so does software. And these days, the big advances are more likely to be in software than in hardware.

There’s a mathematical law, called Church’s thesis, which we’ve known about for a long time. Put simply, it says that there is a maximum limit in computation. All computing devices, no matter how they’re designed or built, are ultimately, at best, equivalent to other computing computing devices. It doesn’t matter whether you’re a Turing machine, or a PC, or a supercomputing cluster – the set of problems that can be solved by computation is fixed. Different devices may be able to solve a given problem faster by some amount, but it can’t solve problems that are truly unsolvable.

We’ve gotten to the point where we can build incredibly fast computers. But those computers are all built on the same basic model of computing that we’ve been using for decades. When a problem that seemed unsolvable before becomes solvable now, most of the time, that’s not because there was some problem with old hardware that made the problem unsolvable – it’s because people have figured out how to write software that solves the problem. In fact, a lot of recent innovations in hardware became possible not because of some fundamental change in the hardware, but because people worked out clever software to help design and lay out circuits on silicon.

To take one example that’s very familiar to me (because my wife is one of the technical leads of the project), consider IBM’s Watson – the computer that beat the two best human Jepoardy players. IBM has stressed, over and over again, that Watson was a cluster of machines built on IBM’s Power architecture. But the only reason that they used Power was marketing. What made Watson special was that a team of brilliant researchers solved a very hard problem with clever software. Watson wasn’t a supercomputer. It was a cluster of off-the-shelf hardware. It could easily have been built from a collection of ultra-cheap PC motherboards stacked together with a high-speed network. The only thing that made Watson special – the thing that made Watson possible had nothing to do with hardware. It’s just a bunch of cleverly-coded software. That’s it.

To get even closer to home: I work on software that lets Twitter run its system on a cluster of thousands and thousands of cheap machines. The hardware is really unimpressive, except for its volume. It’s a ton of cheap PC motherboards, mounted in rack after rack, with network cables connecting the racks. Google has the same thing, but with even more machines. Amazon has the same thing, except that they might even have more machines that Google! If you handed me a ton of money, I – an idiot when it comes to hardware – could build a datacenter like Twitter’s. It would be a huge amount of work – but there’d be nothing inventive about building a new cluster. In fact, that’s the whole point of cluster-based computing: it’s both cheaper and easier to just buy a couple of thousand cheap machines, and distribute your work among them than it is to buy and program one huge machine capable of doing it all by itself.

What makes those clusters special isn’t the hardware. It’s all software. How do you take a thousand, or ten thousand, or a hundred thousand, or a million computers, and make them useful? With software. It’s just clever software – two systems, called Mesos and Aurora, which take that collection of thousands upon thousands of machines, and turn it into something that we can easily program, and easily share between thousands of different programs all running on it.

That doesn’t take away from the accomplishment. It just puts the focus where it belongs. “Clever software” isn’t some kind of trick. It’s the product of hard work, innovation, and creativity. It’s where many – or even most – of the big technological advances that we’re watching today are coming from. Innovation and advance aren’t just something that happens in hardware.

The Heartbleed Bug

There’s a lot of panic going around on the internet today, about something called the Heartbleed bug. I’ve gotten questions, so I’m giving answers.

I’ve heard lots of hype. Is this really a big deal?

Damn right it is!

It’s pretty hard to wrap your head around just how bad this actually is. It’s probably even more of a big deal that the hype has made it out to be!

This bug affects around 90% of all sites on the internet that use secure connections. Seriously: if you’re using the internet, you’re affected by this. It doesn’t matter how reputable or secure the sites you connect to have been in the past: the majority of them are probably vulnerable to this, and some number of them have, in all likelihood, been compromised! Pretty much any website running on linux or netbst, using Apache or NGINX as its webserver is vulnerable. That means just about every major site on the net.

The problem is a bug in a commonly used version of SSL/TLS. So, before I explain what the bug is, I’ll run through a quick background.

What is SSL/TLS?

When you’re using the internet in the simplest mode, you’re using a simple set of communication protocols called TCP/IP. In basic TCP/IP protocols, when you connect to another computer on the network, the data that gets sent back and forth is not encrypted or obscured – it’s just sent in the clear. That means that it’s easy for anyone who’s on the same network cable as you to look at your connection, and see the data.

For lots of things, that’s fine. For example, if you want to read this blog, there’s nothing confidential about it. Everyone who reads the blog sees the same content. No one is going to see anything private.

But for a lot of other things, that’s not true. You probably don’t want someone to be able to see your email. You definitely don’t want anyone else to be able to see the credit card number you use to order things from Amazon!

To protect communications, there’s another protocol called SSL, the Secure Sockets Layer. When you connect to another site that’s got a URL starting with https:, the two computers establish an encrypted connection. Once an SSL connection is established between two computers, all communication between them is encrypted.

Actually, on most modern systems, you’re not really using SSL. You’re using a successor to the original SSL protocol called TLS, which stands for transport layer security. Pretty much everyone is now using TLS, but many people still just say SSL, and in fact the most commonly used implementation of it is in package called OpenSSL.

So SSL/TLS is the basic protocol that we use on the internet for secure communications. If you use SSL/TLS correctly, then the information that you send and receive can only be accessed by you and the computer that you’re talking to.

Note the qualifier: if you use SSL correctly!

SSL is built on public key cryptography. What that means is that a website identifies itself using a pair of keys. There’s one key, called a public key, that it gives away to everyone; and there’s a second key, called a private key, that it keeps secret. Anything that you encrypt with the public key can only be decrypted using the private key; anything encrypted with the private key can only be decrypted using the public key. That means that if you get a message that can be decrypted using the sites public key, you know that no one except the site could have encrypted it! And if you use the public key to encrypt something, you know that no one except that site will be able to decrypt it.

Public key cryptography is an absolutely brilliant idea. But it relies on the fact that the private key is absolutely private! If anyone else can get a copy of the private key, then all bets are off: you can no longer rely on anything about that key. You couldn’t be sure that messages came from the right source; and you couldn’t be sure that your messages could only be read by an authorized person.

So what’s the bug?

The SSL protocol includes something called a heartbeat. It’s a periodic exchange between the two sides of a connection, to let them know that the other side is still alive and listening on the connection.

One of the options in the heartbeat is an echo request, which is illustrated below. Computer A wants to know if B is listening. So A sends a message to B saying “Here’s X bytes of data. Send them back to me.” Then A waits. If it gets a message back from B containing the same X bytes of data, it knows B was listening. That’s all there is to it: the heartbeat is just a simple way to check that the other side is actually listening to what you say.


The bug is really, really simple. The attacker sends a heartbeat message saying “I’m gonna send you a chunk of data containing 64000 bytes”, but then the data only contains one byte.

If the code worked correctly, it would say “Oops, invalid request: you said you were going to send me 64000 bytes of data, but you only sent me one!” But what the buggy version of SSL does is send you that 1 byte, plus 63,999 bytes of whatever happens to be in memory next to wherever it saved the byte..


You can’t choose what data you’re going to get in response. It’ll just be a bunch of whatever happened to be in memory. But you can do it repeatedly, and get lots of different memory chunks. If you know how the SSL implementation works, you can scan those memory chunks, and look for particular data structures – like private keys, or cookies. Given a lot of time, and the ability to connect multiple times and send multiple heartbeat requests each time you connect, you can gather a lot of data. Most of it will be crap, but some of it will be the valuable stuff.

To make matters worse, the heartbeat is treated as something very low-level which happens very frequently, and which doesn’t transfer meaningful data. So the implementation doesn’t log heartbeats at all. So there’s no way of even identifying which connections to a server have been exploiting this. So a site that’s running one of the buggy versions of OpenSSL has no way of knowing whether or not they’ve been the target of this attack!

See what I mean about it being a big deal?

Why is it so widespread?

When I’ve written about security in the past, one of the things that I’ve said repeatedly is: if you’re thinking about writing your own implementation of a security protocol, STOP. Don’t do it! There are a thousand ways that you can make a tiny, trivial mistake which completely compromises the security of your code. It’s not a matter of whether you’re smart or not; it’s just a simple statement of fact. If you try to do it, you will screw something up. There are just so many subtle ways to get things wrong: it takes a whole team of experts to even have a chance to get it right.

Most engineers who build stuff for the internet understand that, and don’t write their own cryptosystems or cryptographic protocols. What they do is use a common, well-known, public system. They know the system was implemented by experts; they know that there are a lot of responsible, smart people who are doing their best to find and fix any problems that crop up.

Imagine that you’re an engineer picking an implementation of SSL. You know that you want as many people trying to find problems as possible. So which one will you choose? The one that most people are using! Because that’s the one that has the most people working to make sure it doesn’t have any problems, or to fix any problems that get found as quickly as possible.

The most widely used version of SSL is an open-source software package called OpenSSL. And that’s exactly where the most bug is: in OpenSSL.

How can it be fixed?

Normally, something like this would be bad. But you’d be able to just update the implementation to a new version without the bug, and that would be the end of the problem. But this case is pretty much the worst possible case: fixing the implementation doesn’t entirely fix the problem. Because even after you’ve fixed the SSL implementation, if someone got hold of your private key, they still have it. And there’s no way to know if anyone stole your key!

To fix this, then, you need to do much more than just update the SSL implementation. You need to cancel all of your keys, and replace them new ones, and you need to get everyone who has a copy of your public key to throw it away and stop using it.

So basically, at the moment, every keypair for nearly every major website in the world that was valid yesterday can no longer be trusted.

Everyone stop implementing programming languages, right now! It's been solved!

Back when I was a student working on my PhD, I specialized in programming languages. Lucky for me I did it a long time ago! According to Wired, if I was working on it now, I’d be out of luck – the problem is already solved!

See, these guys built a new programming language which solves all the problems! I mean, just look how daft all of us programming language implementors are!

Today’s languages were each designed with different goals in mind. Matlab was built for matrix calculations, and it’s great at linear algebra. The R language is meant for statistics. Ruby and Python are good general purpose languages, beloved by web developers because they make coding faster and easier. But they don’t run as quickly as languages like C and Java. What we need, Karpinski realized after struggling to build his network simulation tool, is a single language that does everything well.

See, we’ve been wasting our time, working on languages that are only good for one thing, when if only we’d had a clue, we would have just been smart, and built one perfect language which was good for everything!

How did they accomplish this miraculous task?

Together they fashioned a general purpose programming language that was also suited to advanced mathematics and statistics and could run at speeds rivaling C, the granddaddy of the programming world.

Programmers often use tools that translate slower languages like Ruby and Python into faster languages like Java or C. But that faster code must also be translated — or compiled, in programmer lingo — into code that the machine can understand. That adds more complexity and room for error.

Julia is different in that it doesn’t need an intermediary step. Using LLVM, a compiler developed by University of Illinois at Urbana-Champaign and enhanced by the likes of Apple and Google, Karpinski and company built the language so that it compiles straight to machine code on the fly, as it runs.

Ye bloody gods, but it’s hard to know just where to start ripping that apart.

Let’s start with that last paragraph. Apparently, the guys who designed Julia are geniuses, because they used the LLVM backend for their compiler, eliminating the need for an intermediate language.

That’s clearly a revolutionary idea. I mean, no one has ever tried to do that before – no programming languages except C and C++ (the original targets of LLVM). Except for Ada. And D. And fortran. And Pure. And Objective-C. And Haskell. And Java. And plenty of others.

And those are just the languages that specifically use the LLVM backend. There are others that use different code generators to generate true binary code.

But hey, let’s ignore that bit, and step back.

Let’s look at what they say about how other people implement programming languages, shall we? The problem with other languages, they allege, is that their implementations don’t actually generate machine code. They translate from a slower language into a faster language. Let’s leave aside the fact that speed is an attribute of an implementation, not a language. (I can show you a CommonLisp interpreter that’s slow as a dog, and I can show you a CommonLisp interpreter that’ll knock your socks off.)

What do the Julia guys actually do? They write a front-end that generates LLVM intermediate code. That is, they don’t generate machine code directly. They translate code written in their programming languages into code written in an abstract virtual machine code. And then they take the virtual machine code, and pass it to the LLVM backend, which translates from virtual code to actual true machine code.

In other words, they’re not doing anything different from pretty much any other compiled language. It’s incredibly rare to see a compiler that actually doesn’t do the intermediate code generation. The only example I can think of at the moment is one of the compilers for Go – and even it uses some intermediates internally.

Even if Julia never displaces the more popular languages — or if something better comes along — the team believes it’s changing the way people think about language design. It’s showing the world that one language can give you everything.

That said, it isn’t for everyone. Bezanson says it’s not exactly ideal for building desktop applications or operating systems, and though you can use it for web programming, it’s better suited to technical computing. But it’s still evolving, and according to Jonah Bloch-Johnson, a climate scientist at the University of Chicago who has been experimenting with Julia, it’s more robust than he expected. He says most of what he needs is already available in the language, and some of the code libraries, he adds, are better than what he can get from a seasoned language like Python.

So, our intrepid reporter tells us, the glorious thing about Julia is that it’s one language that can give you everything! This should completely change the whole world of programming language design – because us idiots who’ve worked on languages weren’t smart enough to realize that there should be one language that does everything!

And then, in the very next paragraph, he points out that Julia, the great glorious language that’s going to change the world of programming language design by being good at everything, isn’t good at everything!

Jeebus. Just shoot me now.

I’ll finish with a quote that pretty much sums up the idiocy of these guys.

“People have assumed that we need both fast and slow languages,” Bezanson says. “I happen to believe that we don’t need slow languages.”

This sums up just about everything that I hate about what happens when idiots who don’t understand programming languages pontificate about how languages should be designed/implemented.

At the moment, in my day job, I’m doing almost all of my programming in Python. Now, I’m not exactly a huge fan of Python. There’s an awful lot of slapdash and magic about it that drive me crazy. But I can’t really dispute the decision to use it for my project, because it’s a very good choice.

What makes it a good choice? A certain kind of flexibility and dynamicism. It’s a great language for splicing together different pieces that come from different places. It’s not the fastest language in the world. But for my purposess, that’s completely irrelevant. If you took a super-duper brilliant, uber-fast language with a compiler that could generate perfectly optimal code every time, it wouldn’t be any faster than my Python program. How can that be?

Because my Python program spends most of its time idle, waiting for something to happen. It’s talking to a server out on a datacenter cluster, sending it requests, and then waiting for them to complete. When they’re done, it looks at the results, and then generates output on a local console. If I had a fast compiler, the only effect it would have is that my program would spend more time idle. If I were pushing my CPU anywhere close to its limits, using less CPU before going idle might be helpful. But it’s not.

The speed of the language doesn’t matter. But by making my job easier – making it easier to write the code – it saves something much more valuable than CPU time. It saves human time. And a human programmer is vastly more expensive than another 100 CPUs.

We don’t specifically need slow languages. But no one sets out to implement a slow language. People implement useful languages. And they make intelligent decisions about where to spend their time. You could implement a machine code generator for Python. It would be an extremely complicated thing to do – but you could do it. (In fact, someone is working on an LLVM front-end for Python! It’s not for Python code like my system, but there’s a whole community of people who use Python for implementing numeric processing code with NumPy.) But what’s the benefit? For most applications, absolutely nothing.

According the the Julia guys, the perfectly rational decision to not dedicate effort to optimization when optimization won’t actually pay off is a bad, stupid idea. And that should tell you all that you need to know about their opinions.

Types Gone Wild! SKI at Compile-Time

Over the weekend, a couple of my Foursquare coworkers and I were chatting on twitter, and one of my smartest coworkers, a great guy named Jorge Ortiz, pointed out that type inference in Scala (the language we use at Foursquare, and also pretty much my favorite language) is Turing complete.

Somehow, I hadn’t seen this before, and it absolutely blew my mind. So I asked Jorge for a link to the proof. The link he sent me is a really beautiful blog post. It doesn’t just prove that Scala type inference is Turing complete, but it does it in a remarkably beautiful way.

Before I get to the proof, what does this mean?

A system is Turing complete when it can perform any possible computation that could be performed on any other computing device. The Turing machine is, obviously, Turing complete. So is lambda calculus, the Minsky machine, the Brainfuck computing model, and the Scala programming language itself.

If type inference is Turing complete, then that means that you can write a Scala program where, in order to type-check the program, the compiler has to run an arbitrary program to completion. It means that there are, at least theoretically, Scala programs where the compiler will take forever – literally forever – to determine whether or not a given program contains a type error. Needless to say, I consider this to be a bad thing. Personally, I’d really prefer to see the type system be less flexible. In fact, I’d go so far as to say that this is a fundamental error in the design of Scala, and a big strike against it as a language. Having a type-checking system which isn’t guaranteed to terminate is bad.

But let’s put that aside: Scala is pretty entrenched in the community that uses it, and they’ve accepted this as a tradeoff. How did the blog author, Michael Dürig, prove that Scala type checking is Turing complete? By showing how to implement a variant of lambda calculus called SKI combinator calculus entirely with types.

SKI calculus is seriously cool. We know that lambda calculus is Turing complete. It turns out that for any lambda calculus expression, there’s a way rewriting it without any variables, and without any lambdas at all, using three canonical master functions. If you’ve got those three, then you can write anything, anything at all. The three are called S, K, and I.

  • The S combinator is: S x y z = x z (y z).
  • The K combinator is: K x y = x.
  • The I combinator is: I x = x.

They come from intuitionistic logic, where they’re fundamental axioms that describe how intuitionistic implication works. K is the rule A Rightarrow (B Rightarrow A); S is the rule (A Rightarrow (B Rightarrow C)) Rightarrow ((A Rightarrow B) Rightarrow C); and I is (A Rightarrow A).

Given any lambda calculus expression, you can rewrite it as a chain of SKIs. (If you’re interested in seeing how, please just ask in the comments; if enough people are interested, I’ll write it up.) What the author of the post id is show how to implement the S, K, and I combinators in Scala types.

trait Term {
  type ap[x <: Term] <: Term
  type eval <: Term

He’s created a type Term, which is the supertype of any computable fragment written in this type-SKI. Since everything is a function, all terms have to have two methods: one of them is a one-parameter “function” which applies the term to a parameter, and the second is a “function” which simplifies the term into canonical form.

He implements the S, K, and I combinators as traits that extend Term. We’ll start with the simplest one, the I combinator.

trait I extends Term {
  type ap[x <: Term] = I1[x]
  type eval = I

trait I1[x <: Term] extends Term {
  type ap[y <: Term] = eval#ap[y]
  type eval = x#eval

I needs to take a parameter, so its apply type-function takes a parameter x, and returns a new type I1[x] which has the parameter encoded into it. Evaluating I1[x] does exactly what you’d want from the I combinator with its parameter – it returns it.

The apply “method” of I1 looks strange. What you have to remember is that in lambda calculus (and in the SKI combinator calculus), everything is a function – so even after evaluating I.ap[x] to some other type, it’s still a type function. So it still needs to be applicable. Applying it is exactly the same thing as applying its parameter.

So if have any type A, if you write something like var a : I.ap[A].eval, the type of a will evaluate to A. If you apply I.ap[A].ap[Z], it’s equivalent to taking the result of evaluating I.ap[A], giving you A, and then applying that to Z.

The K combinator is much more interesting:

// The K combinator
trait K extends Term {
  type ap[x <: Term] = K1[x]
  type eval = K

trait K1[x <: Term] extends Term {
  type ap[y <: Term] = K2[x, y]
  type eval = K1[x]

trait K2[x <: Term, y <: Term] extends Term {
  type ap[z <: Term] = eval#ap[z]
  type eval = x#eval

It's written in curried form, so it's a type trait K, which returns a type trait K1, which takes a parameter and returns a type trait K2.

The implementation is a whole lot trickier, but it's the same basic mechanics. Applying K.ap[X] gives you K1[X]. Applying that to Y with K1[X].ap[Y] gives you K2[K, Y]. Evaluating that gives you X.

The S combinator is more of the same.

// The S combinator
trait S extends Term {
  type ap[x <: Term] = S1[x]
  type eval = S

trait S1[x <: Term] extends Term {
  type ap[y <: Term] = S2[x, y]
  type eval = S1[x]

trait S2[x <: Term, y <: Term] extends Term {
  type ap[z <: Term] = S3[x, y, z]
  type eval = S2[x, y]

trait S3[x <: Term, y <: Term, z <: Term] extends Term {
  type ap[v <: Term] = eval#ap[v]
  type eval = x#ap[z]#ap[y#ap[z]]#eval

Michid then goes on to show examples of how to use these beasts. He implements equality testing, and then shows how to test if different type-expressions evaluate to the same thing. And all of this happens at compile time. If the equality test fails, then it's a type error at compile time!

It's a brilliant little proof. Even if you can't read Scala syntax, and you don't really understand Scala type inference, as long as you know SKI, you can look at the equality comparisons, and see how it works in SKI. It's really beautiful.

Static Typing: Give me a break!

I’m a software engineer. I write code for a living. I’m also a programming language junkie. I love programming languages. I’m obsessed with programming languages. I’ve taught myself more programming languages than any sane person has any reason to know.

Learning that many languages, I’ve developed some pretty strong opinions about what makes a good language, and what kind of things I really want to see in the languages that I use.

My number one preference: strong static typing. That’s part of a more general preference, for preserving information. When I’m programming, I know what kind of thing I expect in a parameter, and I know what I expect to return. When I’m programming in a weakly typed language, I find that I’m constantly throwing information away, because I can’t actually say what I know about the program. I can’t put my knowledge about the expected behavior into the code. I don’t think that that’s a good thing.

But… (you knew there was a but coming, didn’t you?)

This is my preference. I believe that it’s right, but I also believe that reasonable people can disagree. Just because you don’t think the same way that I do doesn’t mean that you’re an idiot. It’s entirely possible for someone to know as much as I do about programming languages and have a different opinion. We’re talking about preferences.

Sadly, that kind of attitude is something that is entirely too uncommon. I seriously wonder somethings if I’m crazy, because it seems like everywhere I look, on every topic, no matter how trivial, most people absolutely reject the idea that it’s possible for an intelligent, knowledgeable person to disagree with them. It doesn’t matter what the subject is: politics, religion, music, or programming languages.

What brought this little rant on is that someone sent me a link to a comic, called “Cartesian Closed Comic”. It’s a programming language geek comic. But what bugs me is this comic. Which seems to be utterly typical of the kind of attitude that I’m griping about.

See, this is a pseudo-humorous way of saying “Everyone who disagrees with me is an idiot”. It’s not that reasonable people can disagree. It’s that people who disagree with me only disagree because they’re ignorant. If you like static typing, you probably know type theory. If you don’t like static typing, that there’s almost no chance that you know anything about type theory. So the reason that those stupid dynamic typing people don’t agree with people like me is because they just don’t know as much as I do. And the reason that arguments with them don’t go anywhere isn’t because we have different subjective preferences: it’s because they’re just too ignorant to understand why I’m right and they’re wrong.

Everything about this argument is crap, and it pisses me off.

Most programmers – whether they prefer static typing or not – don’t know type theory. Most of the arguments about whether to use static or dynamic typing aren’t based on type theory. It’s just the same old snobbery, the “you can’t disagree with me unless you’re an idiot”.

Among intelligent skilled engineers, the static versus dynamic typing thing really comes down to a simple, subjective argument:

Static typing proponents believe that expressing intentions in a static checkable form is worth the additional effort of making all of the code type-correct.

Dynamic typing proponents believe that it’s not: that strong typing creates an additional hoop that the programmer needs to jump through in order to get a working system.

Who’s right? In fact, I don’t think that either side is universally right. Building a real working system is a complex thing. There’s a complex interplay of design, implementation, and testing. What static typing really does is take some amount of stuff that could be checked with testing, and allows the compiler the check it in an abstract way, instead of with specific tests.

Is it easier to write code with type declarations, or with additional tests? Depends on the engineers and the system that they’re building.

Turing Crackpottery!

One of the long-time cranks who’s commented on this blog is a bozo who goes by the name “Vorlath”. Vorlath is a hard-core Cantor crank, and so he usually shows up to rant whenever the subject of Cantor comes up. But last week, while I was busy dealing with the Scientopia site hosting trouble, a reader sent me a link to a piece Vorlath wrote about the Halting problem. Apparently, he doesn’t like that proof either.

Personally, the proof that the halting problem is unsolvable is one of my all-time favorite mathematical proofs. It’s incredibly simple – just a couple of steps. It’s very concrete – the key to the proof is a program that you can actually write, easily, in a couple of lines of code in a scripting language. And best of all, it’s incredibly profound – it proves something very similar to Gödel’s incompleteness theorem. It’s wonderful.

To show you how simple it is, I’m going to walk you through it – in all of its technical details.

Continue reading

Return of the Revenge of the Return of the Compression Schmuck

Well, folks, this one is going to be boring for most of you. But the
compression jackass, Jules “Julie-baby-dumbass” Gilbert has been repeatedly
bugging me. And today, he replied to my latest “fuck off and stop sending me
mail” with a long-winded proselytizing response in which he requested that I
be courteous and not publicize the discussion of his “work”.

Fuck that

The entire correspondence is below.

Before I get to it: the short summary of what Jules is doing
is munging a file in a way that makes it amenable to recompression. In
the classic sense, this is removing a trivial bit of information from the
file, which thus allows is to be re-compressed just a tiny bit

The catch is, of course, you can’t un-compress it without re-introducing
the information that you removed. In his case, you need to repeatedly
un-compress, un-munge, un-compress, un-munge, and so on –
exactly the right number of times. From some of his emails, he
does this hundreds to thousands of times.

You can’t un-compress his stuff without knowing how many times the compressor
ran. And how many times is that? Well, gosh, to know that, you need to know
some extra information. Anyone want to take a bet on what the
relationship is between the amount of additional compression he can get, and
the number of repetitions of his system? Anyone?

Ok. On to the transcript.

Continue reading

Revenge of the Return of the Compression Idiot

Do I really need to go through this again? Argh!

As long-time readers will remember, a while back, I had an encounter with
a crackpot named Jules Gilbert who claimed to be able to cyclically compress
files. That is, he could take a particular input file, compress it, do
something to it, and then compress it again, making it smaller. He claimed to
be able to do this for any input.

And to make it more annoying, his claim to be able to do this brilliant
magical cyclical compression was attached to an obnoxious christian “I want to
save you soul” rant, where he effectively offered to share the profits from
this brilliant invention with me, if I’d only give him the chance to tell me
about Jesus.

I tried to get rid of the schmuck. I asked him to leave me alone politely.
Then I asked him to leave me alone not-so politely. Then I asked him to
leave me alone very, very rudely. Then I publicly mocked him on this blog.
After the last, he finally seemed to take the hint and go away.

Unfortunately, good things often don’t last. And yesterday, I received
another new message from him, bragging about the results of his uber-magical
compression system. He has now perfected it to the point where he can take
any input file, and compress it down to about 50K:

I am now compressing single bit as:

I use C code to pack up one bit to a byte and call bzip2’s compress

I print the result and compute the size.

(I did this for both gzip and bzip2.) Gzip didn’t compress; it gave
me back about 6.25, but I needed at least 8.0.

And bzip2 gave it to me. About 8.2 and higher. (This is with no
actual I/O, just buffer to buffer.)

This was with random data. At least that’s what people call it.
Bzip2 data. (It looks pretty random to me.)

I expect to show up at a friend’s house and show him the system on
Monday. THIS IS NOT A COMPLETE SYSTEM, but it’s not just research
either. The input was randomly chosen (not by me, by a C function,)
and the program has been arranged to simply open another file when a
current file is exhausted.

Every input file has been compressed once with bzip2. Each input file
is at least 5k bytes and was chosen with the intent to represent all
popular OS’s somewhat equally.

This program is easily able to process MN’s 415k file. And since it
uses 256 byte blocks, I can bring it down to about 50k (number
demonstrated in testing.) Another way to say this is that (with this
program,) all files greater than about 50k can be made that small.

This is the simplist system I have ever made that uses my “modern”

techniques (I mean, not my LSQ ideas of a decade ago.) i started
coding it this morning, to see if a certain idea worked…

I’m trying to drop bzip but it’s pretty good! I’m also trying to
modulate my input to it, in order to obtain a gain from the pattern
recognition (the BWT may not be an actual FFT but it’s a similar
idea.) If I can do this I will keep it.

If all this is greek to you, here’s the thing: Just about every other
computer scientist in the whole wide world believes this is
impossible. Why is this valuable? Because it means that all files
and messages can be compressed in while in transit to, say, 50k —
probably much less. (Or while sitting on a CD-ROM or DVD.)

I love that line in there: “Just about every other computer scientist in
the whole wide world believes this is impossible”.

Jules, baby, belief has nothing to do with it.

Continue reading