# The Heartbleed Bug

There’s a lot of panic going around on the internet today, about something called the Heartbleed bug. I’ve gotten questions, so I’m giving answers.

### I’ve heard lots of hype. Is this really a big deal?

Damn right it is!

It’s pretty hard to wrap your head around just how bad this actually is. It’s probably even more of a big deal that the hype has made it out to be!

This bug affects around 90% of all sites on the internet that use secure connections. Seriously: if you’re using the internet, you’re affected by this. It doesn’t matter how reputable or secure the sites you connect to have been in the past: the majority of them are probably vulnerable to this, and some number of them have, in all likelihood, been compromised! Pretty much any website running on linux or netbst, using Apache or NGINX as its webserver is vulnerable. That means just about every major site on the net.

The problem is a bug in a commonly used version of SSL/TLS. So, before I explain what the bug is, I’ll run through a quick background.

### What is SSL/TLS?

When you’re using the internet in the simplest mode, you’re using a simple set of communication protocols called TCP/IP. In basic TCP/IP protocols, when you connect to another computer on the network, the data that gets sent back and forth is not encrypted or obscured – it’s just sent in the clear. That means that it’s easy for anyone who’s on the same network cable as you to look at your connection, and see the data.

For lots of things, that’s fine. For example, if you want to read this blog, there’s nothing confidential about it. Everyone who reads the blog sees the same content. No one is going to see anything private.

But for a lot of other things, that’s not true. You probably don’t want someone to be able to see your email. You definitely don’t want anyone else to be able to see the credit card number you use to order things from Amazon!

To protect communications, there’s another protocol called SSL, the Secure Sockets Layer. When you connect to another site that’s got a URL starting with `https:`, the two computers establish an encrypted connection. Once an SSL connection is established between two computers, all communication between them is encrypted.

Actually, on most modern systems, you’re not really using SSL. You’re using a successor to the original SSL protocol called TLS, which stands for transport layer security. Pretty much everyone is now using TLS, but many people still just say SSL, and in fact the most commonly used implementation of it is in package called OpenSSL.

So SSL/TLS is the basic protocol that we use on the internet for secure communications. If you use SSL/TLS correctly, then the information that you send and receive can only be accessed by you and the computer that you’re talking to.

Note the qualifier: if you use SSL correctly!

SSL is built on public key cryptography. What that means is that a website identifies itself using a pair of keys. There’s one key, called a public key, that it gives away to everyone; and there’s a second key, called a private key, that it keeps secret. Anything that you encrypt with the public key can only be decrypted using the private key; anything encrypted with the private key can only be decrypted using the public key. That means that if you get a message that can be decrypted using the sites public key, you know that no one except the site could have encrypted it! And if you use the public key to encrypt something, you know that no one except that site will be able to decrypt it.

Public key cryptography is an absolutely brilliant idea. But it relies on the fact that the private key is absolutely private! If anyone else can get a copy of the private key, then all bets are off: you can no longer rely on anything about that key. You couldn’t be sure that messages came from the right source; and you couldn’t be sure that your messages could only be read by an authorized person.

### So what’s the bug?

The SSL protocol includes something called a heartbeat. It’s a periodic exchange between the two sides of a connection, to let them know that the other side is still alive and listening on the connection.

One of the options in the heartbeat is an echo request, which is illustrated below. Computer A wants to know if B is listening. So A sends a message to B saying “Here’s X bytes of data. Send them back to me.” Then A waits. If it gets a message back from B containing the same X bytes of data, it knows B was listening. That’s all there is to it: the heartbeat is just a simple way to check that the other side is actually listening to what you say.

The bug is really, really simple. The attacker sends a heartbeat message saying “I’m gonna send you a chunk of data containing 64000 bytes”, but then the data only contains one byte.

If the code worked correctly, it would say “Oops, invalid request: you said you were going to send me 64000 bytes of data, but you only sent me one!” But what the buggy version of SSL does is send you that 1 byte, plus 63,999 bytes of whatever happens to be in memory next to wherever it saved the byte..

You can’t choose what data you’re going to get in response. It’ll just be a bunch of whatever happened to be in memory. But you can do it repeatedly, and get lots of different memory chunks. If you know how the SSL implementation works, you can scan those memory chunks, and look for particular data structures – like private keys, or cookies. Given a lot of time, and the ability to connect multiple times and send multiple heartbeat requests each time you connect, you can gather a lot of data. Most of it will be crap, but some of it will be the valuable stuff.

To make matters worse, the heartbeat is treated as something very low-level which happens very frequently, and which doesn’t transfer meaningful data. So the implementation doesn’t log heartbeats at all. So there’s no way of even identifying which connections to a server have been exploiting this. So a site that’s running one of the buggy versions of OpenSSL has no way of knowing whether or not they’ve been the target of this attack!

See what I mean about it being a big deal?

### Why is it so widespread?

When I’ve written about security in the past, one of the things that I’ve said repeatedly is: if you’re thinking about writing your own implementation of a security protocol, STOP. Don’t do it! There are a thousand ways that you can make a tiny, trivial mistake which completely compromises the security of your code. It’s not a matter of whether you’re smart or not; it’s just a simple statement of fact. If you try to do it, you will screw something up. There are just so many subtle ways to get things wrong: it takes a whole team of experts to even have a chance to get it right.

Most engineers who build stuff for the internet understand that, and don’t write their own cryptosystems or cryptographic protocols. What they do is use a common, well-known, public system. They know the system was implemented by experts; they know that there are a lot of responsible, smart people who are doing their best to find and fix any problems that crop up.

Imagine that you’re an engineer picking an implementation of SSL. You know that you want as many people trying to find problems as possible. So which one will you choose? The one that most people are using! Because that’s the one that has the most people working to make sure it doesn’t have any problems, or to fix any problems that get found as quickly as possible.

The most widely used version of SSL is an open-source software package called OpenSSL. And that’s exactly where the most bug is: in OpenSSL.

### How can it be fixed?

Normally, something like this would be bad. But you’d be able to just update the implementation to a new version without the bug, and that would be the end of the problem. But this case is pretty much the worst possible case: fixing the implementation doesn’t entirely fix the problem. Because even after you’ve fixed the SSL implementation, if someone got hold of your private key, they still have it. And there’s no way to know if anyone stole your key!

To fix this, then, you need to do much more than just update the SSL implementation. You need to cancel all of your keys, and replace them new ones, and you need to get everyone who has a copy of your public key to throw it away and stop using it.

So basically, at the moment, every keypair for nearly every major website in the world that was valid yesterday can no longer be trusted.

Over on twitter, some folks were chatting about the latest big security botch. A major service, called Evernote, had a security breach where a password file was stolen. Evernote has handled the situation quite well, being open about what happened, and explaining the risks.

In their description of the breach, they said that the stolen passwords were “both hashed and salted”. Apparently this sounds funny to people outside of software. (Amazing how jargon becomes so ingrained that I didn’t even notice the fact that it could be interpreted in a funny way until it was pointed out to me!)

Anyway, since discussion of this is going around, I thought I’d explain just what password hashing and salting means.

Let’s start at the beginning. You’re some kind of system that wants to have password security. Obviously, you need to save the passwords somewhere, right?

As we’ll see, that’s only partially correct – but let’s go with it for now. You need to store something that lets you check if a user supplied the right password, so you need to store something.

The most naive approach is create a file (or database, or whatever) that contains the usernames and passwords of all of your users. Something like:

```alice:abc
mark:pass
joe:123
jen:hello
```

Suppose you were a thief, and you wanted to crack this password file, what would you do? You’d try to steal that file! If you can get hold of that password file, then you’d have all of the passwords for all of the users of the system.

That means that this is a terrible way of storing the passwords. One step, and a thief has completely breached your system. We don’t want that. So what should we do?

First, we could encrypt the file.

This might seem like a good idea at first. If a thief were the steal the file, the wouldn’t be able to find your user’s passwords without figuring out the encryption key! It’s going to take a lot of work to figure out that key!

The first problem with this is that any password can be cracked given enough time and power. If there’s only one encryption key for the entire file, then it’s worth investing a lot of time and power into breaking it – and once it’s broken, then everything is revealed.

The second problem is: how does your system check a user’s password? It needs to decrypt the file! That means that the encryption key must be available to your system! So all that a thief needs to do is figure out where your system is getting the key from. You’ve got your entire security system for all of your users set up with a single point of protection, and somewhere in your system, everything that you need to break that protection is available!

What can we do to improve this? The answer is something called crypto graphic hashing.

Cryptographic hashing is a combination of two concepts: hashing, and one-way functions.

A really simple example of a not-very-good hash function of a string would be something like: convert all of the characters in the string to their numeric values, and exclusive-or the binary representation of those bits. With that hash function, you could take a string like “ABCD”, and convert it to the numeric values of the characters ([65, 66, 67, 68]), and then x-or them together (1000001 xor 1000010 xor 1000011 xor 1000100 = 0000100) for a result of 4. Real practical hash functions are more complicated.

For example, at least some versions of Java use the following as the default hash for a string of characters:

$text{hash}(s) = left(sum_{i in text{length}(s)} 31^{length(s) - i - 1}*s[i] right) mod 2^{32}$

There’s a class of mathematical functions called one-way functions. A one way function is a function $f$, where given $x$, it’s easy to compute $f(x)$, but given $f(x)$ it’s extremely difficult (or even impossible) to compute $x$.

If we combine those two, we have what’s called a crpytogrphic hash function: a function that takes an arbitrary input string, and converts it to a number, in a way where it’s very difficult to figure out what the input string that produced the number was. That’s great for storing passwords! We don’t store the password at all. We just store the hash-value produced from the password. Then when a user comes and logs in, we take their password, hash it, and compute the result to the stored hash.

```alice:7a2d28fc
mark:dfd4e1c6
joe:ed849ee1
jen:bb76e739
```

This is much better than storing the encrypted password. There is no encryption key that a thief can use to decrypt the password. Even if a thief knows the hash values of your user’s passwords, they can’t get in to the system! And your system actually never stores the actual values of the user’s passwords – just their hashcodes!

So again, let’s look at this from the perspective of a thief. How can a thief break into a system with hashed passwords?

If they don’t know what hash function you’re using, then they’re completely stuck. Sadly, they can probably figure it out. Designing new crpytographic hash functions is hard. Implementing cryptographic hash functions correctly is hard. As a result, most people just use a hash function from a library. That means that for a thief, it’s usually pretty easy to figure out what hash function is being used by a system.

Once they know what hash function you used, their only choice to break your system is to try to guess the passwords. That is, they can guess passwords, compute their hash codes, and search through your password file to see if any of the users password hashes matches. If they find one, they’re gold!

In fact, there’s a common strategy based on this idea called a rainbow table. A rainbox table is a list of common passwords, and the numeric value that they hash to with a common crptographic hash value. Something like:

pass 1b93eb12
abc 7a2d28fc

If you can somehow steal the passwords file, then with a rainbow table, you can find users with common passwords. For example, in the table above, you can see that the hashcode “7a2d28fc” occurs in the passwords file for the username “alice”, and it’s also in the rainbow table for the password “abc”. So a thief could determing that alice’s password was “abc”. Even with the best crpytographic hash, all it takes is one idiot user who uses “password” as their password, and your system’s security is breached.

Salting passwords addresses that problem. In a salting strategy, you don’t hash a user’s password by itself: you combine it with some additional data, and then hash that combination. The additional information is called the salt..

You can use lots of different things for the salt. There’s a complex set of tradeoffs in the exact salting strategy, which are beyond the scope of this post, but a few examples include:

1. Always use a fixed salt string. This is weak, but better than nothing. It’s got a similar weakness to the encrypted password system: you only need one salt to give you a handle on breaking all of the passwords, and that one salt needs to be in the system.
2. Add a random piece of data for each password. The catch here is that you need to store the salt data for each password. This is what unix passwords used to use. They added 12 random bits to each password. In the passwords file, they stored the salt and the hashed password. The weakness of this is that the salt is right there with the password entry. But because each user has a different salt, that means that any attempt to breach the system needs to look at each user separately.
3. Salt on metadata: that is, take information about the user that isn’t part of their username, and use that as the salt. For example, you could use a person’s birthday as the salt for their account.

If each user has a different salt, then even if you’ve got terrible passwords, a thief needs to do a lot of work to try to break your system. Even with a rainbow-table like strategy, they can’t compute the hashcode for a given common password once, and then search the password hash list for that code – they need to recompute it for each possible salt value!

What salting does is, effectively, increase the amount of effort needed to break the passwords. If you add 12 bits of salt, then a rainbow table needs 4096 times more entries to find common passwords! If your salt is long enough, then it can make it effectively impossible to create a rainbox table at all. If they try to attack you without a rainbow table, a 12 bit salt means that your attacker needs to attack the passwords of each of your users seperately! Even if they know the value of the salt, you’ve made it much harder for them to breach your security.

# Turing Machines – what they are, what they aren't.

It’s the anniversary of the birth of Alan Turing. So there’s a ton of people writing commemorative stories. And naturally, a ton of those stories get it wrong. And they get it wrong in a very sad way.

Of course, when you talk about Turing, you talk about Turing machines. Despite the fact that Turing did lots of stuff besides just that machine, it’s always the machine that people focus on. Partly that’s laziness – the machine has his name after all, so it’s one of the first things you find if you Google Turing. It’s also the easiest thing to write about.

What do they say about the Turing machine? It’ “the simplest computing device”. It’s the “basis for modern computers”. It’s “the theoretical model for the microchips in your laptop”. It’s the “mathematical description of your computer”. None of those things are true. And they all both over and under-state what Turing really did. In terms of modern computers, the Turing machine’s contribution to the design of real computers is negligible if not non-existent. But his contrubition to our understanding of what computers can do, and our understanding of how mathematics really works – they’re far, far more significant than the architecture of any single machine.

The Turing machine isn’t a model of real computers. The computer that you’re using to read this has absolutely nothing to do with the Turing machine. As a real device, the turing machine is absolutely terrible.

The turing machine is a mathematical model not of computers, but of computation. That’s a really important distinction. The Turing machine is an easy to understand model of a computing device. It’s definitely not the simplest model. There are simpler computing devices (for example, I think that the rule 111 machine is simpler) – but their simplicitly makes them harder to understand.

The reason that the Turing machine is so important comes down to two important facts. First, which machine you use to talk about computation doesn’t matter. There’s a limit to what a mechanical device can do. There are lots of machines out there – but ultimately, no machine can go past the limit. Any machine that can reach that limit is, for the purposes of the theory of computation, pretty much the same. When we talk about studying computation, what we’re talking about is the set of things that can be done by a machine – not by a specific machine, but by any machine. The specific choice of machine isn’t important. And that’s the point: computation is computation. That’s what Turing figured out.

The Turing machine is a brilliant creation. It’s a simple machine. It’s really easy to understand. And it’s easy to tweak – that is, it’s easy to do experiments where you can modify the machine, and see what effect it has.

So let’s take a step back, and see: what is a Turing machine?

The Turing machine is a very simple kind of theoretical computing device. In fact, it’s almost downright trivial. But according to everything that we know and understand about computation, this trivial device is capable of any computation that can be performed by any other computing device.

The basic idea of the Turing machine is very simple. It’s a machine that runs on top of a tape, which is made up of a long series of little cells, each of which has a single character written on it. The machine is a read/write head that moves over the tape, and which can store a little bit of information. Each step, the machine looks at the symbol on the cell under the tape head, and based on what it sees there, and whatever little bit of information it has stored, it decides what to do. The things that it can do are change the information it has store, write a new symbol onto the current tape cell, and move one cell left or right.

That’s really it. People who like to make computing sound impressive often have very complicated explanations of it – but really, that’s all there is to it. The point of it was to be simple – and simple it certainly is. And yet, it can do anything that’s computable.

To really understand how that trivial machine can do computations, it helps to be a bit formal. In formal terms, we talk about a turing machine as a tuple: (S, s0, A, T), where:

• S is a finite list of possible states that the machine can be in. The state is the information that the machine can store in the head to make decisions. It’s a very limited amount of information; you can think of it as either a specific list of states, or a small set of small numbers. For most Turing machines, we’ll use it as a specific list of states. (You’ll see what I mean in a minute.)
• s0S is the initial state – the state that the machine will be in when it starts a computation.
• A is the machine’s alphabet, which is the set of symbols which can appear on the machine’s tape.
• T is the machines transition function. This is the real heart of the machine, where the computation is defined. It’s a function from the machine state and the alphabet character on the current tape cell to the action that the machine should take. The action is a triple consisting of a new value for the machine’s state, a character to write onto the current tape cell, and a direction to move the tape head – either left or right.

So, for example, let’s look at a simple machine. This is one of the classic examples: a Turing machine that does subtraction using unary numbers. A unary number “N” is written as a series of N 1s. For the program, we’ll give the machine an input in the format “N-M=” written onto the tape; after running the machine, the tape will contain the value of M subtracted from N. So, for example, we could use “1111-11=” as an input; the output would be “11”.

The alphabet is the characters “1”, ” ” (blank space), “-” (minus sign), and “=” (equal sign). The machine has four states: {“scanright”, “eraseone”, “subone”, “skip”}. It starts in the state “scanright”. It’s transition function is given by the following table:

FromState Symbol ToState WriteChar Dir
scanright space scanright space right
scanright 1 scanright 1 right
scanright minus scanright minus right
scanright equal eraseone space left
eraseone 1 subone equal left
eraseone minus HALT space n/a
subone 1 subone 1 left
subone minus skip minus left
skip space skip space left
skip 1 scanright space right

What this machine does is move to the right until it sees the equal sign; it erases the equal sign and moves to the left, erases one digit off of the second number, and replaces it with the equal sign (so the second number is reduced by one, and the equal sign is moved over one position). Then it scans back to the minus sign (which separates the two numbers), and erases one digit of the first number, and then switches back to scanning to the right for the equal. So one at a time, it erases one digit from each of the two numbers. When it reaches the equal sign, and starts going back to erase a digit from the second number, and hits the “-” before it finds a digit, that means its done, so it stops. Let’s look at a simple execution trace. In the trace, I’ll write the machine state, followed by a colon, followed by the tape contents surrounded by “[]”, with the current tape cell surrounded by “{}”.

```	scanright:  [ {1}1111111-111= ]"
scanright:  [ 1{1}111111-111= ]"
scanright:  [ 11{1}11111-111= ]"
scanright:  [ 111{1}1111-111= ]"
scanright:  [ 1111{1}111-111= ]"
scanright:  [ 11111{1}11-111= ]"
scanright:  [ 111111{1}1-111= ]"
scanright:  [ 1111111{1}-111= ]"
scanright:  [ 11111111{-}111= ]"
scanright:  [ 11111111-{1}11= ]"
scanright:  [ 11111111-1{1}1= ]"
scanright:  [ 11111111-11{1}= ]"
scanright:  [ 11111111-111{=} ]"
eraseone :  [ 11111111-11{1}  ]"
subone   :  [ 11111111-1{1}=  ]"
subone   :  [ 11111111-{1}1=  ]"
subone   :  [ 11111111{-}11=  ]"
skip     :  [ 1111111{1}-11=  ]"
scanright:  [ 1111111 {-}11=  ]"
scanright:  [ 1111111 -{1}1=  ]"
scanright:  [ 1111111 -1{1}=  ]"
scanright:  [ 1111111 -11{=}  ]"
eraseone :  [ 1111111 -1{1}   ]"
subone   :  [ 1111111 -{1}=   ]"
subone   :  [ 1111111 {-}1=   ]"
skip     :  [ 1111111{ }-1=   ]"
skip     :  [ 111111{1} -1=   ]"
scanright:  [ 111111 { }-1=   ]"
scanright:  [ 111111  {-}1=   ]"
scanright:  [ 111111  -{1}=   ]"
scanright:  [ 111111  -1{=}   ]"
eraseone :  [ 111111  -{1}    ]"
subone   :  [ 111111  {-}=    ]"
skip     :  [ 111111 { }-=    ]"
skip     :  [ 111111{ } -=    ]"
skip     :  [ 11111{1}  -=    ]"
scanright:  [ 11111 { } -=    ]"
scanright:  [ 11111  { }-=    ]"
scanright:  [ 11111   {-}=    ]"
scanright:  [ 11111   -{=}    ]"
eraseone :  [ 11111   {-}     ]"
Halt:       [ 11111  { }-     ]"
```

See, it works!

One really important thing to understand here is that we do not have a program. What we just did was define a Turing machine that does subtraction. The machine does not take any instructions: the states and the transition function are an intrinsic part of the machine. So the only thing this machine can do is to subtract.

The real genius of Turing wasn’t just defining this simple computing machine. It was realizing the concept of the program: you could design a Turing machine whose input tape contained a description of a Turing machine – that is a program – followed by an input to the program. This single machine – the Universal Turing machine – could simulate any other Turing machine; one machine which could perform any computation.

Turing machines are a lot of fun to play with. Figuring out how to do things can be fascinating. Figuring out how to define a Universal turing machine’s transition function is an amazing thing to do; it’s astonishing how simple the universal machine can be!

As I said earlier – you can’t make a mechanical computing device that does anything that a Turing machine can’t do. One of the beauties of the Turing machine is that it lets you explore that. You can try adding and removing features to the basic machine, and see what happens.

For example: if you can do lots of great stuff with a Turing machine with one tape, what if you had a two-tape turing machine? That is, take the basic turing machine, and say that it’s got two tapes, each with a read/write head. Each state transition rule on this machine depends on the pair of values found on the two tapes. For now, we’ll say that the tapes move together – that is, the transition rule says “move the heads right” or “move the heads left”.

Seems like this should represent a real increase in power, right? No. Take a single-tape turing machine. Take the alphabet for tape one, and call it A1; take the alphabet for tape 2, and call it A2. We can create a single-tape turing machine whose alphabet is the cross-product of A1 and A2. Now each symbol on the tape is equivalent of a symbol on tape 1 and a symbol on tape 2. So we’ve got a single-tape machine which is equivalent to the two-tape machine. Bingo.

We can lift the restriction on the heads moving together, but it’s a lot more work. A two-tape machine can do things a lot faster than a one-tape, and the simulation will necessarily adapt to that. But it’s entirely doable. How about a two-dimensional tape? We can simulate that pretty easily with a two-tape machine, which means we can do it with a one-tape machine. For a two tape machine, what we do is map the two-D tape onto the one-D-tape, as seen in the diagram below – so that cell 0 on the one-D tape corresponds to cell (0,0) on the two tape; cell (0,1) on the two-D corresponds to cell 1 on the one-D; cell (1,1) on the 2-D is cell 2 on the 1-D; etc. Then we use the second tape for the book-keeping necessary to do the equivalent of T-D tape moves. And we’ve got a two-D turing machine simulated with a two-tape one-D; and we know that we can simulate a two-tape one-D with a one-tape one-D.

This is, to me, the most beautiful thing about the Turing machine. It’s not just a fundamental theoretical construction of a computing device; it’s a simple construction of a computing device that’s really easy to experiment with. Consider, for a moment, lambda calculus. It’s more useful that a Turing machine for lots of purposes – we write real programs in lambda calculus, when no one would build a real application using a Turing machine program. But imagine how you’d try things like the alternate constructions of the Turing machine? It’s a whole lot harder to build experiments like those in lambda calculus. Likewise for Minsky machines, Markov machines, etc.

For your enjoyment, I’ve implemented a Turing machine programming language. You feed it a Turing machine description, and an input string, and it will give you a trace of the machines execution like the one above. Here’s the specification of the subtraction machine written in my little Turing language:

```states { "scanright" "eraseone" "subtractOneFromResult" "skipblanks" } initial "scanright"
alphabet { '1' ' ' '=' '-' } blank ' '
trans from "scanright" to "scanright" on (' ') write ' ' move right
trans from "scanright" to "scanright" on ('1') write '1' move right
trans from "scanright" to "scanright" on ('-') write '-' move right
trans from "scanright" to "eraseone" on ('=') write ' ' move left
trans from "eraseone" to "subtractOneFromResult" on ('1') write '=' move left
trans from "eraseone" to "Halt" on ('-') write ' ' move left
trans from "subtractOneFromResult" to "subtractOneFromResult" on ('1') write '1' move left
trans from "subtractOneFromResult" to "skipblanks" on ('-') write '-' move left
trans from "skipblanks" to "skipblanks" on (' ') write ' ' move left
trans from "skipblanks" to "scanright" on ('1') write ' ' move right
```

I think it’s pretty clear as a syntax, but it still needs explanation.

• The first line declares the possible states of the machine, and what state it starts in. This machine has four possible states – “scanright”, “eraseone”, “subtractOneFromResult”, and “skipblanks”. When the machine starts, it will be in the state “skipright”.
• The second line declares the set of symbols that can appear on the tape, including what symbol will initially appear on any tape cell whose value wasn’t specified by the input. For this machine the symbols are the digit 1, a blank space, the equal sign, and the subtraction symbol; the blank symbol is on any tape cell whose initial value wasn’t specified.
• After that is a series of transition declarations. Each declaration specifies what the machine will do for a given pair of initial state and tape symbol. So, for example, if the machine is in state “scanright”, and the current tape cell contains the equal-to sign, then the machine will change to state “eraseone”, write a blank onto the tape cell (erasing the “=” sign), and move the tape cell one position to the left.

Finally, the code. You’ll need GHCI to run it; at the moment, it won’t work in Hugs (I don’t have the parsing library in my version of Hugs), and I haven’t worked out the linkage stuff to make it work under the GHC compiled mode.

```--
-- A Simple Turing machine interpreter
-- Copyright 2007 by Mark C. Chu-Carroll
--    markcc@gmail.com
--   http://scienceblogs.com/goodmath
--
-- You can do anything you want with this code so long as you
-- leave this copyright notice intact.
--
module Turing where
import Text.ParserCombinators.Parsec
import qualified Text.ParserCombinators.Parsec.Token as P
import Text.ParserCombinators.Parsec.Language
import System.Environment

data Motion = MoveLeft  | MoveRight deriving (Show)

-- Transition from to on move write
data Transition = Transition String String [Char] Motion  Char deriving (Show)

-- TuringMachine states initial alphabet blank transitions
data TuringMachine = Machine [String] String   [Char]    Char    [Transition] deriving (Show)

data MachineState = TMState [Char] Char [Char]  String  deriving (Show)
--                           tape-left curcell  curstate

getBlankSym :: TuringMachine -> Char
getBlankSym (Machine _ _ _ bl _) = bl

findMatchingTransition :: [Transition] -> String -> Char -> Maybe Transition
findMatchingTransition [] _ _ =  Nothing
findMatchingTransition translist state c =
let matches = filter (transitionMatches state c) translist
in case matches of
[] -> Nothing
t:[] -> Just t
_ -> error "Ambiguous transition"
where transitionMatches state c (Transition from to on move write) =
(from == state) && (elem c on)

runTransition :: TuringMachine -> [Char] -> Char -> [Char] -> String -> Transition -> MachineState
runTransition m (l:left) c right state (Transition from tostate on MoveLeft write) =
TMState left l (write:right) tostate
runTransition m left c [] state (Transition from tostate on MoveRight write) =
TMState (write:left) (getBlankSym m) [] tostate
runTransition m left c (r:right) state (Transition from tostate on MoveRight write) =
TMState (write:left) r right tostate

stepMachine :: TuringMachine -> MachineState -> MachineState
stepMachine machine@(Machine _ _ _ _ translist) st@(TMState tapeleft c taperight state) =
let trans = findMatchingTransition translist state c
in case trans of
(Just t) -> runTransition machine tapeleft c taperight state t
Nothing -> error "No applicable transition from state"

getMachineStateString (TMState left c right state) =
(state ++ ":[" ++ (reverse left) ++ "{" ++ [c] ++ "}" ++ right ++ "]")

runTracedMachine :: TuringMachine -> [Char] -> [String]
runTracedMachine tm@(Machine states initial alphabet blank transitions) (t:tape) =
runTracedMachineSteps tm (TMState [] t tape initial) where
runTracedMachineSteps machine state =
let newstate@(TMState left c right st) = stepMachine machine state
in if st == "Halt"
then [getMachineStateString newstate]
else let trace=runTracedMachineSteps machine newstate
in ((getMachineStateString newstate):trace)

runMachine :: TuringMachine -> [Char] -> [Char]
runMachine tm@(Machine states initial alphabet blank transitions) (t:tape) =
runMachineSteps tm (TMState [] t tape initial) where
runMachineSteps machine state =
let newstate@(TMState left c right st) = stepMachine machine state
in if st == "Halt"
then (concat [(reverse left), [c], right])
else runMachineSteps machine newstate

---------------------------------------------------------------------------
-- Parsing code - implemented using the Parsec monadic parser library.
---------------------------------------------------------------------------

-- Basic setup stuff - use a standard haskell style lexer; set up the reserved words
-- and symbols in the lexer.
lexer :: P.TokenParser ()
{ P.reservedNames = ["states","alphabet","trans","from","to","on","write","move","left","right","initial","blank"] })

reserved = P.reserved lexer
symbol = P.symbol lexer
braces = P.braces lexer
parens = P.parens lexer
charlit = P.charLiteral lexer
strlit = P.stringLiteral lexer
ident = P.identifier lexer
whitespace = P.whiteSpace lexer

states = reserved "states"
alphabet = reserved "alphabet"
trans = reserved "trans"
from = reserved "from"
to = reserved "to"
on = reserved "on"
write = reserved "write"
move = reserved "move"
initial = reserved "initial"
blank = reserved "blank"

left = do { reserved "left"
; return MoveLeft
}

right = do { reserved "right"
; return MoveRight
}

-- statesDecl ::= "states" "{"  strlit+  "}" "initial" strlit
statesDecl = do { states
; strlst <- braces (many1 strlit)
; initial
; i <- strlit
; return (i,strlst)
}

-- alphaDecl ::= "alphabet" "{" charlit+  "}" "blank" charlit
; charlst <- braces (many1 charlit)
; blank
; bl <- charlit
; return (bl, charlst)
}

-- transDecl ::= "trans" "from" strlit "to" strlit "on" "(" charlit+ ")" "write" charlit "move" ("left" | "right")
transDecl = do { trans
; from
; fromState <- strlit
; to
; targetState <- strlit
; on
; chrs <- parens (many1 charlit)
; write
; wrchar <- charlit
; move
; dir <- ( left <|> right)
; return (Transition fromState targetState chrs dir wrchar)
}

-- machine ::= statesDecl alphaDecl transDecl+
machine = do { (i,sts) <- statesDecl
; trans <- many1 transDecl
; return (Machine sts i alpha bl trans)
}

run :: (Show a) => Parser a -> String -> IO a
run p input
= case (parse p "" input) of
Left err -> do{ putStr "parse error at "
; print err
; error "Parse error"
}
Right x -> return x

runTParser ::  String -> IO TuringMachine
runTParser input =
run (do { whitespace
; x <- machine
; eof
; return x })  input

--------------------------------------------------------------------------
-- A sample Turing program - first in the comment, and then coded into
-- a string literal, to make it easy to run tests. This sample program
-- is a basic Turing subtract - it takes a unary equation "111111-111=",
-- and leaves the result of subtracting the second from the first.
--
-- states { "one" "two" "three" "four" } initial "one"
-- alphabet { '1' ' ' '=' '-' } blank ' '
-- trans from "one" to "one" on (' ') write ' ' move right
-- trans from "one" to "one" on ('1') write '1' move right
-- trans from "one" to "one" on ('-') write '-' move right
-- trans from "one" to "two" on ('=') write ' ' move left
-- trans from "two" to "three" on ('1') write '=' move left
- trans from "two" to "Halt" on ('-') write '-' move left
-- trans from "three" to "three" on ('1') write '1' move left
-- trans from "three" to "four" on ('-') write '-' move left
-- trans from "four" to "four" on (' ') write ' ' move left
-- trans from "four" to "one" on ('1') write ' ' move right

sampleMachine = concat ["states { "one" "two" "three" "four" } initial "one"n ",
" alphabet { '1' ' ' '=' '-' } blank ' 'n ",
"trans from "one" to "one" on (' ') write ' ' move rightn ",
"trans from "one" to "one" on ('1') write '1' move rightn ",
"trans from "one" to "one" on ('-') write '-' move rightn ",
"trans from "one" to "two" on ('=') write ' ' move leftn ",
"trans from "two" to "three" on ('1') write '=' move leftn ",
"trans from "two" to "Halt" on ('-') write '-' move leftn ",
"trans from "three" to "three" on ('1') write '1' move leftn ",
"trans from "three" to "four" on ('-') write '-' move leftn ",
"trans from "four" to "four" on (' ') write ' ' move leftn ",
"trans from "four" to "one" on ('1') write ' ' move right"  ]

runTracedMachineOnString :: String -> String -> IO ([String])
runTracedMachineOnString m str =
do
tm <- runTParser m
return (runTracedMachine tm str)

runMachineOnString :: String -> String -> IO String
runMachineOnString m str =
do
tm <- runTParser m
return (runMachine tm str)

sampleInput = " 11111111-111= "

------------------------------------------------------------------------
-- Main program execution scaffolding
-- main still needs a bit of work so that ghci will link correctly;
-- runs fine in GHCI, but linkage errors in GHC. For now, just load
-- this file, and then execute "runFromFile" from the command line.
------------------------------------------------------------------------
main =
do
[file] <- getArgs
m <- parseFromFile (do { whitespace
; x <- machine
; eof
; return x }) file
case m of
Right machine -> do
print "Enter input for parser:"
s <- getLine
result <- return (runMachine machine s)
print (concat ["Result:[", result, "]"])
Left x -> do
print (concat ["Parse error"])

runFromFile :: String -> IO ()
runFromFile filename =
do
m <- parseFromFile (do { whitespace
; x <- machine
; eof
; return x }) filename
case m of
Right machine -> do
print "Enter input for parser:"
s <- getLine
result <- return (runMachine machine s)
print (concat ["Result:[", result, "]"])
Left x -> do
print (concat ["Parse error"])

```

# What if it's not Regular? Pump it!

At this point, we’ve seen a fair bit about regular languages, and we’ve gone through the introduction to context free languages. We know one way of showing that a language is regular or context free: if you can write a (regular/context free) grammar for a language, then that language is necessarily (regular/context free). But… what if we have a language that we suspect is not regular?

# The Next Step in Computation: Level-2 Languages

Time to move on to Chomsky level 2! Level two languages are also known as context free languages, or CFLs. Level 2 languages are wonderful things. They’re simple enough to be parsed easily, but expressive enough to support a very wide range of useful languages. Pretty much every programming language that’s widely used can have its syntax specified with a level-2 language.

### Grammars for Context Free Languages

In terms of the grammar, a CFL is easy to describe: it’s a language where the left-hand side of every grammar rule consists of exactly one non-terminal symbol. That’s it: the right hand side of a rule in a CFL grammar can be anything at all. Unlike the regular grammars, there are no restrictions on the position or the number of NTSs on the right hand side.

This change makes a huge difference in what you can do. In a CFL, you can count. You can have distant relationships – things like a sub-string that can occurs at the end of a string only if a match for it occurs at the beginning. The canonical example of this is paren matching: you can write a language that makes sure that you have matching parens: the same number of open and close parens.

• `A ::= '(' ')'`
• ` A ::= A A`

This language includes `()((())()()(()))`, but not `()((())()()(())` (the same thing, but with one trailing paren omitted – 8 opens, 7 closes), or `()((())()()(())))` (the same thing, but with an extra close paren at the end – 8 opens, 9 closes).

As a quick aside: this also illustrates what we mean when we say that a language supports counting. When we say that a language requires counting, what we mean is that is that some feature of a string in the language has to have a number of repetitions dictated by another feature in the same string. In a paren-matching grammar, we require that the number of close parens must be the same as the number of open parens. We don’t just make sure that that’s true for 1, or 2, or 3 open parens, we have matching close parens. For any number of parens, the number of closes must be the same as the number of opens.

We can look at a much less trivial example of a simple grammar. As I’ve written about at other times, in computer science, there’s a formal language that we use for a ton of valuable things called lambda calculus. Lambda calculus is the formal mathematical basis of the Haskell language, and the basic tool used for specifying formal semantics of computational systems. The complete grammar of the simply typed lambda calculus is:

• `ident ::= "A" | "B" | "C" | ... | "Z"`
• `expr ::= "$lambda$$lambda$" ident "." expr`
• `expr ::= ident`
• `expr ::= expr expr`
• `expr ::= "(" expr ")"`

You can see a practical example of counting in this grammar. It guarantees that expressions in the lambda calculus are well-formed. We couldn’t do that in a regular language. That’s a huge boost in capability.

So why do we call this a context free language? In terms of the grammar, when we’re constructing a string, we’re always doing it by replacing non-terminal symbols with sequences of other symbols. When we do one of those replacements, if there’s an NTS “S” in the current string, then we can replace it. We can’t look at anything but the individual non-terminals in the string. We can’t consider the context in which a non-terminal occurs before we decide whether or not we’re allowed to replace it.

### Capability of Context Free Languages

As we’ve seen, CFLs give us some pretty significant additional capabilities. That abilities to count and to describe non-consecutive relationships between different parts of a string in a language are a huge upgrade in computational capability. But CFLs are still pretty limited in many ways. You can’t write a grammar for a spoken language using CFGs – natural languages aren’t context free. We use CFLs and CFGs all the time for compilers and such in real programs, but there’s always an extra step involved, because there are aspects of real computer languages that can’t be captured in context-free form.

So what can’t you do in a CFL? I’m not going to try to formally characterize the limits of CFLs, but here are two examples:

Complex counting/Arithmetic
if you want a language of strings with the same number of Xs and Ys, that language is a CFL. If you want a string with the same number of Xs, Ys, and Zs – that isn’t.
Constrained relationships
We can have some distant relationships in a context grammar – but it’s a very limited capability. You can’t specify that a particular symbol can only occur in one place in the string if it already occured in a prior part of the string. For example, in the lamba calculus example, it really should say that you can only use the “expr ::= ident” rule if the ident occured in an enclosing LAMBA ident”. CFLs can’t do that.

### Computing CFLs: the PushDown Automaton

So – now that we know what a CFL is, and what it can express: what’s
the basic computational model that underlies them? CFLs are computable using a kind of machine called a pushdown automaton (PDA). Pushdown automata are relatively simple: take a finite state machine, and add a stack.

For the non-CS folks out there, a stack is a last in first out (LIFO) storage system. What that means is that you can store something on the top of the stack whenever you want to (called pushing), look at what’s on top of the stack (peeking), and removing the element on top (popping). For a PDA, the transitions look like:

$(mbox{state},mbox{top-of-stack},mbox{input}) rightarrow (mbox{state}, mbox{stack-action})$

• The top-of-stack in the transition can be either a symbol from the machine’s alphabet, or it can be “*”. If it’s a symbol, then the transition can only be taken if both the machine state and the top-of-stack match. If it’s “*”, then the transition can be taken regardless of the value on top of the stack.
• The stack-action can be “push(symbol)”; “pop”, or “none”.
• The machine accepts the input if it reaches a final state with the stack empty. (There are alternate formulations that don’t require an empty stack, or that only require an empty stack but don’t use final states. They’re exactly equivalent to empty stack + final state.)

As usual, it’s easiest to understand this with an example. So let’s look at a simple language consisting of parens and identifiers, where the parens have to match. So, for example, “((I)(((I)I)(II)))” would be accepted, but “(I)(((I)I)(II)))” (the same string, but with the first open removed) would be rejected.

Our machine has an alphabet of “(“, “)”, and “I”. It has two states: 0, and 1. 0 is both the initial state, and the only final state. The available transitions are:

• $(0, *, "(") rightarrow (0, push("("))$: in state 0, if you see an open paren, push it onto the stack, and stay in state 0 It doesn’t matter what’s on the stack – if there’s an open-paren in state 0, you can do this.
• $(0, "(", ")") rightarrow (0, pop)$: in state 0, if you see a close paren and there’s an open-paren on top of the stack, then you’ve matched a pair, so you can pop the top open off of the stack.
• $(0, epsilon, ")") rightarrow (1, _)$: if you’re in state 0, and you see a close paren, but the stack in empty, then go to state one. State one is an error state: it means that you saw a close paren without a corresponding open paren. That’s not allowed in this grammar. Once you’re in state 1, you’re stuck.
• $(0, *, "I") rightarrow (0, _)$: in state zero, if you see an “I”, then you consume it and continue in state zero. It doesn’t matter what’s on the stack, and you don’t change the stack.

Or graphically:

So – if it sees a “(“, it pushes a “(” on the stack. If it sees an identifier, it just keeps going past it. If it sees a “)”, and there’s an “(” on top of the stack, it pops the stack; if it sees a “)” and there’s no “(” on the stack, then it switches into state 1. Since state 1 is not a final state, and there is no transitions that can be taken from state 1, the machine rejects the string if it sees an extra “)”. If there’s a “(” without a matching close, then when the machine finishes, it will have a non-empty stack, and so it will reject the string.

Finally, one nifty little note. The pushdown automaton is a very limited kind of machine. It can’t do complex arithmetic, or process complex grammatical constructions. There’s a lot that it can’t do. So what happens if we take this very limited machine, and give it a second stack?

It becomes maximally powerful – Turing complete. In fact, it becomes a Turing machine. We’ll see more about Turing machines later, but a Turing machine is equivalent to a two-stack PDA, and a Turing
machine can perform any computation computable by any mechanized computing process. So one stack is extremely constrained; two stacks is as un-constrained as any computing device can ever be.

# Regular Expressions and Derivatives

When you’re working with regular languages specified in regular expression form, there’s a really cool idea that you can use for building regular expression matchers, and for describing how to convert from a regular expression to a NFA. It’s called the Brzozozwksi derivative of a regular expression – or just simply the derivative of a regexp.

The basic idea of the derivative is that given a regular expression, $r$, you can derive a new regular expression called the derivative with respect to symbol $c$, $D_c(r)$. $D_c(r)$ is a regular expression describing the string matched by $r$ after it’s matched an $r$.

# Nondeterminism in Finite State Automata

In the last post, I mentioned the fact that regular expressions specify the same set of languages as regular grammars, and that finite state machines are the computational device that can recognize those languages.

It’s even pretty easy to describe how to convert from regular expressions to FSMs. But before we do that, to make it a bit easier, we’ll extend our finite state machines. Doing that is interesting in itself: What we’re going to do is create non-deterministic finite state machines – NFA (for nondeterministic finite automata) for short.

# Regular Languages

The simplest family of languages – and correspondingly, the simplest kind of computing machine – deal with regular languages. A regular language is very limited: there’s no counting, no deep patterns. They really don’t have much in the way of computational power. But they’re still very useful – the ubiquitous regular expression libraries in most programming languages are just easy ways of using regular languages to decompose simple patterned inputs.

The regular languages can be described in three ways: as regular grammars; as regular expressions; or as a kind of computing machine called a finite state machine or finite state automaton.

# Computability

I just recently realized that I only wrote about computability back in the earliest days of this blog. Those posts have never been re-run, and they only exist back on the original blogger site. When I wrote them, I was very new to blogging – looking back, I think I can do a much better job now. So I’m going to re-do that topic. This isn’t just going to be a re-post of those early articles, but a complete rewrite.

The way that I’m going to cover this is loosely based on the way that it was first taught to me by a wonderful professor, Dr. Eric Allender at Rutgers, where I went to college. Dr. Allender was a really tremendous professor: he managed to take an area of computer science that could seem hopelessly abstract and abstruse, and turned it into something fun and even exciting to learn about.

Computability is the most basic and fundamental sub-field of theoretical computer science. It’s the study of what a mechanical computing device can do. Not just what a specific mechanical computing device can do, but what can any mechanical computing device do? What are the limits of what you can do mechanically? And once we know the limits, what can we discover about the nature of computation?

# The Halting Problem

Some people expressed interest in seeing a full-out, formal presentation of the Halting problem proof. So, I’m going to walk through it. There are actually a lot of different versions of this proof; the one that I’m going to use is based on the one used by my grad-school theory professor, Dr. John Case. To be clear, I’m doing it from memory, so any errors are completely my own fault, not John’s!

To start off, we need to define what a computing device is. In formal mathematical terms, we don’t care how it works – all we care about is what it can do in abstract tems. So what we do is define something called an effective computing device. An effective computing device is any Turing equivalent computing device. An ECS is modelled as a two parameter function $phi : {cal N} times {cal N} rightarrow {cal N}$. The first parameter is an encoding of a program as a natural number; the second parameter is the input to the program. It’s also a natural number, which might seem limiting – but we can encode any finite data structure as an integer, so it’s really not a problem. The return value is the result of the program, if the program halts. If it doesn’t halt, then we say that the pair of program and input aren’t in the domain of $phi$. So if you wanted to describe running the program “$f$” on the input 7, you’d write that as $phi(f, 7)$. And, finally, the way that we would write that a program $p$ doesn’t halt for input $i$ as $phi(p, i) = bot$.

So now we’ve got our basic effective computing device. There’s one thing we still need before we can formulate the halting problem. We need to be able to deal with more parameters. After all – a halting oracle is a program that takes two inputs – another program, snd the input to that program. the easiest way to do that is to use something called a pairing function. A pairing functions is a one-to-one function from an ordered pair to an integer. There are lots of possible pairing functions – for example, you can convert both numbers to binary, left-pad the smaller until they’re equal length, and then interleave the bits. Given (9,3), you convert 9 to 1001, and 3 to 11; then you pad 3 to 0011, and interleave them to give you 10001011 – 139. We’ll write our pairing function as angle brackets around the two values: $langle x,yrangle$.

With the help of the pairing function, we can now express the halting problem. The question is, does there exist a program $O$, called a halting oracle, such that:

$forall p, forall i: (varphi(O,langle p,irangle) = left{ begin{array}{l} 0 mbox{ if } varphi(p, i) = bot \ 1 mbox{ if } varphi(p,i) neq bot end{array}right.$

In english, does there exist a program $O$ such that for all pairs of programs p and inputs i, the oracle returns $1$ if $varphi(p, i)$ halts, and 0 if it doesn’t?

Time, finally, for the proof. Suppose that we do have a halting oracle, O. That means that for any program $p$ and input $i$, $varphi(O, langle p, i rangle) = 0 iff varphi(p,i)=bot$.

So, can we devise a program \$p_d\$ and input $i$ where $varphi(p_d, i) != bot$,
but $varphi(O, langle p, i rangle) = 0$?

Of course we can. We need a $p_d$ which takes two parameters: an oracle, and an input. So it should be really simple right? Well, not quite as easy as it might seem. You see, the problem is, $p_d$ needs to be able to pass itself to the oracle. But how can it do that? A program can’t pass itself into the oracle. Why not? Because we’re working with the program as a Gödel number – and a program can’t contain its own Gödel number. If it contained it, it would be larger than itself. And that’s rather a problem.

But there’s a nice workaround. What we care about is: is there any combination of program and input such that $O$ will incorrectly predict the halting status? So what we’ll do is just turn $p_d$ into a parameter to itself. That is, we’ll look at a program $p_d$ like the following:

```def deceiver(input):
(oracle, program) =  unpair(input)
if oracle(program, input):
while(True): continue
else:
halt
```

Then we’ll be interested in the case where the value of the `program` parameter is a Gödel number of the deceiver itself.

(As an aside, there are a variety of tricks to work around this. One of the more classical ones is based on the fact that for any given program, $p$, there are an infinite number of versions of the same program with different Gödel numbers. Using that property, you can embed a program $p_{d_2}$ into another program $p_{d}$. But there are a few other tricks involved in getting it right. It’s not simple – Alan Turing screwed it up in the first published version of the proof!)

Now, when input = $langle O, p_d rangle$, then $O$ will make the wrong prediction about what $p_d$ will do. So – once again, we’re back where we were in the simpler version of the proof. A halting oracle is a program which, given any pair of program and input, will correctly determine whether that program will halt on that input. We’re able to construct a pair of program and input where the oracle doesn’t make the right prediction, and therefore, it isn’t a halting oracle.

This version is more abstract, but it’s still got that wonderfully concrete property. Even in the most abstract way of talking about a computing device, if you’ve got something that you believe is a halting oracle, this shows you how to create a program that will prove that the halting oracle is wrong. So you can’t possibly create a halting oracle.

And to be extra clear: this doesn’t rely on any ambiguities about definitions, or any distinction between values and meanings. It shows a way of producing a real, concrete counterexample for any purported halting oracle. No trickery, no fudging – if you think you have a halting oracle, you’re wrong, and this proof shows you exactly how to create a program that will demonstrate that.