Category Archives: Programming

Combinator Parsing, part 1

My professional work is focused on making the lives of engineers building software better. I got into that area because I was fascinated – or even obsessed – with programming languages. That led me to the more general area of engineering tools of all types. But programming languages and how they’re implemented remain one of my great fascinations.

If you’re trying to implement a new programming language (or re-implement an old one), the first thing you need to do is called parsing. Parsing consists of reading in a file containing a bunch of text, making sure that it’s correctly structured, and then using its structure to translate it into some kind of data structure.

For example, I implemented a silly programming language called Cruise. In Cruise, a snippet from a program looks like:

actor !Adder {
  behavior :Add() {
    on Plus(Z(),$x, $target) { send $x to $target }
    on Plus(I($x), $y, $target) { send Plus($x,I($y), $target) to $self }
  }
  initial :Add
}

In the Cruise implementation, the parser reads that input, and translates it into a structure called an abstract syntax tree. For the code in the example above, a part of the syntax tree for that code looks something like the following:

Actor[
  name="actor"
  behaviors=[
    Behavior[
      pattern=Tuple[
        name="Plus",
        Tuple[name="Z"],
        Var[name="x"],
        Var[name="target"]],
      action=[Send[Var["x"], Var["Target"]],
    Behavior[
      pattern=Tuple[
        name="Plus",
        Tuple[name="I", Var["x"]],
        Var["y"],
        Var["target"]],
      action=[
        Send[
          Tuple[name="Plus",
            Var["x"],
            Tuple[name="I", Var["y"]]],
          Var["target"]]]]]]

When you want to write a parser, you need to start by describing
what you want to parse. Most of the time, we do that using
a context free grammar (CFG). A CFG is a set of rules that
describe how to produce texts with a particular structure. For
example, suppose you wanted to implement a simple calculator (one
of the canonical examples used when teaching parsing!). You’d
need to describe what the expressions you want to parse
look like. A pretty typical version of this would be:

expr := add_expr
add_expr := mult_expr ( ( '+' | '-' ) mult_expr)*
mult_expr := unary_expr ( ('*' | '/') unary_expr)*
unary_expr := '-' simple | simple
simple := number | '(' expr ')'
number: digit+

This grammar is a way of expressing a system for producing valid expressions. The easiest way to see how it works is to look at an example. We know that something like “3 + 4 * (-5 + 6)” is an expression. So how would be produce that using the grammar?

  1. Start with expr. That always expands to add_expr.
  2. Expand add_expr to mult_expr ‘+’ mult_expr.
  3. Expand the first instance of mult_expr to unary_expr, giving us “unary_expr+mult_expr“.
  4. Expand the first unary expr to simple: “simple+mult_expr“.
  5. Expand simple to number: “number+mult_expr“.
  6. Expand number to the number 3: “3+mult_expr“.
  7. Expand the mult_expr to “unary_expr*unary_expr“: “3+unary_expr*unary_expr“.
  8. Expand the first unary_expr to simple, and simple to number, and number to “4”: “3+4*unary_expr“.
  9. Expand the remaining unary_expr to simple, and simple to a parenthesized expression: “3+4*(expr)”.
  10. And so on, until we end up with “3+4*(-5+6)”.

When I started learning to write compilers, there were two ways that people wrote parsers. They either implemented them manually, using a technique called recursive descent, or they used a table-driven parser generator like yacc, where you wrote the grammar using a special syntax, and then a tool translated it into the implementation of a parser.

Writing a parser using recursive descent was a huge pain. It’s very tedious and repetitive, and like any meticulous process, it’s very error-prone. In short, it’s no damn fun at all.

Writing a parser using a generator was usually much easier – but if something didn’t work, it would be a huge pain. The problem is that the code you write doesn’t end up looking very much like the code you run. For example, to write our expression language, in a typical parser generator, you’d write it something like the follows. (The grammar gets re-arranged a bit, to fit the particular underlying parser algorithm that the generator uses).

%start expr
expr: add_expr;
add_expr:
    add_expr '+' mult_expr
  | add_expr '-' mult_expr
  | mult_expr
  ;

mult_expr:
    mult_expr '*' unary_expr
  | mult_expr '/' unary_expr
  | unary_expr
  ;

unary_expr:
    '-' simple
  | simple
  ;

simple:
    number
  | '(' expr ')'
  ;

The parser generator reads the grammar that we wrote, and translates it into a set of lookup tables, which get interpreted by semi-generic parser code provided by the generator toolkit. What that means is that when your parser actually executes, there’s nothing called unary_expr. There’s some set of states encoded into magic variables with names like yytable and yytrans. If you got something wrong in your parser, debugging it is painful at best, because the grammar is gone: it’s been translated into something completely incomprehensible to human eyes.

Not all parser generators generate code that’s this extreme when it comes to readability. But it’s the output from the most widely used one, called YACC, which has been translated and used by C/C++, Caml, Go, Fortran, Pascal, D, SML, and more.

But even if you’re not using YACC, it’s not a good scene. Even ANTLR, my favorite parser generator, which makes a heroic effort to generate efficient but human-readable parsers, the fact remains that the code that you’re running and trying to debug isn’t familiar to you: it’s not what you wrote. You’re still stuck trying to work through an unfamiliar translation.

Honestly, both approaches – the manual parser, and the parser generator – suck an awful lot of the time.

Then, a few years ago, a gang of functional programmers came up with a truly brilliant scheme. You can describe a parser as a special kind of function from inputs to outputs. If you do that, you can also combine parsers using higher-order functions. Using those functions, all you need to do is write tiny parsers for the simplest elements of your language, and then you can build the parser for your language by just combining your simple parsers in different ways. When you implement a parser this way, what you get is very close to the original grammar of the language, but it’s an actual executable program: the code that you need to read and debug is exactly the code that you wrote.

For example, suppose I wanted to parse that expression language. Using a Python version of parser combinators, I can write:

digit = Parser.match(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'])
number = digit.many(1)
parens = Parser.match(['(']) & Reference('expr') & Parser.match([')'])
simple = number | parens
unary_expr = Parser.match(['-']).opt() & simple
mult_expr = unary_expr &  (Parser.match(['*', '/']) & unary_expr).many()
add_expr = mult_expr & (Parser.match(['-', '+']) & mult_expr).many()
expr = add_expr
Reference.register_named_parser('expr', add_expr)

That’s not a contrived example. That’s actually a real, honest-to-god parser for our expression language! We’ll look at it in more detail later. But first, we need to understand what parsers and parser combinators really are, and how they work.

At this point, it’s time to start getting more concrete. Just what is a parser, precisely?

In keeping with the fact that this approach came from the functional programming community, it makes sense to describe a parser as a function. If you think of it this way, a parser is a special kind of function:

  • It takes a sequence of inputs;
  • It can either succeed or fail;
  • If it succeeds, it consumes part of its input, and produces an output.

That means that we can come up with a sort-of type signature pretty easily. It’s a function from an input to a result value. The result value can be two types: either a success, or a failure. If it’s a success, it contains both the output value produced by the parser and the unconsumed part of the input. If it fails, the failure contains nothing.

In Python, that’s easy to represent in several different ways. We’ll do it in the one that’s closest to a typed approach: an abstract class ParseResult, with subclasses for success or failure.

class ParseResult(object):
  def succeeded(self):
    return False


class Success(ParseResult):
  """A successful parse result, containing the value produced by a parser,
  and the unconsumed part of its input.
  """
  def __init__(self, output, rest):
    self.output = output
    self.rest = rest

  def succeeded(self):
    return True


class Failure(ParseResult):
  def __init__(self):
    pass

The inputs are just a sequence of values. The parsers can look at one element of the input at a time. In Python, we’ll represent them using a class.

class ParserInput(object):
  @abstractmethod
  def at_end(self):
    """Return True if there's no remaining input."""

  def first(self):
     """Return the first element of this input."""

  def rest(self):
     """Return a ParserInput consisting of the remainder after the first
     element has been consumed.
     """

With those definitions, a parser becomes an object with one method, which takes a ParserInput, and produces a ParseResult.

class Parser(object):
  def parse(self, inp):
    """inp should be an instance of ParserInput; returns an instance of ParseResult"""

Before we get to the combinators, let’s look at an example of a simple parser. Suppose we want to accept a digit. We can implement that as:

class DigitParser(Parser):
  def parse(self, inp):
    if inp.first().isdigit():
      return Success(inp.first(), inp.rest())
    else:
      return Failure()

This looks at the first character of the input stream. This is mostly functional, so it does this without altering the input stream. If the first character is a digit, then it succeeds, returning the digit character and the remaining input; otherwise, it fails.

Simple code like that little snippet are all you need to write yourself. It’s so incredibly simple that looking at it, it seems like it’s almost too simple. How on earth can anything this trivially simple actually be the basis of something as powerful as the infamous YACC?

The answer is combinators: functions that combine parsers togethers. This simple basic, functional mechanism makes it possible to combine simple parsers using simple functions – producing a limitless variety of parsers, which can parse pretty much anything you choose to throw at them. So what kinds of combinators do we need? There are four basics that I used in my implementations, and one that’s an artifact of the implementation.

  1. Sequence: Parse an “A” followed by a “B”. In my Python implementation, this is written A & B. It succeeds only if both “A” and “B” succeed in sequence. The output of this is a list containing the outputs from “A”, “B”, and whatever else is in the sequence with them.
  2. Choice: Parse either an “A” or if that fails, parse a “B”. That’s written A | B. The output is whichever one succeeds, or else fails.
  3. Option: Parse either an “A”, or nothing, written a.opt(). The output is either the output from parsing an “A”, or None.
  4. Repetition: Parse more than one “A”, written a.many(min). The result is a list of the outputs of the invocations of “A”.
  5. Reference: I’m going to tell you what kind of thing to parse here later. For now, it’s just named “A”. The output is, of course, whatever the referenced parser outputs.

The last one definitely needs some extra explanation. Most interesting grammars are, ultimately, recursive. In our calculator’s expression grammar, the rule expr contains an add_expr, which contains a mult_expr, which contains a unary_expr, which contains a simple, which could expand to something that contains an expr. That leads to a problem constructing our parsers: we can’t actually build an instance of an expr, because it needs to contain itself. The Reference combinator is the solution to this: it lets you say Call the parser named “A”; I’ll tell you what parser that is later. If you look at the parser implementation I showed earlier, you can see this:

parens = Parser.match(['(']) & Reference('expr') & Parser.match([')'])
...
Reference.register_named_parser('expr', add_expr)

In the definition of the parens rule, we used Reference('expr'), which says “call whatever parser is registered with the name ‘expr'”. Then later on, we registered the add_expr parser with the name ‘expr’. Voila! Recursion.

There’s one more thing that I used in the calculator example: Parser.match(). That’s just a helpful little piece of code that I wrote to create parsers that look for one character from a fixed set. It’s implemented with a class named SetParser. Parser.match(['(']) will only match a “(” in the input; Parser.match(['a', 'b', 'c']) will match either an “a”, a “b”, or a “c”. Implementing it is as easy as you’d guess at this point:

class SetParser(Parser):
  def __init__(self, chars):
    self.chars = chars

  def pr(self):
    return "SetParser%s" % self.chars

  def parse(self, inp):
    if inp.first() in self.chars:
      return Success(inp.first(), inp.rest())
    else:
      return Failure()

With these, you should now be able to completely read and understand the implementation of our expression parser. Let’s walk through it.

digit = Parser.match(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']) 
number = digit.many(1)
parens = Parser.match(['(']) & Reference('expr') & Parser.match([')'])
simple = number | parens
unary_expr = Parser.match(['-']).opt() & simple
mult_expr = unary_expr &  (Parser.match(['*', '/']) & unary_expr).many()
add_expr = mult_expr & (Parser.match(['-', '+']) & mult_expr).many()
expr = add_expr
Reference.register_named_parser('expr', add_expr)
  • digit: this is a SetParser that reads a single digit.
  • number: a number is just a sequence of at least one digit.
  • parens: this is a parenthesized expression: an open paren followed by an expression, followed by a close paren.
  • simple: a simple expression is either a number, or a parenthesized expression.
  • unary: a unary expression is an optional “-“, followed by a simple.
  • mult_expr: a multiplication expression is a unary, followed by a collection of multiplication operators and operands.
  • add_expr: the same thing as a mult_expr, except that it uses add operations. By defining them seperately, with multiplication embedded inside addition, we gave multiplication a higher precedence.
  • expr is just an alias for add_expr.
  • Reference: finally, we bind the reference from the parenthesized expression.

Now that we have a parser, we can run it!

>>> inp = StringParserInput("1+2*(3+5*4)*(6+7)")
>>> print(expr.parse(inp).output)
[None, ['1'], [], [['+', [None, ['2'], [['*', [None, ['(', [None, ['3'], [], [['+', [None, ['5'],
 [['*', [None, ['4']]]]]]]], ')']]], ['*', [None, ['(', [None, ['6'], [], [['+', [None, ['7'],
  []]]]], ')']]]]]]]]

It works, but wow, that output is horrible! It’s basically a tree, formed from lists, that describes the parse process. It’s got all of the information we’d need to evaluate the expressions parsed by our grammar, only it’s in a terrible form.

That brings us to one last thing to add: Actions. In terms of parser combinators, an action is a parser that takes another parser, and a function. The output of the action is the result of applying the function to the embedded parser:

class Action(Parser):
  def __init__(self, parser, act):
    self.parser = parser
    self.action = act

  def pr(self):
    return "Action[%s]" % self.parser.pr()

  def parse(self, inp):
    result = self.parser.parse(inp)
    if result.succeeded():
      return Success(self.action(result.output), result.rest)
    else:
      return Failure()

Using that, we can turn our expression parser into a calculator!

def digits_to_number(digits, running=0):
  """Convert a list of digits to an integer"""
  if len(digits) == 0:
    return running
  else:
    r = (running * 10) + int(digits[0])
    return digits_to_number(digits[1:], r)

digit = Parser.match(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'])
number = Action(digit.many(1), digits_to_number)

In this version, number now actually outputs an integer, which is the number represented by the list of digits produced by the parser.

We can attach similar actions to all of the other rules. For example, the new version of our addition expression is:

def eval_add(lst):
  """Evaluate an addition expression. For addition rules, the parser will return
  [number, [[op, number], [op, number], ...]]
  To evaluate that, we start with the first element of the list as result value,
  and then we iterate over the pairs that make up the rest of the list, adding
  or subtracting depending on the operator.
  """
  first = lst[0]
  result = first
  for n in lst[1]:
    if n[0] == '+':
      result += n[1]
    else:
      result -= n[1]
  return result

add_expr = Action(mult_expr & (Parser.match(['-', '+']) & mult_expr).many(), eval_add)

With similar actions attached to the others, now, it works. (The code for the other actions is in the github repository, linked below):

>>> inp = StringParserInput("1+2*(3+5*4)*(6+7)")
>>> print(expr.parse(inp).output)
599

That’s about it for doing this in Python. The code is all here on Github, in the “python” subdirectory. The rest of the github project is the Java version. It’s more complicated to do all of this in Java, because of the type system. We’ll take a look at that in the next post. The syntax is more verbose, and the types make things a bit more complicated, but it’s still amazingly easy to build a useful parser this way.

This implementation isn’t quite what I’d call industrial strength:

  1. It’s not super-efficient; particularly when it’s parsing a multiple-precedence expression, it can do a whole lot of backtracking, which makes things very slow. To fix this, the best thing would be to add something called operator precedence parsing for handling infix expressions. Parsec, the Haskell library that introduced my to combinator parsing, does this, and it workes beautifully. It would be nice to add that, but for my purposes (ie, writing this blog post, plus a bit of language parsing for my Apex project), it’s not necessary.
  2. It has terrible error handling. In fact, it doesn’t have error handling at all. If there’s a syntax error in your input, all that will happen is that the parser will fail. To fix this, I’ll need to add a third kind of return value to the basic parser function: a parser should be able to succeed, fail without error, and fail with an error. This is something I really need to add if I ever want to use this code for anything even semi-real.
  3. It doesn’t track positions. In other words, when I get it to the point where it can actually produce a syntax error, it can’t say where the error is. I’d really like to be able to say “There’s an error at line 3 column 7”, but right now, there’s no way to do that. This one is relatively easy to add: the parser input just needs to be able to report a line and column number. I’m definitely planning on adding that.

How Computers Really Work: Math via Boolean Logic

As I try to write more posts about ARM programming, I’m finding that I keep getting sidetracked by background stuff. Hopefully it’s interesting; if not, let me know in the comments!

Today’s sidetrack: how the heck does a computer actually do math?

As I said in my post about binary the other day, at a deep level, computers don’t actually work with numbers, not even zeros and ones. Computers are electronic devices, and what they really do is worth with electrical signals. In computer hardware, there are really just a few fundamental operations that it can perform with those signals. By taking those basic operations, and combining them in the right way, we can do math. How that works is very mysterious to most people. Obviously, I can’t describe all of how a computer works in a blog post. But what I’m going to do is explain the basic parts (the gates that are used to build the hardware), and how to combine them to implement a piece of hardware that does integer addition.

In computer hardware, there are four fundamental components that we can use to build operations, and they’re the basic operations of boolean logic: And, Or, Exclusive Or, and Not.

  • AND: Boolean AND takes two inputs, and outputs a 1 if both inputs are one.andgate
  • OR: Boolean OR takes two inputs, and outputs a 1 if at least one of its inputs are 1. orgate
  • XOR: Boolean Exclusive-OR takes two inputs, and outputs a 1 if one, but not
    both, of its inputs are one.xor
  • NOT: Boolean NOT (also called negation) takes one input, and outputs a 1 if its input was 0. notgate

In talking about these kinds of operations, we frequently use truth tables to explain them. A truth table just lists all of the possible input combinations, and tells you what the output for them is. For example, here’s a truth table showing you AND, OR, and XOR:

A B A∧B A∨B A⊕B
0 0 0 0 0
0 1 0 1 1
1 0 0 1 1
1 1 1 1 0

In terms of hardware, you can actually implement all of these using one primitive gate type, the NAND (negated AND) gate. For example, in the diagram below, you can see how to build a NOT gate using a NAND gate, and then using using that NAND-based NOT gate, to build an AND gate using two NANDs.

building-gates

In the hardware, those gates are all that we have to work with. To build arithmetic, we need to figure out how to combine these primitive pieces to build up something that produces the right results. First, we need to be able to describe what the correct results are. We’ll start by figuring out what it means to add two 1-bit numbers. We’ll have two one-bit inputs. We need two outputs – because two one-bit numbers can add up to a two-bit sum. We’ll call the outputs S and C (sum and carry – S is the low-order bit, which would be the sum if we could only have a one-bit result, and C is the value of the second bit.)

We can describe addition with a truth table:

A B S C
0 0 0 0
0 1 1 0
1 0 1 0
1 1 0 1

We can read those rows as “If A is 0 and B is 0, then A+B is 0 with a carry of 0”.

If we compare that table to the table up above showing the basic operations, then we can see that S is exactly the same as XOR, and C is exactly the same as AND. So we can build a one-bit adder out of gates:

half-adder

This little one-bit adder is commonly called a half-adder, because to really implement a useful add operation, we need two of them for each bit.

If you think about what you’d need to do to add together two two-bit numbers, you couldn’t just use a half-adder for each bit. The carry from the first bit needs to get added to the second bit. You need to plumb the carry result from the first bit into a third input for the second bit. To get that second bit to work properly, you need to add together three one-bit values: the two inputs for the high-order bit, and the carry from the low-order bit. Generalizing, if you want to add together N bits,
when you’re computing the sum of the Mth bit, you need to include the carry from the M-1th bit.

To include the carry, we need to combine half-adders into a full adder, which looks like the following:

full-adder

Using that, we can create an N-bit adder, by chaining the carry output from bit M-1 to the carry input of bit M. This creates the simplest adder circuit, which is called a ripple-carry adder.

ripple-carry

Ripple-carry adders are the simplest way of building an integer addition operation in hardware. They’ve got one very big downside: to add the Nth bit, you need to have the result of adding the N-1th bit. That means that the more bits you add, the slower the ripple-carry adder gets. You have to way for the signals to propagate all the way through the circuit from the low bits to the high bits.

So ripple-carry isn’t really used in hardware anymore. There are a bunch of more complicated ways of building the adder that get rid of that propagation delay. It comes down to a very common tradeoff for engineers: performance versus complexity. But at the end of the day, the concept is still basically the same: it’s still a chain of multiple simple adders, that work the way I described here.

Representing Numbers in Binary

Before we can really start writing interesting programs with our ARM, even simple ones, we need to understand a bit about how how a computer actually works with numbers.

When people talk about computers, they frequently say something like “It’s all zeros and ones”. That’s wrong, in a couple of different ways.

First, when you look at the actual computer hardware, there are no zeros and ones. There are two kinds of signals, represented as two different voltage levels. You can call them “high” and “low”, or “A” and “B”, or “+” and “-“. It doesn’t matter: it’s just two signals. (In fact, one fascinating thing to realize is that there’s a fundamental symmetry in those signals: you can swap the choice of which signal is 0 and 1, and if you just swap the and gates and the or gates, everything you won’t be able to tell the difference! So one ARM processor could use one signal for 1, and a different ARM could use that signal as 0, and you wouldn’t actually be able to tell. In fact, everyone does it the same way, because chip design and manufacture are standardized. But mathematically, it’s possible. We’ll talk about that duality/symmetry in another post.)

Second, the computer really doesn’t work with numbers at all. Computer hardware is all about binary logic. Even if you abstract from different voltages to the basic pairs of values, the computer still doesn’t understand numbers. You’ve got bits, but to get from bits to numbers, you need to decide on a meaning for the bits two possible values, and you need to decide how to put them together.

That’s what we’re really going to focus on in this post. Once you’ve decided to call on of the two primitive values 1, and the other one 0, you need to decide how to combine multiple zeros and ones to make a number.

It might seem silly, because isn’t it obvious that it should use binary? But the choice isn’t so clear. There have been a lot of different numeric representations. For one example, many of IBM’s mainframe computers used (and continue to use) something called binary coded decimal (BCD). It’s not just different for the sake of being different: in fact, for financial applications, BCD really does have some major advantages! Even if you have decided to use simple binary, it’s not that simple. Positive integers are easy. But how do you handle negative numbers? How do you handle things that aren’t integers?

We’re going to start with the simplest case: unsigned integers. (We’ll worry about fractions, decimals, and floating point in another post.) Like pretty much all modern computers, the ARM uses the basic, mathematical binary exponential representation. This works basically the same as our usual base-10 numbers. We look at the digits from right to left. The leftmost digit (also called the least significant digit counts ones; the next digit counts 10s; the next counts 100s, and soon. So in base 10, the number 3256 means 6*100 plus 5*101 plus 2*102 plus 3*103.

binary-bits

In binary, we do exactly the same thing, only we do it with powers of 2 instead of powers of 10. So in binary the number 1001101 is 1 + 0*21 + 1*22 + 1*23 + 0*24 + 0*25 + 1*26 =1 + 4 + 8 + 64 = 77.

Arithmetic on binary is easy enough – you do the same thing you would with decimal, but in subtraction, borrows give you 2, not 10. As a quick example, let’s look at 7 + 13, which is 111 + 1101.

  1. We start at the right edge. We have 1 + 1 = 10 – so the first digit of the sum is 0, and we carry 1.
  2. Next we have 1 + 0 + 1(carry) = 10 – so the second digit is again 0, and we carry 1. Our sum so far is 00.
  3. Now we’re on to the third digit. 1 + 1 + 1(carry) = 11. So the third digit is 1, and we carry one. Our sum so far is 100.
  4. Now the fourth digit, 1 + 0 + 1(carry), so we get 10. So the sum is 10100, or 20.

We’ll ignore subtraction for a moment, because as we’ll see in a little while, in computer hardware, we don’t actually need to have subtraction as a primitive. Addition is the core of what computer arithmetic, and we’ll use addition to implement subtraction by taking the negative of the number being subtracted. (That is, to compute A-B, we’ll do A+(-B).)

Positive integers with addition aren’t enough to do most stuff we want to do on a computer. Even if we’re never going to write a program that manipulates numbers, we absolutely need to be able to subtract. To a computer, the only way to compare values is to subtract one from another! So we need to be able to do negatives and subtractions. How can we represent negative numbers in binary?

There are three basic choices, called sign-bit/sign-magnitude, one’s-complement and two’s complement. I’ve put an example of the three representing the number 75 in the figure below.

different-binary

In sign-bit representation, what you do is take the leftmost bit (also called the high-order bit), and use it to indicate sign (for obvious reasons, you call it the sign bit.). If the sign bit is 0, then the number is positive; if it’s 1, then the number is negative. For example, 01010 would be +10; 11010 would be -10.

For a human being, sign-bit looks great. But in practice, sign-bit was never used much, because while it looks simple to a human, it’s quite complicated to build in computer hardware. IBM did use it in some early machines, but even they gave up on it.

Next is one’s complement. In 1’s complement, high order bit is still a sign bit. But to convert a number from positive to negative, you don’t just change the sign bit – you invert every single bit in the number. You can still tell whether a number is positive or negative by its sign bit, but the rest of the bits are also different. +10 in one’s complement binary is 01010; -10 is 10101.

Arithmetic in 1s complement is a bit weird. You can almost just add a negative number to a positive number as if they were both positive. Almost, but not quite.

For example, let’s try 6 + -6. 6 is 0110, and -6 is 1001. Add them up: 1111. In twos complement, that’s -0. And there’s one of the weird issues about one’s complement: it’s got two distinct values for 0 – +0 and -0. Since they’re both just 0, we treat them as equal, and it’s not really a problem.

How about 6 + -8? 6 is 00110 (we need 5 bits to handle 8), and -8 is 10111. Add
them up, and you get 11101 – which is -2, the correct answer.

Now, what about 8 + -6? 8 is 01000, and -6 is 11001. Add them up, and you
get 00001, with a carry of 1. So 8 + -6 = 1? That’s wrong! In one’s complement,
there are a bunch of places where simple binary addition will be off by one.
So you need to work out the algorithm for where it’s off-by-one and where it’s not, and you need to build more complicated hardware to incorporate it. That’s not attractive.

Still, one’s complement has been used a lot. In particular, one of the first computers I got to use was an old, beaten-up PDP-1, which used 1’s complement numbers.

Finally, we get to the representation that’s used in pretty much all modern computers: 2’s complement!

Once again, 2’s complement uses a sign bit. But instead of flipping all of the bits, you do something different. In 2’s complement, you need to know how many bits you’re using. If you’re doing an N-bit 2’s complement binary number, then the number -x is represented by 2N-x.

So if we’re doing 6 bits, and we wanted to represent -5, then we’d take 26-5, or 64-5=59. In binary, that’s 111011.

The really beautiful thing about 2s complement is that it’s pretty much the same thing as a trucated 2-adic integer – which means that arithmetic just works. If you’re adding two numbers, it doesn’t matter whether they’re signed numbers or not – it works.

It’s also really easy to implement negation. You don’t have to do that whole “subtract from 2^N” thing. In 2s complement, -N is 1+(ones_complement(N)). That’s super-easy to implement in hardware, and it’s also easy to understand and do for a human: flip the bits and add one!

Two’s complement is, in my opinion, a clear winner in integer representation, and the world of computer hardware maker agrees – everyone now uses 2’s complement for integers.

Controlling Thousands of Machines. Aka, My Day Job.

I promised my friend Toby that I’d link to one of his blog posts. Toby is one of the SREs that I work with at twitter, and let me tell you, you always want to stay on the good side of the SREs!

And to introduce it, I actually get a chance to talk about what I really do!

Since last July, I’ve been working for twitter. I haven’t yet gotten to talk about what it is that I do at twitter. For obvious reasons, I think it’s absolutely fascinating. And since we just recently released it as open-source, there’s no reason to keep it secret anymore. I work on something called Aurora, which is a framework for Mesos.

Let’s start with Mesos.

When you run a web application like Twitter, you can’t run your application on a single computer. There’s a bunch of reasons for that, but the biggest one is that there’s not a computer in the world that’s big enough to do it; and if there was, depending on one epically massive computer would be incredibly unreliable. Instead, what we do is set up a data center, and fill it up with thousands upon thousands of cheap machines. Then we just use all of those thousands of little computers. These days, everyone does things that way. Twitter, Google, Facebook, Amazon, Microsoft, Apple – we all run on gigantic clusters of cheap machines in a datacenter.

But there’s a problem there. How do you use a thousand computers? Much less 10,000 or a million, or more?

What you do is build a substrate, called a cluster management system. You write a program, and you say that you want to run 1,000 copies of it, and hand it to the cluster manager substrate. The substrate takes care of figuring out where to put those processes, and telling the specific individual machines that it selects what to do to run your program.

One of the hardest problems in a cluster management system is figuring out how to assign programs to machines. Depending on what a particular program needs to do, its requirements can vary enormously. Running a quick map-reduce has one set of requirements. Running a reliable service – a service that can survive if part of a datacenter loses electricity – has a different set of requirements.

One approach – the one used by Google when I worked there – was to define a constraint language that ultimately had hundreds of possible parameters, and then treat it as a classical optimization problem, trying to optimize over all of it. The good side of that is that it meant that every program at Google could express its requirements exactly. The bad side of it was that the constraint system was incredibly complicated, and almost no one could predict what results a given set of constraints would produce. Configuration turned into a sort of game, where you’d make a guess, look at what the borg gave you, and then modify your constraint – repeat until you got something satisfactory.

mesos_logo At Twitter, Mesos is our cluster management system. It takes a completely different approach. Basically, it takes the borg-like approach and turns it upside down. Mesos itself doesn’t try to do constraint satisfaction at all. What it does is talk to components, called frameworks, and it offers them resources. It basically says “Here’s all of the possible ways I can give you a fair share of resources in the clusters. Tell me which ones you want.”.

Each framework, in turn, figures out how to allocate the resources offered to it by Mesos to individual jobs. In some ways, that might seem like it’s just deferring the problem: don’t the frameworks end up needing to do the same allocation nightmare? But in fact, what you do is write a framework for a particular kind of job. A framework needs to work out some constraint satisfaction – but it’s a much simpler problem, because instead of needing to satisfy every possible customer with one way of expressing constraints, each framework can decide, on its own, how to allocate resources to one particular kind of job. Map-reduce jobs get one framework that knows how to do a good job scheduling map-reduces. Services get a framework that knows how to do a good job allocating resources for services. Databases get a framework that knows how to do a good job allocating resources for storage and bandwidth intensive processes. As a result, the basic Mesos cluster manager is dramatically simpler than a one-size-fits-all scheduler, and each of the component frameworks is also much simpler.

You can see a really cute sort-of demonstration of how Mesos works by looking at @MesosMaster and @MesosSlave on twitter.

I don’t work on Mesos.


I work on Aurora. Aurora is a framework that runs under Mesos. It’s specialized for running services. Services are basically little server processes that run a lot of identical copiesfor a long time. Things like web-servers – where you need a ton of them to support millions of users. Or things like common processes that live behind the web-server answering requests for particular kinds of information.

At Twitter, all of the services that make up our system run in our datacenters on top of Mesos. They’re scheduled by Aurora.

With Aurora, running a service is incredibly easy. You write a configuration:

my_server_task = SequentialTask(
  processes = [
    Process(name='stage_binary',
        cmdline='hadoopfs-copy /glorp/foo/hello_service .'),
    Process(name='run_service', cmdline='./myservice')
  ],
  resources = Resources(cpu=1.0, ram=128*MB, disk=128*MB))

jobs = [
    Service(task = my_server_task, 
        cluster = 'mycluster',
        role='markcc_service',
        environment = 'prod',
        name = 'hello',
        instances=300)]

The configuration says how to install a program and then run it on a machine, once Aurora assigns a machine to it. Aurora will find 300 machines, each of which can dedicate one full CPU, 128MB of memory, and 128MB of disk to the process, and it will run the service on those 300 machines.

Both Aurora and Mesos are open-source, under the Apache license. We’ve got a test environment in the distribution, so that you could be running tests in a virtual Mesos/Aurora cluster one hour from now! And my good friend Toby wrote an excellent blog post on how to set up a working Mesos/Aurora cluster.

I don’t want to toot my own horn too much, but I’m incredibly proud that Twitter open-sourced this stuff. Most companies consider their cluster management systems to be super-proprietary secrets. But at Twitter, we decided that this is a problem that no one should have to solve from scratch. It’s something we all know how to do, and it’s time that people stop being forced to waste time reinventing the same old wheel. So we’re giving it away, to anyone who wants to use it. That’s pretty damned awesome, and I’m proud to be a part of it!

Hello World in ARM Assembly Language

Since K&R’s book on C, it’s become traditional to start any tutorial on a new language is to present a program that prints “Hello world”. For ARM assembly running on Raspbian Linux, that traditional program looks like:

.global _start
_start:
  MOV R7, #4
  MOV R0, #1
  MOV R2, #12
  LDR R1, =string
  SWI 0
  MOV R7, #1
  SWI 0
  .data
string:
  .ascii "Hello Worldn"

It’s definitely a bit more cryptic than most languages, but it doesn’t look all that bad, now does it? Before I can explain how that works, we’ll need to talk a bit about what we’re programming, and how we can program it. We’re going to go through a bunch of introductory material here; everything that we touch on, I’ll come back to in more detail in a later post.


arm
In the diagram to the right, you can see my attempt to draw a diagram illustrating the important parts of an ARM CPU from our initial perspective. As we learn more about it, we’ll gradually refine this picture, adding more details – but for now, this is what things look like.

For now, we’ll say that the CPU has 5 main parts:

  1. A collection of 16 registers. A register is a memory cell that’s built in to the CPU. On an ARM processor, any time that you want to do any kind of operation – arithmetic, logic, comparison, you name it! – you’ll need to have the values in a register. The first thirteen registers are available to you, to use for whatever you want. The last three are special; R13 is called the stack pointer (SP), R14 is called the link register (LR), and R15 is called the program counter (PC). We’ll talk about what those three mean as we learn to program.
  2. An arithmetic/logic unit (ALU). This is where the CPU does integer arithmetic and logic. Most of our programs will work exclusively with the ALU. (Floating point is important, but it’s possible to do an awful lot of programming without it.)
  3. A floating point unit (FPU). This is where the CPU does floating point arithmetic.
  4. A status register. This is, like the other registers, a chunk of internal storage. But you can’t manipulate it or access it directly. It’s automatically updated by the ALU/FPU. Individual bits of the status register get updated to reflect various conditions about the current status of the CPU, and the results of the previous instruction. For example, the way that you can compare two values in the ARM is to subtract one from the other. If the two values were equal, then the ZERO flag in the status register will be set to 1; otherwise it will be set to 0. There’s a branch instruction that only actually branches if the ZERO flag is set.
  5. A data channel, called the bus. The bus connects the CPU to the rest of the computer. Memory, other storage devices, and input and output devices are all connected to the CPU via the bus. Doing anything that involves communicating through the bus is slow compared to doing anything that doesn’t. For now, we’ll say that memory is the only thing on the bus.

Now that we have a bit of a clue about the basic pieces of this thing we’re going to program, we can start looking at our hello world program. We still need to talk about one other bit of background before we can get started.

For a computer, on the lowest level, a “program” is just a chunk of numbers. It’s not even a chunk of instructions – it’s just numbers. The numbers can be instructions, data, or both at the same time! That last bit might sound strange, but you’ll see instructions like MOV R0, #4. That’s saying load the literal value 4 into register R0. The 4 is a value encoded as a part of an instruction. So that 4 is both literal data sitting in the middle of a collection of instructions, and it’s also a part of an instruction. The actual instruction doesn’t really say “load the value 4”; it says “load the data value that’s at this position in the instruction sequence”.

We’re not going to program the ARM using the numeric instructions directly. We’re going to program the ARM using assembly language. Assembly language is a way of writing that chunk of numbers that is your program, but doing it with a syntax that easy for a human being to read. Then a program called an assembler will translate from that readable format into the raw numeric format used by the computer. Conceptually, the assembler sounds a lot like the compiler that you’d use with a higher level language. But it’s quite different: compilers take your code, and change it. Frequently, if you look at code that your compiler generates, you’d have a hard time recognizing code that was generated for a program that you wrote! But an assembel doesn’t change anything. There’s no restructuring, no optimization, no changes at all. In an assembly language program, you’re describing how to lay out a bunch of instructions and data in memory, and the assembler does nothing but generate that exact memory layout.


Ok. That said, finally, we can get to the program!

Programming in assembly is quite different from programming in any reasonable programming language. There are no abstractions to make your life easier. You need to be painfully explicit about everything. It really brings home just how many abstractions you generally use in your code.

For example, in assembly language, you don’t really have variables. You can store values anywhere you want in the computer’s memory, but you have to decide where to put them, and how to lay them out, by yourself. But as I said before – all of the arithmetic and logic that makes up a program has to be done on values in registers. So a value in memory is only good if you can move it from memory into a register. It’s almost like programming in a language with a total of 16 variables – only you’re only really allowed to use 13 of them!

Not only do you not have variables, but you don’t really have parameters. In a high level programming language, you can just pass things to subroutines. You don’t need to worry about how. Maybe they’re going onto a stack; maybe there’ doing some kind of fancy lambda calculus renaming thing; maybe there’s some magic variables. You don’t need to know or care. But in assembly, there is no built-in notion of parameter-passing. You need to use the computer’s register and memory to build a parameter passing system. In the simplest form of that, which is what we’re using here, you designate certain registers as carrying certain parameters. There’s nothing in assembly to enforce that: if your program puts something into register R3, and a function was expecting it to be in R4, you won’t get any kind of error.

In our “Hello world” program above, the first three instructions are loading specific values into registers expected by the operating system “print” function. For example, MOV R0, #4 means move the specific number 4 into register R0.

Loading literal values into registers are done using the MOV instruction. It’s got two operands, the register to move the data into, and the source of the data. The source of the data can be either a literal value, or another register. If you want to load data from memory, you need to use a different instruction – LDR.

With the LDR instruction, we can see one of the conveniences of using assembly language. We want to print the string “Hello world”. So we need to have that string in memory somewhere. The assembler lets us do that using a .ascii directive. The directive isn’t an ARM instruction; it’s an instruction to the assembler telling it “I want to put this string data into a block in memory”. The .ascii directive is prefaced with a label, which allows us to refer to the beginning of the memory block populated by the directive. Now we can use “string” to refer to the memory block. So the instruction LDR R1, =string is exactly the same as saying LDR R1, address, where address is the memory location where the first byte of the string is stored.

These four instructions have been preparation for calling a function provided by the operating system. R0 and R7 are used by the operating system to figure out what function we want to call. R1 and R2 are being used to pass parameters to the function. The print function expects R1 to contain the memory location of the first byte in the string we want to print, and R2 to contain the number of characters in the string.

We call the function using SWI 0. SWI is the software interrupt function. We can’t call the operating system directly! One of the purposes of the operating system is to provide a safe environment, where different programs can’t accidentally interfere with one another. If you could just branch into an OS function directly, any program would be able to do anything it wanted! But we don’t allow that, so the program can’t directly call anything in the OS. Instead, what it does is send a special kind of signal called an interrupt. Before it runs our program, the operating system has already told the CPU that any time it gets an interrupt, control should be handed to the OS. So the operating system gets called by the interrupt. It sees the values in R0 and R7, and recognizes that the interrupt is a request to run the “print” function, so it does that. Then it returns from the interrupt – and execution continues at the first instruction after the SWI call.

Now it’s returned from the print, and we don’t want to do anything else. If we didn’t put something here to tell the operating system that we’re done, the CPU would just proceed to the next memory address after our SWI, and interpret that as an instruction! We need to specifically say “We’re done”, so that the operating system takes control away from our program. The way we do that is with another SWI call. This SWI is the operating system “exit” call. To exit a program and kill the process, you call SWI with R0=1 and R7=1.

And that’s it. That’s hello-world in assembly.

Leading in to Machine Code: Why?

I’m going to write a few posts about programming in machine language. It seems that many more people are interested in learning about the ARM processor, so that’s what I’ll be writing about. In particular, I’m going to be working with the Raspberry Pi running Raspbian linux. For those who aren’t familiar with it, the Pi is a super-inexpensive computer that’s very easy to program, and very easy to interface with the outside world. It’s a delightful little machine, and you can get one for around $50!

Anyway, before getting started, I wanted to talk about a few things. First of all, why learn machine language? And then, just what the heck is the ARM thing anyway?

Why learn machine code?

My answer might surprise you. Or, if you’ve been reading this blog for a while, it might not.

Let’s start with the wrong reason. Most of the time, people say that you should learn machine language for speed: programming at the machine code level gets you right down to the hardware, eliminating any layers of junk that would slow you down. For example, one of the books that I bought to learn ARM assembly (Raspberry Pi Assembly Language RASPBIAN Beginners: Hands On Guide) said:

even the most efficient languages can be over 30 times
slower than their machine code equivalent, and that’s on a good
day!

This is pure, utter rubbish. I have no idea where he came up with that 30x figure, but it’s got no relationship to reality. (It’s a decent book, if a bit elementary in approach; this silly statement isn’t representative of the book as a whole!)

In modern CPUs – and the ARM definitely does count as modern! – the fact is, for real world programs, writing code by hand in machine language will probably result in slower code!

If you’re talking about writing a single small routine, humans can be very good at that, and they often do beat compilers. Butonce you get beyond that, and start looking at whole programs, any human advantage in machine language goes out the window. The constraints that actually affect performance have become incredibly complex – too complex for us to juggle effectively. We’ll look at some of these in more detail, but I’ll explain one example.

The CPU needs to fetch instructions from memory. But memory is dead slow compared to the CPU! In the best case, your CPU can execute a couple of instructions in the time it takes to fetch a single value from memory. This leads to an obvious problem: it can execute (or at least start executing) one instruction for each clock tick, but it takes several ticks to fetch an instruction!

To get around this, CPUs play a couple of tricks. Basically, they don’t fetch single instructions, but instead grab entire blocks of instructions; and they start retrieving instructions before they’re needed, so that by the time the CPU is ready to execute an instruction, it’s already been fetched.

So the instruction-fetching hardware is constantly looking ahead, and fetching instructions so that they’ll be ready when the CPU needs them. What happens when your code contains a conditional branch instruction?

The fetch hardware doesn’t know whether the branch will be taken or not. It can make an educated guess by a process called branch prediction. But if it guesses wrong, then the CPU is stalled until the correct instructions can be fetched! So you want to make sure that your code is written so that the CPUs branch prediction hardware is more likely to guess correctly. Many of the tricks that humans use to hand-optimize code actually have the effect of confusing branch prediction! They shave off a couple of instructions, but by doing so, they also force the CPU to sit idle while it waits for instructions to be fetched. That branch prediction failure penalty frequently outweighs the cycles that they saved!

That’s one simple example. There are many more, and they’re much more complicated. And to write efficient code, you need to keep all of those in mind, and fully understand every tradeoff. That’s incredibly hard, and no matter how smart you are, you’ll probably blow it for large programs.

If not for efficiency, then why learn machine code? Because it’s how your computer really works! You might never actually use it, but it’s interesting and valuable to know what’s happening under the covers. Think of it like your car: most of us will never actually modify the engine, but it’s still good to understand how the engine and transmission work.

Your computer is an amazingly complex machine. It’s literally got billions of tiny little parts, all working together in an intricate dance to do what you tell it to. Learning machine code gives you an idea of just how it does that. When you’re programming in another language, understanding machine code lets you understand what your program is really doing under the covers. That’s a useful and fascinating thing to know!

What is this ARM thing?

As I said, we’re going to look at machine language coding on the
ARM processor. What is this ARM beast anyway?

It’s probably not the CPU in your laptop. Most desktop and laptop computers today are based on a direct descendant of the first microprocessor: the Intel 4004.

Yes, seriously: the Intel CPUs that drive most PCs are, really, direct descendants of the first CPU designed for desktop calculators! That’s not an insult to the intel CPUs, but rather a testament to the value of a good design: they’ve just kept on growing and enhancing. It’s hard to see the resemblance unless you follow the design path, where each step follows directly on its predecessors.

The Intel 4004, released in 1971, was a 4-bit processor designed for use in calculators. Nifty chip, state of the art in 1971, but not exactly what we’d call flexible by modern standards. Even by the standards of the day, they recognized its limits. So following on its success, they created an 8-bit version, which they called the 8008. And then they extended the instruction set, and called the result the 8080. The 8080, in turn, yielded successors in the 8088 and 8086 (and the Z80, from a rival chipmaker).

The 8086 was the processor chosen by IBM for its newfangled personal computers. Chip designers kept making it better, producing the 80286, 386, Pentium, and so on – up to todays CPUs, like the Core i7 that drives my MacBook.

The ARM comes from a different design path. At the time that Intel was producing the 8008 and 8080, other companies were getting into the same game. From the PC perspective, the most important was the 6502, which
was used by the original Apple, Commodore, and BBC microcomputers. The
6502 was, incidentally, the first CPU that I learned to program!

The ARM isn’t a descendant of the 6502, but it is a product of the 6502 based family of computers. In the early 1980s, the BBC decided to create an educational computer to promote computer literacy. They hired a company called Acorn to develop a computer for their program. Acorn developed a
beautiful little system that they called the BBC Micro.

The BBC micro was a huge success. Acorn wanted to capitalize on its success, and try to move it from the educational market to the business market. But the 6502 was underpowered for what they wanted to do. So they decided to add a companion processor: they’d have a computer which could still run all of the BBC Micro programs, but which could do fancy graphics and fast computation with this other processor.

In a typical tech-industry NIH (Not Invented Here) moment, they decided that none of the other commercially available CPUs were good enough, so they set out to design their own. They were impressed by the work done by the Berkeley RISC (Reduced Instruction Set Computer) project, and so they adopted the RISC principles, and designed their own CPU, which they called the Acorn RISC Microprocessor, or ARM.

The ARM design was absolutely gorgeous. It was simple but flexible
and powerful, able to operate on very low power and generating very little heat. It had lots of registers and an extremely simple instruction set, which made it a pleasure to program. Acorn built a lovely computer with a great operating system called RiscOS around the ARM, but it never really caught on. (If you’d like to try RiscOS, you can run it on your Raspberry Pi!)

But the ARM didn’t disappear. Tt didn’t catch on in the desktop computing world, but it rapidly took over the world of embedded devices. Everything from your cellphone to your dishwasher to your iPad are all running on ARM CPUs.

Just like the Intel family, the ARM has continued to evolve: the ARM family has gone through 8 major design changes, and dozens of smaller variations. They’re no longer just produced by Acorn – the ARM design is maintained by a consortium, and ARM chips are now produced by dozens of different manufacturers – Motorola, Apple, Samsung, and many others.

Recently, they’ve even starting to expand even beyond embedded platforms: the Chromebook laptops are ARM based, and several companies are starting to market server boxes for datacenters that are ARM based! I’m looking forward to the day when I can buy a nice high-powered ARM laptop.

More Basics: Compilers, Programs, and Languages

After my “what is an OS?” post, a couple of readers asked me to write a similar post about compilers.

Before I can answer what a compiler is, it’s helpful to first answer a different question: what is a program?

And here we get to one of my pet peeves. The most common answer to that question is “a detailed step-by-step sequence of instructions”. For example, here’s what wikipedia says:

A computer program, or just a program, is a sequence of instructions, written to perform a specified task with a computer.

This is wrong.

Back when people first started to study the idea of computing devices, they talked about computing machines as devices that performed a single, specific task. If you think about a basic Turing machine, you normally define Turing machines that perform a single computation. They’ve got a built-in sequence of states, and a built in transition table – the machine can only perform one computation. It took one kind of input, and performed its computation on that input, producing its output.

Building up from these specific machines, they came up with the idea of a universal computing device. A universal computer was a computing machine whose input was a description of a different computing machine. By giving the universal machine different inputs, it could perform different computations.

The point of this diversion is that looking at this history tells us what a program really is: it’s a description of a computing machine. Our computers are universal computing machines; they take programs as input to describe the computing machines we want them to emulate. What we’re doing when we program is describing a computing machine that we’d like to create. Then we feed it into our universal computing machine, and it behaves as if we’d built a custom piece of hardware to do our computation!

The problem is, our computers are simultaneously very primitive and overwhelming complex. They can only work with data expressed in fixed-length sequences of on/off values; to do anything else, we need to find a way of expressing in terms of extremely simple operations on those on/off values. To make them operate efficiently, they’ve got a complex structure: many different kinds of storage (registers, l1 and l2 caches, addressable memory), complicated instruction sets, and a whole lot of tricky perfomance tweaks. It’s really hard to program a computer in terms of its native instructions!

In fact, it’s so hard to program in terms of native instructions that we just don’t do it. What we do is write programs in terms of different machines. That’s the point of a programming language.

Looked at this way, a program language is a way of describing computing machines. The difference between different programming languages is how they describe computing machines. A language like C describes von Neumann machines. Haskell describes machines that work via lambda calculus computations using something like a spineless G-machine. . Prolog describes machines that perform computations in terms of intuitionistic logical inference like a Warren Abstract Machine.

So finally, we can get to the point: what is a compiler? A compiler is a program that takes a description of a computing device defined in one way, and translates into the kind of machine description that can be used by our hardware. A programming language ets us ignore all of the complexities of how our actual hardware is built, and describe our computations in terms of a simple abstraction. A compiler takes that description, and turns it into the form that computer hardware can actually use.

For anyone who’s read this far: I’ve gotten a few requests to talk about assembly language. I haven’t programmed in assembly since the days of the Motorola 68000. This means that to do it, I’ll need to learn something more up-to-date. Would you be more interested in seeing Intel, or ARM?

Boot all the computers!

Moving on from last weeks operating system post, today we’ll look at how a computer boots up and loads an operating system.

Let’s start with why booting is a question at all. When a computer turns on, what happens? What we’re using to seeing is that the disk drive turns on and starts spinning, and the computer loads something from the disk.

The question is how does the computer know how to turn on the disk? As I said in the OS post, the CPU only really knows how work with memory. To talk to a disk drive, it needs to do some very specific things – write to certain memory locations, wait for things to happen. Basically, in order to turn on that disk drive and load the operating system, it needs to run a program. But how does it know what program to run?

I’m going to focus on how modern PCs work. Other computers have used/do use a similar process. The details vary, but the basic idea is the same.

A quick overview of the process:

  1. CPU startup.
  2. Run BIOS initialization
  3. Load bootloader
  4. Run bootloader
  5. Load and run OS.

As that list suggests, it’s not a particularly simple process. We think of it as one step: turn on the computer, and it runs the OS. In fact, it’s a complicated dance of many steps.

On the lowest level, it’s all hardware. When you turn on a computer, some current gets sent to a clock. The clock is basically a quartz crystal; when you apply current to the crystal, it vibrates and produces a regular electrical pulse. That pulse is what drives the CPU. (When you talk about your computer’s speed, you generally describe it in terms of the frequency of the clock pulse. For example, in the laptop that I’m using to write this post, I’ve got a 2.4 GHz processor: that means that the clock chip pulses 2.4 billion times per second.)

When the CPU gets a clock pulse, it executes an instruction from memory. It knows what instruction to execute because it’s got a register (a special piece of memory built-in to the CPU) that tells it what instruction to execute. When the computer is turned on, that register is set to point at a specific location. Depending on the CPU, that might be 0, or it might be some other magic location; it doesn’t matter: what matters is that the CPU is built so that when it’s first turned on and it receives a clock pulse that starts it running, that register will always point at the same place.

The software part of the boot process starts there: the computer puts a chunk of read-only memory there – so when the computer turns on, there’s a program sitting at that location, which the computer can run. On PCs, that program is called the BIOS (Basic Input/Output System).

The BIOS knows how to tell the hardware that operates your display to show text on the screen, and it knows how to read stuff on your disk drives. It doesn’t know much beyond that. What it knows is extremely primitive. It doesn’t understand things like filesystems – the filesystem is set up and controlled by the operating system, and different operating systems will set up filesystems in different ways. The BIOS can’t do anything with a filesystem: it doesn’t include any programming to tell it how to read a filesystem, and it can’t ask the operating system to do it, because the OS hasn’t loaded yet!

What the BIOS does is something similar to what the CPU did when it started up. The CPU knew to look in a special location in memory to find a program to run. The BIOS knows to look at a special section on a disk drive to find a program to run. Every disk has a special chunk of data on it called the master boot record (MBR). The MBR contains another program, called a boot loader. So the BIOS loads the boot loader, and then uses it to actually load the operating system.

This probably seems a bit weird. The computer starts up by looking in a specific location for a program to run (the BIOS), which loads something (the bootloader). The thing it loads (the bootloader) also just looks in a specific location for a program to run (the OS). Why the two layers?

Different operating systems are build differently, and the specific steps to actually load and run the OS are different. For example, on my laptop, I’ve can run two operating systems: MacOS, and Linux. On MacOS (aka Darwin), there’s something called a microkernel that gets loaded. The microkernel is stored in a file named “mach_kernel” in the root directory of a type of filesystem called HFS. But in my installation of linux, the OS is stored in a file named “vmlinuz” in the root directory of a type of filesystem called EXT4. The BIOS doesn’t know what operating system it’s loading, and it doesn’t know what filesystem the OS uses – and that means that it knows neither the name of the file to load, nor how to find that file.

The bootloader was set up by the operating system. It’s specific to the operating system – you can think of it as part of the OS. So it knows what kind of filesystem it’s going to look at, and how to find the OS in that filesystem.

So once the bootloader gets started, it knows how to load and run the operating system, and once it does that, your computer is up and running, and ready for you to use!

Of course, all of this is a simplified version of how it works. But for understanding the process, it’s a reasonable approximation.

(To reply to commenters: I’ll try to do a post like this about compilers when I have some time to write it up.)

Basics: What is an OS?

A reader of this blog apparently likes the way I explain things, and wrote to me to ask a question: what is an operating system? And how does a computer know how to load it?

I’m going to answer that, but I’m going to do it in a roundabout way. The usual answer is something like: “An operating system or OS is a software program that enables the computer hardware to communicate and operate with the computer software.” In my opinion, that’s a cop-out: it doesn’t really answer anything. I’m going to take a somewhat roundabout approach, but hopefully give you an answer that actually explains things in more detail, which should help you understand it better.

When someone like me sets out to write a program, how can we do it? That sounds like an odd question, until you actually think about it. The core of the computer, the CPU, is a device which really can’t do very much. It’s a self-contained unit which can do lots of interesting mathematical and logical operations, but they all happen completely inside the CPU (how they happen inside the CPU is way beyond this post!). To get stuff in and out of the CPU, the only thing that the computer can do is read and write values from the computer’s memory. That’s really it.

So how do I get a program in to the computer? The computer can only read the program if it’s in the computer’s memory. And every way that I can get it into the memory involves the CPU!

Computers are built so that there are certain memory locations and operations that are used to interact with the outside world. They also have signal wires called interrupt pins where other devices (like disk drives) can apply a current to say “Hey, I’ve got something for you”. The exact mechanics are, of course, complicated, and vary from CPU to CPU. But to give you an idea of what it’s like, to read some data from disk, you’d do something like the following.

  1. Set aside a chunk of memory where the data should be stored after it’s read. This is called a buffer.
  2. Figure out where the data you want to read is stored on the disk. You can identify disk locations as a number. (It’s usually a bit more complicated than that, but we’re trying to keep this simple.
  3. Write that number into a special memory location that’s monitored by the disk drive controller.
  4. Wait until the disk controller signals you via an interrupt that the data is ready. The data will be stored in a special memory location, that can be altered by the disk. (Simplifying again, but this is sometimes called a DMA buffer.)
  5. Copy the data from the controller’s DMA buffer into the application’s memory buffer that you allocated.

When you down to that level, programming is an intricate dance! No one
wants to do that – it’s too complicated, too error prone, and just generally
too painful. But there’s a deeper problem: at this level, it’s every program
for itself. How do you decide where on the disk to put your data? How can you
make sure that no one else is going to use that part of the disk? How can you
tell another program where to find the data that you stored?

You want to have something that creates the illusion of a much simpler computational world. Of course, under the covers, it’s all going to be that incredibly messy stuff, but you want to cover it up. That’s the job of an operating system: it’s a layer between the hardware and the programs that you run that create a new computational world that’s much easier to work in.

Instead of having to do the dance of mucking with the hard disk drive controller yourself, the operating system gives you a way of saying “Open a file named ‘foo'”, and then it takes that request, figures out where ‘foo’ is on the disk, talks to the disk drive, gets the data, and then hands you a buffer containing it. You don’t need to know what kind of disk drive the data is coming from, how the name ‘foo’ maps to sectors of the disk. You don’t need to know where the control memory locations for the drive are. You just let the operating system do that for you.

So, ultimately, this is the answer: The operating system is a program that runs on the computer, and creates the environment in which other programs can run. It does a lot of things to create a pleasant environment in which to write and run other programs. Among the multitude of services provided by most modern operating system are:

  1. Device input and output. This is what we talked about above: direct interaction with input and output devices is complicated and error prone; the operating system implements the input and output processes once, (hopefully) without errors, and then makes it easy for every other program to just use its correct implementation.
  2. Multitasking: your computer has enough power to do many things at once. Most modern computers have more than one CPU. (My current laptop has 4!) And most programs end up spending a lot of their time doing nothing: waiting for you to press a key, or waiting for the disk drive to send it data. The operating system creates sandboxes, called processes, and allows one program to run in each sandbox. It takes care of ensuring that each process gets to run on a CPU for a fair share of the time.
  3. Memory management. With more than one program running at the same time on your computer, you need to make sure that you’re using memory that isn’t also being used by some other program, and to make sure that no other program can alter the memory that you’re using without your permission. The operating system decides what parts of memory can be used by which program.
  4. Filesystems. Your disk drive is really just a huge collection of small sections, each of which can store a fixed number of bits, encoded in some strange format dictated by the mechanics of the drive. The OS provides an abstraction that’s a lot easier to deal with.

I think that’s enough for one day. Tomorrow: how the computer knows how to run the OS when it gets switched on!

Basic Data Structures: Hash Tables

I’m in the mood for a couple of basics posts. As long-time readers might know, I love writing about data structures.

One of the most important and fundamental structures is a hashtable. In fact, in a lot of modern programming languages have left hashtables behind, for reasons I’ll discuss later. But if you want to understand data structures and algorithmic complexity, hashtables are one of the essentials.

A hashtable a structure for keeping a list of (key, value) pairs, where you can look up a value using the key that’s associated with it. This kind of structure is frequently called either a map, an associative array, or a dictionary.

For an example, think of a phonebook. You’ve got a collection of pairs (name, phone-number) that make up the phonebook. When you use the phonebook, what you do is look for a person’s name, and then use it to get their phone number.

A hashtable is one specific kind of structure that does this. I like to describe data structures in terms of some sort of schema: what are the basic operations that the structure supports, and what performance characteristics does it have for those operations.

In those schematic terms, a hashtable is very simple. It’s a structure that maintains a mapping from keys to values. A hashtable really only needs two operations: put and get:

  1. put(key, value): add a mapping from key to value to the table. If there’s already a mapping for the key, then replace it.
  2. get(key): get the value associated with the key.

In a hashtable, both of those operations are extremely fast.

Let’s think for a moment about the basic idea of a key-value map, and what kind of performance we could get out of a cople of simple naive ways of implementing it.

We’ve got a list of names and phone numbers. We want to know how long it’ll take to find a particular name. How quickly can we do it?

How long does that take, naively? It depends on how many keys and values there are, and what properties the keys have that we can take advantage of.

In the worst case, there’s nothing to help us: the only thing we can do is take the key we’re looking for, and compare it to every single key. If we have 10 keys, then on average, we’ll need to do an average of about 5 steps before we find the key we’re looking for. If there are 100 keys, then it’ll take, on average, about 50 steps. If there are one million keys, then it’ll take an average of half a million steps before we can find the value.

If the keys are ordered – that is, if we can compare one key to another not just for equality, but we can ask which came first using a “less than or equal to” operator, then we can use binary search. With binary search, we can find an entry in a list of 10 elements in 4 steps. We can find an entry in a list of 1000 keys in 10 steps, or one in a list of one million keys in 20 steps.

With a hashtable, if things work right, in a table of 10 keys, it takes one step to find the key. 100 keys? 1 step. 1000 keys? 1 step. 1,000,000,000 keys? Still one step. That’s the point of a hashtable. It might be really hard to do something like general a list of all of the keys – but if all you want to do is look things up, it’s lightning.

How can it do that? It’s a fairly simple trick: the hashtable trades space for time. A hashtable, under normal circumstances, uses a lot more space than most other ways of building a dictionary. It makes itself fast by using extra space in a clever way.

A hashtable creates a bunch of containers for (key, value) pairs called buckets. It creates many more buckets than the number of (key, value) pairs than it expects to store. When you want to insert a value into the table, it uses a special kind of function called a hash function on the key to decide which bucket to put the (key, value) into. When you want to look for the value associated with a key, it again uses the hash function on the key to find out which bucket to look in.

It’s easiest to understand by looking at some actual code. Here’s a simple, not at all realistic implementation of a hashtable in Python:

  class Hashtable(object):
    def __init__(self, hashfun, size):
      self._size = size
      self._hashfun = hashfun
      self._table = [[] for i in range(size)]

    def hash(self, key):
      return self._hashfun(key) % self._size

    def get(self, key, value):
      self._table[self.hash(key)].append((key, value))

    def get(self, key):
      for k,v in self._table[self.hash(key)]:
        if k == key:
          return v
      return None

If you’ve got a good hash function, and your hashtable is big enough, then each bucket will end up with no more than one value in it. So if you need to insert a value, you find an (empty) bucket using its hashcode, and dump it in: one step. If you need to find a value given its key, find the bucket using its hashcode, and return the value.

There are two big problems with hashtables.

First, everything is dependent on the quality of your hash function. If you hash function maps a lot of values to the same bucket, then your performance is going to suck. In fact, in the worst case, it’s no better than just searching a randomly ordered list. Most of the time, you can come up with a hash function that does pretty good – but it’s a surprisingly tricky thing to get right.

Second, the table really needs to be big relative to the number of elements that you expect to have in the list. If you set up a hashtable with 40 buckets, and you end up with 80 values stored in it, your performance isn’t going to be very good. (In fact, it’ll be slightly worse that just using a binary search tree.)

So what makes a good hash function? There are a bunch of things to consider:

  1. The hash function must be deterministic: calling the hash on the same key value must always produce the same result. If you’re writing a python program like the one I used as an example above, and you use the value of the key objects fields to compute the hash, then changing the key objects fields will change the hashcode!
  2. The hash function needs to focus on the parts of the key that distinguish between different keys, not on their similarities. To give a simple example, in some versions of Java, the default hash function for objects is based on the address of the object in memory. All objects are stored in locations whose address is divisible by 4 – so the last two bits are always zero. If you did something simple like just take the address modulo the table size, then all of the buckets whose numbers weren’t divisible by four would always be empty. That would be bad.
  3. The hash function needs to be uniform. That means that it needs to map roughly the same number of input values to each possible output value. To give you a sense of how important this is: I ran a test using 3125 randomly generated strings, using one really stupid hash function (xoring together the characters), and one really good one (djb2). I set up a small table, with 31 buckets, and inserted all of the value into it. With the xor hash function, there were several empty buckets, and the worst bucket had 625 values in it. With djb2, there were no empty buckets; the smallest bucket had 98 values, and the biggest one had 106.

So what’s a good hash function look like? Djb2, which I used in my test above, is based on integer arithmetic using the string values. It’s an interesting case, because no one is really entirely sure of exactly why it works better than similar functions, but be that as it may, we know that in practice, it works really well. It was invented by a guy named Dan Bernstein, who used to be a genius poster in comp.lang.c, when that was a big deal. Here’s djb2 in Python:

def djb2(key):
  hash = 5381
  for c in key:
    hash = (hash * 33) + ord(c)
  return hash

What the heck is it doing? Why 5381? Because it’s prime, and it works pretty well. Why 33? No clue.

Towards the beginning of this post, I alluded to the fact that hashtables have, at least to some degree, fallen out of vogue. (For example, in the Go language standard library, the map type is implemented using a red-black tree.) Why?

In practice, it’s rarely any faster to really use a hashtable than to use a balanced binary tree like a red-black tree. Balanced trees have better worst-case bounds, and they’re not as sensitive to the properties of the hash function. And they make it really easy to iterate over all of the keys in a collection in a predictable order, which makes them great for debugging purposes.

Of course, hash tables still get used, constantly. The most commonly used data structures in Java code include, without a doubt, the HashMap and HashSet, which are both built on hashtables. They’re used constantly. You usually don’t have to implement them yourself; and usually system libraries provide a good default hash function for strings, so you’re usually safe.

There’s also a lot of really fascinating research into designing ideal hash functions for various applications. If you know what your data will look like in advance, you can even build something called a perfect hash function, which guarantees no collisions. But that’s a subject for another time.