Category Archives: Good Math

Understanding Global Warming Scale Issues

Aside from the endless stream of Cantor cranks, the next biggest category of emails I get is from climate “skeptics”. They all ask pretty much the same question. For example, here’s one I received today:

My personal analysis, and natural sceptisism tells me, that there are something fundamentally wrong with the entire warming theory when it comes to the CO2.

If a gas in the atmosphere increase from 0.03 to 0.04… that just cant be a significant parameter, can it?

I generally ignore it, because… let’s face it, the majority of people who ask this question aren’t looking for a real answer. But this one was much more polite and reasonable than most, so I decided to answer it. And once I went to the trouble of writing a response, I figured that I might as well turn it into a post as well.

The current figures – you can find them in a variety of places from wikipedia to the US NOAA – are that the atmosphere CO2 has changed from around 280 parts per million in 1850 to 400 parts per million today.

Why can’t that be a significant parameter?

There’s a couple of things to understand to grasp global warming: how much energy carbon dioxide can trap in the atmosphere, and hom much carbon dioxide there actually is in the atmosphere. Put those two facts together, and you realize that we’re talking about a massive quantity of carbon dioxide trapping a massive amount of energy.

The problem is scale. Humans notoriously have a really hard time wrapping our heads around scale. When numbers get big enough, we aren’t able to really grasp them intuitively and understand what they mean. The difference between two numbers like 300 and 400ppm is tiny, we can’t really grasp how in could be significant, because we aren’t good at taking that small difference, and realizing just how ridiculously large it actually is.

If you actually look at the math behind the greenhouse effect, you find that some gasses are very effective at trapping heat. The earth is only habitable because of the carbon dioxide in the atmosphere – without it, earth would be too cold for life. Small amounts of it provide enough heat-trapping effect to move us from a frozen rock to the world we have. Increasing the quantity of it increases the amount of heat it can trap.

Let’s think about what the difference between 280 and 400 parts per million actually means at the scale of earth’s atmosphere. You hear a number like 400ppm – that’s 4 one-hundreds of one percent – that seems like nothing, right? How could that have such a massive effect?!

But like so many other mathematical things, you need to put that number into the appropriate scale. The earths atmosphere masses roughly 5 times 10^21 grams. 400ppm of that scales to 2 times 10^18 grams of carbon dioxide. That’s 2 billion trillion kilograms of CO2. Compared to 100 years ago, that’s about 800 million trillion kilograms of carbon dioxide added to the atmosphere over the last hundred years. That’s a really, really massive quantity of carbon dioxide! scaled to the number of particles, that’s something around 10^40th (plus or minus a couple of powers of ten – at this scale, who cares?) additional molecules of carbon dioxide in the atmosphere. It’s a very small percentage, but it’s a huge quantity.

When you talk about trapping heat, you also have to remember that there’s scaling issues there, too. We’re not talking about adding 100 degrees to the earths temperature. It’s a massive increase in the quantity of energy in the atmosphere, but because the atmosphere is so large, it doesn’t look like much: just a couple of degrees. That can be very deceptive – 5 degrees celsius isn’t a huge temperature difference. But if you think of the quantity of extra energy that’s being absorbed by the atmosphere to produce that difference, it’s pretty damned huge. It doesn’t necessarily look like all that much when you see it stated at 2 degrees celsius – but if you think of it terms of the quantity of additional energy being trapped by the atmosphere, it’s very significant.

Calculating just how much energy a molecule of CO2 can absorb is a lot trickier than calculating the mass-change of the quantity of CO2 in the atmosphere. It’s a complicated phenomenon which involves a lot of different factors – how much infrared is absorbed by an atom, how quickly that energy gets distributed into the other molecules that it interacts with… I’m not going to go into detail on that. There’s a ton of places, like here, where you can look up a detailed explanation. But when you consider the scale issues, it should be clear that there’s a pretty damned massive increase in the capacity to absorb energy in a small percentage-wise increase in the quantity of CO2.

Time in Distributed Systems: Lamport Timestamps

What I do in my day job is working on infrastructure systems at dropbox. That means that I’m neck-deep in distributed systems programming – that’s my bread and butter. One of the problems that comes up over and over again in distributed systems is time.

In distributed systems, time is a problem. Each computer has a clock built in, but those clocks are independent. The clocks on different machines can vary quite a bit. If a human being is setting them, then they’re probably at best accurate to one second. Even using a protocol like NTP, which synchronizes clocks between different computers, you can only get the clocks accurate to within about a millisecond of each other.

That sounds pretty good. In human timescales, a millisecond is a nearly imperceptible time interval. But to a modern computer, which can execute billions of instructions per second, it’s a long time: long enough to execute a million instructions! To get a sense of how much time that is to a computer, I just measured the time it took to take all of the source code for my main project, compile it from scratch, and execute all of its tests: it took 26 milliseconds.

That’s a lot of work. On the scale of a machine running billions of instructions per second, a millisecond is a long time.

Why does that matter?

For a lot of things that we want to do with a collection of computers, we need to know what event happened first. This comes up in lots of different contexts. The simplest one to explain is a shared resource locked by a mutex.

A mutex is a mutual exclusion lock. It’s basically a control that only allows one process to access some shared resource at a time. For example, you could think of a database that a bunch of processes all talk to. To make an update, a process P needs to send a request asking for access. If no one is using it when the server receives the request, it will give a lock to P, and and then block anyone else from accessing it until P is done. Anyone else who asks for access to the the database will have to wait. When P is done, it releases the lock on the mutex, and then if there’s any processes waiting, the database will choose one, and give it the lock.

Here’s where time comes into things. How do you decide who to give the lock to? You could give it to whoever you received the request from first, using the time on the database host. But that doesn’t always work well. It could easily end up with hosts with a lower-bandwidth connection to the server getting far worse service than a a closer host.

You get better fairness by using “send time” – that is, the time that the request was sent to the server by the client. But that’s where the clock issue comes up. Different machines don’t agree perfectly on the current time. If you use their clocktime to determine gets the lock first, then a machine with a slow clock will always get access before one with a fast clock. What you need is some fair way of determining ordering by some kind of timestamp that’s fair.

There are a couple of algorithms for creating some notion of a clock or timestamp that’s fair and consistent. The simplest one, which we’ll look at in this post, is called Lamport timestamps. It’s impressively simple, but it works really well. I’ve seen it used in real-world implementations of Paxos at places like Google. So it’s simple, but it’s serious.

The idea of Lamport timestamps is to come up with a mechanism that defines a partial order over events in a distributed system. What it defines is a causal ordering: that is, for any two events, A and B, if there’s any way that A could have influenced B, then the timestamp of A will be less than the timestamp of B. It’s also possible to have two events where we can’t say which came first; when that happens, it means that they couldn’t possible have affected each other. If A and B can’t have any affect on each other, then it doesn’t matter which one “comes first”.

The way that you make this work is remarkably simple and elegant. It’s based on the simplest model of distributed system, where a distributed system is a collection of processes. The processes only communicate by explicitly sending messages to each other.

  1. Every individual process p in the distributed system maintains an integer timestamp counter, \tau_p.
  2. Every time a process p performs an action, it increments \tau_p. Actions that trigger increments of \tau_p include message sends.
  3. Every time a process p sends a message to another process, it includes the current value of \tau_p in the message.
  4. When a process p receives a message from a process q, that message includes the value of \tau_q when the message was sent. So it updates its \tau_q to the \max(\tau_p, \tau_q)+1 (one more than the maximum of its current timestamp and the incoming message timestamp).

For any two events A and B in the system, if A \rightarrow B (that is, if A causally occurred before B – meaning that A could have done something that affected B), then we know that the timestamp of A will be smaller than the timestamp of B.

The order of that statement is important. It’s possible for timestamp(A) to be smaller than timestamp(B), but for B to have occurred before A by some wallclock. Lamport timestamps provide a causal ordering: A cannot have influenced or caused B unless A \rightarrow B; but A and B can be independent.

Let’s run through an example of how that happens. I’ll write it out by describing the clock-time sequence of events, and following it by a list of the timestamp counter settings for each host. We start with all timestamps at 0: [A(0), B(0), C(0), D(0).

  • [Event 1] A sends to C; sending trigger a timestamp increment. [A(1), B(0), C(0), D(0)].
  • [Event 2] C receives a message from A, and sets its counter to 2. [A(1), B(0), C(2), D(0).
  • [Event 3] C sends a message to A (C increments to 3, and sends.) [A(1), B(0), C(3), D(0).
  • [Event 4] A recieves the message from C, and sets its clock to 4. [A(4), B(0), C(3), D(0)]
  • [Event 5] B sends a message to D. [A(4), B(1), C(3), D(0)]
  • [Event 6] D receives the message. [A(4), B(1), C(3), D(2)].
  • [Event 7] D sends a message to C. [A(4), B(1), C(3), D(3)].
  • [Event 8] C receives the message, and sets its clock to 4.

According to the Lamport timestamps, in event 5, B sent its message to D at time 1. But by wallclock time, it sent its message after C’s timestamp was already 3, and A’s timestamp was already 4. We know that in our scenario, event 5 happened before event 3 by wallclock time. But in a causal ordering, it didn’t. In causal order, event 8 happened after event 4, and event 7 happened before event 8. In causal comparison, we can’t say whether 7 happened before or after 3 – but it doesn’t matter, because which order they happened in can’t affect anything.

The Lamport timestamp is a partial ordering. It tells us something about the order that things happened in, but far from everything. In effect, if the timestamp of event A is less than the timestamp of event B, it means that either A happened before B or that there’s no causal relation between A and B.

The Lamport timestamp comparisons only become meaningful when there’s an actual causal link between events. In our example, at the time that event 5 occurs, there’s no causal connection at all between the events on host A, and the events on host B. You can choose any arbitrary ordering between causally unrelated events, and as long as you use it consistently, everything will work correctly. But when event 6 happens, now there’s a causal connection. Event 5 could have changed some state on host D, and that could have changed the message that D sent in event 7. Now there’s a causal relationship, timestamp comparisons between messages after 7 has to reflect that. Lamport timestamps are the simplest possible mechanism that captures that essential fact.

When we talk about network time algorithms, we say that what Lamport timestamps do is provide weak clock consistency: If A causally happened before B, then the timestamp of A will be less than the timestamp of B.

For the mutex problem, we’d really prefer to have strong clock consistency, which says that the timestamp of A is smaller than the timestamp of B if and only if A causally occurred before B. But Lamport timestamps don’t give us enough information to do that. (Which is why there’s a more complex mechanism called vector clocks, which I’ll talk about in another post.

Getting back to the issues that this kind of timestamp is meant to solve, we’ve got a partial order of events. But that isn’t quite enough. Sometimes we really need to have a total order – we need to have a single, correct ordering of events by time, with no ties. That total order doesn’t need to be real – by which I mean that it doesn’t need to be the actual ordering in which events occured according to a wallclock. But it needs to be consistent, and no matter which host you ask, they need to always agree on which order things happened in. Pure lamport timestamps don’t do that: they’ll frequently have causally unrelated events with identical timestamps.

The solution to that is to be arbitrary but consistent. Take some extra piece of information that uniquely identifies each host in the distributed system, and use comparisons of those IDs to break ties.

For example, in real systems, every host has a network interface controller (NIC) which has a universally unique identifier called a MAC address. The MAC address is a 48 bit number. No two NICs in the history of the universe will ever have the same MAC address. (There are 281 trillion possible MAC codes, so we really don’t worry about running out.) You could also use hostnames, IP addresses, or just random arbitrarily assigned identifiers. It doesn’t really matter – as long as it’s consistent.

This doesn’t solve all of the problems of clocks in distributed systems. For example, it doesn’t guarantee fairness in Mutex assignment – which is the problem that I used as an example at the beginning of this post. But it’s a necessary first step: algorithms that do guarantee fairness rely on some kind of consistent event ordering.

It’s also just a beautiful example of what good distributed solutions look like. It’s simple: easy to understand, easy to implement correctly. It’s the simplest solution to the problem that works: there is, provably, no simpler mechanism that provides weak clock consistency.

WTF is up with bitcoin?

Since I wrote about Bitcoin a couple of years ago, I’ve had a few people ask me what’s going on with Bitcoin this week. There’s been some pretty hysterical-sounding pieces in the press about bitcoin’s nightmare scenario, and folks want to know what’s going on, and whether it’s real, or just a hyped up press thing.

It looks real to me. The technical problem is definitely solvable, but there’s a bunch of social/political stuff piled on top that’s making it hard for the technical solution to actually get implemented.

To understand what’s going on, we need a quick refresher on how bitcoin works.

The basic idea of bitcoin is pretty simple. There’s a thing called a ledger, which consists of a list of transactions. Each transaction is just a triple, (X, Y, Z), which means “X gave Y bitcoins to Z”. When you use a bitcoin to buy something, what’s really happening is that you’re adding a new entry to the ledger.

To make it all work, there’s a bunch of distributed computing going on to maintain the ledger. Every 10 minutes or so, a batch of transactions is added to the ledger by performing a very expensive computation. The set of transactions is called a block. The entire ledger is just a list of blocks – called the blockchain. In the current bitcoin protocol, a ledger block can only hold 1 MB of information.

That block size of 1MB is the problem. There are enough bitcoin transactions going on right now that at peak times, the amount of data needed to represent all of the transactions in a ten minute period is larger than 1MB.

That means that transactions start to back up. Imagine that there’s 1.5M of transactions occuring every 10 minutes. In the first period, you get 1M of them wrapped in a block, and the remaining 0.5MB gets delayed to the next period. The next period, you process the remaining half meg from the previous period, plus just 1/2MB from the current – leaving 1M to roll over to the next. That next period, you’re going to spend the entire block on transactions left from the previous time period – and the full 1.5MB gets deferred to later. Things have backed up to the point where on average, a new transaction doesn’t get added to a block for 45 minutes. There are confirmed reports of transactions taking 7 or 8 hours before they get added to the blockchain.

This is a problem on many levels. If you’re a store trying to sell things, and people want to pay with Bitcoin, this is a massive problem. Up until a transactions is confirmed by being part of a block accepted into the blockchain, the transaction can be rescinded. So you can’t give your customers their merchandise until you’re sure the transaction is in the blockchain. That was awkward when you had to wait 10 minutes. That’s completely unacceptable when you have no idea how long it might take.

Looking at this, you might think that the solution is just to say that you should create blocks more frequently. If there’s 1.5M of transactions every 10 minutes, why not just create a block every five minutes? The answer is: because it takes an average of around 10 minutes to perform the computation needed to add one block to the chain. So you can’t reduce the amount of time per block.

Alternatively, you could just increase the size of the block. In theory, that’s a great answer. Jump a block to 2M, and you’ve got enough space to handle the current volume. Jump it to 10M, and you’ve got enough buffer space to cover a couple of years.

But that’s where the social/political thing comes in. The work of performing the computation needed to add blocks to the chain (called mining) has become concentrated in the hands of a small group of people. And they don’t want to change the mining software that they’re running.

I don’t follow bitcoin closely, so I don’t know the details of the fights over the software. But as an outsider, it looks like a pretty typical thing: people prefer to stick with known profits today even if it kills the business tomorrow, rather than take a risk of losing todays profits. Changing the protocol might undermine their dominant mining position – so they’d rather see Bitcoin fall apart than risk losing todays profits.

To quickly address one stupid “answer” to this problem: I’ve seen lots of people say that you can make your transaction get added to the chain faster. There’s an option in the protocol to allow a transaction to say “I’ll pay X bitcoins to whoever adds this transaction to the chain”. Miners will grab those transactions and process them first, so all you need to do is be willing to pay.

That’s a partial solution, but it’s probably not a long term answer.

Think of it this way. There’s a range of different transactions performed with bitcoin. You can put them into buckets based on how time critical they are. At one end, you’ve got people walking into a store and buying something. The store needs to have that transaction processed while the customer waits – so it needs to be fast. You’ve got other transactions – like, say, paying your mortgage. If it takes 12 hours to go through, big deal! For simplicity, let’s just consider those two cases: there’s time critical transactions (fast), and non-time-critical ones (slow).

For slow transactions, you don’t need to increase the transaction fees. Just let the transaction get added to the blockchain whenever there’s room. For the fast ones, you need to pay.

The problem is, 1MB really isn’t that much space. Even if just 1/3 of the transactions are fast, you’re going to wind up with times when you can’t do all of the pending fast transactions in one block. So fast transactions need to increase their fees. But that can only go on for so long before the cost of using bitcoin starts to become a significant issue in the cost of doing business.

The ultimate problem is that bitcoin is being to successful as a medium of exchange for the current protocol. The blockchain can’t keep up with transactions. What adding transaction fees does is increase the cost of using bitcoin for fast transactions until it reaches the point where enough fast-transactors drop out of using bitcoin that all of the remaining fast-transactors no longer exceed the blocksize. In other words, transaction fees as a “solution” to the block-size problem only work by driving businesses away from accepting bitcoin. Which isn’t exactly in the best interest of people who want to use bitcoins. This is why I think that online wallets like paypal to western union are going to continue to be the mainstay.

Realistically, if you want to use bitcoin as a currency, you can’t solve its capacity problems without increasing its capacity. If there are more than 1MB of transactions happening every 10 minutes, then you need to do something to increase the number of transactions that can be part of a block. If not, then you can’t support the number of transactions that people want to make. If that’s the case, then you can’t rely on being able to use bitcoin to make a purchase – and that means that you don’t have a usable currency.

Significant Figures and the Age of the Universe

(Note: This post originally contained a remarkably stupid error in an example. For some idiotic reason, I calculated as if a liter was a cubic meter. Which, duh, it isn’t. so I was off by a factor of 1000. Pathetic, I know. Thanks to the multiple readers who pointed it out!)

The other day, I got a question via email that involves significant figures. Sigfigs are really important in things that apply math to real-world measurements. But they’re poorly understood at best by most people. I’ve written about them before, but not in a while, and this question does have a somewhat different spin on it.

Here’s the email that I got:

Do you have strong credentials in math and/or science? I am looking for someone to give an expert opinion on what seems like a simple question that requires only a short answer.

Could the matter of significant figures be relevant to an estimate changing from 20 to less than 15? What if it were 20 billion and 13.7 billion?

If the context matters, in the 80s the age of the universe was given as probably 20 billion years, maybe more. After a number of changes it is now considered to be 13.7 billion years. I believe the change was due to distinct new discoveries, but I’ve been told it was simply a matter of increasing accuracy and I need to learn about significant figures. From what I know (or think I know?) of significant figures, they don’t really come into play in this case.

The subject of significant digits is near and dear to my heart. My father was a physicist who worked as an electrical engineer producing power circuitry for military and satellite applications. I’ve talked about him before: most of the math and science that I learned before college, I learned from him. One of his pet peeves was people screwing around with numbers in ways that made no sense. One of the most common ones of that involves significant digits. He used to get really angry at people who did things with calculators, and just read off all of the digits.

He used to get really upset when people did things like, say, measure a plate with a 6 inch diameter, and say that it had an are] of 28.27433375 square inches. That’s ridiculous! If you measured a plate’s diameter to within 1/16th of an inch, you can’t use that measurement to compute its area down to less than one billionth of a square inch!

Before we really look at how to answer the question that set this off, let’s start with a quick review of what significant figures are and why they matter.

When we’re doing science, a lot of what we’re doing involves working with measurements. Whether it’s cosmologists trying to measure the age of the universe, chemists trying to measure the energy produced by a reaction, or engineers trying to measure the strength of a metal rod, science involves measurements.

Measurements are limited by the accuracy of the way we take the measurement. In the real world, there’s no such thing as a perfect measurement: all measurements are approximations. Whatever method we chose for taking a measurement of something, the measurement is accurate only to within some margin.

If I measure a plate with a ruler, I’m limited by factors like how well I can align the ruler with the edge of the plate, by what units are marked on the ruler, and by how precisely the units are marked on the ruler.

Once I’ve taken a measurement and I want to use it for a calculation, the accuracy of anything I calculate is limited by the accuracy of the measurements: the accuracy of our measurements necessarily limits the accuracy of anything we can compute from those measurements.

For a trivial example: if I want to know the total mass of the water in a tank, I can start by saying that the mass of a liter of water is one kilogram. To figure out the mass of the total volume of water in the tank, I need to know its volume. Assuming that the tank edges are all perfect right angles, and that it’s uniform depth, I can measure the depth of the water, and the length and breadth of the tank, and use those to compute the volume.

Let’s say that the tank is 512 centimeters long, and 203 centimeters wide. I measure the depth – but that’s difficult, because the water moves. I come up with it being roughly 1 meter deep – so 100 centimeters.

The volume of the tank can be computed from those figures: 5.12 times 2.03 times 1.00, or 10,393.6 liters.

Can I really conclude that the volume of the tank is 10,393.6 liters? No. Because my measurement of the depth wasn’t accurate enough. It could easily have been anything from, say, 95 centimeters to 105 centimeters, so the actual volume could range between around 9900 liters and 11000 liters. From the accuracy of my measurements, claiming that I know the volume down to a milliliter is ridiculous, when my measurement of the depth was only accurate within a range of +/- 5 centimeters!

Ideally, I might want to know a strong estimate on the bounds of the accuracy of a computation based on measurements. I can compute that if I know the measurement error bounds on each error measurement, and I can track them through the computation and come up with a good estimate of the bounds – that’s basically what I did up above, to conclude that the volume of the tank was between 9,900 and 11,000 liters. The problem with that is that we often don’t really know the precise error bounds – so even our estimate of error is an imprecise figure! And even if we did know precise error bounds, the computation becomes much more difficult when you want to track error bounds through it. (And that’s not even considering the fact that our error bounds are only another measured estimate with its own error bounds!)

Significant figures are a simple statistical tool that we can use to determine a reasonable way of estimating how much accuracy we have in our measurements, and how much accuracy we can have at the end of a computation. It’s not perfect, but most of the time, it’s good enough, and it’s really easy.

The basic concept of significant figures is simple. You count how many digits of accuracy each measurement has. The result of the computation over the measurements is accurate to the smallest number of digits of any of the measurements used in the computation.

In the water tank example, we had three significant figures of accuracy on the length and width of the tank. But we only had one significant figure on the accuracy of the depth. So we can only have one significant figure in the accuracy of the volume. So we conclude that we can say it was around 10 liters, and we can’t really say anything more precise than that. The exact value likely falls somewhere within a bell curve centered around 10 liters.

Returning to the original question: can significant figures change an estimate of the age of the universe from 20 to 13.7?

Intuitively, it might seem like it shouldn’t: sigfigs are really an extension of the idea of rounding, and 13.7 rounded to one sigfig should round down to 10, not up to 20.

I can’t say anything about the specifics of the computations that produced the estimates of 20 and 13.7 billion years. I don’t know the specific measurements or computations that were involved in that estimate.

What I can do is just work through a simple exercise in computations with significant figures to see whether it’s possible that changing the number of significant digits in a measurement could produce a change from 20 to 13.7.

So, we’re looking at two different computations that are estimating the same quantity. The first, 20, has just one significant figure. The second, 13.7 has three significant digits. What that means is that for the original computation, one of the quantities was known only to one significant figure. We can’t say whether all of the elements of the computation were limited to one sigfig, but we know at least one of them was.

So if the change from 20 to 13.7 was caused by significant digits, it means that by increasing the precision of just one element of the computation, we could produce a large change in the computed value. Let’s make it simpler, and see if we can see what’s going on by just adding one significant digit to one measurement.

Again, to keep things simple, let’s imagine that we’re doing a really simple calculation. We’ll use just two measurements x and y, and the value that we want to compute is just their product, x \times y.

Initially, we’ll say that we measured the value of x to be 8.2 – that’s a measurement with two significant figures. We measure y to be 2 – just one significant figure. The product x\times y = 8.2 \times 2 = 16.4. Then we need to reduce that product to just one significant figure, which gives us 20.

After a few years pass, and our ability to measure y gets much better: now we can measure it to two significant figures, with a new value of 1.7. Our new measurement is completely compatible with the old one – 1.7 reduced to 1 significant figure is 2.

Now we’ve got equal precision on both of the measurements – they’re now both 2 significant figures. So we can compute a new, better estimate by multiplying them together, and reducing the solution to 2 significant figures.

We multiply 8.2 by 1.7, giving us around 13.94. Reduced to 2 significant figures, that’s 14.

Adding one significant digit to just one of our measurements changed our estimate of the figure from 20 to 14.

Returning to the intuition: It seems like 14 vs 20 is a very big difference: it’s a 30 percent change from 20 to 14! Our intuition is that it’s too big a difference to be explained just by a tiny one-digit change in the precision of our measurements!

There’s two phenomena going on here that make it look so strange.

The first is that significant figures are an absolute error measurement. If I’m measuring something in inches, the difference between 15 and 20 inches is the same size error as the difference between 90 and 95 inches. If a measurement error changed a value from 90 to 84, we wouldn’t give it a second thought; but because it reduced 20 to 14, that seems worse, even though the absolute magnitude of the difference considered in the units that we’re measuring is exactly the same.

The second (and far more important one) is that a measurement of just one significant digit is a very imprecise measurement, and so any estimate that you produce from it is a very imprecise estimate. It seems like a big difference, and it is – but that’s to be expected when you try to compute a value from a very rough measurement. Off by one digit in the least significant position is usually not a big deal. But if there’s only one significant digit, then you’ve got very little precision: it’s saying that you can barely measure it. So of course adding precision is going to have a significant impact: you’re adding a lot of extra information in your increase in precision!

Technical Interviews: More than a brain teaser?

A request I’ve gotten from a couple of readers (unrelated to the recent charity thing) is to talk about engineering interviews at tech companies. I’ve been at Google, Foursquare, Twitter, and now Dropbox, so I’ve spent some time inside multiple companies that put a lot of effort into recruiting.

Tech interviews get a lot of bad press. Only some of it is deserved.

What you frequently hear in criticism is stuff about “gotcha” questions, or “brain teasers”. Those do happen, and when they do, they deserve condemnation. For example, I have seriously had an interviewer for a software job ask me “Why are manholes round?” That’s stupid. People who like it claim that it’s a test of lateral thinking; I think it’s garbage. But these days, that kind of rubbish seems pretty rare.

Instead, what’s become very common is for interviewers to present a programming problem, and ask the candidate to solve it by writing some code. A lot of people really hate these kinds of interviews, and describe them as just brain-teasers. I disagree, and I’m going to try to explain why.

The underlying problem for tech job interviews is that hiring the right people is really, really hard.

When someone applies for a job as a software engineer with your company, you start off with just their resume. Resumes are not particularly informative. All they give you is a brief, possibly partial history of the candidates work experience. From a resume, can’t tell how much they really contributed to the projects they worked on. You can’t tell how much their work added (or subtracted) from the team they were part of. You can’t tell if they get work done in a reasonable amount of time, or if they’re slower than a snail. You can’t even tell if they can write a simple program at all.

So you start by screening resumes, doing your best to infer as much as you can from them. Next, you often get recommendations. Recommendations can be useful, but let’s be honest: All recommendation letters are positive. You’re not going to ask for a recommendation from someone who isn’t going to say great things about you. So no matter how terrible a candidate is, the recommendations are going to say nice things about them. At best, you can sometimes infer a problem by reading between the lines – but that’s a very subjective process.

So you end up interviewing someone who’s resume looks good on paper, and who got a couple of people to write letters for them. How do you determine whether or not they’re going to be a valuable addition to your team?

You need to do something to decide whether or not to hire a particular person. What can you do?

That’s what the interview is for. It’s a way to try to get more information. Sure, this person has a college degree. Sure, they’ve got N years of experience. But can they program? Can they communicate well with their coworkers? Do they actually know what they’re doing?

A tech interview is generally an attempt to get information about a candidate by watching them work on a problem. The interview isn’t about knowing the right answer. It’s not even about getting the correct solution to the problem. It’s about watching a candidate work.

When I ask a job candidate a technical question, there’s three main things I’m looking for.

  1. What’s their process for solving the problem? On this level, I’m trying to figure out: Do they think about it, or do they jump in and start programming? Do they make sure they understand the problem? Do they clearly state their assumptions?
  2. Can they write a simple program? Here I’m trying to see if they’ve got any ability to write
    code. No one writes great code in an interview setting. But I want to know if they’re
    able to sit down with an unfamiliar problem, and work out a solution in code. I want to see if they start coding immediately, or take time to think through their solution before they start writing.
  3. How well can they communicate ideas about programming? Can they grasp the problem from my description? If not, can they figure out what questions they need to ask to understand it? Once they start solving the problem, how well can they explain what they’re doing? Can they describe the algorithm that they’ve chosen? Can they explain why it works?

To try to clarify this, I’m going to walk through a problem that I used to use in interviews. I haven’t used this question in about 3 years, and as far as I know, no one is using the question anymore. The problem involves something called Gray code. Gray code is an alternative representation of numbers in binary form that’s useful for a range of applications involving things like switching systems.

Here’s a quick run through one of the reasons to use gray code. Imagine a system that uses physical switches. You’ve got an array of 8 switches representing a number. It’s currently presenting the number 7 in standard binary – so the first 5 switches are off, and last 3 are on. You want to increment the number. To do that, you need to change the position of four switches at exactly the same time. The odds of your being able to do that without even a transient state that appeared to be a number other than 7 or 8 are vanishingly small.

Gray code solves that by changing the representation. In Gray code, the representation of every number N+1 is only different from the representation of N by exacly one bit. That’s a nice property which makes it useful, even nowadays when we’re not using physical switches for much of anything anymore.

The easiest way that you get the gray code of numbers is by writing a table. You start off by writing 0 and 1, which are the same in both gray code and standard binary:

Decimal Standard Binary Gray
0 0 0
1 1 1

There’s the one-bit gray codes. To get the two bit, make two copies of the rows in that table.
To the first copy, prepend a 0. To the second copy, reverse the order of the rows, prepend a 1:

Decimal Standard Binary Gray
0 00 00
1 01 01
2 10 11
3 11 10

To get to the three bit gray codes, you repeat the process. Copy the rows, prepend 0s to
the first copy; reverse the order of the second, and prepend 1s.

Decimal Standard Binary Gray
0 000 000
1 001 001
2 010 011
3 011 010
4 100 110
5 101 111
6 110 101
7 111 100

So, the gray code of 6 is 101, and the gray code of 7 is 100.

What I would ask an interview candidate to do is: implement a recursive function that given an integer N, returns a string with the gray code of N.

I can understand how some people look at this question, and say, “Yeah, that’s just a stupid puzzle.” On one level, yeah. It’s obvious an artifical question. In fact, in practice, no one ever uses a recursive algorithm for something like this. Even if you have a problem where gray code is part of a practical solution, there’s a better way of converting numbers to gray code than this silly recursive nonsense.

So I agree that it’s artificial. But interview questions have to be artificial. In a typical interview, you’ve got an hour with a candidate. You’re not going to be able to explain a real problem to them in that amount of time, much less have them solve it!

But it’s artificial in a useful way that allowed me, as an interviewer, to learn about the candidate. I wasn’t trying to see if the candidate was number-puzzle wizard who could instantly see the recursion pattern in a problem like this. Most people have never heard of gray code, and to most people (including me, the first time I saw this problem!), the recursion pattern isn’t obvious. But that’s not the point: there’s a lot more to the interview that just the initial problem statement.

I don’t present the problem, and then sit back and watch silently as they try to solve it. If I did, all I’d be learning is whether or not they’re a number-puzzle wizard. I don’t care about that. So I didn’t just leave them floundering trying to somehow come up with a solution. In the beginning, after describing the problem, I set an initial direction. I usually have them start by extending the table themselves, to make sure they understand the process. Then I take their extended table, and add a new column:

Decimal Standard Binary Gray Rec
0 0000 0000
1 0001 0001
2 0010 0011 “1” + gray(1)
3 0011 0010 “1” + gray(0)
4 0100 0110
5 0101 0111
6 0110 0101
7 0111 0100
8 1000 1100 “1” + gray(7)
9 1001 1101 “1” + gray(6)
10 1010 1111 “1” + gray(5)
11 1011 1110
12 1100 1010
13 1101 1011
14 1110 1001
15 1111 1000 “1” + gray(0)

With that on the board/screen, I’d ask them to try to take what I just gave them, and rewrite it a bit. For example, in row 8, instead of “1” + gray(7), come up with an expression using the numeric value “8” of the row, which will produce 7. They should be able to come up with “15 – 8” – and to see that in every row n , where n \ge 8 and n < 16, the gray code of n is “1” + gray(15 – n).

For most people, that’s enough of a clue to be able to start writing the function. If they can’t get there, it shows me that they’ve got some trouble wrapping their head around this abstraction. I’ve got a few more hints up my sleeve to help, but if without all of the help I can give, they still can’t come up with that, that’s one mark against them.

But even if they can’t come up with the relation at all, it’s not the end of the interview. I have, sometimes, ended up recommending hiring someone who had trouble this far! If they can’t come up with that basic relation, it’s just one piece of information about the candidate. I’d file that away, and move on, by giving them the recurrence relation, and then I would ask them to code it.

There’s one problem that comes up in coding, which is interesting and important. The most naive code for gray is something like:

def gray(n):
  print("gray(%s)" % n)
  if n == 0:
    return "0"
  if n == 1:
    return "1"
  num_digits = math.floor(math.log(n, 2)) + 1
  return "1" + gray(int(2**num_digits - 1 - n))

That’s very close, but not right. If you call gray(10), you get the right answer.
If you call gray(4), you get “110”, which is correct. But if you call gray(8), you’d get “110”, when you should have gotten 1010.

Most candidates make this mistake. And then I ask them to trace it through on 8 as an example. They usually see what the problem is. If they don’t, then that’s another red flag.

If they’re really struggling to put together a recursive function, then I’d ask them to just write a function to convert an integer into standard binary. If they can do that, then I start making suggestions about how to convert that to do gray code.

The big indicator here is whether or not they can write a simple recursive function at all. The systems I was working on at the time made heavy use of recursion – if a candidate couldn’t write recursive code, there was simply no way that they’d be able to do the job. (It was depressing to see just how many qualified-looking candidates came in, but who couldn’t write a simple recursive function.)

Through this whole process, how well they were able to talk about what they were doing was as important as the solution they came up with. If they heard the question, and immediately wrote down perfect, beautiful code, but they couldn’t explain how they got it, or how it worked? They’d get a mediocre rating, which wouldn’t get them a job offer. If they made a lot of mistakes in their code, but they were crystal clear about explaining how they worked out a solution, and how it worked? They’d probably get a better rating than the perfect code candidate.

I hope this shines a bit of light on this kind of interview question. While it’s necessarily artificial, it’s artifical in a way that we hope can help us learn more about the candidate. It’s not a trick question that’s irrelevant to the job, like “Why are manholes round?”: this is an attempt to probe at an area of knowledge that any candidate for a software engineering job should have. It’s not an all or nothing problem: even if you start off with no clue of how to approach it, I’m guiding you through. If you can’t solve it without help, this problem gives me some insight into what it is that you don’t understand, and hopefully whether or not that’s going to be a problem if we hire you.

Is it a great way of doing interviews? Honestly, no. But it’s the best way we know of doing it.

As an interview system, it doesn’t do a good job of identifying the very best people to hire. There’s no correlation between outstanding interview performance and outstanding on-the-job performance. But there’s a strong correlation between poor performance on this kind of interview question and poor performance on the job. Great performers on the job show the same distribution of interview performance as average ones; but poor performers on interviews show a significantly negative-shifted job performance distribution.

We haven’t found a way of interviewing people that does a better job than this. It’s the best we have. Statistically, it works far better at selecting people than “open-ended” interviews that don’t involve any kind of practical programming exercise. So for all of its faults, it’s better that the alternatives.

I’m sure there are pedants out there who are asking “So what’s the correct implementation of gray code?” It’s totally not the point of talking about it, but here’s one sloppy but correct implementation. This isn’t the quality of code I would ever use for anything serious at work, but it’s perfectly adequate for an interview.

import math

def gray(n):
    def required_digits(n):
        """Compute the number of digits required to 
        represent a number in binary
        """
        return int(math.floor(math.log(n, 2))) + 1

    def pad_digits(gray, num_digits):
        if len(gray) < num_digits:
            return "0"*(num_digits - len(gray)) + gray
        return gray

    if n == 0:
        return "0"
    if n == 1:
        return "1"
    num_digits = int(math.floor(math.log(n, 2)) + 1)
    return "1" + pad_digits(gray(int(2**num_digits - 1 - n)), num_digits - 1)

Lychrel Numbers and the 196 Algorithm

The first donor to the Chesapeake Math program asked me to write about the 196 algorithm, a problem also known as the mystery of the Lychrel numbers. To be completely honest, that’s something that people have asked for in the past, but I’ve never written about it, because I didn’t see what I could add. But I made a promise, so… here goes!

Take any number with at least two-digits, N. Reverse the digits of N, giving you M, and then add N+M. That gives you a new number. If you keep repeating this process, most of the time, you’ll get a palindromic number really fast.

For example, let’s start with 16:

  1. 16 reversed is 61.
  2. 16+61=77. 77 is a palindromic number.
  3. So one reverse+add, and we have a palindromic number.

Or 317:

  1. 317 reversed is 713.
  2. 317+713=1030.
  3. 1030 reversed is 301.
  4. 1030 + 301 = 1331, so we have a palindromic number after two steps.

You can play the same game in different number bases. For example, in base 8, we can start with 013 (11 in base-10): in one reverse+add, it becomes 44 (36 in base-10).

For most numbers, you get a palindrome in just a few steps. But for some numbers, in some number bases, you don’t. If you can’t ever get to a palindrome by doing reverse+add starting from a number, then that number is called a Lychrel number.

The process of exploring Lychrel numbers has picked up a lot of devotees, who’ve developed a whole language for talking about it:

chain
A chain is the sequence of numbers produced by reverse+add starting with some number, and possibly converging on a palindromic number.
seed
The seed of a chain is the smallest number that can start that chain. For example, in the example above, we looked at the chain [217, 1030, 1331]. 217 is the seed – even though you could start a chain at 1030, 1030 would never be considered a seed, because it can be the product of a reverse+add on a smaller number.
kin
Two numbers are kin numbers if they’re part of a chain that started at the same seed. If a number is a lychrel number, then (obviously) so are all of its kin.

The question that haunts Lychrel enthusiasts is, will you always eventually get a palindrome? That is, do Lychrel numbers actually exist?

In base-2, we know the answer to that: not all numbers will produce a palindrome; there are base-2 Lychrel numbers. The smallest base-22 Lychrel number is 22 – or 10110 in binary. We can look at its reverse add sequence, and see intuitively why it will never produce a palindrome:

  1. 10110
  2. 100011
  3. 1010100
  4. 1101001
  5. 10110100 (10, 11, 01, 00)
  6. 11100001
  7. 101101000
  8. 110010101
  9. 1011101000 (10, 111, 01, 000)
  10. 110010101
  11. 1101000101
  12. 10111010000
  13. 0b11000101101
  14. 0b101111010000 (10, 1111, 01, 0000)

Starting at step 5, we start seeing a pattern in the sequence, where we have recurring values where that have a pattern of 10, followed by m-1s, followed by 01, followed by m 0s. We’ve got a sequence that’s building larger and larger numbers, in a way that will never converge into a palindrome.

We can find similar sequences in any power-of-two base – base 4, 8, 16, etc. So in power-of-two bases, there are Lychrel numbers. But: are there Lychrel numbers in our familiar base-10?

We think so, but we’re not sure. No one has been able to prove it either way. But we’ve got some numbers, which we call Lychrel candidates, that we think are probably Lychcrel numbers. The smallest one is 196 – which is why this whole discussion is sometimes called the 196 problem, or the 196 algorithm.

People have written programs that follow the Lychrel thread from 196, trying to see if it reaches a palindrome. So far, the record for exploring the 196 Lychrel thread carries it through more than a billion iterations, producing a non-palindromic number with more than 6 million digits.

That’s pretty impressive, given that the longest Lychrel thread for any number smaller than 196 is the thread of 24 steps, starting with 89 (which produces the palindromic number 8,813,200,023,188).

From my perspective, one thing that interests me about this is its nature as a computational problem. As a problem, it’s really easy to implement. For example, here’s a complete implementation in Ratchet, a Scheme-based programming language. (I used ratchet because it features infinite-precision integers, which makes it easier to write.)

(define (reverse-number n)
  (string->number (list->string (reverse (string->list (number->string n))))))

(define (palindromic? n)
  (equal? n (reverse-number n)))

(define (reverse-add n)
  (+ n (reverse-number n)))

(define (find-palindrome seed)
  (define (noisy-find-palindrome n count)
    (if (palindromic? n)
       n
      (begin
        (printf "At iteration ~v, candidate=~v~n" count n)
                (noisy-find-palindrome (reverse-add n) (+ count 1)))))
  (noisy-find-palindrome seed 0))

I literally threw that together in less than five minutes. In that sense, this is a really, really easy problem to solve. But in another sense, it’s a very hard problem: there’s no way to really speed it up.

In modern computing, when we look at a problem that takes a lot of computation to solve, the way that we generally try to approach it is to throw more CPUs at it, and do it in parallel. For most problems that we come across, we can find some reasonable way to divide it into parts that can be run at the same time; then by having a bunch of computers work on those different parts, we can get a solution pretty quickly.

For example, back in the early days of this blog, I did some writing about the Mandelbrot set, and one variant of it that I find particularly interesting, called the Buddhabrot. The Buddhabrot is interesting because it’s a fractal visualization which isn’t naively zoomable. In a typical Mandelbrot set visualization, you can pick a small region of it that you want to see in more detail, and focus your computation on just that part, to get a zoomed in view on that. In the Buddhabrot, due to the chaotic nature of the computation, you can’t. So you just compute the Buddhabrot at a massive size, and then you compress it. When you want to see a region in more detail, you un-compress. To make that work, buddhabrot’s are frequently computed at resolutions like 1 million by 1 million pixels. That translates to enough complex floating point computations to compute several trillion values. That’s a big task. But in modern environments, that’s routine enough that a friend of mine at Google wrote a program, just for kicks, to compute a big buddhabrot image, and ran it on an experimental cluster.

If that kind of computational capability can be exploited just for kicks, why is it that the best effort at exploring the Lychrel thread for 196 only covers 6 million digits?

The answer is that there’s a way of computing the Buddhabrot in parallel. You can throw 10,000 CPUs at it for a couple of days, and get an amazing Buddhabrot image. But for the Lychrel thread, there’s no good way to do it in parallel.

For each additional number in the thread, you need to rearrange and add a couple of million digits. That’s a lot of work, but it’s not crazy. On any halfway decent computer, it won’t take long. To get a sense, I just whipped up a Python program that generated 1,000,000 random pairs of digits, and added them up. It took under 8 seconds – and that’s half-assed code written using a not-particularly fast interpreted language on a not-particularly-speedy laptop. A single iteration of the Lychrel thread computation even for a million-digit candidate doesn’t take that long – it’s on the order of seconds.

The catch is that the process of searching a Lychrel thread is intrinsically serial: you can’t have different CPUs computing the next thread element for different values: you don’t know the next value until you’ve finished the previous one. So even if it took just 1 second to do the reverse+add for million digit numbers, it would takes a long time to actually explore the space. If you want to explore the next million candidates, at 2 seconds per iteration, that will take you around 3 weeks!

Even if you don’t waste time by using a relatively slow interpreter like Python – even if you use carefully hand-optimized code using an efficient algorithm, it would take months at the very least to explore a space of billions of elements of the 196 thread! And to get beyond what anyone has done, you’ll probably need to end up doing years of computation, even with a very fast modern computer – because you can’t parallelize it.

If you’re interested in this kind of thing, I recommend looking at Wade Van Landingham’s p196 site. Wade is the guy who named them Lychrel numbers (based on a rough reversal of his girlfriend (now wife)’s name, Cheryl). He’s got all sorts of data, including tracking some of the longest running efforts to continue the 196 thread.

A couple of edits were made to this, based on error pointed out by commenters.

Truth in Type Theory

Now, we’re getting to the heart of type theory: judgements. Judgements are the basic logical statements that we want to be able to prove about computations in type theory. There’s a basic set of judgements that we want to be able to make.

I keep harping on this, but it’s the heart of type theory: type theory is all about understanding logical statements as specifications of computations. Or, more precisely, in computer science terms, they’re about understanding true logical statements as halting computations.

In this post, we’ll see the ultimate definition of truth in type theory: every logical proposition is a set, and the proposition is true if the set has any members. A non-halting computation is a false statement – because you can never get it to resolve an expression to a canonical value.

So remember as we go through this: judgements are based on the idea of logical statements as specifications of computations. So when we talk about a predicate P, we’re using its interpretation as a specification of a computation. When we look at an expression 3+5, we understand it not as a piece of notation that describes the number 8, but as a description of a computation that adds 3 to 5. “3+5” not the same computation as “2*4” or “2+2+2+2”, but as we’ll see, they’re equal because they evaluate to the same thing – that is, they each perform a computation that results in the same canonical value – the number 8.

In this post, we’re going to focus on a collection of canonical atomic judgement types:

A \text{set}
This judgement says that A is a set.
A = B
A and Bare equal sets.
a \in A
a is an element of the set A
a_1 == a_2 \in A
a_1 and a_2 are equal members of the set A.
A is a proposition.
The proposition A is true.

The definition of the meanings of judgements is, necessarily, mutually recursive, so we’ll have to go through all of them before any of them is complete.

An object A is a Set
When I say that A is a set in type theory, that means that:

  • I know the rules for how to form the canonical expressions for the set;
  • I’ve got an equivalence relation that says when two canonical members of the set are equal.
Two Sets are Equal

When I say that A and B are equal sets, that means:

  • A and B are both sets.
  • If a is a canonical member of A, then a is also a canonical member of B, and vice versa.
  • If a and b are canonical members of A, and they’re also equal in A, then a and b are also equal canonical members of B (and vice versa).

The only tricky thing about the definition of this judgement is the fact that it defines equality is a property of a set, not of the elements of the set. By this definition, tt’s possible for two expressions to be members of two different sets, and to be equal in one, but not equal in the other. (We’ll see in a bit that this possibility gets eliminated, but this stage of the definition leaves it open!)

An object is a member of a set
When I say a \in A, that means that if I evaluate a, the result will be a canonical member of A.
Two members of a set are equal
If a \in A and b \in A and a = b, that means when if you evaluate a, you’ll get a canonical expression ; and when you evaluate b, you’ll get a canonical expression . For a == b to be true, then a, that is, the canonical expressions resulting from their evaluation must also be equal.

This nails down the problem back in set equivalence: since membership equivalence is defined in terms of evaluation as canonical values, and every expression evaluates to exactly one canonical expression (that’s the definition of canonical!), then if two objects are equal in a set, they’re equal in all sets that they’re members of.

An object A is a proposition
Here’s where type theory really starts to depart from the kind of math that we’re used to. In type theory, a proposition is a set. That’s it: to be a proposition, A has to be a set.
The proposition A is true
And the real meat of everything so far: if we have a proposition A, and A is true, what that means is that a has at least one element. If a proposition is a non-empty set, then it’s true. If it’s empty, it’s false.

Truth in type theory really comes down to membership in a set. This is, subtly, different from the predicate logic that we’re familiar with. In predicate logic, a quantified proposition can, ultimately, be reduced to a set of values, but it’s entirely reasonable for that set to be empty. I can, for example, write a logical statement that “All dogs with IQs greater that 180 will turn themselves into cats.” A set-based intepretation of that is the collection of objects for which it’s true. There aren’t any, because there aren’t any dogs with IQs greater than 180. It’s empty. But logically, in terms of predicate logic, I can still say that it’s “true”. In type theory, I can’t: to be true, the set has to have at least one value, and it doesn’t.

In the next post, we’ll take these atomic judgements, and start to expand them into the idea of hypothetical judgements. In the type-theory sense, that means statements require some other set of prior judgements before they can be judged. In perhaps more familiar terms, we’ll be looking at type theory’s equivalent of the contexts in classical sequent calculus – those funny little \Gammas that show up in all of the sequent rules.

Canonical Expressions in Type Theory

Sorry for the gap in posts. I’ve been trying to post more regularly, and was just hitting a rhythm, when my son brought home a particularly vicious bug, and I got sick. I’ve spent the last couple of weeks being really, really sick, and then trying to get caught up at work. I’m mostly recovered, except for some lingering asthma, so I’m trying to get back to that twice-per-week posting schedule.

In the last couple of posts, we looked at Martin-Löf’s theory of expressions. The theory of expressions is purely syntactic: it’s a way of understanding the notation of expressions. Everything that we do in type theory will be written with expressions that follow the syntax, the arity, and the definitional equivalency rules of expression theory.

The next step is to start to understand the semantics of expressions. In type theory, when it comes to semantics, we’re interested in two things: evaluation and judgements. Evaluation is the process by which an expression is reduced to its simplest form. It’s something that we care about, but it’s not really a focus of type theory: type theory largely waves its hands in the air and says “we know how to do this”, and opts for normal-order evaluation. Judgements are where things get interesting.

Judgements are provable statements about expressions and the values that they represent. As software people, when we think about types and type theory, we’re usually thinking about type declarations: type declarations are judgements about the expressions that they apply to. When you write a type declaration in a programming language, what you’re doing is asserting the type theory judgement. When the compiler “type-checks” your program, what it’s doing in type theory terms is checking that your judgements are proven by your program.

For example, we’d like to be able to make the judgement A set – that is, that A is a set. In order to make the judgement that A is a set in type theory, we need to know two things:

  1. How are canonical instances of the set A formed?
  2. Given two canonical instances of A, how can we determine if they’re equal?

To understand those two properties, we need to take a step back. What is a canonical instance of a set?

If we think about how we use predicate logic, we’re always given some basic set of facts as a starting point. In type theory, the corresponding concept is a primitive constant. The primitive constants include base values and primitive functions. For example, if we’re working with lispish expressions, then cons(1, cons(2, nil)) is an expression, and cons, nil, 1 and 2 are primitive constants; cons is the head of the expression, and 1 and cons(2, nil) are the arguments.

A canonical expression is a saturated, closed expression whose head is a primitive constant.

The implications of this can be pretty surprising, because it means that a canonical expression can contain unevaluated arguments! The expression has to be saturated and closed – so its arguments can’t have unbound variables, or be missing parameters. But it can contain unevaluated subexpressions. For example, if we were working with Peano arithmetic in type theory, succ(2+3) is canonical, even though “2+3” hasn’t be evaluated.

In general, in type theory, the way that we evaluate an expression is called normal order evaluation – what programming language people call lazy evaluation: that’s evaluating from the outside in. Given a non-canonical expression, we evaluate from the outside in until we get to a canonical expression, and then we stop. A canonical expression is considered the result of a computation – so we can see succ(2+3) as a result!

A canonical expression is the evaluated form of an expression, but not the fully evaluated form. The fully evaluated form is when the expression and all of its saturated parts are fully evaluated. So in our previous example, the saturated part 2+3 wasn’t evaluated, so it’s not fully evaluated. To get it to be fully evaluated, we’d need to evaluate 2+3, giving us succ(5); then, since succ(5) is saturated, it’s evaluated to 6, which is the fully evaluated form of the expression.

Next post (coming monday!), we’ll use this new understanding of canonical expressions, and start looking at judgements, and what they mean. That’s when type theory starts getting really fun and interesting.

Expressions and Arity (Part 2): Equivalence

Continuing where I left off: we were talking about arity in Martin-Löf’s theory of expressions. There are two basic problems that arity solves: it makes certain kinds of impossible-to-evaluate expressions be invalid in the theory; and it helps enable some way of comparing expressions for equality. Arity solves both of those problems by imposing a simple type system over expressions.

At the end of the last post, I started giving a sketch of what arities look like. Now we’re going to dive in, and take a look at how to determine the arity of an expression. It’s a fairly simple system of rules.

Before diving in, I want to stress the most important thing about the way that these rules work is that the expressions are totally syntactic and static. This is something that confused me the first time I tried to read about expression theory. When you see an expression, you think about how it’s evaluated. But expression theory is a purely syntactic theory: it’s about analyzing expressions as syntactic entities. There are, deliberately, no evaluation rules. It’s about understanding what the notations mean, and how to determine when two expressions are equivalent.

If, under the rules of Martin-Löf’s expression theory, two expressions are equivalent, then if you were to chose a valid set of evaluation rules, the two expressions will evaluate to the same value. But expression equivalence is stronger: expressions are equivalent only if you can prove their equivalence from their syntax.

That clarified, let’s start by looking at the rules of arity in expressions.

Variables and Constants
Every variable and every primitive constant has a pre-defined arity; if x is a variable or primitive constant with arity \alpha, then the expression x has arity \alpha.
Definitions
In a definition D := e, the arity of the defined name D is the same as the arity of the expression e.
Applications
If a is an expression of arity \alpha \twoheadrightarrow \beta, and b is a expression of arity \alpha, then a(b) is an expression of arity \beta.
Abstractions
If e is an expression of arity \beta and x is a variable of arity \alpha, then (x)e is an expression of arity \alpha \twoheadrightarrow \beta.
Combinations
If e_1 is an expression af arity \alpha_1, e_2 is an expression of arity \alpha_2, …, and e_n is an expression of arity \alpha_n, then a combination expression e_1, e_2, ..., e_n is an expression of arity \alpha_1 \otimes \alpha_2 \otimes \ldots \otimes \alpha_n.
Selections
If e is an expression of arity \alpha_1 \otimes \alpha_2 \otimes \ldots \otimes \alpha_n where n \ge 2, then (e).i is an expression af arity e_i.

Let’s try working through an example: x^2 + 3x + 7.

  1. As we saw in this post, this is equivalent to the simple AST-form: (x)+(+(*(x,x), *(3, x)),7).
  2. x” is a variable of arity 0; “3” and “7” are constants of arity 0; “+” and “*” are constants of arity (0 \otimes 0) \twoheadrightarrow 0.
  3. From the combination rule, since x and 3 both have arity 0, (x, x) and (3, x) each have arity 0 \otimes 0.
  4. Since (x, x) has arity 0 \otimes 0, and * has arity (0 \otimes 0) \twoheadrightarrow 0, *(x, x) has arity 0. The same thing works for *(3, x).
  5. Since the arities of the *(x, x) and *(3, x) are both 0, the combination of the pair (the arguments to the inner “+”) are 0 \otimes 0, and the arity of the inner sum expression is thus 0.
  6. Since 7 has arity 0, the combination of it with the inner sum is 0 \otimes 0, and the arity of the outer sum is 0.
  7. Since x is a variable of arity 0, and outer sum expression has arity 0, the abstract has arity 0 \twoheadrightarrow 0.

If you’re familiar with type inference in the simply typed lambda calculus, this should look pretty familiar; the only real difference is that the only thing that arity really tracks is applicability and parameter counting.

Just from this much, we can see how this prevents problems. If you try to compute the arity of 3.1 (that is, the selection of the first element from 3), you find that you can’t: there is no arity rule that would allow you to do that. The selection rule only works on a product-arity, and 3 has arity 0.

The other reason we wanted arity was to allow us to compare expressions. Intuitively, it should be obvious that the expression e and the expression (x)e(x) are in some sense equal. But we need some way of being able to actually precisely define that equality.

The kind of equality that we’re trying to get at here is called definitional equality. We’re not trying to define equality where expressions a and b evaluate to equal values – that would be easy. Instead, we’re trying to get at something more subtle: we want to capture the idea that the expressions are different ways of writing the same thing.

We need arity for this, for a simple reason. Let’s go back to that first example expression: (x)+(+(*(x,x), *(3, x)),7). Is that equivalent to (y)+(+(*(y,y), *(3, y)),7)? Or to 8x+1? If we apply them to the value 3, and then evaluate them using standard arithmetic, then all three expressions evaluate to 25. So are they all equivalent? We want to be able to say that the first two are equivalent expressions, but the last one isn’t. And we’d really like to be able to say that structurally – that is, instead of saying something evaluation-based like “forall values of x, eval(f(x)) == eval(g(x)), therefore f == g”, we want to be able to do something that says f \equiv g because they have the same structure.

Using arity, we can work out a structural definition of equivalence for expressions.

In everything below, we’l write a: \alpha to mean that a has arity \alpha, and a \equiv b : \alpha to mean that a and b are equivalent expressions of arity \alpha. We’ll define equivalence in a classic inductive form by structure:

Variables and Constants
If x is a variable or constant of arity \alpha, then x \equiv x : alpha. This is the simplest identity rule: variables and constants are equivalent to themselves.
Definitions
If a := b is a definition, and b: \alpha, then a \equiv b : \alpha. This is a slightly more complex form of an identity rule: if there’s a definition of b as the value of a, then a and b are equivalent.
Application Rules
  1. If and , then . If an applyable expression a is equivalent to another applyable expression , then applying a to an expression b is equivalent to applying to an expression if the argument b is equivalent to the argument . That’s a mouthful, but it’s simple: if you have two function application expressions, they’re equivalent if both the function expressions and the argument expressions are equivalent.
  2. If x is a variable of arity \alpha, and a is an expression of arity \alpha and b is an expression of arity \beta, then ((x)b)(a)  b[x := a]: \beta. This is arity’s version of the classic beta rule of lambda calculus: applying an abstraction to an argument means substituting the argument for all references to the abstracted parameter in the body of the abstraction.
Abstraction Rules
  1. If x is a variable of arity \alpha, and b \equiv b: \beta, then (x)b \equiv (x)b: \alpha \twoheadrightarrow \beta. If two expressions are equivalent, then two abstractions using the same variable over the same expression is equivalent.
  2. If x and y are both variables of arity \alpha, and b is an expression of arity \beta, then (x)b \equiv (y)(b[x := y]): \alpha \twoheadrightarrow \beta, provided y is not free in b.
    Basically, the renaming variables in abstractions don’t matter, as long as you aren’t using the variable in the body of the abstraction. So (x)(3+4y) is equivalent to (z)(3+4y), but it’s not equivalent to (y)(3+4y), because y is a free variable in 3+4y, and the abstraction would create a binding for y.

  3. This is arities version of the eta-rule from lambda calculus: If x is a variable of arity \alpha, and b is an expression of arity \alpha \twoheadrightarrow \beta, then (x)(b(x)) \equiv b: \alpha \twoheadrightarrow \beta. This is a fancy version of an identity rule: abstraction and application cancel.

Combination Rules
  1. If , , …, , then . This one is simple: if you have two combination expressions with the same arity, then they’re equivalent if their elements are equivalent.
  2. If e: \alpha_1 \otimes \alpha_2 \otimes ... \otimes \alpha_n, then e.1, e.2, ..., e.n \equiv: e : \alpha_1 \otimes \alpha_2 \otimes ... \otimes \alpha_n. Another easy one: if you take a combination expression, and you decompose it using selections, and then recombine those selection expressions into a combination, it’s equivalent to the original expression.
Selection Rules
  1. If , then . This is the reverse of combinations rule one: if you have two equivalent tuples, then their elements are equivalent.
  2. If a_1: \alpha_1, a_2: \alpha_2, ..., a_n: \alpha_n, then (a_1, a_2, ..., a_n).i \equiv a_i. An element of a combination is equivalent to itself outside of the combination.
Symmetry
If a: \alpha, then a \equiv a: \alpha.
Symmetry
If a \equiv b: \alpha, then b \equiv a: \alpha.
Transitivity
If a \equiv b: \alpha, and b \equiv c: \alpha, then a \equiv c: \alpha.

Jumping back to our example: Is x^2 + 3x + 7 equivalent to x^2 + 3x + 7? If we convert them both into their canonical AST forms, then yes. They’re identical, except for one thing: the variable name in their abstraction. By abstraction rule 2, then, they’re equivalent.

Expressions and Arity (part 1)

In the last post, we started looking at expressions. In this post, we’re going to continue doing that, and start looking at a simple form of expression typing called arity.

Before we can do that, we need to introduce a couple of new formalisms to complete our understanding of the elements that form expressions. The reason we need another formalism is because so far, we’ve defined the meaning of expressions in terms of function calls. But we’ve still got some ambiguity about our function calls – specifically, how do parameters work?

Suppose I’ve got a function, f. Can I call it as f(x)? Or f(x,y)? Both? Neither? It depends on exactly what function calls mean, and what it means to be a valid parameter to a function.

There are several approaches that you can take:

  1. You can say that a function application expression takes an arbitrary number of arguments. This is, roughly, what we do in dynamically typed programming languages like Python. In Python, you can write f(x) + f(x,y) + f(x, y, z, a, b, c), and the language parser will accept it. (It’ll complain when you try to execute it, but at the level of expressions, it’s a perfectly valid expression.)
  2. You can say that every function takes exactly one argument, and that a multi-argument function like f(x, y, z) is really a shorthand for a curried call sequence f(x)(y)(z) – and thus, the “application” operation really takes exactly 2 parameters – a function, and a single value to apply that function to. This is the approach that we usually take in lambda calculus.
  3. You can say that application takes two arguments, but the second one can be a combination of multiple objects – effectively a tuple. That’s what we do in a lot of versions of recursive function theory, and in programming languages like SML.

In the theory of expressions, Martin-Löf chose the third approach. A function takes a single parameter, which is a combination of a collection of other objects. If a, b, c, and d are expressions, then a, b, c, d is an expression called the combination of its four elements. This is very closely related to the idea of cartesian products in set theory, but it’s important to realize that it’s not the same thing. We’re not defining elements of a set here: we’re defining a syntactic construct of a pseudo-programming language, where one possible interpretation of it is cartesian products.

In addition to just multi-parameter functions, we’ll use combinations for other things. In type theory, we want to be able to talk about certain mathematical constructs as first-class objects. For example, we’d like to be able to talk about orderings, where an ordering is a collection of objects A combined with an operation \le. Using combinations, we can write that very naturally as A,\le.

For combinations to be useful, we need to be able to extract the component values from them. In set theory, we’d do that by having projection functions associated with the cross-product. In our theory of expressions, we have selection expressions. If x=(a, b, c, d) is a combination with four elements, then x.1 is a selection expression which extracts the first element from x.

In programming language terms, combinations give us a way of writing tuple values. Guess what’s next? Record types! Or rather, combinations with named elements. We can write a combination with names: x = (a: 1, b:2), and then write selection expressions using the names, like x.a.

Now we can start getting to the meat of things.

In combinatory logic, we’d just start with a collection of primitive constant values, and then build whatever we wanted with them using abstractions, applications, combinations, and selections. Combinatory logic is the parent of computation theory: why can’t we just stick with that foundation?

There are two answers to that. The first is familiar to programming language people: if you don’t put any restrictions on things, then you lose type safety. You’ll be able to write “valid” expressions that don’t mean anything – things like 1.x, even though “1” isn’t a combination, and so calling a selector on it makes no sense. Or you’d be able to call a function like factorial(27, 13), even though the function only takes one parameter.

The other answer is equality. One thing that we’d really like to be able to do in our theory of expressions is determine when two expressions are the same. For example, we’d really like to be able to do things like say that ((x)e)(x) == e. But without some form of control, we can’t really do that: the problem of determining whether or not two expressions are equal can become non-decidable. (The canonical example of this problem involves y combinator: ((x)x(x))((x)x(x)). If we wanted to try to work out whether an expression involving this was equivilant to another expression, we could wind up in an infinite loop of applications.)

The way that Martin-Löf worked around this is to associate an arity with an expression. Each arity of expressions is distinct, and there are rules about what operations can be applied to an expression depending on its arity. You can’t call .2 on “1+3”, because “1+3” is a single expression, and selectors can only be applied to combined expressions.

To most of us, arity sounds like it should be a numeric value. When we we talk about the arity of a function in a program, what we mean is how many parameters it takes. In Martin-Löf expressions, arity is more expressive than that: it’s almost the same thing as types in the simply typed lambda calculus.

There are two dimensions to arity: single/combined and saturated/unsaturated.

Single expressions are atomic values, where you can’t extract other values from them by selection; multiple expressions are combinations of multiple other expressions.

Saturated expressions are expressions that have no holes in them that can be filled by other expressions – that is, expressions with no free variables. Unsaturated expressions have open holes – which means that they can be applied to other expressions.

Saturated single expressions have arity 0. An expression of arity 0 can’t be applied, and you can’t be the target of selection expressions.

An unsaturated expression has an arity (\alpha \twoheadrightarrow \beta), where both \alpha and \beta are arities. For example, the integer addition function has arity (0 \otimes 0 \twoheadrightarrow 0). (Erik Eidt pointed out that I made an error here. I originally wrote addition as 0 \twoheadrightarrow 0, where it should have been 0 \otimes 0 \twoheadrightarrow 0.)

A combined expression (e_0, e_1, ..., e_n) has arity (\alpha_1 \otimes \alpha_2 \otimes ... \otimes \alpha_n), where each of the \alpha_is are the arities of the expression e_i.

Sadly, I’m out of time to work on this post, so we’ll have to stop here. Next time, we’ll look at the formal rules for arities, and how to use them to define equality of expressions.