# Bad Arithmetic and Blatant Political Lies

I’ve been trying to say away from the whole political thing lately. Any time that I open my mouth to say anything about politicians, I get a bunch of assholes trying to jump down my throat for being “biased”. But sometimes, things just get too damned ridiculous, and I can’t possibly let it go without comment.

In the interests of disclosure: I despise Mitt Romney. Despite that, I think he’s gotten a very unfairly hard time about a lot of things. Let’s face it, the guys a rich investor. But that’s been taken by the media, and turned in to the story through which everything is viewed, whether it makes sense or not.

For example, there’s the whole \$10,000 bet nonsense. I don’t think that that made a damned bit of sense. It was portrayed as “here’s a guy so rich that he can afford to lose \$10,000”. But… well, let’s look at it from a mathematical perspective.

You can assess the cost of a bet by looking at it from probability. Take the cost of losing, and multiply it by the probability of losing. That’s the expected cost of the bet. So, in the case of that debate moment, what was the expected cost of the bet? \$0. If you know that you’re betting about a fact, and you know the fact, then you know the outcome of the bet. It’s a standard rhetorical trick. How many of us have said “Bet you a million dollars”? It doesn’t matter what dollar figure you attach to it – because you know the fact, and you know that the cost of the bet, to you, is 0.

But… Well, Mitt is a rich asshole.

As you must have heard, Mitt released his income tax return for last year, and an estimate for this year. Because his money is pretty much all investment income, he paid a bit under 15% in taxes. This is, quite naturally, really annoying to many people. Those of us who actually have jobs and get paid salaries don’t get away with a tax rate that low. (And people who are paid salary rather than investment profits have to pay the alternative minimum tax, which means that they’re not able to deduct charity the way that Mitt is.)

So, in an interview, Mitt was asked about the fairness of a guy who made over twenty million dollars a year paying such a low rate. And Mitt, asshole that he is, tried to cover up the insanity of the current system, by saying:

Well, actually, I released two years of taxes and I think the average is almost 15 percent. And then also, on top of that, I gave another more 15 percent to charity. When you add it together with all of the taxes and the charity, particularly in the last year, I think it reaches almost 40 percent that I gave back to the community.

I don’t care about whether the reasoning there is good or not. Personally, I think it’s ridiculous to say “yeah, I didn’t pay taxes, but I gave a lot of money to my church, so it’s OK.” But forget that part. Just look at the freaking arithmetic!

He pays less than 15% in taxes.

He pays 15% in charity (mostly donations to his church).

What’s less than 15 + 15?

It sure as hell isn’t “almost 40 percent”. It’s not quite 30 percent. This isn’t something debatable. It’s simple, elementary school arithmetic. It’s just fucking insane that he thinks he can just get away with saying that. But he did – they let him say that, and didn’t challenge it at all. He says “less than 15 + 15 = almost 40”, and the interviewer never even batted an eye.

And then, he moved on to something which is a bit more debatable:

One of the reasons why we have a lower tax rate on capital gains is because capital gains are also being taxed at the corporate level. So as businesses earn profits, that’s taxed at 35 percent, then as they distribute those profits as dividends, that’s taxed at 15 percent more. So, all total, the tax rate is really closer to 45 or 50 percent.

Now, like I said, you can argue about that. Personally, I don’t think it’s a particularly good argument. The way that I see it, corporations are a tradeoff. A business doesn’t need to be a corporation. You become a corporation, because transforming the business into a quasi-independent legal entity gives you some big advantages. A corporation owns its own assets. You, as an individual who owns part of a corporation, aren’t responsible for the debts of the corporation. You, as an individual who owns part of a corporation, aren’t legally liable for the actions (such as libel) of the corporation. The corporation is an independent entity, which owns its own assets, which is responsible for its debts and actions. In exchange for taking on the legal status on an independent entity, that legal entity becomes responsible for paying taxes on its income. You give it that independent legal status in order to protect yourself; and in exchange, that independent legal status entails an obligation for that independent entity to pay its own taxes.

But hey, let’s leave that argument aside for the moment. Who pays the cost of the corporate taxes? Is it the owners of the business? Is it the people who work for the business? Is it someone else?

When they talk about their own ridiculously low tax rates, people like Mitt argue that they’re paying those taxes, and they want to add those taxes to the total effective tax that they pay.

But when they want to argue about why we should lower corporate tax rates, they pull out a totally different argument, which they call the “flypaper theory“. The flypaper theory argues that the burden of corporate taxes falls on the employees of the company – because if the company didn’t have to pay those taxes, that money would be going to the employees as salary – that is, the taxes are part of the overall expenses paid by the company. A company’s effective profits are (revenue – expenses). Expenses, in turn, are taxes+labor+materials+…. The company makes a profit of \$P to satisfy its shareholders. So if you took away corporate taxes, the company could continue to make \$P while paying its employees more. Therefore, the cost of the corporate taxes comes out of the salaries of the corporations employees.

You can make several different arguments – that the full burden of taxes fall on to the owners, or that the full burden of taxes falls on the employees, or that the full burden of taxes falls on the customers (because prices are raised to cover them). Each of those is something that you could reasonably argue. But what the conservative movement in America likes to do is to claim all of those: that the full burden of corporate taxes falls on the employees, and the full burden of corporate taxes falls on the customers, and the full burden of corporate taxes falls on the shareholders.

That’s just dishonest. If the full burden falls on one, then none of the burden falls on anyone else. The reality is, the burden of taxes is shared between all three. If there were no corporate taxes, companies probably would be able to pay their employees more – but there’s really no way that they’d take all of the money they pay in taxes, and push that into salary. And they’d probably be able to lower prices – but they probably wouldn’t lower prices enough to make up the entire difference. And they’d probably pay more in dividends/stock buybacks to pay the shareholders.

But you don’t get to count the same tax money three times.

# The Basics of Software Transactional Memory

As promised, it’s time for software transactional memory!

A bit of background, first. For most of the history of computers, the way that we’ve built software is very strongly based on the fact that a computer has a processor – a single CPU, which can do one thing at a time, in order. Until recently, that was true. But now it really isn’t anymore. There’s the internet – which effectively means that no computer is ever really operating in isolation – it’s part of a huge computational space shared with billions of other computers. And even ignoring the internet, we’re rapidly approaching the point where tiny devices, like cellphones, will have more than one CPU.

The single processor assumption makes things easy. We humans tend to think very sequentially – that is, the way that we describe how to do things is: do this, then do that. We’re not so good at thinking about how to do lots of things at the same time. Just think about human language. If I want to bake a cake, what I’ll do is: measure out the butter and sugar, put them in the mixer, mix it until they’re creamed. Then add milk. Then in a separate bowl, measure out and sift the flour and the baking powder. Then slowly pour the dry stuff into the wet, and mix it together. Etc.

I don’t need to fully specify that order. If I had multiple bakers, they could do many of steps at the same time. But how, in english, can I clearly say what needs to be done? I can say it, but it’s awkward. It’s harder to say, and harder to understand than the sequential instructions.

What I need to do are identifying families of things that can be done at the same time, and then the points at which those things will merge.

All of the measurements can be done at the same time. In any mixings step, the mixing can’t be done until all of the ingredients are ready. Ingredients being ready could mean two things. It could mean that the ingredients were measured out; or it could mean that one of the ingredients for the mix is the product of one of the previous mixing steps, and that that previous step is complete. In terms of programming, we’d say that the measurement steps are independent; and the points at which we need to wait for several things to get done (like “we can’t mix the dry ingredients in to the wet until the wet ingredients have all been mixed and the dry ingredients have been measured”), we’d call synchroniation points.

It gets really complicated, really quickly. In fact, it gets even more complicated than you might think. You see, if you were to write out the parallel instructions for this, you’d probably leave a bunch of places where you didn’t quite fully specify things – because you’d be relying on stuff like common sense on the part of your bakers. For example, you’d probably say to turn on the over to preheat, but you wouldn’t specifically say to wait until it reached the correct temperature to put stuff into it; you wouldn’t mention things like “open the over door, then put the cake into it, then close it”.

When we’re dealing with multiple processors, we get exactly those kinds of problems. We need to figure out what can be done at the same time; and we need to figure out what the synchronization points are. And we also need to figure out how to do the synchronization. When we’re talking about human bakers “don’t mix until the other stuff is ready” is fine. But in software, we need to consider things like “How do we know when the previous steps are done?”.

And it gets worse than that. When you have a set of processors doing things at the same time, you can get something called a race condition which can totally screw things up!

For example, imagine that we’re counting that all of the ingredients are measured. We could imagine the mixer process as looking at a counter, waiting until all five ingredients have been measured. Each measuring process would do its measurement, and then increment the counter.

```  val measureCount = 0
process Mixer() {
wait until measureCount == 5
}

process Measurer() {
do_measure
measureCount = measureCount + 1
}
```

What happens if two measurer finish at almost the same time? The last statement in `Measurer` actually consists of three steps: retrieve the value of `measureCount`; add one; store the incremented value. So we could wind up with:

Time Measurer1 Measurer2 measureCount
0 1
1 Fetch measureCount(=1) 1
2 Increment(=2) Fetch measurecount(=1) 1
3 Store updated(=2) Increment(=2) 2
4 Store updated(=2) 2

Now, `Mixer` will never get run! Because of the way that the two `Measurer`s overlapped, one of the increments effectively got lost, and so the count will never reach 5. And the way it’s written, there’s absolutely no way to tell whether or not that happened. Most of the time, it will probably work – because the two processes have to hit the code that increments the counter at exactly the same time in order for there to be a problem. But every once in a while, for no obvious reason, the program will fail – the mixing will never get done. It’s the worst kind of error to try to debug – one which is completely unpredictable. And if you try to run it in a debugger, because the debugger slows things down, you probably won’t be able to reproduce it!

This kind of issue always comes down to coordination or synchronization of some kind – that is, the main issue is how do the different processes interact without stomping on each other?

The simplest approach is to use something called a lock. A lock is an object which signals ownership of some resource, and which has one really important property: in concept, it points at at most one process, and updating it is atomic meaning that when you want to look at it and update it, nothing can intervene between the read and write. So if you want to use the resource managed by the lock, you can look at the lock, and see if anyone is using it; and if not, set it to point to you. That process is called acquiring the lock.

In general, we wrap locks up in libraries to make them easier to use. If “l” was a lock, you’d take a lock by using a function something like “lock(l)”, which really expanded to something like:

```def take(L: Lock) {
while (L != me)
atomically do if L==no one then L=me
}
```

So the earlier code becomes:

```val measureCount = 0
val l = new Lock()

process Mixer() {
wait until measureCount == 5
}

process Measurer() {
do_measure
lock(l)
measureCount = measureCount + 1
unlock(l)
}
```

In a simple example like that, locks are really easy. Unfortunately, real examples get messy. For instance, there’s a situation called deadlock. A classic demonstration is something called the dining philosophers. You’ve got four philosophers sitting at a table dinner table. Between each pair, there’s a chopstick. In order to eat, a philosopher needs two chopsticks. When they get two chopsticks, they’ll use them to take a single bite of food, and then they’ll put down the chopsticks. If Each philosopher starts by grabbing the chopstick to their right, then no one gets to each. Each has one chopstick, and there’s no way for them to get a second one.

That’s exactly what happens in a real system. You lock each resource that you want to share. For some operations, you’re going to use more than one shared resource, and so you need two locks. If you have multiple tasks that need multiple resources, it’s easy to wind up with a situation where each task has some subset of the locks that they need.

Things like deadlock mean that simple locks get hairy really quickly. Not that any of the more complex coordination strategies make deadlocks impossible; you can always find a way of creating a deadlock in any system – but it’s a lot easier to create accidental deadlocks using simple locks than, say, actors.

So there’s a ton of methods that try to make it easier to do coordination between multiple tasks. Under the covers, these ultimately rely on primitives like locks (or semaphores, another similar primitive coordination tool). But they provide a more structured way of using them. Just like structured control flow makes code cleaner, clearer, and more maintanable, structured coordination mechanism makes concurrency cleaner, clearer, and more maintainable.

Software transactional memory is one approach to this problem, which is currently very trendy. It’s still not entirely clear to me whether or not STM is really quite up to the real world – current implementations remain promising, but inefficient. But before getting in to any of that, we need to talk about just what it is.

As I see it, STM is based on two fundamental concepts:

1. Optimism. In software terms, by optimism, we mean that we’re going to plow ahead and assume that there aren’t any errors; when we’re done, we’ll check if there was a problem, and clean up if necessary. A good example of this from my own background is source code control systems. In the older systems like RCS, you’d lock a source file before you edited it; then you’d make your changes, and check them in, and release the lock. That way, you know that you’re never going to have two people making changes to the same file at the same time. But the downside is that you end up with lots of people sitting around doing nothing, waiting to get the lock on the file they want to change. Odds are, they weren’t going to change the same part of the file as the guy who has the lock. But in order to make sure that they can’t, the locks also block a lot of real work. Eventually, the optimistic systems came along, and what they did was say: “go ahead and edit the file all you want. But before I let you check in (save) the edited file to the shared repository, I’m going to make sure that no one changed it in a way that will mess things up. If I find out that someone did, then you’re going to have to fix it.”
2. Transactions. A transaction is a concept from (among other places) databases. In a database, you often make a collection of changes that are, conceptually, all part of the same update. By making them a transaction, they become one atomic block – and either the entire collection all succeedd, or the entire collection all fail. Transactions guarantee that you’ll never end up in a situation where half of the changes in an update got written to the database, and the other half didn’t.

What happens in STM is that you have some collection of special memory locations or variables. You’re only allowed to edit those variables in a transaction block. Whenever a program enters a transaction block, it gets a copy of the transaction variables, and just plows ahead, and does whatever it wants with its copy of the transaction variables. When it gets to the end of the block, it tries to commit the transaction – that is, it tries to update the master variables with the values of its copies. But it checks them first, to make sure that the current value of the master copies haven’t changed since the time that it made its copy. If they did, it goes back and starts over, re-running the transaction block. So if anyone else updated any of the transaction variables, the transaction would fail, and then get re-executed.

In terms of our baking example, both of the measurers would enter the transaction block at the same time; and then whichever finished first would commit its transaction, which would update the master count variable. Then when the second transaction finished, it would check the count variable, see that it changed, and go back and start over – fetching the new value of the master count variable, incrementing it, and then committing the result. In terms of code, you’d just do something like:

```transactional val measureCount = 0

process Mixer() {
wait until measureCount == 5
}

process Measurer() {
do_measure
atomically {
measureCount = measureCount + 1
}
}
```

It’s really that simple. You just mark all the shared resources as transactional, and then wrap the code that modifies them in a transaction block. And it just works. It’s a very elegant solution.

Of course there’s a lot more to it, but that’s the basic idea. In the code, you identify the transactional variables, and only allow them to be updated inside of a transaction block. At runtime, when you encounter a transaction block, charge ahead, and do whatever you want. Then when you finish, make sure that there isn’t an error. If there was, try again.

So what’s it look like in a non-contrived programming language? These days, I’m doing most of my coding in Scala. There’s a decent STM implementation for Scala as a part of a package called Akka.

In Akka, the way that you define a transactional variable is by using a `Ref` type. A `Ref` is a basically a cell that wraps a value. (It’s almost like a pointer value in C.) So, for example, in our Baker example:

```var count :Ref[Int] = Ref(0)
```

Then in code, to use it, you literally just wrap the code that modifies the `Ref`s in “atomic”. Alas, you don’t quite get to treat the refs like normal variables – to access the value of a ref, you need to call Ref.get; to change the value, you need to use a method `alter`, which takes a function that computes the new value in terms of the old.

```class Measurer {
def doMeasure() {
// do the measuring stuff
atomic {
ref.alter(_ + 1)
}
}
}
```

The “(_ + 1)” probably needs a quick explanation. In Scala, you can define a single expression function using “_” to mark the slot where the parameter should go. So “`(_ + 1)`” is equivalent to the lambda expression `{ x => x + 1}`.

You can see, just from this tiny example, why STM is such an attractive approach. It’s so simple! It’s very easy to read and write. It’s a very simple natural model. It’s brilliant in its simplicity. Of course, there’s more to it that what I’ve written about here – error handling, voluntary transaction abort, retry management, transaction waits – but this is the basics, and it really is this simple.

What are the problems with this approach?

1. Impurity. If not all variables are transactional, and you can modify a non-transactional variable inside of a transaction block, then you’ve got a mess onp your hands. Values from transactionals can “leak” out of transaction blocks.
2. Inefficiency. You’ve got to either copy all of the transactional variables at the entry to the transaction block, or you’ve got to use some kind of copy-on-write strategy. However you do it, you’ve got grief – aggressive copying, copy-on-write, memory protect – they’ve all got non-trivial costs. And re-doing the entire transaction block every time it fails can eat a lot of CPU.
3. Fairness. This is a fancy term for “what if one guy keeps getting screwed?” You’ve got lots of processes all working with the same shared resources, which are protected behind the transactional variables. It’s possible for timing to work out so that one process keeps getting stuck doing the re-tries. This is something that comes up a lot in coordination strategies for concurrency, and the implementations can get pretty hairy trying to make sure that they don’t dump all of the burden of retries on one process.

# LaTeX Issues Fixed, Hopefully

A quick administrative note. For people reading this blog via RSS readers, there’ve been problems with LaTeX equations for a couple of months. I’ve set up a local tex server and tweak the configurations. Hopefully, now everyone will be able to see the math stuff.

As a test, $pi = 4sum_{k=0}^{infty} frac{(-1)^k}{2k+1}$. If you still get an error box instead of an equation, please drop me an email.

# The Wrong Way To Write Concurrent Programs: Actors in Cruise

I’ve been planning to write a few posts about some programming stuff that interests me. I’ve spent a good part of my career working on systems that need to support concurrent computation. I even did my PhD working on a system to allow a particular style of parallel programming. It’s a really hard problem – concurrency creates a host of complex issues in how systems behave, and the way that you express concurrency in a programming language has a huge impact on how hard it is to read, write, debug, and reason about systems.

So, like I’ve said, I’ve spent a lot of time thinking about these issues, and looking at various different proposed solutions, as well as proposing a couple of my own. But I really don’t know of any good writeup describing the basics of the most successful approaches for beginners. So I thought I could write one.

But that’s not really today’s post. Todays post is my version of a train-wreck. Long-time readers of the blog know that I’m fascinated with bizarre programming languages. So today, I’m going to show you a twisted, annoying, and thoroughly pointless language that I created. It’s based on one of the most successful models of concurrent programming, called Actors, which was originally proposed by Professor Gul Agha of UIUC. There’ve been some really nice languages built using ideas from Actors, but this is not one of them.

The language is called “Cruise”. Why? Because it’s a really bad Actor language. And what name comes to mind when I think of really, really bad actors with delusions of adequacy? Tom Cruise.

You can grab my implementation from github. Just so you know, the code sucks. It’s something I threw together in my spare time a couple of years ago, and haven’t really looked at since. So it’s sloppy, overcomplicated, probably buggy, and slow as a snail on tranquilizers.

### Quick overview of the actor model

Actors are a theoretical model of computation, which is designed to describe completely asynchronous parallel computation. Doing things totally asynchronously is very strange, and very counter-intuitive. But the fact of the matter is, in real distributed systems, everything *is* fundamentally asynchronous, so being able to describe distributed systems in terms of a simple, analyzable model is a good thing.

According to the actor model, a computation is described by a collection of things called, what else, actors. An actor has a mailbox, and a behavior. The mailbox is a uniquely named place where messages sent to an actor can be queued; the behavior is a definition of how the actor is going to process a message from its mailbox. The behavior gets to look at the message, and based on its contents, it can do three kinds of things:

1. Create other actors.
2. Send messages to other actors whose mailbox it knows.
3. Specify a new behavior for the actor to use to process its next message.

You can do pretty much anything you need to do in computations with that basic mechanism. The catch is, as I said, it’s all asynchronous. So, for example, if you want to write an actor that adds two numbers, you can’t do it by what you’d normally think of as a function call. In a lot of ways, it looks like a method call in something like Smalltalk: one actor (object) sends a message to another actor, and in response, the receiver takes some action specified by them message.

But subroutines and methods are synchronous, and nothing in actors is synchronous. In an object-oriented language, when you send a message, you stop and wait until the receiver of the message is done with it. In Actors, it doesn’t work that way: you send a message, and it’s sent; that’s it, it’s over and done with. You don’t wait for anything; you’re done. If you want a reply, you need to send the the other actor a reference to your mailbox, and make sure that your behavior knows what to do when the reply comes in.

It ends up looking something like the continuation passing form of a functional programming language: to do a subroutine-like operation, you need to pass an extra parameter to the subroutine invocation; that extra parameter is the *intended receiver* of the result.

You’ll see some examples of this when we get to some code.

### Tuples – A Really Ugly Way of Handling Data

This subtitle is a bit over-the-top. I actually think that my tuple notion is pretty cool. It’s loosely based on how you do data-types in Prolog. But the way that it’s implemented in Cruise is absolutely awful.

Cruise has a strange data model. The idea behind it is to make it easy to build actor behaviors around the idea of pattern matching. The easiest/stupidest way of doing this is to make all data consist of tagged tuples. A tagged tuple consists of a tag name (an identifier starting with an uppercase letter), and a list of values enclosed in the tuple. The values inside of a tuple can be either other tuples, or actor names (identifiers starting with lower-case letters).

So, for example, `Foo(Bar(), adder)` is a tuple. The tag is “`Foo`“. It’s contents are another tuple, “`Bar()`“, and an actor name, “`adder`“.

Since tuples and actors are the only things that exist, we need to construct all other types of values from some combination of tuples and actors. To do math, we can use tuples to build up Peano numbers. The tuple “`Z()`” is zero; “`I(n)`” is the number `n+1`. So, for example, 3 is “`I(I(I(Z())))`“.

The only way to decompose tuples is through pattern matching in messages. In an actor behavior. message handlers specify a *tuple pattern*, which is a tuple where some positions may be filled by{em unbound} variables. When a tuple is matched against a pattern, the variables in the pattern are bound to the values of the corresponding elements of the tuple.

A few examples:

• matching `I(I(I(Z())))` with `I(\$x)` will succeed with `\$x` bound to `I(I(Z))`.
• matching `Cons(X(),Cons(Y(),Cons(Z,Nil())))` with `Cons(\$x,\$y)` will succeed with \$x bound to `X()`, and `\$y` bound to `Cons(Y(),Cons(Z(),Nil()))`.
• matching `Cons(X(),Cons(Y(),Cons(Z(),Nil())))` with `Cons(\$x, Cons(Y(), Cons(\$y, Nil())))` will succeed with `\$x` bound to `X()`, and `\$y` bound to `Z()`.
• ### Code Examples!

Instead of my rambling on even more, let’s take a look at some Cruise programs. We’ll start off with Hello World, sort of.

```actor !Hello {
behavior :Main() {
on Go() { send Hello(World()) to out }
}
initial :Main
}

instantiate !Hello() as hello
send Go() to hello

```

This declares an actor type “!Hello”; it’s got one behavior with no parameters. It only knows how to handle one message, “Go()”. When it receives go, it sends a hello world tuple to the actor named “out”, which is a built-in that just prints whatever is sent to it.

Let’s be a bit more interesting, and try something using integers. Here’s some code to do a greater than comparison:

```actor !GreaterThan {
behavior :Compare() {
on GT(Z(),Z(), \$action, \$iftrue, \$iffalse) {
send \$action to \$iffalse
}
on GT(Z(), I(\$x), \$action, \$iftrue, \$iffalse) {
send \$action to \$iffalse
}
on GT(I(\$x), Z(), \$action, \$iftrue, \$iffalse) {
send \$action to \$iftrue
}
on GT(I(\$x), I(\$y), \$action, \$iftrue, \$iffalse) {
send GT(\$x,\$y,\$action,\$iftrue,\$iffalse) to \$self
}
}
initial :Compare
}

actor !True {
behavior :True() {
on Result() { send True() to out}
}
initial :True
}

actor !False {
behavior :False() {
on Result() { send False() to out}
}
initial :False
}

instantiate !True() as true
instantiate !False() as false
instantiate !GreaterThan() as greater
send GT(I(I(Z())), I(Z()), Result(), true, false) to greater
send GT(I(I(Z())), I(I(I(Z()))), Result(), true, false) to greater
send GT(I(I(Z())), I(I(Z())), Result(), true, false) to greater

```

This is typical of how you do “control flow” in Cruise: you set up different actors for each branch, and pass those actors names to the test; one of them will receive a message to continue the execution.

What about multiple behaviors? Here’s a trivial example of a flip-flop:

```actor !FlipFlop {
behavior :Flip() {
on Ping(\$x) {
send Flip(\$x) to out
}
on Pong(\$x) {
send Flip(\$x) to out
}
}
behavior :Flop() {
on Ping(\$x) {
send Flop(\$x) to out
}
on Pong(\$x) {
send Flop(\$x) to out
}
}
initial :Flip
}

instantiate !FlipFlop() as ff
send Ping(I(I(Z()))) to ff
send Ping(I(I(Z()))) to ff
send Ping(I(I(Z()))) to ff
send Ping(I(I(Z()))) to ff
send Pong(I(I(Z()))) to ff
send Pong(I(I(Z()))) to ff
send Pong(I(I(Z()))) to ff
send Pong(I(I(Z()))) to ff

```

If the actor is in the “:Flip” behavior, then when it gets a “Ping”, it sends “Flip” to out, and switches behavior to flop. If it gets point, it just sents “Flip” to out, and stays in “:Flip”.

The “:Flop” behavior is pretty much the same idea, accept that it switches behaviors on “Pong”.

An example of how behavior changing can actually be useful is implementing settable variables:

```actor !Var {
behavior :Undefined() {
on Set(\$v) { adopt :Val(\$v) }
on Get(\$target) { send Undefined() to \$target }
on Unset() { }
}
behavior :Val(\$val) {
on Set(\$v) { adopt :Val(\$v) }
on Get(\$target) { send \$val to \$target }
on Unset() { adopt :Undefined() }
}
initial :Undefined
}
instantiate !Var() as v
send Get(out) to v
send Set(I(I(I(Z())))) to v
send Get(out) to v

```

Two more programs, and I’ll stop torturing you. First, a simple adder:

```actor !Adder {
on Plus(Z(),\$x, \$target) {
send \$x to \$target
}
on Plus(I(\$x), \$y, \$target) {
send Plus(\$x,I(\$y), \$target) to \$self
}
}
}

actor !Done {
behavior :Finished() {
on Result(\$x) { send Result(\$x) to out }
}
initial :Finished
}

instantiate !Done() as done

```

Pretty straightforward – the only interesting thing about it is the way that it sends the result of invoking add to a continuation actor.

Now, let’s use an addition actor to implement a multiplier actor. This shows off some interesting techniques, like carrying auxiliary values that will be needed by the continuation. It also shows you that I cheated, and added integers to the parser; they’re translated into the peano-tuples by the parser.

```actor !Adder {
on Plus(Z(),\$x, \$misc, \$target) {
send Sum(\$x, \$misc) to \$target
}
on Plus(I(\$x), \$y, \$misc, \$target) {
send Plus(\$x,I(\$y), \$misc, \$target) to \$self
}
}
}

actor !Multiplier {
behavior :Mult() {
on Mult(I(\$x), \$y, \$sum, \$misc, \$target) {
send Plus(\$y, \$sum, MultMisc(\$x, \$y, \$misc, \$target), \$self) to adder
}
on Sum(\$sum, MultMisc(Z(), \$y, \$misc, \$target)) {
send Product(\$sum, \$misc) to \$target
}
on Sum(\$sum, MultMisc(\$x, \$y, \$misc, \$target)) {
send Mult(\$x, \$y, \$sum, \$misc, \$target) to \$self
}
}
initial :Mult
}

instantiate !Multiplier() as multiplier
send Mult(32, 191, 0, Nil(), out) to multiplier

```

So, is this Turing complete? You bet: it’s got peano numbers, conditionals, and recursion. If you can do those three, you can do anything.

# Sidetrack from the CCCs: Lambda Calculus

So, last post, I finally defined closed cartesian categories. And I alluded to the fact that the CCCs are, essentially, equivalent to the simply typed λ calculus. But I didn’t really talk about what that meant.

Before I can get to that, you need to know what λ calculus is. Many readers are probably familiar, but others aren’t. And as it happens, I absolutely love λ calculus.

In computer science, especially in the field of programming languages, we tend to use λ calculus a whole lot. It’s also extensively used by logicians studying the nature of computation and the structure of discrete mathematics. λ calculus is great for a lot of reasons, among them:

1. It’s very simple.
2. It’s Turing complete: if a function can be computed by any possible computing device, then it can be written in λ-calculus.
3. It’s easy to read and write.
4. Its semantics are strong enough that we can do reasoning from it.
5. It’s got a good solid model.
6. It’s easy to create variants to explore the properties of various alternative ways of structuring computations or semantics.

The ease of reading and writing λ calculus is a big deal. It’s led to the development of a lot of extremely good programming languages based, to one degree or another, on the λ calculus: Lisp, ML, Haskell, and my current favorite, Scala, are very strongly λ calculus based.

The λ calculus is based on the concept of functions. In the pure λ calculus, everything is a function; there are no values except for functions. In fact, we can pretty much build up all of mathematics using λ-calculus.

With the lead-in out of the way, let’s dive in a look at λ-calculus. To define a calculus, you need to define two things: the syntax, which describes how valid expressions can be written in the calculus; and a set of rules that allow you to symbolically manipulate the expressions.

### Lambda Calculus Syntax

The λ calculus has exactly three kinds of expressions:

1. Function definition: a function in λ calculus is an expression, written: λ param . body, which defines a function with one parameter.
2. Identifier reference: an identifier reference is a name which matches the name of a parameter defined in a function expression enclosing the reference.
3. Function application: applying a function is written by putting the function value in front of its parameter, as in x y to apply the function x to the value y.

There’s a trick that we play in λ calculus: if you look at the definition above, you’ll notice that a function (lambda expression) only takes one parameter. That seems like a very big constraint – how can you even implement addition with only one parameter?

It turns out to be no problem, because of the fact that functions are, themselves, values. Instead of writing a two parameter function, you can write a one parameter function that returns a one parameter function, which can then operate on the second parameter. In the end, it’s effectively the same thing as a two parameter function. Taking a two-parameter function, and representing it by two one-parameter functions is called currying, after the great logician Haskell Curry.

For example, suppose we wanted to write a function to add x and y. We’d like to write something like: λ x y . x + y. The way we do that with one-parameter functions is: we first write a function with one parameter, which returns another function with one parameter.

Adding x plus y becomes writing a one-parameter function with parameter x, which returns another one parameter function which adds x to its parameter: λ x . (λ y . x + y).

Now that we know that adding multiple parameter functions doesn’t really add anything but a bit of simplified syntax, we’ll go ahead and use them when it’s convenient.

One important syntactic issue that I haven’t mentioned yet is closure or complete binding. For a λ calculus expression to be evaluated, it cannot reference any identifiers that are not bound. An identifier is bound if it a parameter in an enclosing λ expression; if an identifier is not bound in any enclosing context, then it is called a free variable. Let’s look quickly at a few examples:

• λ x . p x y: in this expression, y and p are free, because they’re not the parameter of any enclosing λ expression; x is bound because it’s a parameter of the function definition enclosing the expression p x y where it’s referenced.
• λ x y.y x: in this expression both x and y are bound, because they are parameters of the function definition, and there are no free variables.
• λ y . (λ x . p x y). This one is a tad more complicated, because we’ve got the inner λ. So let’s start there. In the inner λ, λ x . p x y, y and p are free and x is bound. In the full expression, both x and y are bound: x is bound by the inner λ, and y is bound by the other λ. “p” is still free.

We’ll often use “free(x)” to mean the set of identifiers that are free in the expression “x”.

A λ calculus expression is valid (and thus evaluatable) only when all of its variables are bound. But when we look at smaller subexpressions of a complex expression, taken out of context, they can have free variables – and making sure that the variables that are free in subexpressions are treated right is very important.

### Lambda Calculus Evaluation Rules

There are only two real rules for evaluating expressions in λ calculus; they’re called α and β. α is also called “conversion”, and β is also called “reduction”.

α is a renaming operation; basically it says that the names of variables are unimportant: given any expression in λ calculus, we can change the name of the parameter to a function as long as we change all free references to it inside the body.

So – for instance, if we had an expression like:

λ x . if (= x 0) then 1 else x^2

We can do an α to replace X with Y (written “α[x/y]” and get):

λ y . if (= y 0) then 1 else y^2

Doing α does not change the meaning of the expression in any way. But as we’ll see later, it’s important because without it, we’d often wind up with situations where a single variable symbol is bound by two different enclosing λs. This will be particularly important when we get to recursion.

β reduction is where things get interesting: this single rule is all that’s needed to make the λ calculus capable of performing any computation that can be done by a machine.

β basically says that if you have a function application, you can replace it with a copy of the body of the function with references to the parameter identifiers replaced by references to the parameter value in the application. That sounds confusing, but it’s actually pretty easy when you see it in action.

Suppose we have the application expression: (λ x . x + 1) 3. By performing a beta reduction, we can replace the application by taking the body x + 1 of the function, and substituting (or αing) the value of the parameter (3) for the parameter variable symbol (x). So we replace all references to x with 3. So the result of doing a beta reduction xs 3 + 1.

A slightly more complicated example is the expression:

```λ y . (λ x . x + y)) q
```

It’s an interesting expression, because it’s a λ expression that when applied, results in another λ expression: that is, it’s a function that creates functions. When we do beta reduction in this, we’re replacing all references to the parameter y with the identifier q; so, the result is λ x . x + q.

One more example, just for the sake of being annoying. Suppose we have: (λ x y. x y) (λ z . z * z) 3

That’s a function that takes two parameters, and applies the first one to the second one. When we evaluate that, we replace the parameter x in the body of the first function with λ z . z * z; and we replace the parameter y with 3, getting: (λ z . z * z) 3. And we can perform beta on that, getting 3 * 3.

Written formally, beta says: λ x . B e = B[x := e] if free(e) ⊂ free(B[x := e])

That condition on the end, “if free(e) ⊂ free(B[x := e]” is why we need α: we can only do beta reduction if doing it doesn’t create any collisions between bound identifiers and free identifiers: if the identifier “z” is free in “e”, then we need to be sure that the beta-reduction doesn’t make “z” become bound. If there is a name collision between a variable that is bound in “B” and a variable that is free in “e”, then we need to use α to change the identifier names so that they’re different.

As usual, an example will make that clearer: Suppose we have a expression defining a function, λ z . (λ x . x+z). Now, suppose we want to apply it: (λ z . (λ x . x + z)) (x + 2). In the parameter (x + 2), x is free. Now, suppose we break the rule and go ahead and do beta. We’d get “λ x . x + x + 2“. The variable that was free in x + 2 is now bound! We’ve changed the meaning of the function, which we shouldn’t be able to do. If we were to apply that function after the incorrect β, we’d get (λ x . x + x + 2) 3. By beta, we’d get 3 + 3 + 2, or 8.

What if we did α the way we were supposed to?

First, we’d do an α to prevent the name overlap. By α[x/y], we would get λ z . (λ y . y + z) (x+2).

Then by β, we’d get “λ y . y + x + 2“. If we apply this function the way we did above, then by β, we’d get 3+x+2.
3+x+2 and 3+3+2 are very different results!

And that’s pretty much it. There’s another optional rule you can add called η-conversion. η is a rule that adds extensionality, which provides a way of expressing equality between functions.

η says that in any λ expression, I can replace the value f with the value g if/f for all possible parameter values x, f x = g x.

What I’ve described here is Turing complete – a full effective computation system. To make it useful, and see how this can be used to do real stuff, we need to define a bunch of basic functions that allow us to do math, condition tests, recursion, etc. I’ll talk about those in my next post.

It’l also important to point out that while I’ve gone through a basic definition of λ calculus, and described its mechanics, I haven’t yet defined a model for λ-calculus. That’s quite an important omission! λ-calculus was played with by logicians for several years before they were able to come up with a complete model for it, and it was a matter of great concern that although it looked correct, the early attempts to define a model for it were failures! And without a valid model, the results of the system are meaningless. An invalid model in a logical system like calculus is like a contradiction in axioms: it means that nothing that it produces is valid.

Just a quick administrative note. For the last few days, there’s been a huge increase in spam comments. As a result, I’m tightening up the spam filters to prevent them from getting posted even for a brief period. As a result, legitimate comments may take longer to appear. If you have a comment that should have appeared, and it still hasn’t shown up after a few hours, feel free to send me an email, and I’ll bounce over and release it as soon as possible.

For some reason, lately I’ve been seeing a bunch of mentions of Banach Tarski. B-T is a fascinating case of both how counter-intuitive math can be, and also how profoundly people can misunderstand things.

For those who aren’t familiar, Banach-Tarski refers to a topological/measure theory paradox. There are several variations on it, all of which are equivalent.

The simplest one is this: Suppose you have a sphere. You can take that sphere, and slice it into a finite number of pieces. Then you can take those pieces, and re-assemble them so that, without any gaps, you now have two spheres of the exact same size as the original.

Alternatively, it can be formulated so that you can take a sphere, slice it into a finite number of pieces, and then re-assemble those pieces into a bigger sphere.

This sure as heck seems wrong. It’s been cited as a reason to reject the axiom of choice, because the proof that you can do this relies on choice. It’s been cited by crackpots like EE Escultura as a reason for rejecting the theory of real numbers. And there are lots of attempts to explain why it works. For example, there’s one here that tries to explain it in terms of density. There’s a very cool visualization of it here, which tries to make sense of it by showing it in the hyperbolic plane. Personally, most of the attempts to explain it intuitively drive me crazy. One one level, intuitively, it doesn’t, and can’t make sense. But on another level, it’s actually pretty simple. No matter how hard you try, you’re never going to make the idea of turning a finite-sized object into a larger finite-sized object make sense. But on another level, once you think about infinite sets – well, it’s no problem.

The thing is, when you think about it carefully, it’s not really all that odd. It’s counterintuitive, but it’s not nearly as crazy as it sounds. What you need to remember is that we’re talking about a mathematical sphere – that is, an infinite collection of points in a space with a particular set of topological and measure relations.

Here’s an equivalent thing, which is a bit simpler to think about:

Take a line segment. How many points are in it? It’s infinite. So, from that infinite set, remove an infinite set of points. How many points are left? It’s still infinite. Now you’ve got two infinite sets of the same size. So, now you can use one of the sets to create the original line segment, and you can use the second one to create a second, identical line segment.

Still counterintuitive, but slightly easier.

How about this? Take the set of all natural numbers. Divide it into two sets: the set of even naturals, and the set of odd naturals. Now you have two infinite sets,
the set {0, 2, 4, 6, 8, …}, and the set {1, 3, 5, 7, 9, …}. The size of both of those sets is the ω – which is also the size of the original set you started with.

Now take the set of even numbers, and map it so that for any given value i, f(i) = i/2. Now you’ve got a copy of the set of natural numbers. Take the set of odd naturals, and map them with g(i) = (i-1)/2. Now you’ve got a second copy of the set of natural numbers. So you’ve created two identical copies of the set of natural numbers out of the original set of natural numbers.

The problem with Banach-Tarski is that we tend to think of it less in mathematical terms, and more in concrete terms. It’s often described as something like “You can slice up an orange, and then re-assemble it into two identical oranges“. Or “you can cut a baseball into pieces, and re-assemble it into a basketball.” Those are both obviously ridiculous. But they’re ridiculous because they violate one of our instinct that derives from the conservation of mass. You can’t turn one apple into two apples: there’s only a specific, finite amount of stuff in an apple, and you can’t turn it into two apples that are identical to the original.

But math doesn’t have to follow conservation of mass in that way. A sphere doesn’t have a mass. It’s just an uncountably infinite set of points with a particular collection of topological relationship and geometric relationships.

Going further down that path: Banach-Tarski relies deeply of the axiom of choice. The “pieces” that you cut have non-measurable volume. You’re “cutting” from the collection of points in the sphere in a way that requires you to make an uncountably infinite number of distinct “cuts” to produce each piece. It’s effectively a geometric version of “take every other real number, and put them into separate sets”. On that level, because you can’t actually do anything like that, it’s impossible and ridiculous. But you need to remember: we aren’t talking about apples or baseballs. We’re talking about sets. The “slices” in B-T aren’t something you can cut with a knife – they’re infinitely subdivided, not-contiguous pieces. Nothing in the real world has that property, and no real-world process has the ability to cut like that. But we’re not talking about the real world; we’re talking about abstractions. And on the level of abstractions, it’s no stranger than creating two copies of the set of real numbers.

# Categorical Computation Characterized By Closed Cartesian Categories

One of my favorite categorical structures is a thing called a closed cartesian category, or CCC for short. Since I’m a computer scientist/software engineer, it’s a natural: CCCs are, basically, the categorical structure of lambda calculus – and thus, effectively, a categorical model of computation. However, before we can talk about the CCCs, we need – what else? – more definitions.

### Cartesian Categories

A cartesian category $C$ (note not cartesian closed category) is a category:

1. With a terminal object $t$, and
2. $forall a, b in Obj(C)$, the objects and arrows of the categorical product $a times b in C$.

So, a cartesian category is a category closed with respect to product. Many of the common categories are cartesian: the category of sets, and the category of enumerable sets, And of course, the meaning of the categorical product in set? Cartesian product of sets.

### Categorical Exponentials

To get from cartesian categories to cartesian closed categories, we also need to define categorical exponentials. Like categorical product, the value of a categorical exponential is not required to included in a category. The exponential is a complicated definition, and it’s a bit hard to really get your head around, but it’s well worth the effort. If categorical products are the categorical generalization of set products, then the categorical exponential is the categorical version of a function space. It gives us the ability to talk about structures that are the generalized version of “all functions from A to B”.

Given two objects x and y from a category C, their categorical exponential xy, if it exists in the category, is defined by a set of values:

• An object $x^y$,
• An arrow $mbox{eval}_{y,x}: x^y times y rightarrow x$, called an evaluation map.
• $forall z in Obj(C)$, an operation $Lambda_C: (z times y rightarrow x) rightarrow (z rightarrow x^y)$. (That is, an operation mapping from arrows to arrows.)

These values must have the following properties:

1. $forall f : z times y rightarrow x, g : z rightarrow x^y$:
• $mbox{val}_{y,x} circ (Lambda_C(f)times 1_y)$
• $forall f : z times y rightarrow x, g : z rightarrow x^y: Lambda_C(mbox{eval}_{y,x} circ (z times 1_y) = z$

To make that a bit easier to understand, let’s turn it into a diagram.

As I alluded to earlier, you can also think of it as a generalization of a function space.$x^y$ is the set of all functions from y to x. The evaluation map is simple description in categorical terms of an operation that applies a function from a to b (an arrow) to a value from a, resulting in an a value from b.

So what does the categorical exponential mean? I think it’s easiest to explain in terms of sets and functions first, and then just step it back to the more general case of objects and arrows.

If X and Y are sets, then $X^Y$ is the set of functions from Y to X.

Now, look at the diagram:

1. The top part says, basically, that $g$ is a function from $Z$ to to $X^Y$: so $g$ takes a member of $Z$, and uses it to select a function from $Y$ to $X$.
2. The vertical arrow says:
1. given the pair $(z,y)$, $f(z,y)$ maps $(z,y)$ to a value in $X$.
2. given a pair $(z,y)$, we’re going through a function. It’s almost like currying:
1. The vertical arrow going down is basically taking $g(z,y)$, and currying it to $g(z)(y)$.
2. Per the top part of the diagram, $g(z)$ selects a function from $y$ to $x$. (That is, a member of $X^Y$.)
3. So, at the end of the vertical arrow, we have a pair $(g(z), y)$.
3. The “eval” arrow maps from the pair of a function and a value to the result of applying the function to the value.

### Cartesian Closed Categories

Now – the abstraction step is actually kind of easy: all we’re doing is saying that there is a structure of mappings from object to object here. This particular structure has the essential properties of what it means to apply a function to a value. The internal values and precise meanings of the arrows connecting the values can end up being different things, but no matter what, it will come down to something very much like function application.

With exponentials and products, we can finally say what the cartesian closed categories (CCCs). A Cartesian closed category is a category that is closed with respect to both products and exponentials.

Why do we care? Well, the CCCs are in a pretty deep sense equivalent to the simply typed lambda calculus. That means that the CCCs are deeply tied to the fundamental nature of computation. The structure of the CCCs – with its closure WRT product and exponential – is an expression of the basic capability of an effective computing system. So next, we’ll take a look at a couple of examples of what we can do with the CCCs as a categorical model of computation.