All posts by markcc

Inference with Sets in Type Theory

It’s been quite a while since I did any meaningful writing on type theory. I spent a lot of time introducing the basic concepts, and trying to explain the intuition behind them. I also spent some time describing what I think of as the platform: stuff like arity theory, which we need for talking about type construction. But once we move beyond the basic concepts, I ran into a bit of a barrier – no so much in understanding type theory, but in finding ways of presenting it that will be approacha ble.

I’ve been struggling to figure out how to move forward in my exploration of type theory. The logical next step is working through the basics of intuitionistic logic with type theory semantics. The problem is, that’s pretty dry material. I’ve tried to put together a couple of approaches that skip over this, but it’s really necessary.

For someone like me, coming from a programming language background, type theory really hits its stride when we look at type systems and particularly type inference. But you can’t understand type inference without understanding the basic logic. In fact, you can’t describe the algorithm for type inference without referencing the basic inference rules of the underlying logic. Type inference is nothing but building type theoretic proofs based on a program.

So here we are: we need to spend some time looking at the basic logic of type theory. We’ve looked at the basic concepts that underlie the syntax and semantics, so what we need to do next is learn the basic rules that we use to build logical statements in the realm of type theory. (If you’re interested in more detail, this is material from chapter 5 of “Programming in Martin-Löof’s Type Theory”, which is the text I’m using to learn this material.)

Martin Löoff’s type theory is a standard intuitionistic predicate logic, so we’ll go through the rules using standard sequent notation. Each rule is a sequence which looks sort-of like a long fraction. The “numerator” section is a collection of things which we already know are true; the “denominator” is something that we can infer given those truths. Each statement has the form A[\Gamma], where A is a statement, and B is a set of assumptions. For example, F(a) [a \in A, \Delta] means that F(a) is true, provided we’re in a context that includes a \in A.

Personally, I find that this stuff feels very abstract until you take the logical statements, and interpret them in terms of programming. So throughout this post, I’ll do that for each of the rules.

With that introduction out of the way, let’s dive in to the first set of rules.

Simple Introductions

We’ll start off with a couple of really easy rules, which allow us to introduce a variable given a type, or a type given a variable.

Introducing Elements

\frac{A \,\text{type}}{a \in A \,[a \in A]}

This is an easy one. It says that if we know that A is a type, then we can introduce the statement that a \in A, and add that as an assumption in our context. What this means is also simple: since our definition of type says that it’s only a type if it has an element, then if we know that A is a type, we know that there must be an element of A, and so we can write statements using it.

If you think of this in programming terms, the statement A \text{type} is saying that A is a type. To be a valid type, there must be at least one value that belongs to the type. So you’re allowed to introduce a variable that can be assigned a value of the type.

Introducing Propositions as Types

 \frac{a \in A \, []}{A \, \text{true}}

This is almost the mirror image of the previous. A type and a true proposition are the same thing in our type theory: a proposition is just a type, which is a set with at least one member. So if we know that there’s a member of the set A, then A is both a type and a true proposition.

Equality Rules

We start with the three basic rules of equality: equality is reflexive, symmetric, and transitive.


 \frac{a \in A}{a = a \in A}

 \frac{A \, \text{type}}{A = A}

If a is an element of a type A, then a is equal to itself in type A; and if A is a type, then A is equal to itself.

The only confusing thing about this is just that when we talk about an object in a type, we make reference to the type that it’s a part of. This makes sense if you think in terms of programming: you need to declare the type of your variables. “3: Int” doesn’t necessarily mean the same object as “3: Real”; you need the type to disambiguate the statement. So within type theory, we always talk about values with reference to the type that they’re a part of.


\frac{a = b \in A}{b = a \in A}

\frac{A = B}{B = A}

No surprises here – standard symmetry.


\frac{a = b \in A \quad b = c \in A}{a = c \in A}

\frac{A = B \quad B = C}{A = C}

Type Equality

\frac{a \in A \quad A = B}{a \in B}

\frac{a = b \in A \quad A = B}{a = b \in B}

These are pretty simple, and follow from the basic equality rules. If we know that a is a member of the type A, and we know that the type A equals the type B, then obviously a is also a member of B. Along the same lines, if we know that a=b in type A, and A equals B, then a=b in the type B.

Substitution Rules

We’ve got some basic rules about how to formulate some simple meaningful statements in the logic of our type theory. We still can’t do any interesting reasoning; we haven’t built up enough inference rules. In particular, we’ve only been looking at simple, atomic statements using parameterless predicates.

We can use those basic rules to start building upwards, to get to parametric statements, by using substitution rules that allow us to take a parametric statement and reason with it using the non-parametric rules we talked about above.

For example, a parametric statement can be something like C(x) \, \text{type} [x \in A], which says that applying C to a value x which is a member of type A produces a value which is a type. We can use that to produce new inference rules like the ones below.

 \frac{C(x) \, \text{type} [x \in A] \quad a \in A}{C(a) \, \text{type}}

This says that if we know that given a of type A, C will produce a type; and we know that the value a is of type A, then C(a) will be a type. In logical terms, it’s pretty straightforward; in programming terms it’s even clearer: if C is a function on type A, and we pass it a value of type A, it will produce a result. In other words, C(a) is defined for all values of type A.

 \frac{C(x) \, \text{type}[x \in A] \quad a = b \in A}{C(a) = C(b)}

This is even simpler: if C is a function on type A, then given two values that are equal in type A, C will produce the same result for those values.

Of course, I’m lying a bit. In this stuff, C isn’t really a function. It’s a logical statement; C(x) isn’t quite a function. It’s a logical stamement which includes the symbol x; when we say C(a), what we mean is the logical statement C, with the object a substituted for the symbol x. But I think the programming metaphors help clarify what it means.

Using those two, we can generate more:

 \frac{c(x) \in C(x) [x \in A] \quad a \in A}{c(a) \in C(A)}

This one becomes interesting. C(x) is a proposition which is parametric in x. Then c(x) is a proof-element: it’s an instance of C(x) which proves that C(X) is a type, and we can see c as a computation which, given an element of a produces a instance of C. Then what this judgement says is that given an instance a of type A, we know that c(a) is an instance of type C(a). This will become very important later on, when we really get in to type inference and quantification and parametric types.

\frac{c(x) \in C(x) [x \in A] \quad a = b}{c(a)=c(b)\in C(a)}

This is just a straightforward application of equality to proof objects.

There’s more of these kinds of rules, but I’m going to stop here. My goal isn’t to get you to know every single judgement in the intuitionistic logic of type theory, but to give you a sense of what they mean.

That brings us to the end of the basic inference rules. The next things we’ll need to cover are ways of constructing new types or types from existing ones. The two main tools for that are enumeration types (basically, types consisting of a group of ordered values), and cartesian products of multiple types. With those, we’ll be able to find ways of constructing most of the types we’ll want to use in programming languages.

A Gentle Explanation of the Intel Speculative Execution CPU Bug

There’s a lot of talk today about the recent discovery of a significant bug in the Intel CPUs that so many of us use every day. It’s an interesting problem, and understanding it requires knowing a little bit about how CPUs work, so I thought I’d take a shot at writing an explainer.

Let me start with a huge caveat: this involves a lot of details about how CPUs work, and in order to explain it, I’m going to simplify things to an almost ridiculous degree in order to try to come up with an explanation that’s comprehensible to a lay person. I’m never deliberately lying about how things work, but at times, I’m simplifying enough that it will be infuriating to an expert. I’m doing my best to explain my understanding of this problem in a way that most people will be able to understand, but I’m bound to oversimplify in some places, and get details wrong in others. I apologize in advance.

It’s also early days for this problem. Intel is still trying to keep the exact details of the bug quiet, to make it harder for dishonest people to exploit it. So I’m working from the information I’ve been able to gather about the issue so far. I’ll do my best to correct this post as new information comes out, but I can only do that when I’m not at work!

That said: what we know so far is that the Intel bug involves non-kernel code being able to access cached kernel memory through the use of something called speculative execution.

To an average person, that means about as much as a problem in the flux thruster atom pulsar electrical ventury space-time implosion field generator coil.

Let’s start with a quick overview of a modern CPU.

The CPU, in simple terms, the brain of a computer. It’s the component that actually does computations. It reads a sequence of instructions from memory, and then follows those instructions to perform some computation on some values, which are also stored in memory. This is a massive simplification, but basically, you can think of the CPU as a pile of hardware than runs a fixed program:

def simplified_cpu_main_loop():
  IP = 0
  while true:
     (op, in1, in2, out) = fetch(IP)
     val1 = fetch(in1)
     val2 = fetch(in2)
     result, IP = perform(op, in1, in2)
     store(result, out)

There’s a variable called the instruction pointer (abbreviated IP) built-in to the CPU that tells it where to fetch the next instruction from. Each time the clock ticks, the CPU fetches an instruction from the memory address stored in the instruction pointer, fetches the arguments to that instruction from cells in memory, performs the operation on those arguments, and then stores the result into another cell in the computer memory. Each individual operation produces both a result, and a new value for the instruction pointer. Most of the time, you just increment the instruction pointer to look at the next instruction, but for comparisons or branches, you can change it to something else.

What I described above is how older computers really worked. But as CPUs got faster, chipmaker ran into a huge problem: the CPU can perform its operations faster and faster every year. But retrieving a value from memory hasn’t gotten faster at the same rate as executing instructions. The exact numbers don’t matter, but to give you an idea, a modern CPU can execute an instruction in less than one nanosecond, but fetching a single value from memory takes more than 100 nanoseconds. In the scheme we described above, you need to fetch the instruction from memory (one fetch), and then fetch two parameters from memory (another two fetches), execute the instruction (1 nanosecond), and then store the result back into memory (one store). Assuming a store is no slower than a fetch, that means that for one nanosecond of computation time, the CPU needs to do 3 fetches and one store for each instruction. That means that the CPU is waiting, idle, for at least 400ns, during which it could have executed another 400 instructions, if it didn’t need to wait for memory.

That’s no good, obviously. There’s no point in making a fast CPU if all it’s going to do is sit around and wait for slow memory. But designers found ways to work around that, by creating ways to do a lot of computation without needing to pause to wait things to be retrieved from/stored to memory.

One of those tricks was to add registers to the CPUs. A register is a cell of memory inside of the CPU itself. Early processors (like the 6502 that was used by the Apple II) had one main register called an accumulator. Every arithmetic instruction would work by retrieving a value from memory, and then performing some arithmetic operation on the value in the accumulator and the value retrieved from memory, and leave the result in the accumulator. (So, for example, if 0x1234 were the address variable X, you could add the value of X to the accumulator with the instruction “ADD (1234)”. More modern CPUs added many registers, so that you can keep all of the values that you need for some computation in different registers. Reading values from or writing values to registers is lightning fast – in fact, it’s effectively free. So you structure your computations so that they load up the registers with the values they need, then do the computation in registers, and then dump the results out to memory. Your CPU can run a fairly long sequence of instructions without ever pausing for a memory fetch.

Expanding on the idea of putting memory into the CPU, they added ways of reducing the cost of working with memory by creating copies of the active memory regions on the CPU. These are called caches. When you try to retrieve something from memory, if it’s in the cache, then you can access it much faster. When you access something from a memory location that isn’t currently in the cache, the CPU will copy a chunk of memory including that location into the cache.

You might ask why, if you can make the cache fast, why not just make all of memory like the cache? The answer is that the time it takes in hardware to retrieve something from memory increases with amount of memory that you can potentially access. Pointing at a cache with 1K of memory is lightning fast. Pointing at a cache with 1 megabyte of memory is much slower that the 1K cache, but much faster that a 100MB cache; pointing at a cache with 100MB is even slower, and so on.

So what we actually do in practice is have multiple tiers of memory. We have the registers (a very small set – a dozen or so memory cells, which can be accessed instantly); a level-0 cache (on the order of 8k in Intel’s chips), which is pretty fast; a level-1 cache (100s of kilobytes), an L2 cache (megabytes), L3 (tens of megabytes), and now even L4 (100s of megabytes). If something isn’t in L0 cache, then we look for it in L1; if we can’t find it in L1, we look in L2, and so on, until if we can’t find it in any cache, we actually go out to the main memory.

There’s more we can do to make things faster. In the CPU, you can’t actually execute an entire instruction all at once – it’s got multiple steps. For a (vastly simplified) example, in the pseudocode above, you can think of each instruction as four phases: (1) decode the instruction (figuring out what operation it performs, and what its parameters are), (2) fetch the parameters, (3) perform the operations internal computation, and (4) write out the result. By taking advantage of that, you can set up your CPU to actually do a lot of work in parallel. If there are three phases to executing an instruction, then you can execute phase one of instruction one in one cycle; phase one of instruction two and phase two of instruction one in the next cycle; phase one of instruction three, phase two of instruction two, and phase three of instruction one in the third cycle. This process is called pipelining.

To really take advantage of pipelining, you need to keep the pipeline full. If your CPU has a four-stage pipeline, then ideally, you always know what the next four instructions you’re going to execute are. If you’ve got the machine code version of an if-then-else branch, when you start the comparison, you don’t know what’s going to come next until you finish it, because there are two possibilities. That means that when you get to phase 2 of your branch instruction, you can’t start phase one of the next instruction. instruction until the current one is finished – which means that you’ve lost the advantage of your pipeline.

That leads to another neat trick that people play in hardware, called branch prediction. You can make a guess about which way a branch is going to go. An easy way to understand this is to think of some numerical code:

def run_branch_prediction_demo():
  for i in 1 to 1000:
     for j in 1 to 1000:
          q = a[i][j] * sqrt(b[i][j])

After each iteration of the inner loop, you check to see if j == 1000. If it isn’t, you branch back to the beginning of that loop. 999 times, you branch back to the beginning of the loop, and one time, you won’t. So you can predict that you take the backward branch, and you can start executing the early phases of the first instructions of the next iteration. That may, most of the time you’re running the loop, your pipeline is full, and you’re executing your computation quickly!

The catch is that you can’t execute anything that stores a result. You need to be able to say “Oops, everything that I started after that branch was wrong, so throw it away!”. Alongside with branch prediction, most CPUs also provide speculative execution, which is a way of continuing to execute instructions in the pipeline, but being able to discard the results if they’re the result of an incorrect branch prediction.

Ok, we’re close. We’ve got to talk about just another couple of basic ideas before we can get to just what the problem is with these Intel chips.

We’re going to jump up the stack a bit, and instead of talking directly about the CPU hardware, we’re going to talk about the operating system, and how it’s implemented on the CPU.

An operating system is just a program that runs on your computer. The operating system can load and run other programs (your end-user applications), and it manages all of the resources that those other programs can work with. When you use an application that allocates memory, it sent a request called a syscall to the operating system asking it to give it some memory. If your application wants to read data from a disk drive, it makes a syscall to open a file and read data. The operating system is responsible for really controlling all of those resources, and making sure that each program that’s running only accesses the things that it should be allowed to access. Program A can only use memory allocated by program A; if it tries to access memory allocated by program B, it should cause an error.

The operating system is, therefore, a special program. It’s allowed to touch any piece of memory, any resource owned by anything on the computer. How does that work?

There are two pieces. First, there’s something called memory protection. The hardware provides a mechanism that the CPU can use to say something like “This piece of memory is owned by program A”. When the CPU is running program A, the memory protection system will arrange the way that memory looks to the program so that it can access that piece of memory; anything else just doesn’t exist to A. That’s called memory mapping: the system memory of the computer is mapped for A so that it can see certain pieces of memory, and not see others. In addition to memory mapping, the memory protection system can mark certain pieces of memory as only being accessible by privileged processes.

Privileged processes get us to the next point. In the CPU, there’s something called an execution mode: programs can run in a privileged mode (sometimes called kernel space execution), or it can run in a non-privileged mode (sometimes called user-space execution). Only code that’s running in kernel-space can do things like send commands to the memory manager, or change memory protection settings.

When your program makes a syscall, what really happens is that your program puts the syscall parameters into a special place, and then sends a signal called an interrupt. The interrupt switches the CPU into system space, and gives control to the operating system, which reads the interrupt parameters, and does whatever it needs to. Then it puts the result where the user space program expects it, switches back to user-space, and then allows the user space program to continue.

That process of switching from the user space program to the kernel space, doing something, and then switching back is called a context switch. Context switches are very expensive. Implemented naively, you need to redo the memory mapping every time you switch. So the interrupt consists of “stop what you’re doing, switch to privileged mode, switch to the kernel memory map, run the syscall, switch to the user program memory map, switch to user mode”.

Ok. So. We’re finally at the point where we can actually talk about the Intel bug.

Intel chips contain a trick to make syscalls less expensive. Instead of having to switch memory maps on a syscall, they allow the kernel memory to be mapped into the memory map of every process running in the system. But since kernel memory can contain all sorts of secret stuff (passwords, data belonging to other processes, among other things), you can’t let user space programs look at it – so the kernel memory is mapped, but it’s marked as privileged. With things set up that way, a syscall can drop the two “switch memory map” steps in the syscall scenario. Now all a syscall needs to do is switch to kernel mode, run the syscall, and switch back to user mode. It’s dramatically faster!

Here’s the problem, as best I understand from the information that’s currently available:

Code that’s running under speculative execution doesn’t do the check whether or not memory accesses from cache are accessing privileged memory. It starts running the instructions without the privilege check, and when it’s time to commit to whether or not the speculative execution should be continued, the check will occur. But during that window, you’ve got the opportunity to run a batch of instructions against the cache without privilege checks. So you can write code with the right sequence of branch instructions to get branch prediction to work the way you want it to; and then you can use that to read memory that you shouldn’t be able to read.

With that as a starting point, you can build up interesting exploits that can ultimately allow you to do almost anything you want. It’s not a trivial exploit, but with a bit of work, you can use a user space program to make a sequence of syscalls to get information you want into memory, and then write that information wherever you want to – and that means that you can acquire root-level access, and do anything you want.

The only fix for this is to stop doing that trick where you map the kernel memory into every user space memory map, because there’s no way to enforce the privileged memory property in speculative execution. In other words, drop the whole syscall performance optimization. That’ll avoid the security issue, but it’s a pretty expensive fix: requiring a full context switch for every syscall will slow down the execution of user space programs by something between 5 and 30 percent.

Zombie Math in the Vortex

Paul Krugman has taken to calling certain kinds of economic ideas zombie economics, because no matter how many times they’re shown to be false, they just keep coming back from the dead. I certainly don’t have stature that compares in any way to Krugmant, but I’m still going to use his terminology for some bad math. There are some crackpot ideas that you just can’t kill.

For example, vortex math. I wrote about vortex math for the first time in 2012, again in early 2013, and again in late 2013. But like a zombie in a bad movie, it’s fans won’t let it stay dead. There must have been a discussion on some vortex-math fan forum recently, because over the last month, I’ve been getting comments on the old posts, and emails taking me to task for supposedly being unfair, closed-minded, ignorant, and generally a very nasty person.

Before I look at any of their criticisms, let’s start with a quick refresher. What is vortex math?

We’re going to create a pattern of single-digit numbers using multiples of 2. Take the number 1. Multiply it by 2, and you get 2. Multiple it by 2, and you get 4. Again, you get 8. Again, and you get 16. 16 is two digits, but we only want one-digit numbers, so we add them together, getting 7. Double, you get 14, so add the digits, and you get 5. Double, you get 10, add the digits, and you get 1. So you’ve got a repeating sequence: 1, 2, 4, 8, 7, 5, …

Take the numbers 1 through 9, and put them at equal distances around the perimeter of a circle. Draw an arrow from a number to its single-digit double. You end up with something that looks kinda-sorta like the infinity symbol. You can also fit those numbers onto the surface of a torus.

That’s really all there is to vortex math. This guy named Marco Rodin discovered that there’s a repeating pattern, and if you draw it on a circle, it looks kinda-like the infinity symbol, and that there must be something incredibly profound and important about it. Launching from there, he came up with numerous claims about what that means. According to vortex math, there’s something deeply significant about that pattern:

  1. If you make metallic windings on a toroidal surface according to that pattern and use it as a generator, it will generate free energy.
  2. Take that same coil, and run a current through it, and you have a perfect, reactionless space drive (called “the flux thruster atom pulsar electrical ventury space time implosion field generator coil”).
  3. If you use those numbers as a pattern in a medical device, it will cure cancer, as well as every other disease.
  4. If you use that numerical pattern, you can devise better compression algorithms that can compress any string of bits.
  5. and so on…

Essentially, according to vortex math, that repeated pattern of numbers defines a “vortex”, which is the deepest structure in the universe, and it’s the key to understanding all of math, all of physics, all of metaphysics, all of medicine. It’s the fundamental pattern of everything, and by understanding it, you can do absolutely anything.

As a math geek, the problem with stuff like vortex math is that it’s difficult to refute mathematically, because even though Rodin calls it math, there’s really no math to it. There’s a pattern, and therefore magic! Beyond the observation that there’s a pattern, there’s nothing but claims of things that must be true because there’s a pattern, without any actual mathematical argument.

Let me show you an example, from one of Rodin’s followers, named Randy Powell.

I call my discovery the ABHA Torus. It is now the full completion of how to engineer Marko Rodin’s Vortex Based Mathematics. The ABHA Torus as I have discovered it is the true and perfect Torus and it has the ability to reveal in 3-D space any and all mathematical/geometric relationships possible allowing it to essentially accomplish any desired functional application in the world of technology. This is because the ABHA Torus provides us a mathematical framework where the true secrets of numbers (qualitative relationships based on angle and ratio) are revealed in fullness.

This is why I believe that the ABHA Torus as I have calculated is the most powerful mathematical tool in existence because it presents proof that numbers are not just flat imaginary things. To the contrary, numbers are stationary vector interstices that are real and exhibiting at all times spatial, temporal, and volumetric qualities. Being stationary means that they are fixed constants. In the ABHA Torus the numbers never move but the functions move through the numbers modeling vibration and the underlying fractal circuitry that natures uses to harness living energy.

The ABHA Torus as revealed by the Rodin/Powell solution displays a perfectly symmetrical spin array of numbers (revealing even prime number symmetry), a feat that has baffled countless scientists and mathematicians throughout the ages. It even uncovers the secret of bilateral symmetry as actually being the result of a diagonal motion along the surface and through the internal volume of the torus in an expanding and contracting polarized logarithmic spiral diamond grain reticulation pattern produced by the interplay of a previously unobserved Positive Polarity Energetic Emanation (so-called ‘dark’ or ‘zero-point’ energy) and a resulting Negative Polarity Back Draft Counter Space (gravity).

If experimentally proven correct such a model would for example replace the standard approach to toroidal coils used in energy production today by precisely defining all the proportional and angular relationships existent in a moving system and revealing not only the true pathway that all accelerated motion seeks (be it an electron around the nucleus of an atom or water flowing down a drain) but in addition revealing this heretofore unobserved, undefined point energetic source underlying all space-time, motion, and vibration.

Lots of impressive sounding words, strung together in profound sounding ways, but what does it mean? Sure, gravity is a “back draft” of an unobserved “positive polarity energetic emanatation”, and therefore we’ve unified dark energy and gravity, and unified all of the forces of our universe. That sounds terrific, except that it doesn’t mean anything! How can you test that? What evidence would be consistent with it? What evidence would be inconsistent with it? No one can answer those questions, because none of it means anything.

As I’ve said lots of times before: there’s a reason for the formal framework of mathematics. There’s a reason for the painful process of mathematical proof. There’s a reason why mathematicians and scientists have devised an elaborate language and notation for expressing mathematical ideas. And that reason is because it’s easy to string together words in profound sounding ways. It’s easy to string together reasoning in ways that look like they might be compelling if you took the time to understand them. But to do actual mathematics or actual science, you need to do more that string together something that sounds good. You need to put together something that is precise. The point of mathematical notation and mathematical reasoning is to take complex ideas and turn them into precisely defined, unambiguous structures that have the same meaning to everyone who looks at them.

“positive polarity energetic emanation” is a bunch of gobbledegook wordage that doesn’t mean anything to anyone. I can’t refute the claim that gravity is a back-draft negative polarity energetic reaction to dark energy. I can’t support that claim, either. I can’t do much of anything with it, because Randy Powell hasn’t said anything meaningful. It’s vague and undefined in ways that make it impossible to reason about in any way.

And that’s the way that things go throughout all of vortex math. There’s this cute pattern, and it must mean something! Therefore… endless streams of words, without any actual mathematical, physical, or scientific argument.

There’s so much wrong with vortex math, but it all comes down to the fact that it takes some arbitrary artifacts of human culture, and assigns them deep, profound meaning for no reason.

There’s this pattern in the doubling of numbers and reducing them to one digit. Why multiple by two? Because we like it, and it produces a pretty pattern. Why not use 3? Well, because in base-10, it won’t produce a good pattern: [1, 3, 9, 9, 9, 9, ….] But we can pick another number like 7: [1, 7, 5, 8, 2, 5, 8, 2, 5, ….], and get a perfectly good series: why is that series less compelling than [1, 4, 8, 7, 2, 5]?

There’s nothing magical about base-10. We can do the same thing in base-8: [1, 2, 4, 1, 2, 4…] How about base-12, which was used for a lot of stuff in Egypt? [1, 2, 4, 8, 5, 10, 9, 7, 3, 6, 1] – that gives us a longer pattern! What makes base-10 special? Why does the base-10 pattern mean something that other bases, or other numerical representations, don’t? The vortex math folks can’t answer that. (Note: I made an arithmetic error in the initial version of the base-12 sequence above. It was pointed out in comments by David Wallace. Thanks!)

If we plot the numbers on a circle, we get something that looks kind-of like an infinity symbol! What does that mean? Why should the infinity symbal (which was invented in the 17th century, and chosen because it looked sort of like a number, and sort-of like the last letter of the greek alphabet) have any intrinsic meaning to the universe?

It’s giving profound meaning to arbitrary things, for unsupported reasons.

So what’s in the recent flood of criticism from the vortex math guys?

Well, there’s a lot of “You’re mean, so you’re wrong.” And there’s a lot of “Why don’t you prove that they’re wrong instead of making fun of them?”. And last but not least, there’s a lot of “Yeah, well, the fibonacci series is just a pattern of numbers too, but it’s really important”.

On the first: Yeah, fine, I’m mean. But I get pretty pissed at seeing people get screwed over by charlatans. The vortex math guys use this stuff to take money from “investors” based on their claims about producing limitless free energy, UFO space drives, and cancer cures. This isn’t abstract: this kind of nonsense hurts people. They people who are pushing these scams deserve to be mocked, without mercy. They don’t deserve kindness or respect, and they’re not going to get it from me.

I’d love to be proved wrong on this. One of my daughter’s friends is currently dying of cancer. I’d give up nearly anything to be able to stop her, and other children like her, from dying an awful death. If the vortex math folks could do anything for this poor kid, I would gladly grovel and humiliate myself at their feet. I would dedicate the rest of my life to nothing but helping them in their work.

But the fact is, when they talk about the miraculous things vortex math can do? At best, they’re delusional; more likely, they’re just lying. There is no cure for cancer in [1, 2, 4, 8, 7, 5, 1].

As for the Fibonacci series: well. It’s an interesting pattern. It does appear to show up in some interesting places in nature. But there are two really important differences.

  1. The Fibonacci series shows up in every numeric notation, in every number base, no matter how you do numbers.
  2. It does show up in nature. This is key: there’s more to it than just words and vague assertions. You can really find fragments of the Fibonacci series in nature. By doing a careful mathematical analysis, you can find the Fibonacci series in numerous places in mathematics, such as the solutions to a range of interesting dynamic optimization problems. When you find a way of observing the vortex math pattern in nature, or a way of producing actual numeric solutions for real problems, in a way that anyone can reproduce, I’ll happily give it another look.
  3. The Fibonacci series does appear in nature – but it’s also been used by numerous crackpots to make ridiculous assertions about how the world must work!

Garbage Collection with Semispaces

The roots of most garbage collection ideas come from the Lisp community. Lisp was really the first major garbage collection language that was used to write complicated things. So it’s natural that the first big innovation in the world of GC that we’re going to look at comes from the Lisp community.

In early Lisp systems with garbage collection, the pause that occured when the GC did a mark/sweep to reclaim memory was very long, so it was important to find ways to make the cycle faster. Lisp code had the properly that it tended to allocate a lot of small, short-lived objects: Lisp, particularly early lisp, tended to represent everything using tiny structures called cons cells, and Lisp programs generate bazillions of them. Lots of short-lived cons cells needed to get released in every GC cycle, and the bulk of the GC pause was caused by the amount of time that the GC spend going through all of the dead cons cells, and releasing them.

Beyond just that speed issue, there’s another problem with naive mark-sweep collection when you’re dealing with large numbers of short lived objects, called heap fragmentation. The GC does a pass marking all of the memory in use, and then goes through each unused block of memory, and releases it. What can happen is that you can end up with lots of memory free, but scattered around in lots of small chunks. For an extreme example, imagine that you’re building two lists made up of 8-byte cells. You allocate a cell for list A, and then you do something using A, and generate a new value which you add as a new cell in list B. So you’re alternating: allocate a cell for A, then a cell for B. When you get done, you discard A, and just keep B. After the GC runs, what does your memory look like? If A and B each have 10,000 cells, then what you have is 8 bytes of free memory that used to be part of A, and then 8 bytes of used memory for a cell of B, then 8 bytes of free, then 8 used, etc. You’ve ended up with 80,000 bytes of free memory, none of which can be used to store anything larger than 8 bytes. Eventually, you can wind up with your entire available heap broken into small enough pieces that you can’t actually use it for anything.

What the lisp folks came up with is a way of getting rid of fragmentation, and dramatically reducing the cost of the sweep phase, by using something called semispaces. With semispaces, you do some cleverness that can be summed up as moving from mark-sweep to copy-swap.

The idea is that instead of keeping all of your available heap in one chunk, you divide it into two equal regions, called semispaces. You call one of the semispaces the primary, and the other the secondary. When you allocate memory, you only allocate from the primary space. When the primary space gets to be almost full, you start a collection cycle.

When you’re doing your mark phase, instead of just marking each live value, you copy it to the secondary space. When all of the live values have been copied to the secondary space, you update all of the pointers within the live values to their new addresses in the secondary space.

Then, instead of releasing each of the unused values, you just swap the primary and secondary space. Everything in the old primary space gets released, all at once. The copy phase also compacts everything as it moves it into the secondary space, consolidating all of the free memory in one contiguous chunk. If you implement it well, you can also have beneficial side effect of moving things close in ways that improve the cache performance of your program.

For Lisp programs, semispaces are a huge win: they reduce the cost of the sweep phase to a constant time, at the expense of a nearly linear increase in mark time, which works out really well. And it eliminates the problem of fragmentation. All in all, it’s a great tradeoff!

Of course, it’s got some major downsides as well, which can make it work very poorly in some cases:

  1. The copy phase is significantly more expensive than a traditional mark-phase. The time it takes to copy is linear in the total amount of live data, versus linear in the number of live objects in a conventional mark. Whether semispaces will work well for a given application depend on the properties of the data that you’re working with. If you’ve got lots of large objects, then the increase in time caused by the copy instead of mark can significantly outweigh the savings of the almost free swap, making your GC pauses much longer; but if you’ve got lots of short-lived, small objects, then the increase in time for the copy can be much smaller than time savings from the swap, resulting in dramatically shortened GC pauses.
  2. Your application needs to have access to twice as much memory as you actually expect it to use, because you need two full semispaces. There’s really no good way around this: you really need to have a chunk of unused memory large enough to store all of your live objects – and it’s always possible that nearly everything is alive for a while, meaning that you really do need two equally sized semispaces.
  3. You don’t individually release values, which means that you can’t have any code that runs when an value gets collected. In a conventional mark-sweep, you can have objects provide functions called finalizers to help them clean up when they’re released – so objects like files can close themselves. In a semispace, you can’t do that.

The basic idea of semispaces is beautiful, and it’s adaptable to some other environments where a pure semispace doesn’t make sense, but some form of copying and bulk release can work out well.

For example, years ago, at a previous job, one of my coworkers was working on a custom Java runtime system for a large highly scalable transaction processing system. The idea of this was that you get a request from a client system to perform some task. You perform some computation using the data from the client request, update some data structures on the server, and then return a result to the client. Then you go on to the next request.

The requests are mostly standalone: they do a bunch of computation using the data that they recieved in the request. Once they’re done with a given request, almost nothing that they used processing it will ever be looked at again.

So what they did was integrate a copying GC into the transaction system. Each time they started a new transaction, they’d give it a new memory space to live it. When the transaction finished, they’d do a quick copy cycle to copy out anything that was referenced in the master server data outside the transaction, and then they’d just take that chunk of memory, and make it available for use by the next transaction.

The result? Garbage collection became close to free. The number of pointers into the transaction space from the master server data was usually zero, one, or two. That meant that the copy phase was super-short. The release phase was constant time, just dropping the pointer to the transaction space back into the available queue.

So they were able to go from an older system which had issues with GC pauses to a new system with no pauses at all. It wouldn’t work outside of that specific environment, but for that kind of application, it screamed.

A Beginner’s Guide to Garbage Collection

I was just reading an interesting paper about garbage collection (GC), and realized that I’d never written about it here, so I thought maybe I’d write a couple of articles about it. I’m going to start by talking about the two most basic techniques: mark and sweep collection, and reference counting. In future posts, I’ll move on to talk about various neat things in the world of GC.

So let’s start at the beginning. What is garbage collection?

When you’re writing a program, you need to store values in memory. Most of the time, if you want to do something interesting, you need to be able to work with lots of different values. You read data from your user, and you need to be able to create a place to store it. So (simplifying a bit) you ask your operating system to give you some memory to work with. That’s called memory allocation.

The thing about memory allocation is that the amount of memory that a computer has is finite. If you keep on grabbing more of your computer’s memory, you’re eventually going to run out. So you need to be able to both grab new memory when you need it, and then give it back when you’re done.

In many languages (for example, C or C++), that’s all done manually. You need to write code that says when you want to grab a chunk of memory, and you also need to say when you’re done with it. Your program needs to keep track of how long it needs to use a chunk of memory, and give it back when it’s done. Doing it that way is called manual memory management.

For some programs, manual memory management is the right way to go. On the other hand, it can be very difficult to get right. When you’re building a complicated system with a lot of interacting pieces, keeping track of who’s using a given piece of memory can become extremely complicated. It’s hard to get right – and when you don’t get it right, your program allocates memory and never gives it back – which means that over time, it will be using more and more memory, until there’s none left. That’s called a memory leak. It’s very hard to write a complicated system using manual memory management without memory leaks.

You might reasonably ask, what makes it so hard? You’re taking resources from the system, and using them. Why can’t you just give them back when you’re done with them?

It’s easiest to explain using an example. I’m going to walk through a real-life example from one of my past jobs.

I was working on a piece of software that managed the configuration of services for a cluster management platform. In the system, there were many subsystems that needed to be configured, but we wanted to have one configuration. So we had a piece of configuration that was used to figure out what resources were needed to run a service. There was another piece that was used to figure out where log messages would get stored. There was another piece that specified what was an error that was serious enough to page an engineer. There was another piece that told the system how to figure out which engineer to page. And so on.

We’d process the configuration, and then send pieces of it to the different subsystems that needed them. Often, one subsystem would then need to grab information from the piece of configuration that was the primary responsibility of a different subsystem. For example, when there’s an major error, and you need to page an engineer, we wanted to include a link to the appropriate log in the page. So the pager needed to be able to get access to the logging configuration.

The set of components that worked as part of this configuration system wasn’t fixed. People could add new components as new things got added to the system. Each component would register which section of the configuration it was interested in – but then, when it received its configuration fragment, it could also ask for other pieces of the configuration that it needed.

So, here’s the problem. Any given piece of the configuration could be used by 1, or 2, or 3, or 4, or 20 different components. Which piece of the system should be responsible for knowing when all of the other components are done using it? How can it keep track of that?

That’s the basic problem with manual memory management. It’s easy in easy cases, but in complex systems with overlapping realms of responsibility, where multiple systems are sharing data structures in memory, it’s difficult to build a system where there’s one responsible agent that knows when everyone is done with a piece of memory.

It’s not impossible, but it’s difficult. In a system like the one above, the way that we made it work was by doing a lot of copying of data. Instead of having one copy of a chunk of evaluated configuration which was shared among multiple readers, we’d have many copies of the same thing – one for each component. That worked, but it wasn’t free. We ended up needing to use well over ten times as much memory as we could have using shared data structures. When you’ve got a system that could work with a gigabyte of data, multiplying it by ten is a pretty big deal! Even if you’ve got a massive amount of memory available, making copies of gigabytes of data takes a lot of time!

The most important point here is to understand just how hard it is to get this stuff right. I’ve been a software engineer for a long time, and I’ve worked on a lot of systems. Until the advent of the Rust programming language, I’d never seen a single long-running system built with manual memory management that didn’t have a memory leak somewhere. (I’ll talk more about Rust and how it manages to accomplish this in a later post.)

So manual memory management is very hard to get right, and it can potentially be pretty expensive. On the other hand, it’s predictable: you know, when you write your code, what the costs of managing memory will be, because you wrote the code that does it. And, if you get it right, you can control exacly how much memory your program is using at any time.

The alternative to manual memory management is to somehow make the program automatically keep track of memory, and get rid of it when it’s no longer used. But how do you know when something is no longer used?

It’s pretty easy. You call a chunk of memory live if it can be reached by any variable in the program. If it can’t, it’s garbage, and you can get rid of it. Garbage collection is any mechanism in a programming language or execution environment that automatically figures out when something is garbage, and releases it.

There are two basic methods that we can use to figure out which chunks of memory contain live values, and which are garbage. They’re called reference counting and mark-sweep. (There’s a pool of people who are going to get angry at this definition because, they argue, reference counting isn’t garbage collection. They insist that reference counting is something fundmentally different, and that only mark-sweep is really garbage collection. Obviously I disagree. The definition that I’m using is that anything which automatically releases unused memory is garbage collection.)

In reference counting, each block of memory that gets allocated includes a counter called its reference count. Every time you create a new way of referring to something – by assigning it to a variable, or passing it as a parameter, assigning it to a field of another data structure – you add one to the reference count of the block of memory that contains it. Every time you remove a way of referencing something – by changing a variable, or returning from a function call, or garbage collecting a data structure that referenced it, you decrement its reference count by one. When the reference count for a block of memory hits zero, you can release it.

It’s simple, and it’s predictable. You know that the moment you stop using something, it’s going to get released. That’s great! But there are some problems with reference counting. First, you need to make sure that every single time you change anything, you correctly update the reference counts. Miss any updates, and either things will get released before you’re done with them, or things won’t get released and you’ll leak memory. The other, potentially bigger problem, is that there are a bunch of data structures where simple reference counting doesn’t work. For example, think of a doubly-linked list. That’s a list of values, stored so that each value in the list contains pointers to both the element ahead of it in the list, and to the element behind it in the list. Every element in that list always has at least one thing pointing to it. So none of their reference counts will ever be zero, and no element of the list will ever get collected! (There are ways around that, which we’ll talk about in a later post.)

The other main garbage collection technique is called mark-sweep. In mark-sweep, you have a two-phase process: in the mark phase, you walk over all of the data structures figuring out what’s still being used, and in the sweep phase, you free up anything that isn’t getting used.

In the marking phase, you start with a set of pointers called the root set. The root set consists of the things that you know are being used: the values of all of the variables in the parts of your program that are running, and anything that’s being referenced by the execution environment.

You create a marking queue, consisting initially of the root-set. Then you start to process the queue. For each element in the queue, if it hasn’t been marked yet, you mark it as alive, and then you add everything that it contains a reference to to the queue. If it was already marked as live, you just skip over it: it’s done.

Once the mark phase is done, everything that can possibly be referenced by your running program has been marked as live. So now you can sweep: go through the memory space of your program, and release anything that wasn’t marked as live. Boom: you’ve just gotten rid of everything that’s no longer needed.

Naive mark-sweep has one really big problem: your program can’t change anything during the mark phase! That means that any time you want to release some unusued memory, you need to stop the execution of your program while the garbage collection is going through memory, figuring out what’s still alive.

Personally, I really love working in garbage collected languages. In modern GC systems, the pauses are relatively non-intrusive, and the execution time cost of them is often significantly less than the additional copy-costs of manual collection. But it’s far from a panacaea: it doesn’t even completely prevent memory leaks! (One of the things that surprised me quite a bit earlier in my career was discovering a huge memory leak in a Java program.)

Anyway, that’s the intro to the general ideas. In subsequent posts, I’ll talk about a lot of different things in the area of memory management and garbage collection. I’m mostly going to focus on mark-sweep: reference counting is a very simple idea, and most of the applications of it focus on maintaining that simplicity. But in the world of mark-sweep, there’s a ton of interesting stuff: semispaces (which make the sweep phase of GC faster and more effective), generational garbage collection (which makes the GC system faster, and reduces the number of pauses), incremental collection (which allows the mark phase to be done without stopping the whole program), and various techniques used to implement all of this, like read-barriers, write barriers, and colored pointers.

Introduction to Neural Networks

In preparation for starting a new job next week, I’ve been doing some reading about neural networks and deep learning. The math behind neural networks is pretty interesting, so I thought I’d take my notes, and turn them into some posts.

As the name suggests, the basic idea of a neural network is to construct a computational system based on a simple model of a neuron. If you look at a neuron under a microscope, what you see is something vaguely similar to:

It’s a cell with three main parts:

  • A central body;
  • A collection of branched fibers called dendrites that receive signals and carry them to the body; and
  • A branched fiber called an axon that sends signals produced by the body.

You can think of a neuron as a sort of analog computing element. Its dendrites receive inputs from some collection of sources. The body has some criteria for deciding, based on its inputs, whether to “fire”. If it fires, it sends an output using its axon.

What makes a neuron fire? It’s a combination of inputs. Different terminals on the dendrites have different signaling strength. When the combined inputs reach a threshold, the neuron fires. Those different signal strengths are key: a system of neurons can learn how to interpret a complex signal by varying the strength of the signal from different dendrites.

We can think of this simple model of a neuron in computational terms as a computing element that takes a set of weighted input values, combines them into a single value, and then generates an output of “1” if that value exceeds a threshold, and 0 if it does not.

In slightly more formal terms, (n, \theta, b, t) where:

  1. n is the number of inputs to the machine. We’ll represent a given input as a vector v=[v_1, ..., v_n].
  2. \theta = [\theta_1, \theta_2, ..., \theta_n] is a vector of weights, where \theta_i is the weight for input i.
  3. b is a bias value.
  4. t is the threshold for firing.

Given an input vector v, the machine computes the combined, weighted input value I by taking the dot product v \cdot w = [\theta_1v_1 + \theta_2v_2 + ... + \theta_nv_n]. If I + b \ge t, the neuron “fires” by producing a 1; otherwise, it produces a zero.

This version of a neuron is called a perceptron. It’s good at a particular kind of task called classification: given a set of inputs, it can answer whether or not the input is a member of a particular subset of values. A simple perceptron is limited to linear classification, which I’ll explain next.

To understand what a perceptron does, the easiest way to think of it is graphical. Imagine you’ve got an input vector with two values, so that your inputs are points in a two dimensional cartesian plane. The weights on the perceptron inputs define a line in that plane. The perceptron fires for all points above that line – so the perceptron classifies a point according to which side of the line it’s located on. We can generalize that notion to higher dimensional spaces: for a perceptron taking n input values, we can visualize its inputs as an n-dimensional space, and the perceptron weight’s define a hyperplane that slices the n-dimensional input space into two sub-spaces.

Taken by itself, a single perceptron isn’t very interesting. It’s just a fancy name for a something that implements a linear partition. What starts to unlock its potential is training. You can take a perceptron and initialize all of its weights to 1, and then start testing it on some input data. Based on the results of the tests, you alter the weights. After enough cycles of repeating this, the perceptron can learn the correct weights for any linear classification.

The traditional representation of the perceptron is as a function h:

\displaystyle    h(x, \theta, b) = \left\{    \begin{array}{cl}     0, & x \cdot \theta + b < 0 \\   +1, & x \cdot \theta + b \ge 0 \end{array}     \right.

Using this model, learning is just an optimization process, where we’re trying to find a set of values for {\theta} that minimize the errors in assigning points to subspaces.

A linear perceptron is a implementation of this model based on a very simple notion of a neuron. A perceptron takes a set of weighted inputs, adds them together, and then if the result exceeds some threshold, it “fires”.

A perceptron whose weighted inputs don’t exceed its threshold produces an output of 0; a perceptron which “fires” based on its inputs produces a value of +1.

Linear classification is very limited – we’d like to be able to do things that are more interesting that just linear. We can do that by adding one thing to our definition of a neuron: an activation function. Instead of just checking if the value exceeds a threshold, we can take the dot-product of the inputs, and then apply a function to them before comparing them to the threshold.

With an activation function f, we can define the operation of our more powerful in two phases. First, the perceptron computes the logit, which is the same old dot-product of the weights and the inputs. Then it applies the activation function to the logit, and based on the output, it decides whether or not to fire.

The logit is defined as:

 z = (\Sigma_{i=0}^{n} w_i x_i) + b

And the perceptron as a whole is a classifier:

\displaystyle h(x, \theta) =       \left\{       \begin{array}{cl}       0, & f(z) < 0 \\       +1, & f(z) >= 0       \end{array}       \right.

Like I said before, this gets interesting when you get to the point of training. The idea is that before you start training, you have a neuron that doesn’t know anything about the things it’s trying to classify. You take a collection of values where you know their classification, and you put them through the network. Each time you put a value through, ydou look at the result – and if it’s wrong, you adjust the weights of the inputs. Once you’ve repeated that process enough times, the edge-weights will, effectively, encode a curve (a line in the case of a linear perceptron) that divides between the categories. The real beauty of it is that you don’t need to know where the line really is: as long as you have a large, representative sample of the data, the perceptron will discover a good separation.

The concept is simple, but there’s one big gap: how do you adjust the weights? The answer is: calculus! We’ll define an error function, and then use the slope of the error curve to push us towards the minimum error.

Let’s say we have a set of training data. For each value i in the training data, we’ll say that t^{(i)} is the “true” value (that is, the correct classification) for value i, and y^{(i)} is the value produced by the current set of weights of our perceptron. Then the
cumulative error for the training data is:

E = \frac{1}{2}\sum_{i}(t^{(i)} - y^{(i)})^2

i^{(i)} is given to us with our training data. y^{(i)} is something we know how to compute. Using those, we can view the errors as a curve on y.

Let’s think in terms of a two-input example again. We can create a three dimensional space around the ideal set of weights: the x and y axes are the input weights; the z axis is the size of the cumulative error for those weights. For a given error value z, there’s a countour of a curve for all of the bindings that produce that level of error. All we need to do is follow the curve towards the minimum.

In the simple cases, we could just use Newton’s method directly to rapidly converge on the solution, but we want a general training algorithm, and in practice, most real learning is done using a non-linear activation function. That produces a problem: on a complex error surface, it’s easy to overshoot and miss the minimum. So we’ll scale the process using a meta-parameter \epsilon called the learning rate.

For each weight, we’ll compute a change based on the partial derivative of the error with respect to the weight:

 \Delta w_k = - \epsilon \frac{\partial E}{\partial w_k}

For our linear perceptron, using the definition of the cumulative error E above, we can expand that out to:

\Delta w_k = \Sigma_i \epsilon x_k^{(i)}(t^{(i)} - y^{(i)})

So to train a single perceptron, all we need to do is start with everything equally weighted, and then run it on our training data. After each pass over the data, we compute the updates for the weights, and then re-run until the values stabilize.

This far, it’s all pretty easy. But it can’t do very much: even with a complex activation function, a single neuron can’t do much. But when we start combining collections of neurons together, so that the output of some neurons become inputs to other neurons, and we have multiple neurons providing outputs – that is, when we assemble neurons into networks – it becomes amazingly powerful. So that will be our next step: to look at how to put neurons together into networks, and then train those networks.

As an interesting sidenote: most of us, when we look at this, think about the whole thing as a programming problem. But in fact, in the original implementation of perceptron, a perceptron was an analog electrical circuit. The weights were assigned using circular potentiometers, and the weights were updated during training using electric motors rotating the knob on the potentiometers!

I’m obviously not going to build a network of potentiometers and motors. But in the next post, I’ll start showing some code using a neural network library. At the moment, I’m still exploring the possible ways of implementing it. The two top contenders are TensorFlow, which is a library built on top of Python; and R, which is a stastical math system which has a collection of neural network libraries. If you have any preference between the two, or for something else altogether, let me know!

A Review of Type Theory (so far)

I’m trying to get back to writing about type theory. Since it’s been quite a while since the last type theory post, we’ll start with a bit of review.

What is this type theory stuff about?

The basic motivation behind type theory is that set theory isn’t the best foundation for mathematics. It seems great at first, but when you dig in deep, you start to see cracks.

If you start with naive set theory, the initial view is amazing: it’s so simple! But it falls apart: it’s not consistent. When you patch it, creating axiomatic set theory, you get something that isn’t logically inconsistent – but it’s a whole lot more complicated. And while it does fix the inconsistency, it still gives you some results which seem wrong.

Type theory covers a range of approaches that try to construct a foundational theory of mathematics that has the intuitive appeal of axiomatic set theory, but without some of its problems.

The particular form of type theory that we’ve been looking at is called Martin-Löf type theory. M-L type theory is a constructive theory of mathematics in which computation plays a central role. The theory rebuilds mathematics in a very concrete form: every proof must explicitly construct the objects it talks about. Every existence proof doesn’t just prove that something exists in the abstract – it provides a set of instructions (a program!) to construct an example of the thing that exists. Every proof that something is false provides a set of instructions (also a program!) for how to construct a counterexample that demonstrates its falsehood.

This is, necessarily, a weaker foundation for math than traditional axiomatic set theory. There are useful things that are provable in axiomatic set theory, but which aren’t provable in a mathematics based on M-L type theory. That’s the price you pay for the constructive foundations. But in exchange, you get something that is, in many ways, clearer and more reasonable than axiomatic set theory. Like so many things, it’s a tradeoff.

The constructivist nature of M-L type theory is particularly interesting to wierdos like me, because it means that programming becomes the foundation of mathematics. It creates a beautiful cyclic relationship: mathematics is the foundation of programming, and programming is the foundation of mathematics. The two are, in essence, one and the same thing.

The traditional set theoretic basis of mathematics uses set theory with first order predicate logic. FOPL and set theory are so tightly entangled in the structure of mathematics that they’re almost inseparable. The basic definitions of type theory require logical predicates that look pretty much like FOPL; and FOPL requires a model that looks pretty much like set theory.

For our type theory, we can’t use FOPL – it’s part of the problem. Instead, Martin-Lof used intuitionistic logic. Intuitionistic logic plays the same role in type theory that FOPL plays in set theory: it’s deeply entwined into the entire system of types.

The most basic thing to understand in type theory is what a logical proposition means. A proposition is a complete logical statement no unbound variables and no quantifiers. For example, “Mark has blue eyes” is a proposition. A simple proposition is a statement of fact about a specific object. In type theory, a proof of a proposition is a program that demonstrates that the statement is true. A proof that “Mark has blue eyes” is a program that does something like “Look at a picture of Mark, screen out everything but the eyes, measure the color C of his eyes, and then check that C is within the range of frequencies that we call “blue”. We can only say that a proposition is true if we can write that program.

Simple propositions are important as a starting point, but you can’t do anything terribly interesting with them. Reasoning with simple propositions is like writing a program where you can only use literal values, but no variables. To be able to do interesting things, you really need variables.

In Martin-Lof type theory, variables come along with predicates. A predicate is a statement describing a property or fact about an object (or about a collection of objects) – but instead of defining it in terms of a single fixed value like a proposition, it takes a parameter. “Mark has blue eyes” is a proposition; “Has blue eyes” is a predicate. In M-L type theory, a predicate is only meaningful if you can write a program that, given an object (or group of objects) as a parameter, can determine whether or no the predicate is true for that object.

That’s roughly where we got to in type theory before the blog went on hiatus.

Does well-ordering contradict Cantor?

The other day, I received an email that actually excited me! It’s a question related to Cantor’s diagonalization, but there’s absolutely nothing cranky about it! It’s something interesting and subtle. So without further ado:

Cantor’s diagonalization says that you can’t put the reals into 1 to 1 correspondence with the integers. The well-ordering theorem seems to suggest that you can pick a least number from every set including the reals, so why can’t you just keep picking least elements to put them into 1 to 1 correspondence with the reals. I understand why Cantor says you can’t. I just don’t see what is wrong with the other arguments (other than it must be wrong somehow). Apologies for not being able to state the argument in formal maths, I’m around 20 years out of practice for formal maths.

As we’ve seen in too many discussions of Cantor’s diagonalization, it’s a proof that shows that it is impossible to create a one-to-one correspondence between the natural numbers and the real numbers.

The Well-ordering says something that seems innoccuous at first, but which, looked at in depth, really does appear to contradict Cantor’s diagonalization.

A set S is well-ordered if there exists a total ordering <= on the set, with the additional property that for any subset T \subseteq S, T has a smallest element.

The well-ordering theorem says that every non-empty set can be well-ordered. Since the set of real numbers is a set, that means that there exists a well-ordering relation over the real numbers.

The problem with that is that it appears that that tells you a way of producing an enumeration of the reals! It says that the set of all real numbers has a least element: Bingo, there’s the first element of the enumeration! Now you take the set of real numbers excluding that one, and it has a least element under the well-ordering relation: there’s the second element. And so on. Under the well-ordering theorem, then, every set has a least element; and every element has a unique successor! Isn’t that defining an enumeration of the reals?

The solution to this isn’t particularly satisfying on an intuitive level.

The well-ordering theorem is, mathematically, equivalent to the axiom of choice. And like the axiom of choice, it produces some very ugly results. It can be used to create “existence” proofs of things that, in a practical sense, don’t exist in a usable form. It proves that something exists, but it doesn’t prove that you can ever produce it or even identify it if it’s handed to you.

So there is an enumeration of the real numbers under the well ordering theorem. Only the less-than relation used to define the well-ordering is not the standard real-number less than operation. (It obviously can’t be, because under well-ordering, every set has a least element, and standard real-number less-than doesn’t have a least element.) In fact, for any ordering relation \le_x that you can define, describe, or compute, \le_x is not the well-ordering relation for the reals.

Under the well-ordering theorem, the real numbers have a well-ordering relation, only you can’t ever know what it is. You can’t define any element of it; even if someone handed it to you, you couldn’t tell that you had it.

It’s very much like the Banach-Tarski paradox: we can say that there’s a way of doing it, only we can’t actually do it in practice. In the B-T paradox, we can say that there is a way of cutting a sphere into these strange pieces – but we can’t describe anything about the cut, other than saying that it exists. The well-ordering of the reals is the same kind of construct.

How does this get around Cantor? It weasels its way out of Cantor by the fact that while the well-ordering exists, it doesn’t exist in a form that can be used to produce an enumeration. You can’t get any kind of handle on the well-ordering relation. You can’t produce an enumeration from something that you can’t create or identify – just like you can’t ever produce any of the pieces of the Banach-Tarski cut of a sphere. It exists, but you can’t use it to actually produce an enumeration. So the set of real numbers remains non-enumerable even though it’s well-ordered.

If that feels like a cheat, well… That’s why a lot of people don’t like the axiom of choice. It produces cheatish existence proofs. Connecting back to something I’ve been trying to write about, that’s a big part of the reason why intuitionistic type theory exists: it’s a way of constructing math without stuff like this. In an intuitionistic type theory (like the Martin-Lof theory that I’ve been writing about), it doesn’t exist if you can’t construct it.

Understanding Global Warming Scale Issues

Aside from the endless stream of Cantor cranks, the next biggest category of emails I get is from climate “skeptics”. They all ask pretty much the same question. For example, here’s one I received today:

My personal analysis, and natural sceptisism tells me, that there are something fundamentally wrong with the entire warming theory when it comes to the CO2.

If a gas in the atmosphere increase from 0.03 to 0.04… that just cant be a significant parameter, can it?

I generally ignore it, because… let’s face it, the majority of people who ask this question aren’t looking for a real answer. But this one was much more polite and reasonable than most, so I decided to answer it. And once I went to the trouble of writing a response, I figured that I might as well turn it into a post as well.

The current figures – you can find them in a variety of places from wikipedia to the US NOAA – are that the atmosphere CO2 has changed from around 280 parts per million in 1850 to 400 parts per million today.

Why can’t that be a significant parameter?

There’s a couple of things to understand to grasp global warming: how much energy carbon dioxide can trap in the atmosphere, and hom much carbon dioxide there actually is in the atmosphere. Put those two facts together, and you realize that we’re talking about a massive quantity of carbon dioxide trapping a massive amount of energy.

The problem is scale. Humans notoriously have a really hard time wrapping our heads around scale. When numbers get big enough, we aren’t able to really grasp them intuitively and understand what they mean. The difference between two numbers like 300 and 400ppm is tiny, we can’t really grasp how in could be significant, because we aren’t good at taking that small difference, and realizing just how ridiculously large it actually is.

If you actually look at the math behind the greenhouse effect, you find that some gasses are very effective at trapping heat. The earth is only habitable because of the carbon dioxide in the atmosphere – without it, earth would be too cold for life. Small amounts of it provide enough heat-trapping effect to move us from a frozen rock to the world we have. Increasing the quantity of it increases the amount of heat it can trap.

Let’s think about what the difference between 280 and 400 parts per million actually means at the scale of earth’s atmosphere. You hear a number like 400ppm – that’s 4 one-hundreds of one percent – that seems like nothing, right? How could that have such a massive effect?!

But like so many other mathematical things, you need to put that number into the appropriate scale. The earths atmosphere masses roughly 5 times 10^21 grams. 400ppm of that scales to 2 times 10^18 grams of carbon dioxide. That’s 2 billion trillion kilograms of CO2. Compared to 100 years ago, that’s about 800 million trillion kilograms of carbon dioxide added to the atmosphere over the last hundred years. That’s a really, really massive quantity of carbon dioxide! scaled to the number of particles, that’s something around 10^40th (plus or minus a couple of powers of ten – at this scale, who cares?) additional molecules of carbon dioxide in the atmosphere. It’s a very small percentage, but it’s a huge quantity.

When you talk about trapping heat, you also have to remember that there’s scaling issues there, too. We’re not talking about adding 100 degrees to the earths temperature. It’s a massive increase in the quantity of energy in the atmosphere, but because the atmosphere is so large, it doesn’t look like much: just a couple of degrees. That can be very deceptive – 5 degrees celsius isn’t a huge temperature difference. But if you think of the quantity of extra energy that’s being absorbed by the atmosphere to produce that difference, it’s pretty damned huge. It doesn’t necessarily look like all that much when you see it stated at 2 degrees celsius – but if you think of it terms of the quantity of additional energy being trapped by the atmosphere, it’s very significant.

Calculating just how much energy a molecule of CO2 can absorb is a lot trickier than calculating the mass-change of the quantity of CO2 in the atmosphere. It’s a complicated phenomenon which involves a lot of different factors – how much infrared is absorbed by an atom, how quickly that energy gets distributed into the other molecules that it interacts with… I’m not going to go into detail on that. There’s a ton of places, like here, where you can look up a detailed explanation. But when you consider the scale issues, it should be clear that there’s a pretty damned massive increase in the capacity to absorb energy in a small percentage-wise increase in the quantity of CO2.


I’m working on some type theory posts, but it’s been slow going.

In the meantime, it’s Chanukah time. Every year, my family makes me cook potato latkes for Chanukah. The problem with that is, I don’t particularly like potato latkes. This year, I came up with the idea of trying to tweak them into something that I’d actually enjoy eating. What I came up with is combining a latke with another kind of fried savory pancake that I absolutely love: the japanese Okonomiyaki. The result? Okonomilatkes.


  • 1/2 head green cabbage, finely shredded.
  • 1 1/2 pounds potatoes
  • 1/2 cup flour
  • 1/2 cup water
  • 1 beaten egg
  • 1/2 pound crabstick cut into small pieces
  • Tonkatsu sauce (buy it at an asian grocery store in the japanese section. The traditional brand has a bulldog logo on the bottle.)
  • Katsubuoshi (shredded bonito)
  • Japanese mayonaise (sometimes called kewpie mayonaise. You can find it in squeeze bottles in any asian grocery. Don’t substitute American mayo – Japanese mayo is thinner, less oily, a bit tart, sweeter, and creamier. It’s really pretty different.)
  • 1 teaspoon salt
  • 1/2 teaspoon baking powder.


  1. In a very hot pan, add about a tablespoon of oil, and when it’s nearly smoking, add the cabbage. Saute until the cabbage wilts and starts to brown. Remove from the heat, and set aside to cool.
  2. Using either the grater attachment of a food processor, or the coarse side of a box grater, shred the potatoes. (I leave the skins on, but if that bugs you, peel them first).
  3. Squeeze as much water as you can out of the shredded potatoes.
  4. Mix together the water, flour, baking powder, egg, and salt into a thin batter.
  5. Add the potatoes, cabbage, and crabstick to the batter, and stir together.
  6. Split this mixture into four portions.
  7. Heat a nonstick pan on medium high heat, add a generous amount of oil, and add one quarter of the batter. Let it cook until nicely browned, then flip, and cook the other side. On my stove, it takes 3-5 minutes per side. Add oil as needed while it’s cooking.
  8. Repeat with the other 3 portions
  9. To serve, put a pancake on a plate. Squeeze a bunch of stripes of mayonaise, then add a bunch of the tonkatsu sauce, and sprinkle with the katsubuoshi.