Category Archives: Basics

A Gentle Explanation of the Intel Speculative Execution CPU Bug

There’s a lot of talk today about the recent discovery of a significant bug in the Intel CPUs that so many of us use every day. It’s an interesting problem, and understanding it requires knowing a little bit about how CPUs work, so I thought I’d take a shot at writing an explainer.

Let me start with a huge caveat: this involves a lot of details about how CPUs work, and in order to explain it, I’m going to simplify things to an almost ridiculous degree in order to try to come up with an explanation that’s comprehensible to a lay person. I’m never deliberately lying about how things work, but at times, I’m simplifying enough that it will be infuriating to an expert. I’m doing my best to explain my understanding of this problem in a way that most people will be able to understand, but I’m bound to oversimplify in some places, and get details wrong in others. I apologize in advance.

It’s also early days for this problem. Intel is still trying to keep the exact details of the bug quiet, to make it harder for dishonest people to exploit it. So I’m working from the information I’ve been able to gather about the issue so far. I’ll do my best to correct this post as new information comes out, but I can only do that when I’m not at work!

That said: what we know so far is that the Intel bug involves non-kernel code being able to access cached kernel memory through the use of something called speculative execution.

To an average person, that means about as much as a problem in the flux thruster atom pulsar electrical ventury space-time implosion field generator coil.

Let’s start with a quick overview of a modern CPU.

The CPU, in simple terms, the brain of a computer. It’s the component that actually does computations. It reads a sequence of instructions from memory, and then follows those instructions to perform some computation on some values, which are also stored in memory. This is a massive simplification, but basically, you can think of the CPU as a pile of hardware than runs a fixed program:

def simplified_cpu_main_loop():
  IP = 0
  while true:
     (op, in1, in2, out) = fetch(IP)
     val1 = fetch(in1)
     val2 = fetch(in2)
     result, IP = perform(op, in1, in2)
     store(result, out)

There’s a variable called the instruction pointer (abbreviated IP) built-in to the CPU that tells it where to fetch the next instruction from. Each time the clock ticks, the CPU fetches an instruction from the memory address stored in the instruction pointer, fetches the arguments to that instruction from cells in memory, performs the operation on those arguments, and then stores the result into another cell in the computer memory. Each individual operation produces both a result, and a new value for the instruction pointer. Most of the time, you just increment the instruction pointer to look at the next instruction, but for comparisons or branches, you can change it to something else.

What I described above is how older computers really worked. But as CPUs got faster, chipmaker ran into a huge problem: the CPU can perform its operations faster and faster every year. But retrieving a value from memory hasn’t gotten faster at the same rate as executing instructions. The exact numbers don’t matter, but to give you an idea, a modern CPU can execute an instruction in less than one nanosecond, but fetching a single value from memory takes more than 100 nanoseconds. In the scheme we described above, you need to fetch the instruction from memory (one fetch), and then fetch two parameters from memory (another two fetches), execute the instruction (1 nanosecond), and then store the result back into memory (one store). Assuming a store is no slower than a fetch, that means that for one nanosecond of computation time, the CPU needs to do 3 fetches and one store for each instruction. That means that the CPU is waiting, idle, for at least 400ns, during which it could have executed another 400 instructions, if it didn’t need to wait for memory.

That’s no good, obviously. There’s no point in making a fast CPU if all it’s going to do is sit around and wait for slow memory. But designers found ways to work around that, by creating ways to do a lot of computation without needing to pause to wait things to be retrieved from/stored to memory.

One of those tricks was to add registers to the CPUs. A register is a cell of memory inside of the CPU itself. Early processors (like the 6502 that was used by the Apple II) had one main register called an accumulator. Every arithmetic instruction would work by retrieving a value from memory, and then performing some arithmetic operation on the value in the accumulator and the value retrieved from memory, and leave the result in the accumulator. (So, for example, if 0x1234 were the address variable X, you could add the value of X to the accumulator with the instruction “ADD (1234)”. More modern CPUs added many registers, so that you can keep all of the values that you need for some computation in different registers. Reading values from or writing values to registers is lightning fast – in fact, it’s effectively free. So you structure your computations so that they load up the registers with the values they need, then do the computation in registers, and then dump the results out to memory. Your CPU can run a fairly long sequence of instructions without ever pausing for a memory fetch.

Expanding on the idea of putting memory into the CPU, they added ways of reducing the cost of working with memory by creating copies of the active memory regions on the CPU. These are called caches. When you try to retrieve something from memory, if it’s in the cache, then you can access it much faster. When you access something from a memory location that isn’t currently in the cache, the CPU will copy a chunk of memory including that location into the cache.

You might ask why, if you can make the cache fast, why not just make all of memory like the cache? The answer is that the time it takes in hardware to retrieve something from memory increases with amount of memory that you can potentially access. Pointing at a cache with 1K of memory is lightning fast. Pointing at a cache with 1 megabyte of memory is much slower that the 1K cache, but much faster that a 100MB cache; pointing at a cache with 100MB is even slower, and so on.

So what we actually do in practice is have multiple tiers of memory. We have the registers (a very small set – a dozen or so memory cells, which can be accessed instantly); a level-0 cache (on the order of 8k in Intel’s chips), which is pretty fast; a level-1 cache (100s of kilobytes), an L2 cache (megabytes), L3 (tens of megabytes), and now even L4 (100s of megabytes). If something isn’t in L0 cache, then we look for it in L1; if we can’t find it in L1, we look in L2, and so on, until if we can’t find it in any cache, we actually go out to the main memory.

There’s more we can do to make things faster. In the CPU, you can’t actually execute an entire instruction all at once – it’s got multiple steps. For a (vastly simplified) example, in the pseudocode above, you can think of each instruction as four phases: (1) decode the instruction (figuring out what operation it performs, and what its parameters are), (2) fetch the parameters, (3) perform the operations internal computation, and (4) write out the result. By taking advantage of that, you can set up your CPU to actually do a lot of work in parallel. If there are three phases to executing an instruction, then you can execute phase one of instruction one in one cycle; phase one of instruction two and phase two of instruction one in the next cycle; phase one of instruction three, phase two of instruction two, and phase three of instruction one in the third cycle. This process is called pipelining.

To really take advantage of pipelining, you need to keep the pipeline full. If your CPU has a four-stage pipeline, then ideally, you always know what the next four instructions you’re going to execute are. If you’ve got the machine code version of an if-then-else branch, when you start the comparison, you don’t know what’s going to come next until you finish it, because there are two possibilities. That means that when you get to phase 2 of your branch instruction, you can’t start phase one of the next instruction. instruction until the current one is finished – which means that you’ve lost the advantage of your pipeline.

That leads to another neat trick that people play in hardware, called branch prediction. You can make a guess about which way a branch is going to go. An easy way to understand this is to think of some numerical code:

def run_branch_prediction_demo():
  for i in 1 to 1000:
     for j in 1 to 1000:
          q = a[i][j] * sqrt(b[i][j])

After each iteration of the inner loop, you check to see if j == 1000. If it isn’t, you branch back to the beginning of that loop. 999 times, you branch back to the beginning of the loop, and one time, you won’t. So you can predict that you take the backward branch, and you can start executing the early phases of the first instructions of the next iteration. That may, most of the time you’re running the loop, your pipeline is full, and you’re executing your computation quickly!

The catch is that you can’t execute anything that stores a result. You need to be able to say “Oops, everything that I started after that branch was wrong, so throw it away!”. Alongside with branch prediction, most CPUs also provide speculative execution, which is a way of continuing to execute instructions in the pipeline, but being able to discard the results if they’re the result of an incorrect branch prediction.

Ok, we’re close. We’ve got to talk about just another couple of basic ideas before we can get to just what the problem is with these Intel chips.

We’re going to jump up the stack a bit, and instead of talking directly about the CPU hardware, we’re going to talk about the operating system, and how it’s implemented on the CPU.

An operating system is just a program that runs on your computer. The operating system can load and run other programs (your end-user applications), and it manages all of the resources that those other programs can work with. When you use an application that allocates memory, it sent a request called a syscall to the operating system asking it to give it some memory. If your application wants to read data from a disk drive, it makes a syscall to open a file and read data. The operating system is responsible for really controlling all of those resources, and making sure that each program that’s running only accesses the things that it should be allowed to access. Program A can only use memory allocated by program A; if it tries to access memory allocated by program B, it should cause an error.

The operating system is, therefore, a special program. It’s allowed to touch any piece of memory, any resource owned by anything on the computer. How does that work?

There are two pieces. First, there’s something called memory protection. The hardware provides a mechanism that the CPU can use to say something like “This piece of memory is owned by program A”. When the CPU is running program A, the memory protection system will arrange the way that memory looks to the program so that it can access that piece of memory; anything else just doesn’t exist to A. That’s called memory mapping: the system memory of the computer is mapped for A so that it can see certain pieces of memory, and not see others. In addition to memory mapping, the memory protection system can mark certain pieces of memory as only being accessible by privileged processes.

Privileged processes get us to the next point. In the CPU, there’s something called an execution mode: programs can run in a privileged mode (sometimes called kernel space execution), or it can run in a non-privileged mode (sometimes called user-space execution). Only code that’s running in kernel-space can do things like send commands to the memory manager, or change memory protection settings.

When your program makes a syscall, what really happens is that your program puts the syscall parameters into a special place, and then sends a signal called an interrupt. The interrupt switches the CPU into system space, and gives control to the operating system, which reads the interrupt parameters, and does whatever it needs to. Then it puts the result where the user space program expects it, switches back to user-space, and then allows the user space program to continue.

That process of switching from the user space program to the kernel space, doing something, and then switching back is called a context switch. Context switches are very expensive. Implemented naively, you need to redo the memory mapping every time you switch. So the interrupt consists of “stop what you’re doing, switch to privileged mode, switch to the kernel memory map, run the syscall, switch to the user program memory map, switch to user mode”.

Ok. So. We’re finally at the point where we can actually talk about the Intel bug.

Intel chips contain a trick to make syscalls less expensive. Instead of having to switch memory maps on a syscall, they allow the kernel memory to be mapped into the memory map of every process running in the system. But since kernel memory can contain all sorts of secret stuff (passwords, data belonging to other processes, among other things), you can’t let user space programs look at it – so the kernel memory is mapped, but it’s marked as privileged. With things set up that way, a syscall can drop the two “switch memory map” steps in the syscall scenario. Now all a syscall needs to do is switch to kernel mode, run the syscall, and switch back to user mode. It’s dramatically faster!

Here’s the problem, as best I understand from the information that’s currently available:

Code that’s running under speculative execution doesn’t do the check whether or not memory accesses from cache are accessing privileged memory. It starts running the instructions without the privilege check, and when it’s time to commit to whether or not the speculative execution should be continued, the check will occur. But during that window, you’ve got the opportunity to run a batch of instructions against the cache without privilege checks. So you can write code with the right sequence of branch instructions to get branch prediction to work the way you want it to; and then you can use that to read memory that you shouldn’t be able to read.

With that as a starting point, you can build up interesting exploits that can ultimately allow you to do almost anything you want. It’s not a trivial exploit, but with a bit of work, you can use a user space program to make a sequence of syscalls to get information you want into memory, and then write that information wherever you want to – and that means that you can acquire root-level access, and do anything you want.

The only fix for this is to stop doing that trick where you map the kernel memory into every user space memory map, because there’s no way to enforce the privileged memory property in speculative execution. In other words, drop the whole syscall performance optimization. That’ll avoid the security issue, but it’s a pretty expensive fix: requiring a full context switch for every syscall will slow down the execution of user space programs by something between 5 and 30 percent.

How Computers Really Work: Math via Boolean Logic

As I try to write more posts about ARM programming, I’m finding that I keep getting sidetracked by background stuff. Hopefully it’s interesting; if not, let me know in the comments!

Today’s sidetrack: how the heck does a computer actually do math?

As I said in my post about binary the other day, at a deep level, computers don’t actually work with numbers, not even zeros and ones. Computers are electronic devices, and what they really do is worth with electrical signals. In computer hardware, there are really just a few fundamental operations that it can perform with those signals. By taking those basic operations, and combining them in the right way, we can do math. How that works is very mysterious to most people. Obviously, I can’t describe all of how a computer works in a blog post. But what I’m going to do is explain the basic parts (the gates that are used to build the hardware), and how to combine them to implement a piece of hardware that does integer addition.

In computer hardware, there are four fundamental components that we can use to build operations, and they’re the basic operations of boolean logic: And, Or, Exclusive Or, and Not.

  • AND: Boolean AND takes two inputs, and outputs a 1 if both inputs are one.andgate
  • OR: Boolean OR takes two inputs, and outputs a 1 if at least one of its inputs are 1. orgate
  • XOR: Boolean Exclusive-OR takes two inputs, and outputs a 1 if one, but not
    both, of its inputs are one.xor
  • NOT: Boolean NOT (also called negation) takes one input, and outputs a 1 if its input was 0. notgate

In talking about these kinds of operations, we frequently use truth tables to explain them. A truth table just lists all of the possible input combinations, and tells you what the output for them is. For example, here’s a truth table showing you AND, OR, and XOR:

0 0 0 0 0
0 1 0 1 1
1 0 0 1 1
1 1 1 1 0

In terms of hardware, you can actually implement all of these using one primitive gate type, the NAND (negated AND) gate. For example, in the diagram below, you can see how to build a NOT gate using a NAND gate, and then using using that NAND-based NOT gate, to build an AND gate using two NANDs.


In the hardware, those gates are all that we have to work with. To build arithmetic, we need to figure out how to combine these primitive pieces to build up something that produces the right results. First, we need to be able to describe what the correct results are. We’ll start by figuring out what it means to add two 1-bit numbers. We’ll have two one-bit inputs. We need two outputs – because two one-bit numbers can add up to a two-bit sum. We’ll call the outputs S and C (sum and carry – S is the low-order bit, which would be the sum if we could only have a one-bit result, and C is the value of the second bit.)

We can describe addition with a truth table:

0 0 0 0
0 1 1 0
1 0 1 0
1 1 0 1

We can read those rows as “If A is 0 and B is 0, then A+B is 0 with a carry of 0”.

If we compare that table to the table up above showing the basic operations, then we can see that S is exactly the same as XOR, and C is exactly the same as AND. So we can build a one-bit adder out of gates:


This little one-bit adder is commonly called a half-adder, because to really implement a useful add operation, we need two of them for each bit.

If you think about what you’d need to do to add together two two-bit numbers, you couldn’t just use a half-adder for each bit. The carry from the first bit needs to get added to the second bit. You need to plumb the carry result from the first bit into a third input for the second bit. To get that second bit to work properly, you need to add together three one-bit values: the two inputs for the high-order bit, and the carry from the low-order bit. Generalizing, if you want to add together N bits,
when you’re computing the sum of the Mth bit, you need to include the carry from the M-1th bit.

To include the carry, we need to combine half-adders into a full adder, which looks like the following:


Using that, we can create an N-bit adder, by chaining the carry output from bit M-1 to the carry input of bit M. This creates the simplest adder circuit, which is called a ripple-carry adder.


Ripple-carry adders are the simplest way of building an integer addition operation in hardware. They’ve got one very big downside: to add the Nth bit, you need to have the result of adding the N-1th bit. That means that the more bits you add, the slower the ripple-carry adder gets. You have to way for the signals to propagate all the way through the circuit from the low bits to the high bits.

So ripple-carry isn’t really used in hardware anymore. There are a bunch of more complicated ways of building the adder that get rid of that propagation delay. It comes down to a very common tradeoff for engineers: performance versus complexity. But at the end of the day, the concept is still basically the same: it’s still a chain of multiple simple adders, that work the way I described here.

More Basics: Compilers, Programs, and Languages

After my “what is an OS?” post, a couple of readers asked me to write a similar post about compilers.

Before I can answer what a compiler is, it’s helpful to first answer a different question: what is a program?

And here we get to one of my pet peeves. The most common answer to that question is “a detailed step-by-step sequence of instructions”. For example, here’s what wikipedia says:

A computer program, or just a program, is a sequence of instructions, written to perform a specified task with a computer.

This is wrong.

Back when people first started to study the idea of computing devices, they talked about computing machines as devices that performed a single, specific task. If you think about a basic Turing machine, you normally define Turing machines that perform a single computation. They’ve got a built-in sequence of states, and a built in transition table – the machine can only perform one computation. It took one kind of input, and performed its computation on that input, producing its output.

Building up from these specific machines, they came up with the idea of a universal computing device. A universal computer was a computing machine whose input was a description of a different computing machine. By giving the universal machine different inputs, it could perform different computations.

The point of this diversion is that looking at this history tells us what a program really is: it’s a description of a computing machine. Our computers are universal computing machines; they take programs as input to describe the computing machines we want them to emulate. What we’re doing when we program is describing a computing machine that we’d like to create. Then we feed it into our universal computing machine, and it behaves as if we’d built a custom piece of hardware to do our computation!

The problem is, our computers are simultaneously very primitive and overwhelming complex. They can only work with data expressed in fixed-length sequences of on/off values; to do anything else, we need to find a way of expressing in terms of extremely simple operations on those on/off values. To make them operate efficiently, they’ve got a complex structure: many different kinds of storage (registers, l1 and l2 caches, addressable memory), complicated instruction sets, and a whole lot of tricky perfomance tweaks. It’s really hard to program a computer in terms of its native instructions!

In fact, it’s so hard to program in terms of native instructions that we just don’t do it. What we do is write programs in terms of different machines. That’s the point of a programming language.

Looked at this way, a program language is a way of describing computing machines. The difference between different programming languages is how they describe computing machines. A language like C describes von Neumann machines. Haskell describes machines that work via lambda calculus computations using something like a spineless G-machine. . Prolog describes machines that perform computations in terms of intuitionistic logical inference like a Warren Abstract Machine.

So finally, we can get to the point: what is a compiler? A compiler is a program that takes a description of a computing device defined in one way, and translates into the kind of machine description that can be used by our hardware. A programming language ets us ignore all of the complexities of how our actual hardware is built, and describe our computations in terms of a simple abstraction. A compiler takes that description, and turns it into the form that computer hardware can actually use.

For anyone who’s read this far: I’ve gotten a few requests to talk about assembly language. I haven’t programmed in assembly since the days of the Motorola 68000. This means that to do it, I’ll need to learn something more up-to-date. Would you be more interested in seeing Intel, or ARM?

Boot all the computers!

Moving on from last weeks operating system post, today we’ll look at how a computer boots up and loads an operating system.

Let’s start with why booting is a question at all. When a computer turns on, what happens? What we’re using to seeing is that the disk drive turns on and starts spinning, and the computer loads something from the disk.

The question is how does the computer know how to turn on the disk? As I said in the OS post, the CPU only really knows how work with memory. To talk to a disk drive, it needs to do some very specific things – write to certain memory locations, wait for things to happen. Basically, in order to turn on that disk drive and load the operating system, it needs to run a program. But how does it know what program to run?

I’m going to focus on how modern PCs work. Other computers have used/do use a similar process. The details vary, but the basic idea is the same.

A quick overview of the process:

  1. CPU startup.
  2. Run BIOS initialization
  3. Load bootloader
  4. Run bootloader
  5. Load and run OS.

As that list suggests, it’s not a particularly simple process. We think of it as one step: turn on the computer, and it runs the OS. In fact, it’s a complicated dance of many steps.

On the lowest level, it’s all hardware. When you turn on a computer, some current gets sent to a clock. The clock is basically a quartz crystal; when you apply current to the crystal, it vibrates and produces a regular electrical pulse. That pulse is what drives the CPU. (When you talk about your computer’s speed, you generally describe it in terms of the frequency of the clock pulse. For example, in the laptop that I’m using to write this post, I’ve got a 2.4 GHz processor: that means that the clock chip pulses 2.4 billion times per second.)

When the CPU gets a clock pulse, it executes an instruction from memory. It knows what instruction to execute because it’s got a register (a special piece of memory built-in to the CPU) that tells it what instruction to execute. When the computer is turned on, that register is set to point at a specific location. Depending on the CPU, that might be 0, or it might be some other magic location; it doesn’t matter: what matters is that the CPU is built so that when it’s first turned on and it receives a clock pulse that starts it running, that register will always point at the same place.

The software part of the boot process starts there: the computer puts a chunk of read-only memory there – so when the computer turns on, there’s a program sitting at that location, which the computer can run. On PCs, that program is called the BIOS (Basic Input/Output System).

The BIOS knows how to tell the hardware that operates your display to show text on the screen, and it knows how to read stuff on your disk drives. It doesn’t know much beyond that. What it knows is extremely primitive. It doesn’t understand things like filesystems – the filesystem is set up and controlled by the operating system, and different operating systems will set up filesystems in different ways. The BIOS can’t do anything with a filesystem: it doesn’t include any programming to tell it how to read a filesystem, and it can’t ask the operating system to do it, because the OS hasn’t loaded yet!

What the BIOS does is something similar to what the CPU did when it started up. The CPU knew to look in a special location in memory to find a program to run. The BIOS knows to look at a special section on a disk drive to find a program to run. Every disk has a special chunk of data on it called the master boot record (MBR). The MBR contains another program, called a boot loader. So the BIOS loads the boot loader, and then uses it to actually load the operating system.

This probably seems a bit weird. The computer starts up by looking in a specific location for a program to run (the BIOS), which loads something (the bootloader). The thing it loads (the bootloader) also just looks in a specific location for a program to run (the OS). Why the two layers?

Different operating systems are build differently, and the specific steps to actually load and run the OS are different. For example, on my laptop, I’ve can run two operating systems: MacOS, and Linux. On MacOS (aka Darwin), there’s something called a microkernel that gets loaded. The microkernel is stored in a file named “mach_kernel” in the root directory of a type of filesystem called HFS. But in my installation of linux, the OS is stored in a file named “vmlinuz” in the root directory of a type of filesystem called EXT4. The BIOS doesn’t know what operating system it’s loading, and it doesn’t know what filesystem the OS uses – and that means that it knows neither the name of the file to load, nor how to find that file.

The bootloader was set up by the operating system. It’s specific to the operating system – you can think of it as part of the OS. So it knows what kind of filesystem it’s going to look at, and how to find the OS in that filesystem.

So once the bootloader gets started, it knows how to load and run the operating system, and once it does that, your computer is up and running, and ready for you to use!

Of course, all of this is a simplified version of how it works. But for understanding the process, it’s a reasonable approximation.

(To reply to commenters: I’ll try to do a post like this about compilers when I have some time to write it up.)

What the heck is a DNS amplification DoS attack?

A couple of weeks ago, there was a bunch of news about a major DOS attack on Spamhaus. Spamhaus is an online service that maintains a blacklist of mail servers that are known for propagating spam. I’ve been getting questions about what a DoS attack is, and more specifically what a “DNS amplification attack” (the specific attack at the heart of last week’s news) is. This all became a bit more relevant to me last week, because some asshole who was offended by my post about the Adria Richards affair launched a smallish DoS attack against scientopia. (This is why we were interrmitently very slow last week, between tuesday and thursday. Also, to be clear, the DNS amplification attack was used on Spamhaus. Scientopia was hit by a good old fashioned DDoS attack.)

So what is a DoS attack? And what specifically is a DNS amplification attack?

Suppose that you’re a nastly person who wants to take down a website like scientopia. How could you do it? You could hack in to the server, and delete everything. That would kill it pretty effectively, right?

It certainly would. But from the viewpoint of an attacker, that’s not a particularly easy thing to do. You’d need to get access to a privileged account on our server. Even if we’re completely up to date on patches and security fixes, it’s probably possible to do that, but it’s still probably going to be a lot of work. Even for a dinky site like scientopia, getting that kind of access probably isn’t trivial. For a big security-focused site like spamhaus, that’s likely to be close to impossible: there are layers of security that you’d need to get through, and there are people constantly watching for attacks. Even if you got through, if the site has reliable backups, it won’t be down for long, and once they get back up, they’ll patch whatever hole you used to get in, so you’d be back to square one. It’s a lot of work, and there are much easier ways to take down a site.

What you, as an attacker, want is a way to take the site down without having any kind of access to the system. You want a way that keeps the site down for as long as you want it down. And you want a way that doesn’t leave easily traced connections to you.

That’s where the DoS attack comes in. DoS stands for “denial of service”. The idea of a DoS attack is to take a site down without really taking it down. You don’t actually kill the server; you just make it impossible for legitimate users to access it. If the sites users can’t access the site even though the server is technically still up and running, you’ve still effectively killed it.

How do you do that? You overwhelm the server. You target some finite resource on the server, and force it to use up that resource just dealing with requests or traffic that you sent to the server, leaving it with nothing for its legitimate users.

In terms of the internet, the two resources that people typically target are CPU and network bandwidth.

Every time that you send a request to a webserver, the server has to do some computation to process that request. The server has a finite amount of computational capability. If you can hit it with enough requests that it spends all of its time processing your requests, then the site becomes unusable, and it effectively goes down. This is the simplest kind of DoS attack. It’s generally done in a form called a DDoS – distributed denial of server attack, where the attacker users thousands or millions of virus-infected computers to send requests. The server gets hit by a vast storm of requests, and it can’t distinguish the legitimate requests from the ones generated by the attacker. This is the kind of attack that hit Scientopia last week. We were getting streams of a couple of thousands malformed requests per second.

This kind of attack can be very effective. It’s hard – not impossible, but hard – to fight. You need to identify the common traits of the attackers, and set up some kind of filter to discard those requests. From the attacker’s point of view, it’s got one problem: price. Most people don’t have a personal collection of virus-infected machines that they can use to mount an attack. What they actually do is rent machines! Virus authors run services where they’ll use the machines that they’ve to run an attack for you, for a fee. They typically charge per machine-hour. So to keep a good attack going for a long time is expensive! Another problem with this kind of attack is that the amount of traffic that you can inflict on the server per attacker is also used by the client. The client needs to establish a connection to the server. That consumes CPU, network connections, and bandwidth on the client.

The other main DoS vector is network bandwidth. Every server running a website is connected to the network by a connection with a fixed capacity, called it’s bandwidth. A network connection can only carry a certain quantity of information. People love to make fun of the congressman who said that the internet is like a series of tubes, but that’s not really a bad analogy. Any given connection is a lot like a pipe. You can only cram so much information through that pipe in a given period of time. If you can send enough traffic to completely fill that pipe, then the computer on the other end is, effectively, off the network. It can’t receive any requests.

For a big site like spamhaus, it’s very hard to get enough machines attacking to effectively kill the site. The amount of bandwidth, and the number of different network paths connecting spamhaus to the internet is huge! The number of infected machines available for an attack is limited, and the cost of using all of them is prohibitive.

What an attacker would like for killing something like Spamhaus is an attack where the amount of work/cpu/traffic used to generate the attack is much smaller than the amount of work/cpu/traffic used by the server to combat the attack. That’s where amplification comes in. You want to find some way of using a small amount of work/traffic on your attacker machines to cause your target to lost a large amount of work/traffic.

In this recent attack on Spamhaus, they used an amplification attack, that was based on a basic internet infrastructure service called the Domain Name Service (DNS). DNS is the service which is used to convert between the name of a server (like, and its numeric internet address ( DNS has some technical properties that make it idea for this kind of attack:

  1. It’s not a connection-based service. In most internet services, you establish a connection to a server, and send a request on that connection. The server responds on the same connection. In a connection-based service, that means two things. First, you need to use just as much bandwidth as the target, because if you drop the connection, the server sees the disconnect and stops processing your request. Second, the server knows who it’s connected to, and it always sends the results of a request to the client that requested it. But DNS doesn’t work that way. In DNS, you send a request without a connection, and in the request, you provide an address that the response should be sent to. So you can fake a DNS request, by putting someone else’s address as the “respond-to” address in the request.
  2. It’s possible to set up DNS to create very large responses to very small requests. There are lots of ways to do this. The important thing is that it’s really easy to use DNS in a way that allows you to amplify the amount of data being sent to a server by a factor of 100. In one common form of DNS amplification, you send 60 byte requests, which generate responses larger than 6,000 bytes.

Put these two properties together, and you get a great attack vector: you can send tiny, cheap requests to a server, which don’t cause any incoming traffic on your attacker machine, and which send large quantities of data to your target. Doing this is called a DNS amplification attack: it’s an amplification attack which uses properties of DNS to generate large quantities of data send to your server, using small quantities of data sent by your attackers.

That’s exactly what happened to Spamhaus last week. The attackers used a very common DNS extension, which allowed them to amplify 60 byte requests into 4,000 byte responses, and to send the responses to the spamhaus servers.

There are, of course, more details. (For example, when direct attacks didn’t work, they tried an indirect attack that didn’t target the spamhaus servers, but instead tried to attack other servers that spamhaus relied on.) But this is the gist.

What is math?


I’ve got a bunch of stuff queued up to be posted over the next couple of days. It’s
been the sort of week where I’ve gotten lots of interesting links from
readers, but I haven’t had time to finish anything!

I thought I’d start off with something short but positive. A reader sent
me a link to a post on Reddit, with the following question:

Throughout elementary and high school, I got awful marks in math. I always
assumed I was just stupid in that way, which is perfectly possible. I also
hated my teacher, so that didn’t help. A friend of mine got his PhD in math
from Harvard before he was 25 (he is in his 40’s now) I was surprised the
other week when I learned he isn’t particularly good at basic arithmetic etc.
He said that’s not really what math is about. So my question is really for
math fans/pros. What is math, really? I hear people throwing around phrases
like “elegant” and “artistic” regarding math. I don’t understand how this can
be. To me, math is add, subtract, etc. It is purely functional. Is there
something you can compare it to so that I can understand?

This hits on one of my personal pet peeves. Math really is a beautiful
thing, but the way that math is taught turns it into something
mechanistic, difficult, and boring. The person who posted this question
is a typical example of a victim of lousy math education.

So what is math? It’s really a great question, and not particularly
an easy one to answer.

Continue reading What is math?

Basics: Significant Figures

After my post the other day about rounding errors, I got a ton of
requests to explain the idea of significant figures. That’s
actually a very interesting topic.

The idea of significant figures is that when you’re doing
experimental work, you’re taking measurements – and measurements
always have a limited precision. The fact that your measurements – the
inputs to any calculation or analysis that you do – have limited
precision, means that the results of your calculations likewise have
limited precision. Significant figures (or significant digits, or just “sigfigs” for short) are a method of tracking measurement
precision, in a way that allows you to propagate your precision limits
throughout your calculation.

Before getting to the rules for sigfigs, it’s helpful to show why
they matter. Suppose that you’re measuring the radius of a circle, in
order to compute its area. You take a ruler, and eyeball it, and end
up with the circle’s radius as about 6.2 centimeters. Now you go to
compute the area: π=3.141592653589793… So what’s the area of the
circle? If you do it the straightforward way, you’ll end up with a
result of 120.76282160399165 cm2.

The problem is, your original measurement of the radius was
far too crude to produce a result of that precision. The real
area of the circle could easily be as high as 128, or as low as
113, assuming typical measurement errors. So claiming that your
measurements produced an area calculated to 17 digits of precision is
just ridiculous.

Continue reading Basics: Significant Figures