# Introduction to Neural Networks

In preparation for starting a new job next week, I’ve been doing some reading about neural networks and deep learning. The math behind neural networks is pretty interesting, so I thought I’d take my notes, and turn them into some posts.

As the name suggests, the basic idea of a neural network is to construct a computational system based on a simple model of a neuron. If you look at a neuron under a microscope, what you see is something vaguely similar to:

It’s a cell with three main parts:

• A central body;
• A collection of branched fibers called dendrites that receive signals and carry them to the body; and
• A branched fiber called an axon that sends signals produced by the body.

You can think of a neuron as a sort of analog computing element. Its dendrites receive inputs from some collection of sources. The body has some criteria for deciding, based on its inputs, whether to “fire”. If it fires, it sends an output using its axon.

What makes a neuron fire? It’s a combination of inputs. Different terminals on the dendrites have different signaling strength. When the combined inputs reach a threshold, the neuron fires. Those different signal strengths are key: a system of neurons can learn how to interpret a complex signal by varying the strength of the signal from different dendrites.

We can think of this simple model of a neuron in computational terms as a computing element that takes a set of weighted input values, combines them into a single value, and then generates an output of “1” if that value exceeds a threshold, and 0 if it does not.

In slightly more formal terms, $(n, \theta, b, t)$ where:

1. $n$ is the number of inputs to the machine. We’ll represent a given input as a vector $v=[v_1, ..., v_n]$.
2. $\theta = [\theta_1, \theta_2, ..., \theta_n]$ is a vector of weights, where $\theta_i$ is the weight for input $i$.
3. $b$ is a bias value.
4. $t$ is the threshold for firing.

Given an input vector $v$, the machine computes the combined, weighted input value $I$ by taking the dot product $v \cdot w = [\theta_1v_1 + \theta_2v_2 + ... + \theta_nv_n]$. If $I + b \ge t$, the neuron “fires” by producing a 1; otherwise, it produces a zero.

This version of a neuron is called a perceptron. It’s good at a particular kind of task called classification: given a set of inputs, it can answer whether or not the input is a member of a particular subset of values. A simple perceptron is limited to linear classification, which I’ll explain next.

To understand what a perceptron does, the easiest way to think of it is graphical. Imagine you’ve got an input vector with two values, so that your inputs are points in a two dimensional cartesian plane. The weights on the perceptron inputs define a line in that plane. The perceptron fires for all points above that line – so the perceptron classifies a point according to which side of the line it’s located on. We can generalize that notion to higher dimensional spaces: for a perceptron taking $n$ input values, we can visualize its inputs as an $n$-dimensional space, and the perceptron weight’s define a hyperplane that slices the $n$-dimensional input space into two sub-spaces.

Taken by itself, a single perceptron isn’t very interesting. It’s just a fancy name for a something that implements a linear partition. What starts to unlock its potential is training. You can take a perceptron and initialize all of its weights to 1, and then start testing it on some input data. Based on the results of the tests, you alter the weights. After enough cycles of repeating this, the perceptron can learn the correct weights for any linear classification.

The traditional representation of the perceptron is as a function $h$:

$\displaystyle h(x, \theta, b) = \left\{ \begin{array}{cl} 0, & x \cdot \theta + b < 0 \\ +1, & x \cdot \theta + b \ge 0 \end{array} \right.$

Using this model, learning is just an optimization process, where we’re trying to find a set of values for ${\theta}$ that minimize the errors in assigning points to subspaces.

A linear perceptron is a implementation of this model based on a very simple notion of a neuron. A perceptron takes a set of weighted inputs, adds them together, and then if the result exceeds some threshold, it “fires”.

A perceptron whose weighted inputs don’t exceed its threshold produces an output of 0; a perceptron which “fires” based on its inputs produces a value of +1.

Linear classification is very limited – we’d like to be able to do things that are more interesting that just linear. We can do that by adding one thing to our definition of a neuron: an activation function. Instead of just checking if the value exceeds a threshold, we can take the dot-product of the inputs, and then apply a function to them before comparing them to the threshold.

With an activation function $f$, we can define the operation of our more powerful in two phases. First, the perceptron computes the logit, which is the same old dot-product of the weights and the inputs. Then it applies the activation function to the logit, and based on the output, it decides whether or not to fire.

The logit is defined as:

$z = (\Sigma_{i=0}^{n} w_i x_i) + b$

And the perceptron as a whole is a classifier:

$\displaystyle h(x, \theta) = \left\{ \begin{array}{cl} 0, & f(z) < 0 \\ +1, & f(z) >= 0 \end{array} \right.$

Like I said before, this gets interesting when you get to the point of training. The idea is that before you start training, you have a neuron that doesn’t know anything about the things it’s trying to classify. You take a collection of values where you know their classification, and you put them through the network. Each time you put a value through, ydou look at the result – and if it’s wrong, you adjust the weights of the inputs. Once you’ve repeated that process enough times, the edge-weights will, effectively, encode a curve (a line in the case of a linear perceptron) that divides between the categories. The real beauty of it is that you don’t need to know where the line really is: as long as you have a large, representative sample of the data, the perceptron will discover a good separation.

The concept is simple, but there’s one big gap: how do you adjust the weights? The answer is: calculus! We’ll define an error function, and then use the slope of the error curve to push us towards the minimum error.

Let’s say we have a set of training data. For each value $i$ in the training data, we’ll say that $t^{(i)}$ is the “true” value (that is, the correct classification) for value $i$, and $y^{(i)}$ is the value produced by the current set of weights of our perceptron. Then the
cumulative error for the training data is:

$E = \frac{1}{2}\sum_{i}(t^{(i)} - y^{(i)})^2$

$i^{(i)}$ is given to us with our training data. $y^{(i)}$ is something we know how to compute. Using those, we can view the errors as a curve on $y$.

Let’s think in terms of a two-input example again. We can create a three dimensional space around the ideal set of weights: the x and y axes are the input weights; the z axis is the size of the cumulative error for those weights. For a given error value $z$, there’s a countour of a curve for all of the bindings that produce that level of error. All we need to do is follow the curve towards the minimum.

In the simple cases, we could just use Newton’s method directly to rapidly converge on the solution, but we want a general training algorithm, and in practice, most real learning is done using a non-linear activation function. That produces a problem: on a complex error surface, it’s easy to overshoot and miss the minimum. So we’ll scale the process using a meta-parameter $\epsilon$ called the learning rate.

For each weight, we’ll compute a change based on the partial derivative of the error with respect to the weight:

$\Delta w_k = - \epsilon \frac{\partial E}{\partial w_k}$

For our linear perceptron, using the definition of the cumulative error $E$ above, we can expand that out to:

$\Delta w_k = \Sigma_i \epsilon x_k^{(i)}(t^{(i)} - y^{(i)})$

So to train a single perceptron, all we need to do is start with everything equally weighted, and then run it on our training data. After each pass over the data, we compute the updates for the weights, and then re-run until the values stabilize.

This far, it’s all pretty easy. But it can’t do very much: even with a complex activation function, a single neuron can’t do much. But when we start combining collections of neurons together, so that the output of some neurons become inputs to other neurons, and we have multiple neurons providing outputs – that is, when we assemble neurons into networks – it becomes amazingly powerful. So that will be our next step: to look at how to put neurons together into networks, and then train those networks.

As an interesting sidenote: most of us, when we look at this, think about the whole thing as a programming problem. But in fact, in the original implementation of perceptron, a perceptron was an analog electrical circuit. The weights were assigned using circular potentiometers, and the weights were updated during training using electric motors rotating the knob on the potentiometers!

I’m obviously not going to build a network of potentiometers and motors. But in the next post, I’ll start showing some code using a neural network library. At the moment, I’m still exploring the possible ways of implementing it. The two top contenders are TensorFlow, which is a library built on top of Python; and R, which is a stastical math system which has a collection of neural network libraries. If you have any preference between the two, or for something else altogether, let me know!

# A Review of Type Theory (so far)

I’m trying to get back to writing about type theory. Since it’s been quite a while since the last type theory post, we’ll start with a bit of review.

What is this type theory stuff about?

The basic motivation behind type theory is that set theory isn’t the best foundation for mathematics. It seems great at first, but when you dig in deep, you start to see cracks.

If you start with naive set theory, the initial view is amazing: it’s so simple! But it falls apart: it’s not consistent. When you patch it, creating axiomatic set theory, you get something that isn’t logically inconsistent – but it’s a whole lot more complicated. And while it does fix the inconsistency, it still gives you some results which seem wrong.

Type theory covers a range of approaches that try to construct a foundational theory of mathematics that has the intuitive appeal of axiomatic set theory, but without some of its problems.

The particular form of type theory that we’ve been looking at is called Martin-Löf type theory. M-L type theory is a constructive theory of mathematics in which computation plays a central role. The theory rebuilds mathematics in a very concrete form: every proof must explicitly construct the objects it talks about. Every existence proof doesn’t just prove that something exists in the abstract – it provides a set of instructions (a program!) to construct an example of the thing that exists. Every proof that something is false provides a set of instructions (also a program!) for how to construct a counterexample that demonstrates its falsehood.

This is, necessarily, a weaker foundation for math than traditional axiomatic set theory. There are useful things that are provable in axiomatic set theory, but which aren’t provable in a mathematics based on M-L type theory. That’s the price you pay for the constructive foundations. But in exchange, you get something that is, in many ways, clearer and more reasonable than axiomatic set theory. Like so many things, it’s a tradeoff.

The constructivist nature of M-L type theory is particularly interesting to wierdos like me, because it means that programming becomes the foundation of mathematics. It creates a beautiful cyclic relationship: mathematics is the foundation of programming, and programming is the foundation of mathematics. The two are, in essence, one and the same thing.

The traditional set theoretic basis of mathematics uses set theory with first order predicate logic. FOPL and set theory are so tightly entangled in the structure of mathematics that they’re almost inseparable. The basic definitions of type theory require logical predicates that look pretty much like FOPL; and FOPL requires a model that looks pretty much like set theory.

For our type theory, we can’t use FOPL – it’s part of the problem. Instead, Martin-Lof used intuitionistic logic. Intuitionistic logic plays the same role in type theory that FOPL plays in set theory: it’s deeply entwined into the entire system of types.

The most basic thing to understand in type theory is what a logical proposition means. A proposition is a complete logical statement no unbound variables and no quantifiers. For example, “Mark has blue eyes” is a proposition. A simple proposition is a statement of fact about a specific object. In type theory, a proof of a proposition is a program that demonstrates that the statement is true. A proof that “Mark has blue eyes” is a program that does something like “Look at a picture of Mark, screen out everything but the eyes, measure the color C of his eyes, and then check that C is within the range of frequencies that we call “blue”. We can only say that a proposition is true if we can write that program.

Simple propositions are important as a starting point, but you can’t do anything terribly interesting with them. Reasoning with simple propositions is like writing a program where you can only use literal values, but no variables. To be able to do interesting things, you really need variables.

In Martin-Lof type theory, variables come along with predicates. A predicate is a statement describing a property or fact about an object (or about a collection of objects) – but instead of defining it in terms of a single fixed value like a proposition, it takes a parameter. “Mark has blue eyes” is a proposition; “Has blue eyes” is a predicate. In M-L type theory, a predicate is only meaningful if you can write a program that, given an object (or group of objects) as a parameter, can determine whether or no the predicate is true for that object.

That’s roughly where we got to in type theory before the blog went on hiatus.

The other day, I received an email that actually excited me! It’s a question related to Cantor’s diagonalization, but there’s absolutely nothing cranky about it! It’s something interesting and subtle. So without further ado:

Cantor’s diagonalization says that you can’t put the reals into 1 to 1 correspondence with the integers. The well-ordering theorem seems to suggest that you can pick a least number from every set including the reals, so why can’t you just keep picking least elements to put them into 1 to 1 correspondence with the reals. I understand why Cantor says you can’t. I just don’t see what is wrong with the other arguments (other than it must be wrong somehow). Apologies for not being able to state the argument in formal maths, I’m around 20 years out of practice for formal maths.

As we’ve seen in too many discussions of Cantor’s diagonalization, it’s a proof that shows that it is impossible to create a one-to-one correspondence between the natural numbers and the real numbers.

The Well-ordering says something that seems innoccuous at first, but which, looked at in depth, really does appear to contradict Cantor’s diagonalization.

A set $S$ is well-ordered if there exists a total ordering $<=$ on the set, with the additional property that for any subset $T \subseteq S$, $T$ has a smallest element.

The well-ordering theorem says that every non-empty set can be well-ordered. Since the set of real numbers is a set, that means that there exists a well-ordering relation over the real numbers.

The problem with that is that it appears that that tells you a way of producing an enumeration of the reals! It says that the set of all real numbers has a least element: Bingo, there’s the first element of the enumeration! Now you take the set of real numbers excluding that one, and it has a least element under the well-ordering relation: there’s the second element. And so on. Under the well-ordering theorem, then, every set has a least element; and every element has a unique successor! Isn’t that defining an enumeration of the reals?

The solution to this isn’t particularly satisfying on an intuitive level.

The well-ordering theorem is, mathematically, equivalent to the axiom of choice. And like the axiom of choice, it produces some very ugly results. It can be used to create “existence” proofs of things that, in a practical sense, don’t exist in a usable form. It proves that something exists, but it doesn’t prove that you can ever produce it or even identify it if it’s handed to you.

So there is an enumeration of the real numbers under the well ordering theorem. Only the less-than relation used to define the well-ordering is not the standard real-number less than operation. (It obviously can’t be, because under well-ordering, every set has a least element, and standard real-number less-than doesn’t have a least element.) In fact, for any ordering relation $\le_x$ that you can define, describe, or compute, $\le_x$ is not the well-ordering relation for the reals.

Under the well-ordering theorem, the real numbers have a well-ordering relation, only you can’t ever know what it is. You can’t define any element of it; even if someone handed it to you, you couldn’t tell that you had it.

It’s very much like the Banach-Tarski paradox: we can say that there’s a way of doing it, only we can’t actually do it in practice. In the B-T paradox, we can say that there is a way of cutting a sphere into these strange pieces – but we can’t describe anything about the cut, other than saying that it exists. The well-ordering of the reals is the same kind of construct.

How does this get around Cantor? It weasels its way out of Cantor by the fact that while the well-ordering exists, it doesn’t exist in a form that can be used to produce an enumeration. You can’t get any kind of handle on the well-ordering relation. You can’t produce an enumeration from something that you can’t create or identify – just like you can’t ever produce any of the pieces of the Banach-Tarski cut of a sphere. It exists, but you can’t use it to actually produce an enumeration. So the set of real numbers remains non-enumerable even though it’s well-ordered.

If that feels like a cheat, well… That’s why a lot of people don’t like the axiom of choice. It produces cheatish existence proofs. Connecting back to something I’ve been trying to write about, that’s a big part of the reason why intuitionistic type theory exists: it’s a way of constructing math without stuff like this. In an intuitionistic type theory (like the Martin-Lof theory that I’ve been writing about), it doesn’t exist if you can’t construct it.

# Understanding Global Warming Scale Issues

Aside from the endless stream of Cantor cranks, the next biggest category of emails I get is from climate “skeptics”. They all ask pretty much the same question. For example, here’s one I received today:

My personal analysis, and natural sceptisism tells me, that there are something fundamentally wrong with the entire warming theory when it comes to the CO2.

If a gas in the atmosphere increase from 0.03 to 0.04… that just cant be a significant parameter, can it?

I generally ignore it, because… let’s face it, the majority of people who ask this question aren’t looking for a real answer. But this one was much more polite and reasonable than most, so I decided to answer it. And once I went to the trouble of writing a response, I figured that I might as well turn it into a post as well.

The current figures – you can find them in a variety of places from wikipedia to the US NOAA – are that the atmosphere CO2 has changed from around 280 parts per million in 1850 to 400 parts per million today.

Why can’t that be a significant parameter?

There’s a couple of things to understand to grasp global warming: how much energy carbon dioxide can trap in the atmosphere, and hom much carbon dioxide there actually is in the atmosphere. Put those two facts together, and you realize that we’re talking about a massive quantity of carbon dioxide trapping a massive amount of energy.

The problem is scale. Humans notoriously have a really hard time wrapping our heads around scale. When numbers get big enough, we aren’t able to really grasp them intuitively and understand what they mean. The difference between two numbers like 300 and 400ppm is tiny, we can’t really grasp how in could be significant, because we aren’t good at taking that small difference, and realizing just how ridiculously large it actually is.

If you actually look at the math behind the greenhouse effect, you find that some gasses are very effective at trapping heat. The earth is only habitable because of the carbon dioxide in the atmosphere – without it, earth would be too cold for life. Small amounts of it provide enough heat-trapping effect to move us from a frozen rock to the world we have. Increasing the quantity of it increases the amount of heat it can trap.

Let’s think about what the difference between 280 and 400 parts per million actually means at the scale of earth’s atmosphere. You hear a number like 400ppm – that’s 4 one-hundreds of one percent – that seems like nothing, right? How could that have such a massive effect?!

But like so many other mathematical things, you need to put that number into the appropriate scale. The earths atmosphere masses roughly 5 times 10^21 grams. 400ppm of that scales to 2 times 10^18 grams of carbon dioxide. That’s 2 billion trillion kilograms of CO2. Compared to 100 years ago, that’s about 800 million trillion kilograms of carbon dioxide added to the atmosphere over the last hundred years. That’s a really, really massive quantity of carbon dioxide! scaled to the number of particles, that’s something around 10^40th (plus or minus a couple of powers of ten – at this scale, who cares?) additional molecules of carbon dioxide in the atmosphere. It’s a very small percentage, but it’s a huge quantity.

When you talk about trapping heat, you also have to remember that there’s scaling issues there, too. We’re not talking about adding 100 degrees to the earths temperature. It’s a massive increase in the quantity of energy in the atmosphere, but because the atmosphere is so large, it doesn’t look like much: just a couple of degrees. That can be very deceptive – 5 degrees celsius isn’t a huge temperature difference. But if you think of the quantity of extra energy that’s being absorbed by the atmosphere to produce that difference, it’s pretty damned huge. It doesn’t necessarily look like all that much when you see it stated at 2 degrees celsius – but if you think of it terms of the quantity of additional energy being trapped by the atmosphere, it’s very significant.

Calculating just how much energy a molecule of CO2 can absorb is a lot trickier than calculating the mass-change of the quantity of CO2 in the atmosphere. It’s a complicated phenomenon which involves a lot of different factors – how much infrared is absorbed by an atom, how quickly that energy gets distributed into the other molecules that it interacts with… I’m not going to go into detail on that. There’s a ton of places, like here, where you can look up a detailed explanation. But when you consider the scale issues, it should be clear that there’s a pretty damned massive increase in the capacity to absorb energy in a small percentage-wise increase in the quantity of CO2.

# Okonomilatkes!

I’m working on some type theory posts, but it’s been slow going.

In the meantime, it’s Chanukah time. Every year, my family makes me cook potato latkes for Chanukah. The problem with that is, I don’t particularly like potato latkes. This year, I came up with the idea of trying to tweak them into something that I’d actually enjoy eating. What I came up with is combining a latke with another kind of fried savory pancake that I absolutely love: the japanese Okonomiyaki. The result? Okonomilatkes.

Ingredients:

• 1/2 head green cabbage, finely shredded.
• 1 1/2 pounds potatoes
• 1/2 cup flour
• 1/2 cup water
• 1 beaten egg
• 1/2 pound crabstick cut into small pieces
• Tonkatsu sauce (buy it at an asian grocery store in the japanese section. The traditional brand has a bulldog logo on the bottle.)
• Katsubuoshi (shredded bonito)
• Japanese mayonaise (sometimes called kewpie mayonaise. You can find it in squeeze bottles in any asian grocery. Don’t substitute American mayo – Japanese mayo is thinner, less oily, a bit tart, sweeter, and creamier. It’s really pretty different.)
• 1 teaspoon salt
• 1/2 teaspoon baking powder.

Instructions

1. In a very hot pan, add about a tablespoon of oil, and when it’s nearly smoking, add the cabbage. Saute until the cabbage wilts and starts to brown. Remove from the heat, and set aside to cool.
2. Using either the grater attachment of a food processor, or the coarse side of a box grater, shred the potatoes. (I leave the skins on, but if that bugs you, peel them first).
3. Squeeze as much water as you can out of the shredded potatoes.
4. Mix together the water, flour, baking powder, egg, and salt into a thin batter.
5. Add the potatoes, cabbage, and crabstick to the batter, and stir together.
6. Split this mixture into four portions.
7. Heat a nonstick pan on medium high heat, add a generous amount of oil, and add one quarter of the batter. Let it cook until nicely browned, then flip, and cook the other side. On my stove, it takes 3-5 minutes per side. Add oil as needed while it’s cooking.
8. Repeat with the other 3 portions
9. To serve, put a pancake on a plate. Squeeze a bunch of stripes of mayonaise, then add a bunch of the tonkatsu sauce, and sprinkle with the katsubuoshi.

# Polls and Sampling Errors in the Presidental Debate Results

My biggest pet peeve is press coverage of statistics. As someone who is mathematically literate, I’m constantly infuriated by it. Basic statistics isn’t that hard, but people can’t be bothered to actually learn a tiny bit in order to understand the meaning of the things they’re covering.

My twitter feed has been exploding with a particularly egregious example of this. After monday night’s presidential debate, there’s been a ton of polling about who “won” the debate. One conservative radio host named Bill Mitchell has been on a rampage about those polls. Here’s a sample of his tweets:

Statistical analysis has a very simple point. We’re interested in understanding the properties of a large population of things. For whatever reason, we can’t measure the properties of every object in that population.

The exact reason can vary. In political polling, we can’t ask every single person in the country who they’re going to vote for. (Even if we could, we simply don’t know who’s actually going to show up and vote!) For a very different example, my first exposure to statistics was through my father, who worked in semiconductor manufacturing. They’d produce a run of 10,000 chips for use in Satellites. They needed to know when, on average, a chip would fail from exposure to radiation. If they measured that in every chip, they’d end up with nothing to sell.)

Anyway: you can’t measure every element of the population, but you still want to take measurements. So what you do is randomly select a collection of representative elements from the population, and you measure those. Then you can say that with a certain probability, the result of analyzing that representative subset will match the result that you’d get if you measured the entire population.

How close can you get? If you’ve really selected a random sample of the population, then the answer depends on the size of the sample. We measure that using something called the “margin of error”. “Margin of error” is actually a terrible name for it, and that’s the root cause of one of the most common problems in reporting about statistics. The margin of error is a probability measurement that says “there is an $N$% probability that the value for the full population lies within the margin of error of the measured value of the sample.”.

Right away, there’s a huge problem with that. What is that variable doing in there? The margin of error measures the probability that the full population value is within a confidence interval around the measured sample value. If you don’t say what the confidence interval is, the margin of error is worthless. Most of the time – but not all of the time – we’re talking about a 95% confidence interval.

But there are several subtler issues with the margin of error, both due to the name.

1. The “true” value for the full population is not guaranteed to be within the margin of error of the sampled value. It’s just a probability. There is no hard bound on the size of the error: just a high probability of it being within the margin..
2. The margin of error only includes errors due to sample size. It does not incorporate any other factor – and there are many! – that may have affected the result.
3. The margin of error is deeply dependent on the way that the underlying sample was taken. It’s only meaningful for a random sample. That randomness is critically important: all of sampled statistics is built around the idea that you’ve got a randomly selected subset of your target population.

Let’s get back to our friend the radio host, and his first tweet, because he’s doing a great job of illustrating some of these errors.

The quality of a sampled statistic is entirely dependent on how well the sample matches the population. The sample is critical. It doesn’t matter how big the sample size is if it’s not random. A non-random sample cannot be treated as a representative sample.

So: an internet poll, where a group of people has to deliberately choose to exert the effort to participate cannot be a valid sample for statistical purposes. It’s not random.

It’s true that the set of people who show up to vote isn’t a random sample. But that’s fine: the purpose of an election isn’t to try to divine what the full population thinks. It’s to count what the people who chose to vote think. It’s deliberately measuring a full population: the population of people who chose to vote.

But if you’re trying to statistically measure something about the population of people who will go and vote, you need to take a randomly selected sample of people who will go to vote. The set of voters is the full population; you need to select a representative sample of that population.

Internet polls do not do that. At best, they measure a different population of people. (At worst, with ballot stuffing, they measure absolutely nothing, but we’ll give them this much benefit of the doubt.) So you can’t take much of anything about the sample population and use it to reason about the full population.

And you can’t say anything about the margin of error, either. Because the margin of error is only meaningful for a representative sample. You cannot compute a meaningful margin of error for a non-representative sample, because there is no way of knowing how that sampled population compares to the true full target population.

And that brings us to the second tweet. A properly sampled random population of 500 people can produce a high quality result with a roughly 5% margin of error and a 95% confidence interval. (I’m doing a back-of-the-envelope calculation here, so that’s not precise.) That means that if the population were randomly sampled, we could say there is in 19 out of 20 polls of that size, the full population value would be within +/- 4% of value measured by the poll. For a non-randomly selected sample of 10 million people, the margin of error cannot be measured, because it’s meaningless. The random sample of 500 people tells us a reasonable estimate based on data; the non-random sample of 10 million people tells us nothing.

And with that, on to the third tweet!

In a poll like this, the margin of error only tells us one thing: what’s the probability that the sampled population will respond to the poll in the same way that the full population would?

There are many, many things that can affect a poll beyond the sample size. Even with a truly random and representative sample, there are many things that can affect the outcome. For a couple of examples:

How, exactly, is the question phrased? For example, if you ask people “Should police shoot first and ask questions later?”, you’ll get a very different answer from “Should police shoot dangerous criminal suspects if they feel threatened?” – but both of those questions are trying to measure very similar things. But the phrasing of the questions dramatically affects the outcome.

What context is the question asked in? Is this the only question asked? Or is it asked after some other set of questions? The preceding questions can bias the answers. If you ask a bunch of questions about how each candidate did with respect to particular issues before you ask who won, those preceding questions will bias the answers.

When you’re looking at a collection of polls that asked different questions in different ways, you expect a significant variation between them. That doesn’t mean that there’s anything wrong with any of them. They can all be correct even though their results vary by much more than their margins of error, because the margin of error has nothing to do with how you compare their results: they used different samples, and measured different things.

The problem with the reporting is the same things I mentioned up above. The press treats the margin of error as an absolute bound on the error in the computed sample statistics (which it isn’t); and the press pretends that all of the polls are measuring exactly the same thing, when they’re actually measuring different (but similar) things. They don’t tell us what the polls are really measuring; they don’t tell us what the sampling methodology was; and they don’t tell us the confidence interval.

Which leads to exactly the kind of errors that Mr. Mitchell made.

And one bonus. Mr. Mitchell repeatedly rants about how many polls show a “bias” by “over-sampling< democratic party supporters. This is a classic mistake by people who don't understand statistics. As I keep repeating, for a sample to be meaningful, it must be random. You can report on all sorts of measurements of the sample, but you cannot change it.

If you’re randomly selecting phone numbers and polling the respondents, you cannot screen the responders based on their self-reported party affiliation. If you do, you are biasing your sample. Mr. Mitchell may not like the results, but that doesn’t make them invalid. People report what they report.

In the last presidential election, we saw exactly this notion in the idea of “unskewing” polls, where a group of conservative folks decided that the polls were all biased in favor of the democrats for exactly the reasons cited by Mr. Mitchell. They recomputed the poll results based on shifting the samples to represent what they believed to be the “correct” breakdown of party affiliation in the voting population. The results? The actual election results closely tracked the supposedly “skewed” polls, and the unskewers came off looking like idiots.

We also saw exactly this phenomenon going on in the Republican primaries this year. Randomly sampled polls consistently showed Donald Trump crushing his opponents. But the political press could not believe that Donald Trump would actually win – and so they kept finding ways to claim that the poll samples were off: things like they were off because they used land-lines which oversampled older people, and if you corrected for that sampling error, Trump wasn’t actually winning. Nope: the randomly sampled polls were correct, and Donald Trump is the republican nominee.

If you want to use statistics, you must work with random samples. If you don’t, you’re going to screw up the results, and make yourself look stupid.

# Why we need formality in mathematics

The comment thread from my last Cantor crankery post has continued in a way that demonstrates a common issue when dealing with bad math, so I thought it was worth taking the discussion and promoting it to a proper top-level post.

The defender of the Cantor crankery tried to show what he alleged to be the problem with Cantor, by presenting a simple proof:

If we have a unit line, then this line will have an infinite number of points in it. Some of these points will be an irrational distance away from the origin and some will be a rational distance away from the origin.

Premise 1.

To have more irrational points on this line than rational points (plus 1), it is necessary to have at least two irrational points on the line so that there exists no rational point between them.

Premise 2.

It is not possible to have two irrational points on a line so that no rational point exists between them.

Conclusion.

It is not possible to have more irrational points on a line than rational points (plus 1).

This contradicts Cantor’s conclusion, so Cantor must have made a mistake in his reasoning.

(I’ve done a bit of formatting of this to make it look cleaner, but I have not changed any of the content.)

This is not a valid proof. It looks nice on the surface – it intuitively feels right. But it’s not. Why?

Because math isn’t intuition. Math is a formal system. When we’re talking about Cantor’s diagonalization, we’re working in the formal system of set theory. In most modern math, we’re specifically working in the formal system of Zermelo-Fraenkel (ZF) set theory. And that “proof” relies on two premises, which are not correct in ZF set theory. I pointed this out in verbose detail, to which the commenter responded:

I can understand your desire for a proof to be in ZFC, Peano arithmetic and FOPL, it is a good methodology but not the only one, and I am certain that it is not a perfect one. You are not doing yourself any favors if you let any methodology trump understanding. For me it is far more important to understand a proof, than to just know it “works” under some methodology that simply manipulates symbols.

This is the point I really wanted to get to here. It’s a form of complaint that I’ve seen over and over again – not just in the Cantor crankery, but in nearly all of the math posts.

There’s a common belief among crackpots of various sorts that scientists and mathematicians use symbols and formalisms just because we like them, or because we want to obscure things and make simple things seem complicated, so that we’ll look smart.

That’s just not the case. We use formalisms and notation because they are absolutely essential. We can’t do math without the formalisms; we could do it without the notation, but the notation makes things clearer than natural language prose.

The reason for all of that is because we want to be correct.

If we’re working with a definition that contains any vagueness – even the most subtle unintentional kind (or, actually, especially the most subtle unintentional kind!) – then we can easily produce nonsense. There’s a simple “proof” that we’ve discussed before that shows that 0 is equal to 1. It looks correct when you read it. But it contains a subtle error. If we weren’t being careful and formal, that kind of mistake can easily creep in – and once you allow one, single, innocuous looking error into a proof, the entire proof falls apart. The reason for all the formalism and all the notation is to give us a way to unambiguously, precisely state exactly what we mean. The reason that we insist of detailed logical step-by-step proofs is because that’s the only way to make sure that we aren’t introducing errors.

We can’t rely on intuition, because our intuition is frequently not correct. That’s why we use logic. We can’t rely on informal statements, because informal statements lack precision: they can mean many different things, some of which are true, and some of which are not.

In the case of Cantor’s diagonalization, when we’re being carefully precise, we’re not talking about the size of things: we’re talking about the cardinality of sets. That’s an important distinction, because “size” can mean many different things. Cardinality means one, very precise thing.

Similarly, we’re talking about the cardinality of the set of real numbers compared to the cardinality of the set of natural numbers. When I say that, I’m not just hand-waving the real numbers: the real numbers means something very specific: it’s the unique complete totally ordered field $(R, +, *, <)$ up to isomorphism. To understand that, we’re implicitly referencing the formal definition of a field (with all of its sub-definitions) and the formal definitions of the addition, multiplication, and ordering operations.

I’m not just saying that to be pedantic. I’m saying that because we need to know exactly what we’re talking about. It’s very easy to put together an informal definition of the real numbers that’s different from the proper mathematical set of real numbers. For example, you can define a number system consisting of the set of all numbers that can be generated by a finite, non-terminating computer program. Intuitively, it might seem like that’s just another way of describing the real numbers – but it describes a very different set.

Beyond just definitions, we insist on using formal symbolic logic for a similar reason. If we can reduce the argument to symbolic reasoning, then we’ve abstracted away anything that could bias or deceive us. The symbolic logic makes every statement absolutely precise, and every reasoning step pure, precise, and unbiased.

So what’s wrong with the “proof” above? It’s got two premises. Let’s just look at the first one: “To have more irrational points on this line than rational points (plus 1), it is necessary to have at least two irrational points on the line so that there exists no rational point between them.”.

If this statement is true, then Cantor’s proof must be wrong. But is this statement true? The commenter’s argument is that it’s obviously intuitively true.

If we weren’t doing math, that might be OK. But this is math. We can’t just rely on our intuition, because we know that our intuition is often wrong. So we need to ask: can you prove that that’s true?

And how do you prove something like that? Well, you start with the basic rules of your proof system. In a discussion of a set theory proof, that means ZF set theory and first order predicate logic. Then you add in the definitions you need to talk about the objects you’re interested in: so Peano arithmetic, rational numbers, real number theory, and the definition of irrational numbers in real number theory. That gives you a formal system that you can use to talk about the sets of real numbers, rational numbers, and natural numbers.

The problem for our commenter is that you can’t prove that premise using ZF logic, FOPL, and real number theory. It’s not true. It’s based on a faulty understanding of the behavior of infinite sets. It’s taking an assumption that comes from our intuition, which seems reasonable, but which isn’t actually true within the formal system o mathematics.

In particular, it’s trying to say that in set theory, the cardinality of the set of real numbers is equal to the cardinality of the set of natural numbers – but doing so by saying “Ah, Why are you worrying about that set theory nonsense? Sure, it would be nice to prove this statement about set theory using set theory, but you’re just being picky on insisting that.”

Once you really see it in these terms, it’s an absurd statement. It’s equivalent to something as ridiculous as saying that you don’t need to modify verbs by conjugating them when you speak english, because in Chinese, the spoken words don’t change for conjugation.