# Monte Carlo!

I was browsing around my crackpottery achives, looking for something fun to write about. I noticed a link from one of them to an old subject of one of my posts, the inimitable Miles Mathis. And following that, I noticed an interesting comment from Mr. Mathis about Monte Carlo methods: “Every mathematician knows that ‘tools’ like Monte Carlo are used only when you’ve got nothing else to go on and you are flying by the seat of your pants.” I find Monte Carlo computations absolutely fascinating. So instead of wasting time making fun of more of Mathis rubbish, I decided to talk about Monte Carlo methods.

It’s a little hard to talk about Monte Carlo methods, because there’s a lot of disagreement about exactly what they are. I’m going to use the broadest definition: a Monte Carlo method is a way of generating a computational result using repeated computations and random sampling.

In other words, Monte Carlo methods are a way of using random sampling to solve problems.

I’ll start with a really simple example. Suppose you want to know the value of $\pi$. (Pretend that you don’t know the analytical solution.) One thing that you could do is try to measure the circumference of a rod, and then divide it by its diameter. That would work, but it would be really hard to get much accuracy. You could, instead, get a great big square sheet of paper, and cover the whole thing in a single layer of grains of sand. Then, very carefully, you could remove the grains of sand that weren’t in the circle, compare it to the number of grains of sand that weren’t in the circle. By doing that, you could get a very, very accurate measurement of the area of the circle, and using that, you could get a much more accurate estimate of $\pi$.

The problem with that is: it’s really hard to get a perfect single-grain layer of sand all over the paper. And it would be a lot of very, very tedious work to get all of the grains that weren’t in the circle. And it would be very tedious to count them. It’s too much trouble.

Instead, you could just take 1,000 grains of sand, and drop them randomly all over the circle and the square. Then you could count how many landed in the circle. Or ever easier, you could just go to a place where lots of drunk people play darts! Draw a square around the dartboard, and count how many holes there are in the square wall around it, versus how many in the dartboard!

You’re not going to get a super-precise value for $\pi$ – but you might be surprised just how good you can get!

That’s the basic idea of monte carlo simulation: you’ve got a problem that’s hard to compute, or one where you don’t know a closed-form solution to make it easy to compute. Getting the answer some other way is intractable, because it requires more work than you can reasonably do. But you’ve got an easy way to do a test – like the “is it in the circle or not” test. So you generate a ton of random numbers, and use those, together with the test, to do a sequence of trials. Then using the information from the trials, you can get a reasonable estimate of the value you wanted. The more trials you do, the better your estimate will be.

The more you understand the probability distribution of the space you’re sampling, the better your estimate will be. For example, in the $\pi$ example above, we assumed that the grains of sand/darts would be distributed equally all over the space. If you were using the dartboard in a bar, the odds are that the distribution wouldn’t be uniform – there’d be many more darts hitting the dartboard than hitting the wall (unless I was playing). If you assumed a uniform distribution, your estimate would be off!

That’s obviously a trivial example. But in reality, the Monte Carlo method is incredibly useful for a wide range of purposes. It was used during World War II by the Manhattan project to help design the first atom bomb! They needed to figure out how to create a critical mass that would sustain a nuclear chain reaction; to do that, they needed to be able to compute neutron diffusion through a mass of radioactive uranium. But that’s a very hard problem: there are so many degrees of freedom – so many places where things could proceed in several seemingly (or actually!) random directions. With the computers they had available to them at the time, there was absolutely no way that they could write a precise numerical simulation!

But, luckily for them, they had some amazing mathematicians working on the problem! One of them, Stanislav Ulam, had been sick, and while he was recovering, fooled around with some mathematical problems. One of them involved a solitaire game, called Canfield. Ulam wanted to figure out how often the game of Canfield was winnable. He couldn’t figure out how to do it analytically, but he realized that since the deals of cards are a uniform distribution, then if you were to take a computer, and make it play through 1000 games, the number of times that it won would be a pretty good estimate of how many times the game was winnable in general.

In that case, it’s obvious that a complete solution isn’t feasible: there are 52! possible deals – roughly $3\times 10^{66}$! But with just a couple of hundred trials, you can get a really good estimate.

Ulam figured that out for the card game. He explained it to Jon von Neumann, and von Neumann realized that the same basic method could be used for the Neutron diffraction process!

Since then, it’s been used as the basis for a widely applicable approach to numeric integration. It’s used for numerous physics simulations, where there is no tractable exact solution – for example, weather prediction. (We’ve been able to get reasonably accurate weather predictions up to six days in advance, using very sparse input data, by using Monte Carlo methods!) It’s an incredibly useful, valuable technique – and anyone who suggests that using Monte Carlo is in any way a half-assed solution is an utter jackass.

I’ll finish up with a beautiful example – my favorite example of combining analytical methods with Monte Carlo. It’s another way of computing an estimate of $\pi$, but it gets a more accurate result with fewer trials than the sand/darts.

It’s based on a problem Buffon’s needle. Buffon’s needle is a problem first proposed by the Count of Buffon during the 1700s. He asked: suppose I drop a needle onto a panelled wood floor. What’s the probability that the needle will fall so that it crosses a one of the joints between different boards?

Using some very nice analytical work, you can show that if the panels have uniform width $t$, and the needle has length $l$, then the probability of a needle crossing a line is: $\frac{2l}{\pi t}$. That gives us the nice property that if we let $l = \frac{t}{2}$, then the probability of crossing a line is $\frac{1}{\pi}$.

Using that, you can do a Monte Carlo computation: take a sheet of paper, and a couple of hundred matchsticks. Draw lines on the paper, separated by twice the length of a matchstick. Then scatter the matchsticks all over the paper. Divide the total number of matchsticks by the number that crossed a line. The result will be roughly $\pi$.

For example – with 200 trials, I got 63 crossing a line. That gives me roughly 3.17 as an estimate of $\pi$. That’s not half-bad for a five minute experimental estimate!

# Bayes Theorem

I’ve been meaning to get back to some of the probability stuff. We’re currently recovering from a major snow/ice storm, and I’m snowed/iced in, so this is a good time!

Today, we’ll talk about what is, according to many people, the most important rule in all of probability: Bayes theorem. It’s also, in my experience, the single most abused rule in all of mathematics. Nothing else has been used so poorly, by so many people, to support sloppy, dumb arguments. After we talk about what the rule is, and what it means, we’ll move on to talk about how it gets abused.

In a pure mathematical sense, Bayes theorem is simple. The interpretation of it, and what it means gets pretty hairy. Suppose that you’ve got two related events, A and B. You know the probability of A occurring is P(A). You know the probability of B occurring is P(B). And you know that if A has already occurred, what the probability of B occurring is. (We write that P(B | A), which you can ready as “the probability of B given A”.) What you’d like to know is, suppose that I know that B occurred. What’s the probability that A also occurred? (What is P(A | B)?)

Bayes theorem says:

$P(A|B) = \frac{P(B|A) P(A)}{P(B)}$

Let’s be concrete. I go to work, and walk into my office in the morning, and get into the elevator with one other person that I work with.What is the probability that it’s a man?

Without knowing anything about the people that I work with, a reasonable guess would be 50% – the population is pretty close to evenly divided between the genders.

But I’m an engineer, and one of the very unfortunate facts about my job is that the gender pool of engineers is very skewed. Let’s say that it’s 80% men. (In reality, that’s probably actually pretty low.)

Let’s say that about 1/3 of the office is engineering. So the odds that someone I bump into will be an engineer is about 50%.

I can do a couple of things with that information. I could ask, suppose that I walked into the elevator with a woman. What’s the probability that she’s an engineer?

To answer that, I’ll use Bayes law. We’ll say that P(A) is the probability that a random person is a woman- 1/2. P(B) is the probability that a random person is an engineer – 1/3. If I know that a given person is an engineer, the probability of that person being a woman is P(B | A), or 1/5. So what’s the probability of my random female coworker being an engineer (P(A | B))?

• $P(\text{woman}) = 1/2$
• $P(\text{eng}) = 1/3$
• $P(\text{woman} | \text{eng}) = 1/5$
• $P(\text{eng} | \text{woman}) = \frac{P(\text{woman}|\text{eng})P(\text{eng})}{P(\text{woman})} = \frac{(1/5)(1/3)}{1/2} = \frac{2}{15}$

See? That was easy, wasn’t it?

Now, what’s it actually mean? If you look at it this way, it doesn’t seem to be such a big deal. Sure, it’s a way of combining probabilities in another situation, but so what? Why’s it any more important than any other?

Because it’s the mathematical method for how to incorporate new knowledge into your expectations. What we did above was start with one understanding of the thing we were trying to predict. Knowing nothing but the typical distribution of genders in the general population, we made a guess about a 50% probability of encountering a woman. But then we added in new information. We knew the population of engineers, and the fact that the gender ration was skewed in engineering – and we incorporated that new information into our prediction.

That answer comes from interpretations. One of the classic interpretations of probability theory is the Bayesian interpretation – named Bayesian specifically because of how it interprets this rule! The Bayesian interpretation says that a statement about probability is really a statement about the state of our knowledge. If I say that the probability of flipping heads on a coin is 1/2, what I’m saying under the Bayesian interpretation is that my certainty that I’ll flip heads is just 1/2.

In that kind of knowledge-based interpretation, there is no intrinsic probability of any event. There is just our degree of certainty about whether the event will occur. Given new information, our degree of certainty can change. Bayes theorem tells us, given new information, exactly how we should change our interpretation.

To explain the bayesian interpretation, we’ll add a couple of terms.

Hypothesis
The hypothesis is the thing whose degree of certainty we’re trying to measure. In the formulation of Bayes law up above, we call it A; here, we’ll call it $H$.
Prior
The prior, P(H), is the degree of certainty about the hypothesis given no other information.
Evidence
The evidence is the new piece of information that we’re trying to add to our measurement of certainty. Above, we called it B, but here, we’ll call it $E$.
Likelihood
The likelihood $P(E | H)$ of a piece of evidence is our degree of certainty that a specific piece of evidence would be found if the hypothesis is true.
Model Evidence
The model evidence is $P(E)$, and it’s a bit confusing. It’s the analytic likelihood of any piece of evidence occurring. If you’re considering a set of possible hypotheses using Bayes rule, $P(E)$ will be the same for all of them, but $P(E | H)$ will be the specific likelihood of finding that particular piece of evidence under the hypothesis.
Posterior
The posterior, P(H|E), is the degree of certainty that we will have about A if we add new knowledge, B.
Support
Support is the change in our certainty created by the addition of our new evidence. The support is $\frac{P(E|H)}{P(E)}$.

So Bayes theorem is a formal statement of how, given evidence, we can modify our certainty about the truth of a particular statement. The classical textbook statement of it is the following. (I took this specific formulation from wikipedia, but any textbook will have nearly the same sentence.)

The posterior probability of a hypothesis is determined by a combination of the inherent likeliness of a hypothesis (the prior) and the compatibility of the observed evidence with the hypothesis (the likelihood).

Or, in mathematical terms, $P(H | E) = \frac{P(E | H)}{P(E)} \times P(H)$ – or exactly what we wrote for Bayes theorem up above.

Why is this abused so badly? Because under a naive, stupid
understanding of Bayes rule, you can essentially randomly estimate the probability of anything. After all, Bayes says that probability is just the combination of our certainties about some collection of facts. So if I can line up some set of facts, along with an estimate of the individual probabilities of those facts, then I can combine those probabilities, and come up with an estimate of the probability of anything! And if I don’t know the probability of an event occurring at al, then the state of my initial knowledge is really simple: it’s always 1/2 – 1/2 is always the starting point given absolutely no other knowledge.

That leads to rubbish like this proof that there are no extra-terrestial intelligences, or this or this purported proof of the existence of God.

All of these arguments fail in the same way. They don’t really use Bayes theorem. The quality of the priors – all of the priors, including the priors used to come up with measures of the likelihoods of the evidences – are crucial. They don’t bother with that. They just make up priors, and combine them without good likelihoods.

To me, the thing that makes probability fun is that the results are frequently surprising. We’ve got very strong instincts about how we expect numbers to work. But when you do anything that involves a lot of computations with big numbers, our intuition goes out the window – nothing works the way we expect it to. A great example of that is something called the birthday paradox.

Suppose you’ve got a classroom full of people. What’s the probability that there are two people with the same birthday? Intuitively, most people expect that it’s pretty unlikely. It seems like it shouldn’t be likely – 365 possible birthdays, and 20 or 30 people in a classroom? Very little chance, right?

Let’s look at it, and figure out how to compute the probability.

Interesting probability problems are all about finding out how to put things together. You’re looking at things where there are huge numbers of possible outcomes, and you want to determine the odds of a specific class of outcomes. Finding the solutions is all about figuring out how to structure the problem.

A great example of this is something called the birthday paradox. This is a problem with a somewhat surprising outcome. It’s also a problem where finding the right way to structure the problem is has a dramatic result.

Here’s the problem: you’ve got a group of 30 people. What’s the probability that two people out of that group of thirty have the same birthday?

We’ll look at it with some simplifying assumptions. We’ll ignore leap year – so we’ve got 365 possible birthdays. We’ll assume that all birthdays are equally likely – no variation for weekdays/weekends, no variation for seasons, and no holidays, etc. Just 365 equally probable days.

How big is the space? That is, how many different ways are there to assign birthdays to 30 people? It’s 36530 or something in the vicinity of 7.4*1076.

To start off, we’ll reverse the problem. It’s easier to structure the problem if we try to ask “What’s the probability that no two people share a birthday”. If P(B) is the probability that no two people share a birthday, then 1-P(B) is the probability that at least two people share a birthday.

So let’s look at a couple of easy cases. Suppose we’ve got two people? What’s the odds that they’ve got the same birthday? 1 in 365: there are 3652 possible pairs of birthdays; there are 365 possible pairs. So there’s a probability of 365/3652 that the two people have the same birthday. For just two people, it’s pretty easy. In the reverse form, there’s a 364/365 chance that the two people have different birthdays.

What about 3 people? It’s the probability of the first two having different birthdays, and the probability of the third person having a different birthday that either of those first two. There are 365 possible birthdays for the third person, and 363 possible days that don’t overlay with the first two. So for N people, the probability of having distinct birthdays is $1 \times (1 - 1/365) \times (1 - 2/365) \times \dots (1 - (n/365))$.

At this point, we’ve got a nice recursive definition. Let’s say that $f(N)$ is the probability of $N$ people having distinct birthdays. Then:

1. For 2 people, the probability of distinct birthdays is 364/365. ($f(2) = \frac{364}{365}$)
2. For N>2 people, the probability of distinct birthdays is
$\frac{365-(N-1)}{365} times f(n-1)$.

Convert that to a closed form, and you get: $f(n) = \frac{365!}{(365-(n-1))!365^n}$. For 30 people, that’s
$\frac{365!}{(365-29)!*365^{30}}$. Work it out, and that’s
0.29 – so the probability of everyone having distinct
birthdays is 29% – which means that the probability of at least
two people in a group of 30 having the same birthday is 71%!

You can see why our intuitions are so bad? We’re talking about something where one factor in the computation is the factorial of 365!

Let’s look a bit further: how many people do you need to have, before there’s a 50% chance of 2 people sharing a birthday? Use the formulae we wrote up above, and it turns out to be 23. Here’s the numbers – remember that this is the reverse probability, the probability of all birthdays being distinct.

 1 1 2 0.99726 3 0.991796 4 0.983644 5 0.972864 6 0.959538 7 0.943764 8 0.925665 9 0.905376 10 0.883052 11 0.858859 12 0.832975 13 0.80559 14 0.776897 15 0.747099 16 0.716396 17 0.684992 18 0.653089 19 0.620881 20 0.588562 21 0.556312 22 0.524305 23 0.492703 24 0.461656 25 0.4313 26 0.401759 27 0.373141 28 0.345539 29 0.319031 30 0.293684 31 0.269545 32 0.246652 33 0.225028 34 0.204683 35 0.185617 36 0.167818 37 0.151266 38 0.135932 39 0.12178 40 0.108768 41 0.0968484 42 0.0859695 43 0.0760771 44 0.0671146 45 0.0590241 46 0.0517472 47 0.0452256 48 0.039402 49 0.0342204 50 0.0296264

With just 23 people, there’s a greater than 50% chance that two people will have the same birthday. By the time you get to just 50 people, there’s a greater than 97% chance that two people have the same birthday!

As an amusing aside, the first time I saw this problem worked through was in an undergraduate discrete probability theory class, with 37 people in the class, and no duplicate birthdays!

Now – remember at the beginning, I said that the trick to working probability problems is all about how you formulate the problem. There’s a much, much better way to formulate this.

Think of the assignment of birthdays as a function from people to birthdays: $f: P \rightarrow B$. The number of ways of assigning birthdays to people is the size of the set of functions from people to birthdays. How many possible functions are there? $| B | ^{| P |}$. $| B |$ is the number of days in the year – 365, and $| P |$ is the number of people in the group.

The set of assignments to unique birthdays is the number of injective functions. (An injective function is a function where $f(x) = f(y) \Leftrightarrow x = y$.) How many injective functions are there? $\frac{| B |!}{(| B | - | P |)!}$.

The probability of all birthdays being unique is the size of the set of injective functions divided by the size of the set of all assignments: $\frac{\frac{| B |!}{(| B | - | P |)!}}{ | B | ^{| P |} } = \frac{365!}{365^P\times (365 - P)!}$.

So we’ve got the exact same result – but it’s a whole lot easier in term of the discrete functions!

# Combining Non-Disjoint Probabilities

In my previous post on probability, I talked about how you need to be careful about covering cases. To understand what I mean by that, it’s good to see some examples.

And we can do that while also introducing an important concept which I haven’t discussed yet. I’ve frequently talked about independence, but equally important is the idea of disjointness.

Two events are independent when they have no ability to influence one another. So two coin flips are independent. Two events are disjoint when they can’t possibly occur together. Flipping a coin, the event “rolled a head” and the event “rolled a tail” are disjoint: if you rolled a head, you can’t roll a tail, and vice versa.

So let’s think about something abstract for a moment. Let’s suppose that we’ve got two events, A and B. We know that the probability of A is 1/3 and the probability of B is also 1/3. What’s the probability of A or B?

Naively, we could say that it’s P(A) + P(B). But that’s not necessarily true. It depends on whether or not the two events are disjoint.

Suppose that it turns out that the probability space we’re working in is rolling a six sided die. There are three basic scenarios that we could have:

1. Scenario 1: A is the event “rolled 1 or 2”, and B is “rolled 3 or 4”. That is, A and B are disjoint.
2. Scenario 2: A is the event “rolled 1 or 2”, and B is “rolled 2 or 3”. A and B are different, but they overlap.
3. Scenario 3: A is the event “rolled 1 or 2”, and B is the event “rolled 1 or 2”. A and B are really just different names for the same event.

In scenario one, we’ve got disjoint events. So P(A or B) is P(A) + P(B). One way of checking that that makes sense is to look at how the probability of events work out. P(A) is 1/3. P(B) is 1/3. The probability of neither A nor B – that is, the probability of rolling either 5 or 6 – is 1/3. The sum is 1, as it should be.

But suppose that we looked at scenario 2. If we made a mistake and added them as if they were disjoint, how would things add up? P(A) is 1/3. P(B) is 1/3. P(neither A nor B) = P(4 or 5 or 6) = 1/2. The total of these three probabilities is 1/3 + 1/3 + 1/2 = 7/6. So just from that addition, we can see that there’s a problem, and we did something wrong.

If we know that A and B overlap, then we need to do something a bit more complicated to combine probabilities. The general equation is:

$P(A cup B) = P(A) + P(B) - P(A cap B)$

Using that equation, we’d get the right result. P(A) = 1/3; P(B) =
1/3; P(A and B) = 1/6. So the probability of A or B is 1/3 + 1/3 – 1/6 = 1/2. And P(neither A nor B) = P(4 or 5 or 6) = 1/2. The total is 1, as it should be.

From here, we’ll finally start moving in to some more interesting stuff. Next post, I’ll look at how to use our probability axioms to analyze the probability of winning a game of craps. That will take us through a bunch of applications of the basic rules, as well as an interesting example of working through a limit case.

And then it’s on to combinatorics, which is the main tool that we’ll use for figuring out how many cases there are, and what they are, which as we’ve seen is an essential skill for probability.

# Correction, Sigma Algebras, and Mass functions

So, I messed up a bit in the previous post. Let me get that out of the way before we move forward!

In measure theory, you aren’t just working with sets. You’re working with something called σ-algebras. It’s a very important distinction.

The problem is, our intuition of sets doesn’t always work. Sets, as defined formally, are really pretty subtle. We expect certain things to be true, because they make sense. But in fact, they are not implied by the definition of sets. A σ-algebra is, essentially, a well-behaved set – a set whose behavior matches our usual expectations.

To be formal, a sigma algebra over a set S is a collection Σ of subsets of S such that:

1. Σ is closed over set complement.
2. Σ is closed over countable union.

The reason why you need to make this restriction is, ultimately, because of the axiom of choice. Using the axiom of choice, you can create sets which are unmeasurable. They’re clearly subsets of a measurable set, and supersets of other measurable sets – and yet, they are, themselves, not measurable. This leads to things like the Banach-Tarski paradox: you can take a measurable set, divide it into non-measurable subsets, and then combine those non-measurable subsets back into measurable sets whose size seem to make no sense. You can take a sphere the size of a baseball, slice it into pieces, and then re-assemble those pieces into a sphere the size of the earth, without stretching them!

These non-measurable sets blow away our expectations about how things should behave. The restriction to σ algebras is just a way of saying that we need to be working in a space where all sets are measurable. When we’re looking at measure theory (or probability theory, where we’re building on measures), we need to exclude non-measurable sets. If we don’t, we’re seriously up a creek without a paddle. If we allowed non-measurable sets, then the probability theory we’re building would be inconsistent, and that’s the kiss of death in mathematics.

Ok. So, with that out of the way, how do we actually use Kolmogorov’s axioms? It all comes down to the idea of a sample space. You need to start with an experiment that you’re going to observe. For that experiment, there are a set of possible outcomes. The set of all possible outcomes is the sample space.

Here’s where, sadly, even axiomatized probability theory gets a bit handwavy. Given the sample space, you can define the structure of the sample space with a function, called the probability mass function, f, which maps each possible event in the sample space to a probability. To be a valid mass function for a sample space S, it’s got to have the following properties:

1. For each event e in S, f(e) ≥ 0 and f(e) <= 1..
2. The sum of the probabilities in the sample space must be 1: $Sigma_{e in S} f(e) = 1$

So we wind up with a sort of circularity: in order to describe the probability of events, we need to start by knowing the probability of events. In fact, this isn’t really a problem: we’re talking about taking something than we observe in the real world, and mapping it into the abstract space of math. Whenever we do that, we need to take our observations of the real world and create an approximation as a mathematical model.

The point of probability theory isn’t to do that primitive mapping. In general, we already understand how rolling a single die works. We know how it should behave, and we know how and why its actual behavior can vary from our expectation. What we want to know is really how many events combine.

We don’t need any special theory to figure out what the probability of rolling a 3 on a six-sided die is: that’s easy, and it’s obvious: it’s 1 in 6. But what’s the probability of winning a game of craps?

If all days of the year 2001 are equally likely, then we don’t need anything fancy to ask what the probability of someone born in 2001’s birthday being July 21st. It’s easy: 1 in 365. But if I’ve got a group of 35 people, what’s the probability of two of them sharing the same birthday?

Both of those questions start with the assignment of a probability mass function, which is trivial. But they involve combining the probabilities given by those mass functions, and use them with Kolmogorov’s axioms to figure out the probabilities of the complicated events.

# Kolmogorov's Axioms of Probability

The way that I’ve talked about probability so far is mostly informal. That’s the way that probability theory was treated for a long time. You defined probability spaces over collections of equal probability sets. You combined probability spaces by combining their events into other kinds of equally probable events.

The problem with that should be obvious: it’s circular. You want to define the probability of events; to do that, you need to start with equally probable events, which means that on some level, you already know the probabilities. If you don’t know the probabilities, you can’t talk about them. The reality is somewhat worse than that, because this way of looking at things completely falls apart when you start trying to think about infinite probability spaces!

So what can you do?

The answer is to reformulate probability. Mathematicians knew about this kind of problem for a very long time, but what they mostly just ignored it: probability wasn’t considered a terribly interesting field.

Then, along came Kolmogorov – the same brilliant guy who’s theory of computational complexity is so fascinating to me! Kolmogorov created a new formulation of probability theory. Instead of starting with a space of equally probable discrete events, you start with a measure space.

Before we can look at how Kolmogorov reformulated probability (the Kolmogorov axioms), we need to look at just what a measure space is.

A measure space is just a set with a measure function. So let X be a set. A measure μ on X is a function from a subset of X to a real number: $mu: 2^X rightarrow R$ with the following properties:

• Measures are non-negative: $forall x subseteq X: mu(x) ge 0$
• The measure of the empty set is always 0: $mu(emptyset) = 0$
• The measure of a finite sequence of unions is the sum of the individual measures$mu(x + y) = mu(x) + mu(y)$

So the idea is pretty simple: a measure space is just a way of defining the size of a subset in a consistent way.

To work with probability, you need a measure space where the measure of the entire set is 1. With that idea in mind, we can put together a proper, formal definition of a probability space that will really allow us to work with, and to combine probabilities in a rigorous way.

Like our original version, a probability space has a set of events, called its event space. We’ll use F to represent the set of all possible events, and e to represent an event in that set.

There are three fundamental axioms of probability, which are going to look really similar to the three axioms of a measure space:

1. Basic measure: the probability of any event is a positive real number: ($Omega$ is called the unit event, and is the union of all possible events.) Alternatively, the probability of no event occurring is 0: $P(emptyset)=0$.
2. Combination: For any two distinct events or sets of events $e$ and $f$, the probability of $e$ or $f$ is $P(e) + P(f)$: $forall e, f subseteq P: e cap f = emptyset Rightarrow P(e cup f) = P(e) + P(f)$. This can be extended to any countable sequence of unions.

This is very similar to the informal version we used earlier. But as we’ll see later, this simple formulation from measure theory will give us a lot of additional power.

It’s worth taking a moment to point out two implications of these axioms. (In fact, I’ve seen some presentations that treat some of these as additional axioms, but they’re provable from the first three.

• Monotonicity: if $e subeq f$, then $P(e) le P(f)$.
• Upper Bound: for any event or set of events $e$, $P(e) ge 0 land P(e) le 1$.

The brilliance of Kolmogorov was realizing that these rules were everything you need to work out any probability you want – in both finite and infinite spaces. We’ll see that there’s a lot of complexity in the combinatorics of probability, but it will all always ultimately come back to these three rules.

# Independence and Combining probabilities.

As I alluded to in my previous post, simple probability is really a matter of observation, to produce a model. If you roll a six sided die over and over again, you’ll see that each face comes up pretty much equally often, and so you model it as 1/6 probability for each face. There’s not really a huge amount of principle behind the basic models: it’s really just whatever works. This is the root of the distinction between interpretations: a frequentist starts with an experiment, and builds a descriptive model based on it, and says that the underlying phenomena being tested has the model as g, a property; a Bayesian does almost the same thing, but says that the model describes the state of their knowledge.

Where probability starts to become interesting in when you combine things. I know the probability of outcomes for rolling one die: how can I use that to see what happens when I roll five dice together? I know the probability of drawing a specific card from a deck: what are the odds of being dealt a particular poker hand?

We’ll start with the easiest part: combining independent probabilities. The probability of two events are independent when there’s no way for the outcome of one to influence the outcome of the other. For example, if you’re flipping a coin several times, the result of one coin flip has no effect on the result of a subsequent flip. On the other hand, dealing 10 cards from a deck is a sequence of dependent events: once you’ve dealt one card, the next deal must come from the remaining cards: you can’t deal the same card twice.

If you know the probability space of your trials, then recognizing an independent situation is easy: if the outcome of one trial doesn’t alter the probability space of other trials, then they’re independent.

Look back at the coin flip example: we know what the probability space of a coin flip looks like: it’s got two, equally probable outcomes. If you’ve flipped a coin once, and you’re going to flip another coin, the result of the first flip can’t do anything that alters the probability space of a subsequent flip.

But if you think about dealing cards, that’s not true. With a standard deck of cards, the initial probability space has 52 outcomes, each of which is equally likely. So the odds of being dealt the 5 of spades is exactly 1/52.

Now, suppose that you got lucky, and you did get dealt the 5 of spades on the first card. What’s the probability of being dealt the 5 of spades’s on the second? If they were independent events, it would still be 1/52. But once you’ve dealt one card, you can’t deal it again. The probability of being dealt the 5 of spades as the second card is 0: it’s impossible. The probability space only has 51 possible outcomes, and the 5 of spades is not one of them. The space has changed. That’s the definition of a dependent event.

When you’re faced with dependent probabilities, you need to figure out how the probability space will be changed, and incorporate that into your computation. Once you’ve incorporated the change in the probability space of the second test, then you’ve got a new independent probability, and you can combine them. Figuring out how to alter the probability space can be extremely difficult, but that’s what makes it interesting.

When you’re dealing with independent events, it’s easy to combine them. There are two basic ways of combining event probabilities,
and they should be familiar from logic: event1 AND event2, and event1 OR event2.

Suppose you’re looking at two test with independent outcomes. I know that the probability of event e is P(e), and the probability of event f is P(f) Then the outcome of e & f – that is, of having e as the outcome of the first trial, and f as the outcome of the second, is P(e)×P(f). The odds of rolling HTTH on a coin is (1/2)*(1/2)*(1/2)*(1/2)=(1/16).

If you’re looking at independent alternatives – that is, the probability of e OR F, you combine the probabilities of the event with addition: P(e) + P(f). So, the odds of drawing any heart from a deck: for each card, it’s 1/52. There are thirteen different hearts. So the odds of drawing a red are 1/52 + 1/52 + … = 13/52 = 1/4.

That still doesn’t get us to the really interesting stuff. We still can’t quite work out something like the odds of being dealt a flush. To get there, we need to learn some combinatorics, which will allow us to formulate the probability spaces that we need for an interesting probability.

# Probability Spaces

Sorry for the slowness of the blog lately. I finally got myself back onto a semi-regular schedule when I posted about the Adria Richards affair, and that really blew up. The amount of vicious, hateful bile that showed up, both in comments (which I moderated) and in my email was truly astonishing. I’ve written things which pissed people off before, and I’ve gotten at least my fair share of hatemail. But nothing I’ve written before came close to preparing me for the kind of unbounded hatred that came in response to that post.

I really needed some time away from the blog after that.

Anyway, I’m back, and it’s time to get on with some discrete probability theory!

I’ve already written a bit about interpretations of probability. But I haven’t said anything about what probability means formally. When I say that the probability of rolling a 3 with a pair of fair six-sided dice is 1/18, how do I know that? Where did that 1/6th figure come from?

The answer lies in something called a probability space. I’m going to explain the probability space in frequentist terms, because I think that that’s easiest, but there is (of course) an equivalent Bayesian description.) Suppose I’m looking at a particular experiment. In classic mathematical form, a probability space consists of three components (Ω, E, P), where:

1. Ω, called the sample space, is a set containing all possible outcomes of the experiment. For a pair of dice, Ω would be the set of all possible rolls: {(1,1), (1,2), (1,3), (1,4), (1,5), (1, 6), (2,1), …, (6, 5), (6,6)}.
2. E is an equivalence relation over Ω, which partitions Ω into a set of events. Each event is a set of outcomes that are equivalent. For rolling a pair of dice, an event is a total – each event is the set of outcomes that have the same total. For the event “3” (meaning a roll that totalled three), the set would be {(1, 2), (2, 1)}.
3. P is a probability assignment. For each event e in E, P(e) is a value between 0 and 1, where:

$Sigma_{ein E} P(e) = 1$

(That is, the sum of the probabilities of all of the possible events in the space is exactly 1.)

The probability of an event e being the outcome of a trial is P(e).

So the probability of any particular event as the result of a trial is a number between 0 and 1. What’s it mean? If the probability of event e is p, then if we repeat the trial N times, we expect N*p of those trials to have e as their result. If the probability of e is 1/4, and we repeat the trial 100 times, we’d expect e to be the result 25 times.

But in an important sense, that’s a cop-out. We’ve defined probability in terms of this abstract model, where the third component is the probability. Isn’t that circular?

Not really. For a given trial, we create the probability assignment by observation and/or analysis. The important point is that this is really just a bare minimum starting point. What we really care about in probability isn’t the change associated with a single, simple, atomic event. What we want to do is take the probability associated with a group of single events, and use our understanding of that to allow us to explore a complex event.

If I give you a well-shuffled deck of cards, it’s easy to show that the odds of drawing the 3 of diamonds is 1/52. What we want to do with probability is things like ask: What are the odds of being dealt a flush in a poker hand?

The construction of a probability space gives us a well-defined platform to use for building probabilistic models of more interesting things. Give a probability space of two single dice, we can combine them together to create the probability space of the two dice rolled together. Given the probability space of a pair of dice, we can construct the probability space of a game of craps. And so on.

# Probability and Interpretations

I’m going to do some writing about discrete probability theory. Probability is an extremely important area of math. We encounter aspects of it every day. It’s also a very poorly understood area – it’s one that we see abused or just fouled up every day.

I’m going to focus on discrete probability theory. What that means is that we’re going to look at things where the space containing the things that we’re going to look at contains a countable number of elements. The probability of getting a certain sequence of coin flips, or of getting a certain hand of cards are described by discrete probability theory. On the other hand, the odds of a radioactive isotope decaying at a particular time requires continuous probability theory.

Before getting into the details, there’s one important thing to mention. When you’re talking about probability, there are two fundamental schools of interpretetation. There are frequentist interpretations, and there are Bayesian interpretations.

In a frequentist interpretation, when you say the probability of an event is 0.6, what you mean is that if you were to perform a series of experiments precisely reproducing the event, then on average, if you did 100 experiments, the event would occur 60 times. In the frequentist interpretation, the probability is an intrinsic property of the event. For a frequentist, it makes sense to say that there is a “real” probability associated with an event.

In a Bayesian interpretation, when you say that the probability of an event is 0.6, what you mean is that based on your current state of knowledge about the event, you have a 60% certainty that the event will occur. In a strict Bayesian interpretation, the event doesn’t have any kind of intrinsic probability associated with it. The specific event that you’re interested in either will occur, or it won’t. There’s no real probability involved. What probability measures is how certain you are about whether or not it will occur.

For example, think about flipping a fair coin.

A frequentist would say that you can flip a coin many times, and half of the time, it will land on heads. So the probability of a coin flip landing on the head of the coin is 0.5. A Bayesian would say that the coin will land either on heads or on tails. Since you don’t know which, and you have no other information to use to be able to make a better prediction, you can have a certainty of 0.5 that it will land on the head of the coin.

In the real world, I think that most people are really somewhere in between.

I think that all but the most fervent Bayesians do rely on an intuitive notion of the “intrinsic” probability of an event. They may describe it in different terms, but when it comes down to it, they’re using the basic frequentist notion. And I don’t think that you can find a sane frequentist anywhere who won’t use Bayes theorem to update their priors in the face of new information – which is the most fundamental notion in the Bayesian interpretation.

One note before I finish this, and get started on the real meaty posts. In the past, when I’ve talked about probability, people have started stupid flamewars in the comments. People get downright religious about interpretations of probability. There are religious Bayesians, who think that all frequentists are stupid idiots who should be banished from the field of math; likewise, there are religious frequentists who think that Bayesians are all a crop of arrogant know-it-alls who should be sent to Siberia. I am not going to tolerate any of that nonsense. If you feel that you cannot read posts on probability without going into a diatribe about those stupid frequentists/Bayesians and their deliberately stupid ideas, please go away and don’t even read these posts. If you do go into such a diatribe, I will delete your comments without any hesitation.