Tag Archives: probability

Monte Carlo!

I was browsing around my crackpottery achives, looking for something fun to write about. I noticed a link from one of them to an old subject of one of my posts, the inimitable Miles Mathis. And following that, I noticed an interesting comment from Mr. Mathis about Monte Carlo methods: “Every mathematician knows that ‘tools’ like Monte Carlo are used only when you’ve got nothing else to go on and you are flying by the seat of your pants.” I find Monte Carlo computations absolutely fascinating. So instead of wasting time making fun of more of Mathis rubbish, I decided to talk about Monte Carlo methods.

It’s a little hard to talk about Monte Carlo methods, because there’s a lot of disagreement about exactly what they are. I’m going to use the broadest definition: a Monte Carlo method is a way of generating a computational result using repeated computations and random sampling.

In other words, Monte Carlo methods are a way of using random sampling to solve problems.

I’ll start with a really simple example. Suppose you want to know the value of \pi. (Pretend that you don’t know the analytical solution.) One thing that you could do is try to measure the circumference of a rod, and then divide it by its diameter. That would work, but it would be really hard to get much accuracy. You could, instead, get a great big square sheet of paper, and cover the whole thing in a single layer of grains of sand. Then, very carefully, you could remove the grains of sand that weren’t in the circle, compare it to the number of grains of sand that weren’t in the circle. By doing that, you could get a very, very accurate measurement of the area of the circle, and using that, you could get a much more accurate estimate of \pi.

The problem with that is: it’s really hard to get a perfect single-grain layer of sand all over the paper. And it would be a lot of very, very tedious work to get all of the grains that weren’t in the circle. And it would be very tedious to count them. It’s too much trouble.

Instead, you could just take 1,000 grains of sand, and drop them randomly all over the circle and the square. Then you could count how many landed in the circle. Or ever easier, you could just go to a place where lots of drunk people play darts! Draw a square around the dartboard, and count how many holes there are in the square wall around it, versus how many in the dartboard!

You’re not going to get a super-precise value for \pi – but you might be surprised just how good you can get!

That’s the basic idea of monte carlo simulation: you’ve got a problem that’s hard to compute, or one where you don’t know a closed-form solution to make it easy to compute. Getting the answer some other way is intractable, because it requires more work than you can reasonably do. But you’ve got an easy way to do a test – like the “is it in the circle or not” test. So you generate a ton of random numbers, and use those, together with the test, to do a sequence of trials. Then using the information from the trials, you can get a reasonable estimate of the value you wanted. The more trials you do, the better your estimate will be.

The more you understand the probability distribution of the space you’re sampling, the better your estimate will be. For example, in the \pi example above, we assumed that the grains of sand/darts would be distributed equally all over the space. If you were using the dartboard in a bar, the odds are that the distribution wouldn’t be uniform – there’d be many more darts hitting the dartboard than hitting the wall (unless I was playing). If you assumed a uniform distribution, your estimate would be off!

That’s obviously a trivial example. But in reality, the Monte Carlo method is incredibly useful for a wide range of purposes. It was used during World War II by the Manhattan project to help design the first atom bomb! They needed to figure out how to create a critical mass that would sustain a nuclear chain reaction; to do that, they needed to be able to compute neutron diffusion through a mass of radioactive uranium. But that’s a very hard problem: there are so many degrees of freedom – so many places where things could proceed in several seemingly (or actually!) random directions. With the computers they had available to them at the time, there was absolutely no way that they could write a precise numerical simulation!

But, luckily for them, they had some amazing mathematicians working on the problem! One of them, Stanislav Ulam, had been sick, and while he was recovering, fooled around with some mathematical problems. One of them involved a solitaire game, called Canfield. Ulam wanted to figure out how often the game of Canfield was winnable. He couldn’t figure out how to do it analytically, but he realized that since the deals of cards are a uniform distribution, then if you were to take a computer, and make it play through 1000 games, the number of times that it won would be a pretty good estimate of how many times the game was winnable in general.

In that case, it’s obvious that a complete solution isn’t feasible: there are 52! possible deals – roughly 3\times 10^{66}! But with just a couple of hundred trials, you can get a really good estimate.

Ulam figured that out for the card game. He explained it to Jon von Neumann, and von Neumann realized that the same basic method could be used for the Neutron diffraction process!

Since then, it’s been used as the basis for a widely applicable approach to numeric integration. It’s used for numerous physics simulations, where there is no tractable exact solution – for example, weather prediction. (We’ve been able to get reasonably accurate weather predictions up to six days in advance, using very sparse input data, by using Monte Carlo methods!) It’s an incredibly useful, valuable technique – and anyone who suggests that using Monte Carlo is in any way a half-assed solution is an utter jackass.

I’ll finish up with a beautiful example – my favorite example of combining analytical methods with Monte Carlo. It’s another way of computing an estimate of \pi, but it gets a more accurate result with fewer trials than the sand/darts.

It’s based on a problem Buffon’s needle. Buffon’s needle is a problem first proposed by the Count of Buffon during the 1700s. He asked: suppose I drop a needle onto a panelled wood floor. What’s the probability that the needle will fall so that it crosses a one of the joints between different boards?

Using some very nice analytical work, you can show that if the panels have uniform width t, and the needle has length l, then the probability of a needle crossing a line is: \frac{2l}{\pi t}. That gives us the nice property that if we let l = \frac{t}{2}, then the probability of crossing a line is \frac{1}{\pi}.

Using that, you can do a Monte Carlo computation: take a sheet of paper, and a couple of hundred matchsticks. Draw lines on the paper, separated by twice the length of a matchstick. Then scatter the matchsticks all over the paper. Divide the total number of matchsticks by the number that crossed a line. The result will be roughly \pi.

For example – with 200 trials, I got 63 crossing a line. That gives me roughly 3.17 as an estimate of \pi. That’s not half-bad for a five minute experimental estimate!

Bayes Theorem

I’ve been meaning to get back to some of the probability stuff. We’re currently recovering from a major snow/ice storm, and I’m snowed/iced in, so this is a good time!

Today, we’ll talk about what is, according to many people, the most important rule in all of probability: Bayes theorem. It’s also, in my experience, the single most abused rule in all of mathematics. Nothing else has been used so poorly, by so many people, to support sloppy, dumb arguments. After we talk about what the rule is, and what it means, we’ll move on to talk about how it gets abused.

In a pure mathematical sense, Bayes theorem is simple. The interpretation of it, and what it means gets pretty hairy. Suppose that you’ve got two related events, A and B. You know the probability of A occurring is P(A). You know the probability of B occurring is P(B). And you know that if A has already occurred, what the probability of B occurring is. (We write that P(B | A), which you can ready as “the probability of B given A”.) What you’d like to know is, suppose that I know that B occurred. What’s the probability that A also occurred? (What is P(A | B)?)

Bayes theorem says:

P(A|B) = \frac{P(B|A) P(A)}{P(B)}

Let’s be concrete. I go to work, and walk into my office in the morning, and get into the elevator with one other person that I work with.What is the probability that it’s a man?

Without knowing anything about the people that I work with, a reasonable guess would be 50% – the population is pretty close to evenly divided between the genders.

But I’m an engineer, and one of the very unfortunate facts about my job is that the gender pool of engineers is very skewed. Let’s say that it’s 80% men. (In reality, that’s probably actually pretty low.)

Let’s say that about 1/3 of the office is engineering. So the odds that someone I bump into will be an engineer is about 50%.

I can do a couple of things with that information. I could ask, suppose that I walked into the elevator with a woman. What’s the probability that she’s an engineer?

To answer that, I’ll use Bayes law. We’ll say that P(A) is the probability that a random person is a woman- 1/2. P(B) is the probability that a random person is an engineer – 1/3. If I know that a given person is an engineer, the probability of that person being a woman is P(B | A), or 1/5. So what’s the probability of my random female coworker being an engineer (P(A | B))?

  • P(\text{woman}) = 1/2
  • P(\text{eng}) = 1/3
  • P(\text{woman} | \text{eng}) = 1/5
  • P(\text{eng} | \text{woman}) = \frac{P(\text{woman}|\text{eng})P(\text{eng})}{P(\text{woman})} = \frac{(1/5)(1/3)}{1/2} = \frac{2}{15}

See? That was easy, wasn’t it?

Now, what’s it actually mean? If you look at it this way, it doesn’t seem to be such a big deal. Sure, it’s a way of combining probabilities in another situation, but so what? Why’s it any more important than any other?

Because it’s the mathematical method for how to incorporate new knowledge into your expectations. What we did above was start with one understanding of the thing we were trying to predict. Knowing nothing but the typical distribution of genders in the general population, we made a guess about a 50% probability of encountering a woman. But then we added in new information. We knew the population of engineers, and the fact that the gender ration was skewed in engineering – and we incorporated that new information into our prediction.

That answer comes from interpretations. One of the classic interpretations of probability theory is the Bayesian interpretation – named Bayesian specifically because of how it interprets this rule! The Bayesian interpretation says that a statement about probability is really a statement about the state of our knowledge. If I say that the probability of flipping heads on a coin is 1/2, what I’m saying under the Bayesian interpretation is that my certainty that I’ll flip heads is just 1/2.

In that kind of knowledge-based interpretation, there is no intrinsic probability of any event. There is just our degree of certainty about whether the event will occur. Given new information, our degree of certainty can change. Bayes theorem tells us, given new information, exactly how we should change our interpretation.

To explain the bayesian interpretation, we’ll add a couple of terms.

Hypothesis
The hypothesis is the thing whose degree of certainty we’re trying to measure. In the formulation of Bayes law up above, we call it A; here, we’ll call it H.
Prior
The prior, P(H), is the degree of certainty about the hypothesis given no other information.
Evidence
The evidence is the new piece of information that we’re trying to add to our measurement of certainty. Above, we called it B, but here, we’ll call it E.
Likelihood
The likelihood P(E | H) of a piece of evidence is our degree of certainty that a specific piece of evidence would be found if the hypothesis is true.
Model Evidence
The model evidence is P(E), and it’s a bit confusing. It’s the analytic likelihood of any piece of evidence occurring. If you’re considering a set of possible hypotheses using Bayes rule, P(E) will be the same for all of them, but P(E | H) will be the specific likelihood of finding that particular piece of evidence under the hypothesis.
Posterior
The posterior, P(H|E), is the degree of certainty that we will have about A if we add new knowledge, B.
Support
Support is the change in our certainty created by the addition of our new evidence. The support is \frac{P(E|H)}{P(E)}.

So Bayes theorem is a formal statement of how, given evidence, we can modify our certainty about the truth of a particular statement. The classical textbook statement of it is the following. (I took this specific formulation from wikipedia, but any textbook will have nearly the same sentence.)

The posterior probability of a hypothesis is determined by a combination of the inherent likeliness of a hypothesis (the prior) and the compatibility of the observed evidence with the hypothesis (the likelihood).

Or, in mathematical terms, P(H | E) = \frac{P(E | H)}{P(E)} \times P(H) – or exactly what we wrote for Bayes theorem up above.

Why is this abused so badly? Because under a naive, stupid
understanding of Bayes rule, you can essentially randomly estimate the probability of anything. After all, Bayes says that probability is just the combination of our certainties about some collection of facts. So if I can line up some set of facts, along with an estimate of the individual probabilities of those facts, then I can combine those probabilities, and come up with an estimate of the probability of anything! And if I don’t know the probability of an event occurring at al, then the state of my initial knowledge is really simple: it’s always 1/2 – 1/2 is always the starting point given absolutely no other knowledge.

That leads to rubbish like this proof that there are no extra-terrestial intelligences, or this or this purported proof of the existence of God.

All of these arguments fail in the same way. They don’t really use Bayes theorem. The quality of the priors – all of the priors, including the priors used to come up with measures of the likelihoods of the evidences – are crucial. They don’t bother with that. They just make up priors, and combine them without good likelihoods.

Debunking Two Nate Silver Myths

I followed our election pretty closely. My favorite source of information was Nate Silver. He’s a smart guy, and I love the analysis that he does. He’s using solid math in a good way to produce excellent results. But in the aftermath of the election, I’ve seen a lot of bad information going around about him, his methods, and his result.

First: I keep seeing proclamations that “Nate Silver proves that big data works”.

Rubbish.

There is nothing big data about Nate’s methods. He’s using straightforward Bayesian methods to combine data, and the number of data points is remarkably small.

Big data is one of the popular jargon keywords that people use to appear smart. But it does actually mean something. Big data is using massive quantities of information to find patterns: using a million data points isn’t really big data. Big data means terabytes of information, and billions of datapoints.

When I was at Google, I did log analysis. We ran thousands of machines every day on billions of log records (I can’t say the exact number, but it was in excess of 10 billion records per day) to extract information. It took a data center with 10,000 CPUs running full-blast for 12 hours a day to process a single days data. Using that data, we could extract some obvious things – like how many queries per day for each of the languages that Google supports. We could also extract some very non-obvious things that weren’t explicitly in the data, but that were inferrable from the data – like probable network topologies of the global internet, based on communication latencies. That’s big data.

For another example, look at this image produced by some of my coworkers. At foursquare, we about five million points of checkin data every day, and we’ve got a total of more than 2 1/2 billion data points. By looking at average checkin densities, and then comparing that to checkin densities after the hurricane, we can map out precisely where in the city there was electricity, and where there wasn’t. We couldn’t do that by watching one person, or a hundred people. But by looking at the patterns in millions and millions of records, we can. That is big data.

This doesn’t take away from Nate’s accomplishment in any way. He used data in an impressive and elegant way. The fact is, he didn’t need big data to do this. Elections are determined by aggregate behavior, and you just don’t need big data to predict them. The data that Nate used was small enough that a person could do the analysis of it with paper and pencil. It would be a huge amount of work to do by hand, but it’s just nowhere close to the scale of what we call big data. And trying to do big data would have made it vastly more complicated without improving the result.

Second: there are a bunch of things like this.

The point that many people seem to be missing is that Silver was not simply predicting who would win in each state. He was publishing the odds that one or the other candidate would win in each statewide race. That’s an important difference. It’s precisely this data, which Silver presented so clearly and blogged about so eloquently, that makes it easy to check on how well he actually did. Unfortunately, these very numbers also suggest that his model most likely blew it by paradoxically underestimating the odds of President Obama’s reelection while at the same time correctly predicting the outcomes of 82 of 83 contests (50 state presidential tallies and 32 of 33 Senate races).

Look at it this way, if a meteorologist says there a 90% chance of rain where you live and it doesn’t rain, the forecast wasn’t necessarily wrong, because 10% of the time it shouldn’t rain – otherwise the odds would be something other than a 90% chance of rain. One way a meteorologist could be wrong, however, is by using a predictive model that consistently gives the incorrect probabilities of rain. Only by looking a the odds the meteorologist gave and comparing them to actual data could you tell in hindsight if there was something fishy with the prediction.

Bzzt. Sorry, wrong.

There are two main ways of interpreting probability data: frequentist, and Bayesian.

In a frequentist interpretation, saying that an outcome of an event has a probability X% of occuring, you’re saying that if you were to run an infinite series of repetitions of the event, then on average,
the outcome would occur in X out of every 100 events.

The Bayesian interpretation doesn’t talk about repetition or observation. What it says is: for any specific event, it will have one outcome. There is no repetition. But given the current state of information available to me, I can have a certain amount of certainty about whether or not the event will occur. Saying that I assign probability P% to an event doesn’t mean that I expect my prediction to fail (100-P)% of the time. It just means that given the current state of my knowledge, I expect a particular outcome, and the information I know gives me that degree of certainty.

Bayesian statistics and probability is all about state of knowledge. The fundamental, defining theorem of Bayesian statistics is Bayes theorem, which tells you, given your current state of knowledge and a new piece of information, how to update your knowledge based on what the new information tells you. Getting more information doesn’t change anything about whether or not the event will occur: it will occur, and it will have either one outcome or the other. But new information can allow you to improve your prediction and your certainty of that prediction’s correctness.

The author that I quoted above is being a frequentist. In another section of his articple, he’s more specific:

…The result is P= 0.199, which means there’s a 19.9% chance that it rained every day that week. In other words, there’s an 80.1% chance it didn’t rain on at least one day of the week. If it did in fact rain everyday, you could say it was the result of a little bit of luck. After all, 19.9% isn’t that small a chance of something happening.

That’s frequentist intepretation of the probability – which makes sense, since as a physicist, the author is mainly working with repeated experiments – which is a great place for frequentist interpretation. But looking at the same data, a Bayesian would say: “I have an 19.9% certainty that it will rain today”. Then they’d go look outside, see the clouds, and say “Ok, so it looks like rain – that means that I need to update my prediction. Now I’m 32% certain that it will rain”. Note that nothing about the weather has changed: it’s not true that before looking at the clouds, 80.1 percent of the time it wouldn’t rain, and after looking, that changed. The actual fact of whether or not it will rain on that specific day didn’t
change.

Another way of looking at this is to say that a frequentist believes that a given outcome has an intrinstic probability of occurring, and that our attempts to analyze it just bring us closer to the true probability; whereas a Bayesian says that there is no such thing as an intrinsic probability, because every event is different. All that changes is our ability to make predictions with confidence.

One last metaphor, and I’ll stop. Think about playing craps, where you’re rolling two six sided dice.
For a particular die, a frequentist would say “A fair die has a 1 in 6 chance of coming up with a 1”. A
Bayesian would say “If I don’t know anything else, then my best guess is that I can be 16% certain that a 1
will result from a roll.” The result is the same – but the reasoning is different. And because of the difference in reasoning, you can produce different predictions.

Nate Silver’s predictions of the election are a beautiful example of Bayesian reasoning. He watched daily polls, and each time a new poll came out, he took the information from that poll, weighted it according to the historical reliability of that poll in that situation, and then used that to update his certainty. So based on his data, Nate was 90% certain that his prediction was correct.

Fuzzy Logic vs Probability

In the comments on my last post, a few people asked me to explain the difference between fuzzy logic and probability theory. It’s a very good question.

The two are very closely related. As we’ll see when we start looking at fuzzy logic, the basic connectives in fuzzy logic are defined in almost the same way as the corresponding operations in probability theory.

The key difference is meaning.

There are two major schools of thought in probability theory, and they each assign a very different meaning to probability. I’m going to vastly oversimplify, but the two schools are the frequentists and the Bayesians

First, there are the frequentists. To the frequentists, probability is defined by experiment. If you say that an event E has a probability of, say, 60%, what that means to the frequentists is that if you could repeat an experiment observing the occurrence or non-occurrence of E an infinite number of times, then 60% of the time, E would have occurred. That, in turn, is taken to mean that the event E has an intrinsic probability of 60%.

The other alternative are the Bayesians. To a Bayesian, the idea of an event having an intrinsic probability is ridiculous. You’re interested in a specific occurrence of the event – and it will either occur, or it will not. So there’s a flu going around; either I’ll catch it, or I won’t. Ultimately, there’s no probability about it: it’s either yes or no – I’ll catch it or I won’t. Bayesians say that probability is an assessment of our state of knowledge. To say that I have a 60% chance of catching the flu is just a way of saying that given the current state of our knowledge, I can say with 60% certainty that I will catch it.

In either case, we’re ultimately talking about events, not facts. And those events will either occur, or not occur. There is nothing fuzzy about it. We can talk about the probability of my catching the flu, and depending on whether we pick a frequentist or Bayesian interpretation, that means something different – but in either case, the ultimate truth is not fuzzy.

In fuzzy logic, we’re trying to capture the essential property of vagueness. If I say that a person whose height is 2.5 meters is tall, that’s a true statement. If I say that another person whose height is only 2 meters is tall, that’s still true – but it’s not as true as it was for the person 2.5 meters tall. I’m not saying that in a repeatable experiment, the first person would be tall more often than the second. And I’m not saying that given the current state of my knowledge, it’s more likely than the first person is tall than the second. I’m saying that both people possess the property tall – but in different degrees.

Fuzzy logic is using pretty much the same tools as probability theory. But it’s using them to trying to capture a very different idea. Fuzzy logic is all about degrees of truth – about fuzziness and partial or relative truths. Probability theory is interested in trying to make predictions about events from a state of partial knowledge. (In frequentist terms, it’s about saying that I know that if I repeated this 100 times, E would happen in 60; in Bayesian, it’s precisely a statement of partial knowledge: I’m 60% certain that E will happen.) But probability theory says nothing about how to reason about things that aren’t entirely true or false.

And, in the other direction: fuzzy logic isn’t particularly useful for talking about partial knowledge. If you allowed second-order logic, you could have fuzzy meta-predicates that described your certainty about crisp first-order predicates. But with first order logic (which is really where we want to focus our attention), fuzzy logic isn’t useful for the tasks where we use probability theory.

So probability theory doesn’t capture the essential property of meaning (partial truth) which is the goal of fuzzy logic – and fuzzy logic doesn’t capture the essential property of meaning (partial knowledge) which is the goal of probability theory.