I’d like to start with a quick apology. Sorry that both the abstract algebra and the new game theory posts have been moving so slowly. I’ve been a bit overwhelmed lately with things that need doing right away, and
by the time I’m done with all that, I’m too tired to write anything that
requires a lot of care. I’ve known the probability theory stuff for so long,
and the parts I’m writing about so far are so simple that it really doesn’t take nearly as much effort.
With that out of the way, today, I’m going to write about probability distributions.
Take a probabilistic event. That’s an event where we don’t know how
it’s going to turn out. But we know enough to describe the set of possible outcomes. We can formulate a random variable for the basic idea of the event, or we can consider it in isolation, and use our knowledge of it to describe
the possible outcomes. A probability distribution describes the set of outcomes of some probabilistic event – and how likely each one is. A
probability distribution always has the property that if S in the set
of possible outcomes, and ∀s∈S:P(s) is the probability of outcome s, then Σs∈SP(s)=1 – which is really just another way
of saying that the probability distribution covers all possible outcomes.
In the last post, about random variables, I described a formal abstract
version of what we mean by “how likely” a given outcome is. From here on,
we’ll mainly use the informal version, where an outcome O with a probability
P(O)=1/N means that if watched the event N times, we’d expect to see O occur
Probability distributions are incredibly important to understand statistics. If we know the type of distribution, we can make much stronger
statements about what a given statistic means. We can also do all
sorts of interesting things if we understand what the distributions look like.
For example, when people do work in computer networks, some of the key
concerns are called bandwidth and utilization. One of the really
interesting things about networks is that they’re not generally as fast
as we think they are. The bandwidth of a channel is the maximum amount of
information that can be transmitted through that channel in a given time. The utilization is the percentage of bandwidth actually being used at a particular moment. The utilization virtually never reaches 100%. In fact, on most networks, it’s much lower. The higher utilization gets, the harder it is for anyone else to use what’s left.
For different network protocols, the peak utilization – the amount of the bandwidth that can be used before performance starts to drop off – varies. There’s often a complex tradeoff. To give two extreme cases, you can build
a protocol in which there is a “token” associated with the right to transmit on the wire, and the token is passed around the machines in the network. This
guarantees that everyone will get a turn to transmit, and prevents more than one machine from trying to send at the same time. On the other hand, there’s a family of protocols (including ethernet) called “Aloha” protocols, where you
just transmit whenever you want to – but if someone else happens to be
transmitting at the same time, both of your messages will be corrupted, and will need to be resent.
In the token-ring case, you’re increasing the amount of time it takes
to get a message onto the wire during low utilization, and you’re eating up a
slice of the bandwidth with all of those token-maintenance messages. But as capacity increases, in theory, you can scale smoothly up to the point where all of the bandwidth is being used by either the token maintenance or by real
traffic. In the case of Aloha protocols, there’s no delay, and no token maintenance overhead – but when the utilization goes up, so does the chance
of two machines transmitting simultaneously. So in an Aloha network, you
really can’t effectively use all of the bandwidth, because as
utilization goes up, the amount of bandwidth dedicated to actual messages
decreases, because so much is being used by messages that had to be discarded due to collisions.
From that brief overview, you can see why it’s important to be able to
accurately assess what the effective bandwidth of a network is. To
do that, you get into something called queueing theory, which
uses probability distributions to determine how much of your bandwidth will
likely be taken up by collisions/token maintenance under various utilization scenarios. Without the probability distribution, you can’t do it; and if the probability distribution doesn’t match the real pattern of usage, it will
give you results that won’t match reality.
There are a collection of basic distributions that we like to work with,
and it’s worth taking a few moments to run through and describe
- Uniform: the simplest distribution. In a uniform distribution,
there are N outcomes, o1,…,on, and
the probability of any one of them, P(oi)=1/N. Rolling
a fair die, picking a card from a well-shuffled deck, or picking a ticket in
a raffle are all events with uniform probability distributions.
- Bernoulli: the Bernoulli distribution covers a binary
event – that is, and event that has exactly two possible
outcomes, “1” and “0”, and there’s a fixed probability “p”
of an outcome of “1” – so P(1)=p, P(0)=1-p.
- Binomial: the binomial distribution describes the probability of
a particular number of “1” outcomes in a limited series of
independent trials of an event with a Bernoulli distribution.
So, for example, a binomial probability distribution will say
something like “Given 100 trials, there is a 1% probability of
one success, a 3% probability of 2 successes, …”, etc.
The binomial distribution is one of the sources of the
perfect bell curve. As the number of data points increases,
a histogram of the Bernoulli distribution becomes closer and closer
to the perfect bell.
- Zipf distribution. As a guy married to a computational linguist,
I’m familiar with this one. It’s a power law distribution:
given an ordered set of outcomes, oi, …, on,
the probabilities work out so that P(o1)=2×P(o2); P(o2)=2×P(o3), and so on. The most
example of this is if you take a huge volume of english text, you’ll
find that the most common word is “the”; randomly picking
a work from english text, the probability of “the” is about 7%. The
number 2 word is “of”, which has a probability of about 3.5%. And so
on. If you plot a Zipf distribution on a log-log scale, with the
vertical axis descending powers of 10, and the horizontal axis
is ordered ois, you’ll get a descending line.
- Poisson. The poisson distribution is the one used in computer
networks, as I described above. The poisson distribution describes
the probability of a given number of events happening in a particular
length of time, given that (a) the events have a fixed average
rate of occurence, and (b) the time to the next event is independent of
the time since the previous event. It turns out that this is a pretty
good model of network traffic: on a typical network, taken over
a long period of time, each computer will generate traffic at a
particular average rate. There’s an enormous amount of variation –
sometimes it’ll go hours without sending anything, and sometimes it’ll
send a thousand messages in a second. But overall, it averages out
to something regular. And there’s no way of knowing what pattern it will
follow: just because you’ve just seen 1,000 messages per second for 3
seconds, you can’t predict that it will be 1/1000th of a second before
the next message.
- Zeta. The zeta distribution is basically the Zipf distribution over
an infinite number of possible discrete outcomes.