I’d like to start with a quick apology. Sorry that both the abstract algebra and the new game theory posts have been moving so slowly. I’ve been a bit overwhelmed lately with things that need doing *right away*, and

by the time I’m done with all that, I’m too tired to write anything that

requires a lot of care. I’ve known the probability theory stuff for so long,

and the parts I’m writing about so far are so simple that it really doesn’t take nearly as much effort.

With that out of the way, today, I’m going to write about probability distributions.

Take a probabilistic event. That’s an event where we don’t know how

it’s going to turn out. But we know enough to describe the set of possible outcomes. We can formulate a random variable for the basic idea of the event, or we can consider it in isolation, and use our knowledge of it to describe

the possible outcomes. A probability distribution describes the set of outcomes of some probabilistic event – and how likely each one is. A

probability distribution always has the property that if S in the set

of possible outcomes, and ∀s∈S:P(s) is the probability of outcome s, then Σ_{s∈S}P(s)=1 – which is really just another way

of saying that the probability distribution covers all possible outcomes.

In the last post, about random variables, I described a formal abstract

version of what we mean by “how likely” a given outcome is. From here on,

we’ll mainly use the informal version, where an outcome O with a probability

P(O)=1/N means that if watched the event N times, we’d expect to see O occur

once.

Probability distributions are incredibly important to understand statistics. If we know the type of distribution, we can make much stronger

statements about what a given statistic *means*. We can also do all

sorts of interesting things if we understand what the distributions look like.

For example, when people do work in computer networks, some of the key

concerns are called bandwidth and utilization. One of the really

interesting things about networks is that they’re not generally as fast

as we think they are. The bandwidth of a channel is the maximum amount of

information that can be transmitted through that channel in a given time. The utilization is the percentage of bandwidth actually being used at a particular moment. The utilization virtually never reaches 100%. In fact, on most networks, it’s *much* lower. The higher utilization gets, the harder it is for anyone else to use what’s left.

For different network protocols, the peak utilization – the amount of the bandwidth that can be used before performance starts to drop off – varies. There’s often a complex tradeoff. To give two extreme cases, you can build

a protocol in which there is a “token” associated with the right to transmit on the wire, and the token is passed around the machines in the network. This

guarantees that everyone will get a turn to transmit, and prevents more than one machine from trying to send at the same time. On the other hand, there’s a family of protocols (including ethernet) called “Aloha” protocols, where you

just transmit whenever you want to – but if someone else happens to be

transmitting at the same time, both of your messages will be corrupted, and will need to be resent.

In the token-ring case, you’re increasing the amount of time it takes

to get a message onto the wire during low utilization, and you’re eating up a

slice of the bandwidth with all of those token-maintenance messages. But as capacity increases, in theory, you can scale smoothly up to the point where all of the bandwidth is being used by either the token maintenance or by real

traffic. In the case of Aloha protocols, there’s no delay, and no token maintenance overhead – but when the utilization goes up, so does the chance

of two machines transmitting simultaneously. So in an Aloha network, you

really *can’t* effectively use all of the bandwidth, because as

utilization goes up, the amount of bandwidth dedicated to actual messages

decreases, because so much is being used by messages that had to be discarded due to collisions.

From that brief overview, you can see why it’s important to be able to

accurately assess what the *effective* bandwidth of a network is. To

do that, you get into something called *queueing theory*, which

uses probability distributions to determine how much of your bandwidth will

likely be taken up by collisions/token maintenance under various utilization scenarios. Without the probability distribution, you can’t do it; and if the probability distribution doesn’t match the real pattern of usage, it will

give you results that won’t match reality.

There are a collection of basic distributions that we like to work with,

and it’s worth taking a few moments to run through and describe

them.

- Uniform: the simplest distribution. In a uniform distribution,

there are N outcomes, o_{1},…,o_{n}, and

the probability of any one of them, P(o_{i})=1/N. Rolling

a fair die, picking a card from a well-shuffled deck, or picking a ticket in

a raffle are all events with uniform probability distributions. - Bernoulli: the Bernoulli distribution covers a binary

event – that is, and event that has exactly two possible

outcomes, “1” and “0”, and there’s a fixed probability “p”

of an outcome of “1” – so P(1)=p, P(0)=1-p. - Binomial: the binomial distribution describes the probability of

a particular number of “1” outcomes in a limited series of

independent trials of an event with a Bernoulli distribution.

So, for example, a binomial probability distribution will say

something like “Given 100 trials, there is a 1% probability of

one success, a 3% probability of 2 successes, …”, etc.

The binomial distribution is one of the sources of the

perfect bell curve. As the number of data points increases,

a histogram of the Bernoulli distribution becomes closer and closer

to the perfect bell. - Zipf distribution. As a guy married to a computational linguist,

I’m familiar with this one. It’s a*power law*distribution:

given an ordered set of outcomes, o_{i}, …, o_{n},

the probabilities work out so that P(o_{1})=2×P(o_{2}); P(o_{2})=2×P(o_{3}), and so on. The most

example of this is if you take a huge volume of english text, you’ll

find that the most common word is “the”; randomly picking

a work from english text, the probability of “the” is about 7%. The

number 2 word is “of”, which has a probability of about 3.5%. And so

on. If you plot a Zipf distribution on a log-log scale, with the

vertical axis descending powers of 10, and the horizontal axis

is ordered o_{i}s, you’ll get a descending line. - Poisson. The poisson distribution is the one used in computer

networks, as I described above. The poisson distribution describes

the probability of a given number of events happening in a particular

length of time, given that (a) the events have a fixed average

rate of occurence, and (b) the time to the next event is independent of

the time since the previous event. It turns out that this is a pretty

good model of network traffic: on a typical network, taken over

a long period of time, each computer will generate traffic at a

particular average rate. There’s an enormous amount of variation –

sometimes it’ll go hours without sending anything, and sometimes it’ll

send a thousand messages in a second. But overall, it averages out

to something regular. And there’s no way of knowing what pattern it will

follow: just because you’ve just seen 1,000 messages per second for 3

seconds, you can’t predict that it will be 1/1000th of a second before

the next message. - Zeta. The zeta distribution is basically the Zipf distribution over

an infinite number of possible discrete outcomes.

Lassi HippeläinenA good start!

My only complaint is about an engineering detail. While you are writing about computer networking, you only cover one subtype, local area networks, and limited to link layer protocols.

In wide area networks the Poisson distribution begins to fall apart. One possible culprit is in the higher protocol layers, especially Transmission Control Protocol (TCP). Since it resends missing packets after timeout, there will be duplicate packets in flight, and as load approaches capacity, the distributions will have longer tails than Poisson predicts.

http://en.wikipedia.org/wiki/Long-tail_traffic

Real life measurements show that the Internet has a self-similar behaviour over wide time ranges, i.e. it is chaotic in nature. Therefore it can’t be modelled accurately using any classical distribution.

Mark C. Chu-CarrollLassi:

Thanks! I wasn’t actually aware of that. I haven’t studied queueing theory since grad school, and on the networks of the time, I think that poisson still worked pretty well. But the class where I studied that came before the web or the mainstreaming of the internet.

Do you have any references I could look at about the self-similar nature of the internet? I’d like to read more about it.

Brian UtterbackAloha based networks are not actually “transmit when you want to”, they are “transmit when you want to, if nobody else is talking”. This makes a big difference, since a collision is virtually guaranteed at high utilizations in the first case, but allows a much higher utilization in the second. In fact, the limit of utilization becomes dependent on the overall physical size of the network in the second case.

mattI second Lassi’s commendation of this is as a good start. My only quibble would be that it seems a bit odd to list basic distributions — even referring to a ‘bell’ curve as the limit of the binomial — and yet decline to at least

mentionthe normal, even in a “then there’s this other thing that turns out to be rather important in ways we’ll go into next time” sort of way. If it’s because you’re being picky about distributions asdiscrete, then you should say so — especially given your less pedantic approach in the previous post…Wry Mouthgeometric (exponential decay, roughly, describing the distribution of how many games one needs to play to achieve the first “win”)?

(rough) example: my team wins about .40 of its games, based on past performance. How many games will we play this year before we win our first league game?

Anonymous“Do you have any references I could look at about the self-similar nature of the internet? I’d like to read more about it.”

I tried to Google for something that is on-line (using “self-similar distribution internet”), but all relevant articles seem to be behind a paywall. Of course many readers of your blog have access to IEEE or ACM libraries, so they can read them.

Sorry for a terse answer, but I’ll be in Dubai for the rest of the week, and the plane leaves pretty soon…

Benoit EssiambreThey just wrote an interesting post about visualizing higher dimensional distributions over at Overcoming Bias:

http://www.overcomingbias.com/2008/04/conf-space.html

AnonymousCiteseer seems to have a number of papers. The one below seems pretty good though it doesn’t discuss the “hurst” parameter which is an interesting way of gauging the level of self-similarity in traffic.

http://citeseer.ist.psu.edu/uhlig01understanding.html

avishalomwhat you quoted as zipf is actually exponential

p(x)~(1/C)^x

with C const

a power law is p(d)~d^(-C)

where C is a constant

for words i think it is C=3/2 or 5/2 but too lazy to google

PeterNothing about the normal distribution?

Or the t-distribution?

Or the F-distribution?

or Chi-square?

I’d say those are the four I use most as a statistician.

And the Poisson, at least for regression problems, usually needs to be abandoned for a negative binomial distribution, since the conditional variance almost always increases with the conditional mean.

secondclassMinor nitpick: Ethernet actually uses CSMA/CD, which means that nodes wait until the line is free before transmitting. Collisions occur only when two nodes

starttransmitting at the same time.Love your blog as always.

Mark C. Chu-Carrollsecondclass:

Yeah, I know that my description was a bit oversimplified, but I was already feeling like I was giving too much detail; try sending and look for collision” is close enough to give people a sense of how it works, and I was debating whether or not to even include that much detail.

(As an interesting aside, when I worked as a sysadmin, we actually had some trouble with an ethernet that got too long. We’d run an ether as a coil through four floors of the EE building. Turned out that it got long enough that at Ethernet propagation speed, the machines at the far ends of the network could collide without detecting, because ether only looks for a collision when it starts sending. Took us a while to figure out that that was the problem, because we couldn’t believe that the signal propagation could take long enough on a cable in one building to allow missed collisions! Once we finally figured it out, we split the building into two subnets with a repeater between them, and that took care of it.)

Paddyhttp://www.ams.org/notices/199808/paxson.pdf

is a very good introduction to the failure of poisson modeling due to the self-similar structure of network traffic