Author Archives: markcc

Herd Immunity

With COVID running rampant throughout the US, I’ve seen a bunch of discussions about herd immunity, and questions about what it means. There’s a simple mathematical concept behind it, so I decided to spend a bit of time explaining.

The basic concept is pretty simple. Let’s put together a simple model of an infectious disease. This will be an extremely simple model – we won’t consider things like variable infectivity, population age distributions, population density – we’re just building a simple model to illustrate the point.

To start, we need to model the infectivity of the disease. This is typically done using the name R_0. R_0 is the average number of susceptible people that will be infected by each person with the disease.

R_0 is the purest measure of infectivity – it’s the infectivity of the disease in ideal circumstances. In practice, we look for a value R, which is the actual infectivity. R includes the effects of social behaviors, population density, etc.

The state of an infectious disease is based on the expected number of new infections that will be produced by each infected individual. We compute that by using a number S, which is the proportion of the population that is susceptible to the disease.

  • If R S < 1, then the disease dies out without spreading throughout the population. More people can get sick, but each wave of infection will be smaller than the last.
  • If R S = 1, then the disease is said to be endemic. It continues as a steady state in the population. It never spreads dramatically, but it never dies out, either.
  • If R S > 1, then the disease is pandemic. Each wave of infection spreads the disease to a larger subsequent wave. The higher the value of R in a pandemic, the faster the disease will spread, and the more people will end up sick.

There are two keys to managing the spread of an infectious disease

  1. Reduce the effective value of R. The value of R can be affected by various attributes of the population, including behavioral ones. In the case of COVID-19, an infected person wearing a mask will spread the disease to fewer others; and if other people are also wearing masks, then it will spread even less.
  2. Reduce the value of S. If there are fewer susceptible people in the population, then even with a high value of R, the disease can’t spread as quickly.

The latter is the key concept behind herd immunity. If you can get the value of S to be small enough, then you can get R * S to the sub-endemic level – you can prevent the disease from spreading. You’re effectively denying the disease access to enough susceptible people to be able to spread.

Let’s look at a somewhat concrete example. The R_0 for measles is somewhere around 15, which is insanely infectious. If 50% of the population is susceptible, and no one is doing anything to avoid the infection, then each person infected with measles will infect 7 or 8 other people – and they’ll each infect 7 or 8 others – and so on, which means you’ll have epidemic spread.

Now, let’s say that we get 95% of the population vaccinated, and they’re immune to measles. Now R * S = 15 * 0.05 = 0.75. The disease isn’t able to spread. If you had an initial outbreak of 5 infected, then they’ll infect around 3 people, who’ll infect around 2 people, who’ll infect one person, and soon, there’s no more infections.

In this case, we say that the population has herd immunity to the measles. There aren’t enough susceptible people in the population to sustain the spread of the disease – so if the disease is introduced to the population, it will rapidly die out. Even if there are individuals who are still susceptible, they probably won’t get infected, because there aren’t enough other susceptible people to carry it to them.

There are very few diseases that are as infectious as measles. But even with a disease that is that infectious, you can get to herd immunity relatively easily with vaccination.

Without vaccination, it’s still possible to develop herd immunity. It’s just extremely painful. If you’re dealing with a disease that can kill, getting to herd immunity means letting the disease spread until enough people have gotten sick and recovered that the disease can’t spread any more. What that means is letting a huge number of people get sick and suffer – and let some portion of those people die.

Getting back to COVID-19: it’s got an R_0 that’s much lower. It’s somewhere between 1.4 and 2.5. Of those who get sick, even with good medical care, somewhere between 1 and 2% of the infected end up dying. Based on that R_0, herd immunity for COVID-19 (the value of S required to make R*S<1) is somewhere around 50% of the population. Without a vaccine, that means that we’d need to have 150 million people in the US get sick, and of those, around 2 million would die.

(UPDATE: Ok, so I blew it here. The papers that I found in a quick search appear to have a really bad estimate. The current CDC estimate of R_0 is around 5.7 – so the S needed for herd immunity is significantly higher – upward of 80%, and so the would the number of deaths.)

A strategy for dealing with an infection disease that accepts the needless death of 2 million people is not exactly a good strategy.

Election Fraud? Nope, just bad math

My old ScienceBlogs friend Mike Dunford has been tweeting his way through the latest lawsuit that’s attempting to overturn the results of our presidential election. The lawsuit is an amazingly shoddy piece of work. But one bit of it stuck out to me, because it falls into my area. Part of their argument tries to make the case that, based on "mathematical analysis", the reported vote counts couldn’t possibly make any sense.

The attached affidavit of Eric Quinell, Ph.D. ("Dr. Quinell Report) analyzez the extraordinary increase in turnout from 2016 to 2020 in a relatively small subset of townships and precincts outside of Detroit in Wayne County and Oakland county, and more importantly how nearly 100% or more of all "new" voters from 2016 to 2020 voted for Biden. See Exh. 102. Using publicly available information from Wayne County andOakland County, Dr. Quinell found that for the votes received up to the 2016 turnout levels, the 2020 vote Democrat vs Republican two-ways distributions (i.e. excluding third parties) tracked the 2016 Democrat vs. Republican distribution very closely…

This is very bad statistical analysis – it’s doing something which is absolutely never correct, which is guaranteed to produce a result that looks odd, and then pretending that the fact that you deliberately did something that will produce a certain result means that there’s something weird going on.

Let’s just make up a scenario with some numbers up to demonstrate. Let’s imagine a voting district in Cosine city. Cosine city has 1 million residents that are registered to vote.

In the 2016 election, let’s say that the election was dominated by two parties: the Radians, and the Degrees. The radians won 52% of the vote, and the Degrees won 48%. The voter turnout was low – just 45%.

Now, 2020 comes, and it’s a rematch of the Radians and the Degrees. But this time, the turnout was 50% of registered votes. The Degrees won, with 51% of the vote.

So let’s break that down into numbers for the two elections:

  • In 2016:
    • A total of 450,000 voters actually cast ballots.
    • The Radians got 234,000 votes.
    • The Degrees got 216,000 votes.
  • In 2020:
    • A total of 500,000 voters actually cast ballots.</li>
    • The Radians got 245,000 votes.</li>
    • The Degrees got 255,000 votes.</li>

Let’s do what Dr. Quinell did. Let’s look at the 2020 election numbers, and take out 450,000 votes which match the distribution from 2016. What we’re left with is:

  • 11,000 new votes for the Radians, and
  • 39,000 new votes for the Degrees.

There was a 3 percent shift in the vote, combined with an increase in voter turnout. Neither of those is unusual or radically surprising. But when you extract things in a statistically invalid way, we end up with a result that in a voting district which the vote for the two parties usually varies by no more than 4%, the "new votes" in this election went nearly 4:1 for one party.

If we reduced the increase in voter turnout, that ratio becomes significant worse. If the election turnout was 46%, then the numbers would be 460,000 total votes; 225,400 for the Radians and 234,600 for the Degrees. With Dr. Quinell’s analysis, that would give us: -9,000 votes for the Radians, and +18,000 votes for the Degrees. Or since negative votes don’t make sense, we can just stop at 225,400, and say that all of the remaining votes, every single new vote beyond what the Radians won last time, was taken by the Degrees. Clearly impossible, it must be fraud!

So what’s the problem here? What caused this reasonable result to suddenly look incredibly unlikely?

The votes are one big pool of numbers. You don’t know which data points came from which voters. You don’t know which voters are new versus old. What happened here is that the bozo doing the analysis baked in an invalid assumption. He assumed that all of the voters who voted in 2016 voted the same way in 2020.

"For the votes received up to the turnout level" isn’t something that’s actually measurable in the data. It’s an assertion of something without evidence. You can’t break out subgroups within a population, unless the subgroups were actually deliberately and carefully measured when the data was gathered. And in the case of an election, the data that he’s purportedly analyzing doesn’t actually contain the information needed to separate out that group.

You can’t do that. Or rather you can, but the results are, at best, meaningless.

The Futility of Pi Denialism

To me, the strangest crackpots I’ve encountered through this blog are the π denialists.

When people have trouble with Cantor and differently sized infinities, I get it. It defies our intuitions. It doesn’t seem to make sense.

When you look at Gödel incompleteness theorem – it’s really hard to wrap your head around. It doesn’t seem to make sense. I get it.

When you talk about things like indescribable numbers, it’s crazy. How could it possibly be true? I get it.

But π?

It’s a pretty simple number: the ratio of the diameter of the circle and the circle’s circumference. There’s nothing all that difficult about what it means. And there are so many different ways of calculating it! We can use the nature of a circle, and derive series that compute it. We can write simple programs, do tactile demos, measure actual physical phenomena. And yet, there are people who fervently believe that it’s all a sham: that the value of π isn’t what we say it is. It’s exactly 4. Or it’s exactly 22/7. Or it’s exactly \frac{4}{\phi^{\frac{1}{2}}}. Or it’s not a number, it’s an acceleration.

It’s amazing. I constantly get mail – mostly from fans of either Jain (the author of the &phi;-based &pi; mentioned above), or from followers of Miles Mathis (he of “&pi; isn’t a ratio, it’s an acceleration” fame), insisting that I’m part of the great mathematical conspiracy to deny the true factual value of &pi;.

And yet… It’s so simple to demonstrate how wrong that is.

My favorite version is a simple program.

Here’s the idea, followed by the code.

  • Take the unit square – the region of the graph from (0, 0) to (1, 1), and inside of it, an arc of the circle of radius 1 around (0,0).
  • Pick a random point, (x, y), anywhere inside of that square.
  • If the distance from the origin (x^2 + y^2) is less than one, then the point is inside the circle. If it isn’t, then it’s outside of the circle.
  • The probability, p, of any given random point being inside that circle is equal to the ratio of the area of the circle to the area of the square. The area of that region of the circle is: \pi*1^2/4, and the area of the the square is 1^2. So the probability is (1/4)\pi/1, or \pi/4.
  • So take a ton of random points, and count how many are inside the circle.
  • The ratio of points inside the circle to total random points is \pi/4. The more random points you do this with, the closer you get to π.

We can turn that into a simple Python program:

from random import random

def computePi(points):
    inside = 0
    for i in range(points):
        x = random()
        y = random()
        if (x*x + y*y) < 1.0:
            inside = inside + 1
    return (inside*1.0)/points * 4.0

for i in range(30):
    pi = computePi(2**i)
    print(f"Pi at 2**{i} iterations = {pi}")

The exact value that you’ll get when you run this depends on the random number generator, and the initial seed value. If you don’t specify a seed, most random number libraries will use something like last 32 digits of the current system time in nanoseconds, so you’ll get slightly different results each time you run it. I just ran it, and got:

Pi at 2**0 iterations = 4.0
Pi at 2**1 iterations = 4.0
Pi at 2**2 iterations = 3.0
Pi at 2**3 iterations = 2.0
Pi at 2**4 iterations = 3.5
Pi at 2**5 iterations = 2.75
Pi at 2**6 iterations = 3.0625
Pi at 2**7 iterations = 3.125
Pi at 2**8 iterations = 3.109375
Pi at 2**9 iterations = 3.1875
Pi at 2**10 iterations = 3.171875
Pi at 2**11 iterations = 3.126953125
Pi at 2**12 iterations = 3.12109375
Pi at 2**13 iterations = 3.14013671875
Pi at 2**14 iterations = 3.169677734375
Pi at 2**15 iterations = 3.1324462890625
Pi at 2**16 iterations = 3.14453125
Pi at 2**17 iterations = 3.147247314453125
Pi at 2**18 iterations = 3.138519287109375
Pi at 2**19 iterations = 3.1364669799804688
Pi at 2**20 iterations = 3.1443214416503906
Pi at 2**21 iterations = 3.141223907470703
Pi at 2**22 iterations = 3.141301155090332
Pi at 2**23 iterations = 3.1419320106506348
Pi at 2**24 iterations = 3.1415367126464844
Pi at 2**25 iterations = 3.1421539783477783
Pi at 2**26 iterations = 3.1420511603355408
Pi at 2**27 iterations = 3.1415300369262695
Pi at 2**28 iterations = 3.141532242298126
Pi at 2**29 iterations = 3.1415965482592583

I suspect that I could do a lot better using a special number library to reduce or eliminate the floating point roundoff errors, but I don’t really think it’s worth the time. Just this much, using a really simple, obvious, intuitive method produces a better result than any of the numbers pushed by the crackpots.

To support that previous statement: the best crackpot value for π is the one based on the golden ratio. That version insists that the true value of π is 3.14460551103. But you can see – by using the simple metric of counting points inside and outside the circle – that the actual value is quite different from that.

That’s what makes this breed of denialism so stupid. π isn’t complicated: it’s a simple ratio. And it’s easy to test using simple concepts. Pi relates the diameter (or radius) of a circle to the circumference or area of that circle. So any test that works with circles can easily show you what π is. There’s nothing mysterious or counterintuitive or debatable about it. It is what it is, and you can test it yourself.

Abusing Linear Regression to Make a Point

A bunch of people have been sending me links to a particularly sloppy article that (mis)uses linear regression to draw an incorrect conclusion from some data. So I guess I’ve got to got back to good-old linear regression, and talk about it a bit.

Let’s start with the basics. What is linear regression?

If you have a collection of data – typically data with one independent variable, and one dependent variable (that is, the first variable can vary any way it wants; changing it will change the second variable), then you’re probably interested in how the dependent variable relates to the independent. If you have reason to believe that they should have a linear relationship, then you’d like to know just what that linear relationship is.

If your data were perfect, then you’d just need to plot all of the data points on a graph, with the independent variable on the X axis, and the dependent on the Y, and then your graph would be a line, and you could get its slope and Y intercept, and thus completely capture the relationship.

But data is never perfect. There’s a lot of reasons for that, but no real set of collected data is ever perfect. No matter how perfect the real underlying linear relationship is, real measured data will always show some scatter. And that means that you can draw a lot of possible lines through the collected data. Which one of them represents the best fit?

Since that’s pretty abstract, I’m going to talk a bit about an example – the very example that was used to ignite my interest in math!

Back in 1974 or so, when I was a little kid in second grade, my father was working for RCA, as a physicist involved in manufacturing electronics for satellite systems. One of the important requirements for the products they were manufacturing was that they be radiation hard – meaning that they could be exposed to quite a bit of radiation before they would be damaged enough to stop working.

Their customers – NASA, JPL, and various groups from the U. S. Military, had very strong requirements. They had to show, for a manufacturing setup of a particular component, what the failure profile was.

The primary failure mode of these chips they were making was circuit trace failure. If a sufficiently energetic gamma ray hit one of the circuit traces, it was possible that the trace would burn out – breaking the circuit, and causing the chip to fail.

The test setup that that they used had a gamma ray emitter. So they’d make a manufacturing run to produce a batch of chips from the setup. Then they’d take those, and they’d expose them to increasing doses of radiation from the gamma emitter, and detect when they failed.

For trace failure, the probability of failure is linear in the size of the radiation dose that the chip is exposed to. So to satisfy the customer, they had to show them what the slope of the failure curve was. “Radiation hard” was defined as being able to sustain exposure to some dose of radiation with a specified probability of failure.

So, my dad had done a batch of tests, and he had a ton of little paper slips that described the test results, and he needed to computer the slop of that line – which would give the probability of failure as a multiple of the radiation dose.

I walked into the dining room, where he was set up doing this, and asked what he was doing. So he explained it to me. A lot like I just explained above – except that my dad was a much better teacher than me. I couldn’t explain this to a second or third grader the way that he did!

Anyway… The method that we use to compute the best line is called least squares. The intuition behind it is that you’re trying to find the line where the average distance of all of the datapoints from that line is the smallest. But a simple average doesn’t work well – because some of the data points are above the line, and some are below. Just because one point is, say, above a possible fit by 100, and another is below by 100 doesn’t mean that the two should cancel. So you take the distance between the data points and the line, and you square them – making them all positive. Then you find the line where that total is the smallest – and that’s the best fit.

So let’s look at a real-ish example.

For example, here’s a graph that I generated semi-randomly of data points. The distribution of the points isn’t really what you’d get from real observations, but it’s good enough for demonstration.scatter plot of randomly skewed data

The way that we do that is: first we compute the means of x and y, which we’ll call \overline{x} and \overline{y}. Then using those, we compute the slope as:

 m = \frac{\Sigma_{i=1}^n (x-\hat{x})(y-\hat{y})}{\Sigma_{i=1}^{n} (x-\hat{x})^2}

Then for the y intercept: b = \hat{y} - m\hat{x}.

In the case of this data: I set up the script so that the slope would be about 2.2 +/- 0.5. The slope in the figure is 2.54, and the y-intercept is 18.4.

Now, we want to check how good the linear relationship is. There’s several different ways of doing that. The simplest is called the correlation coefficient, or r.

 r = \frac{\left(\Sigma (x-\hat{x})\right) \left(\Sigma (y - \hat{y})\right)}{\sqrt{ \left(\Sigma (x-\hat{x})^2\right) \left(\Sigma (y - \hat{y})^2\right) }}

If you look at this, it’s really a check of how well the variation between the measured values and the expected values (according to the regression) match. On the top, you’ve got a set of products; on the bottom, you’ve got the square root of the same thing squared. The bottom is, essentially, just stripping the signs away. The end result is that if the correlation is perfect – that is, if the dependent variable increases linearly with the independent, then the correlation will be 1. If the dependency variable decreases linearly in opposition to the dependent, then the correlation will be -1. If there’s no relationship, then the correlation will be 0.

For this particular set of data, I generated it with a linear equation with a little bit of random noise. The correlation coefficient is slighly greater than 0.95, which is exctly what you’d expect.

Ok, so that’s the basics of linear regression. Let’s get back to the bozo-brained article that started this.

They featured this graph:

You can see the scatter-plot of the points, and you can see the line that was fit to the points by linear regression. How does that fit look to you? I don’t have access to the original dataset, so I can’t check it, but I’m guessing that the correlation there is somewhere around 0.1 or 0.2 – also known as “no correlation”.

You see, the author fell into one of the classic traps of linear regression. Look back at the top of this article, where I started explaining it. I said that if you had reason to believe in a linear relationship, then you could try to find it. That’s the huge catch to linear regression: no matter what data you put in, you’ll always get a “best match” line out. If the dependent and independent variables don’t have a linear relation – or don’t have any actual relation at all – then the “best match” fit that you get back as a result is garbage.

That’s what the graph above shows: you’ve got a collection of data points that to all appearances has no linear relationship – and probably no direct relationship at all. The author is interpreting the fact that linear regression gave him an answer with a positive slope as if that positive slope is meaningful. But it’s only meaningful if there’s actually a relationship present.

But when you look at the data, you don’t see a linear relationship. You see what looks like a pretty random scatterplot. Without knowing the correlation coefficient, we don’t know for sure, but that line doesn’t look to me like a particularly good fit. And since the author doesn’t give us any evidence beyond the existence of that line to believe in the relationship that they’re arguing for, we really have no reason to believe them. All they’ve done is demonstrate that they don’t understand the math that they’re using.

Dashi-Braised Brisket with Onions

I’m a nice jewish boy, so I grew up eating a lot of brisket. Brisket’s an interesting piece of meat. By almost any reasonable standard, it’s an absolutely godawful cut of beef. It’s ridiculously tough. We’re not talking just a little bit chewy here: you can cook a hunk of brisket for four hours, still have something that’s inedible, because your teeth can’t break it down. It’s got a huge layer of fat on top – but the meat itself is completely lean – so if you cook it long enough to be chewable, it can be dry as a bone.

But my ancestors were peasants. They couldn’t afford to eat beef normally, and when special occasions rolled around, the only beef they could afford was the stuff that no one else wanted. So they got briskets.

If you get interested in foods, though, you learn that many of the best foods in the world started off with some poor peasant who wanted to make something delicious, but couldn’t afford expensive ingredients! Brisket is a perfect example. Cook it for a good long time, or in a pressure cooker, with lots of liquid, and lots of seasoning, and it’s one of the most flavorful pieces of the entire animal. Brisket is really delicious, once you manage to break down the structure that makes it so tough. These days, it’s become super trendy, and everyone loves brisket!

Anyway, like I said, I grew up eating jewish brisket. But then I married a Chinese woman, and in our family, we always try to blend traditions as much as we can. In particular, because we’re both food people, I’m constantly trying to take things from my tradition, and blend some of her tradition into it. So I wanted to find a way of blending some chinese flavors into my brisket. What I wound up with is more japanese than chinese, but it works. The smoky flavor of the dashi is perfect for the sweet meatiness of the brisket, and the onions slowly cook and sweeten, and you end up with something that is distinctly similar to the traditional jewish onion-braised-brisket, but also very distinctly different.


  1. 1 brisket.
  2. 4 large onions.
  3. 4 packets of shredded bonito flakes from an asian market.
  4. 4 large squares of konbu (japanese dried kelp)
  5. 1 cup soy sauce.
  6. 1 cup apple cider.
  7. Random root vegetables that you like. I tend to go with carrots and daikon radish, cut into 1 inch chunks.


  1. First, make some dashi:
    1. Put about 2 quarts of water into a pot on your stove, and bring to a boil.
    2. Lower to a simmer, and then add the konbu, and simmer for 30 minutes.
    3. Turn off the heat, add the bonito, and then let it sit for 10 minutes.
    4. Strain out all of the kelp and bonito, and you’ve got dashi!.
  2. Slice all of the onions into strips.
  3. Cut the brisket into sections that will fit into an instant pot or other pressure cooker.
  4. Fill the instant pot by laying a layer of onions, followed by a piece of brisket, followed by a layer of onions until all of the meat is covered in onions.
  5. Take your dashi, add the apple cider, and add soy sauce until it tastes too salty. That’s just right (Remember, your brisket is completely unsalted!) Pour it over the brisket and onions.
  6. Fill in any gaps around the brisket and onions with your root vegetables.
  7. Cook in the instant pot for one hour, and then let it slowly depressurize.
  8. Meanwhile, preheat your oven to 275.
  9. Transfer the brisket from the instant pot to a large casserole or dutch oven. Cover with the onions. Taste the sauce – it should be quite a bit less salty. If it isn’t salty enough, add a bit more sauce sauce; if it tastes sour, add a bit more apply cider.
  10. Cook in the oven for about 1 hour, until the top has browned; then turn the brisket over, and let it cook for another hour until the other side is brown.
  11. Slice into thick slices. (It should be falling apart, so that you can’t cut it thin!).
  12. Strain the fat off of the broth, and cook with a bit of cornstarch to thicken into a gravy.
  13. Eat.

Category Theory Lesson 3: From Arrows to Lambda

Quick personal aside: I haven’t been posting a lot on here lately. I keep wanting to get back to it; but each time I post anything, I’m met by a flurry of crap: general threats, lawsuit threats, attempts to steal various of my accounts, spam to my contacts on linkedin, subscriptions to ashley madison or gay porn sites, etc. It’s pretty demotivating. I shouldn’t let the jerks drive me away from my hobby of writing for this blog!

I started this series of posts by saying that Category Theory was an extremely abstract field of mathematics which was really useful in programming languages and in particular in programming language type systems. We’re finally at one of the first places where you can really see how that’s going to work.

If you program in Scala, you might have encountered curried functions. A curried function is something that’s in-between a one-parameter function and a two parameter function. For a trivial example, we could write a function that adds two integers in its usual form:

  def addInt(x: Int, y: Int): Int = x + y

That’s just a normal two parameter function. Its curried form is slightly different. It’s written:

  def curriedAddInt(x: Int)(y: Int): Int = x +y

The curried version isn’t actually a two parameter function. It’s a shorthand for:

  def realCurrentAddInt(x: Int): (Int => Int) = (y: Int) => x + y

That is: currentAddInt is a function which takes an integer, x, and returns a function which takes one parameter, and adds x to that parameter.

Currying is the operation of taking a two parameter function, and turning it into a one-parameter function that returns another one-parameter function – that is, the general form of converting addInt to realAddInt. It might be easier to read its type: realCurrentAddInt: Int => (Int => Int): It’s a function that takes an int, and returns a new function from int to int.

So what does that have to do with category theory?

One of the ways that category theory applies to programming languages is that types and type theory turn out to be natural categories. Almost any programming language type system is a category. For example, the figure below shows a simple view of a programming language with the types Int, Bool, and Unit. Unit is the initial object, and so all of the primitive constants are defined with arrows from Unit.

For the most part, that seems pretty simple: a type T is an object in the programming language category; a function implemented in the language that takes a parameter of type A and returns a value of type is an arrow from A to B. A multi-parameter function just uses the cartesian product: a function that takes (A, B) and returns a C is an arrow from A \times B \rightarrow C.

But how could we write the type of a function like our curried adder? It’s a function from a value to a function. The types in our language are objects in the category. So where’s the object that represents functions from A to B?

As we do often, we’ll start by thinking about some basic concepts from set theory, and then generalize them into categories and arrows. In set theory, we can define the set of functions from A to B as: B^A={f: A \rightarrow B} – that is, as exponentiation of the range of the produced functions.

  • There’s a product object B^A \times A.
  • There’s an arrow from B^A \times A \rightarrow B, which we’ll call eval.

In terms of the category of sets, what that means is:

  • You can create a pair of a function from A \rightarrow B and an element of A.
  • There is a function named eval which takes that pair, and returns an instance of B.

Like we saw with products, there’s a lot of potential exponential objects C which have the necessary product with A, and arrow from that product to B. But which one is the ideal exponential? Again, we’re trying to get to the object with thie universal property – the terminal object in the category of pseudo-exponentials. So we use the same pattern as before. For any potential exponential, there’s an arrow from the potential exponential to the actual exponential, and the potential exponential with arrows from every other potential exponential is the exponential.

Let’s start putting that together. A potential exponential C for B^A is an object where the following product and arrow exist:

There’s an instance of that pattern for the real exponential:

We can create a category of these potential exponentials. In that category, there will be an arrow from every potential exponential to the real exponential. Each of the potential exponentials has the necessary property of an exponential – that product and eval arrow above – but they also have other properties.

In that category of potential exponentials of B^A, there’s an arrow from an object X to an object Y if the following conditions hold in the base category:

  • There is an arrow \text{curry}(x,y): X \rightarrow Y in the base category.
  • There is an arrow \text{curry}(x,y)\times id_A: X\times A \rightarrow Y\times A
  • \text{eval}_y(\text{curry}(x,y)\times id_A=\text{eval}_y y

It’s easiest to understand that by looking at what it means in Set:

  • We’ve got sets X and Y, which we believe are potential exponents.
  • X has a function \text{eval}_x: X \times A \rightarrow B.
  • Y has a function \text{eval}_y: Y \times A \rightarrow B.
  • There’s a function \text{curry}: X \rightarrow Y which converts a value of X to a value of Y, and a corresponding function \text{curry}(\cdot)\times\text{id}_A: X\times A \rightarrow Y\times A, which given a pair (x, a) \in X\times A transforms it into a pair (y, a) \in Y\times A, where evaluating \text{eval}_x(x, a)=\text{eval}_y(\text{curry}(x, a)). In other words, if we restrict the inputs to Y to be effectively the same as the inputs to X, then the two eval functions do the same thing. (Why do I say restrict? Because \text{eval}_y might have a larger domain than the range of X, but these rules won’t capture that.)

An arrow in the category of potential products is a pair of two arrows in the base category:  one from C \rightarrow B^A, and one from C\times A \rightarrow B^A \times A . Since the two arrows are deeply related (they’re one arrow in the category of potential exponentials), we’ll call them \text{curry}(g) and \text{curry}(g)\times id_A. (Note that we’re not really taking the product of an arrow here: we haven’t talked about anything like taking products of arrows! All we’re doing is giving the arrow a name that helps us understand it. The name makes it clear that we’re not touching the right-hand component of the product.)

Since the exponential is the terminal, which means that that pair of curry arrows must exist for every potential exponential to the true exponential. So the exponential object is the unique (up to isomorphism) object for which the following is true:

  • There’s an arrow \text{eval}: B^A \times A \rightarrow A. Since B^A is the type of functions from A to B, \text{eval} represents the application of one of those functions to a value of type A to produce a result of type B.
  • For each two-parameter function g:C\times A\rightarrow B, there is a unique function (arrow) \text{curry}(g) that makes the following diagram commute

Now, how does all this relate to what we understand as currying?

It shows us that in category theory we can have an object that is effectively represents a function type in the same category as the object that represents the type of values it operates on, and you can capture the notion of applying values of that function type onto values of their parameter type using an arrow.

As I said before: not every category has a structure that can support exponentiation. The examples of this aren’t particularly easy to follow. The easiest one I’ve found is Top the category of topological spaces. In Top, the exponent doesn’t exist for many objects. Objects in Top are topological spaces, and arrows are continuous functions between them. For any two objects in Top, you can create the necessary objects for the exponential. But for many topological spaces, the required arrows don’t exist. The functions that they correspond to exist in Set, but they’re not continuous – and so they aren’t arrows in Top. (The strict requirement is that for an exponential X^Y to exist, Y must be a locally compact Hausdorff space. What that means is well beyond the scope of this!)

Cartesian Closed Categories

If you have a category C, and for every pair of objects A and B in the category C, there exists an exponential object B^A \in C, then we’ll say that C has exponentiation. Similarly, if for every pair of objects A, B \in Ob(C), there exists a product object A\times B, we say that the category has products.

There’s a special kind of category, called a cartesian closed category, which is a category  where:

  1. Every pair of objects has both product and exponent objects; and
  2. Which has at least one terminal object. (Remember that terminals are something like singletons, and so they work as a way of capturing the notion of being a single element of an object; so this requirement basically says that the category has at least one value that “functions” can be applied to.)

That may seem like a very arbitrary set of rules: what’s so important about having all products, exponents, and a terminal object?

It means that we have a category which can model types, functions, and function application. Lambda calculus proves that that’s all you need to model computation. Closed cartesian categories are, basically, a formal model of a computing system! Any cartesian closed category is a model for a simply typed \lambda-calculus; and \lambda-calculus is something known as the internal language of a cartesian closed category.

What “internal language” means formally is complicated, but in simple terms: you can take any computation in lambda calculus, and perform the computation by chasing arrows in a category diagram of a closed cartesian category that includes the values of that calculus. Alternatively, every computation step that you perform evaluating a \lambda-calculus expression corresponds to an arrow in a CCC.


For this post, I’ve made heavy use of:

Life with Social Anxiety: Masking

I’ve been thinking about how to talk about social anxiety more. This recently came up at work, and I thought it would be worth writing down. As usual, I’m talking about my own experiences as a person with severe social anxiety. I think there are others who feel the same way as I do – but equally, there are plenty of people with social anxiety disorders who feel very different. I can only talk about what I feel, and what I experience – so don’t assume that I’m talking for anyone but myself.

One of the interesting facets of social anxiety is that people with SA don’t necessarily act the way that you expect us to. People generally expect us to be like one of the characters from the Big Bang theory.

In reality, most of us have an adaptive behavior that we learn, which I call masking. For many people with social anxiety, if you encounter them at work or on the street, you’d never guess that we had any kind of anxiety problem. It’s the nature of social anxiety that we want to hide the anxiety that we feel, and so we find ways to do it.

The heart of social anxiety for is the feeling that there’s something wrong with me – that I’m weird, freakish, abnormal, that I’m broken – and that when people realize that, they’re going to reject me. It doesn’t make sense, but it doesn’t have to. I can know, intellectually, that it’s a pile of crap, but that doesn’t stop me from feeling it; it doesn’t stop my body from reacting to it. I spent years of school being regularly abused – mocked, beaten, tormented – and that got wired into my brain. That’s the way that I expect to be treated by people I don’t know well – and even, sometimes, by people that I do know.

A way to cope with that is to act like I’m a normal, functional person. I don’t believe that I’m normal. I don’t really understand what it’s like to be normal. But I’ve learned how, in many situation to fake it well enough to get by. The way that I that is masking. Think of what you do when you’re painting something. You want to expose certain areas to the paint, and you don’t want to expose others. So you cover up parts of the object with masking tape – and then you’ll only get paint on the parts that aren’t masked. That’s what I’ve learned to do. I to take a piece of myself that I think is close to normal for a situation, and build a persona around it. I mask off everything that doesn’t fit – so people can’t see the parts of me that I don’t want them to. Masking makes it much easier to interact, both because I’ve constructed the mask to only show the parts of myself that I think people won’t react badly to. I’ve created a version of myself that I hope won’t draw any attention for being weird.

A mask lets me appear to be a normal, confident person. It lets me go to work each day, and interact with people on the train, on the street, in the office, without turning into a basket case from the stress.

I try to be open with the people I really work closely with about who I am, and what I feel. I don’t hide the fact that I have social anxiety, and I do my best to minimize the mask. But I do wear a mask at work, because without it, I wouldn’t be able to function.

The people I work with think I’m kind-of loud. They think I’m really confident – probably a bit over-confident. I try to talk about my social anxiety disorder, but I’m not sure if they actually believe me – because what I’m saying about how I feel isn’t consistent with how they see me behave. My masks have gotten good enough that as long as I’m in a situation that I’ve prepared for, most of the time, you can’t see past it to actually see what I’m feeling.

The big weakness of a mask is that it’s an act. It’s not the real me – it’s a face that present to the world so that they don’t really see me. It’s something that I need to consciously construct and prepare. If I’m put into a situation that I couldn’t prepare for, then I don’t necessary have a mask ready. And that means that I’m just me – the broken person who’s paralyzed with fear.

I can get up in front of a classroom full of people, and give a lecture. I can get up in front of the congregation at my synagogue, and give a drash that I wrote – I can do both of those things without feeling overly stressed. People expect that a person with social anxiety won’t be able to do that, but that’s easy. It’s a situation where I know what’s expected of me, where I know what to do and how to behave. So I can mask myself in a way that lets me show the parts of myself that I need for that performance, and hide the rest.

But ask me to sit down and eat lunch with a random selection of people after I’m done teaching my class? That is hard. I don’t know who I’m dealing with. I don’t know how to talk to them, what they expect from me, how they’re going to react to me. That’s the kind of situation that triggers my anxiety, and that can, easily, wreck me. I don’t have a mask ready for that.

Category Theory Lesson 2: Basics of Categorical Abstraction

In my last post about category theory, I introduced the basic idea of typeclasses, and showed how to implement an algebraic monoid as a typeclass. Then we used that algebraic monoid as an example of how to think with arrows, and built it up into a sketch of category theory’s definition of a monoid. We ended with an ominous looking diagram that illustrated what a categorical monoid looks like.

In this post, we’re going to take a look at some of the formal definitions of the basic ideas of category theory. By the end of this lesson, you should be able to look at the categorical monoid, and understand what it means. But the focus of this post will be on understanding initial and terminal objects, and the role they play in defining abstractions in category theory. And in the next post, we’ll see how abstractions compose, which is where the value of category theory to programmers will really become apparrent.

Before I really get started: there’s a lot of terminology in category theory, and this post has a lot of definitions. Don’t worry: you don’t need to remember it all. You do need to understand the concepts behind it, but specifically remembering the difference between, say, and endomorphism, and epimorphism, and a monomorphism isn’t important: you can always look it up. (And I’ll put together a reference glossary to help make that easy.)

Defining Categories

A category is basically a directed graph – a bunch of dots connected by arrows, where the arrows have to satisfy a bunch of rules.

Formally, we can say that a category consists of a the parts: (O, M, \circ), where:

  • O is a collection of objects. We don’t actually care what the objects are – the only thing we can do in category theory is look at how the objects related through arrows that connect them. For a category C, we’ll often call this collection Obj(C).
  • M is a collection of arrows, often called morphisms. Each element of M starts at one object called its domain (often abbreviated dom), and ending at another object called its codomain (abbreviated cod). For an arrow f that goes from a to b, we’ll often write it as f:a \rightarrow b. For a category C, we’ll often call this set Mor(C) (for morphisms of C).
  • \circ is a composition operator. For every pair of arrows f:a \rightarrow b, and g:b \rightarrow c, there must be an arrow g\circ f: a \rightarrow c called the compositions of f and g.
  • To be a category, these must satisfy the following rules:
    1. Identity: For every object o \in Obj(C), there must be an arrow from o to o, called the identity of o. We’ll often write it as id_o:o \rightarrow o. For any arrow f: x \rightarrow o, f \circ o=f; and for any arrow g:o \rightarrow y: o \circ g=g. That’s just a formal way of saying that composing an identity arrow with any other arrow results in the the other arrow.
    2. Associativity: For any set of arrows f:w \rightarrow x, g:x \rightarrow y, z:y \rightarrow z: h \circ (g\circ f) = (h\circ g)\circ f.

When talking about category theory, people often say that an arrow is a structure preserving mapping between objects. We’ll see what that means in slightly more detail with some examples.

A thing that I keep getting confused by involves ordering. Let’s look at a quick little diagram for a moment. The path from X to Z is g \circ f – because g comes after f, which (at least to me) looks backwards. When you write it in terms of function application, it’s g(f(x)). You can read g \circ f as g after f, because the arrow g comes after the arrow f in the diagram; and if you think of arrows as functions, then it’s the order of function application.

Example: The category Set

The most familiar example of a category (and one which is pretty canonical in category theory texts) is the category Set, where the objects are sets, the arrows between them are total functions, and the composition operator is function composition.

That might seem pretty simple, but there’s an interesting wrinkle to Set. 

Suppose, for example, that we look at the function f(x)=x^2 . That’s obviously a function from to Int to Int. Since Int is a set, it’s also an object in the category Set, and so f(x)=x^2 is obviously an arrow from Int \rightarrow Int. .But there’s also a the set Int+, which represents the set of non-negative real numbers. f(x)=x^2 is also a function from Int+ to Int+. So which arrow represents the function?

The answer is both – and many more. (It’s also a function from the reals to complex numbers, because every real number is also a complex number.) And so on. A function isn’t quite an arrow: an arrow is a categorical concept of some kind of mapping between two objects. In many ways, you can think of an arrow as something almost like a function with an associated type declaration: you can write many type declarations for a given function; any valid function with a type declaration that is an arrow in Set.

We’ll be looking at Set a lot. It’s a category where we have a lot of intuition, so using it as an example to demonstrate category concepts will be useful.

Example: The category Poset

Poset is the category of all partially ordered sets. The arrows between objects in posets are order-preserving functions between partially ordered sets. This category is an example of what we mean by structure-preserving mappings: the composition operator must preserve the ordering property.

For that to make sense, we need to remember what partially ordered set is, and what it means to be an order preserving function.

  • A set S is partially ordered if it has a partial less-than-or-equal relation, \le. This relation doesn’t need to be total – some values are less than or equal to other values; and some values can’t be compared.
  • A function between two partially ordered sets f:A \rightarrow B is order-preserving if and only if for all values x, y \in A, if x \le y in A, then f(x)\le f(y) in B.

The key feature of an object in Poset is that is possesses a partial ordering. So arrows in the category must preserve that ordering: if x is less than y, then f(x) must be less than f(y).

That’s a typical example of what we mean by arrows as structure preserving: the objects of a category have some underlying structural property – and to be an arrow in the category, that structure must be preserved across arrows and arrow composition.

Commuting Diagrams

One of the main terms that you’ll hear about category diagrams is about whether or not the diagram commutes. This, in turn, is based on arrow chasing.

An arrow chase is a path through the diagram formed by chaining arrows together by composing them – an arrow chase is basically discovering an arrow from one object to another by looking at the composition of other arrows in the category.

We say that a diagram \textbf{D} commutes if, for any two objects x and y in the diagram, every pair of paths between x and y compose to the same arrow. Another way of saying that is that if P(x, y) is the set of all paths in the diagram between x and Y, \forall p_i, p_j \in P(x, y),:  \circ(p_i) = \circ(p_j).

For example: In this diagram, we can see two paths: f\circ h and g\circ h. If this diagram commutes, it means that following f from A to B and h from B to C must be the same thing as following g from A to B and h from B to C. It doesn’t say that f and g are the same thing – an arrow chase doesn’t tell us anything about single arrows; it just tells us about how they compose. So what we know if this diagram commutes is that f\circ h=g \circ h.

Diagrams and Meta-level reasoning: an example

Let’s look at a pretty tricky example. We’ll take our time, because this is subtle, but it’s also pretty typical of how we do things in category theory. One of the key concepts of category theory is building a category, and then using the arrows in that category, create a new category that allows us to do meta-level reasoning.

We’ve seen that there’s a category of sets, called Set.

We can construct a category based on the arrows of Set, called Set. In this category, each of the arrows in Set is an object. So, more formally, if f: A \rightarrow B \in \text{Mor}(\textbf {Set}) then f:A \rightarrow B \in \text{Obj}(\textbf{ Set}^{\rightarrow}).

The arrows of this new category are where it gets tricky. Suppose we have two arrows in Set, f: A \rightarrow B and f. These arrows are objects in \textbf{Set}^{\rightarrow}} There is an arrow from f to f in \text{Mor}(\textbf{ Set}^{\rightarrow}) if there is a pair of arrows a and b in \text{Mor}(\textbf{Set}) such that the following diagram commutes:

commuting diagram for arrows in set-arrow

The diagram is relatively easy to read and understand; explaining it in works is more complicated:

  • an arrow in our category of Set-arrows is a mapping from one Set-arrow f to another Set-arrow f.
  • That mapping exists when there are two arrows a and b in \text{Mor}(\textbf{Set}) where:
    • a is an arrow from the domain of f to the domain of f;
    • b is an arrow from the codomain of f to the codomain of f; and
    • b \circ f = f.

Another way of saying that is that there’s an arrow means that there’s a structure-preserving way of transforming any arrow from A\rightarrow B into an arrow from A.

Why should we care about that? Well, for now, it’s just a way of demonstrating that a diagram can be a lot easier to read than a wall of text. But this kind of categorical mapping will become important later.

Categorizing Things

As I said earlier, category theory tends to have a lot of jargon. Everything we do in category theory involves reasoning about arrows, so there are many terms that describe arrows with particular properties. We’ll look at the most basic categories now, and we’ll encounter more in later lessons.

Monics, Epics, and Isos

The easiest way to think about all of these categories is by analogy with functions in traditional set-based mathematics. Functions and their properties are really important, so we define special kinds of functions with interesting categories. We have injections (functions from A to B where every element of A is mapped onto a unique element of B), surjections (functions from A to B where each element of B is mapped onto by an element of A), and isomorphisms.

In categories, we define similar categories: monomorphisms (monics), epimorphisms (epics), and isomorphisms (isos).

  • An arrow f:Y \rightarrow Zin category C is monic if for any pair of arrows g:X \rightarrow Y and h:X \rightarrow Y in C, f\circ g = f\circ h implies that g = h. (So a monic arrow discriminates arrows to its domain – every arrow to its domain from a given source will be mapped to a different codomain when left-composed with the monic.)
  • An epic is almost the same, except that it discriminates with right-composition: An arrow f:X \rightarrow Y in category C is epic if for any pair of arrows g:Y \rightarrow Z and h:Y \rightarrow Z in C, g\circ f = h\circ f implies that g = h. (So in the same way that a monic arrow discriminations arrows to its domain, an epic arrow discriminates arrows from its codomain.)

These definitions sound really confusing. But if you think back to sets, you can twist them into making sense. A monic arrow f:Y \rightarrow Z describes an injection in set theory: that is, a function maps every element of X onto a unique element of Y. So if you have some functions g and h that maps from some set A onto Y, then the only way that f\circ g can map onto Z in the same way as f\circ h is if g and h map onto Y in exactly the same way.

The same basic argument (reversed a bit) can show that an epic arrow is a surjective function in Set.

  • An isomorphism is a pair of arrows f:Y \rightarrow Z and f^{-1}: Z \rightarrow Y where f is monic and f^{-1}is epic, and where f\circ f^{-1}= id_Z, and f^{-1}\circ f = id_Y.

We say that the objects Y and Z are isomorphic if there’s an isomorphism between them.

Initial and Terminal Objects

Another kind of categorization that we look at is talking about special objects in the category. Categorical thinking is all about arrows – so even when we’re looking at special objects, what make them special are the arrows that they’re related to.

An initial object 0 in a category C is an object where for every object c \in \text{Obj}(\textbf{C}), there’s exactly one arrow  0_c \in \text{Mor}(\textbf{C}). Similarly, a terminal object 1 in a category {\textbf C} is an object where for every object c \in \text{Obj}(\textbf{C}), there is exactly one arrow  1_c \in \text{Mor}(\textbf{C}).

For example, in the category Set, the empty set is an initial object, and singleton sets are terminal objects.

A brief interlude:

What’s the point?In this lesson, we’ve spent a lot of time on formalisms and definitions of abstract concepts: isos, monos, epics, terminals. And after this pause, we’re going to spend a bunch of time on building some complicated constructions using arrows. What’s the point of all of this? What does any of these mean?

Underlying all of these abstractions, category theory is really about thinking in arrows. It’s about building structures with arrows. Those arrows can represent import properties of the objects that they connect, but they do it in a way that allows us to understand them solely in terms of the ways that they connect, without knowing what the objects connected by the arrows actually do.

In practice, the objects that we connect by arrows are usually some kind of aggregate: sets, types, spaces, topologies; and the arrows represent some kind of mapping – a function, or a transformation of some kind. We’re reasoning about these aggregates by reasoning about how mappings between the aggregates behave.

But if the objects represent some abstract concept of collections or aggregates, and we’re trying to reason about them, sometimes we need to be able to reason about what’s inside of them. Thinking in arrows, the only way to really be able to reason about a concept like membership, the only way we can look inside the structure of an object, is by finding special arrows.

The point of the definitions we just looked at is to give us an arrow-based way of peering inside of the objects in a category. These tools give us the ability to create constructions that let us take the concept of something like membership in a set, and abstract it into an arrow-based structure.

Reasoning in arrows, a terminal object is an object in a category that captures a concept of a single object. It’s easiest to see this by thinking about sets as an example. What does it mean if an object, T, is terminal in the category of sets?

It means that for every set S, there’s exactly one function from S to T. How can that be? If T is a set containing exactly one value t, then from any other set S, the only function from S \rightarrow T is the constant function f(x) = t. If T had more than one value in it, then it would be possible to have more than one arrow from S to T – because it would be possible to define different functions from S to T.

By showing that there’s only one arrow from any object in the category of sets to T, we’re showing that can’t possibly have more than one object inside of it.

Knowing that, we can use the concept of a terminal object to create a category-theoretic generalization of the concept of set membership. If s is an element of a set S, then that set membership can be represented by the fact that there is an arrow from the terminal object {c} to S. In general, for any object S in a category, if there is an arrow from a terminal object {t} to S, then in some sense, t \in S.


We’re finally getting close to the real point of category theory. Category theory is built on a highly abstracted notion of functions – arrows – and then using those arrows for reasoning. But reasoning about individual arrows only gets you so far: things start becoming interesting when you start constructing things using arrows. In lesson one, we saw a glimpse of how you could construct a very generalized notion of monoid in categories – this is the first big step towards understanding that.


Constructions are ways of building things in categories. In general, the way that we work with constructions is by defining some idea using a categorical structure – and then abstracting that into something called a universal construction. A universal construction defines a new category whose objects are instances of the categorical structure; and we can understand the universal construction best by looking at the terminal objects in the universal construction – which we can understand as being the atomic objects in its category.

When we’re working with sets, we know that there’s a set-product called the cartesian product. Given two sets, A and B, the product A \times B={(a, b) : a \in A, b \in B}.

The basic concept of a product is really useful. We’ll eventually build up to something called a closed cartesian category that uses the categorical product, and which allows us to define the basis of lambda calculus in category theory.

As usual, we want to take the basic concept of a cartesian product, and capture it in terms of arrows. So let’s look back at what a cartesian product is, and see how we can turn that into arrow-based thinking.

The simple version is what we wrote above: given two sets A and B, the cartesian product maps them into a new set which consists of pairs of values in the old set. What does that mean in terms of arrows? We can start by just slightly restating the definition we gave above: For each unique value a \in A, and each unique value b \in B, there’s a unique value (a, b) \in A \times B.

But what do we actually mean by (a, b)? Mathematicians have come up with a lot of different ways of constructing ordered pairs. But we want to create a general model of an ordered pair, so we don’t want to limit ourselves to any specific construction: we want to capture the key property of what the ordered pair means.

It doesn’t matter which one we use: what matters is that there’s a key underlying property of the product: there are two functions and, called projection functions, which map elements of the product back to the elements of A and B. If p=(a,b) \in A\times B, then \lambda(p) = a (where \lambda is the name of the left projection), and \rho(p) = b (where \rho is the name of the right projection).

That’s going to be the key to the categorical product: it’s going to be defined primarily by the projection functions. We know that the only way we can talk about things in category theory is to use arrows. The thing that matters about a product is that it’s an object with projections to its two parts. We can describe that, in category theory, as something that we’ll call a wedge:

The wedge: a candidate for a product.

A wedge is basically an object, like the one in the diagram to the right, which we’ll call A \land B. This object has two special arrows, l and r, that represent projections from A\times B to its components in A and B.

Now we get to the tricky part. The concept of a wedge captures the structure of what we mean by a product. But given two objects A and B, there isn’t just one wedge! In a category like Set, there are many different ways of creating objects with projections. Which object is the correct one to use for the product?

For example, I can have the set of triples (A, B, C). I can easily define a left project from (A, B, C) to A, and a right projection from (A, B,C) to B. But clearly (A, B, C) is not what we mean by the product of A \times B. It’s close, but it’s got extra noise attached, in the form of that third element C.

If, for two objects A and B, there are many wedges with left and right projections, which one is the real product?

Just a little while ago, we talked about initial and terminal objects. A terminal object can be understood as being a rough analog to a membership relation. We’re going to use that.
We can create a category of wedges A \land B, where there is an arrow m from X to Y when the diagram below commutes in our original category:

The ideal product: the terminal in the category of wedges.

In the category of wedges, what that means is that Y is at least as strict of a wedge than X; X has some amount of noise in it (noise in the sense of the C element of the triple from the example above), and Y cannot have any more noise than that. The true categorical product will be the wedge with no excess noise: an wedge which has an arrow from every other wedge in the category of wedges.
What’s an object with an edge from every other object? It’s the terminal object. The categorical product is the terminal wedge: the unique (up to isomorphism) object which is stricter than any other wedge.
Another way of saying that, using categorical terminology, is that there is a universal property of products: products have left and right projections. The categorical product is the exemplar of that property: it is the unique object which has exactly the property that we’re looking at, without any extraneous noise. Any property that this universal object has will be shared by every other product-like object.

This diagram should look familiar: it’s the same thing as the diagram for defining arrows in the category of wedges. It’s the universal diagram: you can substitute any wedge in for C, along with its project arrows (f, g).

The categorical product.

We can pull that definition back to our original category, and define the product without the category of wedges. So given two objects, A and B, in a category, the categorical product is defined as an object which we’ll call A \times B along with two arrows and , which have the property that for any object C which has arrows f: C \rightarrow A and g:C \rightarrow B, there is a unique arrow (f,g):C \rightarrow (A\times B) for which the diagram to the right commutes.

On its own, if we’re looking specifically at sets, this is just a complicated way of defining the cartesian product of two values. It doesn’t really tell us much of anything new. What makes this interesting is that it isn’t limited to the cartesian product of two sets: it’s a general concept that takes what we understand about simple sets, and expands it to define a product of any two things in categories. The set-theoretic version only works for sets: this one works for numbers, posets, topologies, or anything else.

In terms of programming, products are pretty familiar to people who write functional programs: a product is a tuple. And the definition of a tuple in a functional language is pretty much exactly what we just described as the categorical product, tweaked to make it slightly easier to use.

For example, let’s look at the product type in Scala.

trait Product extends Any with Equals {
def productElement(n: Int): Any
def productArity: Int

The product object intrinsically wraps projections into a single function which takes a parameter and returns the result of applying the projection. It could have been implemented more categorically as:

trait CatProduct extends Any with Equals {
def projection(n: Int): () => Any

Implemented the latter way, to extract an element from a product, you’d have to write prod.projection(i)() which is more cumbersome, but does the same thing.

More, if you look at this, and think of how you’d use the product trait, you can see how it relates to the idea of terminal objects. There are many different concrete types that you could use to implement this trait. All of them define more information about the type. But every implementation that includes the concept of product can implement the Product trait. This is exactly the relationship we discussed when we used terminal objects to derive the ideal product: there are many abstractions that include the concept of the product; the categorical product is the one that abstracts over all of them.

The categorical product, as an abstraction, may not seem terribly profound. But as we’ll see in a the next post, in category theory, we can compose abstractions – and by using composition to in a compositional way, we’ll be able to define an abstraction of exponentiation, which generalizes the programming language concept of currying.

The Math of Vaccinations, Infection Rates, and Herd Immunity

Here in the US, we are, horribly, in the middle of a measles outbreak. And, as usual, anti-vaccine people are arguing that:

  • Measles isn’t really that serious;
  • Unvaccinated children have nothing to do with the outbreak; and
  • More vaccinated people are being infected than unvaccinated, which shows that vaccines don’t help.

A few years back, I wrote a post about the math of vaccines; it seems like this is a good time to update it.

When it comes to vaccines, there’s two things that a lot of people don’t understand. One is herd immunity; the other is probability of infection.

Herd immunity is the fundamental concept behind vaccines.

In an ideal world, a person who’s been vaccinated against a disease would have no chance of catching it. But the real world isn’t ideal, and vaccines aren’t perfect. What a vaccine does is prime the recipient’s immune system in a way that reduces the probability that they’ll be infected.

But even if a vaccine for an illness were perfect, and everyone was vaccinated, that wouldn’t mean that it was impossible for anyone to catch the illness. There are many people who’s immune systems are compromised – people with diseases like AIDS, or people with cancer receiving chemotherapy. (Or people who’ve had the measles within the previous two years!) And that’s not considering the fact that there are people who, for legitimate medical reasons, cannot be vaccinated!

So individual immunity, provided by vaccines, isn’t enough to completely eliminate the spread of a contagious illness. To prevent outbreaks, we rely on an emergent property of a vaccinated population. If enough people are immune to the disease, then even if one person gets infected with it, the disease won’t be able to spread enough to produce a significant outbreak.

We can demonstrate this with some relatively simple math.

Let’s imagine a case of an infection disease. For illustration purposes, we’ll simplify things in way that makes the outbreak more likely to spread than reality. (So this makes herd immunity harder to attain than reality.)

  • There’s a vaccine that’s 95% effective: out of every 100 people vaccinated against the disease, 95% are perfectly immune; the remaining 5% have no immunity at all.
  • The disease is highly contagious: out of every 100 people who are exposed to the disease, 95% will be infected.

If everyone is immunized, but one person becomes ill with the disease, how many people do they need to expose to the disease for the disease to spread?

Keeping things simple: an outbreak, by definition, is a situation where the number of exposed people is steadily increasing. That can only happen if every sick person, on average, infects more than 1 other person with the illness. If that happens, then the rate of infection can grow exponentially, turning into an outbreak.

In our scheme here, only one out of 20 people is infectable – so, on average, if our infected person has enough contact with 20 people to pass an infection, then there’s a 95% chance that they’d pass the infection on to one other person. (19 of 20 are immune; the one remaining person has a 95% chance of getting infected). To get to an outbreak level – that is, a level where they’re probably going to infect more than one other person, they’d need expose something around 25 people (which would mean that each infected person, on average, could infect roughly 1.2 people). If they’re exposed to 20 other people on average, then on average, each infected person will infect roughly 0.9 other people – so the number of infected will decrease without turning into a significant outbreak.

But what will happen if just 5% of the population doesn’t get vaccinated? Then we’ve got 95% of the population getting vaccinated, with a 95% immunity rate – so roughly 90% of the population has vaccine immunity. Our pool of non-immune people has doubled. In our example scenario, if each person is exposed to 20 other people during their illness, then they will, on average, cause 1.8 people to get sick. And so we have a major outbreak on our hands!

This illustrates the basic idea behind herd immunity. If you can successfully make a large enough portion of the population non-infectable by a disease, then the disease can’t spread through the population, even though the population contains a large number of infectable people. When the population’s immunity rate (either through vaccine, or through prior infection) gets to be high enough that an infection can no longer spread, the population is said to have herd immunity: even individuals who can’t be immunized no longer need to worry about catching it, because the population doesn’t have the capacity to spread it around in a major outbreak.

(In reality, the effectiveness of the measles vaccine really is in the 95 percent range – actually slightly higher than that; various sources estimate it somewhere between 95 and 97 percent effective! And the success rate of the vaccine isn’t binary: 95% of people will be fully immune; the remaining 5% will have a varying degree of immunity And the infectivity of most diseases is lower than the example above. Measles (which is a highly, highly contagious disease, far more contagious than most!) is estimated to infect between 80 and 90 percent of exposed non-immune people. So if enough people are immunized, herd immunity will take hold even if more than 20 people are exposed by every sick person.)

Moving past herd immunity to my second point: there’s a paradox that some antivaccine people (including, recently, Sheryl Atkinson) use in their arguments. If you look at an outbreak of an illness that we vaccinate for, you’ll frequently find that more vaccinated people become ill than unvaccinated. And that, the antivaccine people say, shows that the vaccines don’t work, and the outbreak can’t be the fault of the unvaccinated folks.

Let’s look at the math to see the problem with that.

Let’s use the same numbers as above: 95% vaccine effectiveness, 95% contagion. In addition, let’s say that 2% of people choose to go unvaccinated.

That means thats that 98% of the population has been immunized, and 95% of them are immune. So now 92% of the population has immunity.

If each infected person has contact with 20 other people, then we can expect expect 8% of those 20 to be infectable – or 1.6; and of those, 95% will become ill – or 1.52. So on average, each sick person will infect 1 1/2 other people. That’s enough to cause a significant outbreak. Without the non-immunized people, the infection rate is less than 1 – not enough to cause an outbreak.

The non-immunized population reduced the herd immunity enough to cause an outbreak.

Within the population, how many immunized versus non-immunized people will get sick?

Out of every 100 people, there are 5 who got vaccinated, but aren’t immune. Out of that same 100 people, there are 2 (2% of 100) that didn’t get vaccinated. If every non-immune person is equally likely to become ill, then we’d expect that in 100 cases of the disease, about 70 of them to be vaccinated, and 30 unvaccinated.

The vaccinated population is much, much larger – 50 times larger! – than the unvaccinated.
Since that population is so much larger, we’d expect more vaccinated people to become ill, even though it’s the smaller unvaccinated group that broke the herd immunity!

The easiest way to see that is to take those numbers, and normalize them into probabilities – that is, figure out, within the pool of all vaccinated people, what their likelihood of getting ill after exposure is, and compare that to the likelihood of a non-vaccinated person becoming ill after exposure.

So, let’s start with the vaccinated people. Let’s say that we’re looking at a population of 10,000 people total. 98% were vaccinated; 2% were not.

  • The total pool of vaccinated people is 9800, and the total pool of unvaccinated is 200.
  • Of the 9800 who were vaccinated, 95% of them are immune, leaving 5% who are not – so
    490 infectable people.
  • Of the 200 people who weren’t vaccinated, all of them are infectable.
  • If everyone is exposed to the illness, then we would expect about 466 of the vaccinated, and 190 of the unvaccinated to become ill.

So more than twice the number of vaccinated people became ill. But:

  • The odds of a vaccinated person becoming ill are 466/9800, or about 1 out of every 21
  • The odds of an unvaccinated person becoming ill are 190/200 or 19 out of every 20 people! (Note: there was originally a typo in this line, which was corrected after it was pointed out in the comments.)

The numbers can, if you look at them without considering the context, appear to be deceiving. The population of vaccinated people is so much larger than the population of unvaccinated that the total number of infected can give the wrong impression. But the facts are very clear: vaccination drastically reduces an individuals chance of getting ill; and vaccinating the entire population dramatically reduces the chances of an outbreak.

The reality of vaccines is pretty simple.

  • Vaccines are highly effective.
  • The diseases that vaccines prevent are not benign.
  • Vaccines are really, really safe. None of the horror stories told by anti-vaccine people have any basis in fact. Vaccines don’t damage your immune system, they don’t cause autism, and they don’t cause cancer.
  • Not vaccinating your children (or yourself!) doesn’t just put you at risk for illness; it dramatically increases the chances of other people becoming ill. Even when more vaccinated people than unvaccinated become ill, that’s largely caused by the unvaccinated population.

In short: everyone who is healthy enough to be vaccinated should get vaccinated. If you don’t, you’re a despicable free-riding asshole who’s deliberately choosing to put not just yourself but other people at risk.

Another Stab at Category Theory: Lesson one: Starting with Monoids

Introduction and Motivation

One thing that I keep bumping up against as an engineer who loves functional a programming is category theory. It often seems like there are two kinds of functional programmers: people who came into functional programming via engineering, and people who came into functional programming via math. The problem is that a lot of the really interesting work in languages and libraries for functional programming are being built from the mathematical side, but for people on the engineering side, it’s impenetrable: it’s like it’s written in a whole different language, and even basic discussions about programming go off the rails, because the basic abstractions don’t make any sense if you don’t know category theory.

But how do you learn category theory? It seems impenetrable to mere humans. For example, one of the textbooks on category theory that several people told me was the most approachable starts chapter one with the line:

A group extension of an abelian group H by an abelian group G consists of a group E together with an inclusion of G \hookrightarrow E as a normal subgroup and a surjective homomorphism E \twoheadrightarrow H that displays H as the quotient group E|G.

If you’re not a professional mathematician, then that is pure gobbledigook. But that seems to be typical of how initiates of category theory talk about it. But the basic concepts, while abstract, really aren’t all that tricky. In many ways, it feels a lot like set theory: there’s a simple conceptual framework, on which you can build extremely complicated formalisms. The difference is that while many people have spent years figuring out how to make the basics of set theory accessible to lay-people, but that effort hasn’t been applied to set theory.

What’s the point?

Ok, so why should you care about category theory?

Category theory is a different way of thinking, and it’s a language for talking about abstractions. The heart of engineering is abstraction. We take problems, and turn them into abstract structures. We look at the structures we create, and recognize commonalities between those structures, and then we create new abstractions based on the commonalities. The hardest part of designing a good library is identifying the right abstractions.

Category theory is a tool for talking about structures, which is particularly well suited to thinking about software. In category theory, we think in terms of arrows, where arrows are mappings between objects. We’ll see what that means in detail later, but the gist of it is that one example of arrows mapping between objects is functions mapping between data types in a computer program.

Category theory is built on thinking with orrows, and building structures using arrows. It’s about looking at mathematical constructions built with arrows, and in those structures, figuring out what the fundamental parts are. When we abstract enough, we can start to see that things that look very different are really just different realizations of the same underlying structure. Category theory gives us a language and a set of tools for doing that kind of abstraction – and then we can take the abstract structures that we identify, and turn them into code – into very generic libraries that express deep, fundamental structure.

Start with an Example: Monoids

Monoids in Code

We’ll get started by looking at a simple mathematical structure called a monoid, and how we can implement it in code; and then, we’ll move on to take an informal look at how it works in terms of categories.

Most of the categorical abstractions in Scala are implemented using something called a typeclass, so we’ll start by looking at typeclasses. Typeclasses aren’t a category theoretical notion, but they make it much, much easier to build categorical structures. And they do give us a bit of categorical flavor: a typeclass defines a kind of metatype – that is, a type of type – and we’ll see, that kind of self-reflective abstraction is a key part of category theory.

The easiest way to think about typeclasses is that they’re a kind of metatype – literally, as the name suggests, they define classes where the elements of those classes are types. So a typeclass provides an interface that a type must provide in order to be an instance of the metatype. Just like you can implement an interface in a type by providing implementations of its methods, you can implement a typeclass by providing implementations of its operations.

In Scala, you implement the operations of a typeclasses using a language construct called an implicit parameter. The implicit parameter attaches the typeclass operations to a meta-object that can be passed around the program invisibly, providing the typeclass’s operations.

Let’s take a look at an example. An operation that comes up very frequently in any kind of data processing code is reduction: taking a collection of values of some type, and combining them into a single value. Taking the sum of a list of integers, the product of an array of floats, and the concatenation of a list of strings are all examples of reduction. Under the covers, these are all similar: they’re taking an ordered group of values, and performing an operation on them. Let’s look at a couple of examples of this:

def reduceFloats(floats: List[Float]): Float =
    floats.foldRight(0)((x, y) => x + y)

def reduceStrings(strings: Seq[String]): String =
    strings.foldRight("")((x, y) => x.concat(y))

When you look at the code, they look very similar. They’re both just instantiations of the same structural pattern:

def reduceX(xes: List[X]): X =
    xes.foldRight(xIdentity)((a, b) => Xcombiner(a, b))

The types are different; the actual operation used to combine the values is different; the base value in the code is different. But they’re both built on the same pattern:

  • There’s a type of values we want to combine: Float or String. Everything we care about in reduction is a connected with this type.
  • There’s a collection of values that we want to combine, from left to right. In one case, that’s a List[Float], and in the other, it’s a Seq[String]. The type doesn’t matter, as long as we can iterate over it.
  • There’s an identity value that we can use as a starting point for building the result; 0 for the floats, and "" (the empty string) for the strings.
  • There’s an operation to combine two values: + for the floats, and concat for the strings.

We can capture that concept by writing an interface (a trait, in Scala terms) that captures it; that interface is called a typeclass. It happens that this concept of reducible values is called a monoid in abstract algebra, so that’s the name we’ll use.

trait Monoid[A]  {
    def empty: A
    def combine(x: A, y: A): A

We can read that as saying “A is a monoid if there are implementations of empty and combine that meet these constraints”. Given the declaration of the typeclass, we can implement it as an object which provides those operations for a particular type:

object FloatAdditionMonoid extends Monoid[Float] {
    def empty: Float = 0.0
    def combine(x: Float, y: Float): Float = x + y

object StringConcatMonoid extends Monoid[String] {
    def empty: String = ""
    def combine(x: String, y: String): String = x.concat(y)

FloatAdditionMonoid implements the typeclass Monoid for the type Float. And since we can write an implementation of Monoid for Float or String, we can say that the types Float and String are instances of the typeclass Monoid.

Using our implementation of Monoid, we can write a single, generic reduction operator now:

def reduce[A](values: Seq[A], monoid: Monoid[A]): A =

We can use that to reduce a list of floats:

reduce([1.0, 3.14, 2.718, 1.414, 1.732], FloatAdditionMonoid)

And we can do a bit better than that! We can set up an implicit, so that we don’t need to pass the monoid implementation around. In Scala, an implicit is a kind of dynamically scoped value. For a given type, there can be one implicit value of that type in effect at any point in the code. If a function takes an implicit parameter of that type, then the nearest definition in the execution stack will automatically be inserted if the parameter isn’t passed explicitly.

def reduce[A](values: Seq[A])(implicit A: Monoid[A]): A =

And as long as there’s a definition of the Monoid for a type A in scope, we can can use that now by just writing:

implicit object FloatAdditionMonoid extends Monoid[Float] {
    def empty: Float = 0.0
    def combine(x: Float, y: Float): Float = x + y

val floats: List[Float] = ...
val result = reduce(floats)

Now, anywhere that the FloatAdditionMonoid declaration is imported, you can call reduce on any sequence of floats, and the implicit value will automatically be inserted.

Using this idea of a monoid, we’ve captured the concept of reduction in a common abstraction. Our notion of reduction doesn’t care about whether we’re reducing strings by concatenation, integers by addition, floats by multiplication, sets by union. Those are all valid uses of the concept of a monoid, and they’re all easy to implement using the monoid typeclass. The concept of monoid isn’t a difficult one, but at the same time, it’s not necessarily something that most of us would have thought about as an abstraction.

We’ve got a typeclass for a monoid; now, we’ll try to connect it into category theory. It’s a bit tricky, so we won’t cover it all at once. We’ll look at it a little bit now, and we’ll come back to it in a later lesson, after we’ve absorbed a bit more.

From Sets to Arrows

For most of us, if we’ve heard of monoids, we’ve heard of them in terms of set theory and abstract algebra. So in that domain, what’s a monoid?

A monoid is a triple (V, 1, *), where:

  • V is a set of values;
  • 1 is a value in V;
  • * is a total binary operator where:
    • 1 is an identity of *: For any value v \in V: v*1 = 1*v = v.
    • * is associative: for any values v, w, x \in V: (v * w) * x = v * (w * x)

That’s all just a formal way of saying that a monoid is a set with a binary associative operator and an identity value. The set of integers can form a monoid with addition as the operator, and 0 as identity. Real numbers can be a monoid with multiplication and 1. Strings can be a monoid with concatenation as the operator, and empty string as identity.

But we can look at it in a different way, too, by thinking entirely in terms of function.
Let’s forget about the numbers as individual values, and instead, let’s think about them in functional terms. Every number is a function which adds itself to its parameter. So “2” isn’t a number, it’s a function which adds two to anything.

How can we tell that 2 is a function which adds two to things?

If we compose it with 3 (the function that adds three to things), we get 5 (the function that adds five to things). And how do we know that? Because it’s the same thing that we get if we compose 3 with 1, and then compose the result of that with 1 again. 3+1+1=5, and 3+2=5. We can also tell that it’s 2, because if we just take 1, and compose it with itself, what we’ll get back is the object that we call 2.

In this scheme, all of the numbers are related not by arithmetic, not by an underlying concept of quantity or cardinality or ordinality, but only by how they compose with each other. We can’t see anything else – all we have are these functions. But we can recognize that they are the natural numbers that we’re familiar with.

Looking at it this way, we can think of the world of natural numbers as a single point, which represents the set of all natural numbers. And around that point, we’ve got lots and lots of arrows, each of which goes from that point back to itself. Each of those arrows represents one number. The way we tell them apart is by understanding which arrow we get back when we compose them. Take any arrow from that point back to that point, and compose it with the arrow 0, and what do you get? The arrow you started with. Take any arrow that you want, and compose it with 2. What do you get? You get the same thing that you’d get if you composed it with 1, and then composed it with one again.

That dot, with those arrows, is a category.

What kind of advantage do we get in going from the algebraic notion of a set with a binary operation, to the categorical notion of an object with a bunch of composable arrows? It allows to understand a monoid purely as a structure, without having the think about what the objects are, or what the operator means.

Now, let’s jump back to our monoid typeclass for a moment.

trait Monoid[A]  {
    def empty: A
    def combine(x: A, y: A): A

We can understand this as being a programmable interface for the categorical object that we just described. All we need to do is read “:” as “is an arrow in”: It says that A is a monoid if:

  • It has an element called empty which is an arrow in A.
  • It has an operation called combine which, given any two arrows in A, composes them into a new arrow in A.

There are, of course, other conditions – combine needs to be associative, and empty needs to behave as the identity value. But just like when we write an interface for, say, a binary search tree, the interface only defines the structure not the ordering condition, the typeclass defines the functional structure of the categorical object, not the logical conditions.

This is what categories are really all about: tearing things down to a simple core, where everything is expressed in terms of arrows. It’s almost reasoning in functions, except that it’s even more abstract than that: the arrows don’t need to be functions – they just need to be composable mappings from things to things.

Deeper Into Arrows

We can abstract a bit more, and look at the entire construction, including the identity and associativity constraints entirely in terms of arrows. To really understand this, we’ll need to spend some time diving deeper into the actual theory of categories, but as a preview, we can describe a monoid with the following pair of diagrams (copied from wikipedia):

In these diagrams, any two paths between the same start and end-nodes are equivalent (up to isomorphism). When you understand how to read this diagrams, these really do define everything that we care about for monoids.

For now, we’ll just run through and name the parts – and then later, in another lesson, we’ll come back, and we’ll look at this in more detail.

  • \mu is an arrow from M\times M \rightarrow M, which we’ll call a multiplication operator.
  • \eta is an arrow from I \rightarrow M, called unit.
  • \alpha is an arrow from (M\times M)\times M \rightarrow M \times (M\times M) which represents the associativity property of the monoid.
  • \lambda is a morphism which represents the left identity property of the monoid (that is, 1*x=x), and \rho is a morphism representing the right identity property (x*1=x).

This diagram, using these arrows, is a way of representing all of the key properties of a monoid via nothing but arrows and composition. It says, among other things, that:

  • (M \times M) \times M composes with multiplication to be M \times M.
    That is, applying multiplication to (M \times M) \times M evaluates to (M \times M).
  • (M \times M) \times M composed with associativity can become M \times (M \times M).

So it’s a monoid – but it’s a higher level monoid. In this, M isn’t just an object in a category: it’s an entire category. These arrows are arrows between categories in a category of categories.

What we’ll see when we get deeper into category theory is how powerful this kind of abstraction can get. We’ll often see a sequence of abstractions, where we start with a simple concept (like monoid), and find a way to express it in terms of arrows between objects in a category. But then, we’ll lift it up, and look at how we can see in not just as a relation between objects in a category, but as a different kind of relation between categories, by constructing the same thing using a category of categories. And then we’ll abstract even further, and construct the same thing using mappings between categories of categories.

(You can find the next lesson <a href=””>here</a>.)