Monthly Archives: October 2013

Basic Data Structures: Hash Tables

I’m in the mood for a couple of basics posts. As long-time readers might know, I love writing about data structures.

One of the most important and fundamental structures is a hashtable. In fact, in a lot of modern programming languages have left hashtables behind, for reasons I’ll discuss later. But if you want to understand data structures and algorithmic complexity, hashtables are one of the essentials.

A hashtable a structure for keeping a list of (key, value) pairs, where you can look up a value using the key that’s associated with it. This kind of structure is frequently called either a map, an associative array, or a dictionary.

For an example, think of a phonebook. You’ve got a collection of pairs (name, phone-number) that make up the phonebook. When you use the phonebook, what you do is look for a person’s name, and then use it to get their phone number.

A hashtable is one specific kind of structure that does this. I like to describe data structures in terms of some sort of schema: what are the basic operations that the structure supports, and what performance characteristics does it have for those operations.

In those schematic terms, a hashtable is very simple. It’s a structure that maintains a mapping from keys to values. A hashtable really only needs two operations: put and get:

  1. put(key, value): add a mapping from key to value to the table. If there’s already a mapping for the key, then replace it.
  2. get(key): get the value associated with the key.

In a hashtable, both of those operations are extremely fast.

Let’s think for a moment about the basic idea of a key-value map, and what kind of performance we could get out of a cople of simple naive ways of implementing it.

We’ve got a list of names and phone numbers. We want to know how long it’ll take to find a particular name. How quickly can we do it?

How long does that take, naively? It depends on how many keys and values there are, and what properties the keys have that we can take advantage of.

In the worst case, there’s nothing to help us: the only thing we can do is take the key we’re looking for, and compare it to every single key. If we have 10 keys, then on average, we’ll need to do an average of about 5 steps before we find the key we’re looking for. If there are 100 keys, then it’ll take, on average, about 50 steps. If there are one million keys, then it’ll take an average of half a million steps before we can find the value.

If the keys are ordered – that is, if we can compare one key to another not just for equality, but we can ask which came first using a “less than or equal to” operator, then we can use binary search. With binary search, we can find an entry in a list of 10 elements in 4 steps. We can find an entry in a list of 1000 keys in 10 steps, or one in a list of one million keys in 20 steps.

With a hashtable, if things work right, in a table of 10 keys, it takes one step to find the key. 100 keys? 1 step. 1000 keys? 1 step. 1,000,000,000 keys? Still one step. That’s the point of a hashtable. It might be really hard to do something like general a list of all of the keys – but if all you want to do is look things up, it’s lightning.

How can it do that? It’s a fairly simple trick: the hashtable trades space for time. A hashtable, under normal circumstances, uses a lot more space than most other ways of building a dictionary. It makes itself fast by using extra space in a clever way.

A hashtable creates a bunch of containers for (key, value) pairs called buckets. It creates many more buckets than the number of (key, value) pairs than it expects to store. When you want to insert a value into the table, it uses a special kind of function called a hash function on the key to decide which bucket to put the (key, value) into. When you want to look for the value associated with a key, it again uses the hash function on the key to find out which bucket to look in.

It’s easiest to understand by looking at some actual code. Here’s a simple, not at all realistic implementation of a hashtable in Python:

  class Hashtable(object):
    def __init__(self, hashfun, size):
      self._size = size
      self._hashfun = hashfun
      self._table = [[] for i in range(size)]

    def hash(self, key):
      return self._hashfun(key) % self._size

    def get(self, key, value):
      self._table[self.hash(key)].append((key, value))

    def get(self, key):
      for k,v in self._table[self.hash(key)]:
        if k == key:
          return v
      return None

If you’ve got a good hash function, and your hashtable is big enough, then each bucket will end up with no more than one value in it. So if you need to insert a value, you find an (empty) bucket using its hashcode, and dump it in: one step. If you need to find a value given its key, find the bucket using its hashcode, and return the value.

There are two big problems with hashtables.

First, everything is dependent on the quality of your hash function. If you hash function maps a lot of values to the same bucket, then your performance is going to suck. In fact, in the worst case, it’s no better than just searching a randomly ordered list. Most of the time, you can come up with a hash function that does pretty good – but it’s a surprisingly tricky thing to get right.

Second, the table really needs to be big relative to the number of elements that you expect to have in the list. If you set up a hashtable with 40 buckets, and you end up with 80 values stored in it, your performance isn’t going to be very good. (In fact, it’ll be slightly worse that just using a binary search tree.)

So what makes a good hash function? There are a bunch of things to consider:

  1. The hash function must be deterministic: calling the hash on the same key value must always produce the same result. If you’re writing a python program like the one I used as an example above, and you use the value of the key objects fields to compute the hash, then changing the key objects fields will change the hashcode!
  2. The hash function needs to focus on the parts of the key that distinguish between different keys, not on their similarities. To give a simple example, in some versions of Java, the default hash function for objects is based on the address of the object in memory. All objects are stored in locations whose address is divisible by 4 – so the last two bits are always zero. If you did something simple like just take the address modulo the table size, then all of the buckets whose numbers weren’t divisible by four would always be empty. That would be bad.
  3. The hash function needs to be uniform. That means that it needs to map roughly the same number of input values to each possible output value. To give you a sense of how important this is: I ran a test using 3125 randomly generated strings, using one really stupid hash function (xoring together the characters), and one really good one (djb2). I set up a small table, with 31 buckets, and inserted all of the value into it. With the xor hash function, there were several empty buckets, and the worst bucket had 625 values in it. With djb2, there were no empty buckets; the smallest bucket had 98 values, and the biggest one had 106.

So what’s a good hash function look like? Djb2, which I used in my test above, is based on integer arithmetic using the string values. It’s an interesting case, because no one is really entirely sure of exactly why it works better than similar functions, but be that as it may, we know that in practice, it works really well. It was invented by a guy named Dan Bernstein, who used to be a genius poster in comp.lang.c, when that was a big deal. Here’s djb2 in Python:

def djb2(key):
  hash = 5381
  for c in key:
    hash = (hash * 33) + ord(c)
  return hash

What the heck is it doing? Why 5381? Because it’s prime, and it works pretty well. Why 33? No clue.

Towards the beginning of this post, I alluded to the fact that hashtables have, at least to some degree, fallen out of vogue. (For example, in the Go language standard library, the map type is implemented using a red-black tree.) Why?

In practice, it’s rarely any faster to really use a hashtable than to use a balanced binary tree like a red-black tree. Balanced trees have better worst-case bounds, and they’re not as sensitive to the properties of the hash function. And they make it really easy to iterate over all of the keys in a collection in a predictable order, which makes them great for debugging purposes.

Of course, hash tables still get used, constantly. The most commonly used data structures in Java code include, without a doubt, the HashMap and HashSet, which are both built on hashtables. They’re used constantly. You usually don’t have to implement them yourself; and usually system libraries provide a good default hash function for strings, so you’re usually safe.

There’s also a lot of really fascinating research into designing ideal hash functions for various applications. If you know what your data will look like in advance, you can even build something called a perfect hash function, which guarantees no collisions. But that’s a subject for another time.

A Note to the Trolls Re: Comment Policies

Since yesterday’s post, I’ve been deluged with trolls who want to post comments about their views of sexual harassment. I’ve been deleting them as they come in, and that has, in turn, led to lots of complaints about how horrible unfair and mean I am.

I’ve been doing this blogging thing for a long time, and I’ve watched as a number of sites that I used to really enjoy have wound up becoming worthless, due to comment trolls. There are tons of trolls out there, and they’re more than happy to devote a lot of time and energy to trolling and derailing. When I started my blog, I had a very open commenting policy: I rarely if ever deleted comments, and only did so when they were explicitly abusive towards other commenters. Since then, I’ve learned that in the current internet culture, that doesn’t work. The only way to maintain a halfway decent comment forum is to moderate aggressively. So I’ve become much more aggressive about removing the stuff that I believe to be trolling.

Here’s the problem: Trolls aren’t interested in real discussions. They’re interested in derailing discussions that they don’t like. I’m not interested in hosting flame wars, misogynistic rants, or other forms of trolling. In case you haven’t noticed, this is my blog. I’ll do what I feel is appropriate to maintain a non-abusive, non-troll-infested comment section. I am under no obligation to post your rants, and I am under no obligation to provide you with a list of bullet points of what my exact standards are. If I judge a comment to be inappropriate, I’ll delete it. If don’t like that, you’re welcome to find another forum, or create your own. It’s a big internet out there: there’s bound to be a place where your arguments are welcome. But that’s not this place. If I’m over-aggressive in my moderation, the only one who’ll be hurt by that will be me, because I will have wrecked the comment forum on my blog. That’s a risk I’m prepared to take.

Let me add one additional comment about the particular trolls who’ve been coming to visit lately: I’ve learned, over time, a thing or two about the demographics of the people who visit this blog. As much as I’d prefer it to be otherwise, the frequent commenters on this blog are overwhelmingly male – over the history of the blog, of commenters where gender can be identified, the comments are over 90% male. Similarly, in my career as an engineer, the population of my coworkers has been very, very skewed: the engineering population at my workplaces has varied, but I’ve never worked anywhere where the population of engineers and engineering managers was less than 80% male.

But according to my recent trollish commenters, I’m supposed to believe that suddenly that population has changed, dramatically. Suddenly, every single comment is being posted by a woman who has never seen any male-on-female sexual harassment, but who was a personal witness of multiple female engineering managers who sexually harassed their male employees without any repercussions. It’s particularly amusing, because those rants about the evil sexually-harassing female managers are frequently followed by rants about how the problem is the difference in sexual drive between men and women. Funny how women just aren’t as sexually motivated as man, and that’s the cause of the problem, but there are all of these evil female managers sexually harassing their employees despite their inferior female sexual drive, isn’t it?

Um, guys?! You’re not fooling me. You’re not fooling anyone. I’m not obligated to provide you with a forum for your lies. So go away, find someplace else. Or feel free to keep submitting your comments, but know that they’re going to wind up in the bit-bucket.

It's easy to not harass women

For many of us in the science blogging scene, yesterday was a pretty lousy day. We learned that a guy who many of us had known for a long time, who we’d trusted, who we considered a friend, had been using his job to sexually harass women with sleezy propositions.

This led to a lot of discussion and debate in twitter. I spoke up to say that what bothered me about the whole thing was that it’s easy to not harass people.

This has led to rather a lot of hate mail. But it’s also led to some genuine questions and discussions. Since it can be hard to have detailed discussions on twitter, I thought that I’d take a moment here, expand on what I meant, and answer some of the questions.

To start: it really is extremely easy to not be a harasser. Really. The key thing to consider is: when is it appropriate to discuss sex? In general, it’s downright trivial: if you’re not in a not in private with a person with whom you’re in a sexual relationship, then don’t. But in particular, here are a couple of specific examples of this principle:

  • Is there any way in which you are part of a supervisor/supervisee or mentor/mentee relationship? Then do not discuss or engage in sexual behaviors of any kind.
  • In a social situation, are you explicitly on a date or other romantic encounter? Do both people agree that it’s a romantic thing? If not, then do not discuss or engage in sexual behaviors.
  • In a mutually understood romantic situation, has your partner expressed any discomfort? If so, then immediately stop discussing or engaging in sexual behaviors.
  • In any social situation, if a participant expresses discomfort, stop engaging in what is causing the discomfort.

Like I said: this is not hard.

To touch on specifics of various recent incidents:

  • You do not meet with someone to discuss work, and tell them about your sex drive.
  • You do not touch a students ass.
  • You do not talk to coworkers about your dick.
  • You don’t proposition your coworkers.
  • You don’t try to sneak a glance down your coworkers shirt.
  • You don’t comment on how hot your officemate looks in that sweater.
  • You do not tell your students that you thought about them while you were masturbating.

Seriously! Is any of this difficult? Should this require any explanation to anyone with two brain cells to rub together?

But, many of my correspondants asked, what about grey areas?

I don’t believe that there are significant grey areas here. If you’re not in an explicit sexual relationship with someone, then don’t talk to them about sex. In fact, if you’re in any work related situation at all, no matter who you’re with, it’s not appropriate to discuss sex.

But what about cases where you didn’t mean anything sexual, like when you complimented your coworker on her outfit, and she accused you of harassing her?

This scenario is, largely, a fraud.

Lots of people legitimately worry about it, because they’ve heard so much about this in the media, in politics, in news. The thing is, the reason that you hear all of this is because of people who are deliberately promoting it as part of a socio-political agenda. People who want to excuse or normalize this kind of behavior want to create the illusion of blurred lines.

In reality, harassers know that they’re harassing. They know that they’re making inappropriate sexual gestures. But they don’t want to pay the consequences. So they pretend that they didn’t know that what they were doing wrong. And they try to convince other folks that you’re at risk too! You don’t actually have to be doing anything wrong, and you could have your life wrecked by some crazy bitch!.

Consider for a moment, a few examples of how a scenario could play out.

Scenario one: woman officemate comes to work, dressed much fancier than usual. Male coworker says “Nice outfit, why are you all dressed up today?”. Anyone really think that this is going to get the male coworker into trouble?

Scenario two: woman worker wears a nice outfit to work. Male coworker says “Nice outfit”. Woman looks uncomfortable. Man sees this, and either apologizes, or makes note not to do this again, because it made her uncomfortable. Does anyone really honestly believe that this, occurring once, will lead to a formal accusation of harassment with consequences?

Scenario three: woman officemate comes to work dressed fancier than usual. Male coworker says nice outfit. Woman acts uncomfortable. Man keeps commenting on her clothes. Woman asks him to stop. Next day, woman comes to work, man comments that she’s not dressed so hot today. Anyone think that it’s not clear that the guy is behaving inappropriately?

Scenario four woman worker wears a nice outfit to work. Male coworker says “Nice outfit, wrowr”, makes motions like he’s pawing at her. Anyone really think that there’s anything ambiguous here, or is it clear that the guy is harassing her? And does anyone really, honestly believe that if the woman complains, this harasser will not say “But I just complimented her outfit, she’s being oversensitive!”?

Here’s the hard truths about the reality of sexual harassment:

  • Do you know a professional woman? If so, she’s been sexually harassed at one time or another. Probably way more than once.
  • The guy(s) who harassed her knew that he was harassing her.
  • The guy(s) who harassed her doesn’t think that he really did anything wrong.
  • There are a lot of people out there who believe that men are entitled to behave this way.
  • In order to avoid consequences for their behavior, many men will go to amazing lengths to deny responsibility.

The reality is: this isn’t hard. There’s nothing difficult about not harassing people. Men who harass women know that they’re harassing women. The only hard part of any of this is that the rest of us – especially the men who don’t harass women – need to acknowledge this, stop ignoring it, stop making excuses for the harassers, and stand up and speak up when we see it happening. That’s the only way that things will ever change.

We can’t make exceptions for our friends. I’m really upset about the trouble that my friend is in. I feel bad for him. I feel bad for his family. I’m sad that he’s probably going to lose his job over this. But the fact is, he did something reprehensible, and he needs to face the consequences for that. The fact that I’ve known him for a long time, liked him, considered him a friend? That just makes it more important that I be willing to stand up, and say: This was wrong. This was inexcusable. This cannot stand without consequences..