Passwords, Hashing, and Salt

Over on twitter, some folks were chatting about the latest big security botch. A major service, called Evernote, had a security breach where a password file was stolen. Evernote has handled the situation quite well, being open about what happened, and explaining the risks.

In their description of the breach, they said that the stolen passwords were “both hashed and salted”. Apparently this sounds funny to people outside of software. (Amazing how jargon becomes so ingrained that I didn’t even notice the fact that it could be interpreted in a funny way until it was pointed out to me!)

Anyway, since discussion of this is going around, I thought I’d explain just what password hashing and salting means.

Let’s start at the beginning. You’re some kind of system that wants to have password security. Obviously, you need to save the passwords somewhere, right?

As we’ll see, that’s only partially correct – but let’s go with it for now. You need to store something that lets you check if a user supplied the right password, so you need to store something.

The most naive approach is create a file (or database, or whatever) that contains the usernames and passwords of all of your users. Something like:

alice:abc
mark:pass
joe:123
jen:hello

Suppose you were a thief, and you wanted to crack this password file, what would you do? You’d try to steal that file! If you can get hold of that password file, then you’d have all of the passwords for all of the users of the system.

That means that this is a terrible way of storing the passwords. One step, and a thief has completely breached your system. We don’t want that. So what should we do?

First, we could encrypt the file.

This might seem like a good idea at first. If a thief were the steal the file, the wouldn’t be able to find your user’s passwords without figuring out the encryption key! It’s going to take a lot of work to figure out that key!

The first problem with this is that any password can be cracked given enough time and power. If there’s only one encryption key for the entire file, then it’s worth investing a lot of time and power into breaking it – and once it’s broken, then everything is revealed.

The second problem is: how does your system check a user’s password? It needs to decrypt the file! That means that the encryption key must be available to your system! So all that a thief needs to do is figure out where your system is getting the key from. You’ve got your entire security system for all of your users set up with a single point of protection, and somewhere in your system, everything that you need to break that protection is available!

What can we do to improve this? The answer is something called crypto graphic hashing.

Cryptographic hashing is a combination of two concepts: hashing, and one-way functions.

A really simple example of a not-very-good hash function of a string would be something like: convert all of the characters in the string to their numeric values, and exclusive-or the binary representation of those bits. With that hash function, you could take a string like “ABCD”, and convert it to the numeric values of the characters ([65, 66, 67, 68]), and then x-or them together (1000001 xor 1000010 xor 1000011 xor 1000100 = 0000100) for a result of 4. Real practical hash functions are more complicated.

For example, at least some versions of Java use the following as the default hash for a string of characters:

text{hash}(s) = left(sum_{i in text{length}(s)} 31^{length(s) - i - 1}*s[i] right) mod 2^{32}

There’s a class of mathematical functions called one-way functions. A one way function is a function f, where given x, it’s easy to compute f(x), but given f(x) it’s extremely difficult (or even impossible) to compute x.

If we combine those two, we have what’s called a crpytogrphic hash function: a function that takes an arbitrary input string, and converts it to a number, in a way where it’s very difficult to figure out what the input string that produced the number was. That’s great for storing passwords! We don’t store the password at all. We just store the hash-value produced from the password. Then when a user comes and logs in, we take their password, hash it, and compute the result to the stored hash.

Instead of the file with explicit passwords, we get something like:

alice:7a2d28fc
mark:dfd4e1c6
joe:ed849ee1
jen:bb76e739

This is much better than storing the encrypted password. There is no encryption key that a thief can use to decrypt the password. Even if a thief knows the hash values of your user’s passwords, they can’t get in to the system! And your system actually never stores the actual values of the user’s passwords – just their hashcodes!

So again, let’s look at this from the perspective of a thief. How can a thief break into a system with hashed passwords?

If they don’t know what hash function you’re using, then they’re completely stuck. Sadly, they can probably figure it out. Designing new crpytographic hash functions is hard. Implementing cryptographic hash functions correctly is hard. As a result, most people just use a hash function from a library. That means that for a thief, it’s usually pretty easy to figure out what hash function is being used by a system.

Once they know what hash function you used, their only choice to break your system is to try to guess the passwords. That is, they can guess passwords, compute their hash codes, and search through your password file to see if any of the users password hashes matches. If they find one, they’re gold!

In fact, there’s a common strategy based on this idea called a rainbow table. A rainbox table is a list of common passwords, and the numeric value that they hash to with a common crptographic hash value. Something like:

Password String Hash value
pass 1b93eb12
password a4532c47
abc 7a2d28fc

If you can somehow steal the passwords file, then with a rainbow table, you can find users with common passwords. For example, in the table above, you can see that the hashcode “7a2d28fc” occurs in the passwords file for the username “alice”, and it’s also in the rainbow table for the password “abc”. So a thief could determing that alice’s password was “abc”. Even with the best crpytographic hash, all it takes is one idiot user who uses “password” as their password, and your system’s security is breached.

Salting passwords addresses that problem. In a salting strategy, you don’t hash a user’s password by itself: you combine it with some additional data, and then hash that combination. The additional information is called the salt..

You can use lots of different things for the salt. There’s a complex set of tradeoffs in the exact salting strategy, which are beyond the scope of this post, but a few examples include:

  1. Always use a fixed salt string. This is weak, but better than nothing. It’s got a similar weakness to the encrypted password system: you only need one salt to give you a handle on breaking all of the passwords, and that one salt needs to be in the system.
  2. Add a random piece of data for each password. The catch here is that you need to store the salt data for each password. This is what unix passwords used to use. They added 12 random bits to each password. In the passwords file, they stored the salt and the hashed password. The weakness of this is that the salt is right there with the password entry. But because each user has a different salt, that means that any attempt to breach the system needs to look at each user separately.
  3. Salt on metadata: that is, take information about the user that isn’t part of their username, and use that as the salt. For example, you could use a person’s birthday as the salt for their account.

If each user has a different salt, then even if you’ve got terrible passwords, a thief needs to do a lot of work to try to break your system. Even with a rainbow-table like strategy, they can’t compute the hashcode for a given common password once, and then search the password hash list for that code – they need to recompute it for each possible salt value!

What salting does is, effectively, increase the amount of effort needed to break the passwords. If you add 12 bits of salt, then a rainbow table needs 4096 times more entries to find common passwords! If your salt is long enough, then it can make it effectively impossible to create a rainbox table at all. If they try to attack you without a rainbow table, a 12 bit salt means that your attacker needs to attack the passwords of each of your users seperately! Even if they know the value of the salt, you’ve made it much harder for them to breach your security.

23 thoughts on “Passwords, Hashing, and Salt

  1. PhilM

    This is an excellent review of how to build an authentication system and considering the frequent high-profile hacking, a much needed one.

    But, the need to store credentials securely does arise in some systems which authenticate on behalf of someone and the authentication system. Any thoughts on how this can be done securely without requiring one or more humans providing the key to the secure store or a hardware security module?

    Reply
  2. Oskar Sigvardsson

    A small nitpicky point: that’s not how rainbow tables work. Storing every possible password with its hash would take up insane amounts of space, but what makes rainbow tables different from regular tables (which is what you described) is that it makes a rather brilliant space/time tradeoff.

    Rainbow tables are pretty cool, so I’ll just give a brief description of the basic idea of how they work: lets say that the hash function we’re trying to crack is called H, which maps passwords to hash values. Make up a new function, R, which works the other way around, i.e. it maps hash values to passwords (not the same password that produces the hash obviously, otherwise the hash function wouldn’t be one-way). Now, starting with some password make a chain of alternating passwords and hashvalues by alternating the application of H and R.

    So, for instance, lets say you start with “abc”. Applying H to it, gives you H(“abc”) = 7a2d28fc. Now, apply R to 7a2d28fc, giving you (something like) “basldkg”, i.e. R(7a2d28fc) = “basldkg”. Now apply H to “basldkg”, giving you 6a7b4411. Now apply R to that hash value, giving you “aaubel”. And on and on and on. That is to say, starting with “abc”, you get this chain:

    “abc” -> 7a2d28fc -> “basldkg” -> 6a7b4411 -> “aaubel” -> …

    And on and on and on. Lets say you make this chain 100 items long, and it finally ends up with the password “afhgabl”. In a rainbow table, you then store this value, along with the value that started the chain, i.e. “abc”.

    Now, lets say you want to crack the hash-value 6a7b4411. You start generating the chain from that value, and then you check the table every time you get a password to see whether or not your value is stored. Sooner or later, you’ll end up with “afhgabl”, which is stored in the table, along with the starting value, “abc”. So now you know that the password for 6a7b4411 is somewhere along the chain that “abc” started. So now you just generate the chain again, and pretty soon you’ll realize that the password “basldkg” gives you that hash value.

    Pretty cool. Instead of doing 1 table lookup, you’ll have to do potentially up to 100, but you only have to store 1/100 as much stuff. Since table lookups are very fast and storage is very expensive, this is a very smart trade-off. The clever thing is that you can adjust that value to get a very precise combination of good speed and acceptable storage. Don’t have enough storage, but lookups are really fast? Generate the tables with longer chains! Have oceans of storage, but your lookups take for ever? Generate the tables with a smaller number!

    Reply
    1. Oskar Sigvardsson

      Oh, just one more thing: this is much simplified, and not how actual rainbow tables work, real rainbow tables have to do some extra work to ensure that different chains don’t collide. But this is the basic idea.

      Reply
  3. Jeff C.

    Nothing in this article is factually inaccurate, but take note that no one should roll their own password storage. To store passwords, use one of scrypt, bcrypt, or PBKDF2 (in order from most preferred to least preferred). If you aren’t doing this now, you need to change it.

    Reply
  4. Daniel Martin

    As another comment or said, you should be using bcrypt, scrypt, or PBKDF2 instead of rolling your own password hashing scheme. Also, I’ve seen some arguments that these days rainbow tables are obsolete since md5 and sha1 are so easily computed massively in parallel on GPUs – the economics just don’t support the cost of the hard drive space needed. GPU cycles are just too cheap in comparison.

    Reply
  5. Stormy Dragon

    Even salting and hashing are behind the curve. If you think about it, the server doesn’t even need to know what the password is, it just has to be able to verify that the client knows what it is. There are algorithms to do this (most prominently the Secure Remote Password protocol) that allow the server to do this without having to ever be told what the password is.

    Reply
  6. ithinkso

    “Designing new crpytographic hash functions is hard. Implementing cryptographic hash functions correctly is hard. As a result, most people just use a hash function from a library.”
    And thanks god for that. Creating and implementing new cryptographic hash or whatever is almost always very bad idea.

    Reply
  7. Tobias

    How is it that those cracking always gives me the correct password and not another one with the same hash. There have to be some.
    Is it only because the correct password is usually the shortest and we start with the shortest while cracking?

    Is it possible that my 127 Characters of line noise password has the same hash as ‘password’ and than everyone makes fun of my short password?

    Reply
    1. Stormy Dragon

      Since comparing the hashes is how the server determines if the password is right, all of the other strings with the same hash are “correct”.

      And the chances of your 127 Characters of line noise password having the same hash as “password” is less than 1 in 4 billion.

      Reply
  8. MIke

    >Even with the best crpytographic hash, all it takes is one idiot user who uses “password” as their password, and your system’s security is breached.

    But only for Alice and anyone else who uses entries in the common password rainbow table.

    My password of Scientopia1234 is unaffected.

    woops….

    Reply
  9. Rahul Waghmare

    where does salting-hashing take place? If we are using server client application then whether salting is done at server side or client side(browser)?

    Reply
    1. Marcos

      Usually the client hashes the password and sends the string to the server. The server then adds the salt to the string and re-hashes, and then stores the final hashed string, and the salt. Then when the user submits a password, the server hashes it, adds the salt, rehashes and compares. I believe it’s the most secure way, but I’m still a noob so I’m not quite sure.

      Reply
  10. Nick

    One random note with regards to theoretical correctness is that there is currently no proof that one-way functions as a class actually exist. Mind-you, almost no one studying the subject actually believes they DON’T exist, but it has yet to be proven that they do (this is both similar and related to P=/=NP?). Mostly an irrelevant point in the context of this article, but I figured I’d throw it out there for anyone who cares.

    Reply
  11. Pingback: LoL Chest » Blog Archive » Important Security Update and Password Reset

  12. Pingback: RunicPortal » League of Legends compromised; North American accounts and transactions accessed

  13. Pingback: cpugame.com | Important Security Update and Password Reset

  14. Pingback: Important Security Update and Password Reset | League Chicago

  15. sean white

    so your telling me they accessed my fake name junk email and unreadable passwords?i guess i can live with that its not as if my password is password ,abc123 or123abc if you need a sign that your a redneck, here’s your sign.

    Reply

Leave a Reply to ithinkso Cancel reply