An Introduction to Cryptographic Hash Functions

Basics

Cryptographic hash functions are algorithms that take data input (often called the 'message') and generate a fixed-size result (often called the 'hash' or 'digest'). For example:

1
"administrator" => "b3aca92c793ee0e9b1a9b0a5f5fc044e05140df3"

Ideal hash functions make it very difficult to get the original message back from the digest, reasonably easy to compute a hash for a given message, infeasible to generate a message with a given hash, infeasible to modify a message without changing the resultant hash, and infeasible to find two messages with the same hash.

While no completely ideal function exists, functions which aim for these properties can prove very useful. A classic example of cryptographic hash use is in storage of passwords. When you sign up for a website, your data is usually stored in a database on their servers. The issue is that if your password is stored on the server as regular text (often called 'plaintext') and somebody hacks into the server, your password is completely compromised. If your password is hashed on the server, however, an attacker shouldn't be able to formulate your password from the stored value.

This concept may leave some wondering how a password entered at a later date could then be compared to the stored value to check if the login information is correct, but in fact this is extremely simple. The inputted password is simply hashed using the same function, and this digest is simply compared to that in the database -- if they match, the inputs were the same, and thus the user has inputted the correct password. Situations like this are also where it being "infeasible to find two messages with the same hash" becomes very important. If two values generate the same hash (these situations are called collisions, and are something that pretty much all hash functions are vulnerable to), somebody could input an incorrect password yet it could validate as correct. Take the following ruby script as an example, where 'password' would presumably be stored in some kind of database on a server after hashing:

1
2
3
4
5
6
7
8
9
10
11
puts "Sign up. Password: "
password = gets.chomp!
password = hash(password) #=> b3aca92c793ee0e9b1a9b0a5f5fc044e05140df3 for example

puts "Log in. Password: "
login_input = gets.chomp!
login_input = hash(login_input)

if login_input == password then #If the hashes are equal
	puts "Login successful."
end

Following this, I should probably address the huge question some may have about how it's even possible to have an algorithm that can give a result which cannot be worked back from. This really is the cornerstone of hash functions - especially as most secure algorithms are open source and available to the public.

The irreversibility isn't actually as impossible as it might first sound -- the tough bit comes in compromising this with all the other ideal properties. The trick is the split the message into a number of blocks, and then have these messed up and interact with eachother to get some final value pop out. The interactions are often in the form of bitwise AND, OR, and XOR operations which mean a loss of information to the original input, and this combined with the mixing of blocks means that there simply isn't enough information to get back to the original input via working the algorithm back. A very primitive example of this, might be something like the very basic C++ hash algorithm I made, XHA-64. It's a really really basic algorithm which shouldn't really be used for anything, but it works as a simple example for demonstration purposes -- the interactions should mean that getting the original value back isn't as easy as you might think.

One of the many problems with my basic algorithm, however, is the lack of "avalanche" effect. A good hash function should produce totally different results if even a single character is changed -- the difference between the SHA-1 algorithm and my little algorithm, for example, for two input messages can be seen below:

1
2
3
4
SHA1:  "hello" => aaf4c61ddcc5e8a2dabede0f3b482cd9aea9434d
SHA1:  "hellp" => d4baaad7a68a379b1e133e0fea0603051b0124ca
XHA64: "hello" => E6F82C344A43FE0C
XHA64: "hellp" => E6E71F38401FF14D

Note that my algorithm produces very similar hashes for the similar values, which could (and likely would) help an attacker to find something about the nature of the original input, whereas in SHA-1 the small changes avalanche through the block interactions to produce totally different results.

As a final note to make sure you understand hash functions and their ideal properties before moving on to more complex hash function topics, I'd suggest looking at a few different hash functions in depth to see more detail about how their specific interactions and rules work, how they perform, and their problems. A few examples which might help you get started with this are BCrypt and MD5.

Rainbow Tables and Salting

Hash functions are generally pretty strong, however, as always, brute force is a method around the security. You may have heard of these things called rainbow tables - these are essentially massive tables of message-digest pairs for a certain algorithm.

1
2
3
4
5
6
7
8
9
10
a => 86f7e437faa5a7fce15d1ddcb9eaeaea377667b8
b => e9d71f5ee7c92d6dc9e92ffdad17b8bd49418f98
... etc ...
aa => e0c9035898dd52fc65c41454cec9c4d2611bfb37
ab => da23614e02469a0d7c7bd1bdab5c9c474b1904dc
... etc ...
pass => 9d4e1e23bd5b727046a9e3b4b7db57bd8d6ee684
past => dc52f279269d00ddcc22b2e6bb47a812c140dfdc
pasu => 59fc5e43c6101bd944f23d19ff5906eaf297ad3c
... etc ...

Let's say that a bunch of powerful computers have been working at generating message-digest pairs for an algorithm for a fair amount of time (there are a lot of combinations), and they've managed to list all the character combinations and associated hashes for up to 5 characters in message length. Uh oh - if you're running a website which simply hashes users passwords in a database with this popular (and presumably secure) algorithm, this means that if your users' passwords aren't very complex, they may already be listed in the rainbow table. So if an attacker breaks in, they could simply run all the hashes against the rainbow table, hence getting the original passwords for users with weaker passwords. For example your database shows the association user: bob; password: 9d4e1e23bd5b727046a9e3b4b7db57bd8d6ee684, however an entry for "9d4e1e23bd5b727046a9e3b4b7db57bd8d6ee684" is found in the rainbow table through a quick search-through, and the hash is translated back into the original message: "pass"! Oh no!

This is a pretty big problem - not only can users not really be trusted to come up with secure passwords a lot of the time, but popular algorithms are usually the more secure ones, however this in turn means that more people will be interested in putting their horsepower towards bettering the rainbow tables to break more hashes in that algorithm.

Luckily, this problem isn't too difficult to solve. Generating rainbow tables is only really viable for a certain number of characters as the possible combinations of characters simply gets too high for a rainbow table to be generated for all the combinations in good time. So to combat rainbow tables, all we really have to do is add a bunch of characters to the end of the password before we hash it, and we have a totally different hash which won't be in the rainbow table. These extra characters are called salts, and an example of this can be seen below:

1
2
"pass" => 9d4e1e23bd5b727046a9e3b4b7db57bd8d6ee684
"pass" + "s0mRIdlKvI!x@l=)nc&£NEF&£" => 704d6d4bd9c1ce11bfbe3d0c5287b02a2005e61f

As long as we also stick the salt (which would ideally be a bit longer than the one in the example, simply for even better security) on the end of attempts to log in so the hash generated is the same as the one stored, we should have functionality that works while having a hash which shouldn't be stored in a rainbow table!

There's one more important precaution that should be taken, however. Let's expand our example of being a site owner a bit, and say that we own a site with millions and millions of users. If we use the same salt for each user password, it's probably worth attackers going out of their way to actually generate a custom rainbow table for values with our salt attached to them! The solution to this is also very simple -- use random salts. If salts are different for each user, there's no way that an attacker can try to get all the passwords in the database, even with a LOT of time.

If attackers wanted to target a specific user, of course, they could waste a whole bunch of time trying to generate a rainbow table specifically for that salt, however if the user has a strong password it will practically simply take too long as their are simply too many character combinations that the password could contain. Horrah for good password security!