An Introduction to Cryptographic Hash Functions
Looking for in-depth C videos? Click here.
Cryptographic hash functions are algorithms that take data input (often called the 'message') and generate a fixed-size result (often called the 'hash' or 'digest'). For example:
Ideal hash functions make it very difficult to get the original message back from the digest, reasonably easy to compute a hash for a given message, infeasible to generate a message with a given hash, infeasible to modify a message without changing the resultant hash, and infeasible to find two messages with the same hash.
While no completely ideal function exists, functions which aim for these properties can prove very useful. A classic example of cryptographic hash use is in storage of passwords. When you sign up for a website, your data is usually stored in a database on their servers. The issue is that if your password is stored on the server as regular text (often called 'plaintext') and somebody hacks into the server, your password is completely compromised. If your password is hashed on the server, however, an attacker shouldn't be able to formulate your password from the stored value.
This concept may leave some wondering how a password entered at a later date could then be compared to the stored value to check if the login information is correct, but in fact this is extremely simple. The inputted password is simply hashed using the same function, and this digest is simply compared to that in the database -- if they match, the inputs were the same, and thus the user has inputted the correct password. Situations like this are also where it being "infeasible to find two messages with the same hash" becomes very important. If two values generate the same hash (these situations are called collisions, and are something that pretty much all hash functions are vulnerable to), somebody could input an incorrect password yet it could validate as correct. Take the following ruby script as an example, where 'password' would presumably be stored in some kind of database on a server after hashing:
1 2 3 4 5 6 7 8 9 10 11
Following this, I should probably address the huge question some may have about how it's even possible to have an algorithm that can give a result which cannot be worked back from. This really is the cornerstone of hash functions - especially as most secure algorithms are open source and available to the public.
The irreversibility isn't actually as impossible as it might first sound -- the tough bit comes in compromising this with all the other ideal properties. The trick is the split the message into a number of blocks, and then have these messed up and interact with eachother to get some final value pop out. The interactions are often in the form of bitwise AND, OR, and XOR operations which mean a loss of information to the original input, and this combined with the mixing of blocks means that there simply isn't enough information to get back to the original input via working the algorithm back. A very primitive example of this, might be something like the very basic C++ hash algorithm I made, XHA-64. It's a really really basic algorithm which shouldn't really be used for anything, but it works as a simple example for demonstration purposes -- the interactions should mean that getting the original value back isn't as easy as you might think.
One of the many problems with my basic algorithm, however, is the lack of "avalanche" effect. A good hash function should produce totally different results if even a single character is changed -- the difference between the SHA-1 algorithm and my little algorithm, for example, for two input messages can be seen below:
1 2 3 4
Note that my algorithm produces very similar hashes for the similar values, which could (and likely would) help an attacker to find something about the nature of the original input, whereas in SHA-1 the small changes avalanche through the block interactions to produce totally different results.
As a final note to make sure you understand hash functions and their ideal properties before moving on to more complex hash function topics, I'd suggest looking at a few different hash functions in depth to see more detail about how their specific interactions and rules work, how they perform, and their problems. A few examples which might help you get started with this are BCrypt and MD5.
Rainbow Tables and Salting
Hash functions are generally pretty strong, however, as always, brute force is a method around the security. You may have heard of these things called rainbow tables - these are essentially massive tables of message-digest pairs for a certain algorithm.
1 2 3 4 5 6 7 8 9 10
Let's say that a bunch of powerful computers have been working at generating message-digest pairs for an algorithm for a fair amount of time (there are a lot of combinations), and they've managed to list all the character combinations and associated hashes for up to 5 characters in message length. Uh oh - if you're running a website which simply hashes users passwords in a database with this popular (and presumably secure) algorithm, this means that if your users' passwords aren't very complex, they may already be listed in the rainbow table. So if an attacker breaks in, they could simply run all the hashes against the rainbow table, hence getting the original passwords for users with weaker passwords. For example your database shows the association
user: bob; password: 9d4e1e23bd5b727046a9e3b4b7db57bd8d6ee684, however an entry for "9d4e1e23bd5b727046a9e3b4b7db57bd8d6ee684" is found in the rainbow table through a quick search-through, and the hash is translated back into the original message: "pass"! Oh no!
This is a pretty big problem - not only can users not really be trusted to come up with secure passwords a lot of the time, but popular algorithms are usually the more secure ones, however this in turn means that more people will be interested in putting their horsepower towards bettering the rainbow tables to break more hashes in that algorithm.
Luckily, this problem isn't too difficult to solve. Generating rainbow tables is only really viable for a certain number of characters as the possible combinations of characters simply gets too high for a rainbow table to be generated for all the combinations in good time. So to combat rainbow tables, all we really have to do is add a bunch of characters to the end of the password before we hash it, and we have a totally different hash which won't be in the rainbow table. These extra characters are called salts, and an example of this can be seen below:
As long as we also stick the salt (which would ideally be a bit longer than the one in the example, simply for even better security) on the end of attempts to log in so the hash generated is the same as the one stored, we should have functionality that works while having a hash which shouldn't be stored in a rainbow table!
There's one more important precaution that should be taken, however. Let's expand our example of being a site owner a bit, and say that we own a site with millions and millions of users. If we use the same salt for each user password, it's probably worth attackers going out of their way to actually generate a custom rainbow table for values with our salt attached to them! The solution to this is also very simple -- use random salts. If salts are different for each user, there's no way that an attacker can try to get all the passwords in the database, even with a LOT of time.
If attackers wanted to target a specific user, of course, they could waste a whole bunch of time trying to generate a rainbow table specifically for that salt, however if the user has a strong password it will practically simply take too long as their are simply too many character combinations that the password could contain. Horrah for good password security!