In our first blog post on passwords we discussed the challenges of maintaining passwords in a way that allows us to verify passwords without making them easily stealable. In this post, we begin to explore how this problem is solved in computer systems, though we'll continue to look at it from the perspective of the doorperson from the last post.
The concept of a one-way function may be a little hard to grasp at first, and especially hard to understand how it could be used for something like passwords. As an illustrative example, consider multiplication. Most people, if asked multiply a set of numbers together, would be able to do so relatively painlessly. However, if someone gave you a large number and asked you to tell them which numbers might have been multiplied together in order to get that large number, many would be unsure of where to start and likely no one would be able to do it anywhere near as quickly. Going back to the doorperson example, we could imagine a system where the list of guests has a very large number associated with each guest. The doorperson would then have the person provide their carefully chosen set of numbers, multiply them together, and verify that the product equals what is in the password sheet. This is a start, but ultimately still has some flaws.
Most importantly, we want the order to matter. Since multiplication is commutative, there is nothing forcing someone who stole the password sheet to provide the same numbers in the same order, they just need a set of numbers that have the same product (e.g. 5 x 8 x 12 = 480 = 16 x 3 x 10). But assuming we can find a method for combining the numbers in a way that is dependent on the order that they are provided, we are in much better shape. We also need to be able to use letters (and other characters) in addition to numbers. This isn't really as hard as it sounds (hint: binary can always be interpreted as a number). But if we can find a way to fulfill these requirements we will ideally have a process that allows us to store data that can be securely used to verify users passwords without actually storing them.
The solution commonly applied to this problem is a technique referred to as "hashing." Hashes are operations that can be done on any set of data to turn them into a seemingly random value. When applied to a given input, a hash should always provide the same output ("hash value") but when applied to different inputs (even ones that are very similar) the hash may produce significantly different outputs. Hashes as a general category are useful in a lot of settings but there are cryptographic hash functions which have many relevant features for our purposes here. They are hard to reverse (i.e. figure out what the input was if you have the output), resistant to collisions (two different inputs should rarely produce the same output), and are quick for a computer to execute.
Going back to the doorperson analogy, if the doorperson tracked hash values on their sheet they would be able to hash the password provided by prospective guests and check that hash value against their sheet. If the sheet were stolen, the thief shouldn't be able to invert the hash function, so the only way they would be able to determine a user's password is to guess it and confirm they are right by hashing their guess and seeing that it comes out to the same hash value.
So assuming our hash function does everything we designed it to do, we have successfully put together a system that we can use to verify users passwords but don't have to worry about them getting stolen because no one can reverse-engineer the passwords. Seems perfect, right?
As is often the case in life, we mustn't underestimate the ingenuity of people determined to get what they want. It turns out that the last criteria for hash functions we mentioned above: that they be fast, is itself a weakness. Because making a good cryptographic hash function is hard and requires a lot of analysis and testing, the average programmer isn't well-equipped to create their own. Instead we rely heavily on the experts that specialize in information security and cryptography to vet these functions for us and make recommendations about what to use. This is good. It means that people that know what they're doing are the ones making the real decisions. However, this discourse is public. So hackers are also seeing these recommendations and can immediately start trying to figure out how to undermine them.
With this knowledge, someone that has just stolen our sheet of password hashes might start by having their computer generate every possible password (a, b, c, …, aa, ab, ac, ...), run those passwords through the hash function, and check to see if it matches any of the ones on the list. This is known as a brute force attack. Alternatively, they might get a list of all the words in the dictionary and check those (a dictionary attack). This might take a long time but eventually if a guest has a really simple password they might figure it out. On top of that, if any two users have the same password, they'll be able to notice right away. Depending on how much time they have and how often the users change their passwords, they would eventually figure out all of the user's passwords. And because the hash functions are designed to be fast, this might be faster than you'd think. In some cases they may be able to test thousands of possible passwords every second (!). But if people's passwords are reasonably complex, we would still have time to realize that the sheet was stolen and tell all of the users to change their passwords. Or would we?
In reality, there's nothing preventing the password hash thief from starting the process of running all possible passwords (up to a certain length) through a given hash function ahead of time and storing all of those results. Then when they steal the list they can quickly look up the hash value and find the password that was used to generate it. In this case they might be able to start using the guest's passwords immediately, while we're still realizing they stole a copy of the list. This large set of pre-calculated hashes is referred to as a rainbow table and, provided the thief is able to store and easily look up the hash values, presents a serious problem for our system.
All is not lost though. The next post in this series will cover techniques for combating brute-force attacks and the current recommendations for how to securely manage user's passwords. See you then!