Consistent Hashing

Introduction

Traditional hash tables map keys to an array index using the following process:

hash = hashFunc(key)
index = hash % arraySize

When the arraySize changes, all keys need to be remapped because the index is calculated by a modular operation.

The same technique can be used to partition the data from some application into a number of databases1 by calculating the hash of a key for the data modulo the number of databases, and you could have a situation like the following:

If a new database is added to the cluster, or one existing db is removed (or fails), all keys would need to be remapped just like in the hash table example. Now, if you are dealing with a lot of data you can imagine that remapping all keys would take quite some time which is not very attractive.

Consistent Hashing

That’s when consistent hashing comes as an alternative. First, let’s consider the output range of a function f:

If we connect both ends we end up with a ring:

Using the same function f, we can map each node to a point in the ring:

The interval between two nodes in the ring form a partition. If we use the same function f2 over the key we will end up with a projection of that key in the ring:

With this notion we can define the server responsible for our key as the first node in a clockwise direction after the key projection3.

So in this case, "Mars" would be stored in the server 10.9.5.1. The same process is applied for any other key:

Note that f("Venus") mapped to a point after the last node and before the maximum value of function f. Since both ends are connected, there is no problem in walking clockwise to find the responsible node which in this case is 10.1.2.3.

Adding a new node to the ring does not mean that all keys will need to be remapped. Only a fraction4 of keys will need to move to a different node:

A fraction of the keys are also remapped when a node leaves the ring:

And that’s the essence of Consistent Hashing. The idea was presented in a paper by Karger et al. in 1997 [1]. Nodes are mapped to a ring, forming partitions, which are then used to find the node responsible for a key by mapping the key to the same ring, and finding the first node in a clockwise direction after the key position.

Some examples of systems that use consistent hashing are: Amazon’s Dynamo [2], Riak [3], Akka [4], and Chord [5].

With consistent hashing is easier to avoid hotspots by using a function f that mixes well, so even if the keys are very similar they end up projected in different and distant points in the ring, causing them to be stored at different nodes. Another benefit is the smoothness for moving keys when nodes join or leave the ring, only the immediate neighbors of a node are impacted and other nodes remain unaffected.

A system using consistent hashing can apply other techniques to reduce even more the impact of changes in the ring structure. If nodes are data stores, like the initial example in this post, the system could replicate the data in the next N nodes after the original node, N1, that is responsible for that data. This gives the advantage that if N1 leaves the ring, its immediate neighbors will already have the data that was stored at N1, preventing an increase of network traffic after a node (in this case, N1) departs. The same technique helps avoiding hotspots even more since requests for the data can be handled by N1 or any of its next N neighbors in the ring.

Usually, systems using consistent hashing construct their rings as the output range of a hash function like SHA-1, or SHA-2, for example. The range for SHA-1 goes from 0 to 2160, and SHA-2 has different output ranges, SHA-256 is 0 to 2256, SHA-512 is 0 to 2512, etc. Using any of these functions to map the nodes in the ring will have the effect of placing the nodes in random positions. The partitions will very likely have different sizes which means nodes will be responsible for different amounts of data. It may not be attractive to have this characteristic and since the range is well defined, the ring could be split into partitions of equal size and then each node could claim ownership of a number of these partitions, guaranteeing that each node handles more or less the same amount of data5.

Another important technique is the usage of virtual nodes, which we will see next.

Virtual Nodes

The ring shown above had a one-to-one mapping between physical nodes and nodes in the ring. This approach presents a challenge that randomly placing the nodes in the ring might lead to a non-uniform distribution of data between nodes.

This problem becomes more evident when a node leaves the ring which requires that all the data handled by that node need to be moved entirely to a single other node.

To avoid overloading a single node when another one leaves the ring, and to distribute the keys more evenly, the system can create a different mapping between physical nodes and nodes in the ring. Instead of having a one-to-one mapping, the system creates virtual nodes, creating a M-to-N mapping between physical nodes and virtual nodes in the ring.

With virtual nodes, each physical node becomes responsible for multiple partitions in the ring. Then if a node leaves the ring, the load handled by this node is evenly dispersed across the remaining nodes in the ring.

And similarly when a node joins the ring, it receives a roughly equivalent amount of data from the other nodes in the ring. The virtual nodes scheme also helps when the system is comprised of heterogeneous nodes in terms of resources such as CPU cores, RAM, disk space, etc. With an heterogeneous cluster the number of virtual nodes for each physical node can be chosen considering the characteristics of each physical node.

concha: A consistent hashing library in Erlang

concha is a consistent hashing library in Erlang that I built. The ring represents the output range of SHA-256, and the same function is used for mapping nodes and keys to the ring. It provides lookup of nodes based on the given keys, creating the ring with virtual nodes, adding and removing nodes, etc. For more information and examples of usage you can visit its repository on GitHub.

References

Notes

  1. And also to a bunch of other situations, for example, to spread requests to different servers. 

  2. Or any other function with the same output range. 

  3. If the projection is equal to a node’s projection, then that node is responsible for the key. 

  4. Usually K/n, where K is the number of keys, and n is the number of nodes in the ring. 

  5. Depending on the key distribution in the ring.