Consistent Hashing 2016-09-02
Introduction
Traditional hash tables map keys to an array index using the following process:
hash = hashFunc(key)
index = hash % arraySize
When the arraySize
changes, all keys need to be remapped because the index is
calculated by a modular operation.
The same technique can be used to partition the data from some application into a number of databases1 by calculating the hash of a key for the data modulo the number of databases, and you could have a situation like the following:
If a new database is added to the cluster, or one existing db is removed (or fails), all keys would need to be remapped just like in the hash table example. Now, if you are dealing with a lot of data you can imagine that remapping all keys would take quite some time which is not very attractive.
Consistent Hashing
That’s when consistent hashing comes as an alternative. First, let’s consider the
output range of a function f
:
If we connect both ends we end up with a ring:
Using the same function f
, we can map each node to a point in the ring:
The interval between two nodes in the ring form a partition. If we use the same
function f
2 over the key we will end up with a projection of that key in
the ring:
With this notion we can define the server responsible for our key as the first node in a clockwise direction after the key projection3.
So in this case, "Mars"
would be stored in the server 10.9.5.1
. The same
process is applied for any other key:
Note that f("Venus")
mapped to a point after the last node and before the
maximum value of function f
. Since both ends are connected, there is no
problem in walking clockwise to find the responsible node which in this case is 10.1.2.3
.
Adding a new node to the ring does not mean that all keys will need to be remapped. Only a fraction4 of keys will need to move to a different node:
A fraction of the keys are also remapped when a node leaves the ring:
And that’s the essence of Consistent Hashing. The idea was presented in a paper by Karger et al. in 1997 [1]. Nodes are mapped to a ring, forming partitions, which are then used to find the node responsible for a key by mapping the key to the same ring, and finding the first node in a clockwise direction after the key position.
Some examples of systems that use consistent hashing are: Amazon’s Dynamo [2], Riak [3], Akka [4], and Chord [5].
With consistent hashing is easier to avoid hotspots by using a function f
that
mixes well, so even if the keys are very similar they end up projected in
different and distant points in the ring, causing them to be stored at different
nodes. Another benefit is the smoothness for moving keys when nodes join or
leave the ring, only the immediate neighbors of a node are impacted and other
nodes remain unaffected.
A system using consistent hashing can apply other techniques to reduce even more
the impact of changes in the ring structure. If nodes are data stores, like
the initial example in this post, the system could replicate the data in the
next N
nodes after the original node, N1
, that is responsible for that data. This
gives the advantage that if N1
leaves the ring, its immediate
neighbors will already have the data that was stored at N1
, preventing an
increase of network traffic after a node (in this case, N1
) departs. The same
technique helps avoiding hotspots even more since requests for the data can be
handled by N1
or any of its next N
neighbors in the ring.
Usually, systems using consistent hashing construct their rings as the output range of a hash function like SHA-1, or SHA-2, for example. The range for SHA-1 goes from 0 to 2160, and SHA-2 has different output ranges, SHA-256 is 0 to 2256, SHA-512 is 0 to 2512, etc. Using any of these functions to map the nodes in the ring will have the effect of placing the nodes in random positions. The partitions will very likely have different sizes which means nodes will be responsible for different amounts of data. It may not be attractive to have this characteristic and since the range is well defined, the ring could be split into partitions of equal size and then each node could claim ownership of a number of these partitions, guaranteeing that each node handles more or less the same amount of data5.
Another important technique is the usage of virtual nodes, which we will see next.
Virtual Nodes
The ring shown above had a one-to-one mapping between physical nodes and nodes in the ring. This approach presents a challenge that randomly placing the nodes in the ring might lead to a non-uniform distribution of data between nodes.
This problem becomes more evident when a node leaves the ring which requires that all the data handled by that node need to be moved entirely to a single other node.
To avoid overloading a single node when another one leaves the ring, and to distribute the keys more evenly, the system can create a different mapping between physical nodes and nodes in the ring. Instead of having a one-to-one mapping, the system creates virtual nodes, creating a M-to-N mapping between physical nodes and virtual nodes in the ring.
With virtual nodes, each physical node becomes responsible for multiple partitions in the ring. Then if a node leaves the ring, the load handled by this node is evenly dispersed across the remaining nodes in the ring.
And similarly when a node joins the ring, it receives a roughly equivalent amount of data from the other nodes in the ring. The virtual nodes scheme also helps when the system is comprised of heterogeneous nodes in terms of resources such as CPU cores, RAM, disk space, etc. With an heterogeneous cluster the number of virtual nodes for each physical node can be chosen considering the characteristics of each physical node.
concha
: A consistent hashing library in Erlang
concha
is a consistent hashing library in Erlang that I built.
The ring represents the output range of SHA-256, and the same function is used
for mapping nodes and keys to the ring. It provides lookup of nodes based on the
given keys, creating the ring with virtual nodes, adding and removing nodes,
etc. For more information and examples of usage you can
visit its repository on GitHub.
References
- [1] D. Karger, E. Lehman, T. Leighton, R. Panigrahy, M. Levine, and D. Lewin, Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web. New York, New York, USA: ACM, 1997, pp. 654–663.
- [2] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels, Dynamo: Amazon’s Highly Available Key-value Store, 2007, pp. 1–16.
- [3] Riak Vnodes: https:/docs.basho.com/riak/kv/2.1.4/learn/concepts/vnodes/
- [4] Akka Routing: http:/doc.akka.io/docs/akka/2.4/scala/routing.html#ConsistentHashingPool_and_ConsistentHashingGroup
- [5] I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Balakrishnan, Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications, ACM SIGCOMM …, pp. 149–160, 2001.
- [6] Consistent hashing, Wikipedia. https:/en.wikipedia.org/wiki/Consistent_hashing
Notes
-
And also to a bunch of other situations, for example, to spread requests to different servers. ↩
-
Or any other function with the same output range. ↩
-
If the projection is equal to a node’s projection, then that node is responsible for the key. ↩
-
Usually
K/n
, whereK
is the number of keys, andn
is the number of nodes in the ring. ↩ -
Depending on the key distribution in the ring. ↩