As discussed in Replication , Partitioning is needed for scalable system
We will look at the various partitioning strategies with an example of looking up by a single key. Assume Keys are English words for simplication and there are 5 partitions.
Simplest strategy is to add keys to random partitions. Write performance may be better but reads will suffer. We have to connect multiple partitions(may be nodes) to find if the partition has that particular key or not.
Strategy is to assign from Keys starting with A- E in partition1 and F-K with next and so on. We can maintain this key range mapping separately. Clients can use this to connect to specified partitions to obtain data.
What if our data has more data in F-K ? Based on volume of data, partitions to keys have to be rebalanced. For example, A-D & F are in partition 1, E,G,L-M are in partition 2 etc.
What if write happens on certain keys only on certain days and that partition gets frequently accessed ?
Hash Key Partitioning
To avoid skews and hotspots, it is recommended to use hash function to determine partition. A good hash function takes skewed data and makes it uniformly distributed. Even if the input strings are very similar, their hashes are evenly distributed across a range of numbers. Hash function becomes the index mapper in this case.
But this loses the benefit of range queries. If we select A-C we know we need to go only to partition1.
What if there is huge load for a single key ? We cannot avoid the skew or hotspot problem in this case. Today databases do not handle this efficiently.
We have seen lookup by primary key. With secondary indexes, the problem compounds. Lets take a car database. Car registration number is primary key and color is secondary index.
Document Based Partitioning
Store the secondary index within each partition aka do local mapping. Each partition maintains id of red cars present inside that partition. The downside is if we search for red cars, we have to hit all partitions even if the partition does not contain red car. This approach is called scatter/gather.
Term Based Partitioning
The alternate is to maintain a global index of the partitions based on secondary index as well. This helps to narrow down only needed partitions based on the term used. Downside is during writes, we have to build up the global secondary index as well.
Our data is always evolving - volume increases, data updates happen. This means we have to rebalance our partitions as well across the nodes. We will look at the strategies to handle that.
A simple strategy is to create many more partitions than there are nodes, and assign several partitions to each node. With 10 nodes and 1000 partitions scenario, create 100 partitions for each node. To increase performance or reduce node overload, add additional nodes and new nodes steal few partitions from the existing nodes. But data always has only 1000 partitions. The main problem is fixing the partition count. It is very difficult to get it right.
In this case, partitions are configured with a size in MB or GB. Whenever the size exceeds, partitions are split. Node contains multiple partitions similar to fixed partitioning and when new nodes are added, partitions can be rebalanced. This ensures data skew does not happen and partitions are proportional to data volume.
Partitioning based on nodes
In the above approaches, number of partitions are independent of number of nodes. The approach is to have a fixed number of partitions per node. The size of each partition grows proportionally to the dataset size while the number of nodes remains unchanged. When additional nodes are added, the partitions become smaller again. It is best used with hash based partitioning.
Since our cluster is always being rebalanced, how does a client know which node it needs to connect to. There are two ways this is generally handled
Using a central service like zookeeper -> Each node sends the partition it holds to zookeeper and client can use that information to identify the node.
gossip protocol -> Client sends request to any node. Each node constantly gets information from its neighbouring nodes similar to human gossip. Using this information, the node routes request to appropriate node. More work on each data nodes but avoids central coordinator.