maksadbek/Geo Based Sharding.md

## Geo Based Sharding.md

      
    Raw
  

              Geo Based Sharding.md
            
          
    Geo sharding

Shard Space

Like redis cluster, we'll use a shard space of 2^14 = 16384. Essentially what this means is that we'll take our entire geo area, and split it into 16384 segments of approximately equal entries.
Each instance will be responsible for any number of segments, typically a cluster of segments that are geographically close.
Searches

When performing searches and lookups in a range, a client makes a geosearch request to an intermediary merger server. This merger determines which segments are within that geosearch range, and which worker instances are responsible for those ranges, splits the segment ranges up and forwards them to the worker instances. It then merges the results and returns the merged results to the caller.
Each worker can then execute a series of highly efficient linear searches on each segment within the range to return the results.
Performance

The main performance advantages of geosharding over ID sharding are:
The geoindex (KD/Quad tree) exists in the merger, rather than the worker, leaving only an efficient linear scan in the worker
The queries don't need to look at every data segment, only the relevant ones, unlike ID sharding
The work is concentrated on the instances that contain the data segments, so the entire shardspace doesn't need to be queried for every request.
The merger needs to do significantly less work. In ID sharding, the merger wants to return the top N results, so it:

asks every worker for K <= N results
If there are w workers, the merger has wK items to merge

Using geosharding, the merger:

asks only the workers containing the relevant segments for K <= N results
If there are w workers, and g < w workers containing the relevant geographical segments, the merger has gK items to merge

This means less data is sent over the network, reducing bandwidth, CPU usage, and latency
ID sharding approach
If the workers were sharded on ID and contained their own geoindexing, evey search needs to hit every worker on a replication, meaning that in cases where you are searching for users there is a quadratic relation between number of user and the number of nodes because:

If number of users doubles, the amount of data being stored doubles, and the number of queries doubles
There is now twice as much data on each shard meaning that with an equal number of queries the CPU usage doubles as there is twice as much work done on each query (twice as much data to look through)
There are now twice as many queries because there are twice as many users, and each query must go to every shard meaning the CPU usage doubles again
This means a 2x user increase corresponds to a 4x CPU increase, and 4x user increase corresponds to a 16x CPU increase, so this approach scales quadratically on hardware

With the geosharding approach:

If the number of users doubles, the amount of data being stored doubles, and the number of queries doubles
There is now twice as much data on each shard meaning that with an equal number of queries the CPU usage doubles as there is twice as much work done on each query (twice as much data to look through)
There are now twice as many queries because there are twice as many users, but queries only need to go to the instances that have the relevant segments. Since we've doubled our machines and halved the number of segments each is responsible for, they are dealing with roughly the same number of queries
This means a 2x user increase corresponds to a 2x CPU increase, and a 4x user increase corresponds to a 4x CPU increase, so this approach scales linearly on hardware
The exception to this is once you have scaled to the point that you have one physical machine per segment, because now you need to split the shardspace in order to reduce the data present on each machine. However, the queries will still want to query the same physical geolocation range, meaning each query will simply have twice segments in it after resharding, and twice as many machines will be queried after resharding the space.

Cons:

If the global distribution of traffic changes, then the CPU usage of each machine can become imbalanced. In these situations a new geoindex version needs to be built, and a new replica needs to go up using this new distribution in order to rebalance.
Care needs to be taken here with regard to daily traffic as well. If each node contains segments spread out longitudinally across the globe, this should balance out