markrmiller/scale_current_design.md

## scale_current_design.md

      
    Raw
  

              scale_current_design.md
            
          
    How can the current system design be so poor at scale and performance and yet also scale and perform like Ive discussed? Well, a tough part in terms of what I say it can do on little hardware and instances involves an absolute ton of work and changes that would take me lifetimes to push through committee that lets you create and load tons of SolrCores in parallel in 10s of milliseconds time each. But thats mostly  relevant to my desire to be able to scale in every direction, collection, shards, replicas, all of that even per instance.
Ill gloss over a thousand things, skip over a thousand more. And I dont claim anything here is best way to go, the way anyone would want to go, the way anyone should go, or really anything other than, its an impl that fits the current design, can be ridiculously fast, can scale ridiculously well. That I know. The other things I dont even really care about much. Im up for a change in scenery actually at this point, but I do have a disagreement on some perceptions about current design limitations that I could not be talked out of.
Of course, first you must do something about the current issues around leader election, leader sync, the recovery process, and anything else too in the way, though much of it can start reasonable and iterate better over time.
Other than that, you can turn the Overseer into what I originally thought it was going to end up being.

Gut the Overseer and how it loops calls to ZK to see if its still the leader and processes things. Make it completely Watcher driven. When its elected leader, it starts watching a collection api queue and a state update queue. State updates could be per collection queues, either for scale if needed (would require a surprisingly lot of scale to be needed) or to allow for multiple Overseers split between collections (again, would take very high scale to even be needed). When its no longer leader it removes those watchers.
State updates from replicas are just dropped off to a state publisher queue. They become almost 0 cost for any replica code, just a queue drop off. The high performance queue is consumed by a single thread that will only then bulk publish these updates to ZK at a given fairly fast reasonable rate - but just one publisher, so never faster than it can. If a boat load of state updates hit at once, thats one ZK call or one per collection depending on the setup chosen. The ZK calls are async. If multiple updates come in quickly, useless updates are dropped - only the latest go out. They go out in a format that fits their tiny size. There are 4 or 5 states. Cant scale in every direction if all these costless things have cost.
The Overseer becomes a simple bit of container code with a ZkWriter. When its enabled, that just means its watchers are working. No crazy stop times. No crazy ZK issues. No care or tries to cling to the Overseer role. Its cheap and fast and fine to just bail and let another Overseer get elected and take over.
When the Overseer reads collection api cmds or updates from the queues, it does it efficiently from the efficient batched, small data formats. It doesnt read one at a time, it reads all the items it finds in the queue via async calls. Most ZK interaction is via fast and efficient async API. All of the batches are combined and coalesced again, before being given to the ZKWriter. The ZKWriter keeps an in memory state of the cluster and an in memory data structure for the state updates for each collection. Everything it does is completely independent per collection. Its also driven by an efficient work queue based approach - you cant make it try to go faster than it can go or wants to go. It updates its in memory cluster view fast and efficiently and this does not dictate when it tells clients there are updates or when it writes to ZK. It writes structure updates to state.json, but clients read those updates when it notifies them to via another node intended for that. It keeps structure updates and state updates completely separate.
When collection api queue events come in, they dont trigger slow independent build ups and waits as the cluster state is assembled via dropping off requests to the Overseer queue. It is the Overseer, no use talking to itself via ZK queue, building up a collection state one operation at a time. The request comes in, the new state is computed, its passed to the ZKWriter, its written out via its efficient, coalescing work queue process. Creating a collection is a single write, not many,  queue calls, not any waits. It assembles the state, it gets written out. Updates to the state are the same. Collections are handled in parallel completely independently. You could easily create many Overseers that grab a slice of collections to temporarily manage based on various factors or randomly. The Overseer is nimble and low cost and can easily jump around.

ZkStateReader also needs to be able to scale.

ZKStateReader does not manage a ClusterState object with all the collections in it that gets replaced on each update. That the current stuff is much too slow and wastefuf is known, but even basic removing of the old single cluster state remnants is not nearly enough. It does not create a  new DocCollection object per update. Collections are managed independently. State updates are managed via a threadsafe replica to integer map. If a single state update comes down, a single integer value is updated (of course could be even smaller data type). No new DocCollection, non of the surprising stuff that comes along with that, a tiny state change equals a tiny resulting flow of work, objects and time.
When a ZkStateReader is triggered to fetch new cluster state structure changes or state updates, this also triggers an efficient work queue system,that can coalesces requests, and controls request frequency similar to the other work queues. Its setup to run and do work using the forkjoin executor and completable futures and the ZK async api. Getting notified to update, performing the structure update request, the state update request, combining that info, updating the ZkStateReader in memory states, dealing with calling the current state #waitFor* predicates, is all reactive, chained async calls.
Lots of improvements, many fixes to lazy collections so that an instance only does work for what matters to it and quickly and reliably adjusts as necessary.
Creates a single recursive watcher, replacing most watchers, pretty much all in ZkStateReader itself. A single process method deals with notifications and dispatches correctly. If you wanted even more scalability for some flexibility tradeoff, collections can be listed under multiple top level cluster names, the recursive watcher(s) just watches the root cluster nodes of the collections interested in.

Recovery gets heavily improved, more efficient, various steps corrected or refined.

Also driven by async completable futures and async ZK. Part of a broad push towards not sitting on resources unless they are being used.
Various improvements and speed ups, everything using HTTP2, async calls where appropriate, things like segment replication happening via async and outside the heap with aysnc  mmap IO vs piping bytes through the heap, over the network, through the heap, onto the file system via byte stream.

HTTP2 and async eat the world.

Besides removing unnecessary object duplication and sharing objects that should be shared system wide, everything is fast and efficient and hardened HTTP2. Update and read requests are async.
Jetty servlet requests are serviced via async servlet spec. They use async IO. Data does not have to be read written via byte at a time stream but can be done a primitive type from/to Jetty ByteBuffers.

Threads dont eat the world.

Thread management is heavily changed. Parallel work done heavily increased. Thread pools heavily reduced. Pool behavior greatly improved. Container thread management and overload issues heavily improved.

The amount of data and work gets reduced down much closer to what is actually necessary. The system favors moving towards max throughput, not max overload. The system in steady state is actually steady. When ZK communication is required, whether prompted by tons of activity or objects or few, its ripped through extremely quickly and efficiently.
The system can do whatever the current system can do in terms of supporting features. Few radical departures, even where they could make things better.
The system can scale extremely well. Can also do it with very high performance.
As I believe other systems could. The level and scale  that can be achieved in this design is absurdly dependent on implementation. As most any design will be.