timmc/spclsfc.md

## spclsfc.md

      
    Raw
  

              spclsfc.md
            
          
    Algorithm:

State

For each node, store:

Rolling window of 6 historical stat buckets
One additional "sticky" bucket


Write stats to newest bucket
Age out the oldest bucket every 5 seconds, and add a new one

If the oldest bucket had data, copy it over the "sticky" bucket


Recorded stats: Number of finished requests, number that
finished successfully


Prioritization by health

Assign each node a success rate:

If there are finished requests in the historical buckets:
Divide the success count sum by the finished count sum,
sums weighted exponentially (most recent bucket highest)
No finished requests in historical buckets:

If there is data in the sticky bucket:
Compute success rate from that data, with a floor of
0.0001 divided by node count
No data in sticky bucket either: Use value of 1


Derive a weight for each node: Success rate cubed
Perform a weighted shuffle of nodes. Alternatively, approximate it:

Shuffle connections
Order by descending success rate
Pick one connection by weighted random and promote to first


Fallback cascade

Iterate over prioritized connections
Try to get a lease from each one (using
concurrency-limits library) and use the first one
that is available
If none available, return error to caller
Otherwise, make the call
Tell concurrency-limits whether it was a success, a timeout,
or an unrelated error
Atomically increment connection's request count and (if
applicable) success count


Notes:

Startup: Initially, all nodes are tabula rasa and have equal
chance of being selected. However, we don't want a positive feedback
loop that leads to one node being considered healthy and the others
being untrusted by default, which is why a clean-slate node is
considered completely healthy—it needs to be "competitive" with
nodes that already do have data.

However: Incrementing request stats on the tail of the request
means there's a delay before a bad node is uncovered at startup
or after a dynamic config change. If we're willing to wait 6
seconds before timing out and considering it an error, that's 6
seconds of requests that might fail before we have stats. We'll
rely on concurrency-limits to limit the damage from this
slow-failure scenario.


Recovery: Accomplished by aging-out of buckets combined with the
no-data minimum. A connection with 0% success rate is never
otherwise picked as long as other connections are healthy. The sticky
bucket helps us distinguish between a newly configured node and one
that has simply not received any requests recently, either due to bad
health or low request rate. Even if a node has no recent requests, and
the last recorded data was 0% success, we give it a slight chance to be
picked again. 0.0001 divided by number of nodes keeps the damage low
if the node is still unhealthy, but could bring it back into service
quickly at a high request rate.
Fast detection: Each stat bucket is weighted more heavily than the
next older one, perhaps by a factor of 3. Narrower buckets would enhance
this effect by reducing dilution.

There is likely a more precise algorithm that takes fully into
account the number of outstanding requests and more quickly
detects a sudden timeout condition. However, the
concurrency-limits library should help with this.


Decorrelation: This algorithm uses weighting to prefer healthier
nodes, but weighted-random selection to keep from sending all
requests to the current-healthiest one.

When a node is completely down, it takes no requests unless the
concurrency limit is reached on the healthy nodes.
When all are down, fail fast.
When all nodes are suffering, distribute requests proportionally.


Fast-fail: Under non-timeout error conditions, we always try to
make a request. This assumes that if the backends are truly
struggling under load, they will show a latency increase, at which
point the concurrency-limits will kick in.

If this assumption is challenged, consider implementing a circuit
breaker (or perhaps use Hystrix) and include it in the cascade
alongside concurrency-limits.