dcarley/readme.md

## readme.md

      
    Raw
  

              readme.md
            
          
    I was originally hoping to calculate the appropriate CacheCapacity
from metrics at 10% traffic and then scale up from there, however it
proved harder than I thought.
We have the following variables that we can tweak:


cache per pod: needs to be large enough that we can take spikes in
trace volume or duration but not too large that we're wasting memory


memory per pod: should be less than 8G so that we can effectively fit
into Kubernetes nodes


number of pods: easier to scale horizontally and recover from failure
with more pods but results in more peer-to-peer traffic and topology
changes


I managed to determine the limits at which we wouldn't evict traces at
10% of peak and non-peak throughput with three nodes. However the values
were many orders of magnitude greater than the number of in-flight
traces calculated from the formula described in our config. I've listed
the results below.
It also wasn't possible to verify in a local deployment because a
laptop isn't capable of sustaining that many traces and there are too
many variables with trace duration (which varies a lot depending on what
min/max/granularity you're looking at) and spans per trace (which
doesn't have an effect on the cache size but does influence memory
usage).
So I settled on this configuration by taking:

cache capacity required for no evictions at 10% of peak traffic: 50k
multiplied by the number of nodes: 50k*3 = 150k
multiplied up to 100% of traffic: 150k*100 = 1500k
divided by number of nodes to not OOM at 8G: 1500k/10 = 150k

This has resulted in no evictions during our non-peak traffic. We'll see
what happens in peak traffic overnight (UTC). It's very hard to tell how
much or little headroom this gives us though. I'm going to give feedback
to Honeycomb about this and keep an eye on it over the next couple of
days before we switch the datasets.
Testing mentioned above:

2020-07-20T13:30 BST (no evictions until traffic increased here)

25k CacheCapacity, 75k across 3 nodes
1.09-1.13G memory
1150 traces/sec at samproxy
5000 peak traces/in-flight at Honeycomb scaled down to 10%


2020-07-20T23:30 BST (evictions stopped as traffic decreased here)

30k CacheCapacity, 90k across 3 nodes
1.38-1.44G memory
1350 traces/sec at samproxy
5333 peak traces/in-flight at Honeycomb scaled down to 10%


2020-07-20T20:25 BST (peak traffic)

2000 traces/sec at samproxy
10000 peak traces/in-flight at Honeycomb scaled down to 10%


2020-07-21T16:55 BST (no evictions after change)

60k CacheCapacity, 180k across 3 nodes
2G memory, hitting container limits


2020-07-21T16:55 BST (no evictions after decreasing cache)

50k CacheCapacity
1.66-1.7G memory


2020-07-21T17:21 BST (evictions after decreasing cache)

35k CacheCapacity, 105k across 3 nodes


2020-07-21T17:25 BST (no evictions after decreasing cache)

40k CacheCapacity, 120k across 3 nodes
1.35-1.45G memory
1700 traces/sec at samproxy
6333 peak traces/in-flight at Honeycomb scaled down to 10%


2020-07-22T01:18 BST (no evictions during peak traffic)

50k CacheCapacity, 150k across 3 nodes
2G memory, hitting container limits occasionally
2400 traces/sec at samproxy
6666 peak traces/in-flight at Honeycomb scaled down to 10%