Skip to content

Instantly share code, notes, and snippets.

@dcarley
Created July 28, 2020 10:24
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save dcarley/73713af70434b76a6d9b610e01e75b03 to your computer and use it in GitHub Desktop.
Save dcarley/73713af70434b76a6d9b610e01e75b03 to your computer and use it in GitHub Desktop.
samproxy CacheCapacity tuning notes

I was originally hoping to calculate the appropriate CacheCapacity from metrics at 10% traffic and then scale up from there, however it proved harder than I thought.

We have the following variables that we can tweak:

  • cache per pod: needs to be large enough that we can take spikes in trace volume or duration but not too large that we're wasting memory

  • memory per pod: should be less than 8G so that we can effectively fit into Kubernetes nodes

  • number of pods: easier to scale horizontally and recover from failure with more pods but results in more peer-to-peer traffic and topology changes

I managed to determine the limits at which we wouldn't evict traces at 10% of peak and non-peak throughput with three nodes. However the values were many orders of magnitude greater than the number of in-flight traces calculated from the formula described in our config. I've listed the results below.

It also wasn't possible to verify in a local deployment because a laptop isn't capable of sustaining that many traces and there are too many variables with trace duration (which varies a lot depending on what min/max/granularity you're looking at) and spans per trace (which doesn't have an effect on the cache size but does influence memory usage).

So I settled on this configuration by taking:

  • cache capacity required for no evictions at 10% of peak traffic: 50k
  • multiplied by the number of nodes: 50k*3 = 150k
  • multiplied up to 100% of traffic: 150k*100 = 1500k
  • divided by number of nodes to not OOM at 8G: 1500k/10 = 150k

This has resulted in no evictions during our non-peak traffic. We'll see what happens in peak traffic overnight (UTC). It's very hard to tell how much or little headroom this gives us though. I'm going to give feedback to Honeycomb about this and keep an eye on it over the next couple of days before we switch the datasets.

Testing mentioned above:

  • 2020-07-20T13:30 BST (no evictions until traffic increased here)
    • 25k CacheCapacity, 75k across 3 nodes
    • 1.09-1.13G memory
    • 1150 traces/sec at samproxy
    • 5000 peak traces/in-flight at Honeycomb scaled down to 10%
  • 2020-07-20T23:30 BST (evictions stopped as traffic decreased here)
    • 30k CacheCapacity, 90k across 3 nodes
    • 1.38-1.44G memory
    • 1350 traces/sec at samproxy
    • 5333 peak traces/in-flight at Honeycomb scaled down to 10%
  • 2020-07-20T20:25 BST (peak traffic)
    • 2000 traces/sec at samproxy
    • 10000 peak traces/in-flight at Honeycomb scaled down to 10%
  • 2020-07-21T16:55 BST (no evictions after change)
    • 60k CacheCapacity, 180k across 3 nodes
    • 2G memory, hitting container limits
  • 2020-07-21T16:55 BST (no evictions after decreasing cache)
    • 50k CacheCapacity
    • 1.66-1.7G memory
  • 2020-07-21T17:21 BST (evictions after decreasing cache)
    • 35k CacheCapacity, 105k across 3 nodes
  • 2020-07-21T17:25 BST (no evictions after decreasing cache)
    • 40k CacheCapacity, 120k across 3 nodes
    • 1.35-1.45G memory
    • 1700 traces/sec at samproxy
    • 6333 peak traces/in-flight at Honeycomb scaled down to 10%
  • 2020-07-22T01:18 BST (no evictions during peak traffic)
    • 50k CacheCapacity, 150k across 3 nodes
    • 2G memory, hitting container limits occasionally
    • 2400 traces/sec at samproxy
    • 6666 peak traces/in-flight at Honeycomb scaled down to 10%
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment