Skip to content

Instantly share code, notes, and snippets.

@Integralist
Created April 26, 2019 09:52
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Integralist/68d208605633e3167df4d78f8371f3bf to your computer and use it in GitHub Desktop.
Save Integralist/68d208605633e3167df4d78f8371f3bf to your computer and use it in GitHub Desktop.
[CDN Logs Sampling Rates] #cdn #logs #sampling #maths #requests #traffic #load

CDN Uncached Request Log Sampling

Why?

When we began passing Fastly logs through to Datadog, we were concerned with the volume of requests that we'd be logging. To cut down on this volume we decided to only pass through requests that were not satisfied by the cache. While this has reduced the log volume to a great degree, at times of high traffic we've seen millions of log events still passed through. To remain cost conscious while using Datadog to get visibility into these logs a sampling strategy should be implemented.

Proposed Sampling Rates

Fastly Status Meaning of Status Sample Rate Justification
PASS serving uncached content and do not intend to cache response 100% PASS requests are frequently just administrative, however having them easily available is useful for debugging. As there were 66k events for this category total in the sample time period chosen, sending 100% of them to Datadog seems like a reasonable approach.
MISS serving uncached content 10% MISS is the vanilla "uncached content" response and these logs reflect expected behavior. We can sample this category quite conservatively. Rather than relying on logs to expose major changes to this, we should ensure appropriate metrics are taken, and use full logs pulled from S3 when deeper insight is needed.
MISS-WAIT serving uncached content and had to wait an unusual amount of time for a highly contested object 100% In the sample time period chosen, over 3m events were shipped. 2 of them had the cache response status of MISS-WAIT and both cases returned a 400. These are useful logs to have in completeness for debugging issues with the CDN.
MISS-CLUSTER serving uncached content, object will be served from another node in the PoP 10% This is the largest category of logs, and for the sample time period chosen, 3.37m log events contained this cache status. Of those, all were successful request fills, with 3.2m of those returning 200. This is a "healthy" request with an expected behavior and thus can be sampled quite conservatively. Rather than relying on logs to expose major changes to this, we should ensure appropriate metrics are taken, and use full logs pulled from S3 when deeper insight is needed.
MISS-WAIT-CLUSTER serving uncached content, object will be served from another node in the pop, waited an unusual amount of time for a highly contested object. 100% In the sample time period chosen, over 3m events were shipped. 24 of them had the cache response status of MISS-CLUSTER-WAIT. These logs reveal objects that are in high demand and are uncached for whatever reason, which can be a clue in debugging CDN issues and would be good to have easily available.

Pricing

Quick napkin math says that adopting this sampling strategy will generate ~ 375k log events/hr, and cost roughly $6.21/day (@ 0.69/1m events) for a total of $187/mo (rounded up).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment