Integralist/CDN Logs Sampling Rates.md

## CDN Logs Sampling Rates.md

      
    Raw
  

              CDN Logs Sampling Rates.md
            
          
    CDN Uncached Request Log Sampling

Why?

When we began passing Fastly logs through to Datadog, we were concerned with the volume of requests that we'd be logging. To cut down on this volume we decided to only pass through requests that were not satisfied by the cache. While this has reduced the log volume to a great degree, at times of high traffic we've seen millions of log events still passed through. To remain cost conscious while using Datadog to get visibility into these logs a sampling strategy should be implemented.
Proposed Sampling Rates


Fastly Status
Meaning of Status
Sample Rate
Justification


PASS
serving uncached content and do not intend to cache response
100%
PASS requests are frequently just administrative, however having them easily available is useful for debugging. As there were 66k events for this category total in the sample time period chosen, sending 100% of them to Datadog seems like a reasonable approach.


MISS
serving uncached content
10%
MISS is the vanilla "uncached content" response and these logs reflect expected behavior. We can sample this category quite conservatively. Rather than relying on logs to expose major changes to this, we should ensure appropriate metrics are taken, and use full logs pulled from S3 when deeper insight is needed.


MISS-WAIT
serving uncached content and had to wait an unusual amount of time for a highly contested object
100%
In the sample time period chosen, over 3m events were shipped. 2 of them had the cache response status of MISS-WAIT and both cases returned a 400. These are useful logs to have in completeness for debugging issues with the CDN.


MISS-CLUSTER
serving uncached content, object will be served from another node in the PoP
10%
This is the largest category of logs, and for the sample time period chosen, 3.37m log events contained this cache status. Of those, all were successful request fills, with 3.2m of those returning 200. This is a "healthy" request with an expected behavior and thus can be sampled quite conservatively. Rather than relying on logs to expose major changes to this, we should ensure appropriate metrics are taken, and use full logs pulled from S3 when deeper insight is needed.


MISS-WAIT-CLUSTER
serving uncached content, object will be served from another node in the pop, waited an unusual amount of time for a highly contested object.
100%
In the sample time period chosen, over 3m events were shipped. 24 of them had the cache response status of MISS-CLUSTER-WAIT. These logs reveal objects that are in high demand and are uncached for whatever reason, which can be a clue in debugging CDN issues and would be good to have easily available.


Pricing

Quick napkin math says that adopting this sampling strategy will generate ~ 375k log events/hr, and cost roughly $6.21/day (@ 0.69/1m events) for a total of $187/mo (rounded up).
Fastly Status	Meaning of Status	Sample Rate	Justification
`PASS`	serving uncached content and do not intend to cache response	100%	`PASS` requests are frequently just administrative, however having them easily available is useful for debugging. As there were 66k events for this category total in the sample time period chosen, sending 100% of them to Datadog seems like a reasonable approach.
`MISS`	serving uncached content	10%	`MISS` is the vanilla "uncached content" response and these logs reflect expected behavior. We can sample this category quite conservatively. Rather than relying on logs to expose major changes to this, we should ensure appropriate metrics are taken, and use full logs pulled from S3 when deeper insight is needed.
`MISS-WAIT`	serving uncached content and had to wait an unusual amount of time for a highly contested object	100%	In the sample time period chosen, over 3m events were shipped. 2 of them had the cache response status of `MISS-WAIT` and both cases returned a `400`. These are useful logs to have in completeness for debugging issues with the CDN.
`MISS-CLUSTER`	serving uncached content, object will be served from another node in the PoP	10%	This is the largest category of logs, and for the sample time period chosen, 3.37m log events contained this cache status. Of those, all were successful request fills, with 3.2m of those returning `200`. This is a "healthy" request with an expected behavior and thus can be sampled quite conservatively. Rather than relying on logs to expose major changes to this, we should ensure appropriate metrics are taken, and use full logs pulled from S3 when deeper insight is needed.
`MISS-WAIT-CLUSTER`	serving uncached content, object will be served from another node in the pop, waited an unusual amount of time for a highly contested object.	100%	In the sample time period chosen, over 3m events were shipped. 24 of them had the cache response status of `MISS-CLUSTER-WAIT`. These logs reveal objects that are in high demand and are uncached for whatever reason, which can be a clue in debugging CDN issues and would be good to have easily available.