Skip to content

Instantly share code, notes, and snippets.

@robey
Created March 18, 2013 18:12
Show Gist options
  • Save robey/5189414 to your computer and use it in GitHub Desktop.
Save robey/5189414 to your computer and use it in GitHub Desktop.
idempotent service metrics

idempotent operations to store service metrics: we weren't able to come up with a good solution to this.

our rule of thumb for network failures was that packets will get lost, so you can have two behaviors for server operations:

  • at LEAST once: if a response gets lost, retry. you may end up doing the same operation twice.

  • at MOST once: if a response gets lost, give up. the operation might have been lost before the server saw it.

if your operations are idempotent, you can use "at least once" mode for everything. we did that for almost every server i can think of. it was an explicit design goal of flock (the social graph database) and finagle (the server-building toolkit).

for the storage backend, we used cassandra's counter mode, which gives you sharded, replicated counters that you can just send "add" to. but unlike normal cassandra operations, "add" isn't idempotent.

this would be fine if we were only storing one item per (timestamp, service, machine, and metric). we could just use "set" instead of "add" and it would be fine. but we were trying to coalesce important metrics at write-time too, like having one graph that combined all the machines in a service. if you had a thousand machines in a service, the read requests to build a graph took too long. but if you coalesced the writes, you'd get a thousand "add" requests for a single counter, all without any distinguishing ID.

we couldn't think of a way to make these operations idempotent, so we had to decide if we wanted to sometimes get a service metric twice, or sometimes lose them. we tried "at least once", and when the servers were overloaded, some graphs would show alarming spikes in things like "errors per second" or "murders". so we had to back down and use "at most once". engineers would rather see gaps/drops in the graph than incorrect data.

this bugged us a lot. i ended up convinced that the only real way to fix it is to make sure every incoming stat name is unique, using "set" instead of "add", so the cassandra operations can be idempotent. and that means having some other way to coalesce the per-service metrics.

if i were doing it over

i'd set up a memcache (or similar) to hold the last few minutes of incoming stat data. give everything an expiration so memcache forgets old data without needing handholding. then run incoming writes through memcache and into cassandra. if you let reads hit both cassandra and memcache (and merge the results), you'd also give cassandra a lot of breathing room to fall up to several minutes behind on writes during an emergency.

do the per-service-metric coalescing on read-side demand for these last few minutes, from the cache. after they've been in there for a few minutes, and you're pretty sure you've received most of the metrics for a specific timestamp, coalesce and write into cassandra as a "set".

if your cassandra cluster is a lot less redlined than ours was, you could even let cassandra be the memcache. write per-machine metrics directly into cassandra, and have a process coalesce the per-service metrics on a time delay, reading/writing to the database without intermediation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment