bobvawter/logging.md

## logging.md

      
    Raw
  

              logging.md
            
          
    #On Logging and Smoothies
At Perka, we have a logging system called "Firehose" that integrates every HTTP envelope, API payload, event dispatch, database query, thrown exception, and console message into a unified dashboard.  The dashboard has views for triaging production issues, allowing all records associated with a particular HTTP request to be displayed, and it also supports data-mining by writing JavaScript reduction scripts.  Firehose has been a good experiment in determining what "useful logging" really means.
Useful logs should be:

Comprehensive
Correlated
Clearly actionable

Will it blend?

I'll use a smoothie shop's kitchen blender as an analogy for your code.  A blender is a fairly deterministic device, hopefully your code is too.  Ingredients go in the top, the whirling blades of fruity dismemberment spin for a while, and you have a tasty beverage.  Unfortunately, the current smoothie-making process winds up with fruit in strange places, the blender sometimes catches fire, and your guests are unhappy.  What kinds of information would we want to record in blender.log to fix this?
We could just record the outputs of the process, by taking photographs of completed drinks.  Most of these snapshots would be of nicely decorated glasses full of tasty drinks, the glass sweating on the counter in the summer heat.  Pretty pictures, but ultimately uninteresting because things turned out correctly.  The more interesting photos are the ones where the glass has been knocked over or the fruit somehow wound up sprayed across the ceiling.  You know, the really weird stuff where you ask yourself "How did a banana get jammed into the lighting fixture?"  Recording just your outputs merely tells you where you are, but now how you got there.  You could record a video of each smoothie being made (this is certainly useful when developing new recipes), but that adds up to quite a lot of uninteresting data to sift through later.  If we were to instead photograph just the ingredients before they went into the blender, we could deduce that the blender exploded because someone substituted river rocks for ice cubes.
Now that we've tracked down why there are strawberries behind the refrigerator, we can move onto the problem where the blender keeps catching on fire.  Photos of the inputs (fruit, ice, dairy) and outputs (flames shooting from the motor) aren't particularly helpful; obviously something is missing.  Once the ingredients are loaded into the blender, you have to turn it on and wait.  While you're waiting for the blender to run, you start preparing the next batch of ingredients, only to find that you need to go to the store to buy more bananas.  Going to the store introduces non-determinism into your process.  By taking note of nondeterministic actions, you would be able to see that shopping trips are at the very least correlated with blender fires.
In other words, "comprehensive enough" is usually your inputs, outputs, and points of non-determinism as they arise.
With the kitchen under control, we can turn to solving the problems with your guests.  We know that five smoothies have come out of the kitchen and we know that all five guests are unhappy.  Indeed, we have a comprehensive guest.log which tells us exactly what each guest ordered, when they got their drink, that each of them sent their drinks back to the kitchen.  The trouble is that even with comprehensive logging, there are two different domains in the shop.  The "front of house" and the kitchen.  Without correlation between guest and blender logs, you can't see the bigger picture at play.  Suppose you have a fencepost error in assembling the ingredients for a smoothie, and by that I mean actual chunks of wood.  If we know from the guest log that Alice ordered a kale smoothie and the photo of the ingredients, from the blender log, looks suspiciously brown instead of green, we would have what is known as a "clue".  By keeping track of a common value to correlate entries in the guest and blender logs, a bigger picture emerges.
Finally, there are some things that could be logged, but that really aren't useful.  We could record the color of each guest's shoes, or wether or not Alice is left- or right-handed.  Sometimes it is worthwhile to add temporary logs when trying to determine whether or not correlations exist between things, but there are practical limits on how much information can be stored or even how many records you can read through before your eyes glaze over during a late-night fix-it session.  Logging tends to follow a pattern of under-logging (there's a problem and I don't know what's going on), over-logging (there's too much noise in the logs), and then determining what is truly actionable.  When you are deciding whether or not to record some piece of information, determine if the log message would help you satisfy a particular goal by asking yourself the question "what decision can I make using this data?"  A guest's shoe color could be actionable if we suspect that gravity is broken and guests are scuffing the ceiling, but that would be obvious from ingredients not falling to the bottom of the blender.
Implementation

In terms of an actual implementation, Firehose records various types of "Loggable" objects and streams them to disk, where a separate process reads the log streams and loads the records into a Cassandra database.  Completed log streams are encrypted and stored in S3 for long-term archival.  Each log record is a structured object, not simply a line of text, and has both an identity and a correlation UUID in addition to whatever type-specific properties it has.  Every incoming HTTP request passes through a filter which assigns a new correlation UUID and records the HTTP envelope data for requests and responses.  The HTTP request is decoded and passed to the event layer, where the correlation UUID is propagated to the main and subsequent events.  The serialized forms of the event objects are used as the log records.  Since the entire backend is event-based, the current event's correlation UUID is available to a custom SLF4J implementation, JDBC statement interceptors, and various other recording devices.  Once the various records have been loaded into Cassandra, which excels at write-intensive workloads, engineers can get a global view of a request's behavior, since it may be processed by more than one machine.  Here is some raw Firehose data which demonstrates various types of data that is collected.
--
Bob Vawter is Chief Trouble-Shooter at Perka Inc., a smart loyalty program that helps small merchants and their regular customers support one another.  He was formerly a Research Actualization Engineer at Google Inc.