falsetto/gist:8066166

## gistfile1.md

      
    Raw
  

              gistfile1.md
            
          
    Objectives

An HBase schema that is well-suited to our dashboard's access patterns and can easily answer the questions we'll ask.
Common questions from our initial dashboard iteration


What are the daily sum/average metrics for the last 30 days for this location?
What are the daily sum/average metrics for the last 30 days for this list of locations?

Common questions from our eventual dashboard


What are the daily/hourly(/minutely?) metrics for the last n days for this location?
What are the daily/hourly(/minutely?) metrics for the last n days for this list of locations?

Considerations when designing our schema to answer these questions


How can we design our schema in such a way that we can pull all metrics for a given location or list of locations without full table scans/random seeks?
As row keys are duplicated for every column, how can we minimize the size of our row keys to minimize our storage requirements?
As column families are duplicated for every column, how can we minimize the use of column families or eschew them altogether to minimize our storage requirements? I'll just give this one away: we don't need column families at all.
How can we optimize our access patterns by creating neither a terribly tall schema, nor a terribly wide schema. Specifically, how can we end up with rows around 3-5KB in size?
How can we avoid re-inventing the wheel if our problem is a common one that has already been solved?

Findings

In my research (watching the HBase Con 2012 schema design presentation, watching other presentations, and doing research on popular projects built on HBase), I came across OpenTSDB. Its creators describe it thusly:
"OpenTSDB is a distributed, scalable Time Series Database (TSDB) written on top of HBase.  OpenTSDB was written to address a common need: store, index and serve metrics collected from computer systems (network gear, operating systems, applications) at a large scale, and make this data easily accessible and graphable."
Seems like a good fit for our project as we need to store time series metrics at a large scale and make the data easily accessible and graphable.
OpenTSDB was written and is maintained at StumbleUpon and has been proven in production over a handful of years by a who's-who list of startups and established tech companies.
Note: Since OpenTSDB was written with server monitoring in mind, it captures metrics from individual "hosts". Metrics are tagged with the host name of the server they came from. OpenTSDB literature and tutorials always refer to "hosts" as the source of metrics. However, we're not interested in monitoring servers with OpenTSDB, but rather KickBack locations. To allow us to avoid tedious or confusing translations from their jargon to ours, just assume for the rest of this essay that "host" = "KickBack location".
What OpenTSDB provides

Server-side

OpenTSDB provides a "time series daemon" (hereafter "TSD") which exposes a HTTP REST API for writing and reading data to/from HBase. TSD is shared-nothing so we can fire up many instances and load balance across them. Conveniently, TSD buffers writes for ~1 second and batch-writes them to avoid unnecessarily taxing the HBase cluster.
For data retrieval, OpenTSDB provides aggregate and filtering functions. So though we'll capture data at far higher granularity than we'll initially expose in the dashboard, OpenTSDB provides functionality to roll it up into daily averages or sums.
Client-side

OpenTSDB provides a client-side library written in Python called tcollector. Tcollector is intended to run directly on the individual nodes where metrics will be captured. It provides an API for writing "collectors" to capture various types of data. The built in collectors primarily cover daemon monitoring (HAProxy, Mongo, MySQL) but the tcollector documentation states that it is straightforward to write custom collectors. So we could presumably put together some Python code that would monitor our transaction processing system (logs? Recent DB writes?) and would pass those metrics along to TSDs. I'm not sure where we would run tcollector daemons as the Traffic Cops can't run Python. Perhaps on the transaction processing servers? On dedicated servers?
How the OpenTSDB schema fits our access patterns

Row keys in OpenTSDB are (conventionally, more on that shortly) composed of:
metric key + timestamp (timestamp being in seconds, rounded down to the most recent 10 minute increment)
Column qualifiers are an int offset in seconds from the row timestamp to the current second. Column values are ints or floats capturing the actual metric.
Each row also has n "tags" associated with it (which are just arbitrary key-value pairs). The most common tag in OpenTSDB examples is a 'host' tag which identifies which host the metrics in this row came from.
The "metric key" portion of the row key is a clever mechanism that warrants some explanation. Rather than storing metric names as strings of unpredictable length (e.g. 'msyql.slowqueries, load_average, etc) at the beginning of row keys, OpenTSDB stores those strings in a separate lookup table and uses a 4 byte id as the foreign key. This enables row keys in the actual metric table to remain compact and also be constant in length (constant length row keys are highly recommended in HBase). So rather than a row key being 'mysql.slowqueries+1387594320', it's '0x1+1387594320'. Then we have a row in the lookup table with a row key of '0x1' and column value of 'mysql.slowqueries'. The byte size of the key (e.g. '0x1') is such that we can have ~16 million uniques, which should suit our needs for quite a while.
One other perk of OpenTSDB's thoughtful design is that since row keys start with the 4-byte lookup key, rows are evenly distributed across region servers and hotspotting isn't an issue (as it would be if the timestamp came first in the row key)
Explanation for reasoning behind 10 minute increments in row keys and offset in column qualifiers


Atomicity: We'll want to run many TSDs concurrently. Since HBase guarantees atomic row writes, it behooves us to avoid race conditions resulting in duplicated data by writing the fewest rows necessary. Rather than writing a new row for every metric data point that comes in, we only write a new row every 10 minutes. In his HBase Conf 2012 presentation, the author of OpenTSDB gave an example of a race condition resulting in duplicate inserts that was remedied by the "only write new rows every n seconds/minutes" strategy. Unfortunately, I'm on a plane right now, just finished a Vodka tonic, and am failing to elucidate. I think it had something to do with a tcollector not receiving a timely write confirmation from a TSD and retrying. Whatever it was, I assure you it was compelling.
Contiguousness: HBase writes columns for a given row in a given column family contiguously. Moderately wide rows (~5KB - ~100MB) are easy for HBase to handle resource-wise and, being contiguous, give us the guarantee of avoiding random seeks that are more difficult to avoid with tall tables.
Compactness: Once a row is over 10 minutes old, it will not receive any new writes, as a new row will be created. Since row keys are duplicated for every column, we can reclaim space by going back and compacting rows with multiple columns into a single column. OpenTSDB handily does this automatically. And it of course understands both the non-compacted and compacted schema and can read either. If we only stored one value per row, we would miss this opportunity for compaction.

How the OpenTSDB schema doesn't fit our access patterns

The OpenTSDB schema was written with the goal of answering the question "What are the measurements across all hosts for this specific metric during this timespan". So the main bits of data filtered on are a) the metric name and b) the timespan. Thus these are conventionally the two things that compose row keys. (Though, again, the metric name is actually represented by a lookup key (e.g. '0x1') that links to a lookup table for compactness.)
However, our use case differs a bit. We instead want to ask "What are the measurements for this specific list of hosts for all metrics during this specific timespan".
Since nearly all of our queries will involve filtering on location id, we should promote it from a tag stored in a column to being part of the row key.
I discovered on the OpenTSDB mailing list that there is precedent for this use case. The solution is to swap the metric name and host name (location id in our case). Thus, the location id becomes the first part of the row key and the metric name is moved to a tag.
This would allow us to ask "What are all the metrics for this location id within this time range?" which could be answered with a single contiguous scan. Of course, if a user has selected 20 locations in the dashboard, this question would need to be asked 20 times with different location IDs, but at least each one of those questions could be answered with a contiguous scan.
The big outstanding question is whether OpenTSDB can make a single read from HBase, get back a data set with mixed tags, and aggregate them separately. For example, our proposed schema would interleave rows for kickback location 1337 containing metrics for enrollments, card issuances, and point issuances. Ideally, we could just swoop them up all at once and contiguously to avoid random seeks, then have OpenTSDB de-interleave them, then roll up to "day" resolution for each metric, and give us the result. I'm not yet deep enough into OpenTSDB to know if this is doable.
How OpenTSDB made need tweaked to suit our use case

When monitoring server health, StumbleUpon recommends checking in every 10-15 seconds. Given this interval, StumbleUpon decided to make OpenTSDB create new rows every 10 minutes, resulting in about 60 columns per row. Apparently, 60 column tables (in OpenTSDB's schema) yield what StumbleUpon considers ideal "medium width" tables. However, we're not monitoring server health. We're monitoring transactions at retail locations. So it may or may not make sense for us to create new rows every 10 minutes. It all depends on how often we'll have our tcollectors submit metrics to OpenTSDB. Fortunately, the author of OpenTSDB stated in his presentation that the 10 minute increment is (kind of) configurable. It's a compile time constant so depending on how difficult it is to compile OpenTSDB, it may be a pain in the ass to tweak. Though certainly possible. Also, one may argue that it shouldn't be easy to change this as if we ended up storing some data at a certain increment and some at another increment, we'd probably have a bad time.
Other considerations

OpenTSDB wasn't built for multi-tenancy. We need to tack on:

Authentication
Data separation

For auth, the OpenTSDB author recommends putting another HTTP REST API in front of OpenTSDB. Easy enough. This would also give us a layer to translate the OpenTSDB responses into the schema expected by the dashboard front-end.
For data separation, we could either prefix all of our location ids in the lookup table with the program name, or we could create separate tables in HBase for each program and fire up a separate TSD (or a cluster of TSDs) for each program. Using separate tables seems preferable to me as it affords us the opportunity to create physical isolation between program data sets.
Storing Alerts

I agree that alerts should be stored outside of HBase. It seems like MongoDB or MySQL would both work fine.