sacreman/tsdb_blog.md

## tsdb_blog.md

      
    Raw
  

              tsdb_blog.md
            
          
    Databases are a crazy topic and it seems everyone has an opinion. The trouble is that opinions are like belly buttons. Just because everyone has one it doesn’t mean they are useful for anything.
Time series databases (TSDB’s) in particular always provoke the usual “have you tried X”, where X is some obscure project with 50 commits back in 2009. Invariably, if X is something a bit more mainstream then yes, it probably has been played with. It’s probably good at certain things and bad at others like all software.
With all of the above in mind I decided to pen a magnum opus of my own opinions. Something I can point the HaveYouTriedX’ers at next time they make an appearance. So here it is..
My Top10 TSDB’s:

DalmatinerDB  (no surprise here)
InfluxDB
Prometheus
Riak TS
OpenTSDB
KairosDB
Elasticsearch
Druid
Blueflood
Graphite (Whisper)
Atlas

I realise there are now 11 so just ignore anything below number 10 as I’ll probably end up adding more in future. Read on to find out the reasons behind the ordering.
Overriding Context

Time series databases for use by developers and operations to monitor the health and performance of a service. Everything in this blog will judge the entries based on their suitability for that task. If you wish to debate a different context then please take the underlying data, rearrange it, add your opinion and release a different blog.
Data: https://docs.google.com/a/dataloop.io/spreadsheets/d/1sMQe9oOKhMhIVw9WmuCEWdPtAoccJ4a-IuZv4fXDHxM/edit?usp=sharing
I believe the number of metrics being collected from infrastructure and code is going to rise dramatically over the next few years. Therefore this blog has a heavy bias towards databases with a dimensional data model and those that can either scale beyond a single node, or are so highly efficient it makes up for having to manually shard them.
Scope

I set some rules to attempt to limit the scope otherwise this blog would never end.


Only free and open source time series databases and their features have been compared. Therefore is someone asks “have you tried Kdb+ and Informix?” the answer will be no. They are probably awesome though.


The list will only include databases that either classify themselves in their marketing material as time series, or have been written about in a blog by a cool company as something they are using for time series data.


I’m only going to accept additional suggestions to expand the top 10 if they are actually useful within the context of this blog. This means they can handle hundreds of thousands of writes per second with reasonable query performance and have a data model that includes labels.


Caveats

I’ve tried a handful of the top 10 databases and have a good amount of experience with them. This blog was reviewed by Heinz Geis who has tried others, particularly the Cassandra backed ones, in the past. Testing, validating claims and checking sources in a professional way like a true journalist hasn’t been done. What has been done is reading the official docs, reading StackOverflow, looking through Github issues and code and generally hacking the information together. With this in mind some facts may be incorrect.
If anyone spots anything factually wrong please let me know and I’ll update the blog.
Benchmarking has been based on marketing claims and estimation. Why? Because benchmarking is a sizeable chunk of work and prone to error. You always get “you should have tuned this special undocumented setting”. The numbers listed are highly favourable to most databases. They are either the numbers blogged about or claimed on Twitter at some time in the past. If you feel any numbers are wrong let me know and I’ll update them.
The way I’d like to see benchmarking work is that you get to choose the highest powered server you can possibly lay your hands on. Then set up a reasonable real world test, like firing in 6000 metrics per agent over 500 agents at 1 second granularity (3 million metrics per second) and seeing if queries still work. Benchmarking reads is even more subjective so I wussied out and just went for fast, moderate and slow.
Another subjective row is the query functionality score. I took 5 as the max score and gave Prometheus 5/5 for being awesome, then took 1 point away for being too complex and a bit ugly to look at. Everything else was compared against that score and based on power, expressiveness and wealth of cool functions.
Observations

All of the TSDB’s are eventually consistent. This means that if your use case for time series data means you need to guaranteed storage of every point then you will probably need to write your own time series layer on top of something like MySQL or Postgres.
Databases built from the ground up for time series data are significantly faster than those that sit on top of non purpose built databases like Riak KV, Cassandra and Hadoop.
Performing queries across billions of metrics looking for labels that only match a few of them (a common scenario with time series data at scale) is really slow in Cassandra. This is because of the way it stores data in columns. This extends to any columnar database including Google's BigQuery which all have a natural disadvantage in the time series arena.
Storage efficiency is extremely important for time series data. The purpose built TSDB's only consume a few bytes of storage per data point. Why this is important becomes clear when you start to add up the costs.
Let's say you have 1 server, and you want to record a single metric for its overall cpu % utilisation over a year at a 1 second interval. That's 31,536,000 points that need to be stored. With DalmatinerDB that takes up 31mb on disk and at the other end of the scale it would use 693mb of space on disk with Elasticsearch. No big deal so far.
Now let's say you are crazy enough to actually be storing 3 million metrics per second as per the benchmark. With DalmatinerDB that's 93Tb of disk space needed for a years worth of data. With Elasticsearch it's just over 2Pb. On S3 that would be the difference of $3k per month vs $63k per month and unfortunately SSD's (which is what most time series databases run on) are magnitudes pricier than cloud storage. Now factor in replication and you're in a world of pain.
Results

There are some Pro's and Con's listed in the spreadsheet for each database. I'll elaborate a bit on those below for each database.
DalmatinerDB - First Place

When I was searching for the best time series database I wanted something fast, that would be easy to scale and operate, and that wouldn't lose all of my data. It also needed to support a dimensional data model and expressive query language with a variety of functions. No such thing existed at the time so we took DalmatinerDB into the back of a aircraft hanger, played some A-Team music and about a year later we have a rocket powered tank.
DalmatinerDB is at least 2 - 3 times faster for writes than any other TSDB in the list. It can write millions of points per second on a single node compared to some databases that can only manage a few tens of thousands.
Query performance is also good and can be split out and scaled independently with multiple query engines. Metadata lookups (queries across labels) are handled by Postgres which is blazingly fast.
The DalmatinerDB storage engine, which was designed around the properties of the ZFS filesystem, has the best storage efficiency out of any in the list.
Because DalmatinerDB is based on Riak Core you get all of the benefit of the riak command line utilities. You can cluster join, plan and commit and watch the usual rebalancing magic happen as data automatically shifts around the ring.
The query language is very similar to SQL and is comparable to the other top TSDB's. Additional work is under way to increase the number of available functions to match InfluxDB and Prometheus.
There are some downsides. Although the code is clean and small ,as it relies on mature technologies for the heavy lifting, it is written in Erlang which puts off some people from contributing. Also, it has been in production for a couple of years at a few large companies monitoring Project Fifo cloud orchestration platforms. It has been stable and in production at Dataloop for over a year. The past few weeks have been spent bringing the community open source code 100% in sync and packaged to work on Linux. So while the code is stable and well tested, the docs and the user experience needs to be worked on a bit. We'll release a blog on exactly how to setup DalmatinerDB on Ubuntu 16.04 very soon.
The other problem some people had was with the highly efficient binary protocol for getting the data in. We have a few more client libraries now, as well as a metrics proxy which will convert various metric formats like OpenTSDB line protocol into DalmatinerDB. There's also a good quality Grafana 3 data source plugin.
InfluxDB - Second Place

It got a lot of bashing recently on Hacker News for the decision to close source the clustering. However, I believe it's still the next best option after DalmatinerDB. In another couple of years, had the clustering remained and been proven reliable and efficient through lots of peer review and testing, like what happened with Riak Core and Cassandra, it could have been the defacto best choice. It's pretty fast on a single node and has a lot of very cool features and efficient storage. If I knew my data would always fit within a single node it would probably be number 1 on this list.
Prometheus - Third Place

It's a bit unfair to rank Prometheus on this list. Prometheus is a great monitoring system with a very cool time series database built in for local storage. Prometheus is incredibly fast and has been highly optimised for querying time series data. It uses varbit encoding to get down to 1.3bytes per data point which is almost as good as DalmatinerDB. I'm not entirely sure how you would use Prometheus as solely a time series database as it was never designed for that use case. I guess you'd have to scrape an api that was connected to a queue and manually shard. That would make for a fairly quirky architecture but could be fun. Prometheus is 3rd place because quite frankly, even though it wasn't designed to be a time series database, it's still better than most other options.
Riak TS - Fourth Place

This is a very new database and the docs don't provide good benchmark numbers. I had to pluck some numbers out of someones presentation on Slideshare. The storage efficiency is also a complete unknown. This does however look like quite a good all-rounder. It's moderately fast, has a lot of flexibility baked into the schema design and the query language looks good. It's based on Riak KV (which itself is built on Riak Core). This is the first of the databases that weren't built from the ground up for time series and is a layer on top of something else. I read there has been a lot of work done optimising the write paths and as it evolves over time I'm sure it will start moving up the list.
OpenTSDB - Fifth Place

Old reliable. You can find lots of information online and it's generally agreed that running a Hadoop stack is not enjoyable. However, it works and scales beyond a single node, and when compared to the other options starts to look almost worth doing. If you have Hadoop in-house already then the decision becomes slightly easier.
KairosDB - Sixth Place

The first of the TSDB's built on Cassandra. This is probably the best of bunch although as mentioned above it's not very good for querying large quantities of dimensional data. Beyond a certain number of labels queries are going to start eating up all of the memory and timing out. If used carefully and at lower volumes it could be a good choice.
Elasticsearch - Seventh Place

This isn't really a TSDB. However, when some people have a hammer they use it for everything. It's one of those things that shouldn't be done, but to be honest, in some circumstances, it works. If you already have Elasticsearch and have a bunch of spare space, and your metrics per second are reasonably low then why not.
Druid - Eighth Place

Running Zookeeper and HDFS wouldn't be my first choice. Druid is a powerful analytics database and seems best suited for providing fast queries over a large quantity of data. It's pretty slow to push metrics into Druid and for the types of queries operations and developers do across millions of labels I'm not convinced it would be nice to use at scale.
Blueflood - Ninth Place

The Blueflood docs seem to mostly consist of a half updated Github wiki page. The reason Blueflood is so far down the list is because I couldn't work out whether it supported labels. My suspicion is that it doesn't and therefore while being a good replacement for Graphite it doesn't really fit in with the other TSDB's with more powerful data models. Blueflood is also based on Cassandra but uses Elasticsearch as an index. It seems there was some talk about adding labels in the past, so if anyone knows if that happened, let me know and I'll re-evaluate the score.
Graphite (Whisper) - Tenth Place

This is the poster child for the past generation of time series databases. More powerful than RRD was when it first came out but now quite outdated. Add to that the fact that it doesn't scale. I've read the blog that shows how to scale it and still came away thinking it didn't really scale. It's on the list because it was amazing 5 years ago and did a lot to get people graphing things.
Atlas - Eleventh Place

There are a couple of databases in use at very large companies. Netlix has Atlas and Facebook has Gorilla. Both took the approach that they had vast quantities of metrics coming out of their systems and to scale their monitoring system the only practical choice was to build an in memory clustered time series database. I'm not entirely sure how practical these TSDB's are outside of the mega companies. From reading the Atlas Github issues list it seems some people would quite like to implement long term storage on disk.