Skip to content

Instantly share code, notes, and snippets.

@sacreman
Last active September 10, 2016 21:39
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save sacreman/174f1e633f29a29ec565ae576e365111 to your computer and use it in GitHub Desktop.
Save sacreman/174f1e633f29a29ec565ae576e365111 to your computer and use it in GitHub Desktop.

Problem

tldr; We don't want to be a Datadog level solution forever that keeps the price low(ish) by only supporting light weight events. We also don't want to provide standard SaaS log pricing with one of those slide bars that increases the $$$ as the GB amount goes up until you give up and setup open source.

We want a middle ground that's cheap enough not to worry constantly about log volume, that's incredibly quick so it's nice to use in the app and covers most of the use cases for DevOps troubleshooting. So people can get alerted by time series data and drill down into the context of what happened with log data.

Long Version

Now is the opportune moment to look at event data while the time series analytics front-end is being finished. Event data is similar to time series data but differs just enough that it requires a different data store and a bunch of additional services.

Events are often infrequent and contain variable data structures. They contain a timestamp, are usually line separated and are parsed and indexed to extract key value pairs of dimensional data that is used in searches and queries. Events, like time series data, are immutable, ordered and append only.

The simplest use of an events store would be for application audit data e.g. keeping a log of who did what in the application and using it for event feeds. We do this already but moving it out of our application database would be good housekeeping. We also need a place to store plugin error output to help troubleshoot random plugin failures. These would be displayed in the UI as a timeline in addition to being included in alert messages. We would also shift our custom annotations over to being stored here.

Over time, as we expand our list of integrations we will start to suck in events from places like AWS and GCE to display them on event data specific dashboard widgets and make it possible to alert on event data in alerts.

To do all of the above we could easily use any number of databases that exist today. The volume isn't that high for those features and the current options available would probably be reasonable to operate even at a high number of customers. However, we don't want to just build another Datadog which sprinkles events on top of time series.

Longer term we want to follow the traditional unified monitoring tragectory and start to provide logs, APM and distributed tracing features. All of those features use event data alongside time series.

What we really want, to provide an awesome amount of context when troubleshooting a problem, is to ingest extremely large quantities of event data from a variety of sources including log files.

Unfortunately, there is a very simple reason why SaaS log monitoring tools are currently losing to open source options like ELK and Graylog. Price. Dataloop could go the same route and end up with the same outcome. We could setup large Elasticsearch clusters and then expand our engineering team to manage it all. This quickly starts to add up and I'm sure we'd end up at the same pricing model as everyone else.

Short Term Requirements

Low volume, very simple queries.

  • Light weight events (e.g. audit events, annotations)
  • Plugin error output
  • External integration events (e.g. AWS)

This would improve our events capability in the Dataloop UI.

Medium Term Requirements

High volume, highly complex queries.

  • Log data (e.g. web, application)
  • Full text search across all event data

Another data ingres point and a Kibana plugin to start to visualise the log data. Eventually parsers, indexers and some way to extract key fields. Plus a bunch of query engine work.

Long Term Requirements

High volume, highly complex queries and well defined data models.

  • APM
  • Distributed Tracing

This needs libraries and an end to end design to map data to a model which could then be implemented in the app.

Proposal

Work is under way to add events to DalmatinerDB. From Heinz's initial results it looks like we can solve the 80/20 problem around log data. He needs to build an events store anyway for Project FiFo which needs an audit log. We have a bunch of talent now that could help out and Erlang looks like a good fit for most of the problem.

By using 95% of the code already written in DalmatinerDB we could leap frog to a solution to cover the short term objectives in a few weeks. Storage would be highly efficient and the operational overhead would be extremely minimal since it would initially just be another 5 node DalmatinerDB cluster setup to only do the event data type.

The medium term objectives are more of an unknown. The proposal is to simply defer the search index design until the short term objectives are complete and the ingres and storage in flat files are finished. Internally I would suggest that we start to push Nginx and application (json) log data into the event store and then work on a Kibana plugin to iterate with until queries cover the most useful tasks people usually perform when troubleshooting issues. Once that all works we will have been using it a while and building these features into Dataloop will be low risk.

Trade Offs

What you lose in return for low support cost and high performance.

Data Safety

Same deal as with our time series stuff. Mostly perfect, not possible to guarentee every single event is stored. I believe the Dalmatiner trade off's fit with events too. We're not trying to detect the Higgs Bosun.

Features

It's going to be pretty bare bones to start with. Not basing it on lucene could be a massive win or a massive fail. It's going to be far simpler and faster in flat files. Splunk actually does it this way. It's all a massive unknown in terms of whether we'll end up with a full set of features in Kibana.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment