Skip to content

Instantly share code, notes, and snippets.

@stackedsax
Last active August 29, 2015 14:04
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save stackedsax/63e8c2c059c10cf7e726 to your computer and use it in GitHub Desktop.
Save stackedsax/63e8c2c059c10cf7e726 to your computer and use it in GitHub Desktop.
Blogpost July 2014

Cloud Metrics Snapshot, August 2014

An important part of Rackspace's Monitoring pipeline is the metrics that we gather in the process. We have a small team called Cloud Metrics who is dedicated to these metrics. We are otherwise known as the Blueflood team since we authored the blueflood.io project which is the technology at the heart of our Cloud Metrics service. We've been hard at work improving this part of our business and some changes are underway that I think are worth sharing.

What We're Up To

In short, these are the two primary things we're working on:

  • We're upgrading our hardware to have even more capacity
  • We're making this product an http-based service and making it public

What This Means

Capacity

The upgrade to newer machines has a couple of implications, but the obvious one is our desire for increased capacity and performance. We ingest close to 2M metrics/minute right now, but we want more. To scale for various Rackspace-wide projects, we are expecting to increase our ingestion rate to 40-50M metrics/minute over the next year, so we are preparing ourselves for the onslaught.

Public

The more visible change will be that this service is publicly accessible. Cloud Metrics used to be merely a step-child of the Cloud Monitoring product. As such, it had a thrift API that the Cloud Monitoring team had developed for its own internal purposes.

We have removed the thrift API and made Cloud Metrics available as an HTTP API. We have set up all the necessary wiring to make this just another standard-issue Rackspace service. While the only 'customer' right now is Cloud Monitoring, these changes pave the way for any customer, big or small, to send metrics our way and retreive them through a standard-issue HTTP API.

Where We're At

We're about halfway through the changover. Here's a breakdown of the things we've done and what we're working on right now:

Progress So Far

  • Our new production hardware is set up and ingesting production data as we speak
  • We've migrated all the old rollups to the new production hardware
  • We have all the wiring for the public HTTP API set up

Still In Progress

  • Point all queries to the new production cluster
  • Work out a new method of metric indexing
  • Deprecate the old production hardware

So, some big, big milestones accomplished; some big milestones yet to reach.

What Happens Next

Finishing all the work in progress would be a huge relief for the team and allow us to work on a world of problems and questions we have been eager to address. Things like:

  • Using Blueflood as a backend to Graphite and Grafana
  • Annotation support in Blueflood
  • Integrations with other teams in Rackspace
  • A better data persistence layer with Kafka
  • Using Cloud Metrics for more than monitoring data
  • Courting the open-source community with the Blueflood project

We have a lot that we want to accomplish and the work that we're doing right now will set us up to achieve all of it. I'll post another update in a couple months to let you know how far along we are. In addition, the team is planning on writing a few articles that go into more technical depth on how we have done what we've done.

@gdusbabek
Copy link

"What happens next" refers to using Blueflood as a backend to UI systems, but a newcomer might not know that BF underpins the entire system, or that is an open source project.

@stackedsax
Copy link
Author

Correct, and I just wrote something in the intro to reflect that.

@chinmay-gupte
Copy link

I would like to see something in "What happens next" saying that we are planning to make our system more reliable by adding a persistence layer like Kafka. IMO this closely intertwines with our SLA like being able to measure the performance of the system, by having better monitoring and tracing capabilities and how do we react in case of failures and the guarantees we provide with data persistence, rollup and query.
Also, we do have string metrics support in blueflood fwiw, but I am not sure what exactly you want to imply.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment