bfulton/swipely-sumo-logic-blog-post-draft-2014-09-18.md Secret

## swipely-sumo-logic-blog-post-draft-2014-09-18.md

      
    Raw
  

              swipely-sumo-logic-blog-post-draft-2014-09-18.md
            
          
    Cloud Log Management for Control Freaks

The following is a guest post from Bright Fulton, Director of Engineering Operations at Swipely.
Like other teams that value their time and focus, Swipely Engineering strongly prefers partnering with third party infrastructure, platform, and monitoring services. We don't, however, like to be externally blocked while debugging an issue or asking a new question of our data. Is giving up control the price of convenience? It shouldn't be. The best services do the heavy lifting for you while preserving flexibility. The key lies in how you interface with the service: stay in control of data ingest and code extensibility.
A great example of this principle is Swipely's log management architecture. We've been happily using Sumo Logic for years. They have an awesome product and are responsive to their customers. That's a strong foundation, but because logging is such a vital function, we retain essential controls while taking advantage of all the power that Sumo Logic provides.
Get the benefits

Infrastructure services have flipped our notion of stability: instead of being comforted by long uptime, we now see it as a liability. Instances start, do work for an hour, terminate. But where do the logs go? One key benefit of a well integrated log management solution is centralization: stream log data off transient systems and into a centralized service.
Once stored and indexed, we want to be able to ask questions of our logs, to react to them. Quick answers come from ad-hoc searches:


How many times did we see this exception yesterday?


Show me everything related to this request ID.


Next, we define scheduled reports to catch issues earlier and shift toward a strategic view of our event data.


Alert me if we didn't process a heartbeat job last hour.


Send me a weekly report of which instance types have the worst clock skew.


Good cloud log management solutions make this centralization, searching, and reporting easy.
Control the data

It's possible to get these benefits without sacrificing control of the data by keeping the ingest path simple: push data through a single transport agent and keep your own copy. Swipely’s logging architecture collects with rsyslog and processes with Logstash before forwarding everything to both S3 and Sumo Logic.

Put all your events in one agent and watch that agent.

You likely have several services that you want to push time series data to: logs, metrics, alerts. To solve each concern independently could leave you with multiple long running agent processes that you need to install, configure, and keep running on every system. Each of those agents will solve similar problems of encryption, authorization, batching, local buffering, back-off, updates. Each comes with its own idiosyncrasies and dependencies. That’s a lot of complexity to manage in every instance.
The lowest common denominator of these time series event domains is the log. Simplify by standardizing on one log forwarding agent in your base image. Use something reliable, widely deployed, open source. Swipely uses rsyslog, but more important than which one is that there is just one.
Tee time

It seems an obvious point, but control freaks shouldn't need to export their data from third parties. Instead of forwarding straight to the external service, send logs to an aggregation server first. Swipely uses Logstash to receive the many rsyslog streams. In addition to addressing vendor integrations in one place, this point of centralization allows you to:


Tee your event stream. Different downstream services have different strengths. Swipely sends all logs to both Sumo Logic for search and reporting and to S3 for retention and batch jobs.


Apply real-time policies. Since Logstash sees every log almost immediately, it’s a great place to enforce invariants, augment events, and make routing decisions. For example, logs that come in without required fields are flagged (or dropped). We add classification tags based on source and content patterns. Metrics are sent to a metric service. Critical events are pushed to an SNS topic.


Control the code

The output is as important as the input. Now that you’re pushing all your logs to a log management service and interacting happily through search and reports, extend the service by making use of indexes and aggregation operators from your own code.
Wrap the API

Good log management services have good APIs and Sumo Logic has several. The Search Job API is particularly powerful, giving access to streaming results in the same way we’re used to in their search UI.
Swipely created the sumo-search gem in order to take advantage of the Search Job API. We use it to permit arbitrary action on the results of a search.
# search for 5 most displayed icons
query='
  gif or jpeg or png |
  parse regex "icon[^\w]+(?<image_url>http[^\\]+)\\" |
  count by image_url | order by _count | limit 5
'
sumo -q "$query" \
  --from '2014-09-25T18:00:00' --to '2014-09-25T18:59:59' \
  --time-zone 'UTC' \
  --records | \
cut -d '"' -f 4 | xargs open  # and open images in browser
Custom alerts and dashboards

Bringing searches into the comfort of the Unix shell is part of the appeal of a tool like this, but even more compelling is bringing them into code. For example, Swipely uses sumo-search from a periodic job to send alerts that are more actionable than just the search query results. We can select the most pertinent parts of the message and link in information from other sources.
require 'aws-sdk'
require 'sumo'
require 'time'

# get all instances that have sent logs in last hour
sumo_insts = Sumo.search(
  query: 'instance_id | parse "\"instance_id\":\"*\"" as instance_id | count by instance_id',
  from: (Time.now - (60 * 60)).iso8601,
  to: Time.now.iso8601,
  time_zone: 'UTC'
).records.map { |record| record['instance_id'] }

# get all EC2 instances
ec2_insts = AWS::EC2.new.instances.map { |inst| inst.id }

# find the non-reporting instances
non_reporting_insts = ec2_insts - sumo_insts
Engineers at Swipely start weekly tactical meetings by reporting trailing seven day metrics. For example: features shipped, slowest requests, error rates, analytics pipeline durations. These indicators help guide and prioritize discussion. Although many of these metrics are from different sources, we like to see them together in one dashboard. With sumo-search and the Search Job API, we can turn any number from a log query into a dashboard widget in a couple lines of Ruby.

Giving up control is not the price of SaaS convenience. Sumo Logic does the heavy lifting of log management for Swipely and provides an interface that allows us to stay flexible. We control data on the way in by preferring open source tools in the early stages of our log pipeline and saving everything we send to S3. We preserve our ability to extend functionality by making their powerful search API easy to use from both shell and Ruby.
We’d appreciate feedback (@swipelyeng) on our logging architecture. Also, we’re not really control freaks and would love pull requests and suggestions on sumo-search!