Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save noahcrowley/506c32ea79e43810f8607652e5f2e2a0 to your computer and use it in GitHub Desktop.
Save noahcrowley/506c32ea79e43810f8607652e5f2e2a0 to your computer and use it in GitHub Desktop.
Synthetic Monitoring with Telegraf

Synthetic Monitoring with Telegraf

There are two main modes for collecting data about your systems and software: the first is by collecting data from within the application itself, often called white-box monitoring, and the second is by querying the system from the outside and collecting data about the response.

Google's SRE book defines white-box monitoring as "Monitoring based on metrics exposed by the internals of the system," and black-box monitoring as "testing externally visible behavior as a user would see it." Synthetic monitoring is an implementation of a black-box monitoring system which involves creating requests which simulate user activity.

With synthetic monitoring, the aim is to expose any active issues that the user might be experiencing with a system, such as a website being inaccessible. Since it represents real user pain, this data is especially useful as an alerting signal for paging.

This complements the white-box approach, which allows developers and operators to get insight into the internal functioning of the system, providing insight into issues that may be obscured to the user, such as failures that result in a successful retry, and providing invaluable information for debugging puposes.

Telegraf can gather many white-box metrics using application-specific plugins like the ones for NGINX or MySQL, and you can instrument your applications using the InfluxDB client libraries, but we can also use Telegraf as an synthetic monitoring tool to monitor the status of our systems from the outside.

HTTP Response Input Plugin

Telegraf's http_response input plugin checks the status of HTTP and HTTPS connections by polling an endpoint with a custom request, and then recording information about the result. The configuration for the plugin allows you to specify a list of URLs to query, define the request method, and send a custom request body or headers to simulate actions that might be taken by external users and systems. It also allows you to verify the behavior of those endpoinds by verifying that the responses to these requests match certain predefined strings using regular expressions. These options give us a lot of flexibility in terms of how we monitor our applications.

For each target server that is being polled, the plugin will send a measurement to InfluxDB with tags for the server (the target URL), request method, status code, and result, and fields with data about response times, whether the response string matched, the HTTP response code, and a numerical representation of the result called the result code.

We can create a new block in our Telegraf configuring for each endpoint we want to monitor. Telegraf will collect data for each config block once per collection interval.

Monitoring influxdata.com

Let's look at a quick example: we'll create a simple sythetic monitoring check that will tell us whether influxdata.com is up or not. Because we want these monitoring checks to come from outside of the system, we'll need to set up some kind of independent infrastructure, separate from the rest of our systems, for running Telegraf. This could mean running in a different Availability Zone on AWS, or using a different cloud provider altogether. Since I don't actually need long-lived infrastructure for this example, I'll configure Telegraf to run on my Mac, which is external to the influxdata.com infrastructure.

I already have Telegraf installed using Homebrew, so the next step will be to create a new config file with our http_response settings. Here's a snippet of what the inputs.http_response block would look like:

# HTTP/HTTPS request given an address a method and a timeout
[[inputs.http_response]]
  ## List of urls to query.
  urls = ["https://www.influxdata.com"]

[...]

  ## Optional substring or regex match in body of the response (case sensitive)
  response_string_match = "InfluxDB is the open source time series database"

This queries the InfluxData home page and looks to match the phrase "InfluxDB is the open source [...]".

One thing to note is that telegraf's collection interval is especially important for this plugin because it determines how often to make requests to the endpoint in question. Individual plugins can definte their own collection interval by including a interval parameter in the appropriate config block. For the sake of example we'll use the Telegraf defaults, but you'll need to decide what an appropriate interval is for your own systems. You can find a complete configuration file in this gist.

We can then launch a copy of Telegraf using the new config, and should see some output, as follows:

$ telegraf --config synthetic-telegraf.conf --debug
2019-07-01T11:51:52Z I! Starting Telegraf 1.10.4
2019-07-01T11:51:52Z I! Loaded inputs: http_response
2019-07-01T11:51:52Z I! Loaded aggregators: 
2019-07-01T11:51:52Z I! Loaded processors: 
2019-07-01T11:51:52Z I! Loaded outputs: influxdb
2019-07-01T11:51:52Z I! Tags enabled: host=noah-mbp.local
2019-07-01T11:51:52Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:noah-mbp.local", Flush Interval:10s
2019-07-01T11:51:52Z D! [agent] Connecting outputs
2019-07-01T11:51:52Z D! [agent] Attempting connection to output: influxdb
2019-07-01T11:51:52Z D! [agent] Successfully connected to output: influxdb
2019-07-01T11:51:52Z D! [agent] Starting service inputs
2019-07-01T11:52:10Z D! [outputs.influxdb] wrote batch of 1 metrics in 9.118061ms
2019-07-01T11:52:10Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-07-01T11:52:20Z D! [outputs.influxdb] wrote batch of 1 metrics in 7.672117ms
2019-07-01T11:52:20Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 

Next Steps

The http_reponse plugin provides a lot of flexibility in terms of creating monitoring requests which you can use to more accurately model how users and applications might interact with your site. For example, on influxdata.com you might want to verify that your search page is working by submitting a POST request and verifying that the response includes text from the search result page. Because synthetic monitoring is intended to model the user experience, the specific number, frequency, and implementation of your checks will depend heavily on the design and functioning of your product, but in generally you're looking for things like slow response times or high rates of errors.

You'll also want to create a sane alerting strategy based on this data. Because black-box monitoring often exposes existing issues that are already impacting users, that usually means paging someone as soon as issues arise.

# Telegraf Configuration
#
# Telegraf is entirely plugin driven. All metrics are gathered from the
# declared inputs, and sent to the declared outputs.
#
# Plugins must be declared in here to be active.
# To deactivate a plugin, comment out the name and any variables.
#
# Use 'telegraf -config telegraf.conf -test' to see what metrics a config
# file would generate.
#
# Environment variables can be used anywhere in this config file, simply prepend
# them with $. For strings the variable must be within quotes (ie, "$STR_VAR"),
# for numbers and booleans they should be plain (ie, $INT_VAR, $BOOL_VAR)
# Global tags can be specified here in key="value" format.
[global_tags]
# dc = "us-east-1" # will tag all metrics with dc=us-east-1
# rack = "1a"
## Environment variables can be used as tags, and throughout the config file
# user = "$USER"
# Configuration for telegraf agent
[agent]
## Default data collection interval for all inputs
interval = "10s"
## Rounds collection interval to 'interval'
## ie, if interval="10s" then always collect on :00, :10, :20, etc.
round_interval = true
## Telegraf will send metrics to outputs in batches of at most
## metric_batch_size metrics.
## This controls the size of writes that Telegraf sends to output plugins.
metric_batch_size = 1000
## For failed writes, telegraf will cache metric_buffer_limit metrics for each
## output, and will flush this buffer on a successful write. Oldest metrics
## are dropped first when this buffer fills.
## This buffer only fills when writes fail to output plugin(s).
metric_buffer_limit = 10000
## Collection jitter is used to jitter the collection by a random amount.
## Each plugin will sleep for a random time within jitter before collecting.
## This can be used to avoid many plugins querying things like sysfs at the
## same time, which can have a measurable effect on the system.
collection_jitter = "0s"
## Default flushing interval for all outputs. Maximum flush_interval will be
## flush_interval + flush_jitter
flush_interval = "10s"
## Jitter the flush interval by a random amount. This is primarily to avoid
## large write spikes for users running a large number of telegraf instances.
## ie, a jitter of 5s and interval 10s means flushes will happen every 10-15s
flush_jitter = "0s"
## By default or when set to "0s", precision will be set to the same
## timestamp order as the collection interval, with the maximum being 1s.
## ie, when interval = "10s", precision will be "1s"
## when interval = "250ms", precision will be "1ms"
## Precision will NOT be used for service inputs. It is up to each individual
## service input to set the timestamp at the appropriate precision.
## Valid time units are "ns", "us" (or "µs"), "ms", "s".
precision = ""
## Logging configuration:
## Run telegraf with debug log messages.
debug = false
## Run telegraf in quiet mode (error log messages only).
quiet = false
## Specify the log file name. The empty string means to log to stderr.
logfile = ""
## Override default hostname, if empty use os.Hostname()
hostname = ""
## If set to true, do no set the "host" tag in the telegraf agent.
omit_hostname = false
###############################################################################
# OUTPUT PLUGINS #
###############################################################################
# Configuration for sending metrics to InfluxDB
[[outputs.influxdb]]
## The full HTTP or UDP URL for your InfluxDB instance.
##
## Multiple URLs can be specified for a single cluster, only ONE of the
## urls will be written to each interval.
# urls = ["unix:///var/run/influxdb.sock"]
# urls = ["udp://127.0.0.1:8089"]
# urls = ["http://127.0.0.1:8086"]
## The target database for metrics; will be created as needed.
## For UDP url endpoint database needs to be configured on server side.
# database = "telegraf"
## The value of this tag will be used to determine the database. If this
## tag is not set the 'database' option is used as the default.
# database_tag = ""
## If true, no CREATE DATABASE queries will be sent. Set to true when using
## Telegraf with a user without permissions to create databases or when the
## database already exists.
# skip_database_creation = false
## Name of existing retention policy to write to. Empty string writes to
## the default retention policy. Only takes effect when using HTTP.
# retention_policy = ""
## Write consistency (clusters only), can be: "any", "one", "quorum", "all".
## Only takes effect when using HTTP.
# write_consistency = "any"
## Timeout for HTTP messages.
# timeout = "5s"
## HTTP Basic Auth
# username = "telegraf"
# password = "metricsmetricsmetricsmetrics"
## HTTP User-Agent
# user_agent = "telegraf"
## UDP payload size is the maximum packet size to send.
# udp_payload = "512B"
## Optional TLS Config for use on HTTP connections.
# tls_ca = "/etc/telegraf/ca.pem"
# tls_cert = "/etc/telegraf/cert.pem"
# tls_key = "/etc/telegraf/key.pem"
## Use TLS but skip chain & host verification
# insecure_skip_verify = false
## HTTP Proxy override, if unset values the standard proxy environment
## variables are consulted to determine which proxy, if any, should be used.
# http_proxy = "http://corporate.proxy:3128"
## Additional HTTP headers
# http_headers = {"X-Special-Header" = "Special-Value"}
## HTTP Content-Encoding for write request body, can be set to "gzip" to
## compress body or "identity" to apply no encoding.
# content_encoding = "identity"
## When true, Telegraf will output unsigned integers as unsigned values,
## i.e.: "42u". You will need a version of InfluxDB supporting unsigned
## integer values. Enabling this option will result in field type errors if
## existing data has been written.
# influx_uint_support = false
###############################################################################
# PROCESSOR PLUGINS #
###############################################################################
###############################################################################
# AGGREGATOR PLUGINS #
###############################################################################
###############################################################################
# INPUT PLUGINS #
###############################################################################
# HTTP/HTTPS request given an address a method and a timeout
[[inputs.http_response]]
## Server address (default http://localhost)
address = "https://www.influxdata.com"
## Set http_proxy (telegraf uses the system wide proxy settings if it's is not set)
# http_proxy = "http://localhost:8888"
## Set response_timeout (default 5 seconds)
# response_timeout = "5s"
## HTTP Request Method
# method = "GET"
## Whether to follow redirects from the server (defaults to false)
# follow_redirects = false
## Optional HTTP Request Body
# body = '''
# {'fake':'data'}
# '''
## Optional substring or regex match in body of the response
response_string_match = "InfluxDB is the open source time series database"
## Optional TLS Config
# tls_ca = "/etc/telegraf/ca.pem"
# tls_cert = "/etc/telegraf/cert.pem"
# tls_key = "/etc/telegraf/key.pem"
## Use TLS but skip chain & host verification
# insecure_skip_verify = false
## HTTP Request Headers (all values must be strings)
# [inputs.http_response.headers]
# Host = "github.com"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment