There are two main modes for collecting data about your systems and software: the first is by collecting data from within the application itself, often called white-box monitoring, and the second is by querying the system from the outside and collecting data about the response.
Google's SRE book defines white-box monitoring as "Monitoring based on metrics exposed by the internals of the system," and black-box monitoring as "testing externally visible behavior as a user would see it." Synthetic monitoring is an implementation of a black-box monitoring system which involves creating requests which simulate user activity.
With synthetic monitoring, the aim is to expose any active issues that the user might be experiencing with a system, such as a website being inaccessible. Since it represents real user pain, this data is especially useful as an alerting signal for paging.
This complements the white-box approach, which allows developers and operators to get insight into the internal functioning of the system, providing insight into issues that may be obscured to the user, such as failures that result in a successful retry, and providing invaluable information for debugging puposes.
Telegraf can gather many white-box metrics using application-specific plugins like the ones for NGINX or MySQL, and you can instrument your applications using the InfluxDB client libraries, but we can also use Telegraf as an synthetic monitoring tool to monitor the status of our systems from the outside.
Telegraf's http_response
input plugin checks the status of HTTP and HTTPS connections by polling an endpoint with a custom request, and then recording information about the result. The configuration for the plugin allows you to specify a list of URLs to query, define the request method, and send a custom request body or headers to simulate actions that might be taken by external users and systems. It also allows you to verify the behavior of those endpoinds by verifying that the responses to these requests match certain predefined strings using regular expressions. These options give us a lot of flexibility in terms of how we monitor our applications.
For each target server that is being polled, the plugin will send a measurement to InfluxDB with tags for the server (the target URL), request method, status code, and result, and fields with data about response times, whether the response string matched, the HTTP response code, and a numerical representation of the result called the result code.
We can create a new block in our Telegraf configuring for each endpoint we want to monitor. Telegraf will collect data for each config block once per collection interval.
Let's look at a quick example: we'll create a simple sythetic monitoring check that will tell us whether influxdata.com is up or not. Because we want these monitoring checks to come from outside of the system, we'll need to set up some kind of independent infrastructure, separate from the rest of our systems, for running Telegraf. This could mean running in a different Availability Zone on AWS, or using a different cloud provider altogether. Since I don't actually need long-lived infrastructure for this example, I'll configure Telegraf to run on my Mac, which is external to the influxdata.com infrastructure.
I already have Telegraf installed using Homebrew, so the next step will be to create a new config file with our http_response settings. Here's a snippet of what the inputs.http_response
block would look like:
# HTTP/HTTPS request given an address a method and a timeout
[[inputs.http_response]]
## List of urls to query.
urls = ["https://www.influxdata.com"]
[...]
## Optional substring or regex match in body of the response (case sensitive)
response_string_match = "InfluxDB is the open source time series database"
This queries the InfluxData home page and looks to match the phrase "InfluxDB is the open source [...]".
One thing to note is that telegraf's collection interval is especially important for this plugin because it determines how often to make requests to the endpoint in question. Individual plugins can definte their own collection interval by including a interval
parameter in the appropriate config block. For the sake of example we'll use the Telegraf defaults, but you'll need to decide what an appropriate interval is for your own systems. You can find a complete configuration file in this gist.
We can then launch a copy of Telegraf using the new config, and should see some output, as follows:
$ telegraf --config synthetic-telegraf.conf --debug
2019-07-01T11:51:52Z I! Starting Telegraf 1.10.4
2019-07-01T11:51:52Z I! Loaded inputs: http_response
2019-07-01T11:51:52Z I! Loaded aggregators:
2019-07-01T11:51:52Z I! Loaded processors:
2019-07-01T11:51:52Z I! Loaded outputs: influxdb
2019-07-01T11:51:52Z I! Tags enabled: host=noah-mbp.local
2019-07-01T11:51:52Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:noah-mbp.local", Flush Interval:10s
2019-07-01T11:51:52Z D! [agent] Connecting outputs
2019-07-01T11:51:52Z D! [agent] Attempting connection to output: influxdb
2019-07-01T11:51:52Z D! [agent] Successfully connected to output: influxdb
2019-07-01T11:51:52Z D! [agent] Starting service inputs
2019-07-01T11:52:10Z D! [outputs.influxdb] wrote batch of 1 metrics in 9.118061ms
2019-07-01T11:52:10Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics.
2019-07-01T11:52:20Z D! [outputs.influxdb] wrote batch of 1 metrics in 7.672117ms
2019-07-01T11:52:20Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics.
The http_reponse plugin provides a lot of flexibility in terms of creating monitoring requests which you can use to more accurately model how users and applications might interact with your site. For example, on influxdata.com you might want to verify that your search page is working by submitting a POST request and verifying that the response includes text from the search result page. Because synthetic monitoring is intended to model the user experience, the specific number, frequency, and implementation of your checks will depend heavily on the design and functioning of your product, but in generally you're looking for things like slow response times or high rates of errors.
You'll also want to create a sane alerting strategy based on this data. Because black-box monitoring often exposes existing issues that are already impacting users, that usually means paging someone as soon as issues arise.