Skip to content

Instantly share code, notes, and snippets.

@monsieurpigeon
Forked from travisjeffery/influxdb-setup.md
Last active May 27, 2018 23:18
Show Gist options
  • Save monsieurpigeon/3666442936616a28379386211d02083a to your computer and use it in GitHub Desktop.
Save monsieurpigeon/3666442936616a28379386211d02083a to your computer and use it in GitHub Desktop.
Guide to setting up InfluxData's TICK stack

Guide to setting up InfluxData's TICK stack

InfluxData's T.I.C.K. stack is made up from the following components:

Component Role
Telegraf Data collector
InfluxDB Stores data
Chronograf Visualizer
Kapacitor Alerter

Since I needed each of these roles, I setup each component. You’ll at least need InfluxDB. But if you didn’t need alerts, for example, you don't need Kapacitor.

Visit InfluxDB's hardware sizing guideline if you're setting this up on fresh hardware.

Get things up and running

Here's the steps to get the whole stack running with Docker:

  1. Create a Docker network for the components to communicate in:

     $ docker network create influxdb
    
  2. Start InfluxDB:

    InfluxDB is the component that the other components write and read data from.

     $ mkdir influxdb
     $ docker run -d -p 8083:8083 -p 8086:8086 --net=influxdb -v $PWD/influxdb:/var/lib/influxdb influxdb:1.0
    
  3. Start Telegraf:

    Telegraf is the component that collects metrics that you send it, from other services, and writes them into InfluxDB or other services.

    Here are the lists of services that Telegraf can collect metrics from, and write out to.

    1. Create a sample config for Telegraf:

       $ touch telegraf.conf
       $ docker run -d --name=telegraf --net=influxdb -p 8125:8125/udp -v $PWD/telegraf.conf:/etc/telegraf/telegraf.conf:ro telegraf:1.0 -sample-config > /etc/telegraf/telegraf.conf
      
       The sample config is worth a quick look-over to get an idea of the many ways that Telegraf can get and send data.
      
    2. Edit the config file to enable the StatsD plugin and add 50 to the percentiles calculated for timing & histogram stats:

       # Telegraf Configuration
       #
       # Telegraf is entirely plugin driven. All metrics are gathered from the
       # declared inputs, and sent to the declared outputs.
       #
       # Plugins must be declared in here to be active.
       # To deactivate a plugin, comment out the name and any variables.
       #
       # Use 'telegraf -config telegraf.conf -test' to see what metrics a config
       # file would generate.
       #
       # Environment variables can be used anywhere in this config file, simply prepend
       # them with $. For strings the variable must be within quotes (ie, "$STR_VAR"),
       # for numbers and booleans they should be plain (ie, $INT_VAR, $BOOL_VAR)
      
       # Global tags can be specified here in key="value" format.
       [global_tags]
       # dc = "us-east-1" # will tag all metrics with dc=us-east-1
       # rack = "1a"
       ## Environment variables can be used as tags, and throughout the config file
       # user = "$USER"
      
       # Configuration for telegraf agent
       [agent]
       ## Default data collection interval for all inputs
       interval = "10s"
       ## Rounds collection interval to 'interval'
       ## ie, if interval="10s" then always collect on :00, :10, :20, etc.
       round_interval = true
      
       ## Telegraf will send metrics to outputs in batches of at
       ## most metric_batch_size metrics.
       metric_batch_size = 1000
       ## For failed writes, telegraf will cache metric_buffer_limit metrics for each
       ## output, and will flush this buffer on a successful write. Oldest metrics
       ## are dropped first when this buffer fills.
       metric_buffer_limit = 10000
      
       ## Collection jitter is used to jitter the collection by a random amount.
       ## Each plugin will sleep for a random time within jitter before collecting.
       ## This can be used to avoid many plugins querying things like sysfs at the
       ## same time, which can have a measurable effect on the system.
       collection_jitter = "0s"
      
       ## Default flushing interval for all outputs. You shouldn't set this below
       ## interval. Maximum flush_interval will be flush_interval + flush_jitter
       flush_interval = "10s"
       ## Jitter the flush interval by a random amount. This is primarily to avoid
       ## large write spikes for users running a large number of telegraf instances.
       ## ie, a jitter of 5s and interval 10s means flushes will happen every 10-15s
       flush_jitter = "0s"
      
       ## Run telegraf in debug mode
       debug = false
       ## Run telegraf in quiet mode
       quiet = false
       ## Override default hostname, if empty use os.Hostname()
       hostname = ""
       ## If set to true, do no set the "host" tag in the telegraf agent.
       omit_hostname = false
      
       ###############################################################################
       #                            OUTPUT PLUGINS                                   #
       ###############################################################################
      
       # Configuration for influxdb server to send metrics to
       [[outputs.influxdb]]
       ## The full HTTP or UDP endpoint URL for your InfluxDB instance.
       ## Multiple urls can be specified as part of the same cluster,
       ## this means that only ONE of the urls will be written to each interval.
       # urls = ["udp://localhost:8089"] # UDP endpoint example
       urls = ["http://influxdb:8086"] # required
       ## The target database for metrics (telegraf will create it if not exists).
       database = "telegraf" # required
       ## Precision of writes, valid values are "ns", "us" (or "µs"), "ms", "s", "m", "h".
       ## note: using "s" precision greatly improves InfluxDB compression.
       precision = "s"
      
       ## Retention policy to write to.
       retention_policy = "default"
       ## Write consistency (clusters only), can be: "any", "one", "quorom", "all"
       write_consistency = "any"
      
       ## Write timeout (for the InfluxDB client), formatted as a string.
       ## If not provided, will default to 5s. 0s means no timeout (not recommended).
       timeout = "5s"
       # username = "telegraf"
       # password = "metricsmetricsmetricsmetrics"
       ## Set the user agent for HTTP POSTs (can be useful for log differentiation)
       # user_agent = "telegraf"
       ## Set UDP payload size, defaults to InfluxDB UDP Client default (512 bytes)
       # udp_payload = 512
      
       ## Optional SSL Config
       # ssl_ca = "/etc/telegraf/ca.pem"
       # ssl_cert = "/etc/telegraf/cert.pem"
       # ssl_key = "/etc/telegraf/key.pem"
       ## Use SSL but skip chain & host verification
       # insecure_skip_verify = false
      
       ###############################################################################
       #                            INPUT PLUGINS                                    #
       ###############################################################################
      
       # Read metrics about cpu usage
       [[inputs.cpu]]
       ## Whether to report per-cpu stats or not
       percpu = true
       ## Whether to report total system cpu stats or not
       totalcpu = true
       ## Comment this line if you want the raw CPU time metrics
       fielddrop = ["time_*"]
      
       # Read metrics about disk usage by mount point
       [[inputs.disk]]
       ## By default, telegraf gather stats for all mountpoints.
       ## Setting mountpoints will restrict the stats to the specified mountpoints.
       # mount_points = ["/"]
      
       ## Ignore some mountpoints by filesystem type. For example (dev)tmpfs (usually
       ## present on /run, /var/run, /dev/shm or /dev).
       ignore_fs = ["tmpfs", "devtmpfs"]
      
       # Read metrics about disk IO by device
       [[inputs.diskio]]
       ## By default, telegraf will gather stats for all devices including
       ## disk partitions.
       ## Setting devices will restrict the stats to the specified devices.
       # devices = ["sda", "sdb"]
       ## Uncomment the following line if you do not need disk serial numbers.
       # skip_serial_number = true
      
       # Get kernel statistics from /proc/stat
       [[inputs.kernel]]
       # no configuration
      
       # Read metrics about memory usage
       [[inputs.mem]]
       # no configuration
      
       # Get the number of processes and group them by status
       [[inputs.processes]]
       # no configuration
      
       # Read metrics about swap memory usage
       [[inputs.swap]]
       # no configuration
      
       # Read metrics about system load & uptime
       [[inputs.system]]
       # no configuration
      
       ###############################################################################
       #                            SERVICE INPUT PLUGINS                            #
       ###############################################################################
      
       # Statsd Server
       [[inputs.statsd]]
       ## Address and port to host UDP listener on
       service_address = ":8125"
       ## Delete gauges every interval (default=false)
       delete_gauges = true
       ## Delete counters every interval (default=false)
       delete_counters = true
       ## Delete sets every interval (default=false)
       delete_sets = true
       ## Delete timings & histograms every interval (default=true)
       delete_timings = true
       ## Percentiles to calculate for timing & histogram stats
       percentiles = [50, 90]
      
       ## separator to use between elements of a statsd metric
       metric_separator = "_"
      
       ## Parses tags in the datadog statsd format
       ## http://docs.datadoghq.com/guides/dogstatsd/
       parse_data_dog_tags = false
      
       ## Statsd data translation templates, more info can be read here:
       ## https://github.com/influxdata/telegraf/blob/master/docs/DATA_FORMATS_INPUT.md#graphite
       # templates = [
       #     "cpu.* measurement*"
       # ]
      
       ## Number of UDP messages allowed to queue up, once filled,
       ## the statsd server will start dropping packets
       allowed_pending_messages = 10000
      
       ## Number of timing/histogram values to track per-measurement in the
       ## calculation of percentiles. Raising this limit increases the accuracy
       ## of percentiles but also increases the memory usage and cpu time.
       percentile_limit = 1000        
      
    3. Run Telegraf:

       $ docker run -d --name=telegraf --net=influxdb -p 8125:8125/udp -v $PWD/telegraf.conf:/etc/telegraf/telegraf.conf:ro telegraf:1.0
      
  4. Start Chronograf:

    Chronograf is the component to visualize queries in charts.

       $ mkdir chronograf
       $ docker run -d -p 10000:10000 -v $PWD/chronograf:/var/lib/chronograf chronograf
    
  5. Start Kapacitor:

    Kapacitor is the component to setup alerts or whatever actions you want to run based on the result of some query.

     $ mkdir kapacitor
     $ docker run -d --net=influxdb --name=kapacitor -p 9092:9092 -v $PWD/kapacitor:/var/lib/kapacitor -v ~/tasks:/tasks -v $PWD/kapacitor.conf:/etc/kapacitor/kapacitor.conf:ro kapacitor -config /etc/kapacitor/kapacitor.conf
    

Use it

  1. Add data to InfluxDB via Telegraf:

    Every component is setup and running now and we can start sending data to it. If you're using StastsD, update the host that your StatsD client hits to your Telegraf host and you're good to go.

    For testing, you could quickly send data into Telegraf with netcat too. For example:

     $ echo "api.msgs.ok:10|c" | nc -C -w 1 -u localhost 8125
    
  2. Add some visualizations in Chronograf:

    Chart

    The tmpltime() allows you to control the time range of the query via Chronograf's UI. The time range is set to the past 15 minutes in both of the previous examples.

  3. Alert with Kapacitor:

    Kapacitor has its own DSL called TICKscript.

    Here's an example:

     stream
     |from()
         .measurement('events_msgs_processed')
     |deadman(100.0, 10s)
     |alert(')
         .id(Events worker processed/{{ index .Tags "host" }}')
         .message('{{ .ID }} is {{ .Level}} value: {{ index .Fields "value" }}')
         .pagerduty()
         .slack()
    

    This example alerts Slack and Pagerduty when the throughput drops below 100 points per 10s and is checked every 10s. Thanks to the host tag, it tells you which worker/host is the source of the alert.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment