Skip to content

Instantly share code, notes, and snippets.

@timshadel
Last active July 2, 2018 20:32
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save timshadel/5256653 to your computer and use it in GitHub Desktop.
Save timshadel/5256653 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Logging structure
Combine Riemann.io & Splunk with Heroku
- Logs are streams, not files
- Event streams should be treated properly
- Standard fields cover most situations
- There is one subject of a log, and one event
- The description is separate from the core items
- There is a standard way to log the common information model
- Go simple or go home
metric
ttl
time
host
service -- lowest level of failure detection
state
description
tags
os.<resource>.<subcat>_<metric>:<stat>
app.<resource>.<subcat>_<metric>:<stat>
client.<ext>.<resource>.<subcat>_<metric>:<stat>
mb
count
rate
ms
A population is composed of observations of a process at various times.
- Time Series data
* ratio
* interval (arbitrary zero)
- Categorical
* %
- Describe a whole set of data
* Location (mean, median, mode, inter-quartile mean)
* Variation (std dev, variance, range, inter-quartile range, absolute deviation, distance std dev)
* Shape (skew)
* Dependence
- Is there any more information in that set of data?
* Sufficient
Statistic: a quantity calculated from a set of data.
min
pct_25
pct_50
mean
pct_75
pct_90
pct_99
max
Statistical Process Control uses a series of samples
- Each sample is given its descriptive statistics
- Each statistic is placed in a time-series with those from other samples
os.memory
app.memory
app.req.open_count
app.req.incoming_rate
app.req.error_rate
app.req.completion_rate
app.req.time_ms:min
app.req.time_ms:max
client.amgr.open.count
client.amgr.outgoing_rate
client.amgr.error_rate
client.amgr.returning_rate
client.amgr.time_ms:min
client.amgr.time_ms:max
### Data Types
counter, timer, gauge
### On Timers
When flushing timers to the Graphite backend, Statsd computes and sends the following stats:
count
lower
mean
mean_90
std
sum
sum_90
upper
upper_90
http://fabrizioregini.info/blog/2012/09/23/statsd-graphite-and-you/
### Period
Standard reporting period for the metric (ttl in Riemann).
http://dev.librato.com/v1/metrics#metrics
### Time Series
A Series is a sequence of timestamp/value pairs. This sequence of timestamp/value pairs are measurements for a single source of data.
http://tempo-db.com/docs/api/
### Tags
And since these three Series originate from the same thermostat, it would be useful to relate these Series to each other so you can find them easily when querying. This is done by adding metadata to each Series in the form of tags and attributes.
http://tempo-db.com/docs/modeling-time-series-data/
### Use Queueing
The details of how you implement asynchronous communication is specific to your language and infrastructure, but a common practice is to write all your TempoDB data to a queue (eg, Kestrel or Celery), and have one or more workers pulling off the queue and writing to us (using our client libraries). A strategy along these lines allows you to handle errors, retries, and general latency out-of-band.
http://tempo-db.com/docs/best-practices/
### Sample or Census
If we measure something in the app over a period of time, and then send summary data
to the sever, then the _server_ has only a statistic from a sample (count, avg, etc).
If we measure only some percentage of values, we should note the sampling rate.
"What’s important here is that the packet sent to StatsD includes the sample rate, and so StatsD then multiplies the numbers to give an estimate of a 100% sample rate before it sends the data on to graphite."
(is that actually valid??)
http://codeascraft.etsy.com/2011/02/15/measure-anything-measure-everything/
### Pipeline
All the Things of Type 'T' ==> Population(Thing)
You Want to Know About Part 'P' of All the Things of Type 'T' ==> Aspect(Thing, Part)
The Information About Aspect 'A' is always ValueType 'VT' ==> Metric(Aspect, ValueType)
You Might Not See Every Instance of All the Things of Type 'T', but You See a Sample at Time 't' ==> Sample(Thing, t)
When You See Instance 'i' of Thing of Type 'T', You Observe Metric 'M' has Value 'v' at Time 't' ==> SampleObservation(Metric, v, t)
If you See the Same Thing of Type 'T' Over Time, You Give it Identity 'i' ==> IdentifiedObservation(SampleObservation, i)
If you Make Observations from the Same Area Over Time, You Tag Observations 'g' ==> TaggedObservation(SampleObservation, g)
A Part of Thing That You Observe ==> Metric(Thing, Part, Style)
The Ones You Think You'll Observe at Time 't' ==> SamplePopulation(T, t)
When You Observe a Similar Group of Things 'r', You May Summarize Your SampleObservation Set with a Statistic 'c' ==> Statistic(Set, r)
You May Use Several Statistics about Group 'r' to Estimate the True Value of Aspect 'A' of All the Things of Type 'T', at Time 't' ==> Estimate(Aspect, [stat])
When You Observe the Same Aspect 'A' Over Time, you Create a View of its Estimate over Time ==> TimeSeries(Aspect)
Set of Observations About Thing 'a' Across Time ==> TimeSeries(a)
You Break a Time Series into Groups based on Intervals, and then Use Statistics to Give Each Interval an Estimate of the Aspect ==> IntervalEstimate(Aspect, intvl)
You Analyze the Series of Interval Estimates of Aspect 'A' to Find Knowledge ==> Did something happen? (Is Value special cause or common cause?)
You Take Your Knowledge from Analysis and Decide between Action and Inaction ==> How do we repair 'T'?
If Your Decision Result is Action, You Take Action and Expect You Metric Estimate to Respond Appropriately ==> Act(Thing, Expect(Estimate(Aspect)))
If you Action does not Create Your Estimated Aspect within an Allotted Time 'timeout' You Perform Manual Investigation ==> Investigate(Aspect)
A Subsystem or Supersystem must be Observing and Analyzing the System under Observation ==> 'Now we care about delay between observation & analysis...'
### Summarizers (Reducers)
sum
min
max
median
distinct
https://github.com/square/cube/wiki/Queries
### Selectors (Filters)
eq - equal.
lt - less than.
le - less than or equal to.
gt - greater than.
ge - greater than or equal to.
ne - not equal to.
re - regular expression.
in - one of an array of values (e.g., in(foo, [1, 2, 3])).
Multiple filters can be chained together, such as sum(request.ge(duration, 250).lt(duration, 500)).
https://github.com/square/cube/wiki/Queries
## Event Type Tags
## Host Tags
## Fields
## Message Types
Document (data exchange)
Command (request that may fail or be denied)
Event (something happend, and can't be changed)
<timestamp> name="<name>" event_id=<event_id> <key>=<value>
2008-11-06 22:29:04 name="Failed Login" event_id=sshd:failure src_ip=10.2.3.4 src_port=12355 dest_ip=192.168.1.35 dest_port=22
Resources Used in Numbers over Time
Actions Recorded in Events for all Time
View raw

(Sorry about that, but we can’t show files that are this big right now.)

Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment