Last active
July 2, 2018 20:32
-
-
Save timshadel/5256653 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Logging structure | |
Combine Riemann.io & Splunk with Heroku | |
- Logs are streams, not files | |
- Event streams should be treated properly | |
- Standard fields cover most situations | |
- There is one subject of a log, and one event | |
- The description is separate from the core items | |
- There is a standard way to log the common information model | |
- Go simple or go home | |
metric | |
ttl | |
time | |
host | |
service -- lowest level of failure detection | |
state | |
description | |
tags | |
os.<resource>.<subcat>_<metric>:<stat> | |
app.<resource>.<subcat>_<metric>:<stat> | |
client.<ext>.<resource>.<subcat>_<metric>:<stat> | |
mb | |
count | |
rate | |
ms | |
A population is composed of observations of a process at various times. | |
- Time Series data | |
* ratio | |
* interval (arbitrary zero) | |
- Categorical | |
* % | |
- Describe a whole set of data | |
* Location (mean, median, mode, inter-quartile mean) | |
* Variation (std dev, variance, range, inter-quartile range, absolute deviation, distance std dev) | |
* Shape (skew) | |
* Dependence | |
- Is there any more information in that set of data? | |
* Sufficient | |
Statistic: a quantity calculated from a set of data. | |
min | |
pct_25 | |
pct_50 | |
mean | |
pct_75 | |
pct_90 | |
pct_99 | |
max | |
Statistical Process Control uses a series of samples | |
- Each sample is given its descriptive statistics | |
- Each statistic is placed in a time-series with those from other samples | |
os.memory | |
app.memory | |
app.req.open_count | |
app.req.incoming_rate | |
app.req.error_rate | |
app.req.completion_rate | |
app.req.time_ms:min | |
app.req.time_ms:max | |
client.amgr.open.count | |
client.amgr.outgoing_rate | |
client.amgr.error_rate | |
client.amgr.returning_rate | |
client.amgr.time_ms:min | |
client.amgr.time_ms:max | |
### Data Types | |
counter, timer, gauge | |
### On Timers | |
When flushing timers to the Graphite backend, Statsd computes and sends the following stats: | |
count | |
lower | |
mean | |
mean_90 | |
std | |
sum | |
sum_90 | |
upper | |
upper_90 | |
http://fabrizioregini.info/blog/2012/09/23/statsd-graphite-and-you/ | |
### Period | |
Standard reporting period for the metric (ttl in Riemann). | |
http://dev.librato.com/v1/metrics#metrics | |
### Time Series | |
A Series is a sequence of timestamp/value pairs. This sequence of timestamp/value pairs are measurements for a single source of data. | |
http://tempo-db.com/docs/api/ | |
### Tags | |
And since these three Series originate from the same thermostat, it would be useful to relate these Series to each other so you can find them easily when querying. This is done by adding metadata to each Series in the form of tags and attributes. | |
http://tempo-db.com/docs/modeling-time-series-data/ | |
### Use Queueing | |
The details of how you implement asynchronous communication is specific to your language and infrastructure, but a common practice is to write all your TempoDB data to a queue (eg, Kestrel or Celery), and have one or more workers pulling off the queue and writing to us (using our client libraries). A strategy along these lines allows you to handle errors, retries, and general latency out-of-band. | |
http://tempo-db.com/docs/best-practices/ | |
### Sample or Census | |
If we measure something in the app over a period of time, and then send summary data | |
to the sever, then the _server_ has only a statistic from a sample (count, avg, etc). | |
If we measure only some percentage of values, we should note the sampling rate. | |
"What’s important here is that the packet sent to StatsD includes the sample rate, and so StatsD then multiplies the numbers to give an estimate of a 100% sample rate before it sends the data on to graphite." | |
(is that actually valid??) | |
http://codeascraft.etsy.com/2011/02/15/measure-anything-measure-everything/ | |
### Pipeline | |
All the Things of Type 'T' ==> Population(Thing) | |
You Want to Know About Part 'P' of All the Things of Type 'T' ==> Aspect(Thing, Part) | |
The Information About Aspect 'A' is always ValueType 'VT' ==> Metric(Aspect, ValueType) | |
You Might Not See Every Instance of All the Things of Type 'T', but You See a Sample at Time 't' ==> Sample(Thing, t) | |
When You See Instance 'i' of Thing of Type 'T', You Observe Metric 'M' has Value 'v' at Time 't' ==> SampleObservation(Metric, v, t) | |
If you See the Same Thing of Type 'T' Over Time, You Give it Identity 'i' ==> IdentifiedObservation(SampleObservation, i) | |
If you Make Observations from the Same Area Over Time, You Tag Observations 'g' ==> TaggedObservation(SampleObservation, g) | |
A Part of Thing That You Observe ==> Metric(Thing, Part, Style) | |
The Ones You Think You'll Observe at Time 't' ==> SamplePopulation(T, t) | |
When You Observe a Similar Group of Things 'r', You May Summarize Your SampleObservation Set with a Statistic 'c' ==> Statistic(Set, r) | |
You May Use Several Statistics about Group 'r' to Estimate the True Value of Aspect 'A' of All the Things of Type 'T', at Time 't' ==> Estimate(Aspect, [stat]) | |
When You Observe the Same Aspect 'A' Over Time, you Create a View of its Estimate over Time ==> TimeSeries(Aspect) | |
Set of Observations About Thing 'a' Across Time ==> TimeSeries(a) | |
You Break a Time Series into Groups based on Intervals, and then Use Statistics to Give Each Interval an Estimate of the Aspect ==> IntervalEstimate(Aspect, intvl) | |
You Analyze the Series of Interval Estimates of Aspect 'A' to Find Knowledge ==> Did something happen? (Is Value special cause or common cause?) | |
You Take Your Knowledge from Analysis and Decide between Action and Inaction ==> How do we repair 'T'? | |
If Your Decision Result is Action, You Take Action and Expect You Metric Estimate to Respond Appropriately ==> Act(Thing, Expect(Estimate(Aspect))) | |
If you Action does not Create Your Estimated Aspect within an Allotted Time 'timeout' You Perform Manual Investigation ==> Investigate(Aspect) | |
A Subsystem or Supersystem must be Observing and Analyzing the System under Observation ==> 'Now we care about delay between observation & analysis...' | |
### Summarizers (Reducers) | |
sum | |
min | |
max | |
median | |
distinct | |
https://github.com/square/cube/wiki/Queries | |
### Selectors (Filters) | |
eq - equal. | |
lt - less than. | |
le - less than or equal to. | |
gt - greater than. | |
ge - greater than or equal to. | |
ne - not equal to. | |
re - regular expression. | |
in - one of an array of values (e.g., in(foo, [1, 2, 3])). | |
Multiple filters can be chained together, such as sum(request.ge(duration, 250).lt(duration, 500)). | |
https://github.com/square/cube/wiki/Queries | |
## Event Type Tags | |
## Host Tags | |
## Fields | |
## Message Types | |
Document (data exchange) | |
Command (request that may fail or be denied) | |
Event (something happend, and can't be changed) | |
<timestamp> name="<name>" event_id=<event_id> <key>=<value> | |
2008-11-06 22:29:04 name="Failed Login" event_id=sshd:failure src_ip=10.2.3.4 src_port=12355 dest_ip=192.168.1.35 dest_port=22 | |
Resources Used in Numbers over Time | |
Actions Recorded in Events for all Time | |
View raw
(Sorry about that, but we can’t show files that are this big right now.)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment