timshadel/CormodeShkapenyukSrivastavaXu09.pdf

## CormodeShkapenyukSrivastavaXu09.pdf

      
Display the source blob

    
Display the rendered blob

    
    Raw
  

              CormodeShkapenyukSrivastavaXu09.pdf
            
          
      Loading

      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.
      
    
## Logging structure.txt
Logging structure

Combine Riemann.io & Splunk with Heroku

  - Logs are streams, not files
  - Event streams should be treated properly
  - Standard fields cover most situations
  - There is one subject of a log, and one event
  - The description is separate from the core items
  - There is a standard way to log the common information model
  - Go simple or go home

metric
ttl
time
host
service  -- lowest level of failure detection
state
description
tags

os.<resource>.<subcat>_<metric>:<stat>
app.<resource>.<subcat>_<metric>:<stat>
client.<ext>.<resource>.<subcat>_<metric>:<stat>

  mb
  count
  rate
  ms

A population is composed of observations of a process at various times.
  - Time Series data
    * ratio
    * interval (arbitrary zero)
  - Categorical
    * %
  - Describe a whole set of data
    * Location (mean, median, mode, inter-quartile mean)
    * Variation (std dev, variance, range, inter-quartile range, absolute deviation, distance std dev)
    * Shape (skew)
    * Dependence
  - Is there any more information in that set of data?
    * Sufficient

Statistic: a quantity calculated from a set of data.
  min
  pct_25
  pct_50
  mean
  pct_75
  pct_90
  pct_99
  max

Statistical Process Control uses a series of samples
  - Each sample is given its descriptive statistics
  - Each statistic is placed in a time-series with those from other samples


os.memory

app.memory
app.req.open_count
app.req.incoming_rate
app.req.error_rate
app.req.completion_rate
app.req.time_ms:min
app.req.time_ms:max

client.amgr.open.count
client.amgr.outgoing_rate
client.amgr.error_rate
client.amgr.returning_rate
client.amgr.time_ms:min
client.amgr.time_ms:max

### Data Types

counter, timer, gauge

### On Timers

When flushing timers to the Graphite backend, Statsd computes and sends the following stats:

count
lower
mean
mean_90
std
sum
sum_90
upper
upper_90

http://fabrizioregini.info/blog/2012/09/23/statsd-graphite-and-you/

### Period

Standard reporting period for the metric (ttl in Riemann).

http://dev.librato.com/v1/metrics#metrics


### Time Series

A Series is a sequence of timestamp/value pairs. This sequence of timestamp/value pairs are measurements for a single source of data.

http://tempo-db.com/docs/api/

### Tags

And since these three Series originate from the same thermostat, it would be useful to relate these Series to each other so you can find them easily when querying. This is done by adding metadata to each Series in the form of tags and attributes.

http://tempo-db.com/docs/modeling-time-series-data/

### Use Queueing

The details of how you implement asynchronous communication is specific to your language and infrastructure, but a common practice is to write all your TempoDB data to a queue (eg, Kestrel or Celery), and have one or more workers pulling off the queue and writing to us (using our client libraries). A strategy along these lines allows you to handle errors, retries, and general latency out-of-band.

http://tempo-db.com/docs/best-practices/

### Sample or Census

If we measure something in the app over a period of time, and then send summary data
to the sever, then the _server_ has only a statistic from a sample (count, avg, etc).

If we measure only some percentage of values, we should note the sampling rate.

"What’s important here is that the packet sent to StatsD includes the sample rate, and so StatsD then multiplies the numbers to give an estimate of a 100% sample rate before it sends the data on to graphite."

(is that actually valid??)

http://codeascraft.etsy.com/2011/02/15/measure-anything-measure-everything/

### Pipeline


All the Things of Type 'T' ==> Population(Thing)

You Want to Know About Part 'P' of All the Things of Type 'T' ==> Aspect(Thing, Part)

The Information About Aspect 'A' is always ValueType 'VT' ==> Metric(Aspect, ValueType)

You Might Not See Every Instance of All the Things of Type 'T', but You See a Sample at Time 't' ==> Sample(Thing, t)

When You See Instance 'i' of Thing of Type 'T', You Observe Metric 'M' has Value 'v' at Time 't' ==> SampleObservation(Metric, v, t)

If you See the Same Thing of Type 'T' Over Time, You Give it Identity 'i' ==> IdentifiedObservation(SampleObservation, i)

If you Make Observations from the Same Area Over Time, You Tag Observations 'g' ==> TaggedObservation(SampleObservation, g)

A Part of Thing That You Observe ==> Metric(Thing, Part, Style)

The Ones You Think You'll Observe at Time 't' ==> SamplePopulation(T, t)


When You Observe a Similar Group of Things 'r', You May Summarize Your SampleObservation Set with a Statistic 'c' ==> Statistic(Set, r)

You May Use Several Statistics about Group 'r' to Estimate the True Value of Aspect 'A' of All the Things of Type 'T', at Time 't' ==> Estimate(Aspect, [stat])


When You Observe the Same Aspect 'A' Over Time, you Create a View of its Estimate over Time ==> TimeSeries(Aspect)

Set of Observations About Thing 'a' Across Time ==> TimeSeries(a)

You Break a Time Series into Groups based on Intervals, and then Use Statistics to Give Each Interval an Estimate of the Aspect ==> IntervalEstimate(Aspect, intvl)

You Analyze the Series of Interval Estimates of Aspect 'A' to Find Knowledge ==> Did something happen? (Is Value special cause or common cause?)

You Take Your Knowledge from Analysis and Decide between Action and Inaction ==> How do we repair 'T'?

If Your Decision Result is Action, You Take Action and Expect You Metric Estimate to Respond Appropriately ==> Act(Thing, Expect(Estimate(Aspect)))

If you Action does not Create Your Estimated Aspect within an Allotted Time 'timeout' You Perform Manual Investigation ==> Investigate(Aspect)


A Subsystem or Supersystem must be Observing and Analyzing the System under Observation ==> 'Now we care about delay between observation & analysis...'


### Summarizers (Reducers)

sum
min
max
median
distinct

https://github.com/square/cube/wiki/Queries

### Selectors (Filters)

eq - equal.
lt - less than.
le - less than or equal to.
gt - greater than.
ge - greater than or equal to.
ne - not equal to.
re - regular expression.
in - one of an array of values (e.g., in(foo, [1, 2, 3])).

Multiple filters can be chained together, such as sum(request.ge(duration, 250).lt(duration, 500)).

https://github.com/square/cube/wiki/Queries

## Event Type Tags

## Host Tags

## Fields


## Message Types

Document  (data exchange)
Command  (request that may fail or be denied)
Event  (something happend, and can't be changed)


<timestamp> name="<name>" event_id=<event_id> <key>=<value>

2008-11-06 22:29:04 name="Failed Login" event_id=sshd:failure src_ip=10.2.3.4 src_port=12355 dest_ip=192.168.1.35 dest_port=22


Resources Used in Numbers over Time

Actions Recorded in Events for all Time


## metrics.txt

      
    Raw
  

              metrics.txt
            
          
            View raw
              (Sorry about that, but we can’t show files that are this big right now.)
        
    
## RealtimeJQT2012.pdf

      
Display the source blob

    
Display the rendered blob

    
    Raw
  

              RealtimeJQT2012.pdf
            
          
      Loading

      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.
      
    
## vitter-R.pdf

      
Display the source blob

    
Display the rendered blob

    
    Raw
  

              vitter-R.pdf
            
          
      Loading

      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.
	Logging structure

	Combine Riemann.io & Splunk with Heroku

	- Logs are streams, not files
	- Event streams should be treated properly
	- Standard fields cover most situations
	- There is one subject of a log, and one event
	- The description is separate from the core items
	- There is a standard way to log the common information model
	- Go simple or go home

	metric
	ttl
	time
	host
	service -- lowest level of failure detection
	state
	description
	tags

	os.<resource>.<subcat>_<metric>:<stat>
	app.<resource>.<subcat>_<metric>:<stat>
	client.<ext>.<resource>.<subcat>_<metric>:<stat>

	mb
	count
	rate
	ms

	A population is composed of observations of a process at various times.
	- Time Series data
	* ratio
	* interval (arbitrary zero)
	- Categorical
	* %
	- Describe a whole set of data
	* Location (mean, median, mode, inter-quartile mean)
	* Variation (std dev, variance, range, inter-quartile range, absolute deviation, distance std dev)
	* Shape (skew)
	* Dependence
	- Is there any more information in that set of data?
	* Sufficient

	Statistic: a quantity calculated from a set of data.
	min
	pct_25
	pct_50
	mean
	pct_75
	pct_90
	pct_99
	max

	Statistical Process Control uses a series of samples
	- Each sample is given its descriptive statistics
	- Each statistic is placed in a time-series with those from other samples




	os.memory

	app.memory
	app.req.open_count
	app.req.incoming_rate
	app.req.error_rate
	app.req.completion_rate
	app.req.time_ms:min
	app.req.time_ms:max

	client.amgr.open.count
	client.amgr.outgoing_rate
	client.amgr.error_rate
	client.amgr.returning_rate
	client.amgr.time_ms:min
	client.amgr.time_ms:max

	### Data Types

	counter, timer, gauge

	### On Timers

	When flushing timers to the Graphite backend, Statsd computes and sends the following stats:

	count
	lower
	mean
	mean_90
	std
	sum
	sum_90
	upper
	upper_90

	http://fabrizioregini.info/blog/2012/09/23/statsd-graphite-and-you/

	### Period

	Standard reporting period for the metric (ttl in Riemann).

	http://dev.librato.com/v1/metrics#metrics


	### Time Series

	A Series is a sequence of timestamp/value pairs. This sequence of timestamp/value pairs are measurements for a single source of data.

	http://tempo-db.com/docs/api/

	### Tags

	And since these three Series originate from the same thermostat, it would be useful to relate these Series to each other so you can find them easily when querying. This is done by adding metadata to each Series in the form of tags and attributes.

	http://tempo-db.com/docs/modeling-time-series-data/

	### Use Queueing

	The details of how you implement asynchronous communication is specific to your language and infrastructure, but a common practice is to write all your TempoDB data to a queue (eg, Kestrel or Celery), and have one or more workers pulling off the queue and writing to us (using our client libraries). A strategy along these lines allows you to handle errors, retries, and general latency out-of-band.

	http://tempo-db.com/docs/best-practices/

	### Sample or Census

	If we measure something in the app over a period of time, and then send summary data
	to the sever, then the _server_ has only a statistic from a sample (count, avg, etc).

	If we measure only some percentage of values, we should note the sampling rate.

	"What’s important here is that the packet sent to StatsD includes the sample rate, and so StatsD then multiplies the numbers to give an estimate of a 100% sample rate before it sends the data on to graphite."

	(is that actually valid??)

	http://codeascraft.etsy.com/2011/02/15/measure-anything-measure-everything/

	### Pipeline


	All the Things of Type 'T' ==> Population(Thing)

	You Want to Know About Part 'P' of All the Things of Type 'T' ==> Aspect(Thing, Part)

	The Information About Aspect 'A' is always ValueType 'VT' ==> Metric(Aspect, ValueType)

	You Might Not See Every Instance of All the Things of Type 'T', but You See a Sample at Time 't' ==> Sample(Thing, t)

	When You See Instance 'i' of Thing of Type 'T', You Observe Metric 'M' has Value 'v' at Time 't' ==> SampleObservation(Metric, v, t)

	If you See the Same Thing of Type 'T' Over Time, You Give it Identity 'i' ==> IdentifiedObservation(SampleObservation, i)

	If you Make Observations from the Same Area Over Time, You Tag Observations 'g' ==> TaggedObservation(SampleObservation, g)

	A Part of Thing That You Observe ==> Metric(Thing, Part, Style)

	The Ones You Think You'll Observe at Time 't' ==> SamplePopulation(T, t)



	When You Observe a Similar Group of Things 'r', You May Summarize Your SampleObservation Set with a Statistic 'c' ==> Statistic(Set, r)

	You May Use Several Statistics about Group 'r' to Estimate the True Value of Aspect 'A' of All the Things of Type 'T', at Time 't' ==> Estimate(Aspect, [stat])


	When You Observe the Same Aspect 'A' Over Time, you Create a View of its Estimate over Time ==> TimeSeries(Aspect)

	Set of Observations About Thing 'a' Across Time ==> TimeSeries(a)

	You Break a Time Series into Groups based on Intervals, and then Use Statistics to Give Each Interval an Estimate of the Aspect ==> IntervalEstimate(Aspect, intvl)

	You Analyze the Series of Interval Estimates of Aspect 'A' to Find Knowledge ==> Did something happen? (Is Value special cause or common cause?)

	You Take Your Knowledge from Analysis and Decide between Action and Inaction ==> How do we repair 'T'?

	If Your Decision Result is Action, You Take Action and Expect You Metric Estimate to Respond Appropriately ==> Act(Thing, Expect(Estimate(Aspect)))

	If you Action does not Create Your Estimated Aspect within an Allotted Time 'timeout' You Perform Manual Investigation ==> Investigate(Aspect)


	A Subsystem or Supersystem must be Observing and Analyzing the System under Observation ==> 'Now we care about delay between observation & analysis...'


	### Summarizers (Reducers)

	sum
	min
	max
	median
	distinct

	https://github.com/square/cube/wiki/Queries

	### Selectors (Filters)

	eq - equal.
	lt - less than.
	le - less than or equal to.
	gt - greater than.
	ge - greater than or equal to.
	ne - not equal to.
	re - regular expression.
	in - one of an array of values (e.g., in(foo, [1, 2, 3])).

	Multiple filters can be chained together, such as sum(request.ge(duration, 250).lt(duration, 500)).

	https://github.com/square/cube/wiki/Queries

	## Event Type Tags

	## Host Tags

	## Fields


	## Message Types

	Document (data exchange)
	Command (request that may fail or be denied)
	Event (something happend, and can't be changed)


	<timestamp> name="<name>" event_id=<event_id> <key>=<value>

	2008-11-06 22:29:04 name="Failed Login" event_id=sshd:failure src_ip=10.2.3.4 src_port=12355 dest_ip=192.168.1.35 dest_port=22



	Resources Used in Numbers over Time

	Actions Recorded in Events for all Time