isethi/gist:6d798362c0dbc41e78df

## gistfile1.txt
Operational Visibility Workshop:
kaizen - recognizes improvement can be small or large
the objectives of observability - business velocity, availability, efficiency and scalability
problems with traditional monitoring system - too many dashboards, data collected multiple times, resolution too high, logs not centralized
little better -
monitoring - sense which uses rabbitmq and redis for scaling.
log stash agent, can count telemetry data, sends to statsd which and sends it to graphite
extracting performance schema data to graphite
elastic search
whisper
Whats in a metric - telemetry and events
architectural components
sensing
collecting eg: agent https://vividcortex.com/
agent or agentless
push and pull
filtering and tokenizing
scaling
performance impact
analysis before storage
in stream
feeding into automation
anomaly detection
aggregation and calculations
storage
telemetry
events
resolution and aggregation
backends. graphite uses whisper which is flat file, zabbix uses mysql, cacti uses round robin, elastic search runs on lucene
alerts
rule based processing
notification routing
event aggregation and management
under not over paging
actionable alerts
Visualization
What to measure?
we measure to support our KPIs
w measure to preempt incidents
we measure to diagnose problems
we alert when customers feel pain
supporting our KPIS
velocity
how fast can org push new features? how fast can org grow up or down?
efficiency
how cost efficient is the environment? how elastic is our environment?
security
performance
AppDex(n) where n is latency
availability
How available is each component to the application
“Measure as much as possible, alert on as little as possible"
automate remediation if possible
At server level - the basics, resource utilization, process behavior and the network. look for sys logs, mysql log, cron,authentication and mail log. aggregation up in distributed systems.
At the database level: exposed database metrics. sql analytics and meyrics
Database metrics: how fast are we hitting our resource and concurrency limits?
how do my queries behave - sort, join, index scans, commits, rollbacks,
At connection layer: max_connection, open tcp ports, open files etc,
Whats next?
better time series storage
leevrage parallelism and data aggregation
machine learning
Workshop hands on: https://github.com/dtest/plsc15-opvis
Pythian opsviz stack - https://github.com/pythian/opsviz
Current stack
Telemetry data: sensu
agent pushes to rabbitMQ
Sensu agent on host polls 1 to 60 secs
sensu stores in redis
why sensu -
excellent api, backwards compatible to nagios checks, can be paralleized
rabbitmq issues
network partition
node failures
mirrored queues
tcp load balancers
az failures
scaling concerns
redis failures
use elasticache, multi az
monitor your monitor - make sure sense gas n+1 hosts
Telemetry data: logstash
Event data: logstash - tokenizes and gets stuff into elastic search
Telemetry storage: graphite. works with many different pollers to graph everything. limitation - flat files which means complex queries are difficult
carbon cache, carbon relay, whisper. -  scale with multiple caches, replicate
Event storage - elastic search
Handles distribution well
cluster scales reads
distribute across az
sharding indices
to prevent network partition:
running masters on dedicated nodes
running data nodes on dedicated nodes
Visualization
uchiwa https://github.com/sensu/uchiwa
kibana
grafana
Future works:
anamoly detection via heka or skyline
influxdb for storage
merge kibana and grafana

Workshop:
/etc/sensu/conf.d/rabbitmq.json
/etc/sensu/conf.d/client.json
/etc/logstash/conf.d/agent.conf
	Operational Visibility Workshop:
	kaizen - recognizes improvement can be small or large
	the objectives of observability - business velocity, availability, efficiency and scalability
	problems with traditional monitoring system - too many dashboards, data collected multiple times, resolution too high, logs not centralized
	little better -
	monitoring - sense which uses rabbitmq and redis for scaling.
	log stash agent, can count telemetry data, sends to statsd which and sends it to graphite
	extracting performance schema data to graphite
	elastic search
	whisper
	Whats in a metric - telemetry and events
	architectural components
	sensing
	collecting eg: agent https://vividcortex.com/
	agent or agentless
	push and pull
	filtering and tokenizing
	scaling
	performance impact
	analysis before storage
	in stream
	feeding into automation
	anomaly detection
	aggregation and calculations
	storage
	telemetry
	events
	resolution and aggregation
	backends. graphite uses whisper which is flat file, zabbix uses mysql, cacti uses round robin, elastic search runs on lucene
	alerts
	rule based processing
	notification routing
	event aggregation and management
	under not over paging
	actionable alerts
	Visualization
	What to measure?
	we measure to support our KPIs
	w measure to preempt incidents
	we measure to diagnose problems
	we alert when customers feel pain
	supporting our KPIS
	velocity
	how fast can org push new features? how fast can org grow up or down?
	efficiency
	how cost efficient is the environment? how elastic is our environment?
	security
	performance
	AppDex(n) where n is latency
	availability
	How available is each component to the application
	“Measure as much as possible, alert on as little as possible"
	automate remediation if possible
	At server level - the basics, resource utilization, process behavior and the network. look for sys logs, mysql log, cron,authentication and mail log. aggregation up in distributed systems.
	At the database level: exposed database metrics. sql analytics and meyrics
	Database metrics: how fast are we hitting our resource and concurrency limits?
	how do my queries behave - sort, join, index scans, commits, rollbacks,
	At connection layer: max_connection, open tcp ports, open files etc,
	Whats next?
	better time series storage
	leevrage parallelism and data aggregation
	machine learning
	Workshop hands on: https://github.com/dtest/plsc15-opvis
	Pythian opsviz stack - https://github.com/pythian/opsviz
	Current stack
	Telemetry data: sensu
	agent pushes to rabbitMQ
	Sensu agent on host polls 1 to 60 secs
	sensu stores in redis
	why sensu -
	excellent api, backwards compatible to nagios checks, can be paralleized
	rabbitmq issues
	network partition
	node failures
	mirrored queues
	tcp load balancers
	az failures
	scaling concerns
	redis failures
	use elasticache, multi az
	monitor your monitor - make sure sense gas n+1 hosts
	Telemetry data: logstash
	Event data: logstash - tokenizes and gets stuff into elastic search
	Telemetry storage: graphite. works with many different pollers to graph everything. limitation - flat files which means complex queries are difficult
	carbon cache, carbon relay, whisper. - scale with multiple caches, replicate
	Event storage - elastic search
	Handles distribution well
	cluster scales reads
	distribute across az
	sharding indices
	to prevent network partition:
	running masters on dedicated nodes
	running data nodes on dedicated nodes
	Visualization
	uchiwa https://github.com/sensu/uchiwa
	kibana
	grafana
	Future works:
	anamoly detection via heka or skyline
	influxdb for storage
	merge kibana and grafana

	Workshop:
	/etc/sensu/conf.d/rabbitmq.json
	/etc/sensu/conf.d/client.json
	/etc/logstash/conf.d/agent.conf