Skip to content

Instantly share code, notes, and snippets.

@isethi
Created April 14, 2015 05:20
Show Gist options
  • Save isethi/6d798362c0dbc41e78df to your computer and use it in GitHub Desktop.
Save isethi/6d798362c0dbc41e78df to your computer and use it in GitHub Desktop.
Operational Visibility - Percona Live
Operational Visibility Workshop:
kaizen - recognizes improvement can be small or large
the objectives of observability - business velocity, availability, efficiency and scalability
problems with traditional monitoring system - too many dashboards, data collected multiple times, resolution too high, logs not centralized
little better -
monitoring - sense which uses rabbitmq and redis for scaling.
log stash agent, can count telemetry data, sends to statsd which and sends it to graphite
extracting performance schema data to graphite
elastic search
whisper
Whats in a metric - telemetry and events
architectural components
sensing
collecting eg: agent https://vividcortex.com/
agent or agentless
push and pull
filtering and tokenizing
scaling
performance impact
analysis before storage
in stream
feeding into automation
anomaly detection
aggregation and calculations
storage
telemetry
events
resolution and aggregation
backends. graphite uses whisper which is flat file, zabbix uses mysql, cacti uses round robin, elastic search runs on lucene
alerts
rule based processing
notification routing
event aggregation and management
under not over paging
actionable alerts
Visualization
What to measure?
we measure to support our KPIs
w measure to preempt incidents
we measure to diagnose problems
we alert when customers feel pain
supporting our KPIS
velocity
how fast can org push new features? how fast can org grow up or down?
efficiency
how cost efficient is the environment? how elastic is our environment?
security
performance
AppDex(n) where n is latency
availability
How available is each component to the application
“Measure as much as possible, alert on as little as possible"
automate remediation if possible
At server level - the basics, resource utilization, process behavior and the network. look for sys logs, mysql log, cron,authentication and mail log. aggregation up in distributed systems.
At the database level: exposed database metrics. sql analytics and meyrics
Database metrics: how fast are we hitting our resource and concurrency limits?
how do my queries behave - sort, join, index scans, commits, rollbacks,
At connection layer: max_connection, open tcp ports, open files etc,
Whats next?
better time series storage
leevrage parallelism and data aggregation
machine learning
Workshop hands on: https://github.com/dtest/plsc15-opvis
Pythian opsviz stack - https://github.com/pythian/opsviz
Current stack
Telemetry data: sensu
agent pushes to rabbitMQ
Sensu agent on host polls 1 to 60 secs
sensu stores in redis
why sensu -
excellent api, backwards compatible to nagios checks, can be paralleized
rabbitmq issues
network partition
node failures
mirrored queues
tcp load balancers
az failures
scaling concerns
redis failures
use elasticache, multi az
monitor your monitor - make sure sense gas n+1 hosts
Telemetry data: logstash
Event data: logstash - tokenizes and gets stuff into elastic search
Telemetry storage: graphite. works with many different pollers to graph everything. limitation - flat files which means complex queries are difficult
carbon cache, carbon relay, whisper. - scale with multiple caches, replicate
Event storage - elastic search
Handles distribution well
cluster scales reads
distribute across az
sharding indices
to prevent network partition:
running masters on dedicated nodes
running data nodes on dedicated nodes
Visualization
uchiwa https://github.com/sensu/uchiwa
kibana
grafana
Future works:
anamoly detection via heka or skyline
influxdb for storage
merge kibana and grafana
Workshop:
/etc/sensu/conf.d/rabbitmq.json
/etc/sensu/conf.d/client.json
/etc/logstash/conf.d/agent.conf
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment