Created
April 14, 2015 05:20
-
-
Save isethi/6d798362c0dbc41e78df to your computer and use it in GitHub Desktop.
Operational Visibility - Percona Live
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Operational Visibility Workshop: | |
kaizen - recognizes improvement can be small or large | |
the objectives of observability - business velocity, availability, efficiency and scalability | |
problems with traditional monitoring system - too many dashboards, data collected multiple times, resolution too high, logs not centralized | |
little better - | |
monitoring - sense which uses rabbitmq and redis for scaling. | |
log stash agent, can count telemetry data, sends to statsd which and sends it to graphite | |
extracting performance schema data to graphite | |
elastic search | |
whisper | |
Whats in a metric - telemetry and events | |
architectural components | |
sensing | |
collecting eg: agent https://vividcortex.com/ | |
agent or agentless | |
push and pull | |
filtering and tokenizing | |
scaling | |
performance impact | |
analysis before storage | |
in stream | |
feeding into automation | |
anomaly detection | |
aggregation and calculations | |
storage | |
telemetry | |
events | |
resolution and aggregation | |
backends. graphite uses whisper which is flat file, zabbix uses mysql, cacti uses round robin, elastic search runs on lucene | |
alerts | |
rule based processing | |
notification routing | |
event aggregation and management | |
under not over paging | |
actionable alerts | |
Visualization | |
What to measure? | |
we measure to support our KPIs | |
w measure to preempt incidents | |
we measure to diagnose problems | |
we alert when customers feel pain | |
supporting our KPIS | |
velocity | |
how fast can org push new features? how fast can org grow up or down? | |
efficiency | |
how cost efficient is the environment? how elastic is our environment? | |
security | |
performance | |
AppDex(n) where n is latency | |
availability | |
How available is each component to the application | |
“Measure as much as possible, alert on as little as possible" | |
automate remediation if possible | |
At server level - the basics, resource utilization, process behavior and the network. look for sys logs, mysql log, cron,authentication and mail log. aggregation up in distributed systems. | |
At the database level: exposed database metrics. sql analytics and meyrics | |
Database metrics: how fast are we hitting our resource and concurrency limits? | |
how do my queries behave - sort, join, index scans, commits, rollbacks, | |
At connection layer: max_connection, open tcp ports, open files etc, | |
Whats next? | |
better time series storage | |
leevrage parallelism and data aggregation | |
machine learning | |
Workshop hands on: https://github.com/dtest/plsc15-opvis | |
Pythian opsviz stack - https://github.com/pythian/opsviz | |
Current stack | |
Telemetry data: sensu | |
agent pushes to rabbitMQ | |
Sensu agent on host polls 1 to 60 secs | |
sensu stores in redis | |
why sensu - | |
excellent api, backwards compatible to nagios checks, can be paralleized | |
rabbitmq issues | |
network partition | |
node failures | |
mirrored queues | |
tcp load balancers | |
az failures | |
scaling concerns | |
redis failures | |
use elasticache, multi az | |
monitor your monitor - make sure sense gas n+1 hosts | |
Telemetry data: logstash | |
Event data: logstash - tokenizes and gets stuff into elastic search | |
Telemetry storage: graphite. works with many different pollers to graph everything. limitation - flat files which means complex queries are difficult | |
carbon cache, carbon relay, whisper. - scale with multiple caches, replicate | |
Event storage - elastic search | |
Handles distribution well | |
cluster scales reads | |
distribute across az | |
sharding indices | |
to prevent network partition: | |
running masters on dedicated nodes | |
running data nodes on dedicated nodes | |
Visualization | |
uchiwa https://github.com/sensu/uchiwa | |
kibana | |
grafana | |
Future works: | |
anamoly detection via heka or skyline | |
influxdb for storage | |
merge kibana and grafana | |
Workshop: | |
/etc/sensu/conf.d/rabbitmq.json | |
/etc/sensu/conf.d/client.json | |
/etc/logstash/conf.d/agent.conf |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment