josegonzalez/gist:2855592

## gistfile1.md

      
    Raw
  

              gistfile1.md
            
          
    Now that SeatGeek is crushin it, it has become apparent that we need some sort of server/service monitoring beyond simple healthchecks with pingdom.
This should be split up into multiple parts, but the idea is thus:
Metric Collection

Collect metrics from various servers, including, but not limited to:

disk space
load average
memory usage
processes in D/Z state
network latency

This is all generic data that would be useful to have for all instances, such that we have a baseline to compare with future states and new instances of certain types. All current graphs in cacti should be supported. See Appendix [1] for a list.
Services should generally have the following available:

recently updated
is running the correct number of processes
available to the outside world
response time
response ok?
queries per second

Some of this information can be counters, some just simple boolean 0/1 states. A good example would be the realtime api, which tracks QPS realtime on the api's varnishlog. Another would be whenever a user loads the spotify app, or whenever we deploy
Metrics should be named sanely, and we should strive to document what each metric does, either in a wiki or elsewhere.
Metric Monitoring

A single dashboard, not unlike geckoboard, with key SG metrics would be useful. There are plenty of existing tools for this, links forthcoming. Ops would like to view certain metrics, while devs for particular services might want a dashboard for that. Thus, any dashboard tool should make it easy to compose new graphs and easy to compose graphs into dashboards.
Note that lots of graphs will be naturally spiky, and thus it should be easy for a developer to create smooth, visually pleasing graphs where accuracy isn't quite necessary. This can be performed through rolling averages.
GDash [1] is an example of what a useful dashboard might look like. Being able to annotate what a graph is about is EXTREMELY useful. As well, there should be a method to caching the generated graphs in several places, both on disk, S3, dropbox etc. The purpose of this would be to be able to refer to past graphs and send the graphs easily via email.
Another usecase would be creating small widgets for various services, like this [2]. Graphene [3] provides realtime charting. This might be interesting for the API, and other critical services, but we would not want to enable this for all charts/widgets by default. Refreshing charts once every few minutes seems like the way to go here.
Dashboards can be stored either in Redis as hashes, or in MongoDB as documents. Datastructure is currently unknown. Graphiti [4] does a lot of the above, but I think its a bit heavy for our needs. As well, development by an outside company that isn't adding features we want/need isn't desired.
Alerting

Can we move alerting
Links


[1] GDash: http://www.devco.net/archives/2011/10/08/gdash-graphite-dashboard.php
[2] Graphite Widget: http://hoborglabs.com/en/blog/graphite-widget
[3] Graphene Realtime Dashboard: http://jondot.github.com/graphene/
[4] Graphiti: https://github.com/paperlesspost/graphiti

Appendix


[1] List of metrics collected in Cacti


OS


Context Switches
CPU Usage (User/Nice/System/Idle/Iowait/Irq/Softirq/Steal/Guest)
Disk Elapsed IO Time (Io Time/Io Time Weighted)
Disk Operations (Reads/Reads Merged/Writes/Writes Merged)
Disk Read/Write Time (Time Spent Reading/Time Spent Writing)
Disk Sectors Read/Written (Sectors Read/Sectors Written)
Forks
Interrupts
Load Average
Memory (Memused/Memcached/Membuffer/Memshared/Memfree)
Number of Users


Network


Traffic in bytes/sec, Total Bandwidth (Inbound/Outbound)
Unicast Packets (In/Out)


Nginx


Accepts/Handled
Requests
Scoreboard (Reading/Writing/Waiting/Active Connections)


MongoDB


MongoDB Background Flushes GT (Back Flushes/Back Total Ms/Back Average Ms/Back Last Ms)
MongoDB Commands GT (Inserts/Queries/Updates/Deletes/Getmores/Commands)
MongoDB Connections (Connected Clients)
MongoDB Index Ops GT (Accesses/Hits/Misses/Resets)
MongoDB Memory GT (Used Virtual Memory/Used Mapped Memory/Used Resident Memory/)
MongoDB Slave Lag GT (Slave Lag)


MySQL


InnoDB Transactions (Active/Locked)
InnoDB Adaptive Hash Index (Cells Total/Cells Used)
InnoDB Buffer Pool (Pool Size/Database Pages/Free Pages/Modified Pages)
InnoDB Checkpoint Age GT (Uncheckpointed Bytes)
InnoDB Current Lock Waits GT (Lock Wait Secs)
InnoDB I/O Pending GT (Aio Log Ios/Aio Sync Ios/Buf Pool Flushes/Chkp Writes/Ibuf Aio Reads/Log Flushes/Log Wrtes/Normal Aio Reads/Normal Aio Writes)
InnoDB Insert Buffer GT (Ibuf Inserts/Ibuf Merged/Ibuf Merges)
InnoDB Insert Buffer Usage GT (Ibuf Cell Count/Ibuf Used Cells/Ibuf Free Cells)
InnoDB Internal Hash Memory Usage GT (Adaptive Hash Memory/Page Hash Memory/Dictionary Cache Memory/File System Memory/Lock System Memory/Recovery System Mmeory/Thread Hash Memory)
InnoDB Lock Structures GT
InnoDB Log GT (InnoDB Log Buffer Size/Log Butes Writte/Log Bytes Flushed/Unflushed Log)
InnoDB Memory Allocation GT (Total Mem Alloc/Additional Pool Alloc)
InnoDB Row Lock Time GT
InnoDB Row Lock Waits GT
InnoDB Row Operations GT (Read/Deleted/Updated/Inserted)
InnoDB Sem Wait Time
InnoDB Semaphore Waits
InnoDB Semaphores GT (Spin Rounds/Spin Waits/OS Waits)
InnoDB Tables In Use GT (InnoDB Tables In Use/InnoDB Locked Tables)
InnoDB Transactions GT (InnoDB Transactions/Current Transactions/History List/Read Views)
MyISAM Indexes GT (Key Read Requests/Key Reads/Key Write Requests/Key Writes)
MyISAM Key Cache GT (Key Buffer Size/Key Buf Bytes Used/Key Buf Bytes Unflushed)
MySQL Binary/Relay Logs GT (Binlog Cache Use/Bunlog Cache Disk Use/Binary Log Space/Relay Log Space)
MySQL Command Counters GT (Questions/Com Select/Com Delete/Com Insert/Com Update/Com Replace/Com Load/Com Delete Multi/Com Insert Select/Com Update Multi/Com Replace Select)
MySQL Connections GT (Max Connections/Max Used Connections/Aborted Clients/Aborted Connects/Threads Connected/Connections)
MySQL Files and Tables GT (Table Cache/Open Tables/Open Files/Opened Tables)
MySQL Handlers GT (Write/Update/Delete/Read First/Read Key/Read Next/Read Prev/Read Rnd/Read Rnd Next)
MySQL Network Traffic GT (Bytes Sent/Bytes Received)
MySQL Processlist GT (Closing Tables/Copying To Tmp Table/End/Freeing Items/Init/Locked/Login/Preparing/Reading From Net/Sending Data/SOrting Result/Statistics/Updating/Writing To Net/None/Other)
MySQL Query Cache GT (Queries In Cache/Hits/Inserts/Not Cached/Lowmem Prunes)
MySQL Query Cache Memory GT (Cache Size/Free Memory/Total Blocks/Free Blocks)
MySQL Query Response Time in Microseconds GT (Time Total 00-13)
MySQL Query Time Histogram Count GT (Time Count 00-13)
MySQL Replication GT (Slave Running/Slave Stopped/Slave Lag/Slave Open Temp TablesSlave Retried Transactions)
MySQL Select Types GT (Full Join/Full Range Join/Range/Range Check/Scan)
MySQL Sorts GT (Sort Rows/Sort Range/Sort Merge Passes/Sort Scan)
MySQL Table Locks GT (Table Locks Immediate/Table Locks Waited/Slow Queries)
MySQL Temporary Objects GT (Created Tmp Tables/Created Tmp Disk Tables/Created Tmp Files)
MySQL Threads GT (Thread Cache Size/Threads Created)
MySQL Transaction Handler GT (Commit/Rollback/Savepoint/Savepoint Rollbacks)


Redis


Redis Commands GT (Total Commands Processed)
Redis Connections GT (Connected Clients/Connected Slaves/Total Connections Received)
Redis Memory GT (Used Memory)
Redis Unsaved Changes GT (Changes Since Last Save)