In an IDA world we need to think about heartbeats differently. ELB heartbeats should not result in the whole program faliing just because one of the services the app needs is unreachable.
-
Health check should fail only if the full IDA is un-usable
-
It should otherwise indicate problems on the health page but continue to do what we can.
-
Will we need to monitor differently? Want to know when things are sick not just when they're dead.
-
TEST/PLAT: How are we testing short-circuting right now? Any integration tests?
-
Metrics and Logging
- Errors
- Sentry tool for errors - Newrelic does the same sort of thing.
- Errors
-
Grafana + Graphite or InfluxDB <- New tools