Last active
July 15, 2018 05:49
-
-
Save koma77/69c166987ccccad2908aaffe80a36827 to your computer and use it in GitHub Desktop.
Namshi-on-call 08/07/2018 15/07/2018
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
08/07 21:05 ~ 22:10 Payments crons deployed with wrong version. | |
Alert gone after deploying a correct image version. | |
09/07 02:05 ~ 02:06 Nginx 4XX alert, traffic loop produced by old app version. | |
It's a flapping alert, there is no solution provided for now. | |
09/07 11-50 ~ 12:20 Catalog replies were slow, fond a POD killed by OOM on node. | |
RES memory size was more then 1.5G | |
09/07 22-30 ~ 22:50 Pricing app metric `pricing-auto-approver`, issue autoresolved. | |
13/07 02:50 ~ 04:45 IMR warning (stale nodes killed manually) | |
13/07 04-22 ~ 04:30 Prometheus restart happened altogether with nodes replacenent | |
in *net* namespace. So alert hangs. | |
13/07 21-10 ~ 21-20 order-export-sa cronjob stuck on reciving google credentials. | |
13/07 23-05 catalog pod crashed, a lot of 502 happened. Pod was autorestarted. | |
14/07 12-07 ~ 13-05 order-export-sa cronjob hangs, jobs pod was killed. | |
15/07 05-20 ~ 05-25 There was a lot of 502 for because of website pod killed by OOM. | |
2Gb limit is too low. | |
15/07 05-30 ~ 05-35 order-export-sa cronjob hangs, job was autorestarted, | |
sill need to tune alert trigger. | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment