Skip to content

Instantly share code, notes, and snippets.

@koma77
Last active July 15, 2018 05:49
Show Gist options
  • Save koma77/69c166987ccccad2908aaffe80a36827 to your computer and use it in GitHub Desktop.
Save koma77/69c166987ccccad2908aaffe80a36827 to your computer and use it in GitHub Desktop.
Namshi-on-call 08/07/2018 15/07/2018
08/07 21:05 ~ 22:10 Payments crons deployed with wrong version.
Alert gone after deploying a correct image version.
09/07 02:05 ~ 02:06 Nginx 4XX alert, traffic loop produced by old app version.
It's a flapping alert, there is no solution provided for now.
09/07 11-50 ~ 12:20 Catalog replies were slow, fond a POD killed by OOM on node.
RES memory size was more then 1.5G
09/07 22-30 ~ 22:50 Pricing app metric `pricing-auto-approver`, issue autoresolved.
13/07 02:50 ~ 04:45 IMR warning (stale nodes killed manually)
13/07 04-22 ~ 04:30 Prometheus restart happened altogether with nodes replacenent
in *net* namespace. So alert hangs.
13/07 21-10 ~ 21-20 order-export-sa cronjob stuck on reciving google credentials.
13/07 23-05 catalog pod crashed, a lot of 502 happened. Pod was autorestarted.
14/07 12-07 ~ 13-05 order-export-sa cronjob hangs, jobs pod was killed.
15/07 05-20 ~ 05-25 There was a lot of 502 for because of website pod killed by OOM.
2Gb limit is too low.
15/07 05-30 ~ 05-35 order-export-sa cronjob hangs, job was autorestarted,
sill need to tune alert trigger.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment