-
-
Save zph/8b9449ec4260583d0e644c7cdc94f3be to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
GitLab.com Database Incident - 2017/01/31 | |
GitLab.com Database Incident - 2017/01/31 | |
This incident affected the database (including issues and merge requests) but not the git repo's (repositories and wikis). | |
Timeline (all times UTC): | |
2017/01/31 16:00/17:00 - 21:00 | |
YP is working on setting up pgpool and replication in staging, creates an LVM snapshot to get up to date production data to staging. This was done roughly 6 hours before data loss. | |
Getting replication to work is proving to be problematic and time consuming (estimated at ±20 hours just for the initial pg_basebackup sync). Work is interrupted due to this (as YP needs the help of another collegue who’s not working this day), and spam/high load on GitLab.com | |
2017/01/31 21:00 - Spike in database load due to spam users - Twitter | Slack | |
Blocked users based on IP address | |
Removed a user for using a repository as some form of CDN, resulting in 47 000 IPs signing in using the same account (causing high DB load). This was communicated with the infrastructure and support team. | |
Removed users for spamming (by creating snippets) - Slack | |
Database load goes back to normal, some manual PostgreSQL vacuuming is applied here and there to catch up with a large amount of dead tuples. | |
2017/01/31 22:00 - Replication lag alert triggered in pagerduty Slack | |
Attempts to fix db2, it’s lagging behind by about 4 GB at this point | |
db2.cluster refuses to replicate, /var/opt/gitlab/postgresql/data is wiped to ensure a clean replication | |
db2.cluster refuses to connect to db1, complaining about max_wal_senders being too low. This setting is used to limit the number of WAL (= replication) clients | |
YP adjusts max_wal_senders to 32 on db1, restarts PostgreSQL | |
PostgreSQL complains about too many semaphores being open, refusing to start | |
YP adjusts max_connections to 2000 from 8000, PostgreSQL starts again (despite 8000 having been used for almost a year) | |
db2.cluster still refuses to replicate, though it no longer complains about connections; instead it just hangs there not doing anything | |
At this point frustration begins to kick in. Earlier this night YP explicitly mentioned he was going to sign off as it was getting late (23:00 or so local time), but didn’t due to the replication problems popping up all of a sudden. | |
2017/01/31 23:00-ish | |
YP thinks that perhaps pg_basebackup is being super pedantic about there being an empty data directory, decides to remove the directory. After a second or two he notices he ran it on db1.cluster.gitlab.com, instead of db2.cluster.gitlab.com | |
2017/01/31 23:27 YP - terminates the removal, but it’s too late. Of around 310 GB only about 4.5 GB is left - Slack | |
Recovery - 2017/01/31 23:00 | |
Suggested recovery solutions: | |
Migrate db1.staging.gitlab.com data to GitLab.com (±6 hours old) | |
Problem with web hooks, these are removed as part of the staging sync. | |
Restore LVM snapshot (6 hours old) | |
Sid: try to undelete files? | |
CW: Not possible! `rm -Rvf` Sid: OK | |
JEJ: Probably too late, but isn't it sometimes possible if you make the disk read-only quickly enough? Also might still have file descriptor if the file was in use by a running process according to http://unix.stackexchange.com/a/101247/213510 | |
YP: PostgreSQL doesn't keep all files open at all times, so that wouldn't work. Also, Azure is apparently also really good in removing data quickly, but not at sending it over to replicas. In other words, the data can't be recovered from the disk itself. | |
Action taken: | |
2017/02/01 00:36 - Backup db1.staging.gitlab.com data | |
2017/02/01 00:55 - Mount db1.staging.gitlab.com on db1.cluster.gitlab.com | |
Copy data from staging /var/opt/gitlab/postgresql/data/ to production /var/opt/gitlab/postgresql/data/ | |
2017/02/01 01:05 - nfs-share01 server commandeered as temp storage place in /var/opt/gitlab/db-meltdown | |
2017/02/01 01:18 - Copy of remaining production data, including pg_xlog tar’ed up as ‘20170131-db-meltodwn-backup.tar.gz’ | |
TODO after data restored: | |
Sid: shared a public link to this document from @gitlabstatus, https://twitter.com/gitlabstatus/status/826591961444384768 | |
Update sentry DSN to production as it’s updated for staging to point to a different project | |
Try to restore webhooks | |
Remove the users we removed earlier today due to spam/abuse. | |
Create outage issue | |
Create issue to change terminal PS1 format/colours to make it clear whether you’re using production or staging (red production, yellow staging) | |
Show the full hostname in the bash prompt for all users by default (e.g., “db1.staging.gitlab.com” instead of just “db1”) | |
Somehow disallow rm -rf for the PostgreSQL data directory? Unsure if this is feasible, or necessary once we have proper backups | |
Add alerting for backups: check S3 storage etc. | |
Consider adding a last successful backup time in DB so admins can see this easily (suggested by customer in https://gitlab.zendesk.com/agent/tickets/58274) | |
Figure out why PostgreSQL suddenly had problems with max_connections being set to 8000, despite it having been set to that since 2016-05-13. A large portion of frustration arose because of this suddenly becoming a problem. | |
Upgrade dbX.cluster to PostgreSQL 9.6.1 as it’s still running the pinned 9.6.0 package (used for the Slony upgrade from 9.2 to 9.6.0) | |
Problems Encountered | |
LVM snapshots are by default only taken once every 24 hours. YP happened to run one manually about 6 hours prior to the outage | |
Regular backups seem to also only be taken once per 24 hours, though YP has not yet been able to figure out where they are stored. According to JN these don’t appear to be working, producing files only a few bytes in size. | |
SH: It looks like pg_dump may be failing because PostgreSQL 9.2 binaries are being run instead of 9.6 binaries. This happens because omnibus only uses Pg 9.6 if data/PG_VERSION is set to 9.6, but on workers this file does not exist. As a result it defaults to 9.2, failing silently. No SQL dumps were made as a result. Fog gem may have cleaned out older backups. | |
Disk snapshots in Azure are enabled for the NFS server, but not for the DB servers. | |
The synchronisation process removes webhooks once it has synchronised data to staging. Unless we can pull these from a regular backup from the past 24 hours they will be lost | |
The replication procedure is super fragile, prone to error, relies on a handful of random shell scripts, and is badly documented | |
Our backups to S3 apparently don’t work either: the bucket is empty | |
So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place. | |
Published by Google Drive–Report Abuse–Updated automatically every 5 minutes |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment