zph/- Secret

## -
GitLab.com Database Incident - 2017/01/31
  GitLab.com Database Incident - 2017/01/31

This incident affected the database (including issues and merge requests) but not the git repo's (repositories and wikis).
Timeline (all times UTC):
2017/01/31 16:00/17:00 - 21:00
YP is working on setting up pgpool and replication in staging, creates an LVM snapshot to get up to date production data to staging. This was done roughly 6 hours before data loss.
Getting replication to work is proving to be problematic and time consuming (estimated at ±20 hours just for the initial pg_basebackup sync). Work is interrupted due to this (as YP needs the help of another collegue who’s not working this day), and spam/high load on GitLab.com
2017/01/31 21:00 - Spike in database load due to spam users - Twitter | Slack
Blocked users based on IP address
Removed a user for using a repository as some form of CDN, resulting in 47 000 IPs signing in using the same account (causing high DB load). This was communicated with the infrastructure and support team.
Removed users for spamming (by creating snippets) - Slack
Database load goes back to normal, some manual PostgreSQL vacuuming is applied here and there to catch up with a large amount of dead tuples.
2017/01/31 22:00 - Replication lag alert triggered in pagerduty Slack
Attempts to fix db2, it’s lagging behind by about 4 GB at this point
db2.cluster refuses to replicate, /var/opt/gitlab/postgresql/data is wiped to ensure a clean replication
db2.cluster refuses to connect to db1, complaining about max_wal_senders being too low. This setting is used to limit the number of WAL (= replication) clients
YP adjusts max_wal_senders to 32 on db1, restarts PostgreSQL
PostgreSQL complains about too many semaphores being open, refusing to start
YP adjusts max_connections to 2000 from 8000, PostgreSQL starts again (despite 8000 having been used for almost a year)
db2.cluster still refuses to replicate, though it no longer complains about connections; instead it just hangs there not doing anything
At this point frustration begins to kick in. Earlier this night YP explicitly mentioned he was going to sign off as it was getting late (23:00 or so local time), but didn’t due to the replication problems popping up all of a sudden.
2017/01/31 23:00-ish
YP thinks that perhaps pg_basebackup is being super pedantic about there being an empty data directory, decides to remove the directory. After a second or two he notices he ran it on db1.cluster.gitlab.com, instead of db2.cluster.gitlab.com
2017/01/31 23:27 YP - terminates the removal, but it’s too late. Of around 310 GB only about 4.5 GB is left - Slack
Recovery - 2017/01/31 23:00
Suggested recovery solutions:
Migrate db1.staging.gitlab.com data to GitLab.com (±6 hours old)
Problem with web hooks, these are removed as part of the staging sync.
Restore LVM snapshot (6 hours old)
Sid: try to undelete files?
CW: Not possible! `rm -Rvf` Sid: OK
JEJ: Probably too late, but isn't it sometimes possible if you make the disk read-only quickly enough? Also might still have file descriptor if the file was in use by a running process according to http://unix.stackexchange.com/a/101247/213510
YP: PostgreSQL doesn't keep all files open at all times, so that wouldn't work. Also, Azure is apparently also really good in removing data quickly, but not at sending it over to replicas. In other words, the data can't be recovered from the disk itself.
Action taken:
2017/02/01 00:36 - Backup db1.staging.gitlab.com data
2017/02/01 00:55 - Mount db1.staging.gitlab.com on db1.cluster.gitlab.com
Copy data from staging /var/opt/gitlab/postgresql/data/ to production /var/opt/gitlab/postgresql/data/
2017/02/01 01:05 - nfs-share01 server commandeered as temp storage place in /var/opt/gitlab/db-meltdown
2017/02/01 01:18 - Copy of remaining production data, including pg_xlog tar’ed up as ‘20170131-db-meltodwn-backup.tar.gz’
TODO after data restored:
Sid: shared a public link to this document from @gitlabstatus, https://twitter.com/gitlabstatus/status/826591961444384768
Update sentry DSN to production as it’s updated for staging to point to a different project
Try to restore webhooks
Remove the users we removed earlier today due to spam/abuse.
Create outage issue
Create issue to change terminal PS1 format/colours to make it clear whether you’re using production or staging (red production, yellow staging)
Show the full hostname in the bash prompt for all users by default (e.g., “db1.staging.gitlab.com” instead of just “db1”)
Somehow disallow rm -rf for the PostgreSQL data directory? Unsure if this is feasible, or necessary once we have proper backups
Add alerting for backups: check S3 storage etc.
Consider adding a last successful backup time in DB so admins can see this easily (suggested by customer in https://gitlab.zendesk.com/agent/tickets/58274)
Figure out why PostgreSQL suddenly had problems with max_connections being set to 8000, despite it having been set to that since 2016-05-13. A large portion of frustration arose because of this suddenly becoming a problem.
Upgrade dbX.cluster to PostgreSQL 9.6.1 as it’s still running the pinned 9.6.0 package (used for the Slony upgrade from 9.2 to 9.6.0)
Problems Encountered
LVM snapshots are by default only taken once every 24 hours. YP happened to run one manually about 6 hours prior to the outage
Regular backups seem to also only be taken once per 24 hours, though YP has not yet been able to figure out where they are stored. According to JN these don’t appear to be working, producing files only a few bytes in size.
SH: It looks like pg_dump may be failing because PostgreSQL 9.2 binaries are being run instead of 9.6 binaries. This happens because omnibus only uses Pg 9.6 if data/PG_VERSION is set to 9.6, but on workers this file does not exist. As a result it defaults to 9.2, failing silently. No SQL dumps were made as a result. Fog gem may have cleaned out older backups.
Disk snapshots in Azure are enabled for the NFS server, but not for the DB servers.
The synchronisation process removes webhooks once it has synchronised data to staging. Unless we can pull these from a regular backup from the past 24 hours they will be lost
The replication procedure is super fragile, prone to error, relies on a handful of random shell scripts, and is badly documented
Our backups to S3 apparently don’t work either: the bucket is empty
So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place.
Published by Google Drive–Report Abuse–Updated automatically every 5 minutes
	GitLab.com Database Incident - 2017/01/31
	GitLab.com Database Incident - 2017/01/31

	This incident affected the database (including issues and merge requests) but not the git repo's (repositories and wikis).
	Timeline (all times UTC):
	2017/01/31 16:00/17:00 - 21:00
	YP is working on setting up pgpool and replication in staging, creates an LVM snapshot to get up to date production data to staging. This was done roughly 6 hours before data loss.
	Getting replication to work is proving to be problematic and time consuming (estimated at ±20 hours just for the initial pg_basebackup sync). Work is interrupted due to this (as YP needs the help of another collegue who’s not working this day), and spam/high load on GitLab.com
	2017/01/31 21:00 - Spike in database load due to spam users - Twitter \| Slack
	Blocked users based on IP address
	Removed a user for using a repository as some form of CDN, resulting in 47 000 IPs signing in using the same account (causing high DB load). This was communicated with the infrastructure and support team.
	Removed users for spamming (by creating snippets) - Slack
	Database load goes back to normal, some manual PostgreSQL vacuuming is applied here and there to catch up with a large amount of dead tuples.
	2017/01/31 22:00 - Replication lag alert triggered in pagerduty Slack
	Attempts to fix db2, it’s lagging behind by about 4 GB at this point
	db2.cluster refuses to replicate, /var/opt/gitlab/postgresql/data is wiped to ensure a clean replication
	db2.cluster refuses to connect to db1, complaining about max_wal_senders being too low. This setting is used to limit the number of WAL (= replication) clients
	YP adjusts max_wal_senders to 32 on db1, restarts PostgreSQL
	PostgreSQL complains about too many semaphores being open, refusing to start
	YP adjusts max_connections to 2000 from 8000, PostgreSQL starts again (despite 8000 having been used for almost a year)
	db2.cluster still refuses to replicate, though it no longer complains about connections; instead it just hangs there not doing anything
	At this point frustration begins to kick in. Earlier this night YP explicitly mentioned he was going to sign off as it was getting late (23:00 or so local time), but didn’t due to the replication problems popping up all of a sudden.
	2017/01/31 23:00-ish
	YP thinks that perhaps pg_basebackup is being super pedantic about there being an empty data directory, decides to remove the directory. After a second or two he notices he ran it on db1.cluster.gitlab.com, instead of db2.cluster.gitlab.com
	2017/01/31 23:27 YP - terminates the removal, but it’s too late. Of around 310 GB only about 4.5 GB is left - Slack
	Recovery - 2017/01/31 23:00
	Suggested recovery solutions:
	Migrate db1.staging.gitlab.com data to GitLab.com (±6 hours old)
	Problem with web hooks, these are removed as part of the staging sync.
	Restore LVM snapshot (6 hours old)
	Sid: try to undelete files?
	CW: Not possible! `rm -Rvf` Sid: OK
	JEJ: Probably too late, but isn't it sometimes possible if you make the disk read-only quickly enough? Also might still have file descriptor if the file was in use by a running process according to http://unix.stackexchange.com/a/101247/213510
	YP: PostgreSQL doesn't keep all files open at all times, so that wouldn't work. Also, Azure is apparently also really good in removing data quickly, but not at sending it over to replicas. In other words, the data can't be recovered from the disk itself.
	Action taken:
	2017/02/01 00:36 - Backup db1.staging.gitlab.com data
	2017/02/01 00:55 - Mount db1.staging.gitlab.com on db1.cluster.gitlab.com
	Copy data from staging /var/opt/gitlab/postgresql/data/ to production /var/opt/gitlab/postgresql/data/
	2017/02/01 01:05 - nfs-share01 server commandeered as temp storage place in /var/opt/gitlab/db-meltdown
	2017/02/01 01:18 - Copy of remaining production data, including pg_xlog tar’ed up as ‘20170131-db-meltodwn-backup.tar.gz’
	TODO after data restored:
	Sid: shared a public link to this document from @gitlabstatus, https://twitter.com/gitlabstatus/status/826591961444384768
	Update sentry DSN to production as it’s updated for staging to point to a different project
	Try to restore webhooks
	Remove the users we removed earlier today due to spam/abuse.
	Create outage issue
	Create issue to change terminal PS1 format/colours to make it clear whether you’re using production or staging (red production, yellow staging)
	Show the full hostname in the bash prompt for all users by default (e.g., “db1.staging.gitlab.com” instead of just “db1”)
	Somehow disallow rm -rf for the PostgreSQL data directory? Unsure if this is feasible, or necessary once we have proper backups
	Add alerting for backups: check S3 storage etc.
	Consider adding a last successful backup time in DB so admins can see this easily (suggested by customer in https://gitlab.zendesk.com/agent/tickets/58274)
	Figure out why PostgreSQL suddenly had problems with max_connections being set to 8000, despite it having been set to that since 2016-05-13. A large portion of frustration arose because of this suddenly becoming a problem.
	Upgrade dbX.cluster to PostgreSQL 9.6.1 as it’s still running the pinned 9.6.0 package (used for the Slony upgrade from 9.2 to 9.6.0)
	Problems Encountered
	LVM snapshots are by default only taken once every 24 hours. YP happened to run one manually about 6 hours prior to the outage
	Regular backups seem to also only be taken once per 24 hours, though YP has not yet been able to figure out where they are stored. According to JN these don’t appear to be working, producing files only a few bytes in size.
	SH: It looks like pg_dump may be failing because PostgreSQL 9.2 binaries are being run instead of 9.6 binaries. This happens because omnibus only uses Pg 9.6 if data/PG_VERSION is set to 9.6, but on workers this file does not exist. As a result it defaults to 9.2, failing silently. No SQL dumps were made as a result. Fog gem may have cleaned out older backups.
	Disk snapshots in Azure are enabled for the NFS server, but not for the DB servers.
	The synchronisation process removes webhooks once it has synchronised data to staging. Unless we can pull these from a regular backup from the past 24 hours they will be lost
	The replication procedure is super fragile, prone to error, relies on a handful of random shell scripts, and is badly documented
	Our backups to S3 apparently don’t work either: the bucket is empty
	So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place.
	Published by Google Drive–Report Abuse–Updated automatically every 5 minutes