my take on the gitlab issue
It happens I had read the database troubleshooting part of Gitlab's Ops manual (https://gitlab.com/gitlab-com/runbooks/blob/master/troubleshooting/postgresql_replication.md) just a month ago, looking for monitoring info. I found many sound instructions on how to deal with database replication issues. I was very happy when I read they're also using Check_MK, but ...
It also made me decide pg_basebackup is too dangerous w/o dedicated DBA staff for two reasons:
- reason 1: lack of stability / fault tolerance
- reason 2: destructive resume on issue
In the following article I'll show what I think are the biggest issues that I'd change @Gitlab to harden things. First, I'll rant about the tools in place, but then I'll claim that the operational issues weigh a lot higher. Fixing them also has more net effect, which is the cool thing. (OFC the technical stuff also needs to be fixed. But what I triggered on isn't a few wrong tunables or a missing connection pooler)
A personal note, before I go into details:
About 11 years back I've also been in a situation where a colleague ended up deleting the wrong thing. Two of us, who normally shielded our resident SAN guru from Walk-Ins and less critical effects, had gone on holiday.
Soon enough someone distracted him and 'things happened'. I was called off holiday, everyone was basically offered to sign some pretty evil liability extensions to our contract or...
For the next two weeks I could hardly type - my hands were shaking that badly under the situation's pressure. My other colleague and I agreed to never again go on holiday at the same time and we stuck to that for years.
So, please rest assured none of this article is meant to offend. It's just I've had a lot of time to reflect about some of these things.
It surely is more blunt than what I'd write to one of my consulting clients, if they ask me to review their processes.
But Gitlab has been super-cool in releasing their runbooks, asking for advice with HW purchases and so on. So I felt this article needs to be just like that, and out in the open.
Tools are not your friends
- The fragility caused them to think they need to act now and not put this off any longer.
- A Staggering replication delay put their availability at risk
- the issue was felt to be critical enough to skip sleeping.
Put this in relation to how things should be and you see this isn't cool.
If they had had a pg_bouncer setup, it would have enabled a failover, but it would have cost them the redundancy. with pg_bouncer you need to do DARK, EVIL, THINGS to failback before you can failover a second time. Also it would not have helped at all.
(would|might|could) have been helpful as the replication pile would have not happened it allows you to exempt non-critical tables from replication, and can more easily attach and detach the replication for maintenance. all of this best applies in some ideal world. hiring a PGSQL consultancy to throw in Slony can be useful.
well, of course we have nothing like flashback ;-)
we all know a 340GB PGSQL restore isn't gonna finish in an hour. so, yeah, you need to solve this on a different level.
1) pg_basebackup is a destructive command.
if running any command like this, you need to ADD redundancy to do it. often this doesn't work due to time / space requirements. fight those to be able to do it. keep your FS at 2.1 times the size of the database volume you expect. (plus WAL piling space on a different disk)
a comparison: you can online migrate storage using pvmove. Or you could be using a lvm mirror.
only one of those methods doubles safety and performance during it's duration. Mind you it is NOT the one that is commonly used / recommended.
2) lack of handover for a time-bound issue
there was ongoing replication lag, this means it needs time to fix. so it should be handed over once any initial analysis is done. we often try to wrap things up, but the basebackup route is time-consuming before you start procedures, if you're not all set to go into a 3-4 hours journey. (unless there's already a deterministic "IT IS BROKEN" situation)
3) snapshots, one was run (good), but not prior to rm.
make sure you have space / capacity to throw in another one if you think one is good, be able to run one more be able to dump it quickly would ensure you can have more snapshots w/o worrying about space.
- recommendation: monitor your VGs to have like 30% extra capacity for snapshots
- recommendation: look into split mirror backups (if you got the time)
- not thin pools, they're seem not that reliable
- zfs would seem a choice, but is said to be unhappy with postgresql and ZFSOnLinux has no place in a real prod setup.
anyway: rm of prod data:
- snapshot or split mirror
- do not differentiate between primary or secondary data for this.
4) lack of DBA
this type of incident (repl. lag, cleanups) should be done by dedicated staff, who has their mind wrapped on the databases.
if you can't get the headcount, get a 24x7 postgresql support contract, who you will be able to escalate such shitty unstable replication to. Don't make it your team's woe. You didn't cause the lag, can't do anything about it. you didn't write the replication. and you, by definition, can't thoroughly fix it.
you just can't run databases "on the side" if your data matters.
if it works out, you're lucky. luck doesn't matter in our job.
5) lack of clear SLA and RTO/RPO objectives
The available point in time was 6 hours ago. Trying to restore exceeded 6 hours already, meaning there was no gain unless your RPO is very low.
I don't know if you had clear definitions for this, but I sense the decision to fix the original issue was out of worries, not out of a hard requirement. You need to decide up to which replication lag amount (% of normal change volume, whatever) it's OK to go to bed. This needs to become a standard procedure and it can't until goals are decided.
- You should go for a team event and then, after a somewhat sober night, discuss, why you decided like this.
- You decided to get the best possible RPO for secondary data (PR, not git).
- Once you start fighting over that, instead DECIDE on numbers and criteria and a process.
I think that's something you were missing.
I think I just also understood that you were trying to fix your staging. So, a team / mgmt decision should happen here: Staging issues are not eligble for fixing off hours, UNLESS you expect them to creep/seep over to prod. Also your naming convention needs fixing (PM me on twitter for that, I recently made a non-enterprise, trivial one that does the right things and saves my ass)
6) don't forget what caused this
a) application loopholes allowing some spam to pile up?
b) networking or performance that suffocated the replication.
You need to solve both.
I know this was your #1, and I intentionally put it last, because I think the biggest issue is that you had to run !!!a command that is supposed to get redundancy but required you to have LESS of it!!! establish a workaround for that, meaning, mv, not rm. Overprovision your DB box 100%+
- logwatch alerts on the S3 uploads
(PM for this, too)
my honest advice:
I would start by hiring a DBA and working with them trough the rest of the backup/monitoring cleanup.
all the other fun points, on the technical angle, are sugar on the top.
23:00+ is time where the oncall should fix things, not someone who was awake and working complex stuff all day.
If you had had a(n additional) DBA-only person in place, handover, advice and estimation of risk would have been more readily available.
And, again, your team and your management need to decide on certain aspects of this.
Especially the RTO! You're not in high finance. You're not in the situation, where a few billion $ would have passed through your systems in that time period. There, every single transaction is more money than startups dreams of.
But 6 hours of PRs? it is not the world to gain more uptime. IMO 6 hours of partial low-crit data loss would've totally OK.
If people cry about that, just make sure to double check that they actually pay high finance rates.
I feel Github has been down that long during only a month's cource sometimes.
This was shared by f3ew on #LOPSA and is above perfect: https://www.pythian.com/wp-content/uploads/2015/11/Pythian-FITACER-Human-Reliability-Checklist-2015.pdf
Also, idk? fuck cloud hosting!