starcraft66/matrix-postmortem.md

## matrix-postmortem.md

      
    Raw
  

              matrix-postmortem.md
            
          
    17-05-2018 - LOSS OF DATA AND DOWNTIME OF THE NERDSIN.SPACE MATRIX HOME SERVER

Description

On May 17th 2018, the PostgreSQL instance housing all of the data pertaining to the nerdsin.space Matrix home server suffered a critical failure causing the complete loss of all user data.
Timeline

On May 17th 2018, around 11:45 EST, I (Tristan) started performing routine maintenance on the home server, the plan was to possibly update the Synapse instance to the latest revision and also to install the matrix-appservice-webhooks integration.
Around 12:00 EST, I ran the docker-compose down and docker-compose up -d commands to recreate the Synapse, PostgreSQL and matrix-appservice-webhooks containers and essentially reboot the entire home server.
Around 12:05 EST, the command to bring up all of the containers had still not completed successfully and docker was generally unresponsive and started behaving erratically.
Around 12:15 EST, after a bit of investigation the root cause of the docker slowness became clear: I had added a configuration stanza to the docker-compose file instructing docker to publish a range of around 11000 ports required for Synapse's TURN server to connect to the ourside world. Docker then started creating individual iptables rules for every single one of those ports individually, causing iptables to peg a CPU core to 100% usage. By looking at the iptables actions, it was clear that creating over ten thousand rules was going to take many hours so I killed everything, deleted the container and also removed that large port-forwarding rule.
Around 12:30 EST, I managed to get all of the containers back up and running however the Riot client on my phone and desktop logged me out and did not allow me to log back in. At this point, I had a family matter to take care of so I put my investigation on hold for an hour.
At 13:30 EST, I was back on the case and found to my demise that the bind-mount volume storing all of the PostgreSQL data on disk was completely empty. After doing some research, it runs out bind-mounting the /var/lib/postgresql folder does not work. Someone recommended bind-mounting the folder /var/lib/postgresql/data (which worked when I tried it later). Alas it was too late. Since the original bind-mount was incorrect, all of the PostgreSQL data (all of the matrix server's data essentially) was lost irrecoverably when the container was shut down during the first phase of the upgrade.
Contributing Factor(s)

Lack of testing: If I had done dry-run tests causing the destruction and re-creation of containers when I first installed the home server, I would have caught this configuration issue right away before it could cause any real-world damage.
Lack of backups: Because this server was quickly set-up one day, I never ended up configuring any backups for it.
Stabilization Steps

I have brought the matrix server back up, minus all of the user data of course. I am slowly in the process of contacting users to inform them of this postmortem.
Impact

All of the user accounts, messages and uploaded media files created on the nerdsin.space since its inception a few weeks ago have been permanently lost.
Corrective Actions

I have audited all of the Dockerfiles, docker-compose files and Synapse configuration files thoroughly to ensure catastriphic failure like this does not happen again.
I have changed the bind-mount path for the postgres container and confirmed that the postgres data now properly persists container deletion and creation.
I have configured hourly backups of all of the matrix user data including the postgres database and all media uploads to minimize the impact of a theoretical future incident. I plan to test restore procedures soon.
Final Thoughts

Running the nerdsin.space home server was always a fun little side-project for myself and I wasn't thinking of the possible impact losing the user data would cause to any other users on my server.
This is a learning experience for myself and possible others reading this who might be thinking of running public services for others and I will always remember to have proper backups in the future.