Last active
August 22, 2019 15:48
-
-
Save lattera/3efb43cb5f50cf866d11d191830cecd4 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Our nightly build server (jenkins.hardenedbsd.org, which is the same | |
as installer.hardenedbsd.org) is currently down for emergency | |
maintenance. | |
This is a post-mortem, as seen by the eyes of Shawn Webb. On 19 Aug | |
2019 at 20:59 EDT, Oliver Pinter emailed core@hardenedbsd.org to | |
notify the team that Jenkins was down. | |
I checked my email as I woke up, saw the email from Oliver. On 20 Aug | |
2019 at 07:15 EDT, I opened a ticket with New York Internet (NYI), our | |
hosting provider, to gain IPMI access to the server. I gained access | |
that same day at 10:53. I immediately started debugging. | |
The bootloader claimed it could not find the kernel, that there were | |
ZFS issues. Since the bootloader would not even show the loader screen | |
(since we're using a root-on-zfs setup). I needed to boot a recent | |
build of 12-STABLE/amd64. | |
Oliver was likely asleep given timezone differences, so I decided to | |
do a build of 12-STABLE on my laptop. I needed to make sure that my | |
build was at least as new as the build on the Jenkins machine, | |
primarily because I needed to match minimum ZFS code parity[1]. | |
I uploaded the newly-built memstick image to hardenedbsd.org/~shawn/ | |
for an NYI tech to flash and boot, which is what occurred next. I | |
imported the pool and performed a scrub. The scrub completed a few | |
hours later. No errors reported. | |
I more closely inspected the output of `zpool status` in the hopes of | |
finding some clue as to why the bootloader would complain that the | |
pool was broken. | |
Sure enough, the clue was in the output of `zpool status`: How the | |
pool is configured. | |
Configuration of the pool, as determined by `zpool status`: | |
zroot: | |
mirror: | |
ada0p2 | |
ada1p2 | |
mirror: | |
ada2 | |
ada3 | |
I realized that the server was recently migrated the server from | |
11-STABLE to 12-STABLE. FreeBSD introduced a regression in the | |
bootloader in 12-STABLE: it cannot boot from ZFS pools that contain a | |
whole-disk device (ada2 and ada3 above). The bootloader in 12-STABLE | |
expects all nodes backing a vdev be contained within a partition. | |
Such a constraint does not exist in 11-STABLE. | |
So it was a perfect mixture for a failure. However, when I attempted | |
to revert to the last known good ZFS boot environment, the system | |
hung when attempting to launch /sbin/init. Due to time constraints, I | |
decided not to investigate further. The pool and data are intact. | |
However, the bootloader in 12-STABLE does not support our current pool | |
configuration. | |
We have not performed a full backup of the system in... I'm not sure. | |
No matter what, I want to take this downtime to back up the system | |
when it's not under load and while it's not bootable. | |
I reboot the system into the memstick image again to start afresh. I | |
bootstrap a chroot in which I can perform my work. (I love how easy it | |
is to do this with hbsd-update.) I set up sshd within the chroot. I | |
take a snapshot of the ZFS pool. I then run zfs send to a storage | |
server some 250 miles away with limited bandwidth. The `zfs send` | |
started at 00:18 EDT on 20 Aug 2019. ZFS reported an estimated total | |
of 1.51TB to be sent. As of 22 Aug 2019 at 11:24 EDT, around 390GB has | |
been sent. | |
We're now in a waiting state, waiting for the backup to complete. We | |
will then reinstall 12-STABLE, reconfiguring the ZFS pool such that | |
all nodes backing vdevs are contained within partitions (so ada2 | |
becomes ada2p1 and ada3 becomes ada3p1). | |
We will the restore the backup via `zfs send`. Meaning, we will need | |
to transfer the 1.51TB over again. Thus, the time frame for bringing | |
this server back online will take... a long while. | |
The really great thing coming out of this is that going forward, we | |
will be able to perform incremental remote backups. Since we have the | |
1.51TB base, we can create a cronnable script to perform backups over | |
our VPN. | |
We will test the restoration procedures on a server here in Maryland | |
of the same base model as our build server in New York City being | |
backed up (jenkins.hardenedbsd.org). Once testing those restoration | |
procedures results in success, we will perform those procedures on the | |
server in NYC. | |
While the 1.51TB is restoring from our storage server in Annapolis | |
Junction, Maryland to our build server in New York City, I can also | |
perform the same restoration steps here. Through recent donations by | |
my employer, I have the same base model server at the datacenter here | |
in Maryland as is deployed in NYC. So, I'll do just that: I'll perform | |
the same steps on a server here and make sure services come up fine. | |
Wishful thinking: Build a VPN for build clusters. Set up build jails | |
connected to the VPN, integrated with our primary Jenkins instance | |
(jenkins.hardenedbsd.org). This would be a long-term goal, now made | |
possible through recent network enhancements and server donations. | |
To reiterate. Where we stand right now: We're performing a full backup | |
of the system. | |
Next steps: | |
1. Perform a test restoration on server here in Maryland | |
2. After successful restoration, reinstall HardenedBSD 12-STABLE on | |
the production Jenkins instance. | |
3. Configure the ZFS pool to have two mirror vdevs, each with two | |
partitioned disks. | |
4. Restore full backup. | |
5. Full backup restores to 11-STABLE. This is fine. | |
6. Perform source-based upgrade to 12-STABLE. | |
7. Ensure services are running fine. Allow one or more builds to | |
complete successfully. | |
8. Perform first incremental backup. | |
9. Server is now considered stable. | |
10. Determine priorities for writing backup scripts and setting up the | |
backup VPN and the build cluster VPN. | |
The time frame for these steps depend on many factors, including the | |
potential for the backup and restoration to bump up against vBSDcon. | |
During this time, we will continue to provide binary updates for | |
existing 12-STABLE installations. The server building the binary | |
updates is a separate one, hosted in Maryland. I recently had to do | |
perform similar backup and restoration steps for that server, too. | |
However, the binary update building server is not publicly accessible | |
and thus no one knew about that downtime as no updates were needed | |
during that migration. :) | |
[1]: I wanted to make sure that when I performed the next task, a ZFS | |
scrub, that similar/same code paths are taken. If there indeed was an | |
issue, I needed to ensure that the issue stemmed from the disks and | |
not changes in code. Since I didn't know when Oliver updated the | |
system last, and to which revision the system was updated, I felt the | |
need to build the latest 12-STABLE from our hardenedBSD.git repo. | |
Additionally, there were some networking-related security fixes, so I | |
wanted to make sure that during the backup, I was covered from a | |
security fixes perspective. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment