lattera/2019-08-21_jenkins_postmortem.txt Secret

## 2019-08-21_jenkins_postmortem.txt
Our nightly build server (jenkins.hardenedbsd.org, which is the same
as installer.hardenedbsd.org) is currently down for emergency
maintenance.

This is a post-mortem, as seen by the eyes of Shawn Webb. On 19 Aug
2019 at 20:59 EDT, Oliver Pinter emailed core@hardenedbsd.org to
notify the team that Jenkins was down.

I checked my email as I woke up, saw the email from Oliver. On 20 Aug
2019 at 07:15 EDT, I opened a ticket with New York Internet (NYI), our
hosting provider, to gain IPMI access to the server. I gained access
that same day at 10:53. I immediately started debugging.

The bootloader claimed it could not find the kernel, that there were
ZFS issues. Since the bootloader would not even show the loader screen
(since we're using a root-on-zfs setup). I needed to boot a recent
build of 12-STABLE/amd64.

Oliver was likely asleep given timezone differences, so I decided to
do a build of 12-STABLE on my laptop. I needed to make sure that my
build was at least as new as the build on the Jenkins machine,
primarily because I needed to match minimum ZFS code parity[1].

I uploaded the newly-built memstick image to hardenedbsd.org/~shawn/
for an NYI tech to flash and boot, which is what occurred next. I
imported the pool and performed a scrub. The scrub completed a few
hours later. No errors reported.

I more closely inspected the output of `zpool status` in the hopes of
finding some clue as to why the bootloader would complain that the
pool was broken.

Sure enough, the clue was in the output of `zpool status`: How the
pool is configured.

Configuration of the pool, as determined by `zpool status`:

zroot:
  mirror:
    ada0p2
    ada1p2
  mirror:
    ada2
    ada3

I realized that the server was recently migrated the server from
11-STABLE to 12-STABLE. FreeBSD introduced a regression in the
bootloader in 12-STABLE: it cannot boot from ZFS pools that contain a
whole-disk device (ada2 and ada3 above). The bootloader in 12-STABLE
expects all nodes backing a vdev be contained within a partition.
Such a constraint does not exist in 11-STABLE.

So it was a perfect mixture for a failure. However, when I attempted
to revert to the last known good ZFS boot environment, the system
hung when attempting to launch /sbin/init. Due to time constraints, I
decided not to investigate further. The pool and data are intact.
However, the bootloader in 12-STABLE does not support our current pool
configuration.

We have not performed a full backup of the system in... I'm not sure.
No matter what, I want to take this downtime to back up the system
when it's not under load and while it's not bootable.

I reboot the system into the memstick image again to start afresh. I
bootstrap a chroot in which I can perform my work. (I love how easy it
is to do this with hbsd-update.) I set up sshd within the chroot. I
take a snapshot of the ZFS pool. I then run zfs send to a storage
server some 250 miles away with limited bandwidth. The `zfs send`
started at 00:18 EDT on 20 Aug 2019. ZFS reported an estimated total
of 1.51TB to be sent. As of 22 Aug 2019 at 11:24 EDT, around 390GB has
been sent.

We're now in a waiting state, waiting for the backup to complete. We
will then reinstall 12-STABLE, reconfiguring the ZFS pool such that
all nodes backing vdevs are contained within partitions (so ada2
becomes ada2p1 and ada3 becomes ada3p1).

We will the restore the backup via `zfs send`. Meaning, we will need
to transfer the 1.51TB over again. Thus, the time frame for bringing
this server back online will take... a long while.

The really great thing coming out of this is that going forward, we
will be able to perform incremental remote backups. Since we have the
1.51TB base, we can create a cronnable script to perform backups over
our VPN.

We will test the restoration procedures on a server here in Maryland
of the same base model as our build server in New York City being
backed up (jenkins.hardenedbsd.org). Once testing those restoration
procedures results in success, we will perform those procedures on the
server in NYC.

While the 1.51TB is restoring from our storage server in Annapolis
Junction, Maryland to our build server in New York City, I can also
perform the same restoration steps here. Through recent donations by
my employer, I have the same base model server at the datacenter here
in Maryland as is deployed in NYC. So, I'll do just that: I'll perform
the same steps on a server here and make sure services come up fine.

Wishful thinking: Build a VPN for build clusters. Set up build jails
connected to the VPN, integrated with our primary Jenkins instance
(jenkins.hardenedbsd.org). This would be a long-term goal, now made
possible through recent network enhancements and server donations.

To reiterate. Where we stand right now: We're performing a full backup
of the system.

Next steps:

1. Perform a test restoration on server here in Maryland
2. After successful restoration, reinstall HardenedBSD 12-STABLE on
   the production Jenkins instance.
3. Configure the ZFS pool to have two mirror vdevs, each with two
   partitioned disks.
4. Restore full backup.
5. Full backup restores to 11-STABLE. This is fine.
6. Perform source-based upgrade to 12-STABLE.
7. Ensure services are running fine. Allow one or more builds to
   complete successfully.
8. Perform first incremental backup.
9. Server is now considered stable.
10. Determine priorities for writing backup scripts and setting up the
    backup VPN and the build cluster VPN.

The time frame for these steps depend on many factors, including the
potential for the backup and restoration to bump up against vBSDcon.

During this time, we will continue to provide binary updates for
existing 12-STABLE installations. The server building the binary
updates is a separate one, hosted in Maryland. I recently had to do
perform similar backup and restoration steps for that server, too.
However, the binary update building server is not publicly accessible
and thus no one knew about that downtime as no updates were needed
during that migration. :)

[1]: I wanted to make sure that when I performed the next task, a ZFS
scrub, that similar/same code paths are taken. If there indeed was an
issue, I needed to ensure that the issue stemmed from the disks and
not changes in code. Since I didn't know when Oliver updated the
system last, and to which revision the system was updated, I felt the
need to build the latest 12-STABLE from our hardenedBSD.git repo.
Additionally, there were some networking-related security fixes, so I
wanted to make sure that during the backup, I was covered from a
security fixes perspective.
	Our nightly build server (jenkins.hardenedbsd.org, which is the same
	as installer.hardenedbsd.org) is currently down for emergency
	maintenance.

	This is a post-mortem, as seen by the eyes of Shawn Webb. On 19 Aug
	2019 at 20:59 EDT, Oliver Pinter emailed core@hardenedbsd.org to
	notify the team that Jenkins was down.

	I checked my email as I woke up, saw the email from Oliver. On 20 Aug
	2019 at 07:15 EDT, I opened a ticket with New York Internet (NYI), our
	hosting provider, to gain IPMI access to the server. I gained access
	that same day at 10:53. I immediately started debugging.

	The bootloader claimed it could not find the kernel, that there were
	ZFS issues. Since the bootloader would not even show the loader screen
	(since we're using a root-on-zfs setup). I needed to boot a recent
	build of 12-STABLE/amd64.

	Oliver was likely asleep given timezone differences, so I decided to
	do a build of 12-STABLE on my laptop. I needed to make sure that my
	build was at least as new as the build on the Jenkins machine,
	primarily because I needed to match minimum ZFS code parity[1].

	I uploaded the newly-built memstick image to hardenedbsd.org/~shawn/
	for an NYI tech to flash and boot, which is what occurred next. I
	imported the pool and performed a scrub. The scrub completed a few
	hours later. No errors reported.

	I more closely inspected the output of `zpool status` in the hopes of
	finding some clue as to why the bootloader would complain that the
	pool was broken.

	Sure enough, the clue was in the output of `zpool status`: How the
	pool is configured.

	Configuration of the pool, as determined by `zpool status`:

	zroot:
	mirror:
	ada0p2
	ada1p2
	mirror:
	ada2
	ada3

	I realized that the server was recently migrated the server from
	11-STABLE to 12-STABLE. FreeBSD introduced a regression in the
	bootloader in 12-STABLE: it cannot boot from ZFS pools that contain a
	whole-disk device (ada2 and ada3 above). The bootloader in 12-STABLE
	expects all nodes backing a vdev be contained within a partition.
	Such a constraint does not exist in 11-STABLE.

	So it was a perfect mixture for a failure. However, when I attempted
	to revert to the last known good ZFS boot environment, the system
	hung when attempting to launch /sbin/init. Due to time constraints, I
	decided not to investigate further. The pool and data are intact.
	However, the bootloader in 12-STABLE does not support our current pool
	configuration.

	We have not performed a full backup of the system in... I'm not sure.
	No matter what, I want to take this downtime to back up the system
	when it's not under load and while it's not bootable.

	I reboot the system into the memstick image again to start afresh. I
	bootstrap a chroot in which I can perform my work. (I love how easy it
	is to do this with hbsd-update.) I set up sshd within the chroot. I
	take a snapshot of the ZFS pool. I then run zfs send to a storage
	server some 250 miles away with limited bandwidth. The `zfs send`
	started at 00:18 EDT on 20 Aug 2019. ZFS reported an estimated total
	of 1.51TB to be sent. As of 22 Aug 2019 at 11:24 EDT, around 390GB has
	been sent.

	We're now in a waiting state, waiting for the backup to complete. We
	will then reinstall 12-STABLE, reconfiguring the ZFS pool such that
	all nodes backing vdevs are contained within partitions (so ada2
	becomes ada2p1 and ada3 becomes ada3p1).

	We will the restore the backup via `zfs send`. Meaning, we will need
	to transfer the 1.51TB over again. Thus, the time frame for bringing
	this server back online will take... a long while.

	The really great thing coming out of this is that going forward, we
	will be able to perform incremental remote backups. Since we have the
	1.51TB base, we can create a cronnable script to perform backups over
	our VPN.

	We will test the restoration procedures on a server here in Maryland
	of the same base model as our build server in New York City being
	backed up (jenkins.hardenedbsd.org). Once testing those restoration
	procedures results in success, we will perform those procedures on the
	server in NYC.

	While the 1.51TB is restoring from our storage server in Annapolis
	Junction, Maryland to our build server in New York City, I can also
	perform the same restoration steps here. Through recent donations by
	my employer, I have the same base model server at the datacenter here
	in Maryland as is deployed in NYC. So, I'll do just that: I'll perform
	the same steps on a server here and make sure services come up fine.

	Wishful thinking: Build a VPN for build clusters. Set up build jails
	connected to the VPN, integrated with our primary Jenkins instance
	(jenkins.hardenedbsd.org). This would be a long-term goal, now made
	possible through recent network enhancements and server donations.

	To reiterate. Where we stand right now: We're performing a full backup
	of the system.

	Next steps:

	1. Perform a test restoration on server here in Maryland
	2. After successful restoration, reinstall HardenedBSD 12-STABLE on
	the production Jenkins instance.
	3. Configure the ZFS pool to have two mirror vdevs, each with two
	partitioned disks.
	4. Restore full backup.
	5. Full backup restores to 11-STABLE. This is fine.
	6. Perform source-based upgrade to 12-STABLE.
	7. Ensure services are running fine. Allow one or more builds to
	complete successfully.
	8. Perform first incremental backup.
	9. Server is now considered stable.
	10. Determine priorities for writing backup scripts and setting up the
	backup VPN and the build cluster VPN.

	The time frame for these steps depend on many factors, including the
	potential for the backup and restoration to bump up against vBSDcon.

	During this time, we will continue to provide binary updates for
	existing 12-STABLE installations. The server building the binary
	updates is a separate one, hosted in Maryland. I recently had to do
	perform similar backup and restoration steps for that server, too.
	However, the binary update building server is not publicly accessible
	and thus no one knew about that downtime as no updates were needed
	during that migration. :)

	[1]: I wanted to make sure that when I performed the next task, a ZFS
	scrub, that similar/same code paths are taken. If there indeed was an
	issue, I needed to ensure that the issue stemmed from the disks and
	not changes in code. Since I didn't know when Oliver updated the
	system last, and to which revision the system was updated, I felt the
	need to build the latest 12-STABLE from our hardenedBSD.git repo.
	Additionally, there were some networking-related security fixes, so I
	wanted to make sure that during the backup, I was covered from a
	security fixes perspective.