Skip to content

Instantly share code, notes, and snippets.

@MaartenBaert
Last active February 7, 2024 16:56
Show Gist options
  • Star 6 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save MaartenBaert/8879c019d82ede2e9794a14906921021 to your computer and use it in GitHub Desktop.
Save MaartenBaert/8879c019d82ede2e9794a14906921021 to your computer and use it in GitHub Desktop.

Manually fixing bit flips in BTRFS

Somehow my BTRFS file system became corrupted by what appears to be a single bit flip in a metadata field. Rather than copying all the data and reformatting the file system, which would have required another disk at least as large as the original, I decided to try to fix this manually, which appears to have worked. I've documented the procedure I've used here, in case I need it again or someone else runs into a similar issue and finds it useful.

The first thing you should do is run btrfs check. For me this produced the following output:

Opening filesystem to check...
Checking filesystem on /dev/nvme0n1p1
UUID: ec7afe1c-8478-450a-82fc-d17b32d8ca3d
[1/7] checking root items
[2/7] checking extents
data extent[46615457792, 8192] referencer count mismatch (root 5 owner 6618219 offset 40960) wanted 0 have 1
data extent[46615457792, 8192] bytenr mimsmatch, extent item bytenr 46615457792 file item bytenr 0
data extent[46615457792, 8192] referencer count mismatch (root 2305843009213693957 owner 6618219 offset 40960) wanted 1 have 0
backpointer mismatch on [46615457792 8192]
ERROR: errors found in extent allocation tree or chunk allocation
[3/7] checking free space tree
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs
[7/7] checking quota groups skipped (not enabled on this FS)
found 340295188480 bytes used, error(s) found
total csum bytes: 327893144
total tree bytes: 3356934144
total fs tree bytes: 2704637952
total extent tree bytes: 231227392
btree space waste bytes: 652723263
file data blocks allocated: 1675529441280
 referenced 347143503872

In my case the damage seemed to be limited to the extents tree. This could in theory be fixed by running btrfs check with --init-extent-tree to rebuild the tree from scratch, but that seemed somewhat risky since I didn't know whether anything else was corrupted, and it would have been difficult to revert if it failed.

Note that the value 2305843009213693957 is 0x2000000000000005 in hexadecimal. The correct value is almost certainly 5, since this is the default root id. The reference count errors seemed to confirm this, so I wanted to try to fix this value manually by writing directly to the block device. Note that btrfs check is not reporting any checksum errors - this means that the bit flip must have happened in RAM, before the checksum was calculated, rather than on the disk itself. This also meant that after fixing the bit flip, I needed to update the checksum so it would be correct again.

Address calculation

In order to make the required changes to the block device, we first have to figure out at which address the data is stored. This can be done with the btrfs tool and some simple math.

First, run btrfs inspect-internal dump-tree --extents /dev/nvme0n1p1 to dump the entire extents tree in text format. Note, this produces a huge amount of output (~400MB in my case) so either save it to a file or pipe it directly to something like grep -B 200 -A 20 46615457792. This should allow you to find the affected item within the extents tree:

leaf 359743488 items 102 free space 7619 generation 188105 owner EXTENT_TREE
leaf 359743488 flags 0x1(WRITTEN) backref revision 1
fs uuid ec7afe1c-8478-450a-82fc-d17b32d8ca3d
chunk uuid 2c688fe1-c3fd-4cc1-91a8-6e5e1ef372c2
	item 0 key (46614720512 EXTENT_ITEM 16384) itemoff 16230 itemsize 53
		refs 1 gen 182145 flags DATA
		(178 0xdfb591f09c4705d) extent data backref root FS_TREE objectid 7279200 offset 0 count 1
	item 1 key (46614740992 EXTENT_ITEM 20480) itemoff 16177 itemsize 53
		refs 1 gen 171820 flags DATA
		(178 0xdfb591faeeafe9c) extent data backref root FS_TREE objectid 7100202 offset 0 count 1
[...]
	item 27 key (46615457792 EXTENT_ITEM 8192) itemoff 14636 itemsize 144
		refs 8 gen 167787 flags DATA
		(178 0x1da59e701e7789d2) extent data backref root 2305843009213693957 objectid 6618219 offset 40960 count 1
		(184 0xcbcdfc0000) shared data backref parent 875334205440 count 1
		(184 0xca33c2c000) shared data backref parent 868451794944 count 1
		(184 0x7fc67cc000) shared data backref parent 548790910976 count 1
		(184 0x7fc336c000) shared data backref parent 548735991808 count 1
		(184 0x22084000) shared data backref parent 570966016 count 1
		(184 0x218f4000) shared data backref parent 563036160 count 1
		(184 0x31c4000) shared data backref parent 52183040 count 1

In this case the key 46615457792 is found as item 27 of leaf 359743488. The number 359743488 is the logical address of the leaf node. The item itself is located at <leaf_addr> + 101 + <item_offset> (101 is the size of the leaf header). In order to modify the raw data through the block device, we need to translate these logical addresses into physical addresses. To do this, we first have to find the chunk in which this logical address is located. We can get a list of all chunks by running:

btrfs inspect-internal dump-tree --device /dev/nvme0n1p1 | grep -A 10 CHUNK_ITEM

The specific entry we need is this one:

	item 3 key (FIRST_CHUNK_TREE CHUNK_ITEM 30408704) itemoff 15881 itemsize 112
		length 1073741824 owner 2 stripe_len 65536 type METADATA|DUP
		io_align 65536 io_width 65536 sector_size 4096
		num_stripes 2 sub_stripes 1
			stripe 0 devid 1 offset 38797312
			dev_uuid 11118de3-27cf-4640-be79-1af30b59edbc
			stripe 1 devid 1 offset 1112539136
			dev_uuid 11118de3-27cf-4640-be79-1af30b59edbc

The number 30408704 is the logical address of the chunk, and the chunk length is 1073741824, which means this chunk contains all addresses between 30408704 and 30408704 + 1073741824 = 1104150528. The address we need is 359743488 (the logical address of the damaged leaf node), which falls within this range. Note that this chunk is a metadata chunk (since the extents tree is metadata) and is duplicated to two physical addresses (called 'stripes') for redundancy. The physical addresses of these stripes are 38797312 and 1112539136.

We will need to check and repair both copies of the damaged data. We can calculate the two physical addresses of the leaf node like this:

359743488 - 30408704 + 38797312 = 368132096
359743488 - 30408704 + 1112539136 = 1441873920

We can now verify that these addresses are correct by doing a hex dump at these addresses:

root@desktop-maarten maarten # xxd -s 368132096 -l 240 /dev/nvme0n1p1
15f14000: 2f0c c809 0000 0000 0000 0000 0000 0000  /...............
15f14010: 0000 0000 0000 0000 0000 0000 0000 0000  ................
15f14020: ec7a fe1c 8478 450a 82fc d17b 32d8 ca3d  .z...xE....{2..=
15f14030: 0040 7115 0000 0000 0100 0000 0000 0001  .@q.............
15f14040: 2c68 8fe1 c3fd 4cc1 91a8 6e5e 1ef3 72c2  ,h....L...n^..r.
15f14050: c9de 0200 0000 0000 0200 0000 0000 0000  ................
15f14060: 6600 0000 0000 3074 da0a 0000 00a8 0040  f.....0t.......@
15f14070: 0000 0000 0000 663f 0000 3500 0000 0080  ......f?..5.....
15f14080: 74da 0a00 0000 a800 5000 0000 0000 0031  t.......P......1
15f14090: 3f00 0035 0000 0000 d074 da0a 0000 00a8  ?..5.....t......
15f140a0: 0040 0000 0000 0000 fc3e 0000 3500 0000  .@.......>..5...
15f140b0: 0020 75da 0a00 0000 a800 3000 0000 0000  . u.......0.....
15f140c0: 00ba 3e00 0042 0000 0000 5075 da0a 0000  ..>..B....Pu....
15f140d0: 00a8 0020 0000 0000 0000 853e 0000 3500  ... .......>..5.
15f140e0: 0000 0080 75da 0a00 0000 a800 7000 0000  ....u.......p...

root@desktop-maarten maarten # xxd -s 1441873920 -l 240 /dev/nvme0n1p1
55f14000: 2f0c c809 0000 0000 0000 0000 0000 0000  /...............
55f14010: 0000 0000 0000 0000 0000 0000 0000 0000  ................
55f14020: ec7a fe1c 8478 450a 82fc d17b 32d8 ca3d  .z...xE....{2..=
55f14030: 0040 7115 0000 0000 0100 0000 0000 0001  .@q.............
55f14040: 2c68 8fe1 c3fd 4cc1 91a8 6e5e 1ef3 72c2  ,h....L...n^..r.
55f14050: c9de 0200 0000 0000 0200 0000 0000 0000  ................
55f14060: 6600 0000 0000 3074 da0a 0000 00a8 0040  f.....0t.......@
55f14070: 0000 0000 0000 663f 0000 3500 0000 0080  ......f?..5.....
55f14080: 74da 0a00 0000 a800 5000 0000 0000 0031  t.......P......1
55f14090: 3f00 0035 0000 0000 d074 da0a 0000 00a8  ?..5.....t......
55f140a0: 0040 0000 0000 0000 fc3e 0000 3500 0000  .@.......>..5...
55f140b0: 0020 75da 0a00 0000 a800 3000 0000 0000  . u.......0.....
55f140c0: 00ba 3e00 0042 0000 0000 5075 da0a 0000  ..>..B....Pu....
55f140d0: 00a8 0020 0000 0000 0000 853e 0000 3500  ... .......>..5.
55f140e0: 0000 0080 75da 0a00 0000 a800 7000 0000  ....u.......p...

We read the same data at both addresses, which is a good sign. The first 32 bytes shown are the checksum (only the first 4 bytes are actually used), the next 32 bytes contain the chunk id, which was also shown in the extents tree dump as ec7afe1c-8478-450a-82fc-d17b32d8ca3d. We can see this matches the hex dump, so the leaf address is probably correct. In order to find the item itself, we need to increment the addres by 101 + <item_offset> bytes. This gives us:

368132096 + 101 + 15881 = 368148078
1441873920 + 101 + 15881 = 1441889902

A hex dump at these addresses should show something like this (sadly I lost the original hex dump, so I just reconstructed the first line):

root@desktop-maarten maarten # xxd -s 368148078 -l 800 /dev/nvme0n1p1
15f17e6e: 0800 0000 0000 0000 6b8f 0200 0000 0000  ........k.......
[...]

root@desktop-maarten maarten # xxd -s 1441889902 -l 800 /dev/nvme0n1p1
55f17e6e: 0800 0000 0000 0000 6b8f 0200 0000 0000  ........k.......
[...]

Here the first 8 bytes show the refcount (8), the next 8 bytes show the generation number (0x028f6b = 167787) - this should match the information from the extents tree dump. At this point I just started searching for the specific binary value I was looking for in the hex dump, which was easier than figuring out the exact binary format of the item data. Note that this value is stored in little-endian format, so it appears as 05 00 00 00 00 00 00 20. I found this value at address 368148103. I needed to replace it with 05 00 00 00 00 00 00 00.

Fixing the bit flip

In order to modify the values, I wrote a simple python script using this helper function:

f = open('/dev/nvme0n1p1', 'r+b', buffering=0)

def replace(f, addr, oldval, newval):
	print(f'At address {addr}, replacing {oldval} with {newval}')
	f.seek(addr)
	assert f.read(len(oldval)) == oldval, 'Data on disk does not match old value'
	f.seek(addr)
	f.write(newval)

This function first reads the old value to make sure it matches what we expect, then replaces it with the new value. This should reduce the potential for data corruption in case we somehow got the address wrong. I saved all changes I made to a text file so I would be able to undo them later if I made a mistake.

After fixing the flipped bit, we still need to fix the checksums. The easiest way to find the new checksum values is actually to simply run btrfs check again, because it will now print out the following error:

checksum verify failed on 359743488 wanted 0x2f0cc809 found 0x8085c383

After this, btrfs check will refuse to read the affected leaf and subsequently prints a large number of new errors as a consequence of this, but we should just ignore those and fix the checksum, then they will go away again. The number 0x2f0cc809 is the old (now incorrect) checksum that's stored on disk, while the number 0x8085c383 is the new correct value that BTRFS wants. So all we need to do now is do another replacement to fix the checksums, which are conveniently located at the start of the leaf nodes (physical addresses 368132096 and 1441873920). Note that this value (unlike almost everything else) is stored in big-endian format.

After these final changes, btrfs check passes! I did another run with --check-data-csum to also check the data itself, and this passed as well. Apparently it really was just a single bit flip. After rebooting, my file system is working fine again :).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment