Skip to content

Instantly share code, notes, and snippets.

@chapmanjacobd
Last active November 26, 2023 18:26
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save chapmanjacobd/bc6e31c8bc3647e0bcb0c43bc0464a9c to your computer and use it in GitHub Desktop.
Save chapmanjacobd/bc6e31c8bc3647e0bcb0c43bc0464a9c to your computer and use it in GitHub Desktop.
BTRFS single mode evaluation

The experiment

Preparation

truncate -s20G d1.img
truncate -s20G d2.img
truncate -s20G d3.img
truncate -s20G d4.img
set ld1 (sudo losetup --show --find d1.img)
set ld2 (sudo losetup --show --find d2.img)
set ld3 (sudo losetup --show --find d3.img)
set ld4 (sudo losetup --show --find d4.img)

sudo mkfs.btrfs -d single -m raid1c3 "$ld1" "$ld2" "$ld3" "$ld4" 

sudo mkdir -p /mnt/loop
sudo mount "$ld1" /mnt/loop

sudo dd if=/dev/zero of=/mnt/loop/file bs=1M count=500

I also copied a bunch of videos

sudo cp -r ~/d/70_Now_Watching/ /mnt/loop/

First, I check the distribution of data

sudo btrfs device usage /mnt/loop
/dev/loop0, ID: 1
   Device size:            20.00GiB
   Device slack:              0.00B
   Data,single:            19.00GiB
   Unallocated:             1.00GiB

/dev/loop1, ID: 2
   Device size:            20.00GiB
   Device slack:              0.00B
   Data,single:            18.00GiB
   Metadata,RAID1C3:        1.00GiB
   System,RAID1C3:          8.00MiB
   Unallocated:          1016.00MiB

/dev/loop2, ID: 3
   Device size:            20.00GiB
   Device slack:              0.00B
   Data,single:            18.00GiB
   Metadata,RAID1C3:        1.00GiB
   System,RAID1C3:          8.00MiB
   Unallocated:          1016.00MiB

/dev/loop3, ID: 4
   Device size:            20.00GiB
   Device slack:              0.00B
   Data,single:            18.00GiB
   Metadata,RAID1C3:        1.00GiB
   System,RAID1C3:          8.00MiB
   Unallocated:          1016.00MiB

Seems to be distributed pretty evenly. Now let's fuck shit up !!

sudo dd if=/dev/random of="$ld3"
dd: writing to '/dev/loop2': No space left on device
41943041+0 records in
41943040+0 records out
21474836480 bytes (21 GB, 20 GiB) copied, 101.421 s, 212 MB/s
sudo btrfs scrub start /mnt/loop/

ERROR: there are uncorrectable errors

UUID:             b4ade67a-8c7b-45c3-b747-8280d9504714
Scrub started:    Sat Jan 21 23:16:09 2023
Status:           finished
Duration:         0:00:25
Total to scrub:   72.80GiB
Rate:             2.91GiB/s
Error summary:    super=2 csum=4698471
Corrected:      5662
Uncorrectable:  4692809
Unverified:     0

Now we have a little script to check our files:

from pathlib import Path

error_count = 0
success_count = 0

for file_path in Path('/mnt/loop/').rglob('*'):
    try:
        if file_path.is_file():
            with open(file_path, 'rb') as f:
                # read the entire contents of the file
                file_contents = f.read()
                success_count += 1
    except IOError:
        error_count += 1

print(f'Number of successful reads: {success_count}')
print(f'Number of IO errors: {error_count}')

Results

And the results are... drumroll please......... no? ok fine.

Number of successful reads: 119
Number of IO errors: 66

Interesting, but I wonder if there is any variation in size of those files or if we could simulate heavy file extent fragmentation.

Successful read files size: min 0       average 241136404 max 2397645276 sum 28695232117
IO error files size:        min 1888612 average 745390515 max 4884066696 sum 49195774012

Interesting... maybe I need to run a bigger test but the result of being able to read a 2.4GB file is probably somewhat surprising to the person who said only kbs would be accessible. But it is true that only 37% of data was still accessible, and this is a very small contrived simulation with only one process writing data.

It does seem like smaller files are going to be more likely to survive, but that can be expected. The gods of bits only make guillotines so big.

This result does make me feel a little bit better--though I will still investigate MergerFS a little bit more. I really like btrfs and switching to MergerFS seems like a lot of work...

A script to simulate this test using the output of btrfs inspect-internal dump-tree --extents might be interesting. I wonder how much file extent fragmentation my drives actually have.

Cleanup

sudo umount /mnt/loop
sudo losetup -d "$ld1" "$ld2" "$ld3" "$ld4"
rm d1.img d2.img d3.img d4.img

uname -a
Linux pakon 6.1.6-200.fc37.x86_64 #1 SMP PREEMPT_DYNAMIC Sat Jan 14 16:55:06 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
@chapmanjacobd
Copy link
Author

https://news.ycombinator.com/item?id=34477899

This is very good advice. I did the same preparation, here is the distribution of files before the degraded state:

Number of successful reads: 280
Number of IO errors: 0
Successful read files size: sum 82648303047 max 4884066696 average 295172511

then I unmounted the fs, deleted disk 2, echo 3 > /proc/sys/vm/drop_caches, and remounted the fs.

sudo umount /mnt/loop
echo 3 | sudo tee /proc/sys/vm/drop_caches
echo 3 | sudo tee /proc/sys/vm/drop_caches
echo 3 | sudo tee /proc/sys/vm/drop_caches

dmesg --human --nopager --decode --level emerg,alert,crit,err,warn,notice,info
kern  :info  : [Jan22 13:18] tee (215899): drop_caches: 3
kern  :info  : [  +3.232287] tee (215931): drop_caches: 3
kern  :info  : [  +0.775697] tee (215953): drop_caches: 3

rm d2.img
sudo mount "$ld1" /mnt/loop

I am surprised that mounting worked without error but I guess the device is still active via losetup. I'm assuming this would be similar to an actual disk failure though, if the device weren't there maybe btrfs will complain and ask to be mounted with the -o degraded flag.

There was nothing exciting in dmesg

kern  :info  : [ +14.363762] BTRFS info (device loop0): using crc32c (crc32c-intel) checksum algorithm
kern  :info  : [  +0.000004] BTRFS info (device loop0): using free space tree

Oohh weird...

Number of successful reads: 280
Number of IO errors: 0
Successful read files size: sum 82648303047 max 4884066696 average 295172511

sudo btrfs scrub status /mnt/loop/
UUID:             a57027e5-feb8-4f58-9022-f5dc0a5c67ac
Scrub started:    Sun Jan 22 13:33:49 2023
Status:           finished
Duration:         0:00:28
Total to scrub:   77.25GiB
Rate:             2.76GiB/s
Error summary:    no errors found

Okay turns out the deleted file is still connected to the loopback device.

sudo losetup -d $ld2
sudo umount /mnt/loop
echo 3 | sudo tee /proc/sys/vm/drop_caches

Now we get some interesting stuff in dmesg

sudo mount -o degraded "$ld1" /mnt/loop
mount: /mnt/loop: wrong fs type, bad option, bad superblock on /dev/loop0, missing codepage or helper program, or other error.
    dmesg(1) may have more information after failed mount system call.

kern  :info  : [Jan22 13:37] tee (222135): drop_caches: 3
kern  :info  : [ +16.362674] BTRFS info (device loop0): using crc32c (crc32c-intel) checksum algorithm
kern  :info  : [  +0.000004] BTRFS info (device loop0): using free space tree
kern  :err   : [  +0.000419] BTRFS error (device loop0): devid 2 uuid 1b352839-f719-499f-b9a7-25ed4d06e2be is missing
kern  :err   : [  +0.000003] BTRFS error (device loop0): failed to read chunk tree: -2
kern  :err   : [  +0.000183] BTRFS error (device loop0): open_ctree failed
kern  :info  : [ +11.713125] BTRFS info (device loop0): using crc32c (crc32c-intel) checksum algorithm
kern  :info  : [  +0.000004] BTRFS info (device loop0): allowing degraded mounts
kern  :info  : [  +0.000001] BTRFS info (device loop0): using free space tree
kern  :warn  : [  +0.000167] BTRFS warning (device loop0): devid 2 uuid 1b352839-f719-499f-b9a7-25ed4d06e2be is missing
kern  :warn  : [  +0.007647] BTRFS warning (device loop0): chunk 2177892352 missing 1 devices, max tolerance is 0 for writable mount
kern  :warn  : [  +0.000002] BTRFS warning (device loop0): writable mount is not allowed due to too many missing devices
kern  :err   : [  +0.000155] BTRFS error (device loop0): open_ctree failed

But we can still mount it as read-only

sudo mount -o ro,degraded "$ld1" /mnt/loop

And the results are

Number of successful reads: 219
Number of IO errors: 61
Successful read files size: sum 21798190683 max 2122064756 average 99535117
IO error files size:        sum 60850112364 max 4884066696 average 997542825

In this test about 26% of data is still fully readable (21798190683 / (21798190683+60850112364)).

I also tried another variant of the experiment where I did all of the above but ran this command before removing the disk:

sudo rm /mnt/loop/file  # a 500 mb file that was included the above tests. I deleted this to give btrfs defrag some room to work
sudo btrfs fi defrag -v -r -czstd /mnt/loop/

sudo losetup -d $ld2
sudo umount /mnt/loop
echo 3 | sudo tee /proc/sys/vm/drop_caches
sudo mount -o ro,degraded "$ld1" /mnt/loop

and the results are not much better... in fact they are worse 20% lol

Number of successful reads: 199
Number of IO errors: 80
Successful read files size: sum 16695157031 max 2122064756 average 83895261
IO error files size:        sum 65428858016 max 4884066696 average 817860725
from pathlib import Path

error_count = 0
success_count = 0
error_files_size = []
success_files_size = []

for file_path in Path('/mnt/loop/').rglob('*'):
    try:
        if file_path.is_file():
            with open(file_path, 'rb') as f:
                # read the entire contents of the file
                file_contents = f.read()
                success_count += 1
                success_files_size.append(file_path.stat().st_size)
    except IOError:
        error_count += 1
        error_files_size.append(file_path.stat().st_size)

print(f'Number of successful reads: {success_count}')
print(f'Number of IO errors: {error_count}')
print(f'Successful read files size: sum {sum(success_files_size)} max {max(success_files_size)} average {sum(success_files_size)/len(success_files_size)}')
print(f'IO error files size: sum {sum(error_files_size)} max {max(error_files_size)} average {sum(error_files_size)/len(error_files_size)}')

@chapmanjacobd
Copy link
Author

chapmanjacobd commented Jan 22, 2023

With raid0 it is a lot less... only 1MB is readable. those files are probably all inlined extents

sudo mkfs.btrfs -d raid0 -m raid1 "$ld1" "$ld2" "$ld3" "$ld4" 

Before removing device 2

Number of successful reads: 280
Number of IO errors: 0
Successful read files size: sum 83730370013 max 4884066696 average 299037035

After removing device 2

Number of successful reads: 109
Number of IO errors: 171
Successful read files size: sum 694079      max 57344      average 6367
IO error files size:        sum 83729675934 max 4884066696 average 489647227

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment