We have a failing disk.
There is data on the disk that we would like to recover, if at all possible.
We can still access the disk using ddrescue
, but it is really slow, when we just
try to dump the disk, estimated runtime is more than a year.
We are not actually interested in all the blocks of the disk: There are only specific files that are of interest to us and/or the filesystem has quite a bit of free space remaining.
So we need to come up with a smarter way to run ddrescue
that will read those parts of
the disk that we need for recovering what we are interested in, and ignoring the blocks that
are not necessary for this.
To achieve this goal, we read a small area of the disk to identify the partition table.
Then we expose the disk via nbd to a client, and attempt to access the filesystem on the disk, while we trace all the read access done on the server side.
Naturally, as we only have a small set of data available initially, and any other blocks return zeroes when they are read, we will not see many files on the filesystem, or might even encounter mounting errors. In the worst case, we could even trigger a kernel oops, because some filesystems don't deal all that well with on-disk corruption.
However, even with all these caveats, we will still be able to tell which blocks where accessed in disk image when we tried to access the file system.
We can use this information to generate ranges we want to read from the failing disk using ddrescue
.
Once we have read these ranges, we can again try to access the filesystem.
With any luck, we will get further now, because the blocks that we need have been read from the failing disk.
This means that we will access more blocks, because we likely did not only read file data from disk, but also structures like directory trees.
Now that we have accessed more blocks, we now have even more ranges that we can plug into ddrescue.
This way, we can iteratively extend our search and recover the blocks that we need, without having to copy lots of uninteresting blocks and stressing the disk without getting any merit in return.
All of this happens in a zfs that was created with
zfs create storagepool/recovery && cd /storagepool/recovery
Now, dump first blocks, so we have basic info for disk like partition table and hopefully superblock:
ddrescue -i 0 -s 6291456 --cpass=1 /dev/failing-disk disk.img disk.map
Once this ddrescue
invocation is complete, pad image to the size of the original disk:
truncate -s $(blockdev --getsize64 /dev/failing-disk) disk.img
Create snapshot (for safety, if we screw something up later):
zfs snapshot "storagepool/recovery@$(date -u +%Y-%m-%d-%H-%M-%S)"
Configure nbd:
[generic]
listenaddr = 100.101.102.1 # It's a good idea to only bind to a local address
[recovery]
exportname = /storagepool/recovery/disk.img
readonly = true
transactionlog = /storagepool/recovery/nbd.trace
This exposes the current dump via nbd and creates a trace of all the access that is done to it.
(Re)start nbd:
systemctl restart nbd-server.service
On another host, try to access all of the filesystem that we can find (or the files we are interested in, respectively):
nbd-client -N recovery 100.101.102.1 /dev/nbd0
mount.exfat /dev/nbd0p1 /mnt/recovery
find /mnt/recovery
tar c /mnt/recovery | wc -c
(It is recommended to use another host because access to a broken filesystem might cause a kernel panic, which is something that should really be avoided on the system where the nbd trace is being written)
Now, unmount again:
umount /mnt/recovery
nbd-client -d /dev/nbd0
Now, on the storage host, stop nbd:
systemctl stop nbd-server.service
Then, use nbd-trdump
and the provided script to get a list of ddrescue commands to attempt next:
nbd-trdump < nbd.trace | ./trdump-to-args.py | tee nbd.trace.$(date -u +%Y-%m-%d-%H-%M-%S).cmds
Run these commands.
If offsets fail, add them to the failing_offsets
list in the provided script so that these blocks will be
ignored for the time being, and the broken areas are not hammered, each time the commands are rerun.
(Once you have read all the sectors that are readable and no new sectors show up in the command list, you might
remove the failing offsets from the list and have ddrescue
try to read them using its various strategies.)
Once the commands have all been run, start again from the step creating the snapshot, and repeat the process,
until no new ddrescue
commands appear in the generated command list.