Skip to content

Instantly share code, notes, and snippets.

@cfra
Last active May 7, 2023 10:41
Show Gist options
  • Save cfra/635557782f50bbf51005a53d62100858 to your computer and use it in GitHub Desktop.
Save cfra/635557782f50bbf51005a53d62100858 to your computer and use it in GitHub Desktop.
How to use ddrescue to recover allocated areas or specific files from a filesystem on a failing disk

Situation

We have a failing disk.

There is data on the disk that we would like to recover, if at all possible.

We can still access the disk using ddrescue, but it is really slow, when we just try to dump the disk, estimated runtime is more than a year.

We are not actually interested in all the blocks of the disk: There are only specific files that are of interest to us and/or the filesystem has quite a bit of free space remaining.

So we need to come up with a smarter way to run ddrescue that will read those parts of the disk that we need for recovering what we are interested in, and ignoring the blocks that are not necessary for this.

To achieve this goal, we read a small area of the disk to identify the partition table.

Then we expose the disk via nbd to a client, and attempt to access the filesystem on the disk, while we trace all the read access done on the server side.

Naturally, as we only have a small set of data available initially, and any other blocks return zeroes when they are read, we will not see many files on the filesystem, or might even encounter mounting errors. In the worst case, we could even trigger a kernel oops, because some filesystems don't deal all that well with on-disk corruption.

However, even with all these caveats, we will still be able to tell which blocks where accessed in disk image when we tried to access the file system.

We can use this information to generate ranges we want to read from the failing disk using ddrescue.

Once we have read these ranges, we can again try to access the filesystem.

With any luck, we will get further now, because the blocks that we need have been read from the failing disk.

This means that we will access more blocks, because we likely did not only read file data from disk, but also structures like directory trees.

Now that we have accessed more blocks, we now have even more ranges that we can plug into ddrescue.

This way, we can iteratively extend our search and recover the blocks that we need, without having to copy lots of uninteresting blocks and stressing the disk without getting any merit in return.

Recovery

All of this happens in a zfs that was created with

zfs create storagepool/recovery && cd /storagepool/recovery

Now, dump first blocks, so we have basic info for disk like partition table and hopefully superblock:

ddrescue -i 0 -s 6291456 --cpass=1 /dev/failing-disk disk.img disk.map

Once this ddrescue invocation is complete, pad image to the size of the original disk:

truncate -s $(blockdev --getsize64 /dev/failing-disk) disk.img

Create snapshot (for safety, if we screw something up later):

zfs snapshot "storagepool/recovery@$(date -u +%Y-%m-%d-%H-%M-%S)"

Configure nbd:

[generic]
	listenaddr = 100.101.102.1 # It's a good idea to only bind to a local address

[recovery]
	exportname = /storagepool/recovery/disk.img
	readonly = true
	transactionlog = /storagepool/recovery/nbd.trace

This exposes the current dump via nbd and creates a trace of all the access that is done to it.

(Re)start nbd:

systemctl restart nbd-server.service

On another host, try to access all of the filesystem that we can find (or the files we are interested in, respectively):

nbd-client -N recovery 100.101.102.1 /dev/nbd0
mount.exfat /dev/nbd0p1 /mnt/recovery
find /mnt/recovery
tar c /mnt/recovery | wc -c

(It is recommended to use another host because access to a broken filesystem might cause a kernel panic, which is something that should really be avoided on the system where the nbd trace is being written)

Now, unmount again:

umount /mnt/recovery
nbd-client -d /dev/nbd0

Now, on the storage host, stop nbd:

systemctl stop nbd-server.service

Then, use nbd-trdump and the provided script to get a list of ddrescue commands to attempt next:

nbd-trdump < nbd.trace | ./trdump-to-args.py | tee nbd.trace.$(date -u +%Y-%m-%d-%H-%M-%S).cmds

Run these commands.

If offsets fail, add them to the failing_offsets list in the provided script so that these blocks will be ignored for the time being, and the broken areas are not hammered, each time the commands are rerun.

(Once you have read all the sectors that are readable and no new sectors show up in the command list, you might remove the failing offsets from the list and have ddrescue try to read them using its various strategies.)

Once the commands have all been run, start again from the step creating the snapshot, and repeat the process, until no new ddrescue commands appear in the generated command list.

#!/usr/bin/env python3
import logging
import re
import sys
# logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger(__name__)
# The commands we are interested in look like this:
# > H=2f00000009000000 C=0x00000000 ( NBD_CMD_READ+NONE) O=0000000829c7d000 L=00020000
pattern = re.compile(r'^> .*NBD_CMD_READ.*O=([0-9a-f]+) L=([0-9a-f]+)$')
# The block size in which we attempt to read areas with ddrescue.
# 1MiB is chosen somewhat arbitrarily hoping to achive a good balance between reading too much
# because the blocks are too large and creating a too convoluted command list because blocks are too small.
block_size = 1024*1024
# A list of failing offsets in bytes (can be taken from the -i of the generated ddrescue invocations)
failing_offsets = [
]
failing_blocks = set([ failing_offset // block_size for failing_offset in failing_offsets ])
blocks = set()
for line in sys.stdin:
match = pattern.match(line)
if match is None:
logger.debug("Ignoring line %r, it is not a read.", line)
continue
read_offset = int(match.group(1), 16)
read_length = int(match.group(2), 16)
logger.debug("Observed read at %x len %x", read_offset, read_length)
first_block = read_offset // block_size
last_block = (read_offset + read_length - 1) // block_size
for block in range(first_block, last_block + 1):
blocks.add(block)
def command_for_blocks(first_block, last_block):
logger.debug("Generating command for blocks %d - %d", first_block, last_block)
print("ddrescue -i %d -s %d -N -K %d --cpass=1 /dev/failing-disk disk.img disk.map" % (
first_block * block_size,
(last_block + 1 - first_block) * block_size,
block_size
))
blocks -= failing_blocks
blocks = sorted(blocks)
first_in_range = None
last_block = None
for block in blocks:
if first_in_range is None:
# Only in first step
first_in_range = block
last_block = block
continue
if last_block + 1 == block:
# Range continues
last_block = block
continue
# Range does not continue. Print it an start a new range at current block
command_for_blocks(first_in_range, last_block)
first_in_range = block
last_block = block
# Print last range, as it is still ongoing
command_for_blocks(first_in_range, last_block)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment