cfra/targeted-ddrescue.md

## targeted-ddrescue.md

      
    Raw
  

              targeted-ddrescue.md
            
          
    Situation

We have a failing disk.
There is data on the disk that we would like to recover, if at all possible.
We can still access the disk using ddrescue, but it is really slow, when we just
try to dump the disk, estimated runtime is more than a year.
We are not actually interested in all the blocks of the disk: There are only specific
files that are of interest to us and/or the filesystem has quite a bit of free space
remaining.
So we need to come up with a smarter way to run ddrescue that will read those parts of
the disk that we need for recovering what we are interested in, and ignoring the blocks that
are not necessary for this.
To achieve this goal, we read a small area of the disk to identify the partition table.
Then we expose the disk via nbd to a client, and attempt to access the filesystem on the disk,
while we trace all the read access done on the server side.
Naturally, as we only have a small set of data available initially, and any other blocks return
zeroes when they are read, we will not see many files on the filesystem, or might even encounter mounting
errors. In the worst case, we could even trigger a kernel oops, because some filesystems don't deal all that well
with on-disk corruption.
However, even with all these caveats, we will still be able to tell which blocks where accessed in disk image
when we tried to access the file system.
We can use this information to generate ranges we want to read from the failing disk using ddrescue.
Once we have read these ranges, we can again try to access the filesystem.
With any luck, we will get further now, because the blocks that we need have been read from the failing disk.
This means that we will access more blocks, because we likely did not only read file data from disk, but also structures
like directory trees.
Now that we have accessed more blocks, we now have even more ranges that we can plug into ddrescue.
This way, we can iteratively extend our search and recover the blocks that we need, without having to copy
lots of uninteresting blocks and stressing the disk without getting any merit in return.
Recovery

All of this happens in a zfs that was created with
zfs create storagepool/recovery && cd /storagepool/recovery
Now, dump first blocks, so we have basic info for disk like partition table and hopefully superblock:
ddrescue -i 0 -s 6291456 --cpass=1 /dev/failing-disk disk.img disk.map
Once this ddrescue invocation is complete, pad image to the size of the original disk:
truncate -s $(blockdev --getsize64 /dev/failing-disk) disk.img
Create snapshot (for safety, if we screw something up later):
zfs snapshot "storagepool/recovery@$(date -u +%Y-%m-%d-%H-%M-%S)"
Configure nbd:
[generic]
	listenaddr = 100.101.102.1 # It's a good idea to only bind to a local address

[recovery]
	exportname = /storagepool/recovery/disk.img
	readonly = true
	transactionlog = /storagepool/recovery/nbd.trace
This exposes the current dump via nbd and creates a trace of all the access that is done to it.
(Re)start nbd:
systemctl restart nbd-server.service
On another host, try to access all of the filesystem that we can find (or the files we are interested in, respectively):
nbd-client -N recovery 100.101.102.1 /dev/nbd0
mount.exfat /dev/nbd0p1 /mnt/recovery
find /mnt/recovery
tar c /mnt/recovery | wc -c
(It is recommended to use another host because access to a broken filesystem might cause a kernel panic, which is something
that should really be avoided on the system where the nbd trace is being written)
Now, unmount again:
umount /mnt/recovery
nbd-client -d /dev/nbd0

Now, on the storage host, stop nbd:
systemctl stop nbd-server.service
Then, use nbd-trdump and the provided script to get a list of ddrescue commands to attempt next:
nbd-trdump < nbd.trace | ./trdump-to-args.py | tee nbd.trace.$(date -u +%Y-%m-%d-%H-%M-%S).cmds
Run these commands.
If offsets fail, add them to the failing_offsets list in the provided script so that these blocks will be
ignored for the time being, and the broken areas are not hammered, each time the commands are rerun.
(Once you have read all the sectors that are readable and no new sectors show up in the command list, you might
remove the failing offsets from the list and have ddrescue try to read them using its various strategies.)
Once the commands have all been run, start again from the step creating the snapshot, and repeat the process,
until no new ddrescue commands appear in the generated command list.

  
## trdump-to-args.py
#!/usr/bin/env python3

import logging
import re
import sys

# logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger(__name__)

# The commands we are interested in look like this:
# > H=2f00000009000000 C=0x00000000 ( NBD_CMD_READ+NONE) O=0000000829c7d000 L=00020000
pattern = re.compile(r'^> .*NBD_CMD_READ.*O=([0-9a-f]+) L=([0-9a-f]+)$')

# The block size in which we attempt to read areas with ddrescue.
# 1MiB is chosen somewhat arbitrarily hoping to achive a good balance between reading too much
# because the blocks are too large and creating a too convoluted command list because blocks are too small.
block_size = 1024*1024

# A list of failing offsets in bytes (can be taken from the -i of the generated ddrescue invocations)
failing_offsets = [
]

failing_blocks = set([ failing_offset // block_size for failing_offset in failing_offsets ])

blocks = set()

for line in sys.stdin:
    match = pattern.match(line)
    if match is None:
        logger.debug("Ignoring line %r, it is not a read.", line)
        continue
    read_offset = int(match.group(1), 16)
    read_length = int(match.group(2), 16)

    logger.debug("Observed read at %x len %x", read_offset, read_length)

    first_block = read_offset // block_size
    last_block = (read_offset + read_length - 1) // block_size

    for block in range(first_block, last_block + 1):
        blocks.add(block)


def command_for_blocks(first_block, last_block):
    logger.debug("Generating command for blocks %d - %d", first_block, last_block)
    print("ddrescue -i %d -s %d -N -K %d --cpass=1 /dev/failing-disk disk.img disk.map" % (
        first_block * block_size,
        (last_block + 1 - first_block) * block_size,
        block_size
    ))

blocks -= failing_blocks

blocks = sorted(blocks)
first_in_range = None
last_block = None
for block in blocks:
    if first_in_range is None:
        # Only in first step
        first_in_range = block
        last_block = block
        continue
    if last_block + 1 == block:
        # Range continues
        last_block = block
        continue
    # Range does not continue. Print it an start a new range at current block
    command_for_blocks(first_in_range, last_block)
    first_in_range = block
    last_block = block
# Print last range, as it is still ongoing
command_for_blocks(first_in_range, last_block)
	#!/usr/bin/env python3

	import logging
	import re
	import sys

	# logging.basicConfig(level=logging.DEBUG)
	logger = logging.getLogger(__name__)

	# The commands we are interested in look like this:
	# > H=2f00000009000000 C=0x00000000 ( NBD_CMD_READ+NONE) O=0000000829c7d000 L=00020000
	pattern = re.compile(r'^> .NBD_CMD_READ.O=([0-9a-f]+) L=([0-9a-f]+)$')

	# The block size in which we attempt to read areas with ddrescue.
	# 1MiB is chosen somewhat arbitrarily hoping to achive a good balance between reading too much
	# because the blocks are too large and creating a too convoluted command list because blocks are too small.
	block_size = 1024*1024

	# A list of failing offsets in bytes (can be taken from the -i of the generated ddrescue invocations)
	failing_offsets = [
	]

	failing_blocks = set([ failing_offset // block_size for failing_offset in failing_offsets ])

	blocks = set()

	for line in sys.stdin:
	match = pattern.match(line)
	if match is None:
	logger.debug("Ignoring line %r, it is not a read.", line)
	continue
	read_offset = int(match.group(1), 16)
	read_length = int(match.group(2), 16)

	logger.debug("Observed read at %x len %x", read_offset, read_length)

	first_block = read_offset // block_size
	last_block = (read_offset + read_length - 1) // block_size

	for block in range(first_block, last_block + 1):
	blocks.add(block)


	def command_for_blocks(first_block, last_block):
	logger.debug("Generating command for blocks %d - %d", first_block, last_block)
	print("ddrescue -i %d -s %d -N -K %d --cpass=1 /dev/failing-disk disk.img disk.map" % (
	first_block * block_size,
	(last_block + 1 - first_block) * block_size,
	block_size
	))

	blocks -= failing_blocks

	blocks = sorted(blocks)
	first_in_range = None
	last_block = None
	for block in blocks:
	if first_in_range is None:
	# Only in first step
	first_in_range = block
	last_block = block
	continue
	if last_block + 1 == block:
	# Range continues
	last_block = block
	continue
	# Range does not continue. Print it an start a new range at current block
	command_for_blocks(first_in_range, last_block)
	first_in_range = block
	last_block = block
	# Print last range, as it is still ongoing
	command_for_blocks(first_in_range, last_block)