My laptop was crashing repeatedly and I needed to back up the data. I had done an rsync from a couple of important directories on the laptop hard drive to an external drive:
rsync \
--archive \
--compress \
--progress \
--append-verify \
--partial \
--inplace \
/mnt/windows/dev \
/mnt/windows/Users/super \
/mnt/samsung/backup
I used --append-verify --partial --inplace
to allow me to in-theory safely resume the rsync
across crashes for large files while checking the integrity of the partial copies.
I wanted to be sure of the integrity of the data that had been copied but simply using an rsync
command with --checksum
or so to check checksums of the source and destination files cannot be resumed across crashes.
I wrote a python script to do this for me using a sqlite database to store the remaining files to be processed. I didn't want to spend time making the database creation safe so I just made that part work, built the database and then commented it out in the script. That commented part takes a list of destination files from find /mnt/samsung/backup -type f > /mnt/samsung/files_list
and then writes them into a database with a table of the src file path, the destination file path and an initial status of TODO
.
With the files to process stored in the database, I could then query the database for all files with a status of TODO
. Then I calculate a sha256 hash of the src file and the dst file and compare them. If they differ, I copy the src file to the dst file path and recalculate the dst hash and re-compare. If the hashes match either initially or after another copy attempt, then I update the database row for that src, dst file pair to set its status to DONE
. In this way, it is only when the file has been properly verified that the status is marked as DONE
and otherwise it remains as TODO
and so the script can be run across multiple sessions without issue.
Some of the paths are hard-coded but I figured this was at least still a useful starting point for someone else in the future... potentially.