Skip to content

Instantly share code, notes, and snippets.

@superdump
Last active July 1, 2019 11:41
Show Gist options
  • Save superdump/dbbc972fd2276441f4728077488db88e to your computer and use it in GitHub Desktop.
Save superdump/dbbc972fd2276441f4728077488db88e to your computer and use it in GitHub Desktop.
Compare hashes of files in two identical directory structures to check data integrity and copy from src to dst if hashes mismatch

My laptop was crashing repeatedly and I needed to back up the data. I had done an rsync from a couple of important directories on the laptop hard drive to an external drive:

rsync \
	--archive \
	--compress \
	--progress \
	--append-verify \
	--partial \
	--inplace \
	/mnt/windows/dev \
	/mnt/windows/Users/super \
	/mnt/samsung/backup

I used --append-verify --partial --inplace to allow me to in-theory safely resume the rsync across crashes for large files while checking the integrity of the partial copies.

I wanted to be sure of the integrity of the data that had been copied but simply using an rsync command with --checksum or so to check checksums of the source and destination files cannot be resumed across crashes.

I wrote a python script to do this for me using a sqlite database to store the remaining files to be processed. I didn't want to spend time making the database creation safe so I just made that part work, built the database and then commented it out in the script. That commented part takes a list of destination files from find /mnt/samsung/backup -type f > /mnt/samsung/files_list and then writes them into a database with a table of the src file path, the destination file path and an initial status of TODO.

With the files to process stored in the database, I could then query the database for all files with a status of TODO. Then I calculate a sha256 hash of the src file and the dst file and compare them. If they differ, I copy the src file to the dst file path and recalculate the dst hash and re-compare. If the hashes match either initially or after another copy attempt, then I update the database row for that src, dst file pair to set its status to DONE. In this way, it is only when the file has been properly verified that the status is marked as DONE and otherwise it remains as TODO and so the script can be run across multiple sessions without issue.

Some of the paths are hard-coded but I figured this was at least still a useful starting point for someone else in the future... potentially.

#!/usr/bin/env python
from hashlib import sha256
from shutil import copyfile
import sqlite3
from sqlite3 import Error
import sys
def db_connect(db_path):
try:
connection = sqlite3.connect(db_path)
except Error:
print(Error)
return connection
def db_create_table(connection):
cursor = connection.cursor()
cursor.execute('CREATE TABLE IF NOT EXISTS files(src text, dst text, status text)')
connection.commit()
def db_insert(connection, row):
cursor = connection.cursor()
cursor.execute('INSERT INTO files(src, dst, status) VALUES(?, ?, ?)', row)
connection.commit()
def db_select_todo(connection):
cursor = connection.cursor()
cursor.execute("SELECT * FROM files WHERE status = 'TODO'")
return cursor.fetchall()
def db_update(connection, row):
cursor = connection.cursor()
cursor.execute('UPDATE files SET status = ? WHERE src = ? AND dst = ?', row)
connection.commit()
def sha256_digest(file_path):
with open(file_path, 'rb') as f:
return sha256(f.read()).hexdigest()
src_base = '/mnt/windows'
dst_base = '/mnt/samsung'
dst_super = dst_base + '/backup/super'
src_super = src_base + '/Users/super'
dst_dev = dst_base + '/backup/dev'
src_dev = src_base + '/dev'
connection = db_connect(dst_base + '/file_status.db')
#db_create_table(connection)
#with open(dst_base + '/files_list') as f:
# lines = [line.rstrip('\n') for line in f]
# last_percent = 0
# count = 0
# total = len(lines)
# for line in lines:
# if int(100.0 * float(count) / float(total)) > last_percent:
# last_percent += 1
# print(last_percent, ' complete')
# count += 1
# if line.startswith(dst_super):
# src = src_super + line[len(dst_super):]
# elif line.startswith(dst_dev):
# src = src_dev + line[len(dst_dev):]
# else:
# print("WARNING: Unrecognised path: ", line)
# continue
# dst = line
# row = (src, dst, 'TODO')
# db_insert(connection, row)
pending_tasks = db_select_todo(connection)
last_percent = -1
count = 0
total = len(pending_tasks)
for (src, dst, status) in pending_tasks:
if int(100.0 * float(count) / float(total)) > last_percent:
last_percent += 1
print(last_percent, '% complete - ', count, '/', total)
count += 1
src_hash = sha256_digest(src)
dst_hash = sha256_digest(dst)
if src_hash == dst_hash:
db_update(connection, ('DONE', src, dst))
else:
print("{} {}\n{} {}".format(src_hash, src, dst_hash, dst))
copyfile(src, dst)
dst_hash = sha256_digest(dst)
if src_hash == dst_hash:
db_update(connection, ('DONE', src, dst))
else:
print("WARNING: Failed retry\n{} {}\n{} {}".format(src_hash, src, dst_hash, dst))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment