Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Fast duplicate file finder written in python
#!/usr/bin/env python
"""
Fast duplicate file finder.
Usage: duplicates.py <folder> [<folder>...]
Based on https://stackoverflow.com/a/36113168/300783
Modified for Python3 with some small code improvements.
The script on stackoverflow has a bug which could lead to false positives. This is fixed
here by using a tuple (file_size, hash) as key in the small hash comparison dictionary.
"""
import os
import sys
import hashlib
from collections import defaultdict
def chunk_reader(fobj, chunk_size=1024):
""" Generator that reads a file in chunks of bytes """
while True:
chunk = fobj.read(chunk_size)
if not chunk:
return
yield chunk
def get_hash(filename, first_chunk_only=False, hash_algo=hashlib.sha1):
hashobj = hash_algo()
with open(filename, "rb") as f:
if first_chunk_only:
hashobj.update(f.read(1024))
else:
for chunk in chunk_reader(f):
hashobj.update(chunk)
return hashobj.digest()
def check_for_duplicates(paths):
files_by_size = defaultdict(list)
files_by_small_hash = defaultdict(list)
files_by_full_hash = dict()
for path in paths:
for dirpath, _, filenames in os.walk(path):
for filename in filenames:
full_path = os.path.join(dirpath, filename)
try:
# if the target is a symlink (soft one), this will
# dereference it - change the value to the actual target file
full_path = os.path.realpath(full_path)
file_size = os.path.getsize(full_path)
except OSError:
# not accessible (permissions, etc) - pass on
continue
files_by_size[file_size].append(full_path)
# For all files with the same file size, get their hash on the first 1024 bytes
for file_size, files in files_by_size.items():
if len(files) < 2:
continue # this file size is unique, no need to spend cpu cycles on it
for filename in files:
try:
small_hash = get_hash(filename, first_chunk_only=True)
except OSError:
# the file access might've changed till the exec point got here
continue
files_by_small_hash[(file_size, small_hash)].append(filename)
# For all files with the hash on the first 1024 bytes, get their hash on the full
# file - collisions will be duplicates
for files in files_by_small_hash.values():
if len(files) < 2:
# the hash of the first 1k bytes is unique -> skip this file
continue
for filename in files:
try:
full_hash = get_hash(filename, first_chunk_only=False)
except OSError:
# the file access might've changed till the exec point got here
continue
if full_hash in files_by_full_hash:
duplicate = files_by_full_hash[full_hash]
print("Duplicate found:\n - %s\n - %s\n" % (filename, duplicate))
else:
files_by_full_hash[full_hash] = filename
if __name__ == "__main__":
if sys.argv[1:]:
check_for_duplicates(sys.argv[1:])
else:
print("Usage: %s <folder> [<folder>...]" % sys.argv[0])
@GreatBahram

This comment has been minimized.

Copy link

GreatBahram commented Dec 28, 2019

You don't need to calculate a full hash(Line76), I think it's better to use filecmp.cmp module here, as it uses byte to byte comparison, which is much more efficient!

@tfeldmann

This comment has been minimized.

Copy link
Owner Author

tfeldmann commented Dec 30, 2019

Yes you are right the filecmp.cmp is faster comparing two files. The problem is that we are comparing lots and lots of files and calculating and caching the hashes gives us O(n) complexity.

@romainjouin

This comment has been minimized.

Copy link

romainjouin commented Jun 14, 2020

All that would be nicely enhanced with some multi-cpu parallelism where possible

@tfeldmann

This comment has been minimized.

Copy link
Owner Author

tfeldmann commented Jun 15, 2020

I guess this is more IO than CPU bound, so I don't think it would benefit much from multi CPU... I would love to be proven wrong!

@romainjouin

This comment has been minimized.

Copy link

romainjouin commented Jun 15, 2020

I recently had a task I thought was IO bound : I had to read 400 K files, but when I multi-threaded it on 8 threads, the performance definitely improved, by a factor I would guess at least of 2 or 3. As here there may be millions of files, I think it could be worthwhile to test.

@tminakov

This comment has been minimized.

Copy link

tminakov commented Jun 15, 2020

The original SO poster here 👋.
I would argue on this comment:

The script on stackoverflow has a bug which could lead to false positives.

It's not a bug, but sub-optimal execution - the small hashes are used for the input on the 2nd pass, not the final result; e.g. they don't pop up to the user, just increase the runtime.


Couple of comments on the gist:

  • there are bugs, hidden in the exceptions handling - if an exception does occur, the value that was to be assigned in the caught block is used outside of it; after all, this was a hacky script for SO answer :)
  • though not deprecated (and, probably never going to be), printf style formatting with %s is not the now-usual py3 way.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.