-
-
Save tfeldmann/fc875e6630d11f2256e746f67a09c1ae to your computer and use it in GitHub Desktop.
#!/usr/bin/env python | |
""" | |
Fast duplicate file finder. | |
Usage: duplicates.py <folder> [<folder>...] | |
Based on https://stackoverflow.com/a/36113168/300783 | |
Modified for Python3 with some small code improvements. | |
""" | |
import os | |
import sys | |
import hashlib | |
from collections import defaultdict | |
def chunk_reader(fobj, chunk_size=1024): | |
""" Generator that reads a file in chunks of bytes """ | |
while True: | |
chunk = fobj.read(chunk_size) | |
if not chunk: | |
return | |
yield chunk | |
def get_hash(filename, first_chunk_only=False, hash_algo=hashlib.sha1): | |
hashobj = hash_algo() | |
with open(filename, "rb") as f: | |
if first_chunk_only: | |
hashobj.update(f.read(1024)) | |
else: | |
for chunk in chunk_reader(f): | |
hashobj.update(chunk) | |
return hashobj.digest() | |
def check_for_duplicates(paths): | |
files_by_size = defaultdict(list) | |
files_by_small_hash = defaultdict(list) | |
files_by_full_hash = dict() | |
for path in paths: | |
for dirpath, _, filenames in os.walk(path): | |
for filename in filenames: | |
full_path = os.path.join(dirpath, filename) | |
try: | |
# if the target is a symlink (soft one), this will | |
# dereference it - change the value to the actual target file | |
full_path = os.path.realpath(full_path) | |
file_size = os.path.getsize(full_path) | |
except OSError: | |
# not accessible (permissions, etc) - pass on | |
continue | |
files_by_size[file_size].append(full_path) | |
# For all files with the same file size, get their hash on the first 1024 bytes | |
for file_size, files in files_by_size.items(): | |
if len(files) < 2: | |
continue # this file size is unique, no need to spend cpu cycles on it | |
for filename in files: | |
try: | |
small_hash = get_hash(filename, first_chunk_only=True) | |
except OSError: | |
# the file access might've changed till the exec point got here | |
continue | |
files_by_small_hash[(file_size, small_hash)].append(filename) | |
# For all files with the hash on the first 1024 bytes, get their hash on the full | |
# file - collisions will be duplicates | |
for files in files_by_small_hash.values(): | |
if len(files) < 2: | |
# the hash of the first 1k bytes is unique -> skip this file | |
continue | |
for filename in files: | |
try: | |
full_hash = get_hash(filename, first_chunk_only=False) | |
except OSError: | |
# the file access might've changed till the exec point got here | |
continue | |
if full_hash in files_by_full_hash: | |
duplicate = files_by_full_hash[full_hash] | |
print("Duplicate found:\n - %s\n - %s\n" % (filename, duplicate)) | |
else: | |
files_by_full_hash[full_hash] = filename | |
if __name__ == "__main__": | |
if sys.argv[1:]: | |
check_for_duplicates(sys.argv[1:]) | |
else: | |
print("Usage: %s <folder> [<folder>...]" % sys.argv[0]) |
can you explain more how to do it ?
Install python 3, save this gist as duplicates.py
and run the given command in a terminal of your choice.
thanks @tfeldmann!
In case anyone else stumbles here - in my fork I have enabled automated hardlinking of duplicates. This is not something you want to do lightly as it can mess up your system, but it is what I was looking for to help reduce the size of my backups using rsnapshot (which does not handle moving files very well).
For you Thomas maybe a small improvement to the 'base' gist is grouping all files together using the true_duplicates
dict, rather than just printing them out one by one.
Hello, I understand your code and it is very well and you did the great work, but I have little bit problem with it like,
path = askdirectory(title='Select a directory')
if I want to ask from user to select path folder as I mentioned the script so, for this kind of solution where and how can I use that script.
@abidkhan03 You can use something like Gooey to create a simple GUI for this script
thanks @tfeldmann!
In case anyone else stumbles here - in my fork I have enabled automated hardlinking of duplicates. This is not something you want to do lightly as it can mess up your system, but it is what I was looking for to help reduce the size of my backups using rsnapshot (which does not handle moving files very well).
@jcjveraa random question about the functionality of what you describe as "hardlinking" a duplicate.
I've read through the code and somewhere between line 120 to 146 of well written code I'm having trouble keeping it all in my head. You had me until you started mixing a hash
with an inode
or file_node
it goes fuzzy for me. (non CS major) =)
Can you tell me more?
- Are we making indexes that reference hard file paths?
- Are we making indexes of indexes for speeding up future searches
- Are we creating a 'never move or delete hash' that breaks if modified?
Hi there @datatalking - maybe the confusion starts at what hard links are?
To avoid wrong impressions by the way - the code I added wouldn't be acceptable by my standards if it wasn't a gist (ie a proof of concept / handy tool), not "well written" on my part ;-) my professional code is a lot more readable.
To your question:
A harddisk/ssd in and of itself doesn't contain files, but raw data (bytes). The "filesystem", like ext4 or ntfs or fat32, that we put on top of this disk, helps to assign meaning to sets of bytes saying "a certain file exists, and this file is stored at this and this 'block (of bytes)'". If you search online you'll find that these inode-things have role to play there. Think of this as the "path" of a file, or a hard "link" to the file, like www.github.com is a link to this site.
In case we find there are two files that exist, ie we have two separate hard links (which is the same - a file "exists" because there is at least one hard link to it, else it is simply unreachable data on a hard disk), but we find out by computing the hash of the file that these two files are effectively identical (e.g. you just copied a file using copy-paste in your windows system), my addition will "unlink" (delete) one of the files and replace it with a hard link.
Now the copied file and the original both refer to the same set of bytes on the disk, rather than a real copy of the bytes. This means there is less disk space used. It also means that changing the original also changes the 'copy', as it is no longer a true copy but a reference to the same bytes on disk as the original. That can be not at all what you want, hence this is a little dangerous.
In my usecase though this is exactly what I need as it helps a lot in keeping the size of "read only" backups down. It is easy to "undo" the hard linking when restoring the backup if that is required.
Note: a hard link differs from a "soft link" (or symlink / windows shortcut) in that as long as either of the "files" (= at least one hard link) exists, the data will remain on the disk.
The data is "deleted" when the last hard link to it is 'unlinked'. That fact is desired in my use case - soft links could mean I lose data in the backups if I would delete some file, as in that case deleting the original (unlinking the hard link) results in a broken shortcut and the data would be gone. With multiple hardlinks such as my version of the script creates, the data will be preserved until it is truly no longer required. Google can tell you more on this :-)
Note, when I say that hardlinking can be undone this is "... by an expert user and then still only to a certain extent". Noting comes for free, and what you will lose/need to restore via other means (if required) is file metadata including file access permissions. If like in my case this is not an issue, then my method works, but your mileage may vary :-)
You run this file with
python3 duplicates.py folder1 path/to/folder2 folder3
and so on.