Skip to content

Instantly share code, notes, and snippets.

@vinovator
Last active May 17, 2024 09:13
Show Gist options
  • Save vinovator/a2ba7306e829bf3a9010 to your computer and use it in GitHub Desktop.
Save vinovator/a2ba7306e829bf3a9010 to your computer and use it in GitHub Desktop.
Python script to find duplicate files from a folder
# checkDuplicates.py
# Python 2.7.6
"""
Given a folder, walk through all files within the folder and subfolders
and get list of all files that are duplicates
The md5 checcksum for each file will determine the duplicates
"""
import os
import hashlib
from collections import defaultdict
import csv
src_folder = "../../"
def generate_md5(fname, chunk_size=1024):
"""
Function which takes a file name and returns md5 checksum of the file
"""
hash = hashlib.md5()
with open(fname, "rb") as f:
# Read the 1st block of the file
chunk = f.read(chunk_size)
# Keep reading the file until the end and update hash
while chunk:
hash.update(chunk)
chunk = f.read(chunk_size)
# Return the hex checksum
return hash.hexdigest()
if __name__ == "__main__":
"""
Starting block of script
"""
# The dict will have a list as values
md5_dict = defaultdict(list)
file_types_inscope = ["ppt", "pptx", "pdf", "txt", "html",
"mp4", "jpg", "png", "xls", "xlsx", "xml",
"vsd", "py", "json"]
# Walk through all files and folders within directory
for path, dirs, files in os.walk(src_folder):
print("Analyzing {}".format(path))
for each_file in files:
if each_file.split(".")[-1].lower() in file_types_inscope:
# The path variable gets updated for each subfolder
file_path = os.path.join(os.path.abspath(path), each_file)
# If there are more files with same checksum append to list
md5_dict[generate_md5(file_path)].append(file_path)
# Identify keys (checksum) having more than one values (file names)
duplicate_files = (
val for key, val in md5_dict.items() if len(val) > 1)
# Write the list of duplicate files to csv file
with open("duplicates.csv", "w") as log:
# Lineterminator added for windows as it inserts blank rows otherwise
csv_writer = csv.writer(log, quoting=csv.QUOTE_MINIMAL, delimiter=",",
lineterminator="\n")
header = ["File Names"]
csv_writer.writerow(header)
for file_name in duplicate_files:
csv_writer.writerow(file_name)
print("Done")
@moeabdol
Copy link

Thanks

@datatalking
Copy link

Your code is beautifully written, are you the original author?

@vinovator
Copy link
Author

Your code is beautifully written, are you the original author?

Thanks. Ofcourse I am the original author. But nothing novel about the libraries used or the logic.

@datatalking
Copy link

datatalking commented Jul 28, 2021 via email

@ricky-andre
Copy link

ricky-andre commented Dec 23, 2021

quite elegant, but if you have videos this approach would require a lot of time, too much really. Two ways to improve things:

  • you should first cycle to store the files' length, and retrieve the hash only for those files that have the SAME length
  • the above is of course not a guarantee that the files are the same, so you could make a couple of random checks in the middle of the files to check if they are the same or not (for example you could calculate the hash only on the first 16kbytes and check if they are the same or not)
  • if the previous steps did not discriminate, you really need to retrieve the hash for the full file, and since this a potentially time consuming process, you could store the result in a text file, so that the next time you won't need to retrieve it again. Maybe a json file to store/retrieve data back

I will try to do the above by myself, but check this out:

https://gist.github.com/tfeldmann/fc875e6630d11f2256e746f67a09c1ae

@Youssef-DS
Copy link

brilliant!

@tmb55
Copy link

tmb55 commented Aug 24, 2023

I'm running the script and purposely created duplicates. Nothing is being written to the duplicates.csv file. Any suggestions?

@datatalking
Copy link

@tmb55 @Youssef-DS @ricky-andre @vinovator I'm somehow just getting notifications for this thread and reading through the ricky-andre's link to https://gist.github.com/tfeldmann/fc875e6630d11f2256e746f67a09c1ae from above.

Seeing as the code was written for Python 2.7.6 is this code already in a repo somewhere we can submit PR's?

If not I'll start one as I've added features to the original to be able to connect a few tools to @vinovator original.

@ricky-andre
Copy link

ricky-andre commented Aug 26, 2023

Hi all, quite strange that this thread became 'live' again, check the following one:

https://github.com/ricky-andre/Python-duplicate-files-finder/blob/main/find_duplicates.py

@datatalking
Copy link

@ricky-andre can you give us a 'clif notes' difference between that code and this one?

@ricky-andre
Copy link

@ricky-andre can you give us a 'clif notes' difference between that code and this one?

the link to my repository's script finds duplicates using the approach described above:

  • check the file's length
  • give that two files have the same length, check the md5 on the first 16Kbytes of data
  • if they look still the same, calculate the md5 on the whole files (long task, whole file needs to be read)

Save the calculated md5 hash on a text file. Of course, other things could go wrong and be improved (e.g. text file could be encrypted, checked for integrity ... ), but I've tested it with my HDD and for sure it's really efficient and fast. For someone's personal use, it's very good.

@tmb55
Copy link

tmb55 commented Aug 27, 2023 via email

@datatalking
Copy link

@ricky-andre thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment