Skip to content

Instantly share code, notes, and snippets.

@SphinxKnight
Created March 11, 2022 17:07
Show Gist options
  • Save SphinxKnight/60d766b8d6cd970b42eaf967b8dac3ff to your computer and use it in GitHub Desktop.
Save SphinxKnight/60d766b8d6cd970b42eaf967b8dac3ff to your computer and use it in GitHub Desktop.
Check for duplicated files in mdn/translated-content
import os
import hashlib
ref_locale = 'en-us'
ref_path = 'content/files/' + ref_locale + '/'
locale = 'fr'
locale_path = 'translated-content/files/' + locale + '/'
dict_files={}
for r, d, f in os.walk(ref_path):
for file in f :
if not('.md' in file) and not('.html' in file):
full_path = os.path.join(r, file)
file_b = open(full_path,"rb")
content = file_b.read()
file_slug= full_path.split(ref_locale)[1]
dict_files[file_slug] = hashlib.sha256(content).hexdigest()
# print(dict_files)
spared_size = 0
for r, d, f in os.walk(locale_path):
for file in f :
if not('.md' in file) and not('.html' in file):
full_path = os.path.join(r, file)
file_slug= full_path.split(locale)[1]
if file_slug in dict_files:
file_b = open(full_path,"rb")
content = file_b.read()
locale_file_hash = hashlib.sha256(content).hexdigest()
if locale_file_hash == dict_files[file_slug]:
spared_size = spared_size + os.path.getsize(full_path)
print(full_path)
print(spared_size)
@caugner
Copy link

caugner commented Dec 6, 2022

@SphinxKnight Thank you for this script! 🙏

Would you be interested in extending your script to not only compare the sha256, but for images with matching file extensions, to compare their height/width and - if equal - their visual similarity, e.g. using Structural Similiarity Index (SSIM) or Mean Squared Error (MSE)? This would account for the fact that we usually compress images before checking them in (using yarn filecheck).

@SphinxKnight
Copy link
Author

I was using this one as a "one time" thing so I'm not really into plugging new features in it. If you need me to create a repo for this, so that it can be forked/extended, I can do that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment