Created
March 20, 2015 16:34
-
-
Save mmechtley/b292733d76b9700d52dc to your computer and use it in GitHub Desktop.
Python SequenceMatcher for finding a similarly-named file
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
""" | |
Here's a cute example of using Python's builtin difflib support to find a file with the closest matching name | |
""" | |
from difflib import SequenceMatcher | |
# Suppose we have some files (databases here) with a certain naming scheme. | |
db_files = ['out_NDWFS_1425+3254_J_db.hdf5', 'out_NDWFS_1425+3254_H_db.hdf5'] | |
# Now we have several other files (model definitions here) that have a similar naming scheme | |
py_files = ['model_NDWFS_1425+3254_J.py', 'model_NDWFS_1425+3254_H.py'] | |
for db_file in db_files: | |
# Setup a function that creates a SequenceMatcher against db_file, then returns the similarity ratio | |
# Note a= and b= are important, the first argument of SequenceMatcher supplies "junk" characters to ignore | |
similar_score = lambda x: SequenceMatcher(a=db_file, b=x).ratio() | |
# Now sort the py_files list using their similarity ratios against db_file as the sort key | |
py_files.sort(key=similar_score) | |
# Best-matching filename is now the last element in the sorted array | |
model_file = py_files[-1] | |
print '{} matches {}'.format(model_file, db_file) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment