Skip to content

Instantly share code, notes, and snippets.

@algal
Created July 28, 2020 21:01
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save algal/22c6bf20b73d3fa486d0c07d2b9b6c59 to your computer and use it in GitHub Desktop.
Save algal/22c6bf20b73d3fa486d0c07d2b9b6c59 to your computer and use it in GitHub Desktop.
Read file paths, names, hashes into a data frame
from fastai2.vision.all import * # to get L
import pandas as pd
def readMD5file(md5path:Path) -> pd.DataFrame:
"""
Generate MD5 output file by doing a search like:
find /home/jupyter/data/foldersToAdd/ -iname '*jpg' -print0 | xargs -0 -n 100 md5sum >> /home/jupyter/data/foldersToAdd.md5.out
Then read it with this to make a dataframe to check for name uniqueness, path uniqueness, etc..
"""
with open(str(md5path),'r') as f:
lines = L(f.read().split('\n')).map(lambda line:tuple(line.split(' '))).filter(lambda t: len(t) == 2)
lines.sort()
dff = pd.DataFrame(list(lines),columns=['hash','path'])
dff['fname'] = dff['path'].map(lambda p: Path(p).parts[-1])
return dff
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment