Skip to content

Instantly share code, notes, and snippets.

@sergiolucero
Last active April 12, 2023 13:49
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save sergiolucero/e3253ac0d5f16e309963194cc4ecb967 to your computer and use it in GitHub Desktop.
Save sergiolucero/e3253ac0d5f16e309963194cc4ecb967 to your computer and use it in GitHub Desktop.
pdf legal conversion
import glob, fitz, pandas as pd
files = glob.glob('folder/*.pdf')
texts = [' '.join([page.get_text() for page in fitz.open(fn)])
for fn in files]
df = pd.DataFrame(dict(file=files, text=texts))
df['cuerpo'] = df.text.apply(lambda t: remove_headandsentence)
df['fallo'] = df.text.apply(lambda t: extract_fallo)
df.to_csv('sentencias.csv', index=False)
print(sum(len(txt) for txt in texts))
@sergiolucero
Copy link
Author

pip install PyMuPDF

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment