Skip to content

Instantly share code, notes, and snippets.

@abehmiel
Created November 1, 2017 21:12
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save abehmiel/cfe4bc3ae5222b8f046b0f32d5f35a4b to your computer and use it in GitHub Desktop.
Save abehmiel/cfe4bc3ae5222b8f046b0f32d5f35a4b to your computer and use it in GitHub Desktop.
Convert tabular pdf data to a csv and also read it as a python dataframe
# It's really stupid when the gov't releases pdf's of tabular data. So I made a quick, hacky script to
# fix their mistakes for them. (I'm referring to https://t.co/oOyhHNVvjS )
# requirements:
# pandas
# tabula-py
import pandas as pd
from tabula import read_pdf
# read the pdf-- it's all messed up and only one space-delimited column.
# also it defauls to only loading one page unless you specify pages='all' or
# a different int or list.
df = read_pdf("exhibit_b.pdf", pages='all')
# fix the columns
df['user id'] = df['user id handle'].apply(lambda x: x.split()[0])
df['handle'] = df['user id handle'].apply(lambda x: x.split()[1])
df = df.drop('user id handle', axis=1)
# output to csv
df.to_csv('exhibit_b.csv', index=False)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment