Skip to content

Instantly share code, notes, and snippets.

@rjurney
Created April 7, 2019 21:30
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save rjurney/c6afde4b3d5306fec317067f9cedbd0d to your computer and use it in GitHub Desktop.
Save rjurney/c6afde4b3d5306fec317067f9cedbd0d to your computer and use it in GitHub Desktop.
Load Gzipped JSON Lines generated by Spark into Pandas
import pandas as pd
import numpy as np
import glob
pd.set_option('display.max_columns', 500)
all_files = glob.glob('../data/patent_applications/2019-04-07.jsonl.gz/part-*.json.gz')
li = []
for filename in all_files:
df = pd.read_json(
filename,
lines=True,
compression='gzip'
)
li.append(df)
patents = pd.concat(li, axis=0, ignore_index=True)
patents['patent_index'] = patents.index
print('Patent records: {:,}'.format(len(patents)))
patents = patents[['patent_index', 'application_id', 'app_date', 'title', 'abstract', 'description', 'granted']]
patents.head(5)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment