Skip to content

Instantly share code, notes, and snippets.

@ilbaroni
Forked from rjurney/pandas.py
Created November 12, 2021 12:12
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ilbaroni/967b785bfc6a74c24fc3fefde047a942 to your computer and use it in GitHub Desktop.
Save ilbaroni/967b785bfc6a74c24fc3fefde047a942 to your computer and use it in GitHub Desktop.
Load Gzipped JSON Lines generated by Spark into Pandas
import pandas as pd
import numpy as np
import glob
pd.set_option('display.max_columns', 500)
all_files = glob.glob('../data/patent_applications/2019-04-07.jsonl.gz/part-*.json.gz')
li = []
for filename in all_files:
df = pd.read_json(
filename,
lines=True,
compression='gzip'
)
li.append(df)
patents = pd.concat(li, axis=0, ignore_index=True)
patents['patent_index'] = patents.index
print('Patent records: {:,}'.format(len(patents)))
patents = patents[['patent_index', 'application_id', 'app_date', 'title', 'abstract', 'description', 'granted']]
patents.head(5)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment