Skip to content

Instantly share code, notes, and snippets.

@eliasdabbas
Created July 16, 2023 11:01
Show Gist options
  • Save eliasdabbas/7b1e9914ca7cfe31bb6e5df2768fc214 to your computer and use it in GitHub Desktop.
Save eliasdabbas/7b1e9914ca7cfe31bb6e5df2768fc214 to your computer and use it in GitHub Desktop.
Convert a jsonlines file to a compressed parquet file (if JSON object have different types e.g. list and scalar in the same column, it converts them to strings)
def jl_to_parquet(jl_filepath, parquet_filepath):
"""Convert a jsonlines crawl file to the parquet format.
Parameters
----------
jl_filepath : str
The path of an existing .jl file.
parquet_fileapth : str
The pather where you want the new file to be saved (ending with .parquet).
"""
status = 'not done'
crawldf = pd.read_json(jl_filepath, lines=True)
while status == 'not done':
try:
crawldf.to_parquet(parquet_filepath, index=False, version='2.6')
status = 'done'
except Exception as e:
error = e.args[-1]
column = re.findall('column (\S+)', error)
print(f'converting to string: {column[0]}')
crawldf[column[0]] = crawldf[column[0]].astype(str).replace('nan', pd.NA)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment