Skip to content

Instantly share code, notes, and snippets.

@bits01
Created June 13, 2018 05:33
Show Gist options
  • Star 4 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save bits01/5d2d67212e3576b855f36d073876a563 to your computer and use it in GitHub Desktop.
Save bits01/5d2d67212e3576b855f36d073876a563 to your computer and use it in GitHub Desktop.
Convert Parquet file to gzipped JSON lines (JSONL) in 3 lines of code
# pip install pyarrow
# pip install pandas
import pyarrow.parquet as pq
# columns=['col1', 'col2'] to restrict loaded columns
pds = pq.read_pandas('/path/to/file.parquet', columns=None, nthreads=4).to_pandas()
# path_or_buf='output.jsonl.gz' to output to a file instead of stdout
print pds.to_json(path_or_buf=None, orient='records', lines=True, date_format='iso', date_unit='us', compression='gzip')
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment