Skip to content

Instantly share code, notes, and snippets.

@bits01
bits01 / parquet_to_jsonl.py
Created June 13, 2018 05:33
Convert Parquet file to gzipped JSON lines (JSONL) in 3 lines of code
# pip install pyarrow
# pip install pandas
import pyarrow.parquet as pq
# columns=['col1', 'col2'] to restrict loaded columns
pds = pq.read_pandas('/path/to/file.parquet', columns=None, nthreads=4).to_pandas()
# path_or_buf='output.jsonl.gz' to output to a file instead of stdout
print pds.to_json(path_or_buf=None, orient='records', lines=True, date_format='iso', date_unit='us', compression='gzip')

Keybase proof

I hereby claim:

  • I am bits01 on github.
  • I am dragosh (https://keybase.io/dragosh) on keybase.
  • I have a public key ASBzMQ-WNrn_AEEGiZjRxVUI5tYNpjYm2PVMRJa3zXbJoAo

To claim this, I am signing this object: