Skip to content

Instantly share code, notes, and snippets.

@acarapetis
Last active September 14, 2022 06:49
Show Gist options
  • Save acarapetis/94630da0493c9116e7e1c21f2304e52a to your computer and use it in GitHub Desktop.
Save acarapetis/94630da0493c9116e7e1c21f2304e52a to your computer and use it in GitHub Desktop.
csv_to_parquet.py
"""CLI tool to stream CSV (from stdin) to parquet"""
import sys
from pathlib import Path
from typing import TextIO
import pandas as pd
CHUNKSIZE=2**16 # ~65k rows per parquet chunk
def convert(instream: TextIO, outdir: Path, chunksize=CHUNKSIZE):
outdir.mkdir()
for idx, chunk in enumerate(pd.read_csv(instream, chunksize=chunksize)):
chunk.to_parquet(outdir / f"{idx:010}.parquet")
if __name__ == "__main__":
convert(sys.stdin, Path(sys.argv[1]))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment