Skip to content

Instantly share code, notes, and snippets.

@CodeBear801
Created June 27, 2019 15:43
Show Gist options
  • Save CodeBear801/f5eb39aac6968c11fe6fa989006436b8 to your computer and use it in GitHub Desktop.
Save CodeBear801/f5eb39aac6968c11fe6fa989006436b8 to your computer and use it in GitHub Desktop.
convert csv into parquet
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
csv_file = 'id2ids.csv'
parquet_file = 'id2ids.parquet'
chunksize = 10_000_000
csv_stream = pd.read_csv(csv_file, sep='\t', chunksize=chunksize, low_memory=False)
for i, chunk in enumerate(csv_stream):
print("Chunk", i)
if i == 0:
# Guess the schema of the CSV file from the first chunk
parquet_schema = pa.Table.from_pandas(df=chunk).schema
# Open a Parquet file for writing
parquet_writer = pq.ParquetWriter(parquet_file, parquet_schema, compression='snappy')
# Write CSV chunk to the parquet file
table = pa.Table.from_pandas(chunk, schema=parquet_schema)
parquet_writer.write_table(table)
parquet_writer.close()
@micomahesh1982
Copy link

I have used the above mentioned code but oftenly getting “python exit(), 139 Error”. There is no error message so not sure what’s going with this piece of code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment