Created
June 27, 2019 15:43
-
-
Save CodeBear801/f5eb39aac6968c11fe6fa989006436b8 to your computer and use it in GitHub Desktop.
convert csv into parquet
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import pandas as pd | |
import pyarrow as pa | |
import pyarrow.parquet as pq | |
csv_file = 'id2ids.csv' | |
parquet_file = 'id2ids.parquet' | |
chunksize = 10_000_000 | |
csv_stream = pd.read_csv(csv_file, sep='\t', chunksize=chunksize, low_memory=False) | |
for i, chunk in enumerate(csv_stream): | |
print("Chunk", i) | |
if i == 0: | |
# Guess the schema of the CSV file from the first chunk | |
parquet_schema = pa.Table.from_pandas(df=chunk).schema | |
# Open a Parquet file for writing | |
parquet_writer = pq.ParquetWriter(parquet_file, parquet_schema, compression='snappy') | |
# Write CSV chunk to the parquet file | |
table = pa.Table.from_pandas(chunk, schema=parquet_schema) | |
parquet_writer.write_table(table) | |
parquet_writer.close() |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I have used the above mentioned code but oftenly getting “python exit(), 139 Error”. There is no error message so not sure what’s going with this piece of code.