Skip to content

Instantly share code, notes, and snippets.

@jordansamuels
Created July 18, 2019 03:10
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jordansamuels/d69f1c22c58418f5dfa0785b9ecd211e to your computer and use it in GitHub Desktop.
Save jordansamuels/d69f1c22c58418f5dfa0785b9ecd211e to your computer and use it in GitHub Desktop.
Python script to reproduce pyarrow 0.14.0 only reading part of a valid gzip csv file
import pandas as pd
import pyarrow
import pyarrow.csv as pcsv
import os
pd.DataFrame({'x': [1]}).to_csv('/tmp/1.csv.gz', index=False, compression='gzip')
pd.DataFrame({'x': [2]}).to_csv('/tmp/2.csv.gz', header=False, index=False, compression='gzip')
os.system("cat /tmp/1.csv.gz /tmp/2.csv.gz > /tmp/t.csv.gz")
print("pyarrow.csv only reads one row:")
print(pcsv.read_csv('/tmp/t.csv.gz').to_pandas())
print("pandas reads two rows:")
print(pd.read_csv('/tmp/t.csv.gz'))
print("pyarrow version: " + pyarrow.__version__)
@jordansamuels
Copy link
Author

Output is:

$ python repro.py
pyarrow.csv only reads one row:
   x
0  1
pandas reads two rows:
   x
0  1
1  2
pyarrow version: 0.14.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment