Reading a large CSV file via pandas and joblib. Probably degrades due to pd.concat
usage.
Tests and better function parameter definitions and documentation pending.
A very objective test on a 5GB CSV file (shape=()
) resulted
in a Kernel died
message (it was run in a Jupyter notebook and repeated twice)
when using pd.read_csv
directly.
In contrast, using read_csv_joblib
with the following settings returned in 3h 4m:
Concatenating the row chunks took the longest.
df = read_csv_joblib(file_path, row_chunksize=17, column_chunksize=None, n_rows=None, n_columns=None, sep="\s+", n_jobs=70)
Here are the timing results:
* Reading dimensions
* duration: 177.37sec
* Deriving intervals
* row steps: 97, column steps: 1
* duration: 0.00sec
* Loading dataframes
* loaded dataframes: 97
* duration: 1203.23sec
* Concatenating dataframe columns
* duration: 0.00sec
* Concatenating dataframe rows
* duration: 9691.57sec
CPU times: user 2h 42min 46s, sys: 3min 26s, total: 2h 46min 12s
Wall time: 3h 4min 52s