Reading a large CSV file via pandas and joblib. Probably degrades due to pd.concat
usage.
Tests and better function parameter definitions and documentation pending.
A very objective test on a 5GB CSV file (shape=()
) resulted
in a Kernel died
message (it was run in a Jupyter notebook and repeated twice)
when using pd.read_csv
directly.
In contrast, using read_csv_joblib
with the following settings returned in 3h 4m:
Concatenating the row chunks took the longest.