I have a pile of CSVs from the NYC bike share program
ls -l /home/gil/databog/csv/citibike
total | 3618000 | |||||||
-rw-r–r– | 1 | gil | gil | 234843729 | Feb | 4 | 2020 | 202001-citibike-tripdata.csv |
-rw-r–r– | 1 | gil | gil | 217131425 | Mar | 24 | 2020 | 202002-citibike-tripdata.csv |
-rw-r–r– | 1 | gil | gil | 202642779 | Apr | 17 | 2020 | 202003-citibike-tripdata.csv |
-rw-r–r– | 1 | gil | gil | 129734561 | May | 22 | 2020 | 202004-citibike-tripdata.csv |
-rw-r–r– | 1 | gil | gil | 283682921 | Jun | 5 | 2020 | 202005-citibike-tripdata.csv |
-rw-r–r– | 1 | gil | gil | 357642798 | Jul | 5 | 2020 | 202006-citibike-tripdata.csv |
-rw-r–r– | 1 | gil | gil | 399626762 | Aug | 11 | 2020 | 202007-citibike-tripdata.csv |
-rw-r–r– | 1 | gil | gil | 442365582 | Sep | 4 | 2020 | 202008-citibike-tripdata.csv |
-rw-r–r– | 1 | gil | gil | 472984200 | Oct | 13 | 2020 | 202009-citibike-tripdata.csv |
-rw-r–r– | 1 | gil | gil | 427237224 | Nov | 4 | 2020 | 202010-citibike-tripdata.csv |
-rw-r–r– | 1 | gil | gil | 330107544 | Dec | 4 | 2020 | 202011-citibike-tripdata.csv |
-rw-r–r– | 1 | gil | gil | 206762924 | Jan | 5 | 2021 | 202012-citibike-tripdata.csv |
du -sh /home/gil/databog/csv/citibike
3.5G | /home/gil/databog/csv/citibike |
Read in the CSVs using the DuckDB backend, write out to delta in a temp directory, then remove the temp directory.
from __future__ import annotations
import tempfile
import ibis
con = ibis.duckdb.connect()
citibike = con.read_csv("/home/gil/databog/csv/citibike/*.csv", table_name="citibike")
with tempfile.TemporaryDirectory() as tmpdirname:
citibike.to_delta(
tmpdirname,
mode="append",
partition_by=["gender"],
storage_options={"allow_unsafe_rename": "true"},
)
🐍(nix) ~githibis-ibismain…1⚑6
🐚 # On current main
°º for i in range(5):
°º # filtering out delta log messages
°º time -f '%e' python delta_write_times.py | grep -v _delta_log
°º
12.21
12.65
12.69
12.76
12.70
🐍(nix) ~githibis-ibismain…1⚑6 ⌛1m3s
🐚 gd
diff --git a/ibis/backends/__init__.py b/ibis/backends/__init__.py
index 2a434f19e..1d0824d1e 100644
--- a/ibis/backends/__init__.py
+++ b/ibis/backends/__init__.py
@@ -544,8 +544,7 @@ class _FileIOHandler:
"pip install 'ibis-framework[deltalake]'\n"
)
- with expr.to_pyarrow_batches(params=params) as batch_reader:
- write_deltalake(path, batch_reader, **kwargs)
+ write_deltalake(path, expr.to_pyarrow(), **kwargs)
class CanListCatalog(abc.ABC):
🐍(nix) ~githibis-ibismain+1…1⚑6 ⌛9s 🐚 for i in range(5): °º time -f '%e' python delta_write_times.py | grep -v _delta_log °º 9.27 9.29 9.33 9.42 9.31
Same as above, but now using duckdb-python
directly:
from __future__ import annotations
import tempfile
import duckdb
from deltalake.writer import write_deltalake
con = duckdb.connect()
citibike = con.read_csv("/home/gil/databog/csv/citibike/*.csv").arrow()
with tempfile.TemporaryDirectory() as tmpdirname:
write_deltalake(
tmpdirname,
citibike,
mode="append",
partition_by=["gender"],
storage_options={"allow_unsafe_rename": "true"},
)
🐚 for i in range(5): °º time -f '%e' python delta_write_times_duckdb.py | grep -v _delta_log °º 7.87 8.06 8.27 8.27 8.45
Ibis PyArrow Table (s) | Ibis main (s) | DuckDB (s) |
---|---|---|
9.27 | 12.21 | 7.87 |
9.29 | 12.65 | 8.06 |
9.33 | 12.69 | 8.27 |
9.42 | 12.76 | 8.27 |
9.31 | 12.70 | 8.45 |