Skip to content

Instantly share code, notes, and snippets.

@gforsyth
Last active May 23, 2024 13:38
Show Gist options
  • Save gforsyth/9d380566c07da2846729df8a410a3607 to your computer and use it in GitHub Desktop.
Save gforsyth/9d380566c07da2846729df8a410a3607 to your computer and use it in GitHub Desktop.

[2024-05-23 Thu 09:13] deltalake Duckdb vs ibis

Dataset

I have a pile of CSVs from the NYC bike share program

ls -l /home/gil/databog/csv/citibike
total3618000
-rw-r–r–1gilgil234843729Feb42020202001-citibike-tripdata.csv
-rw-r–r–1gilgil217131425Mar242020202002-citibike-tripdata.csv
-rw-r–r–1gilgil202642779Apr172020202003-citibike-tripdata.csv
-rw-r–r–1gilgil129734561May222020202004-citibike-tripdata.csv
-rw-r–r–1gilgil283682921Jun52020202005-citibike-tripdata.csv
-rw-r–r–1gilgil357642798Jul52020202006-citibike-tripdata.csv
-rw-r–r–1gilgil399626762Aug112020202007-citibike-tripdata.csv
-rw-r–r–1gilgil442365582Sep42020202008-citibike-tripdata.csv
-rw-r–r–1gilgil472984200Oct132020202009-citibike-tripdata.csv
-rw-r–r–1gilgil427237224Nov42020202010-citibike-tripdata.csv
-rw-r–r–1gilgil330107544Dec42020202011-citibike-tripdata.csv
-rw-r–r–1gilgil206762924Jan52021202012-citibike-tripdata.csv
du -sh /home/gil/databog/csv/citibike
3.5G/home/gil/databog/csv/citibike

Ibis Profiling script

Read in the CSVs using the DuckDB backend, write out to delta in a temp directory, then remove the temp directory.

from __future__ import annotations

import tempfile

import ibis

con = ibis.duckdb.connect()

citibike = con.read_csv("/home/gil/databog/csv/citibike/*.csv", table_name="citibike")

with tempfile.TemporaryDirectory() as tmpdirname:
    citibike.to_delta(
        tmpdirname,
        mode="append",
        partition_by=["gender"],
        storage_options={"allow_unsafe_rename": "true"},
    )

On current main

 🐍(nix) ~githibis-ibismain…1⚑6
🐚 # On current main
°º for i in range(5):
°º     # filtering out delta log messages
°º     time -f '%e' python delta_write_times.py | grep -v _delta_log
°º
12.21
12.65
12.69
12.76
12.70

Switch from using RecordBatchReader to in-memory pyarrow.Table

 🐍(nix) ~githibis-ibismain…1⚑6                                       ⌛1m3s
🐚 gd
diff --git a/ibis/backends/__init__.py b/ibis/backends/__init__.py
index 2a434f19e..1d0824d1e 100644
--- a/ibis/backends/__init__.py
+++ b/ibis/backends/__init__.py
@@ -544,8 +544,7 @@ class _FileIOHandler:
                 "pip install 'ibis-framework[deltalake]'\n"
             )

-        with expr.to_pyarrow_batches(params=params) as batch_reader:
-            write_deltalake(path, batch_reader, **kwargs)
+        write_deltalake(path, expr.to_pyarrow(), **kwargs)


 class CanListCatalog(abc.ABC):
 🐍(nix) ~githibis-ibismain+1…1⚑6                                      ⌛9s
🐚 for i in range(5):
°º     time -f '%e' python delta_write_times.py | grep -v _delta_log
°º
9.27
9.29
9.33
9.42
9.31

DuckDB profiling script

Same as above, but now using duckdb-python directly:

from __future__ import annotations

import tempfile

import duckdb
from deltalake.writer import write_deltalake

con = duckdb.connect()

citibike = con.read_csv("/home/gil/databog/csv/citibike/*.csv").arrow()

with tempfile.TemporaryDirectory() as tmpdirname:
    write_deltalake(
        tmpdirname,
        citibike,
        mode="append",
        partition_by=["gender"],
        storage_options={"allow_unsafe_rename": "true"},
    )
🐚 for i in range(5):
°º     time -f '%e' python delta_write_times_duckdb.py | grep -v _delta_log
°º
7.87
8.06
8.27
8.27
8.45

Collected results

Ibis PyArrow Table (s)Ibis main (s)DuckDB (s)
9.2712.217.87
9.2912.658.06
9.3312.698.27
9.4212.768.27
9.3112.708.45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment