Skip to content

Instantly share code, notes, and snippets.

@talegari
Created November 12, 2023 08:13
Show Gist options
  • Save talegari/33a43dec3d5a8a8cfd3100b4630df4c8 to your computer and use it in GitHub Desktop.
Save talegari/33a43dec3d5a8a8cfd3100b4630df4c8 to your computer and use it in GitHub Desktop.
Exploring duckplyr
# dataset: https://zenodo.org/records/2594012
df = arrow::read_parquet("personal/Avazu/test.parquet") |>
tibble::as_tibble()
dim(df) # 4,218,938 X 24
res =
bench::mark(
# using `duckplyr`
duckplyr = {
df |>
duckplyr::as_duckplyr_df() |>
duckplyr::summarise(msp = median(click),
.by = c(site_id, device_model, app_domain)
) |>
duckplyr::arrange(site_id, device_model, app_domain)
}
,
# using `dplyr`
dplyr = {
df |>
dplyr::summarise(msp = median(click),
.by = c(site_id, device_model, app_domain)
) |>
dplyr::arrange(site_id, device_model, app_domain)
}
,
# using `tidytable` (internally uses `data.table`)
tidytable = {
df |>
tidytable::summarise(msp = median(click),
.by = c(site_id, device_model, app_domain)
) |>
tidytable::arrange(site_id, device_model, app_domain)
}
,
iterations = 10,
check = FALSE
)
res |>
dplyr::select(min, median, iter_per_sec = `itr/sec`) |>
dplyr::mutate(backend = c("duckplyr", "dplyr", "tidytable")) |>
dplyr::relocate(backend)
# on my laptop:
# # A tibble: 3 × 4
# backend min median iter_per_sec
# <chr> <bch:tm> <bch:tm> <dbl>
# 1 duckplyr 3.99ms 4.76ms 215.
# 2 dplyr 3.89s 4.52s 0.225
# 3 tidytable 380.64ms 395.7ms 2.40
sessionInfo()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment