Skip to content

Instantly share code, notes, and snippets.

@szilard
Last active May 2, 2016 22:39
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save szilard/7d810877fcc6b019a478 to your computer and use it in GitHub Desktop.
Save szilard/7d810877fcc6b019a478 to your computer and use it in GitHub Desktop.
SQLite vs R data.table
sqlite vs R's data.table
TLDR; sqlite (:memory:) 250 sec data.table 7 sec
data: 100 million rows, 1 million groups
generated by: https://github.com/szilard/benchm-databases/blob/master/0-gendata.txt
sqlite3 :memory:
## SQLite version 3.8.4.1
create table d(x int, y double);
.mode csv
.import d-noh.csv d
.timer on
select x, avg(y) as ym
from d
group by x
order by ym desc
limit 5;
## Run Time: real 249.674 user 227.471406 sys 13.416719
R
## R version 3.2.2
library(data.table)
## data.table 1.9.6
d <- fread("d-noh.csv")
setnames(d, c("x","y"))
system.time(
print(head(d[, list(ym=mean(y)), by=x][order(-ym)],5))
)
## user system elapsed
## 6.811 0.176 7.005
@leondutoit
Copy link

Interesting. Does DT do the automatic indexing here? If so, would be a more accurate comparison to add an index to the x column in sqlite. On my machine that reduces the query time to 150 seconds. Not quite 7 but at least 100 seconds faster.

@szilard
Copy link
Author

szilard commented Nov 6, 2015

No (auto) key for DT above. But you can set a key, in which case the DT query will run in 1.5 sec :)

@leondutoit
Copy link

Ah so :) Perhaps you can consider adding the obvious indexes to the db benchmarks - I think it represents a more realistic use of databases. If not, the benchmarks will benefit from an explanation about why db indexes are left out, imho. Anyways, interesting work!

@hannes
Copy link

hannes commented Feb 10, 2016

Hi Szilard,

just sent you a PR to also check MonetDBLite (R package, https://www.monetdb.org/blog/monetdblite-r), which does these queries in < 10 seconds on my box.

When running on the m3.2xlarge instance and following your test protocol, I get the following timings: Group query: 7.0s Join query: 1.5s.

screen shot 2016-02-10 at 11 53 33

@szilard
Copy link
Author

szilard commented Feb 11, 2016

Awesome, thanks :) I added it to the benchmark: https://github.com/szilard/benchm-databases

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment