Checking for exact equality of FPs
require(dplyr)
DF = data.frame(a=seq(0, 1, by=0.2), b=1:2)
merge(data.frame(a=0.6), DF, all.x=TRUE)
# a b
# 1 0.6 NA
Checking for exact equality of FPs
require(dplyr)
DF = data.frame(a=seq(0, 1, by=0.2), b=1:2)
merge(data.frame(a=0.6), DF, all.x=TRUE)
# a b
# 1 0.6 NA
Suppose I've two data.frame
s DF1
and DF2
as shown below:
require(dplyr)
set.seed(1L)
DF1 = data.frame(x=sample(3,10,TRUE), y1=1:10, y2=11:20)
# x y1 y2
# 1: 1 1 11
# 2: 1 5 15
# 3: 1 10 20
Generating some data first:
# R version 3.3.0
require(data.table) ## 1.9.7, commit 2433, github
require(dplyr) ## devel, commit 3189, github
require(data.table) ## 1.9.3
set.seed(1L)
DT = data.table(ID = sample(1e3, 1e8, TRUE), GROUP = sample(letters, 1e8, TRUE))
Here's another comparison between dplyr
and data.table
on relatively large data, with different number of unique groups.
require(dplyr)
require(data.table)
N = 20e6L # 20 million rows, UPDATE: also ran for 50 million rows (see table below)
K = 10L # other values tested are 25L, 50L, 100L
DT <- data.table(Insertion = sample(K, N, TRUE),
Unit = sample(paste("V", 1:K, sep=""), N, TRUE),
Channel = sample(K, N, TRUE),
Here are the new benchmarks for setkey
updated for v 1.9.2
. Let's generate some data.
require(data.table)
set.seed(1)
N <- 2e7 # size of DT
foo <- function() paste(sample(letters, sample(5:9, 1), TRUE), collapse="")
ch <- replicate(1e5, foo())
Benchmarking for this gist: https://gist.github.com/PeteHaitch/75d6f7fd0566767e1e80
sim_data <- function(n, m, d, sim_strand = FALSE){
if (d >= n){
stop("Require d < n")
}
i <- sample(n - d, d)
A small note on this tweet from @KevinUshey and this tweet from @ChengHLee:
The number of rows, while is important, is only one of the factors that influence the time taken to perform the join. From my benchmarking experience, the two features that I found to influence join speed, especially on hash table based approaches (ex: dplyr
), much more are:
That is, these features influence join speed in spite of having the same number of rows.
Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified.
Under the section Vectorise
(and also briefly mentioned under section Do as little as possible
), one point I think would be nice to have is to be aware of the data structure the vectorised functions are implemented for. Using vectorised code without understanding that is a form of "premature optimisation" as well, IMHO.
For example, consider the case of rowSums
on a data.frame
. Some issues to consider here are:
rowSums
on a data.frame
will coerce into a matrix
first. Imagine a huge (> 1Gb) data.frame and this might turn out to be a bad idea if the conversion drains memory and starts swapping.Note: I personally think discussion about performance should merit on trade-offs between "speed" and "memory".