Skip to content

Instantly share code, notes, and snippets.

Arun Srinivasan arunsrinivasan

Block or report user

Report or block arunsrinivasan

Hide content and notifications from this user.

Learn more about blocking users

Contact Support about this user’s behavior.

Learn more about reporting abuse

Report abuse
View GitHub Profile
@arunsrinivasan
arunsrinivasan / floating_points.md
Last active Jul 18, 2018
data.table, dplyr and R - floating point comparisons
View floating_points.md

Checking for exact equality of FPs

require(dplyr)
DF = data.frame(a=seq(0, 1, by=0.2), b=1:2)

merge(data.frame(a=0.6), DF, all.x=TRUE)
#     a  b
# 1 0.6 NA
View dplyr_complex_join.md

Suppose I've two data.frames DF1 and DF2 as shown below:

require(dplyr)
set.seed(1L)
DF1 = data.frame(x=sample(3,10,TRUE), y1=1:10, y2=11:20)
#     x y1 y2
#  1: 1  1 11
#  2: 1  5 15
#  3: 1 10 20
@arunsrinivasan
arunsrinivasan / tweet_reply.md
Last active Jul 27, 2018
automatic indexing vs between() on integer ranges
View tweet_reply.md

Updated June 16 with latest devel

data.table's automatic indexing:

Generating some data first:

# R version 3.3.0
require(data.table) ## 1.9.7, commit 2433, github
require(dplyr)      ## devel, commit 3189, github
@arunsrinivasan
arunsrinivasan / SO_25436418.md
Last active Aug 29, 2015
Benchmarks for SO_25436418
View SO_25436418.md

Generate large enough data:

require(data.table) ## 1.9.3
set.seed(1L)
DT = data.table(ID = sample(1e3, 1e8, TRUE), GROUP = sample(letters, 1e8, TRUE))

Benchmarks:

View SO_25066925.md

Here's another comparison between dplyr and data.table on relatively large data, with different number of unique groups.

require(dplyr)
require(data.table)
N = 20e6L # 20 million rows, UPDATE: also ran for 50 million rows (see table below)
K = 10L # other values tested are 25L, 50L, 100L
DT <- data.table(Insertion = sample(K, N, TRUE), 
                 Unit      = sample(paste("V", 1:K, sep=""), N, TRUE),
                 Channel   = sample(K, N, TRUE), 
@arunsrinivasan
arunsrinivasan / benchmarks_1.9.2.md
Last active Aug 29, 2015
Some old benchmarking repeated on 1.9.2
View benchmarks_1.9.2.md

Setkey on v1.9.2:

Here are the new benchmarks for setkey updated for v 1.9.2. Let's generate some data.

require(data.table)
set.seed(1)
N <- 2e7 # size of DT
foo <- function() paste(sample(letters, sample(5:9, 1), TRUE), collapse="")
ch <- replicate(1e5, foo())
@arunsrinivasan
arunsrinivasan / duplicated_dt.md
Last active Aug 29, 2015
Benchmarking `duplicated.data.table`
View duplicated_dt.md
@arunsrinivasan
arunsrinivasan / group_effect.md
Last active Sep 17, 2015
Illustrating the impact of number of groups on joins
View group_effect.md

Update: The timings are now updated with runs from R v3.2.2 along with the new 'on=' syntax

A small note on this tweet from @KevinUshey and this tweet from @ChengHLee:

The number of rows, while is important, is only one of the factors that influence the time taken to perform the join. From my benchmarking experience, the two features that I found to influence join speed, especially on hash table based approaches (ex: dplyr), much more are:

  • The number of unique groups.
  • The number of columns to perform the join based on - note that this is also related to the previous point as in most cases, more the columns, more the number of unique groups.

That is, these features influence join speed in spite of having the same number of rows.

@arunsrinivasan
arunsrinivasan / Knuth_quote.md
Created May 7, 2014
Knuth's quote of interest
View Knuth_quote.md

Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified.

@arunsrinivasan
arunsrinivasan / reply_tweet.md
Last active Aug 29, 2015
A suggestion on Hadley's point about "Performance", "Premature optimisation" and "vectorise"
View reply_tweet.md

Under the section Vectorise (and also briefly mentioned under section Do as little as possible), one point I think would be nice to have is to be aware of the data structure the vectorised functions are implemented for. Using vectorised code without understanding that is a form of "premature optimisation" as well, IMHO.

For example, consider the case of rowSums on a data.frame. Some issues to consider here are:

  • Memory - using rowSums on a data.frame will coerce into a matrix first. Imagine a huge (> 1Gb) data.frame and this might turn out to be a bad idea if the conversion drains memory and starts swapping.

Note: I personally think discussion about performance should merit on trade-offs between "speed" and "memory".

  • Data structure - We can do much more in terms of speed (and memory) by taking advantage of the data structure here. Here's an example:
You can’t perform that action at this time.