Arun Srinivasan arunsrinivasan

## FR_5241.R
require(data.table)

# let's create data huge data.table
set.seed(1)
N <- 2e7 # size of DT

# generate a character vector of length about 1e5
foo <- function() paste(sample(letters, sample(5:9, 1), TRUE), collapse="")
ch <- replicate(1e5, foo())
ch <- unique(ch)

## base_dplyr_datatable.R
# here's some sample data to test it out
require(data.table)
require(dplyr)
set.seed(45)
DF <- data.frame(x=sample(3, 25, TRUE), y=1:25, z=26:50)
DP <- tbl_df(DF) # for DPLYR data.frame object
DT <- data.table(DF)

# 1) row-wise subset (usually based on conditions):

## DT_comp_set.R
require(data.table)

set.seed(1L)
DT1 <- data.table(x=sample(1e7), y=as.numeric(sample(1e7)), z=sample(letters, 1e7, TRUE))
DT2 <- copy(DT1)

val <- runif(1e7)

# 'set' seems faster when adding 1-column
# =======================================

## SO_21308436.R
require(dplyr)
require(data.table)

foo <- function(N) {

    group_sizes = 10^(1:(log10(N)-1L))
    uniqval <- unique(runif(2*N))

    fans <- vector("list", length(group_sizes))
    for (i in seq_along(group_sizes)) {

## Ramnath_twitter_question.R
require(data.table)

DT <- as.data.table(mtcars)

# directly using .SD
set.seed(45L)
system.time(ans1 <- DT[, .SD[sample(.N, 5L)], by=gear])
#   user  system elapsed
#  0.009   0.000   0.010


## SO_23388893.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                arunsrinivasan
                / SO_23388893.md
            
            
              Last active
              August 29, 2015 14:00
            
              
                data.table's sub-assignment by reference feature vs R v3.0.3 and R v3.1 benchmarks
              
          
    Here's the code for comparing base R (v3.0.3 and 3.1.0 vs data.table's sub-assignment by reference feature):
require(data.table)
set.seed(20140430)
N <- as.integer(10^(3:7)*2L)
ans = vector("list", length(N))
for (i in seq_along(N)) {
    print(i)
    nreg = N[i]


## reply_tweet.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              1 star
            
          
                arunsrinivasan
                / reply_tweet.md
            
            
              Last active
              August 29, 2015 14:01
            
              
                A suggestion on Hadley's point about "Performance", "Premature optimisation" and "vectorise"
              
          
    Under the section Vectorise (and also briefly mentioned under section Do as little as possible), one point I think would be nice to have is to be aware of the data structure the vectorised functions are implemented for. Using vectorised code without understanding that is a form of "premature optimisation" as well, IMHO.
For example, consider the case of rowSums on a data.frame. Some issues to consider here are:

Memory - using rowSums on a data.frame will coerce into a matrix first. Imagine a huge (> 1Gb) data.frame and this might turn out to be a bad idea if the conversion drains memory and starts swapping.


Note: I personally think discussion about performance should merit on trade-offs between "speed" and "memory".


Data structure - We can do much more in terms of speed (and memory) by taking advantage of the data structure here. Here's an example:


## Knuth_quote.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                arunsrinivasan
                / Knuth_quote.md
            
            
              Created
              May 7, 2014 03:08
            
              
                Knuth's quote of interest
              
          
Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified.


## group_effect.md

      
              1 file
            
          
              1 fork
            
          
              0 comments
            
          
              1 star
            
          
                arunsrinivasan
                / group_effect.md
            
            
              Last active
              September 17, 2015 14:03
            
              
                Illustrating the impact of number of groups on joins
              
          
    Update: The timings are now updated with runs from R v3.2.2 along with the new 'on=' syntax

A small note on this tweet from @KevinUshey and this tweet from @ChengHLee:
The number of rows, while is important, is only one of the factors that influence the time taken to perform the join. From my benchmarking experience, the two features that I found to influence join speed, especially on hash table based approaches (ex: dplyr), much more are:

The number of unique groups.
The number of columns to perform the join based on - note that this is also related to the previous point as in most cases, more the columns, more the number of unique groups.

That is, these features influence join speed in spite of having the same number of rows.

  
## duplicated_dt.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                arunsrinivasan
                / duplicated_dt.md
            
            
              Last active
              August 29, 2015 14:02
            
              
                Benchmarking `duplicated.data.table`
              
          
    Benchmarking for this gist: https://gist.github.com/PeteHaitch/75d6f7fd0566767e1e80
Function to simulate data as a matrix:

sim_data <- function(n, m, d, sim_strand = FALSE){
  if (d >= n){
    stop("Require d < n")
  }
  i <- sample(n - d, d)
	require(data.table)

	# let's create data huge data.table
	set.seed(1)
	N <- 2e7 # size of DT

	# generate a character vector of length about 1e5
	foo <- function() paste(sample(letters, sample(5:9, 1), TRUE), collapse="")
	ch <- replicate(1e5, foo())
	ch <- unique(ch)
	# here's some sample data to test it out
	require(data.table)
	require(dplyr)
	set.seed(45)
	DF <- data.frame(x=sample(3, 25, TRUE), y=1:25, z=26:50)
	DP <- tbl_df(DF) # for DPLYR data.frame object
	DT <- data.table(DF)

	# 1) row-wise subset (usually based on conditions):
	require(data.table)

	set.seed(1L)
	DT1 <- data.table(x=sample(1e7), y=as.numeric(sample(1e7)), z=sample(letters, 1e7, TRUE))
	DT2 <- copy(DT1)

	val <- runif(1e7)

	# 'set' seems faster when adding 1-column
	# =======================================
	require(dplyr)
	require(data.table)

	foo <- function(N) {

	group_sizes = 10^(1:(log10(N)-1L))
	uniqval <- unique(runif(2*N))

	fans <- vector("list", length(group_sizes))
	for (i in seq_along(group_sizes)) {
	require(data.table)

	DT <- as.data.table(mtcars)

	# directly using .SD
	set.seed(45L)
	system.time(ans1 <- DT[, .SD[sample(.N, 5L)], by=gear])
	# user system elapsed
	# 0.009 0.000 0.010