arunsrinivasan/reply_tweet.md

## reply_tweet.md

      
    Raw
  

              reply_tweet.md
            
          
    Under the section Vectorise (and also briefly mentioned under section Do as little as possible), one point I think would be nice to have is to be aware of the data structure the vectorised functions are implemented for. Using vectorised code without understanding that is a form of "premature optimisation" as well, IMHO.
For example, consider the case of rowSums on a data.frame. Some issues to consider here are:

Memory - using rowSums on a data.frame will coerce into a matrix first. Imagine a huge (> 1Gb) data.frame and this might turn out to be a bad idea if the conversion drains memory and starts swapping.


Note: I personally think discussion about performance should merit on trade-offs between "speed" and "memory".


Data structure - We can do much more in terms of speed (and memory) by taking advantage of the data structure here. Here's an example:

set.seed(1L)
require(data.table)
DF <- as.data.frame(setDT(lapply(1:1e2, function(x) as.numeric(sample(10, 1e6, TRUE)))))

## using vectorised rowSums
system.time(ans1 <- rowSums(DF))
#   user  system elapsed 
#  2.029   1.154   3.660 

## using simple for-loop
foo <- function(x) {
    ## skipping checks here just for illustration
    ans = x[[1L]]
    for (i in seq_len(ncol(x))[-1L]) {
        ans = ans + x[[i]]
    }
    ans
}
system.time(ans2 <- foo(DF))
#   user  system elapsed 
#  0.565   0.570   1.172 

identical(ans1, ans2) ## [1] TRUE

The for-loop has no coercion (no twice the memory usage) and is ~3x faster. We've performance improvement in terms of both "speed" and "memory" by choosing not to use rowSums on a data.frame.
Even better would be to write this for-loop in C. But that shouldn't matter a lot as long as you're not dealing with a lot of columns (which is rarely the case).