noamross/vectorizationtalk.md

## vectorizationtalk.md

      
    Raw
  

              vectorizationtalk.md
            
          
    % Vectorization without Condescension
% Noam Ross
% 14-04-07 09:09:18
Here are my notes from a recent talk I gave on vectorization at a Davis R Users' Group meeting. Thanks to Vince Buffalo, John Myles White, and Hadley Wickham for their input as I was preparing this.
Beginning R users are often told to "vectorize" their code. Here, I try to explain why vectorization can be advantageous in R by showing how R works under the hood.
Now, remember, premature optimization is the root of all evil (Knuth). Don't start re-writing your code unless the time saved is going to be worth the time invested.  Other approaches, like finding a bigger machine or parallelization, could give you more bang for the buck in terms of programming time.  But if you understand the nuts and bolts of vectorization in R, it may help you write shorter, simpler, safer, and yes, faster code in the first place.
First, let's acknowledge that vectorization can seem like voodoo. Consider a two math problems, one vectorized, and one not:
$$\begin{bmatrix} 1 \ 2 \ 3 \end{bmatrix} +
\begin{bmatrix} 1 \ 2 \ 3 \end{bmatrix} =
\begin{bmatrix} 2 \ 4 \ 6 \end{bmatrix}$$
$$\begin{aligned}
1 + 1 = 2 \\
2 + 2 = 4 \\
3 + 3 = 6
\end{aligned}$$
Why on earth should these take a different amount of time to calculate? Linear algebra isn't magic.  In both cases there are three addition operations to perform. So what's up?
1. What on earth is R actually doing?

R is a high-level, interpreted computer language. This means that R takes care of a lot of basic computer tasks for you. For instance, when you type
    i <- 5.0

You don't have to tell your computer:

That "5.0" is a floating-point number
That "i" should store numeric-type data
To find a place in memory for to put "5"
To register "i" as a pointer to that place in memory

You also don't have to convert i <- 5.0 to binary code. That's done automatically when you hit 'Enter'.
When you then type
i <- "foo"

You don't have to tell the computer that i no longer stores an integer but a series of characters that form a string, to store "f", "o", and "o", consecutively, etc.
R figures these things on it's own, on the fly, as you type commands or source them from a file.  This means that running a command in R takes a relatively long time than it might in a lower-level language.  For instance, if I am writing in C, I might say
int i
i = 5

This tells the computer the i will store data of the type int (integers), and assign the value 5 to it.  If I try to assign 5.5 to it, something will go wrong. Depending on my set-up, it might throw an error, or just silently assign 5 to i.  But C doesn't have to figure out what type of data is is represented by i and this is part of what makes it faster.
Here's another example. If you type:
2L + 3.5

The computer asks:

"OK, what's the first thing?"
"An integer"
"The second thing?"
"A a floating-point number"
"Do we have a way to deal with adding an integer and a floating-point number"
"Yes! Convert the integer to a floating-point number, then add the two floating point numbers"
[converts integer]
[finds a place in memory for the answer]
etc.

If R were a compiled computer language, like C or FORTRAN, much of this "figuring out" would be accomplished the compilation step, not when the program was run.  Compiled programs are translated into binary computer language after they are written, and this occurs over the whole program, rather than line-by-line.  This allows the compiler to organize the binary machine code in an optimal way for the program.
What does this have to do with vectorization in R? Well, many R functions are actually written in a a compiled language, such as C, C++, and FORTRAN, and have a small R "wrapper".  For instance, when you inspect the code for fft, the fast fourier transform, you see
> fft
function (z, inverse = FALSE) 
.Call(C_fft, z, inverse)
<bytecode: 0x7fc261e1b910>
<environment: namespace:stats>

R is passing the data onto a C function called C_fft.  You'll see this in many R functions.  If you look at their source code, you'll see .C(), .Call(), or sometimes .Internal() or .Primitive().  These means R is calling a C or FORTRAN program to carry out operations. However, this passing occurs after R figures out the data type in z, and also whether to use the default value of inverse.  The compiled code is able to run faster than code written in pure R, because the "figuring out" stuff is done first, and it can zoom ahead without the "translation" steps that R needs.
So if you use a loop to call an R function, it has to do the "figuring out" stuff, as well as the translation, each time.  But if you call it once, with a vector, the "figuring out" part happens just once.  Once in C or FORTRAN, the data are processed using something like loops, but this occurs in the compiled code, much faster.In the end, there's no actual vectorization. Somehow the computer is going to need to operate on each element of your vector, often with something like a loop.  Some compilers do "loop unrolling", and the computer just gets a sequence of commands, with no control commands telling it to go back and forth.
Another important component of the speed vectorized operations is that vectors in R are typed.  Despite all of its flexibility, R does have some restrictions on what we can do.  All elements of a vector must be the same data type.  If I try to do this
a <- c(1, 2, FALSE, "hello")

I get
> a
[1] "1"     "2"     "FALSE" "hello"
> class(a)
[1] "character"

R converts all my data to characters.  It can't handle a vector with different data types.
So when R needs to perform an operation like
c(1, 2, 3) + c(1, 2, 3)

R only has to ask what types of data are in each vector (2) rather than each element (6).
One consequence of all this is that fast R code is short.  If you can express what you want to do in R in a line or two, with just a few function calls that are actually calling compiled code, it'll be more efficient than if you write long program.  This is not the case in all other languages.  Often, in compiled languages, you want to stick with lots of very simple statements.  The fast way is the obvious, verbose way way.  However, they don't have the flexibility or the interactivity in R, and that's a lot of typing.
2. In R, everything is a vector

In R everything is a vector. To quote Tim Smith in aRrgh: a newcomer's (angry) guide to R, /

All naked numbers are double-width floating-point atomic vectors of length one. You're welcome.

This means that, in R, typing "6" tells R something like
<start vector, type=numeric, length=1>6<end vector>

While in other languages, "6" might just be
<numeric>6

So, while in other languages, it might be more efficient to express something as a single number rather than a length-one vector, in R this is impossible.  There's no advantage to NOT organizing your data as vector.  In other languages, short vectors might be better expressed as scalars.
3. Linear algebra is a special case

Linear algebra is one of the core functions of a lot of computing, so there are highly optimized programs for linear algebra. Such a program is called a BLAS - basic linear algebra system. R, and a lot of other software, relies on these specialized programs and outsources linear algebra to them.  These programs have things like built-in parallel processing, and they may be specialized for the hardware on your computer. So if your calculations can be expressed in actual linear algebra terms, such as matrix multiplication, than it's almost certainly faster to vectorize them.
Now, there are faster and slower linear algebra libraries, and you can install new ones on your computer and tell R to use them instead of the basic ones.  This used to be like putting a new engine in your car, but it's gotten considerably easier.  For certain problems, doing this can considerably speed up code, but results vary depending on the specific linear algebra operations you are using.
4. Functionals: Pre-allocating memory, avoiding side effects.

There are a whole family of functions in R called functionals, or apply functions, which take vectors (or matrices, or lists) of values and apply arbitrary functions to each. Because these can use arbitrary functions, they are NOT compiled.  Functionals mostly are written in pure R, and they speed up code only in certain cases.
One operation that is slow in R, and somewhat slow in all languages, is memory allocation.  So one of the slower ways to write a for loop is to resize a vector repeatedly, so that R has to re-allocate memory repeatedly, like this:
j <- 1
for (i in 1:10) {
    j[i] = 10
}

Here, in each repetition of the for loop, R has to re-size the vector and re-allocate memory.  It has to find the vector in memory, create a new one that will fit more data, copy the old data over, insert the new data, and erase the old vector.  This can get very slow as vectors get big.
The following form is faster
j <- rep(NA, 10)
for (i in 1:10) {
    j[i] = 10
}

Here, the vector is pre-sized, so writing j[9] doesn't require R to re-allocate memory.
The apply or plyr::*ply functions all actually have for loops inside, but they automatically do things like pre-allocating vector size so you don't screw it up.
I've seen arguments that both "ply" functions and for loops make for more expressive, easier to read code, and I don't really have an opinion on it.
One thing that "ply" functions do is avoid what are known as side effects.  When you run a ply function, everything happens inside that function, and nothing changes in your working environment (this is known as "functional programming).  In a for loop, on the other hand, when you do something like for(i in 1:10), you get the leftover i in your environment. This is considered bad practice sometimes.  You can understand how having a bunch of temporary variables like i lying around could cause problems, especially if you use i for something else later.
So when might for loops make sense over vectorization?


The functions you are calling don't take vector arguments
You can't express your code as a linear algebra operation
The number of iterations is relatively small, and the computational time of the functions called inside your for loop is high, so that actually calling them is a small fraction of your computational time.
For loops make more sense and are easier to read for you.

Of course, if you understand and intuit the first few points, the last point is going to become less and less common.
Some resources on vectorization


Good discussion in a couple of blog posts by John Myles White.
Some relevant chapters of Hadley Wickham's Advanced R Book: Functionals and code profiling
Vectorization is covered in chapters 3 and 4 of the classic text on R's idiosyncrasies - The R Inferno
A bunch of assorted blog posts on vectorization for speed:

http://www.r-bloggers.com/how-to-use-vectorization-to-streamline-simulations/
http://nesterko.com/blog/2011/04/29/drastic-r-speed-ups-via-vectorization-and-bug-fixes/ (note named lookups much slower than numbered)
http://blog.revolutionanalytics.com/2014/04/a-look-a-r-vectorization-through-the-collatz-conjecture.html
http://rpubs.com/daspringate/vectorisation
http://quanttrader.info/public/FasterRCode.pdf