Thell/SO11533438

## SO11533438
_Why_ it is so slow? A little research located a mail group posting from a year ago this month where @hadley, the package author, [states][2]

> This is a drawback of the way that ddply always works with data
frames.  It will be a bit faster if you use summarise instead of
data.frame (because data.frame is very slow), but I'm still thinking
about how to overcome this fundamental limitation of the ddply
approach.

---

As for being *efficient* plyr code I didn't know either. After a bunch of param testing and bench-marking it looks like we can do better.

The `summarize()` in your command is a just helper function, pure and simple. We can replace it with our own sum function since it isn't helping with anything that isn't already simple and the `.data` and `.(price)` arguments can be made more explicit. The result is

    ddply( dd[, 2:3], ~price, function(x) sum( x$volume ) )

The `summarize` may be nice, pretty and correct but it just isn't quicker our function. It makes sense; just look at our little function and the simple `sum` versus the [code][3] for `summarize`.  Running the your benchmarks with the revised formula suggests a noticeable gain. Don't take that to mean you've used plyr incorrectly, it just isn't efficient.

In my opinion the resulting function isn't easily understood and must be mentally parsed which negates _and_ ( even with a 60% gain ) we are ridiculously slow compared with data.table.

---

In the same [thread][4] mentioned above, regarding the slowness of plyr, a plyr2 project is mentioned.  Since the time of the original answer to the question the plyr author has released `dplyr` as the successor of plyr.  While both plyr and dplyr are billed as data manipulation tools and your primary stated interest is aggregation you may still be interested in your benchmark results of the new package for comparison.

```{r, comment=''}
plyr_Original   <- function(dd) ddply( dd, .(price), summarise, ss=sum(volume))
plyr_Optimized  <- function(dd) ddply( dd[, 2:3], ~price, function(x) sum( x$volume ) )

data_table      <- function(dd) dd[, sum(volume), keyby=price]

dplyr           <- function(dd) dd %.% group_by(price) %.% summarize( sum(volume) )
```

The `dataframe` package has been removed from CRAN and subsequently from the tests, along with the matrix function versions.

```{r, echo=FALSE, results='hide'}
capture.output( suppressPackageStartupMessages( {
  require(plyr, quietly=T);
  require(dplyr, quietly=T);
  require(data.table, quietly=T);
  require(xts, quietly=T);
  require(rbenchmark, quietly=T);
  require(microbenchmark, quietly=T)
} ) )
```

```{r, echo=FALSE, comment=''}
t.apply       <- function(dd) unlist(tapply(dd$volume, dd$price, sum))
l.apply       <- function(dd) unlist(lapply(split(dd$volume, dd$price), sum))
b.y           <- function(dd) unlist(by(dd$volume, dd$price, sum))
agg           <- function(dd) aggregate(dd$volume, list(dd$price), sum)

obs <- c(5e1, 5e2, 5e3, 5e4, 5e5, 5e6, 5e6, 5e7, 5e8)
timS <- timeBasedSeq('20110101 083000/20120101 083000')

bmkRL <- list()
reps <- 5

i <- 5
j <- 8

tt <- timS[1:obs[i]]
pxl <- seq(0.9, 1.1, by= (1.1 - 0.9)/floor(obs[i]/(11-j)))
px <- sample(pxl, length(tt), replace=TRUE)
set.seed(1)
vol <- rnorm(length(tt), 1000, 100)

d.df <- base::data.frame(time=tt, price=px, volume=vol)
d.dt <- data.table(d.df)

listLabel <- paste( 'obs=', formatC( obs[i], format='d', big.mark=','),
                    'unique prices=', formatC( length(unique(px)), format='d', big.mark=','),
                    'reps=', reps)
bmkRL[[listLabel]] <- benchmark( plyr_Original(d.df),
                                 plyr_Optimized(d.df),
                                 dplyr(d.df),
                                 dplyr(d.dt),
                                 t.apply(d.df),
                                 l.apply(d.df),
                                 b.y(d.df),
                                 agg(d.df),
                                 data_table(d.dt),
                                 columns =c('test', 'elapsed', 'relative'),
                                 replications = reps,
                                 order = 'elapsed')

print( bmkRL[1] )
```

For a little perspective on the slowness of the data.frame structure here are micro-benchmarks of the aggregation times of data_table and dplyr using your largest test dataset.

```{r, echo=FALSE, comment=''}
i <- j <- 8
tt <- timS[1:obs[i]]
pxl <- seq(0.9, 1.1, by= (1.1 - 0.9)/floor(obs[i]/(11-j)))
px <- sample(pxl, length(tt), replace=TRUE)
set.seed(1)
vol <- rnorm(length(tt), 1000, 100)

createTimedf <- system.time({
  d.df <- base::data.frame(time=tt, price=px, volume=vol)
})

createTimedt <- system.time({
  d.dt <- data.table(time=tt, price=px, volume=vol)
})

listLabel <- paste( 'obs=', formatC( obs[i], format='d', big.mark=','),
                    'unique prices=', formatC( length(unique(px)), format='d', big.mark=','),
                    'reps=', reps)
bmkRL[[listLabel]] <- microbenchmark( data_table(d.dt), dplyr(d.dt),  dplyr(d.df), times=10 )
print(bmkRL[2])
```

The elapsed system.time to create the `d.df` data.frame was `r createTimedf["elapsed"]` seconds and for the `d.dt` data.table it was `r createTimedt["elapsed"]` seconds.

Notice that *both* creation and aggregation of the data.frame is slower than that of the data.table.

In the end... **plyr is slow because the underlying data.frame used in plyr is slow.**
---

```{r, echo=FALSE}
print( sessionInfo(), locale=F )
```

  [1]: https://github.com/hadley/dplyr
  [2]: https://groups.google.com/forum/?fromgroups#!msg/manipulatr/Xo3-2FBI35k/9pClNUuxoPIJ%5B1-25%5D
  [3]: https://github.com/hadley/plyr/blob/master/R/helper-summarise.r
  [4]: https://groups.google.com/forum/?fromgroups#!msg/manipulatr/Xo3-2FBI35k/9pClNUuxoPIJ%5B1-25%5D
	_Why_ it is so slow? A little research located a mail group posting from a year ago this month where @hadley, the package author, [states][2]

	> This is a drawback of the way that ddply always works with data
	frames. It will be a bit faster if you use summarise instead of
	data.frame (because data.frame is very slow), but I'm still thinking
	about how to overcome this fundamental limitation of the ddply
	approach.

	---

	As for being efficient plyr code I didn't know either. After a bunch of param testing and bench-marking it looks like we can do better.

	The `summarize()` in your command is a just helper function, pure and simple. We can replace it with our own sum function since it isn't helping with anything that isn't already simple and the `.data` and `.(price)` arguments can be made more explicit. The result is

	ddply( dd[, 2:3], ~price, function(x) sum( x$volume ) )

	The `summarize` may be nice, pretty and correct but it just isn't quicker our function. It makes sense; just look at our little function and the simple `sum` versus the [code][3] for `summarize`. Running the your benchmarks with the revised formula suggests a noticeable gain. Don't take that to mean you've used plyr incorrectly, it just isn't efficient.

	In my opinion the resulting function isn't easily understood and must be mentally parsed which negates _and_ ( even with a 60% gain ) we are ridiculously slow compared with data.table.

	---

	In the same [thread][4] mentioned above, regarding the slowness of plyr, a plyr2 project is mentioned. Since the time of the original answer to the question the plyr author has released `dplyr` as the successor of plyr. While both plyr and dplyr are billed as data manipulation tools and your primary stated interest is aggregation you may still be interested in your benchmark results of the new package for comparison.

	```{r, comment=''}
	plyr_Original <- function(dd) ddply( dd, .(price), summarise, ss=sum(volume))
	plyr_Optimized <- function(dd) ddply( dd[, 2:3], ~price, function(x) sum( x$volume ) )

	data_table <- function(dd) dd[, sum(volume), keyby=price]

	dplyr <- function(dd) dd %.% group_by(price) %.% summarize( sum(volume) )
	```

	The `dataframe` package has been removed from CRAN and subsequently from the tests, along with the matrix function versions.

	```{r, echo=FALSE, results='hide'}
	capture.output( suppressPackageStartupMessages( {
	require(plyr, quietly=T);
	require(dplyr, quietly=T);
	require(data.table, quietly=T);
	require(xts, quietly=T);
	require(rbenchmark, quietly=T);
	require(microbenchmark, quietly=T)
	} ) )
	```

	```{r, echo=FALSE, comment=''}
	t.apply <- function(dd) unlist(tapply(dd$volume, dd$price, sum))
	l.apply <- function(dd) unlist(lapply(split(dd$volume, dd$price), sum))
	b.y <- function(dd) unlist(by(dd$volume, dd$price, sum))
	agg <- function(dd) aggregate(dd$volume, list(dd$price), sum)

	obs <- c(5e1, 5e2, 5e3, 5e4, 5e5, 5e6, 5e6, 5e7, 5e8)
	timS <- timeBasedSeq('20110101 083000/20120101 083000')

	bmkRL <- list()
	reps <- 5

	i <- 5
	j <- 8

	tt <- timS[1:obs[i]]
	pxl <- seq(0.9, 1.1, by= (1.1 - 0.9)/floor(obs[i]/(11-j)))
	px <- sample(pxl, length(tt), replace=TRUE)
	set.seed(1)
	vol <- rnorm(length(tt), 1000, 100)

	d.df <- base::data.frame(time=tt, price=px, volume=vol)
	d.dt <- data.table(d.df)

	listLabel <- paste( 'obs=', formatC( obs[i], format='d', big.mark=','),
	'unique prices=', formatC( length(unique(px)), format='d', big.mark=','),
	'reps=', reps)
	bmkRL[[listLabel]] <- benchmark( plyr_Original(d.df),
	plyr_Optimized(d.df),
	dplyr(d.df),
	dplyr(d.dt),
	t.apply(d.df),
	l.apply(d.df),
	b.y(d.df),
	agg(d.df),
	data_table(d.dt),
	columns =c('test', 'elapsed', 'relative'),
	replications = reps,
	order = 'elapsed')

	print( bmkRL[1] )
	```

	For a little perspective on the slowness of the data.frame structure here are micro-benchmarks of the aggregation times of data_table and dplyr using your largest test dataset.

	```{r, echo=FALSE, comment=''}
	i <- j <- 8
	tt <- timS[1:obs[i]]
	pxl <- seq(0.9, 1.1, by= (1.1 - 0.9)/floor(obs[i]/(11-j)))
	px <- sample(pxl, length(tt), replace=TRUE)
	set.seed(1)
	vol <- rnorm(length(tt), 1000, 100)

	createTimedf <- system.time({
	d.df <- base::data.frame(time=tt, price=px, volume=vol)
	})

	createTimedt <- system.time({
	d.dt <- data.table(time=tt, price=px, volume=vol)
	})

	listLabel <- paste( 'obs=', formatC( obs[i], format='d', big.mark=','),
	'unique prices=', formatC( length(unique(px)), format='d', big.mark=','),
	'reps=', reps)
	bmkRL[[listLabel]] <- microbenchmark( data_table(d.dt), dplyr(d.dt), dplyr(d.df), times=10 )
	print(bmkRL[2])
	```

	The elapsed system.time to create the `d.df` data.frame was `r createTimedf["elapsed"]` seconds and for the `d.dt` data.table it was `r createTimedt["elapsed"]` seconds.

	Notice that both creation and aggregation of the data.frame is slower than that of the data.table.

	In the end... plyr is slow because the underlying data.frame used in plyr is slow.
	---

	```{r, echo=FALSE}
	print( sessionInfo(), locale=F )
	```

	[1]: https://github.com/hadley/dplyr
	[2]: https://groups.google.com/forum/?fromgroups#!msg/manipulatr/Xo3-2FBI35k/9pClNUuxoPIJ%5B1-25%5D
	[3]: https://github.com/hadley/plyr/blob/master/R/helper-summarise.r
	[4]: https://groups.google.com/forum/?fromgroups#!msg/manipulatr/Xo3-2FBI35k/9pClNUuxoPIJ%5B1-25%5D