Skip to content

Instantly share code, notes, and snippets.

@mdsumner
Created June 20, 2017 00:25
Show Gist options
  • Save mdsumner/c086a5005c59373f4965fa6afd0d5a7c to your computer and use it in GitHub Desktop.
Save mdsumner/c086a5005c59373f4965fa6afd0d5a7c to your computer and use it in GitHub Desktop.
Split is the nest bottle-neck
data("wrld_simpl", package = "maptools")
w <- tibble::as_tibble(ggplot2::fortify(wrld_simpl[1:50, ]))
library(tidyr)
system.time(nest(w, -hole, -piece, -group, -id))
#user system elapsed
#0.508 0.004 0.515
## down in tidyr:::nest_impl the group-idx is used like this, which
## takes all the time
system.time(split(w, w$group))
#user system elapsed
#0.504 0.004 0.505
## if there's a faster split(d, idx) for data.frames that would be speed up
## (remember split sorts by "name" hence the final [unique(idx)] in nest_impl)
@jsonbecker
Copy link

jsonbecker commented Jun 20, 2017

x <- dplyr::select(w, -hole, -piece, -group, -id)
x <- purrr::map_df(x, as.character)
x <- apply(t(as.matrix(x)), 2, as.list)
w$data <- x

This got me a 2x speed up with my data but the disadvantage is all the data needs to be of the same type (because of the matrix step). Doesn't hurt my use case, but probably isn't broadly applicable.

@jsonbecker
Copy link

> data("wrld_simpl", package = "maptools")
> 
> w <- tibble::as_tibble(ggplot2::fortify(wrld_simpl[1:50, ]))
Loading required package: sp
> system.time({x <- dplyr::select(w, -hole, -piece, -group, -id)
+ x <- purrr::map_df(x, as.character)
+ x <- apply(t(as.matrix(x)), 2, as.list)
+ w$data <- x})
   user  system elapsed 
  0.066   0.001   0.068 

@mdsumner
Copy link
Author

mdsumner commented Jun 21, 2017

Here's a crazy idea, not well tested ...

EDIT: had left in a testing subset call that was not intended

data("wrld_simpl", package = "maptools")

w <- tibble::as_tibble(ggplot2::fortify(wrld_simpl))

system.time(d1 <- split(w, w$group))
library(dplyr)
library(purrr)
split_each <- function(x, g) {
  lapply(x, function(a) split(a, g))
}
faster_tibble <- function(x) {
  structure(x, class = c("tbl_df", "tbl", "data.frame"), row.names = as.character(seq_along(x[[1]])))
}
system.time({
aa <- split_each(w, w$group)
d2 <- lapply(transpose(aa), faster_tibble)
})

@mdsumner
Copy link
Author

I think this is doomed, it's not generally performant.

@mdsumner
Copy link
Author

mdsumner commented Jun 25, 2017

Far better is to avoid split altogether and just lapply on the split index, as split.data.frame does.

I haven't check this any where near enough, but I need it and it's good enough in my context

data("wrld_simpl", package = "maptools")

w <- tibble::as_tibble(ggplot2::fortify(wrld_simpl[1:50, ]))


split_tibble <- function (x, f, drop = FALSE, ...) 
  lapply(split(x = seq_len(nrow(x)), f = f,  ...), 
         function(ind) tibble::as_tibble(lapply(x, "[", ind)))

split_slice <- function (x, f, drop = FALSE, ...) 
  lapply(split(x = seq_len(nrow(x)), f = f,  ...), 
         function(ind) dplyr::slice(x, ind))
faster_as_tibble <- function(x) {
  structure(x, class = c("tbl_df", "tbl", "data.frame"), row.names = as.character(seq_along(x[[1]])))
}
split_fast_tibble <- function (x, f, drop = FALSE, ...) 
  lapply(split(x = seq_len(nrow(x)), f = f,  ...), 
         function(ind) faster_as_tibble(lapply(x, "[", ind)))

system.time(split(w, w$group))
system.time(split_tibble(w, w$group))
system.time(split_slice(w, w$group))
system.time(split_fast_tibble(w, w$group))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment