Skip to content

Instantly share code, notes, and snippets.

@mdsumner
Created June 20, 2017 00:25
Show Gist options
  • Save mdsumner/c086a5005c59373f4965fa6afd0d5a7c to your computer and use it in GitHub Desktop.
Save mdsumner/c086a5005c59373f4965fa6afd0d5a7c to your computer and use it in GitHub Desktop.
Split is the nest bottle-neck
data("wrld_simpl", package = "maptools")
w <- tibble::as_tibble(ggplot2::fortify(wrld_simpl[1:50, ]))
library(tidyr)
system.time(nest(w, -hole, -piece, -group, -id))
#user system elapsed
#0.508 0.004 0.515
## down in tidyr:::nest_impl the group-idx is used like this, which
## takes all the time
system.time(split(w, w$group))
#user system elapsed
#0.504 0.004 0.505
## if there's a faster split(d, idx) for data.frames that would be speed up
## (remember split sorts by "name" hence the final [unique(idx)] in nest_impl)
@mdsumner
Copy link
Author

mdsumner commented Jun 21, 2017

Here's a crazy idea, not well tested ...

EDIT: had left in a testing subset call that was not intended

data("wrld_simpl", package = "maptools")

w <- tibble::as_tibble(ggplot2::fortify(wrld_simpl))

system.time(d1 <- split(w, w$group))
library(dplyr)
library(purrr)
split_each <- function(x, g) {
  lapply(x, function(a) split(a, g))
}
faster_tibble <- function(x) {
  structure(x, class = c("tbl_df", "tbl", "data.frame"), row.names = as.character(seq_along(x[[1]])))
}
system.time({
aa <- split_each(w, w$group)
d2 <- lapply(transpose(aa), faster_tibble)
})

@mdsumner
Copy link
Author

I think this is doomed, it's not generally performant.

@mdsumner
Copy link
Author

mdsumner commented Jun 25, 2017

Far better is to avoid split altogether and just lapply on the split index, as split.data.frame does.

I haven't check this any where near enough, but I need it and it's good enough in my context

data("wrld_simpl", package = "maptools")

w <- tibble::as_tibble(ggplot2::fortify(wrld_simpl[1:50, ]))


split_tibble <- function (x, f, drop = FALSE, ...) 
  lapply(split(x = seq_len(nrow(x)), f = f,  ...), 
         function(ind) tibble::as_tibble(lapply(x, "[", ind)))

split_slice <- function (x, f, drop = FALSE, ...) 
  lapply(split(x = seq_len(nrow(x)), f = f,  ...), 
         function(ind) dplyr::slice(x, ind))
faster_as_tibble <- function(x) {
  structure(x, class = c("tbl_df", "tbl", "data.frame"), row.names = as.character(seq_along(x[[1]])))
}
split_fast_tibble <- function (x, f, drop = FALSE, ...) 
  lapply(split(x = seq_len(nrow(x)), f = f,  ...), 
         function(ind) faster_as_tibble(lapply(x, "[", ind)))

system.time(split(w, w$group))
system.time(split_tibble(w, w$group))
system.time(split_slice(w, w$group))
system.time(split_fast_tibble(w, w$group))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment