Created
June 20, 2017 00:25
-
-
Save mdsumner/c086a5005c59373f4965fa6afd0d5a7c to your computer and use it in GitHub Desktop.
Split is the nest bottle-neck
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
data("wrld_simpl", package = "maptools") | |
w <- tibble::as_tibble(ggplot2::fortify(wrld_simpl[1:50, ])) | |
library(tidyr) | |
system.time(nest(w, -hole, -piece, -group, -id)) | |
#user system elapsed | |
#0.508 0.004 0.515 | |
## down in tidyr:::nest_impl the group-idx is used like this, which | |
## takes all the time | |
system.time(split(w, w$group)) | |
#user system elapsed | |
#0.504 0.004 0.505 | |
## if there's a faster split(d, idx) for data.frames that would be speed up | |
## (remember split sorts by "name" hence the final [unique(idx)] in nest_impl) |
> data("wrld_simpl", package = "maptools")
>
> w <- tibble::as_tibble(ggplot2::fortify(wrld_simpl[1:50, ]))
Loading required package: sp
> system.time({x <- dplyr::select(w, -hole, -piece, -group, -id)
+ x <- purrr::map_df(x, as.character)
+ x <- apply(t(as.matrix(x)), 2, as.list)
+ w$data <- x})
user system elapsed
0.066 0.001 0.068
Here's a crazy idea, not well tested ...
EDIT: had left in a testing subset call that was not intended
data("wrld_simpl", package = "maptools")
w <- tibble::as_tibble(ggplot2::fortify(wrld_simpl))
system.time(d1 <- split(w, w$group))
library(dplyr)
library(purrr)
split_each <- function(x, g) {
lapply(x, function(a) split(a, g))
}
faster_tibble <- function(x) {
structure(x, class = c("tbl_df", "tbl", "data.frame"), row.names = as.character(seq_along(x[[1]])))
}
system.time({
aa <- split_each(w, w$group)
d2 <- lapply(transpose(aa), faster_tibble)
})
I think this is doomed, it's not generally performant.
Far better is to avoid split altogether and just lapply on the split index, as split.data.frame does.
I haven't check this any where near enough, but I need it and it's good enough in my context
data("wrld_simpl", package = "maptools")
w <- tibble::as_tibble(ggplot2::fortify(wrld_simpl[1:50, ]))
split_tibble <- function (x, f, drop = FALSE, ...)
lapply(split(x = seq_len(nrow(x)), f = f, ...),
function(ind) tibble::as_tibble(lapply(x, "[", ind)))
split_slice <- function (x, f, drop = FALSE, ...)
lapply(split(x = seq_len(nrow(x)), f = f, ...),
function(ind) dplyr::slice(x, ind))
faster_as_tibble <- function(x) {
structure(x, class = c("tbl_df", "tbl", "data.frame"), row.names = as.character(seq_along(x[[1]])))
}
split_fast_tibble <- function (x, f, drop = FALSE, ...)
lapply(split(x = seq_len(nrow(x)), f = f, ...),
function(ind) faster_as_tibble(lapply(x, "[", ind)))
system.time(split(w, w$group))
system.time(split_tibble(w, w$group))
system.time(split_slice(w, w$group))
system.time(split_fast_tibble(w, w$group))
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
This got me a 2x speed up with my data but the disadvantage is all the data needs to be of the same type (because of the matrix step). Doesn't hurt my use case, but probably isn't broadly applicable.