Created
June 20, 2017 00:25
-
-
Save mdsumner/c086a5005c59373f4965fa6afd0d5a7c to your computer and use it in GitHub Desktop.
Split is the nest bottle-neck
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
data("wrld_simpl", package = "maptools") | |
w <- tibble::as_tibble(ggplot2::fortify(wrld_simpl[1:50, ])) | |
library(tidyr) | |
system.time(nest(w, -hole, -piece, -group, -id)) | |
#user system elapsed | |
#0.508 0.004 0.515 | |
## down in tidyr:::nest_impl the group-idx is used like this, which | |
## takes all the time | |
system.time(split(w, w$group)) | |
#user system elapsed | |
#0.504 0.004 0.505 | |
## if there's a faster split(d, idx) for data.frames that would be speed up | |
## (remember split sorts by "name" hence the final [unique(idx)] in nest_impl) |
jsonbecker
commented
Jun 20, 2017
Here's a crazy idea, not well tested ...
EDIT: had left in a testing subset call that was not intended
data("wrld_simpl", package = "maptools")
w <- tibble::as_tibble(ggplot2::fortify(wrld_simpl))
system.time(d1 <- split(w, w$group))
library(dplyr)
library(purrr)
split_each <- function(x, g) {
lapply(x, function(a) split(a, g))
}
faster_tibble <- function(x) {
structure(x, class = c("tbl_df", "tbl", "data.frame"), row.names = as.character(seq_along(x[[1]])))
}
system.time({
aa <- split_each(w, w$group)
d2 <- lapply(transpose(aa), faster_tibble)
})
I think this is doomed, it's not generally performant.
Far better is to avoid split altogether and just lapply on the split index, as split.data.frame does.
I haven't check this any where near enough, but I need it and it's good enough in my context
data("wrld_simpl", package = "maptools")
w <- tibble::as_tibble(ggplot2::fortify(wrld_simpl[1:50, ]))
split_tibble <- function (x, f, drop = FALSE, ...)
lapply(split(x = seq_len(nrow(x)), f = f, ...),
function(ind) tibble::as_tibble(lapply(x, "[", ind)))
split_slice <- function (x, f, drop = FALSE, ...)
lapply(split(x = seq_len(nrow(x)), f = f, ...),
function(ind) dplyr::slice(x, ind))
faster_as_tibble <- function(x) {
structure(x, class = c("tbl_df", "tbl", "data.frame"), row.names = as.character(seq_along(x[[1]])))
}
split_fast_tibble <- function (x, f, drop = FALSE, ...)
lapply(split(x = seq_len(nrow(x)), f = f, ...),
function(ind) faster_as_tibble(lapply(x, "[", ind)))
system.time(split(w, w$group))
system.time(split_tibble(w, w$group))
system.time(split_slice(w, w$group))
system.time(split_fast_tibble(w, w$group))
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment