Skip to content

Instantly share code, notes, and snippets.

@mikelove
Created July 14, 2023 13:47
Show Gist options
  • Save mikelove/d788831af3cf76de642ba03af7a0124b to your computer and use it in GitHub Desktop.
Save mikelove/d788831af3cf76de642ba03af7a0124b to your computer and use it in GitHub Desktop.
issues with summarizing over many groups in plyranges - related to Issue #30
library(plyranges)
library(microbenchmark)
library(dplyr)
library(tibble)
make_rand_gr <- function(N, grps) {
data.frame(seqnames = sample(c("seq1", "seq2", "seq3"), N, replace = TRUE),
strand = sample(c("+", "-", "*"), N, replace = TRUE), start = rpois(N,
N), width = rpois(N, N), grps = sample(grps, N, replace = TRUE),
score = runif(N)) %>% as_granges()
}
set.seed(1)
r <- make_rand_gr(10000L, 1:100)
microbenchmark(
plyra = r %>% group_by(grps) %>% mutate(n = plyranges::n(), mn = mean(score)),
dplyr = r %>% as_tibble() %>% group_by(grps) %>% mutate(n = n(), mn = mean(score)),
times = 5)
## Unit: milliseconds
## expr min lq mean median uq max neval
## plyra 607.878177 608.307529 613.92383 609.944290 621.163120 622.326044 5
## dplyr 3.563556 3.630468 4.09772 4.117753 4.179581 4.997244 5
@mikelove
Copy link
Author

Edge case, want to be able to preserve S4 columns, or at least push this choice to user:

make_rand_gr <- function(N, grps) {
  r <- data.frame(seqnames = sample(c("seq1", "seq2", "seq3"), N, replace = TRUE), 
    strand = sample(c("+", "-", "*"), N, replace = TRUE), start = rpois(N, 
      N), width = rpois(N, N), grps = sample(grps, N, replace = TRUE), 
    score = runif(N)) %>% as_granges()
    # try out S4 column:
    r$score <- NumericList(as.list(r$score))
    r
}

In current plyranges this works but is slow:

> r %>% group_by(grps) %>% mutate(mn = mean(do.call(c, score))) %>% arrange(grps)
GRanges object with 10000 ranges and 3 metadata columns:
Groups: grps [100]
          seqnames      ranges strand |      grps         score        mn
             <Rle>   <IRanges>  <Rle> | <integer> <NumericList> <numeric>
      [1]     seq3  9897-19958      + |         1      0.200175  0.510507
      [2]     seq1  9840-19796      + |         1      0.251189  0.510507
      [3]     seq2 10148-19922      * |         1      0.595466  0.510507
      [4]     seq1  9871-19961      + |         1      0.562359  0.510507
      [5]     seq3 10054-20124      - |         1     0.0157297  0.510507
      ...      ...         ...    ... .       ...           ...       ...
   [9996]     seq1 10073-20173      - |       100      0.483839  0.500596
   [9997]     seq2  9936-19887      * |       100      0.474332  0.500596
   [9998]     seq1 10025-19999      - |       100      0.233531  0.500596
   [9999]     seq1  9933-19780      - |       100       0.93272  0.500596
  [10000]     seq3 10003-19864      + |       100      0.363325  0.500596
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment