Skip to content

Instantly share code, notes, and snippets.

@lcolladotor
Created November 4, 2013 23:14
Show Gist options
  • Save lcolladotor/7310971 to your computer and use it in GitHub Desktop.
Save lcolladotor/7310971 to your computer and use it in GitHub Desktop.
Quantiles from Rle's with long lengths
======================================
```{r echo=FALSE}
library(knitr)
render_html()
```
The goal is to calculate quantiles on aggregated long Rle's. The information is split by chr, and combining it naively fails.
```{r}
suppressMessages(library("IRanges"))
## Create dummy data
lengths <- c(249250621, 243199373, 198022430, 191154276, 180915260, 171115067, 159138663, 146364022, 141213431, 135534747, 135006516, 133851895, 115169878, 107349540, 102531392, 90354753, 81195210, 78077248, 59128983, 63025520, 48129895, 51304566, 155270560, 59373566)
chrdata <- lapply(lengths, function(x) {
Rle(rep(0, x))
})
## Just to see visualize explore it (doesn't matter if the type is numeric-Rle or integer-Rle)
head(chrdata)
## Trying to combine
unlist(RleList(chrdata), use.names=FALSE)
traceback()
## Works manually up to the first 12
tmp <- c(chrdata[[1]], chrdata[[2]], chrdata[[3]], chrdata[[4]], chrdata[[5]], chrdata[[6]], chrdata[[7]], chrdata[[8]], chrdata[[9]], chrdata[[10]], chrdata[[11]], chrdata[[12]])
## For the first 13 it breaks
tmp <- c(chrdata[[1]], chrdata[[2]], chrdata[[3]], chrdata[[4]], chrdata[[5]], chrdata[[6]], chrdata[[7]], chrdata[[8]], chrdata[[9]], chrdata[[10]], chrdata[[11]], chrdata[[12]], chrdata[[13]])
traceback()
## Note the main issue is related to the 2^31 ceiling
plot(cumsum(lengths), main="Length of Rle exceeding 2^31")
abline(v=13, col="red")
abline(h=2^31, col="orange")
```
One idea: leave the Rle-world and use some kind of weighted quantile.
```{r}
## Manually combine in two parts
part1 <- unlist(RleList(chrdata[1:12]), use.names=FALSE)
part2 <- unlist(RleList(chrdata[13:length(chrdata)]), use.names=FALSE)
parts <- list(part1, part2)
## Reduce number of runs in the Rle's by sorting (kind of doing table())
sortedParts <- lapply(parts, function(x) sort(x))
## Leave Rle-world
info <- lapply(sortedParts, function(y) {
data.frame(value=runValue(y), length=runLength(y))
})
info <- do.call(rbind, info)
## Format
info$length <- as.numeric(info$length)
## Collapse
collapsed <- tapply(info$length, info$value, sum)
weights <- as.integer(names(collapsed))
## Calculate weighted quantile of interest
library("Hmisc")
wtd.quantile(collapsed, weights=weights, probs = 0.5)
```
Is there a way to calculate the quantiles of the aggregated Rle's without all this machinery? Are there other possibly more efficients ways to do so?
```{r}
## Reproducibility
sessionInfo()
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment