Created
November 4, 2013 23:14
-
-
Save lcolladotor/7310971 to your computer and use it in GitHub Desktop.
For making http://rpubs.com/lcollado/10279
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Quantiles from Rle's with long lengths | |
====================================== | |
```{r echo=FALSE} | |
library(knitr) | |
render_html() | |
``` | |
The goal is to calculate quantiles on aggregated long Rle's. The information is split by chr, and combining it naively fails. | |
```{r} | |
suppressMessages(library("IRanges")) | |
## Create dummy data | |
lengths <- c(249250621, 243199373, 198022430, 191154276, 180915260, 171115067, 159138663, 146364022, 141213431, 135534747, 135006516, 133851895, 115169878, 107349540, 102531392, 90354753, 81195210, 78077248, 59128983, 63025520, 48129895, 51304566, 155270560, 59373566) | |
chrdata <- lapply(lengths, function(x) { | |
Rle(rep(0, x)) | |
}) | |
## Just to see visualize explore it (doesn't matter if the type is numeric-Rle or integer-Rle) | |
head(chrdata) | |
## Trying to combine | |
unlist(RleList(chrdata), use.names=FALSE) | |
traceback() | |
## Works manually up to the first 12 | |
tmp <- c(chrdata[[1]], chrdata[[2]], chrdata[[3]], chrdata[[4]], chrdata[[5]], chrdata[[6]], chrdata[[7]], chrdata[[8]], chrdata[[9]], chrdata[[10]], chrdata[[11]], chrdata[[12]]) | |
## For the first 13 it breaks | |
tmp <- c(chrdata[[1]], chrdata[[2]], chrdata[[3]], chrdata[[4]], chrdata[[5]], chrdata[[6]], chrdata[[7]], chrdata[[8]], chrdata[[9]], chrdata[[10]], chrdata[[11]], chrdata[[12]], chrdata[[13]]) | |
traceback() | |
## Note the main issue is related to the 2^31 ceiling | |
plot(cumsum(lengths), main="Length of Rle exceeding 2^31") | |
abline(v=13, col="red") | |
abline(h=2^31, col="orange") | |
``` | |
One idea: leave the Rle-world and use some kind of weighted quantile. | |
```{r} | |
## Manually combine in two parts | |
part1 <- unlist(RleList(chrdata[1:12]), use.names=FALSE) | |
part2 <- unlist(RleList(chrdata[13:length(chrdata)]), use.names=FALSE) | |
parts <- list(part1, part2) | |
## Reduce number of runs in the Rle's by sorting (kind of doing table()) | |
sortedParts <- lapply(parts, function(x) sort(x)) | |
## Leave Rle-world | |
info <- lapply(sortedParts, function(y) { | |
data.frame(value=runValue(y), length=runLength(y)) | |
}) | |
info <- do.call(rbind, info) | |
## Format | |
info$length <- as.numeric(info$length) | |
## Collapse | |
collapsed <- tapply(info$length, info$value, sum) | |
weights <- as.integer(names(collapsed)) | |
## Calculate weighted quantile of interest | |
library("Hmisc") | |
wtd.quantile(collapsed, weights=weights, probs = 0.5) | |
``` | |
Is there a way to calculate the quantiles of the aggregated Rle's without all this machinery? Are there other possibly more efficients ways to do so? | |
```{r} | |
## Reproducibility | |
sessionInfo() | |
``` | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment