Skip to content

Instantly share code, notes, and snippets.

@ohofmann
Created September 2, 2015 14:33
Show Gist options
  • Save ohofmann/629cbf543c8c92614e7f to your computer and use it in GitHub Desktop.
Save ohofmann/629cbf543c8c92614e7f to your computer and use it in GitHub Desktop.
```{r custom}
library(VariantAnnotation)
library(ggplot2)
library(pheatmap)
library(scales)
library(gridExtra)
library(gtools)
library(RColorBrewer)
library(knitr)
library(tidyr)
library(reshape)
library(rmarkdown)
library(dplyr)
# library(ggbio)
number_ticks <- function(n) {function(limits) pretty(limits, n)}
options(bitmapType = 'cairo')
path_results = "."
```
```{r create-report, echo=FALSE, eval=FALSE}
knitr::opts_chunk$set(tidy=TRUE, highlight=TRUE, dev="png",
cache=FALSE, highlight=TRUE, autodep=TRUE, warning=FALSE, error=FALSE,
eval=TRUE, fig.width= 9, echo=FALSE,
message=FALSE, prompt=TRUE, comment='', fig.cap='', bootstrap.show.code=FALSE)
render(file.path(path_results, "report-ready.Rmd"))
```
Various quality metrics generated as part of a bcbio workflow. While these are useful to get a general sense of the overall quality we still recommend exploring at least a subset of the underlying data manually to check for outliers or details that might be obscured by summarizing across all samples.
## Sample similarity
A quick sanity check to detect samples that are considered 'identical'. This could be due to technical / biological replicates, tumor / normal pairing or samples picked from members of the same family. By default we highlight samples that have more than 5% similarity as based on identical variant calls (the [`qSignature` method](http://sourceforge.net/p/adamajava/wiki/qSignature/)) which should highlight samples that are a first degree relative or a replicate.
> Can we have a simple heatmap or dendogram of sample similarity? No self-similarity measurements but still useful at a glance.
```{r qsignature, results='asis'}
sim = read.table(file.path(path_results, "qsignature.ma"))
names(sim) = c("sample1" , "sample2", "score")
kable(sim %>% filter(score<0.1) %>% arrange(score))
ggplot(data = sim, aes(x=sample1, y=sample2, fill=score)) + geom_tile()
```
## Basic sample metrics
A quick summary of sample metrics to identify outliers. Offtargets will be set to 0 for non-targeted experiments:
> Can we add a coverage metrics to this table? Mean coverage of genome or targets, and 90% of regions covered at X?
```{r table, results='asis'}
metrics = c("sample", "Total_reads" ,"Mapped_reads_pct", "Duplicates_pct",
"offtarget",
"%GC", "Sequence_length", "Median_insert_size")
if (is.null(qc$offtarget)) {
qc$offtarget <- rep (0, length(qc$sample))
} else {
qc$offtarget = qc$offtarget/qc$Total_reads
}
# qc$secondary = qc$secondary/qc$Total_reads
# qc$singletons = qc$singletons/qc$Total_reads
print(kable(qc[, metrics], align="c", digits=2))
```
### Total and mapped reads
The next two plots compare the number of reads in each sample (should be uniform) and the percentage of reads mapping to the reference genome. Low mapping rates are indicative of sample contamination, poor sequencing quality or other artifacts.
> Unify plot labels
```{r total-reads}
ggplot(qc, aes(x=sample, y=Total_reads/1e6)) +
geom_bar(stat = 'identity') +
ylab("Million reads") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
```
```{r mapped-reads}
ggplot(qc, aes(x=sample, y=Mapped_reads/Total_reads)) +
geom_bar(stat = 'identity') +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
```
### Offtarget reads
In addition to total read count and map-ability it is important to know if the target enrichment (if any) worked, i.e., what percentage of reads are on the amplified or captured regions, and what percentage is considered 'off target'. On target percentages should be uniform and ideally above 50% for most capture methods.
```{r off-reads}
ggplot(qc, aes(x=sample, y=offtarget/Mapped_reads)) +
geom_bar(stat = 'identity') +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
```
## Sequencing metrics
Various metrics related to the quality of the sequencing data itself, i.e., the FASTQ or BAM fie as it came off the sequencer.
```{r load-metrics}
qc = read.table(file.path(path_results, "metrics", "metrics.tsv"),
header=T, sep="\t", check.names=F,
colClasses=list("sample"="character"))
rownames(qc) = qc$sample
qc$Mapped_reads_pct = as.numeric(gsub("%", "", qc$Mapped_reads_pct))
qc$Duplicates_pct = as.numeric(gsub("%", "", qc$Duplicates_pct))
```
### Mean sequence quality along read
Mean base quality for each sample along with the upper and lower quartile, and the 10% (lower) quantile. A slight dropoff towards the end of the read is normal, but you want the mean qualities and quartiles to track each other closely with relatively few outliers. In addition, all plots should look similar to each other.
> ToDo: Needs better series label, label for median plot, better axis labels
> Could do with an outlier detection
```{r read-qual}
qual = read.table(file.path(path_results, "fastqc", "Per_base_sequence_quality.tsv"), header=T, sep= "\t", check.names =T, colClasses=list("sample"="character"))
qual$sample = as.character(qual$sample)
ggplot(qual, aes(x=Base, y=Median, group=sample)) +
geom_line() +
geom_line(aes(x=Base, y=Lower_Quartile, group=sample, col="lower_quantile")) +
geom_line(aes(x=Base, y=Upper_Quartile, group=sample, col="upper_quantile")) +
geom_line(aes(x=Base, y=X10th_Percentile, group=sample, col="10%_quantile")) +
facet_wrap(~sample) +
ylim(0,45) +
scale_color_brewer("metrics",palette = "Set1") +
theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust=0.5))+
scale_x_discrete(breaks=number_ticks(10))
```
### Read length distribution
This should ideally be uniform across all samples with one distinct read length dominating the distribution.
```{r read-size}
qual = read.table(file.path(path_results, "fastqc", "Sequence_Length_Distribution.tsv"), header=T, sep= "\t", check.names = F, colClasses=list("sample"="character"))
qual = qual %>% group_by(sample) %>% mutate(total=sum(Count), pct=Count/total)
ggplot(qual , aes(x=Length, y=Count, group=sample)) +
geom_line(size=2, alpha=0.5) +
scale_color_brewer(palette = "Set1") +
theme_bw() +
facet_wrap(~sample) +
labs(y="# of reads")+
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust=0.5)) +
scale_x_discrete(breaks=number_ticks(5))
```
The following table lists for each sample the read lengths contributing more than 10% of that sample's total number of reads:
```{r table-size, results='asis'}
kable(qual %>% filter(pct > 0.10) %>% dplyr::select(Length, sample, pct) %>% spread(Length, pct), align="c", digits=2)
```
### Read GC content
For re-sequencing projects the GC nucleotide content of sequenced reads should follow the genome distribution. Here we are checking if there are any samples where reads outside of a 10-90% GC content contribute more than 5% of the overall reads. These cutoffs can be tweaked as needed based on the G/C distribution graph:
> Added a plot that might be useful to detect outliers even if they fall within cutoffs.
```{r table-gc, results='asis'}
qual = read.table(file.path(path_results, "fastqc", "Per_sequence_GC_content.tsv"), header=T, sep= "\t", check.names = F, colClasses=list("sample"="character"))
qual$percent <- qual$Count / qual$total
ggplot(qual, aes(x=GC_Content, y=percent, group=sample)) +
geom_line(size=2, alpha=0.5) +
scale_color_brewer(palette = "Set1") +
theme_bw() +
labs(y='Percent of reads', x='Percent G/C content')
qual = qual %>% mutate(pct=Count/total) %>% filter(GC_Content>10 & GC_Content<90) %>%
group_by(sample) %>% summarise(pct_in_10_90=sum(pct))
kable(qual %>% filter(pct_in_10_90 <= 0.95))
```
### Read nucleotide content
Expanding on the GC read content analysis this checks for biases in nucleotide content along the position in the read. Typically, biases are introduced to due preferential (non-random) primer binding (at the beginning of the read) or other artifacts; strong biases are indicative of technical problems:
> Remove the 'total' plot
```{r read-content}
qual = read.table(file.path(path_results, "fastqc", "Per_base_sequence_content.tsv"), header=T, sep= "\t", check.names = F, colClasses=list("sample"="character"))
qual$sample = as.character(qual$sample)
qual$Base = as.numeric(qual$Base)
dd = melt(qual, id.vars = c("Base", "sample"), variable_name = c("nt"))
ggplot(dd, aes(x=Base, y=value, group=sample)) +
geom_line(size=2, alpha=0.5) +
theme_bw() +
ylab("% of nucleotides") +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust=0.5)) +
ylim(10,50) +
facet_wrap(~nt)
```
### Read nuceotide biases
To help trace down possible adapter contamination, missed barcodes or other biases we list for each sample the read positions where any nucleotide is present in either less than 10% or more than 30% of the reads:
```{r table-content, results='asis', eval=TRUE}
kable( melt(qual, id.vars=c("Base", "sample"), variable_name = "nt") %>%
filter(value > 30 | value < 10) %>% filter(Base<20) %>%
dplyr::select(Base, sample, nt, value) %>%
spread(Base, value),
align="c", digits=2)
```
## Coverage analysis
Maybe more important than the total number of reads is the coverage information: at what read count is each base (genome-wide or within a target set) covered? For germline calls we need a minimum of ~13X to call heterozygous variants reliably; this number goes up for tumor samples due to purity and clonality confounders.
If no target region was given we calculate these metrics for a list of 'medically actionable genes' based on the [Personalis ACE Clinical Exome](http://www.personalis.com/clinical/).
```{r total-coverage-load}
get_quantile_cov = function(path){
cov_tab = data.frame()
cov_regions = data.frame()
for (fn in list.files(path, pattern = "_coverage.bed", full.names = TRUE)){
d = read.table(fn, header=T, sep="\t", comment.char = "%")
for (i in c("q10", "q20", "q50")){
# q = quantile(d[,i],c(.10,.25,.50,.75,.90))
q = quantile(d[,i],c(0.01,.10,.25,.50,.75,.90,.99))
labels=factor(rev(names(q)),levels=c("1%","10%","25%","50%","75%","90%","99%"))
s = gsub("-ready","",d[1,"sample"])
t = data.frame(quantiles=q*100, type=labels, sample=s, min_reads=i)
cov_tab = rbind(cov_tab, t)
}
ma = (d %>% dplyr::select(q10, q20, q50)) > 0.60
if (nrow(cov_regions) == 0){
cov_regions = ma
}else{
cov_regions = cov_regions + ma
}
}
row.names(cov_regions) = as.character(d$region)
list("sample" = cov_tab, "region" = as.data.frame(cov_regions))
}
get_total_cov = function(path){
cov_tab = data.frame()
for (fn in list.files(path, pattern = "_cov_total.tsv", full.names = TRUE)){
d = read.table(fn, header=F, sep="\t")
d = d[order(d[,1]),]
pct = cumsum(d[,2])/d[,3] * 100
s = gsub("-ready","",d[1,4])
t = data.frame(depth=d[,1], bases=pct, sample=s)
cov_tab = rbind(cov_tab, t)
}
cov_tab
}
make_total_cov_plots = function(cov_tab){
p =ggplot(cov_tab, aes(y=bases, x=depth, group=sample)) +
geom_line(size=2, alpha=.5)+
theme_bw()+
labs(list(y="% of bed file > depth", x="# of reads"))
print(p)
}
make_quantile_plots = function(cov_tab){
p1 = ggplot(cov_tab %>% filter(min_reads=='q10'), aes(x=type, y=quantiles, group=sample)) +
geom_line(size=2, alpha=.5)+
theme_bw()+
labs(list(x="% of target regions with\nmore than X bases covered", y="% of nt covered\ninside the target", title="considered covered when nt has >10 reads"))
p2 = ggplot(cov_tab %>% filter(min_reads=='q20'), aes(x=type, y=quantiles, group=sample)) +
geom_line(size=2, alpha=.5)+
theme_bw()+
labs(list(x="% of target regions with\nmore than X bases covered", y="% of nt covered\ninside the target", title="considered covered when nt has >25 reads"))
p3 = ggplot(cov_tab %>% filter(min_reads=='q50'), aes(x=type, y=quantiles, group=sample)) +
geom_line(size=2, alpha=.5)+
theme_bw()+
labs(list(x="% of target regions with\nmore than X bases covered", y="% of nt covered\ninside the target", title="considered covered when nt has >50 reads"))
grid.arrange(p1, p2, p3, ncol=1)
}
make_quantile_region_plots = function(dd, n_samples){
# cov_tab = melt(cov_tab, id.vars="region")
p1 = ggplot(dd %>% group_by(q10) %>% summarise(n_regions = n()) %>% filter(q10<n_samples*0.8), aes(x=n_regions, y=q10)) +
geom_point(size=2, alpha=.5)+
theme_bw()+
labs(list(x="number of regions", y="num of samples with region covered", title="considered covered when nt has >10 reads. Only looking at regions with < 80% of samples"))
p2 = ggplot(dd %>% group_by(q20) %>% summarise(n_regions = n()) %>% filter(q20<n_samples*0.8), aes(x=n_regions, y=q20)) +
geom_point(size=2, alpha=.5)+
theme_bw()+
labs(list(x="number of regions", y="num of samples with region covered", title="considered covered when nt has >10 reads. Only looking at regions with < 80% of samples"))
p3 = ggplot(dd %>% group_by(q50) %>% summarise(n_regions = n()) %>% filter(q50<n_samples*0.8), aes(x=n_regions, y=q50)) +
geom_point(size=2, alpha=.5)+
theme_bw()+
labs(list(x="number of regions", y="num of samples with region covered", title="considered covered when nt has >10 reads. Only looking at regions with < 80% of samples"))
grid.arrange(p1, p2, p3, ncol=1)
}
```
### Total coverage by sample
Starting with a simple plot to check ...
> Y axis is off; graph should also be the other way (lower percentage of target regions)
```{r cov-total-fig, fig.height=6, fig.width=11, cache=TRUE}
n_samples = nrow(qc)
cov_tab_total = get_total_cov(file.path(path_results, "coverage"))
make_total_cov_plots(cov_tab_total)
```
### Completeness by sample
A more detailed look at the previous plot. For three different cutoffs (at least 10, 25 or 50 reads) check what percentage of target _regions_ are covered with at least that many reads on average (X-axis), and ...
> Let's discuss. Still unclear on Y axis :)
```{r completeness-fig, fig.height=12, fig.width=12, cache=TRUE}
cov_tab = get_quantile_cov(file.path(path_results, "coverage"))
make_quantile_plots(cov_tab$sample)
```
Flagging samples that are problematic. i.e., where more than 90% of target regions have less than 60% covered at completeness cut-off of 10.
> That's a bit lenient. I'd start to worry about a target region if it has more than 10% below the cut-off of 10 reads, and if more than 10% of target regions get flagged as problematic.
```{r table-completeness, results='asis'}
kable(cov_tab$sample %>% filter(type == "90%") %>%
spread(min_reads, quantiles) %>%
dplyr::select(targets_pct=type, sample, min_10=q10, min_20=q20, min_50=q50) %>%
dplyr::filter(min_10<60),
align="c", digits=2)
```
### Completeness by region
> As above, let's discuss
```{r completeness-region-fig, fig.height=12, fig.width=12, cache=TRUE}
make_quantile_region_plots(cov_tab$region, n_samples)
```
Regions where less than 60% of samples covered at completeness cut-off of 10.
> Also for discussion :)
```{r table-completeness-regions, results='asis'}
n_samples = nrow(qc)
kable(head(cov_tab$region %>% mutate(region=row.names(cov_tab$region)) %>%
dplyr::select(region, q10,q20,q50) %>% filter(q10 < n_samples*0.6),50),
align="c", digits=2)
write.table(cov_tab$region, file.path(path_results, "completeness_by_region_and_sample.tsv"))
```
### Coverage uniformity
> Not working for me right now; no bias data?
```{r cov-uniformity-load, echo=FALSE, eval=FALSE}
cov_tab = data.frame()
get_bias_cov = function(path){
for (fn in list.files(path, pattern = "_cov.tsv", full.names = TRUE)){
d = read.table(fn, header=T, sep="\t")
cv = d[,"std"]/d[,"mean"]
bias = (d[,"ntdow"] + d[,"ntup"])/d[,"size"] * 100
s = as.character(gsub("-ready","",d[1,"sample"]))
t = data.frame(bias=bias, cv=cv, mean=d[,"mean"], sample=s)
cov_tab = rbind(cov_tab, t)
}
cov_tab
}
make_bias_plot = function(cov_tab){
p1 = ggplot(cov_tab, aes(x=log2(mean), y=cv)) +
geom_point(alpha=0.5) +
scale_color_brewer(palette = "Set1") +
labs(list(y="coefficient of variation",x="log2(mean coverage)")) +
theme_bw() +
ggtitle("coverage variation for each target region")
p2 = ggplot(cov_tab, aes( x=sample,y=bias)) +
geom_jitter(alpha=0.5) +
scale_color_brewer(palette = "Set1") +
theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(list(y="% of nt with mean-2SD > coverage > mean+2SD ")) +
ggtitle("% of nucleotides with extreme\ncoverage inside target regions")
# grid.arrange(p1, p2, ncol=1)
print(p2)
}
```
```{r cov-uniformity-fig, fig.height=12, fig.width=12, cache=TRUE, eval=FALSE, echo=FALSE}
bias_tab = get_bias_cov(file.path(path_results, "bias"))
make_bias_plot(bias_tab)
```
## Variant QC
Finally, looking at global metrics for variant calls such as the number of heterozygous calls, transition/transversion ratios, and the read coverage at sites that were called as variants.
> Unclear whether this is limited to the target / reporting regions or global. I think in general we need to clarify where metrics are a) global, b) limited to target regions, and c) limited to coverage regions.
```{r table-variants, results='asis'}
qc$ratio_het_hom = qc$Variations_heterozygous/qc$Variations_homozygous
metrics = c("sample", "Variations_total", "Variations_in_dbSNP_pct",
"Variations_heterozygous", "Variations_homozygous", "ratio_het_hom", "Transition/Transversion")
print(kable(qc[, metrics], align="c", digits=2))
```
### Variant coverage
Another coverage plot, this time limited to positions identified as variants. Read coverage on the X-axis (limited to 100X), percentage of variants with that coverage on the Y-axis. The red line highlights the X=13 cutoff required for somewhat reliable heterogenous variant identification in germline samples.
```{r variants-coverage}
fns = list.files(file.path(path_results, "variants"), full.names = TRUE, pattern = "cg-depth-parse.tsv")
tab = data.frame()
for (fn in fns){
dt = read.table(fn, header=T,sep="\t")
dt = dt %>% filter(!grepl("[::.::]",depth))
dt[,2] = as.numeric(dt[,2])
q = quantile(dt[,2],c(0,.10,.25,.50,.75,.90,1))
labels=factor(rev(names(q)),levels=c("0%","10%","25%","50%","75%","90%","100%"))
dt = data.frame(variants_pct=labels, depth=q, sample=dt$sample[1])
tab = rbind(tab, dt)
}
ggplot(tab, aes(x=depth, y=variants_pct, group=sample)) +
geom_line(size=2, alpha=.5) +
geom_vline(xintercept=13, color='red') +
theme_bw() +
xlim(0,100) +
labs(list(x="# of reads", y="% variants with more than X reads", title="variants coverage"))
```
### Variant G/C bias
Checking for G/C bias for called variants. This is again capped at 100X coverage; X-axis is G/C content of a read, Y-axis the number of reads supporting a variant, colour reflects the number of variants at that coverage and G/C window.
> Change x axis to 10% steps. Maybe filter out singletons.
```{r variants-coverage-gc, fig.width=15, fig.height=15}
rf <- colorRampPalette(rev(brewer.pal(11,'Spectral')))
r <- rf(32)
list_p = list()
fns = list.files(file.path(path_results, "variants"), full.names = TRUE, pattern = "cg-depth-parse.tsv")
for (fn in fns){
dt = read.table(fn, header=T,sep="\t")
dt = dt %>% filter(!grepl("[::.::]",depth))
dt[,2] = as.numeric(dt[,2])
sample = dt$sample[1]
p = ggplot(dt, aes(CG, depth)) +
stat_bin2d() +
ylab("# of reads") +
scale_fill_gradientn(guide = FALSE,colours=r) +
theme_bw() +
ylim(0, 100) +
ggtitle(sample)
list_p[[as.character(sample)]]=p
}
do.call(grid.arrange, list_p)
```
> As a global to do: are there RMarkdown table constructs that support color coding? I'd love to have a way to flag outliers in table, and/or make the tables sortable by columns to find outliers.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment