Skip to content

Instantly share code, notes, and snippets.

@chapmanb
Created December 5, 2015 18:14
Show Gist options
  • Save chapmanb/46f3e3cabdd27e7fa341 to your computer and use it in GitHub Desktop.
Save chapmanb/46f3e3cabdd27e7fa341 to your computer and use it in GitHub Desktop.
Lorena small RNA post feedback

small RNA-seq with bcbio-nextgen

It would be good to have a few short sentences of introduction here to orient readers and give them an overview of the whole post.

  • Why is small RNA-seq analysis important
  • What types of analysis do you provide (this is in the pipeline section below), in a sentence.
  • What is bcbio
  • What will you show? Validation of the pipeline and demonstration of its capabilities.

reproducible code

R code


miRQC project

We used samples from mirRQC project paper. This project provides samples with known, relative amounts of small RNAs, enabling comparison of quantitation and detection of miRNAs. The main goal was to test different platforms for miRNA detection, but these are also great samples for benchmarking tools. I am using them to test how bcbio-nextgen works with small RNA data.

Quoted from the paper, samples are (see below figure, lower panel):

Universal Human miRNA reference RNA (Agilent Technologies, #750700), human brain total RNA (Life Technologies, #AM6050), human liver total RNA (Life Technologies, #AM7960) and MS2-phage RNA (Roche, #10165948001) were diluted to a platform-specific concentration. ... Synthetic miRNA templates for let-7a-5p, let-7b-5p, let-7c, let-7d-5p, miR-302a-3p, miR-302b-3p, miR-302c-3p, miR-302d-3p, miR-133a and miR-10a-5p were synthesized by Integrated DNA Technologies and 5′ phosphorylated. Synthetic let-7 and miR-302 miRNAs were spiked into MS2-phage RNA and total human liver RNA, respectively, at 5 × 106 copies/μg RNA.

samples

Pipeline

You can read more about bcbio here. There are 4 main steps in the small RNA-seq pipeline:

  • adapter removal
  • miRNA quantification
  • other small RNAs detection and quantification
  • quality control metrics

adapter removal

We integrated cutadapt allowing a minimum read size of 17 nts and removing the adapter if it is at least 8 nts long.

miRNA quantification

bcbio uses seqbuster for this step. It has been used by gEUvaids consortium for miRNA quantification and allows isomiRs analysis as well. Read more about why isomiRs are important.

other smallRNAs quantification

bcbio uses seqcluster to detect unique units of transcription over the genome, allowing resolutions of small RNAs found in multiple genomic locations. Normally these small RNAs are dropped because they map multiple times on the genome and require special analysis to avoid bias in the quantification. Read more about why other small RNAs are important.

quality control metrics

bcbio summarizes Fastqc metrics for each sample. Together with different metrics from the previous steps, the user has an idea of the quality of the samples and the overall project. It includes fastqc results, size distribution after adapter removal and amount of small RNAs mapped to miRNAs, tRNA, rRNA, repeats among others. Other metrics like amount of data used until the end of the analysis, or warning flags if the data is noisy are provided by seqcluster and included in the final Rmd template report. (TODO: Provide a description and pointer to what Rmarkdown is)

automatic report

bcbio generates a Rmd template report to make easy the visualization of all the results from each of the steps. It will be inside report folder in the working directory.

Results

The mirRQC samples allow us to measure quantitation and detection accuracy of specific miRs for the tools integrated in bcbio.

size distribution

The size distribution shows easily the quality of your data. In a normal small RNA sample we should see a peak at 22/23 nt and maybe another at 26 or 31 depending on the biology of the samples.

size-distribution

note: there may be specific cases where this assumption is not true, but it applies to the majority of projects.

miRNA abundance detection of miRQC samples

There are 4 samples that help to validate the quantification:

  • A: universal human RNA sample
  • B: human brain sample
  • C: 25% of A + 75% of B
  • D: 25% of B + 75% of A

So we can assume:

  • If A > B then A > D > C > B
  • If B > A then A < D < C < B

Note that C/D samples are swapped in the paper and in the GEO web. The text from the paper says:

These samples (termed miRQC A–D) consist of 100% Universal Human miRNA Reference RNA (UHmiRR; A), 100% human brain RNA (HBR; B) and two titrations thereof (C = 0.75A + 0.25B and D = 0.25A + 0.75B).

While the text in the GEO web says:

Source name miRQC C Organism Homo sapiens Characteristics biomaterial: Mixture of 25% UHRR and 75% HuBr Total RNA

Source name miRQC D Organism Homo sapiens Characteristics biomaterial: Mixture of 75% UHRR and 25% HuBr Total RNA

After the analysis with bcbio, we can calculate the amount of miRs that follows the relative abundance rule. To measure this, I took the average from the replicates and kept only the miRs with a median > 5 counts after normalization (upper quantile normalization) among samples.

miRNAs which A > B are 111, and all of them follows A > D > C

miRNAs which B > A are 181 and 174 follows B > C > D

That is more than 95% of accuracy for miRs with more than 5 counts.

specificity

To evaluate specificity we used samples that included specific miRNAs that are not normally expressed there. These samples were analyzed in a different run.

we spiked in 8 synthetic miRNAs from two miRNA families into human liver RNA (miR-302a/b/c/d) or MS2-phage RNA (let-7a/b/c/d)

We should only see those miRNAs (let-7 and miR-302 families) in those samples.

heatmap-mir302 heatmap-mirlet7

hsa-let-7a appears in all samples. If you check the processed data is not due to wrong alignment. It would be interesting to figure out whether this signal is because of contamintation or errors in sequencing or/and amplification.

clusters abundance detection of miRQC samples

Similarly to miRs, we can do the same for other small RNAs detected by seqcluster. The results where very similar:

  • clusters which A > B are 147, where 139 (75 are known miRNAs) follow D > C
  • clusters which B > A are 230, where 222 (129 are known miRNAs) follow D > C

Timing and Resources

The time for 8 samples with 6 millions reads each was 3 hour and 19 minutes.

Total 3:19 total cores total memory GB
organize samples 0 1 1
trimming & miRNA 0:21 8 20
prepare 0:01 1 8
alignment 0:07 6 42.1
cluster 2:49 1 8
quality control 0:01 8 20
report 0 1 1

Conclusion

It would be nice to have a conclusion here, similar to the intro -- re-emphasize the major points and suggest additional work.

Thanks

--- lorena-mirqc.md.orig 2015-12-05 11:29:26.456290998 -0500
+++ lorena-mirqc.md 2015-12-05 13:11:01.364354674 -0500
@@ -1,6 +1,15 @@
small RNA-seq with bcbio-nextgen
=============================
+It would be good to have a few short sentences of introduction here to orient
+readers and give them an overview of the whole post.
+
+- Why is small RNA-seq analysis important
+- What types of analysis do you provide (this is in the pipeline section below),
+ in a sentence.
+- What is bcbio
+- What will you show? Validation of the pipeline and demonstration of its capabilities.
+
[reproducible code](http://seqcluster.readthedocs.org/example_pipeline.html)
[R code](https://github.com/lpantano/mypubs/blob/master/srnaseq/mirqc/ready_report.rmd)
@@ -11,15 +20,13 @@
--------------------
We used samples from mirRQC project
-[paper](http://www.nature.com/nmeth/journal/v11/n8/full/nmeth.3014.html). The
-idea behind this project is to provide samples that we know how much relative
-small RNAs they contain one respect another, or/and even the presence or
-absence of specific molecules, such as miRNAs. The main goal was to test
-different platforms for miRNA detection, but I think these are great samples for
-benchmarking any tools as well. I am using them to test how [bcbio-nextgen]()
+[paper](http://www.nature.com/nmeth/journal/v11/n8/full/nmeth.3014.html).
+This project provides samples with known, relative amounts of small RNAs, enabling
+comparison of quantitation and detection of miRNAs. The main goal was to test
+different platforms for miRNA detection, but these are also great samples for
+benchmarking tools. I am using them to test how [bcbio-nextgen]()
works with small RNA data.
-
Quoted from the paper, samples are (see below figure, lower panel):
> Universal Human miRNA reference RNA (Agilent Technologies, #750700), human
@@ -40,7 +47,6 @@
You can read more about bcbio [here](http://github.com/chapmanb/bcbio-nextgen).
There are 4 main steps in the small RNA-seq pipeline:
-
* adapter removal
* miRNA quantification
* other small RNAs detection and quantification
@@ -64,10 +70,10 @@
### other smallRNAs quantification
`bcbio` uses [seqcluster](http://github.com/lpantano/seqcluster) to detect
-unique units of transcription over the genome with the main advantage being the
-possibility to detect small RNAs that may come from different places. Normally
+unique units of transcription over the genome, allowing resolutions of small
+RNAs found in multiple genomic locations. Normally
these small RNAs are dropped because they map multiple times on the genome and
-require an special analysis to avoid bias in the quantification. Read more about
+require special analysis to avoid bias in the quantification. Read more about
why
[other small RNAs are important](http://seqcluster.readthedocs.org/literature.html).
@@ -75,11 +81,12 @@
### quality control metrics
`bcbio` summarizes `Fastqc` metrics for each sample. Together with different
metrics from the previous steps, the user has an idea of the quality of the
-samples and the overall project. It includes the `fastqc` results, size
+samples and the overall project. It includes `fastqc` results, size
distribution after adapter removal and amount of small RNAs mapped to miRNAs,
tRNA, rRNA, repeats among others. Other metrics like amount of data used until
the end of the analysis, or warning flags if the data is noisy are provided by
-`seqcluster` and included in the final _Rmd_ template report.
+`seqcluster` and included in the final _Rmd_ template report. (TODO: Provide a
+description and pointer to what Rmarkdown is)
### automatic report
@@ -90,10 +97,8 @@
## Results
-The advantage to use these samples is that we can measure the accuracy in the
-quantification of the tools used inside `bcbio` and the detection of specific
-miRs.
-
+The mirRQC samples allow us to measure quantitation and detection
+accuracy of specific miRs for the tools integrated in `bcbio`.
### size distribution
The size distribution shows easily the quality of your data. In a normal small
@@ -144,11 +149,11 @@
miRNAs which B > A are 181 and 174 follows B > C > D
-That is more than 95% of accuracy for miRs with more than 5 counts.
+That is more than 95% of accuracy for miRs with more than 5 counts.
### specificity
-To detect specificity we used samples that included specific miRNAs that are not
+To evaluate specificity we used samples that included specific miRNAs that are not
normally expressed there. These samples were analyzed in a different
[run](https://github.com/lpantano/seqcluster/blob/master/data/pipeline_example/mirqc/non_mirqc_bcbio.csv).
@@ -184,6 +189,11 @@
quality control|0:01|8|20
report | 0| 1| 1
+## Conclusion
+
+It would be nice to have a conclusion here, similar to the intro -- re-emphasize
+the major points and suggest additional work.
+
# Thanks
* [Harvard T.H. Chan School of Public Health](http://bioinformatics.sph.harvard.edu)
for supporting the integration of small RNAseq pipeline in
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment