chapmanb/lorena-mirqc.md

## lorena-mirqc.md

      
    Raw
  

              lorena-mirqc.md
            
          
    small RNA-seq with bcbio-nextgen

It would be good to have a few short sentences of introduction here to orient
readers and give them an overview of the whole post.

Why is small RNA-seq analysis important
What types of analysis do you provide (this is in the pipeline section below),
in a sentence.
What is bcbio
What will you show? Validation of the pipeline and demonstration of its capabilities.

reproducible code
R code

miRQC project

We used samples from mirRQC project
paper.
This project provides samples with known, relative amounts of small RNAs, enabling
comparison of quantitation and detection of miRNAs. The main goal was to test
different platforms for miRNA detection, but these are also great samples for
benchmarking tools. I am using them to test how bcbio-nextgen
works with small RNA data.
Quoted from the paper, samples are (see below figure, lower panel):

Universal Human miRNA reference RNA (Agilent Technologies, #750700), human
brain total RNA (Life Technologies, #AM6050), human liver total RNA (Life
Technologies, #AM7960) and MS2-phage RNA (Roche, #10165948001) were diluted to
a platform-specific concentration. ... Synthetic miRNA templates for
let-7a-5p, let-7b-5p, let-7c, let-7d-5p, miR-302a-3p, miR-302b-3p,
miR-302c-3p, miR-302d-3p, miR-133a and miR-10a-5p were synthesized by
Integrated DNA Technologies and 5′ phosphorylated. Synthetic let-7 and miR-302
miRNAs were spiked into MS2-phage RNA and total human liver RNA, respectively,
at 5 × 106 copies/μg RNA.


Pipeline

You can read more about bcbio here.
There are 4 main steps in the small RNA-seq pipeline:

adapter removal
miRNA quantification
other small RNAs detection and quantification
quality control metrics

adapter removal

We integrated cutadapt
allowing a minimum read size of 17 nts and removing the adapter if it is at
least 8 nts long.
miRNA quantification

bcbio uses
seqbuster for this
step. It has been used by
gEUvaids consortium
for miRNA quantification and allows
isomiRs analysis as well. Read more
about why isomiRs are important.
other smallRNAs quantification

bcbio uses seqcluster to detect
unique units of transcription over the genome, allowing resolutions of small
RNAs found in multiple genomic locations. Normally
these small RNAs are dropped because they map multiple times on the genome and
require special analysis to avoid bias in the quantification. Read more about
why
other small RNAs are important.
quality control metrics

bcbio summarizes Fastqc metrics for each sample. Together with different
metrics from the previous steps, the user has an idea of the quality of the
samples and the overall project. It includes fastqc results, size
distribution after adapter removal and amount of small RNAs mapped to miRNAs,
tRNA, rRNA, repeats among others. Other metrics like amount of data used until
the end of the analysis, or warning flags if the data is noisy are provided by
seqcluster and included in the final Rmd template report. (TODO: Provide a
description and pointer to what Rmarkdown is)
automatic report

bcbio generates a
Rmd template report
to make easy the visualization of all the results from each of the steps. It
will be inside report folder in the working directory.
Results

The mirRQC samples allow us to measure quantitation and detection
accuracy of specific miRs for the tools integrated in bcbio.
size distribution

The size distribution shows easily the quality of your data. In a normal small
RNA sample we should see a peak at 22/23 nt and maybe another at 26 or 31
depending on the biology of the samples.


note: there may be specific cases where this assumption is not true, but
it applies to the majority of projects.

miRNA abundance detection of miRQC samples

There are 4 samples that help to validate the quantification:

A: universal human RNA sample
B: human brain sample
C: 25% of  A + 75% of B
D: 25% of B + 75% of A

So we can assume:

If A > B then A > D > C > B
If B > A then A < D < C < B

Note that C/D samples are swapped in the paper and in the GEO web. The text from the paper says:

These samples (termed miRQC A–D) consist of 100% Universal Human miRNA
Reference RNA (UHmiRR; A), 100% human brain RNA (HBR; B) and two titrations
thereof (C = 0.75A + 0.25B and D = 0.25A + 0.75B).

While the text in the GEO web says:

Source name 	miRQC C
Organism 	Homo sapiens
Characteristics 	biomaterial: Mixture of 25% UHRR and 75% HuBr Total RNA


Source name 	miRQC D
Organism 	Homo sapiens
Characteristics 	biomaterial: Mixture of 75% UHRR and 25% HuBr Total RNA

After the analysis with bcbio, we can calculate the amount of miRs that
follows the relative abundance rule. To measure this, I took the average from
the replicates and kept only the miRs with a median > 5 counts after
normalization (upper quantile normalization) among samples.
miRNAs which A > B are 111, and all of them follows A > D > C
miRNAs which B > A are 181 and 174 follows B > C > D
That is more than 95% of accuracy for miRs with more than 5 counts.
specificity

To evaluate specificity we used samples that included specific miRNAs that are not
normally expressed there. These samples were analyzed in a different
run.

we spiked in 8 synthetic miRNAs from two miRNA families into human liver RNA
(miR-302a/b/c/d) or MS2-phage RNA (let-7a/b/c/d)

We should only see those miRNAs (let-7 and miR-302 families) in those samples.


hsa-let-7a appears in all samples. If you check the processed data is not due to
wrong alignment. It would be interesting to figure out whether this signal is
because of contamintation or errors in sequencing or/and amplification.
clusters abundance detection of miRQC samples

Similarly to miRs, we can do the same for other small RNAs detected by
seqcluster. The results where very similar:

clusters which A > B are 147, where 139 (75 are known miRNAs) follow D > C
clusters which B > A are 230, where 222 (129 are known miRNAs) follow D > C

Timing and Resources

The time for 8 samples with 6 millions reads each was 3 hour and 19 minutes.


Total
3:19
total cores
total memory GB


organize samples
0
1
1


trimming & miRNA
0:21
8
20


prepare
0:01
1
8


alignment
0:07
6
42.1


cluster
2:49
1
8


quality control
0:01
8
20


report
0
1
1


Conclusion

It would be nice to have a conclusion here, similar to the intro -- re-emphasize
the major points and suggest additional work.
Thanks


Harvard T.H. Chan School of Public Health
for supporting the integration of small RNAseq pipeline in
bcbio. Special thanks to @roryk and
@chapmanb.
Research Computing at Harvard Medical School:
Chris Botka, Director of Research Computing and all the people in the team.
Special thanks to the author of that papers to make data available. I
encourage to use this data for any tool that analyzes small RNA data.


## lorena-mirqc.patch
--- lorena-mirqc.md.orig	2015-12-05 11:29:26.456290998 -0500
+++ lorena-mirqc.md	2015-12-05 13:11:01.364354674 -0500
@@ -1,6 +1,15 @@
 small RNA-seq with bcbio-nextgen
 =============================

+It would be good to have a few short sentences of introduction here to orient
+readers and give them an overview of the whole post.
+
+- Why is small RNA-seq analysis important
+- What types of analysis do you provide (this is in the pipeline section below),
+  in a sentence.
+- What is bcbio
+- What will you show? Validation of the pipeline and demonstration of its capabilities.
+
 [reproducible code](http://seqcluster.readthedocs.org/example_pipeline.html)

 [R code](https://github.com/lpantano/mypubs/blob/master/srnaseq/mirqc/ready_report.rmd)
@@ -11,15 +20,13 @@
 --------------------

 We used samples from mirRQC project
-[paper](http://www.nature.com/nmeth/journal/v11/n8/full/nmeth.3014.html). The
-idea behind this project is to provide samples that we know how much relative
-small RNAs they contain one respect another,  or/and even the presence or
-absence of specific molecules, such as miRNAs. The main goal was to test
-different platforms for miRNA detection, but I think these are great samples for
-benchmarking any tools as well. I am using them to test how [bcbio-nextgen]()
+[paper](http://www.nature.com/nmeth/journal/v11/n8/full/nmeth.3014.html).
+This project provides samples with known, relative amounts of small RNAs, enabling
+comparison of quantitation and detection of miRNAs. The main goal was to test
+different platforms for miRNA detection, but these are also great samples for
+benchmarking tools. I am using them to test how [bcbio-nextgen]()
 works with small RNA data.

-
 Quoted from the paper, samples are (see below figure, lower panel):

 > Universal Human miRNA reference RNA (Agilent Technologies, #750700), human
@@ -40,7 +47,6 @@
 You can read more about bcbio [here](http://github.com/chapmanb/bcbio-nextgen).
 There are 4 main steps in the small RNA-seq pipeline:

-
 * adapter removal
 * miRNA quantification
 * other small RNAs detection and quantification
@@ -64,10 +70,10 @@

 ### other smallRNAs quantification
 `bcbio` uses [seqcluster](http://github.com/lpantano/seqcluster) to detect
-unique units of transcription over the genome with the main advantage being the
-possibility to detect small RNAs that may come from different places. Normally
+unique units of transcription over the genome, allowing resolutions of small
+RNAs found in multiple genomic locations. Normally
 these small RNAs are dropped because they map multiple times on the genome and
-require an special analysis to avoid bias in the quantification. Read more about
+require special analysis to avoid bias in the quantification. Read more about
 why
 [other small RNAs are important](http://seqcluster.readthedocs.org/literature.html).

@@ -75,11 +81,12 @@
 ### quality control metrics
 `bcbio` summarizes `Fastqc` metrics for each sample. Together with different
 metrics from the previous steps, the user has an idea of the quality of the
-samples and the overall project. It includes the `fastqc` results, size
+samples and the overall project. It includes `fastqc` results, size
 distribution after adapter removal and amount of small RNAs mapped to miRNAs,
 tRNA, rRNA, repeats among others. Other metrics like amount of data used until
 the end of the analysis, or warning flags if the data is noisy are provided by
-`seqcluster` and included in the final _Rmd_ template report.
+`seqcluster` and included in the final _Rmd_ template report. (TODO: Provide a
+description and pointer to what Rmarkdown is)


 ### automatic report
@@ -90,10 +97,8 @@


 ## Results
-The advantage to use these samples is that we can measure the accuracy in the
-quantification of the tools used inside `bcbio` and the detection of specific
-miRs.
-
+The mirRQC samples allow us to measure quantitation and detection
+accuracy of specific miRs for the tools integrated in `bcbio`.

 ### size distribution
 The size distribution shows easily the quality of your data. In a normal small
@@ -144,11 +149,11 @@

 miRNAs which B > A are 181 and 174 follows B > C > D

-That is more than 95% of accuracy for miRs with more than 5 counts.
+That is more than 95% of accuracy for miRs with more than 5 counts.

 ### specificity

-To detect specificity we used samples that included specific miRNAs that are not
+To evaluate specificity we used samples that included specific miRNAs that are not
 normally expressed there. These samples were analyzed in a different
 [run](https://github.com/lpantano/seqcluster/blob/master/data/pipeline_example/mirqc/non_mirqc_bcbio.csv).

@@ -184,6 +189,11 @@
 quality control|0:01|8|20
 report | 0| 1| 1

+## Conclusion
+
+It would be nice to have a conclusion here, similar to the intro -- re-emphasize
+the major points and suggest additional work.
+
 # Thanks
 * [Harvard T.H. Chan School of Public Health](http://bioinformatics.sph.harvard.edu)
   for supporting the integration of small RNAseq pipeline in
Total	3:19	total cores	total memory GB
organize samples	0	1	1
trimming & miRNA	0:21	8	20
prepare	0:01	1	8
alignment	0:07	6	42.1
cluster	2:49	1	8
quality control	0:01	8	20
report	0	1	1
	--- lorena-mirqc.md.orig 2015-12-05 11:29:26.456290998 -0500
	+++ lorena-mirqc.md 2015-12-05 13:11:01.364354674 -0500
	@@ -1,6 +1,15 @@
	small RNA-seq with bcbio-nextgen
	=============================

	+It would be good to have a few short sentences of introduction here to orient
	+readers and give them an overview of the whole post.
	+
	+- Why is small RNA-seq analysis important
	+- What types of analysis do you provide (this is in the pipeline section below),
	+ in a sentence.
	+- What is bcbio
	+- What will you show? Validation of the pipeline and demonstration of its capabilities.
	+
	[reproducible code](http://seqcluster.readthedocs.org/example_pipeline.html)

	[R code](https://github.com/lpantano/mypubs/blob/master/srnaseq/mirqc/ready_report.rmd)
	@@ -11,15 +20,13 @@
	--------------------

	We used samples from mirRQC project
	-[paper](http://www.nature.com/nmeth/journal/v11/n8/full/nmeth.3014.html). The
	-idea behind this project is to provide samples that we know how much relative
	-small RNAs they contain one respect another, or/and even the presence or
	-absence of specific molecules, such as miRNAs. The main goal was to test
	-different platforms for miRNA detection, but I think these are great samples for
	-benchmarking any tools as well. I am using them to test how [bcbio-nextgen]()
	+[paper](http://www.nature.com/nmeth/journal/v11/n8/full/nmeth.3014.html).
	+This project provides samples with known, relative amounts of small RNAs, enabling
	+comparison of quantitation and detection of miRNAs. The main goal was to test
	+different platforms for miRNA detection, but these are also great samples for
	+benchmarking tools. I am using them to test how [bcbio-nextgen]()
	works with small RNA data.

	-
	Quoted from the paper, samples are (see below figure, lower panel):

	> Universal Human miRNA reference RNA (Agilent Technologies, #750700), human
	@@ -40,7 +47,6 @@
	You can read more about bcbio [here](http://github.com/chapmanb/bcbio-nextgen).
	There are 4 main steps in the small RNA-seq pipeline:

	-
	* adapter removal
	* miRNA quantification
	* other small RNAs detection and quantification
	@@ -64,10 +70,10 @@

	### other smallRNAs quantification
	`bcbio` uses [seqcluster](http://github.com/lpantano/seqcluster) to detect
	-unique units of transcription over the genome with the main advantage being the
	-possibility to detect small RNAs that may come from different places. Normally
	+unique units of transcription over the genome, allowing resolutions of small
	+RNAs found in multiple genomic locations. Normally
	these small RNAs are dropped because they map multiple times on the genome and
	-require an special analysis to avoid bias in the quantification. Read more about
	+require special analysis to avoid bias in the quantification. Read more about
	why
	[other small RNAs are important](http://seqcluster.readthedocs.org/literature.html).

	@@ -75,11 +81,12 @@
	### quality control metrics
	`bcbio` summarizes `Fastqc` metrics for each sample. Together with different
	metrics from the previous steps, the user has an idea of the quality of the
	-samples and the overall project. It includes the `fastqc` results, size
	+samples and the overall project. It includes `fastqc` results, size
	distribution after adapter removal and amount of small RNAs mapped to miRNAs,
	tRNA, rRNA, repeats among others. Other metrics like amount of data used until
	the end of the analysis, or warning flags if the data is noisy are provided by
	-`seqcluster` and included in the final _Rmd_ template report.
	+`seqcluster` and included in the final _Rmd_ template report. (TODO: Provide a
	+description and pointer to what Rmarkdown is)


	### automatic report
	@@ -90,10 +97,8 @@


	## Results
	-The advantage to use these samples is that we can measure the accuracy in the
	-quantification of the tools used inside `bcbio` and the detection of specific
	-miRs.
	-
	+The mirRQC samples allow us to measure quantitation and detection
	+accuracy of specific miRs for the tools integrated in `bcbio`.

	### size distribution
	The size distribution shows easily the quality of your data. In a normal small
	@@ -144,11 +149,11 @@

	miRNAs which B > A are 181 and 174 follows B > C > D

	-That is more than 95% of accuracy for miRs with more than 5 counts.
	+That is more than 95% of accuracy for miRs with more than 5 counts.

	### specificity

	-To detect specificity we used samples that included specific miRNAs that are not
	+To evaluate specificity we used samples that included specific miRNAs that are not
	normally expressed there. These samples were analyzed in a different
	[run](https://github.com/lpantano/seqcluster/blob/master/data/pipeline_example/mirqc/non_mirqc_bcbio.csv).

	@@ -184,6 +189,11 @@
	quality control\|0:01\|8\|20
	report \| 0\| 1\| 1

	+## Conclusion
	+
	+It would be nice to have a conclusion here, similar to the intro -- re-emphasize
	+the major points and suggest additional work.
	+
	# Thanks
	* [Harvard T.H. Chan School of Public Health](http://bioinformatics.sph.harvard.edu)
	for supporting the integration of small RNAseq pipeline in