afrendeiro/ngs_101.md

## ngs_101.md

      
    Raw
  

              ngs_101.md
            
          
    Introduction to next-generation sequencing (NGS)

General workflow

The current used technology for next generation sequencing is Illumina sequencing - all others cannot compete with its speed, price and output power - they have therefore specialized in niche applications (not discussed here).
Nevertheless, no sequencing technology cannot simply start sequencing one end of a chromosome until the other end.
The approach therefore is:

cut the genome into several small pieces that can be sequenced individually
sequence all those small pieces at the same time <- each of these is a sequencing read
map the position of each of those to a reference genome

This implies that in every NGS application, there are several common steps:

prepare a library of DNA fragments to be sequenced - this step is called library preparation
sequence the library - this step is called sequencing
assess quality of raw data reads - this step is called quality control
determine the position of each sequence (read) in the genome - this step is called mapping

From here on, analysis varies from application to application.
Sequencing applications


de novo whole-genome sequencing - determine the sequence of a genome from a species never sequenced before and make a reference genome - called genome assembly
Whole-genome sequencing - determine the sequence of an individual from a species with a genome reference and annotate deviations from the reference - this process is called variant calling
RNA sequencing - convert RNA molecules in cells to cDNA, sequence the cDNA, determine its origin in the genome (mapping), and count how many cDNA molecules come from each gene - called gene expression profiling
"Chromatin profiling" (ChIP-seq, ATAC-seq, DNase-seq) - select regions of the genome associated with certain proteins or with a certain conformation, make a library with those only, sequence the library and determine the abundance of reads along the genome (regions with more reads will be the binding sites of proteins)
... many others, but these above are >80% of the usage cases.

Essential vocabulary

(make sure you understand these!)

sequencing library (or simply library)
library fragment (or simply fragment)
sequencing read (or simply read)
mapping
alignment
variant
gene
transcript

Tools that everyone uses

(google at least one per category)

Aligners

BWA
Bowtie2


Variant calling

GATK
Samtools


Differential expression

cufflinks
DESeq


Genome browsers (to visualize reads and regions in the genome)

UCSC genome browser
IGV genome browser


General purpose (e.g. format conversion)

samtools
bedtools
Fastqc (raw read quality control)


Data formats

(good to know, but don't worry too much right now)

FASTQ - format to store reads and measurements of their quality
SAM/BAM - format to store reads, alignments and measurements of their quality
VCF/BCF - format to store called variants
BED - format to store annotation of regions in the genome

Other resources

https://www.youtube.com/watch?v=womKfikWlxM
http://www.illumina.com/content/dam/illumina-marketing/documents/products/sequencing_introduction_microbiology.pdf
...