Skip to content

Instantly share code, notes, and snippets.

Last active September 26, 2022 13:19
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
Star You must be signed in to star a gist
What would you like to do?
NGS for dummies

Introduction to next-generation sequencing (NGS)

General workflow

The current used technology for next generation sequencing is Illumina sequencing - all others cannot compete with its speed, price and output power - they have therefore specialized in niche applications (not discussed here).

Nevertheless, no sequencing technology cannot simply start sequencing one end of a chromosome until the other end.

The approach therefore is:

  • cut the genome into several small pieces that can be sequenced individually
  • sequence all those small pieces at the same time <- each of these is a sequencing read
  • map the position of each of those to a reference genome

This implies that in every NGS application, there are several common steps:

  • prepare a library of DNA fragments to be sequenced - this step is called library preparation
  • sequence the library - this step is called sequencing
  • assess quality of raw data reads - this step is called quality control
  • determine the position of each sequence (read) in the genome - this step is called mapping

From here on, analysis varies from application to application.

Sequencing applications

  • de novo whole-genome sequencing - determine the sequence of a genome from a species never sequenced before and make a reference genome - called genome assembly
  • Whole-genome sequencing - determine the sequence of an individual from a species with a genome reference and annotate deviations from the reference - this process is called variant calling
  • RNA sequencing - convert RNA molecules in cells to cDNA, sequence the cDNA, determine its origin in the genome (mapping), and count how many cDNA molecules come from each gene - called gene expression profiling
  • "Chromatin profiling" (ChIP-seq, ATAC-seq, DNase-seq) - select regions of the genome associated with certain proteins or with a certain conformation, make a library with those only, sequence the library and determine the abundance of reads along the genome (regions with more reads will be the binding sites of proteins)
  • ... many others, but these above are >80% of the usage cases.

Essential vocabulary

(make sure you understand these!)

  • sequencing library (or simply library)
  • library fragment (or simply fragment)
  • sequencing read (or simply read)
  • mapping
  • alignment
  • variant
  • gene
  • transcript

Tools that everyone uses

(google at least one per category)

  • Aligners
    • BWA
    • Bowtie2
  • Variant calling
    • GATK
    • Samtools
  • Differential expression
    • cufflinks
    • DESeq
  • Genome browsers (to visualize reads and regions in the genome)
    • UCSC genome browser
    • IGV genome browser
  • General purpose (e.g. format conversion)
    • samtools
    • bedtools
    • Fastqc (raw read quality control)

Data formats

(good to know, but don't worry too much right now)

  • FASTQ - format to store reads and measurements of their quality
  • SAM/BAM - format to store reads, alignments and measurements of their quality
  • VCF/BCF - format to store called variants
  • BED - format to store annotation of regions in the genome

Other resources


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment