Skip to content

Instantly share code, notes, and snippets.

@colinbrislawn
Last active September 2, 2016 18:26
Show Gist options
  • Save colinbrislawn/278265c4a4c2247a966da969d74c7ec1 to your computer and use it in GitHub Desktop.
Save colinbrislawn/278265c4a4c2247a966da969d74c7ec1 to your computer and use it in GitHub Desktop.
Work in progress for the vsearch wiki

Quality Control

High throughput sequencing data is often presented in the .fastq format. This flat text file format contains both the nucleotide sequences and Phred quality scores(Q scores). Quality scores estimate the accuracy of each nucleotide.

Phred Quality Score Estimated Accuracy
10 90 %
20 99 %
30 99.9 %
40 99.99 %

Q scores are not perfect

Q scores are estimations; The real accuracy of a nucleotide could be lower.

Q score are different between sequencing platforms; Illumina reports the probability of an substitution error, while Ion Torrent and 454 Roach report the probability of an insertion or deletion.

The relative quality of sequening platforms is hotly debated (PDF, PDF). For this discussion, we will accept Q scores as reasonable estimates of accuracy.

Quality Filtering with Q scores

There are many ways that Q scores can be used increase the quality of a dataset.

  1. Trimming.
  • Once a single nucleotides has a low Q score, remove all following nucleotides
  1. Filtering.
  • Remove reads with a low average Q score
  1. Some combination of trimming and filtering, like
  • Once series of nucleotides has a low average Q score, remove all following nucleotides

Because common sequencing technologies produce lower quality nucleotides near the end of the reads, trimming is common.

Illumina sequencing produced paired-end reads that can be joined. These joined reads are high quality on both ends, making making filtering a better fit.

Average Q is a bad idea!

As explained discussed in Edgar & Flyvbjerg, 2015, the average Q score of a read is a very poor indicator of quality because a simple average dramatically underestimates the number of errors predicted by cumulative Q scores. Take this example from Edgar, shown below.

Q scores in read Avg. Q Expected number of errors
140 x Q35 + 10 x Q2 33 6.4 !
150 x Q25 25 0.5

Expected Error filtering

Coming soon!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment