Skip to content

Instantly share code, notes, and snippets.

@nathanhaigh
Last active October 10, 2015 19:38
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save nathanhaigh/3740868 to your computer and use it in GitHub Desktop.
Save nathanhaigh/3740868 to your computer and use it in GitHub Desktop.
This script takes FASTQ formatted sequences on STDIN and computes the number of occurences of each quality character. Useful for determining what FASTQ encoding the FATSQ file might be using.
#!/bin/bash
# Usage: quality_value_summary.sh < in.fastq > quality_stats.txt
# Inspired by Torsten Seemann's blog post:
# http://thegenomefactory.blogspot.com.au/2012/05/cool-use-of-unix-paste-with-ngs.html
# Inspired by:
# http://www.unix.com/shell-programming-scripting/37305-how-convert-hex-value-dec.html
paste - - - - | \
cut -f 4 | \
xxd -g 1 -c 1 -p | \
grep -v '0a' | \
sort -S 5G --parallel 10 | \
uniq --count | \
awk --non-decimal-data '{
printf "%d\t%d\t%c\n", $1,("0x"$2)+0,("0x"$2)+0
}' | \
@nathanhaigh
Copy link
Author

Example Usage

While you could summaries all the quality values present in an entire FASTQ file, it's probably not worth the time. Instead, just take the first x reads as a subsample. Here's how to get the stats for the first 10000 reads (assuming 4 lines per read) of an input file:

$ head -40000 illumina.fastq | ./quality_value_summary.sh
3506    35      #
9       38      &
300     39      '
777     40      (
3427    41      )
656     42      *
358     43      +
94      44      ,
330     45      -
589     46      .
294     47      /
3415    48      0
1855    49      1
367     50      2
601     51      3
383     52      4
485     53      5
1091    54      6
1397    55      7
3956    56      8
2961    57      9
9222    58      :
2268    59      ;
2839    60      <
3609    61      =
1163    62      >
21300   63      ?
14766   64      @
5862    65      A
17534   66      B
32309   67      C
30109   68      D
11787   69      E
67901   70      F
39690   71      G
78040   72      H
61041   73      I
83709   74      J

Output

A tab-delimited output in 3 columns:

  1. Frequency
  2. The decimal representation of the quality character
  3. The quality character found in the input

What Base Encoding are my Sequences In?

  • If the 2nd column of output has a value >=75 then your FASTQ sequences are base64 encoded
  • Else If it contains values <=58 then your FASTQ sequences are base32 encoded
  • Else you either have bad quality base64 encoded sequences OR good quality base32 encoded sequences

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment