Last active
October 10, 2015 19:38
-
-
Save nathanhaigh/3740868 to your computer and use it in GitHub Desktop.
This script takes FASTQ formatted sequences on STDIN and computes the number of occurences of each quality character. Useful for determining what FASTQ encoding the FATSQ file might be using.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/bash | |
# Usage: quality_value_summary.sh < in.fastq > quality_stats.txt | |
# Inspired by Torsten Seemann's blog post: | |
# http://thegenomefactory.blogspot.com.au/2012/05/cool-use-of-unix-paste-with-ngs.html | |
# Inspired by: | |
# http://www.unix.com/shell-programming-scripting/37305-how-convert-hex-value-dec.html | |
paste - - - - | \ | |
cut -f 4 | \ | |
xxd -g 1 -c 1 -p | \ | |
grep -v '0a' | \ | |
sort -S 5G --parallel 10 | \ | |
uniq --count | \ | |
awk --non-decimal-data '{ | |
printf "%d\t%d\t%c\n", $1,("0x"$2)+0,("0x"$2)+0 | |
}' | \ | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Example Usage
While you could summaries all the quality values present in an entire FASTQ file, it's probably not worth the time. Instead, just take the first
x
reads as a subsample. Here's how to get the stats for the first 10000 reads (assuming 4 lines per read) of an input file:Output
A tab-delimited output in 3 columns:
What Base Encoding are my Sequences In?