Last active
July 1, 2016 04:09
-
-
Save skurscheid/aac1066cd80bd5bfcc324f328d8792ae to your computer and use it in GitHub Desktop.
extract over-represented sequences from FASTQ output file [fastqc_data.txt]
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
awk 'BEGIN{i=0}; {if ($1 == "Filename") {file=$2} else if ($1 ~ /[A|C|T|G]/ && length($1) > 40) {i+=1; print ">",file," ",i,"\n" $1 }}' < fastqc_data.txt |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
FASTQC reports over-represented sequences in its fastqc_data.txt output. Sometimes it is necessary to perform a BLAST search of this data. In order to quickly extract these sequences (in this example sequences present more than 40 times), run this awk one-liner to create a FASTA file.