Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save skurscheid/aac1066cd80bd5bfcc324f328d8792ae to your computer and use it in GitHub Desktop.
Save skurscheid/aac1066cd80bd5bfcc324f328d8792ae to your computer and use it in GitHub Desktop.
extract over-represented sequences from FASTQ output file [fastqc_data.txt]
awk 'BEGIN{i=0}; {if ($1 == "Filename") {file=$2} else if ($1 ~ /[A|C|T|G]/ && length($1) > 40) {i+=1; print ">",file," ",i,"\n" $1 }}' < fastqc_data.txt
@skurscheid
Copy link
Author

FASTQC reports over-represented sequences in its fastqc_data.txt output. Sometimes it is necessary to perform a BLAST search of this data. In order to quickly extract these sequences (in this example sequences present more than 40 times), run this awk one-liner to create a FASTA file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment