Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 4 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save shenwei356/a94a23ce27e13056ac4a6f1758f4abb2 to your computer and use it in GitHub Desktop.
Save shenwei356/a94a23ce27e13056ac4a6f1758f4abb2 to your computer and use it in GitHub Desktop.
Filtering spades assembly result according to coverage using SeqKit and csvtk

Filtering Spades assembly result according to coverage information in sequence header

Sample sequence

$ cat contigs.fasta 
>NODE_1_length_869844_cov_1135.34
ACTGNacgtn 
>NODE_2_length_576386_cov_975.882
acgtn

Steps

  1. Converting FASTA to tabular format using SeqKit (http://bioinf.shenwei.me/seqkit/)

Note that seqkit fx2tab converts FASTA to 3-column tabular format, with sequence in the 2nd column and quality in 3rd column.

    $ seqkit fx2tab contigs.fasta 
    NODE_1_length_869844_cov_1135.34        ACTGNacgtn
    NODE_2_length_576386_cov_975.882        acgtn
  1. Retrieving coverage as new column using csvtk (http://bioinf.shenwei.me/csvtk/)

     $ seqkit fx2tab contigs.fasta | csvtk mutate -H -t -f 1 -p "cov_(.+)" 
     NODE_1_length_869844_cov_1135.34        ACTGNacgtn              1135.34
     NODE_2_length_576386_cov_975.882        acgtn           975.882
    
  2. Filtering by coverage (4th column) using csvtk or awk

     # seqkit fx2tab contigs.fasta | csvtk mutate -H -t -f 1 -p "cov_(.+)" | awk -F "\t" '$4>=1000'
     $ seqkit fx2tab contigs.fasta | csvtk mutate -H -t -f 1 -p "cov_(.+)" | csvtk filter2 -H -t -f "$4>=1000" 
     NODE_1_length_869844_cov_1135.34        ACTGNacgtn              1135.34
    
  3. Converting tabular format back to FASTA format

     $ seqkit fx2tab contigs.fasta | csvtk mutate -H -t -f 1 -p "cov_(.+)" | awk -F "\t" '$4>=1000' | seqkit tab2fx
     >NODE_1_length_869844_cov_1135.34
     ACTGNacgtn
    
@ramiroricardo
Copy link

Thanks for posting this. It was quite helpful. Based on your code, I reached this solution:

seqkit fx2tab contigs.fasta | csvtk mutate -H -t -f 1 -p "cov_(.+)" | csvtk mutate -H -t -f 1 -p "length_([0-9]+)" | awk -F "\t" '$4>=10 && $5>=500' | seqkit tab2fx > filtered_contigs.fasta

for filtering assembled contigs by both coverage and length.

@tpsduarte
Copy link

tpsduarte commented Dec 11, 2023

Just a little correction from the previous comment. It is missing an underscore in "cov_(.+)" :

seqkit fx2tab contigs.fasta | csvtk mutate -H -t -f 1 -p "cov_(.+)_" | csvtk mutate -H -t -f 1 -p "length_([0-9]+)" | awk -F "\t" '$4>=10 && $5>=500' | seqkit tab2fx > filtered_contigs.fasta

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment