Skip to content

Instantly share code, notes, and snippets.

@mfoll
Forked from danielecook/fasta_sequence_lengths.sh
Last active September 25, 2015 15:02
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mfoll/04d751165416a4466001 to your computer and use it in GitHub Desktop.
Save mfoll/04d751165416a4466001 to your computer and use it in GitHub Desktop.
Generate Fasta sequence lengths
BEGIN {
OFS = "\t"; # tab-delimited output
}
# Use substr instead of regex to match a starting ">"
substr($0, 1, 1) == ">" {
if (seqlen) {
# Only print info for this sequence if no target was given
# or its id matches the target.
if (! target || id == target) {
print id, seqlen;
}
}
# Get sequence id:
# 1. Split header on whitespace (fields[1] is now ">id")
split($0, fields);
# 2. Get portion of first field after the starting ">"
id = substr(fields[1], 2);
seqlen = 0;
next;
}
{
seqlen = seqlen + length($0);
}
END {
if (! target || id == target) {
print id, seqlen;
}
}
@mfoll
Copy link
Author

mfoll commented Sep 25, 2015

Source: http://stackoverflow.com/a/28548049. This is really slow on human genome (~4min)

@mfoll
Copy link
Author

mfoll commented Sep 25, 2015

Using biopython this is doing the same but is also very slow:

#!/usr/bin/python
from Bio import SeqIO
import sys
cmdargs = str(sys.argv)
for seq_record in SeqIO.parse(str(sys.argv[1]), "fasta"):
 output_line = '%s\t%i' % (seq_record.id, len(seq_record))
 print(output_line)

@mfoll
Copy link
Author

mfoll commented Sep 25, 2015

One-liner:

cat file.fa | awk '$0 ~ ">" {print c; c=0;printf substr($0,2,100) "\t"; } $0 !~ ">" {c+=length($0);} END { print c; }'

@mfoll
Copy link
Author

mfoll commented Sep 25, 2015

I use this script to write the contig field in a VCF file (see samtools/bcftools#326) that is in this format:

##contig=<ID=chr1,length=249250621>

To directly output it in this format, one can replace the two print id, seqlen; commands from the awkscript above with:

print "##contig=<ID="id",length="seqlen">";

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment