-
-
Save mfoll/04d751165416a4466001 to your computer and use it in GitHub Desktop.
Generate Fasta sequence lengths
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
BEGIN { | |
OFS = "\t"; # tab-delimited output | |
} | |
# Use substr instead of regex to match a starting ">" | |
substr($0, 1, 1) == ">" { | |
if (seqlen) { | |
# Only print info for this sequence if no target was given | |
# or its id matches the target. | |
if (! target || id == target) { | |
print id, seqlen; | |
} | |
} | |
# Get sequence id: | |
# 1. Split header on whitespace (fields[1] is now ">id") | |
split($0, fields); | |
# 2. Get portion of first field after the starting ">" | |
id = substr(fields[1], 2); | |
seqlen = 0; | |
next; | |
} | |
{ | |
seqlen = seqlen + length($0); | |
} | |
END { | |
if (! target || id == target) { | |
print id, seqlen; | |
} | |
} |
Using biopython this is doing the same but is also very slow:
#!/usr/bin/python
from Bio import SeqIO
import sys
cmdargs = str(sys.argv)
for seq_record in SeqIO.parse(str(sys.argv[1]), "fasta"):
output_line = '%s\t%i' % (seq_record.id, len(seq_record))
print(output_line)
One-liner:
cat file.fa | awk '$0 ~ ">" {print c; c=0;printf substr($0,2,100) "\t"; } $0 !~ ">" {c+=length($0);} END { print c; }'
I use this script to write the contig
field in a VCF file (see samtools/bcftools#326) that is in this format:
##contig=<ID=chr1,length=249250621>
To directly output it in this format, one can replace the two print id, seqlen;
commands from the awk
script above with:
print "##contig=<ID="id",length="seqlen">";
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Source: http://stackoverflow.com/a/28548049. This is really slow on human genome (~4min)