-
-
Save mfoll/04d751165416a4466001 to your computer and use it in GitHub Desktop.
Generate Fasta sequence lengths
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
BEGIN { | |
OFS = "\t"; # tab-delimited output | |
} | |
# Use substr instead of regex to match a starting ">" | |
substr($0, 1, 1) == ">" { | |
if (seqlen) { | |
# Only print info for this sequence if no target was given | |
# or its id matches the target. | |
if (! target || id == target) { | |
print id, seqlen; | |
} | |
} | |
# Get sequence id: | |
# 1. Split header on whitespace (fields[1] is now ">id") | |
split($0, fields); | |
# 2. Get portion of first field after the starting ">" | |
id = substr(fields[1], 2); | |
seqlen = 0; | |
next; | |
} | |
{ | |
seqlen = seqlen + length($0); | |
} | |
END { | |
if (! target || id == target) { | |
print id, seqlen; | |
} | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I use this script to write the
contig
field in a VCF file (see samtools/bcftools#326) that is in this format:To directly output it in this format, one can replace the two
print id, seqlen;
commands from theawk
script above with: