Skip to content

Instantly share code, notes, and snippets.

@ivan-krukov
Created August 9, 2012 18:20
Show Gist options
  • Save ivan-krukov/3306771 to your computer and use it in GitHub Desktop.
Save ivan-krukov/3306771 to your computer and use it in GitHub Desktop.
Another quick FASTA parser
#Read a fasta file and only keep the sequences with correct headers (id_pattern regex)
import re
import sys
seq_pattern = re.compile(r">[^>]+\n",re.MULTILINE)
id_pattern = re.compile(r"protein_id:(?P<id>[.\w]+)")
with open(sys.argv[1]) as f:
text = f.read()
sequences = seq_pattern.findall(text)
for seq in sequences:
lines = seq.split("\n")
id_line, data = lines[0],lines[1:]
match = id_pattern.search(id_line)
if match:
print(">{seq_id}\n{data}".format(seq_id = match.group('id'),data=''.join(data)))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment