Skip to content

Instantly share code, notes, and snippets.

@nhoffman
Created November 16, 2012 20:54
Show Gist options
  • Save nhoffman/4090830 to your computer and use it in GitHub Desktop.
Save nhoffman/4090830 to your computer and use it in GitHub Desktop.
More patterns for matching unclassified sequences
#!/usr/bin/env python
import re
import sys
rexp = re.compile(r'|'.join([
r'\bactinomycete\b',
r'\bcrenarchaeote\b',
r'\bculture\b',
r'\bchimeric\b',
r'\bcyanobiont\b',
'degrading',
r'\beuryarchaeote\b',
'disease',
r'\b[cC]lone',
r'\bmethanogen(ic)?\b',
'planktonic',
r'\bplanctomycete\b',
r'\bsymbiote\b',
r'\btransconjugant\b',
r'^[a-z]', # need to look for false positives
r'^[a-zA-Z]+\s+[a-zA-Z]*\d' # digit in second word
]))
for line in sys.stdin:
if rexp.search(line.split(None, 3)[-1]):
sys.stdout.write(line)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment