Skip to content

Instantly share code, notes, and snippets.

@martijnvermaat
Last active April 28, 2016 13:55
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save martijnvermaat/6465d41b0cc534c3fecf7ed946477886 to your computer and use it in GitHub Desktop.
Save martijnvermaat/6465d41b0cc534c3fecf7ed946477886 to your computer and use it in GitHub Desktop.
Genomic GenBank files and transcript identifiers

Genomic GenBank files and transcript identifiers

Using the attached script we check all genomic GenBank files in the server cache for the transcript_id field in their mRNA features.

Using the cache from mutalyzer.nl

Category Files checked Transcripts With ID Without ID %
NG_ 3076 5989 5978 11 0.2 %
NC_ 8 13649 13649 0 0.0 %
UD_ 52760 306619 306617 2 0.0 %
Other 280 2215 2116 99 4.5 %

Using the cache from test.mutalyzer.nl

Category Files checked Transcripts With ID Without ID %
NG_ 961 1876 1865 11 0.6 %
NC_ 10 13676 13649 27 0.2 %
UD_ 1803 5875 5869 6 0.1 %
Other 860 6385 5542 843 13.2 %

Notes

The Other category is defined as all GenBank files not starting with:

  • NG_
  • NC_
  • UD_
  • NM_
  • XM_
  • NR_
  • XR_
  • NP_
  • XP_
#!/usr/bin/env python
#
# Check annotated transcripts in genomic GenBank files for the 'transcript_id
# field.
#
# For every annotated transcript, this reports:
# 1. Gene symbol (or '<unknown gene>').
# 2. 'pseudo' if this is annotated as a pseudo gene, 'normal' otherwise.
# 3. 'good' if the 'transcript_id' field is present, 'missing' otherwise.
#
# Example usage:
#
# $ ./check.py NG_1245.gb
# MAST3 normal good
# MAST3 normal good
# MAST3 normal good
# BLAAT3 pseudo missing
# ABC5 normal missing
# PIK3R2 normal good
import sys
from Bio import SeqIO
record = SeqIO.read(sys.argv[1], 'genbank')
for feature in record.features:
if feature.type == 'source':
if feature.qualifiers.get('organelle') == 'mitochondrion':
break
if any(t in ('mRNA', 'transcribed RNA') for t in feature.qualifiers.get('mol_type', [])):
break
if feature.type == 'mRNA':
print ','.join(feature.qualifiers.get('gene', ['<unknown gene>'])),
if 'pseudo' in feature.qualifiers:
print 'pseudo',
else:
print 'normal',
if 'transcript_id' in feature.qualifiers:
print 'good'
else:
print 'missing'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment