Skip to content

Instantly share code, notes, and snippets.

@zacharyvoase
Created May 13, 2009 21:25
Show Gist options
  • Save zacharyvoase/111312 to your computer and use it in GitHub Desktop.
Save zacharyvoase/111312 to your computer and use it in GitHub Desktop.
The following shows how you might go about using the processed data which is
output from parse_humsavar.py.
We’ll begin by reading it out from JSON; this involves importing the JSON
library, opening the file, and then just using ``json.load()``::
# If you’re using Python 2.5, you probably have the bottom one (so you can
# just run `import simplejson as json`); if you’re using Python 2.6 then you
# definitely have the top one, so you can just do `import json`.
>>> import json
>>> import simplejson
>>> json_fp = open('non_disease_variants.json')
>>> data = json.load(json_fp)
>>> json_fp.close() # it’s really important to close this.
We now have all the information in a big list called ``data``. Each item in this
list is a dictionary with several fields::
>>> print data[0].keys()
['main_gene_name', 'sprot_entry_name', 'accession', 'feature_id',
'sequence_position', 'aa_change', 'variant_type', 'disease_name']
Each of these dictionaries represents a variant; the ``'aa_change'`` field is
itself a dictionary with ``'wild'`` and ``'mutated'`` fields representing the
change in amino acid that occurred at that sequence position; so for example, if
we wanted to find out the number of variants with Alanine as the wild aa (code
``'A'``), we could do the following::
>>> print sum(1 for var in data if var['aa_change']['wild'].upper() == 'A')
2332
Interesting result - by the way, that is the *actual* result. These sort of
operations are quite easy to do in the interactive interpreter in Python. As you
can see, JSON is only important when we’re loading the data from disk; it
doesn’t make any difference to how we use the data, or even the data itself. One
of the coolest things is that there are JSON parsers written in Ruby, Perl,
Python, PHP, JavaScript, C, C#, C++, Haskell, Erlang, Java, Lua, Objective-C,
Visual Basic and Prolog (believe me, the list goes on). This means that if your
supervisor/professor programs in a language other than Python, he/she can still
see and work with all of your data very easily in his/her preferred language,
without the fuss of SQL and SQLite and MySQL and usernames and passwords and all
of that stuff (you can just forward a JSON file in an attachment). Also, because
JSON doesn’t need a schema, you don’t need to start worrying about designing
tables and deciding on field names, because you can always just change them
really easily without having to delete your database and do syncdb again and all
of that tedious stuff.
# -*- coding: utf-8 -*-
from cStringIO import StringIO
import re
import sys
# Tries to import SimpleJSON from a couple of sources.
try:
import json
except ImportError:
try:
import simplejson as json
except ImportError:
print "It seems you don't have SimpleJSON installed. Go talk to Zack :)"
sys.exit(1)
class HumanSingleAminoAcidVariant(object):
"""A class which represents a Human Single-Amino-Acid Variant."""
fields = ['main_gene_name', 'sprot_entry_name', 'accession', 'feature_id',
'sequence_position', 'aa_change', 'variant_type', 'disease_name']
def __init__(self, *args):
# Whereas we would usually go `for item in list:`, if we call
# `enumerate()` on that list, instead we can use
# `for i, item in enumerate(list):`, where `i` is the numerical index of
# the item in the list.
for i, value in enumerate(args):
# `setattr()` is a function which takes an object (in this case
# we’re using `self`), a name and a value, and sets the attribute
# with that name on that object to the value. For example:
#
# setattr(obj, 'attr', 'value')
#
# Is the same as doing:
#
# obj.attr = 'value'
#
# The reason we do it is because it allows us to choose what
# attribute to set dynamically (in this case we’re looking in
# `self.fields` to find the right attribute for that particular item
# in the list of arguments).
setattr(self, self.fields[i], value)
# This is a classmethod; it basically means that it isn’t called on an
# instance of the class, but the class itself. Therefore it takes the class
# as the first parameter, not `self`. Classmethods are used instead of just
# plain old functions because they can be inherited, which means subclasses
# will also have the same `parse_line()` method which will operate on the
# subclass.
@classmethod
def parse_line(cls, line):
# If we’ve been given an empty line, just return `None`. Whatever calls
# this function should recognize that `None` has been returned and just
# ignore the result.
if not line.strip():
return
# The StringIO object is a class which wraps around a string and acts
# like a file object, with read() and write() methods. Its main use is
# in wrapping strings so they can be passed to functions which only take
# file objects, but it’s also a way for us to cut the string up without
# having to use the fiddly string[index1:index2] syntax; we just call
# `io.read(n)` to pull out a certain number of characters at a time.
io = StringIO(line)
# This will store the arguments to our class’s __init__ method.
args = []
# Main Gene Name
main_gene_name = io.read(10).strip()
if main_gene_name == 'n.a.':
args.append(None)
else:
args.append(main_gene_name)
args.append(io.read(13).strip()) # SwissProt Entry Name
args.append(io.read(8).strip()) # Accession
args.append(io.read(10).strip()) # Feature Identifier
# We use `int()` here to parse the string into an integer
args.append(int(io.read(7).strip())) # Sequence Position
# Amino Acid/Residue Change
# An example of a classmethod in action; we call parse() on the
# AminoAcidChange class *itself*, not an instance thereof.
args.append(AminoAcidChange.parse(io.read(8).strip()))
# We’re converting it all to lower case for the sake of consistency :)
args.append(io.read(14).strip().lower()) # Variant Type
# Calling `io.read()` reads all the data left up to the end of the line.
# By stripping this, we can determine if there is or isn’t a disease
# name associated with this SAVariation. The `or` statement below is an
# example of what is known as ‘short-circuit evaluation’. What one would
# expect to happen is this (in sort of pseudo-code):
#
# def OR(left_statement, right_statement):
# left = evaluate(left_statement)
# right = evaluate(right_statement)
#
# if either(left, right):
# return True
# else:
# return False
#
# What Python actually does is this:
#
# def OR(left_statement, right_statement):
# left = evaluate(left_statement)
# if left:
# return left
# else:
# return evaluate(right_statement)
#
# So in actual fact, if the left-hand statement is true, then the right-
# hand statement will never be evaluated.
#
# In the following case, `io.read().strip()` will return the rest of the
# line without leading and trailing space characters. So if there *is* a
# disease name, this string will not be empty, will evaluate to true,
# and will be passed to `args.append()`. If it is empty, then Python
# will return the result of evaluating the right-hand statement (i.e.
# None), passing it into `args.append()`.
#
# All-in-all, if there is a disease name, it will be appended to the
# list of arguments; if not, `None` will be appended.
args.append(io.read().strip() or None)
# Finally, instantiate the class and return the instance. We need to
# turn `args` into a tuple because the `function(*arguments)` syntax
# only works when `arguments` is a tuple (for some reason).
return cls(*tuple(args))
# Another classmethod; this time you give it a file full of lines and it
# parses each line in the file, returning the list of parsed lines. It is
# convention to use `fp` instead of `file` as the variable to hold a file
# object, as `fp` stands for ‘file pointer’ and `file` is already the name
# of a built-in Python class (incidentally, the file pointers themselves are
# usually instances of this class). By the way, the `StringIO` class used
# above is completely compatible with the usual `file` methods and so can be
# passed in here instead of a file object. This is useful if you have a
# string held in memory and want to parse it as if it were the contents of
# an actual file.
@classmethod
def parse_file(cls, fp):
results = []
# File pointer objects can be used in for loops, where each iteration
# will yield the next line in the file.
for (i, line) in enumerate(fp):
variant = cls.parse_line(line)
# If the line is empty, `parse_line()` will return `None`.
if variant is not None:
results.append(variant)
return results
def to_dictionary(self):
# Python supports what’s known as ‘generator syntax’ in function calls;
# it basically means the line below is the equivalent of creating a list
# of pairs of `(attr, getattr(self, attr, None))` and then passing the
# list in to `dict()`.
#
# `getattr()` is a method which takes an object, an attribute name and
# a default value. It attempts to get that attribute from the object,
# and if it can’t find it it will return the default value. If no
# default value is given it raises an `AttributeError`. The following:
#
# object.attribute
#
# Is equivalent to:
#
# getattr(object, 'attribute')
#
# Again, as with `setattr()`, the benefit is that we can choose the
# attribute to fetch as we go along, rather than hardcoding it in our
# program.
di = dict((attr, getattr(self, attr, None)) for attr in self.fields)
# A little hack to deal with the internal AminoAcidChange object.
di['aa_change'] = self.aa_change.to_dictionary()
return di
# Another tip: usually, when parsing or writing a `from_something()` method,
# it will likely be a classmethod (because an instance hasn’t been created
# yet).
@classmethod
def from_dictionary(cls, dictionary):
arguments = []
# Don’t forget that because `fields` was defined at the class level,
# it’s also an attribute of the class object itself.
for field in cls.fields:
if field == 'aa_change':
arguments.append(
AminoAcidChange.from_dictionary(dictionary['aa_change']))
else:
# `dict.get()` is a method which takes a key and a default value; if
# it can’t find that key in the dictionary it returns the default.
arguments.append(dictionary.get(field, None))
# See earlier commends for the `*tuple(arguments)` explanation.
return cls(*tuple(arguments))
class AminoAcidChange(object):
def __init__(self, wild, mutated):
self.wild = wild
self.mutated = mutated
@classmethod
def parse(cls, string):
# This regex uses Python’s special syntax for creating named groups in
# Regular Expressions. The `re.match()` function takes a regex and a
# target string, applies the regex to the string and returns a Match
# object which, if the match was successful, will contain information on
# the match; if unsuccessful, it will be None (in which case calling
# match.group() below will raise an error anyway).
match = re.match(
r'(?P<wild>[A-Za-z])[\s]+\-\>[\s]+(?P<mutated>[A-Za-z])', string)
# Here we use match.group() with the name of the named group we made in
# the regex above to retrieve the string matched by that particular
# group.
return cls(match.group('wild'), match.group('mutated'))
# These two methods are really simple; we’re not using the `field` stuff
# because it’s not necessary for only two attributes (`wild` and `mutated`).
def to_dictionary(self):
return {'wild': self.wild, 'mutated': self.mutated}
@classmethod
def from_dictionary(cls, dictionary):
return cls(dictionary['wild'], dictionary['mutated'])
# This is the main body of the program; what will be run when the application is
# executed (via `python parse_humsavar.py` or some other way). As we go along we
# give some helpful logging output (it also helps us figure out where things
# went wrong in the case of a crash).
def main():
# Open the file containing Human Single-Amino-Acid Variant information.
humsavar_fp = open('humsavar.txt')
# The following essentially grabs each line in the nondisease.txt file and
# strips the newline characters. This results in a list of the non-disease
# accession numbers at `nondisease_accns`.
nondisease_fp = open('nondisease.txt')
print 'Reading non-disease accession numbers'
nondisease_accns = [line.strip() for line in nondisease_fp.readlines()]
print 'Finished reading non-disease accession numbers'
nondisease_fp.close() # Closing the file is very important.
# We made things *so* much simpler by defining that in a classmethod rather
# than writing it all down here! This should get us a list of *every* human
# SAA variant in the file.
print 'Reading all Human Single-Amino-Acid Variants'
all_variants = HumanSingleAminoAcidVariant.parse_file(humsavar_fp)
print 'Finished reading all Human Single-Amino-Acid Variants'
humsavar_fp.close() # Closing the file is very important. Again.
# We now go in and filter this list for only the ones whose accessions are
# in the list of non-disease accession numbers. It’s simple enough to
# understand from the code alone (that’s the beauty of Python).
non_disease_variants = []
print 'Filtering non-disease variants'
for variant in all_variants:
if variant.accession in nondisease_accns:
non_disease_variants.append(variant)
print 'Finished filtering non-disease variants'
# We now have to write this all out to disk. I’ve chosen JSON as the data
# format to write it to because it’s simple, there are libraries for parsing
# and creating JSON in almost *every* programming language, and also because
# I can’t really think of any other way to write the data out where you can
# access it and start playing with it so easily.
json_fp = open('non_disease_variants.json', 'w')
# Convert each variant to a plain dictionary so it can be saved as JSON.
print 'Serializing non-disease variants to JSON'
non_disease_dicts = [var.to_dictionary() for var in non_disease_variants]
# `json.dumps()` is a function in the SimpleJSON library which takes a
# simple Python object (i.e. one made up of only lists, dictionaries,
# numbers, strings, True, False and None) and returns a string with the
# encoded JSON. We pass that directly into `json_fp.write()` so it can write
# it to disk, close the file pointer and we’re done :)
json_fp.write(json.dumps(non_disease_dicts))
print 'Finished serializing non-disease variants to JSON'
json_fp.close()
print 'Done! :)'
# Only execute the main part of this program if it is being run from the
# command-line. If it’s being imported as a library, `main()` should not run.
# The following is an idiom which checks this and only runs `main()` if the
# program is being run from the command-line.
if __name__ == '__main__':
main()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment