Skip to content

Instantly share code, notes, and snippets.

@audy
Created October 29, 2010 00:24
Show Gist options
  • Save audy/652635 to your computer and use it in GitHub Desktop.
Save audy/652635 to your computer and use it in GitHub Desktop.
putting this here in case my house catches on fire.

Unweighted Pair Group Method w/ Arithmatic Mean

Austin G. Davis-Richardson

Project Development Log

Invocation

Tested on Python 2.7

python upgma.py dataset.fa

Description

Unweighted Pair Group Method with Arithmatic Mean is an algorithm for computing trees that convey the relatedness of strings. In our case, the strings are protein sequences taken from the SCOP protein database. The tree (given in Newick format), groups closely related proteins together, then groups closely related groups together. Evolutionary distance is measured by the edit distance which is calculated using the Needleman-Wunsch algorithm. When groups are joined, the distance from that group to others is computed as the average (hence, arithmatic mean).

For example, consider these strings:

	1. seoul
	2. saopaolo
	3. vancouver

We use Needleman Wunsch to compute edit distances and store them in a matrix:

  - 1 2 3 
  1 0 5 7
	2   0 8
	3     0

(minor note: algorithm actually scores pairs by 100*edit_distance/length(max{|a|, |b|}))

We then group the most closely related entries in the matrix, in this case 1 and 2:

	s((1,2), 3) = s(1,3) + s(2,3) = (5 + 8)/2 = 9
	
	  - 1,2 3
	1,2   0 9
	  3     0

At this point, the only possible grouping is between (1,2) and 3 so we group those together and return out the tree in Newigg format:

	(3, (2, 1))

This makes sense: 2 and 1 do seem more similar to eachother than each is to 3.

This shows that 2 and 1 are most closely related to each other. You can't say that (2, 1) is related to 3 because 3 is the only choice. Bioinformaticians use outgroups for this purpose. An outgroup is a sequence that is predicted to be unrelated. If 3 was closer to (2, 1) than say anobviouslyveryunrelatedstringsuchasthisone, then it would be safe to say that 3 is related to (2, 1)

Discussion:

The algorithm was tested on 4 representative proteins for 4 families in the SCOP database.

  1. Ribulose-phoshate binding barrel
  2. Cellulase-like
  3. OMPA-like
  4. Ubiquitin-like

This is the Newigg format for the tree computed:

( 11, ( 10, ( 8, ( 2, ( 3, ( 0, ( ( ( 9, ( ( ( ( 1, ( ( 6, 5) , 15) ) , 7) , 4) , 14) ) , 12) , 13) ) ) ) ) ) ) 

This is the tree drawn with DrawTree:

Phylogram

This is how the families should be grouped if you have a non-bifurcating tree:

((1, 2, 3, 4), (5, 6, 7, 8), (9, 10, 11, 12), (13, 14, 15, 16))

Which doesn't match the computed tree very much. Some families group together on the tree but, for example, 5 grops with 15 which shouldn't happen as they are from different superfamilies.

The problem with the UPGMA algorithm is that it assumes constant evolutionary rate which isn't true. It would work better on closely related proteins.

>sp|P16250|HIS4_STRCO Phosphoribosyl isomerase A OS=Streptomyces coelicolor GN=priA PE=1 SV=1
MSKLELLPAVDVRDGQAVRLVHGESGTETSYGSPLEAALAWQRSGAEWLHLVDLDAAFGT
GDNRALIAEVAQAMDIKVELSGGIRDDDTLAAALATGCTRVNLGTAALETPEWVAKVIAE
HGDKIAVGLDVRGTTLRGRGWTRDGGDLYETLDRLNKEGCARYVVTDIAKDGTLQGPNLE
LLKNVCAATDRPVVASGGVSSLDDLRAIAGLVPAGVEGAIVGKALYAKAFTLEEALEATS
>sp|Q9WYG7|PYRF_THEMA Orotidine 5'-phosphate decarboxylase OS=Thermotoga maritima GN=pyrF PE=1 SV=1
MTPVLSLDMEDPIRFIDENGSFEVVKVGHNLAIHGKKIFDELAKRNLKIILDLKFCDIPS
TVERSIKSWDHPAIIGFTVHSCAGYESVERALSATDKHVFVVVKLTSMEGSLEDYMDRIE
KLNKLGCDFVLPGPWAKALREKIKGKILVPGIRMEVKADDQKDVVTLEEMKGIANFAVLG
REIYLSENPREKIKRIKEMRL
>sp|P65517|NANE_STAAN Putative N-acetylmannosamine-6-phosphate 2-epimerase OS=Staphylococcus aureus (strain N315) GN=nanE PE=1 SV=1
MLPHGLIVSCQALADEPLHSSFIMSKMALAAYEGGAVGIRANTKEDILAIKETVDLPVIG
IVKRDYDHSDVFITATSKEVDELIESQCEVIALDATLQQRPKETLDELVSYIRTHAPNVE
IMADIATVEEAKNAARLGFDYIGTTLHGYTSYTQGQLLYQNDFQFLKDVLQSVDAKVIAE
GNVITPDMYKRVMDLGVHCSVVGGAITRPKEITKRFVQVMED
>sp|Q5L3Y2|PDXS_GEOKA Pyridoxal biosynthesis lyase pdxS OS=Geobacillus kaustophilus GN=pdxS PE=1 SV=1
MALTGTDRVKRGMAEMQKGGVIMDVVNAEQAKIAEAAGAVAVMALERVPADIRAAGGVAR
MADPTVIEEVMNAVSIPVMAKVRIGHYVEARVLEALGVDYIDESEVLTPADEEFHIDKRQ
FTVPFVCGCRDLGEAARRIAEGASMLRTKGEPGTGNIVEAVRHMRKVNAQIRKVVNMSED
ELVAEAKQLGAPVEVLREIKRLGRLPVVNFAAGGVATPADAALMMHLGADGVFVGSGIFK
SENPEKYARAIVEATTHYEDYELIAHLSKGLGGAMRGIDIATLLPEHRMQERGW
>sp|P00691|AMY_BACSU Alpha-amylase OS=Bacillus subtilis GN=amyE PE=1 SV=2
MFAKRFKTSLLPLFAGFLLLFHLVLAGPAAASAETANKSNELTAPSIKSGTILHAWNWSF
NTLKHNMKDIHDAGYTAIQTSPINQVKEGNQGDKSMSNWYWLYQPTSYQIGNRYLGTEQE
FKEMCAAAEEYGIKVIVDAVINHTTSDYAAISNEVKSIPNWTHGNTQIKNWSDRWDVTQN
SLLGLYDWNTQNTQVQSYLKRFLDRALNDGADGFRFDAAKHIELPDDGSYGSQFWPNITN
TSAEFQYGEILQDSASRDAAYANYMDVTASNYGHSIRSALKNRNLGVSNISHYASDVSAD
KLVTWVESHDTYANDDEESTWMSDDDIRLGWAVIASRSGSTPLFFSRPEGGGNGVRFPGK
SQIGDRGSALFEDQAITAVNRFHNVMAGQPEELSNPNGNNQIFMNQRGSHGVVLANAGSS
SVSINTATKLPDGRYDNKAGAGSFQVNDGKLTGTINARSVAVLYPDDIAKAPHVFLENYK
TGVTHSFNDQLTITLRADANTTKAVYQINNGPETAFKDGDQFTIGKGDPFGKTYTIMLKG
TNSDGVTRTEKYSFVKRDPASAKTIGYQNPNHWSQVNAYIYKHDGSRVIELTGSWPGKPM
TKNADGIYTLTLPADTDTTNAKVIFNNGSAQVPGQNQPGFDYVLNGLYNDSGLSGSLPH
>sp|P40943|XYN1_BACST Endo-1,4-beta-xylanase OS=Bacillus stearothermophilus PE=1 SV=1
MRNVVRKPLTIGLALTLLLPMGMTATSAKNADSYAKKPHISALNAPQLDQRYKNEFTIGA
AVEPYQLQNEKDVQMLKRHFNSIVAENVMKPISIQPEEGKFNFEQADRIVKFAKANGMDI
RFHTLVWHSQVPQWFFLDKEGKPMVNETDPVKREQNKQLLLKRLETHIKTIVERYKDDIK
YWDVVNEVVGDDGKLRNSPWYQIAGIDYIKVAFQAARKYGGDNIKLYMNDYNTEVEPKRT
ALYNLVKQLKEEGVPIDGIGHQSHIQIGWPSEAEIEKTINMFAALGLDNQITELDVSMYG
WPPRAYPTYDAIPKQKFLDQAARYDRLFKLYEKLSDKISNVTFWGIADNHTWLDSRADVY
YDANGNVVVDPNAPYAKVEKGKGKDAPFVFGPDYKVKPAYWAIIDHK
>tr|Q9WXS5|Q9WXS5_THEMA Endo-1,4-beta-xylanase B OS=Thermotoga maritima GN=TM_0070 PE=1 SV=1
MKILPSVLILLLGCVPVFSSQNVSLRELAEKLNIYIGFAAINNFWSLSDAEKYMEVARRE
FNILTPENQMKWDTIHPERDRYNFTPAEKHVEFAEENDMIVHGHTLVWHNQLPGWITGRE
WTKEELLNVLEDHIKTVVSHFKGRVKIWDVVNEAVSDSGTYRESVWYKTIGPEYIEKAFR
WAKEADPDAILIYNDYSIEEINAKSNFVYNMIKELKEKGVPVDGIGFQMHIDYRGLNYDS
FRRNLERFAKLGLQIYITEMDVRIPLSGSEEYYLKKQAEVCAKIFDICLDNPAVKAIQFW
GFTDKYSWVPGFFKGYGKALLFDENYNPKPCYYAIKEVLEKKIEERK
>sp|P49235|BGLC_MAIZE Beta-glucosidase, chloroplastic OS=Zea mays GN=GLU1 PE=1 SV=1
MAPLLAAAMNHAAAHPGLRSHLVGPNNESFSRHHLPSSSPQSSKRRCNLSFTTRSARVGS
QNGVQMLSPSEIPQRDWFPSDFTFGAATSAYQIEGAWNEDGKGESNWDHFCHNHPERILD
GSNSDIGANSYHMYKTDVRLLKEMGMDAYRFSISWPRILPKGTKEGGINPDGIKYYRNLI
NLLLENGIEPYVTIFHWDVPQALEEKYGGFLDKSHKSIVEDYTYFAKVCFDNFGDKVKNW
LTFNEPQTFTSFSYGTGVFAPGRCSPGLDCAYPTGNSLVEPYTAGHNILLAHAEAVDLYN
KHYKRDDTRIGLAFDVMGRVPYGTSFLDKQAEERSWDINLGWFLEPVVRGDYPFSMRSLA
RERLPFFKDEQKEKLAGSYNMLGLNYYTSRFSKNIDISPNYSPVLNTDDAYASQEVNGPD
GKPIGPPMGNPWIYMYPEGLKDLLMIMKNKYGNPPIYITENGIGDVDTKETPLPMEAALN
DYKRLDYIQRHIATLKESIDLGSNVQGYFAWSLLDNFEWFAGFTERYGIVYVDRNNNCTR
YMKESAKWLKEFNTAKKPSKKILTPA
>sp|P62987|RL40_HUMAN Ubiquitin-60S ribosomal protein L40 OS=Homo sapiens GN=UBA52 PE=1 SV=2
MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYN
IQKESTLHLVLRLRGGIIEPSLRQLAQKYNCDKMICRKCYARLHPRAVNCRKKKCGHTNN
LRPKKKVK
>sp|Q9QZ49|UBXN8_MOUSE UBX domain-containing protein 8 OS=Mus musculus GN=Ubxn8 PE=1 SV=1
MASRGVVGLFLLSALPLLCLELRRGIPSLGIKDLILLSGRIFLLLALLTLVISVTTSWFN
SLKPSQGHLKEGEKENEKRRRLVRERQQEAQGEKASRYIENVLKPQQEMKLKKLEERFYQ
MTGETWKLTAGHRLLEGDEDSEFENSSQASFETINGEAARRQNLPKFSTEISPAARPLLR
KEVPDLPEEPSETAEEVVTVALRCPNGRVLRRRFFKSWNSQVLLDWMMKVGYHKSLYRLS
TSFPRRALEVEGGSSLEDIGITVDTVLNVEEKEQSSQ
>sp|Q62625|MLP3B_RAT Microtubule-associated proteins 1A/1B light chain 3B OS=Rattus norvegicus GN=Map1lc3b PE=1 SV=3
MPSEKTFKQRRSFEQRVEDVRLIREQHPTKIPVIIERYKGEKQLPVLDKTKFLVPDHVNM
SELIKIIRRRLQLNANQAFFLLVNGHSMVSVSTPISEVYESERDEDGFLYMVYASQETFG
TALAVTYMSALKATATGREPCL
>sp|P61960|UFM1_HUMAN Ubiquitin-fold modifier 1 OS=Homo sapiens GN=UFM1 PE=1 SV=1
MSKVSFKITLTSDPRLPYKVLSVPESTPFTAVLKFAAEEFKVPAATSAIITNDGIGINPA
QTAGNVFLKHGSELRIIPRDRVGSC
>sp|P0A920|OMPX_SHIFL Outer membrane protein X OS=Shigella flexneri GN=ompX PE=3 SV=1
MKKIACLSALAAVLAFTAGTSVAATSTVTGGYAQSDAQGQMNKMGGFNLKYRYEEDNSPL
GVIGSFTYTEKSRTASSGDYNKNQYYGITAGPAYRINDWASIYGVVGVGYGKFQTTEYPT
YKHDTSDYGFSYGAGLQFNPMENVALDFSYEQSRIRSVDVGTWIAGVGYRF
>tr|Q83LW9|Q83LW9_SHIFL Putative uncharacterized protein crcA OS=Shigella flexneri GN=crcA PE=4 SV=1
MNVSKYVAIFSFVFIQLISVGKVFANADERMTTFRENIAQTWQQPEHYDLYIPAITWHAR
FAYDKEKTDRYNERPWGGGFGLSRWDEKGNWHGLYAMAFKDSWNKWEPIAGYGWESTWRP
LADENFHLGLGFTAGVTARDNWNYIPLPVLLPLASVGYGPVTFQMTYIPGTYNNGNVYFA
WMRFQF
>tr|Q6QCC2|Q6QCC2_NEIME Lipoprotein OS=Neisseria meningitidis GN=gna1870 PE=1 SV=1
MNRTAFCCLSLTTALILTACSSGGGGVAADIGAGLADALTAPLDHKDKGLQSLTLDQSVR
KNEKLKLAAQGAEKTYGNGDSLNTGKLKNDKVSRFDFIRQIEVDGQLITLESGEFQVYKQ
SHSALTAFQTEQIQDSEHSGKMVAKRQFRIGDIAGEHTSFDKLPEGGRATYRGTAFGSDD
AGGKLTYTIDFAAKQGNGKIEHLKSPELNVDLAAADIKPDGKRHAVISGSVLYNQAEKGS
YSLGIFGGKAQEVAGSAEVKTVNGIRHIGLAAKQ
>sp|P0A431|PSBO_THEEB Photosystem II manganese-stabilizing polypeptide OS=Thermosynechococcus elongatus (strain BP-1) GN=psbO PE=1 SV=1
MKYRILMATLLAVCLGIFSLSAPAFAAKQTLTYDDIVGTGLANKCPTLDDTARGAYPIDS
SQTYRIARLCLQPTTFLVKEEPKNKRQEAEFVPTKLVTRETTSLDQIQGELKVNSDGSLT
FVEEDGIDFQPVTVQMAGGERIPLLFTVKNLVASTQPNVTSITTSTDFKGEFNVPSYRTA
NFLDPKGRGLASGYDSAIALPQAKEEELARANVKRFSLTKGQISLNVAKVDGRTGEIAGT
FESEQLSDDDMGAHEPHEVKIQGVFYASIEPA
0 78 78 76 80 77 76 78 80 79 79 80 76 79 77 81
0 0 78 77 82 78 76 80 77 79 77 79 83 82 77 79
0 0 0 76 80 77 76 80 79 80 78 79 79 80 77 78
0 0 0 0 79 77 81 78 80 82 79 83 78 78 81 81
0 0 0 0 0 77 78 79 86 80 86 89 83 83 77 77
0 0 0 0 0 0 67 76 80 77 81 85 80 78 77 74
0 0 0 0 0 0 0 77 80 80 80 84 78 77 79 77
0 0 0 0 0 0 0 0 85 79 84 88 82 82 78 78
0 0 0 0 0 0 0 0 0 79 83 78 80 79 78 79
0 0 0 0 0 0 0 0 0 0 78 83 78 78 80 83
0 0 0 0 0 0 0 0 0 0 0 78 83 80 80 79
0 0 0 0 0 0 0 0 0 0 0 0 78 80 81 80
0 0 0 0 0 0 0 0 0 0 0 0 0 82 76 76
0 0 0 0 0 0 0 0 0 0 0 0 0 0 79 78
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 80
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
( 11, ( 10, ( 8, ( 2, ( 3, ( 0, ( ( ( 9, ( ( ( ( 1, ( ( 6, 5) , 15) ) , 7) , 4) , 14) ) , 12) , 13) ) ) ) ) ) )
>1
aaaaaaaa
>2
aaaaaaab
>3
aaaaaabb
>4
aaaaabbb
>5
aaaabbbb
>6
aaabbbbb
>7
aabbbbbb
>8
abbbbbbb
#!/usr/bin/env python
#encoding: utf-8
# Unweighted Pair Group Method w/ Arithmatic Mean
# Austin G. Davis-Richardson
import sys
from itertools import combinations
def main():
""" Application Logic """
try:
infile = sys.argv[1]
except IndexError:
print 'USAGE: %s input.fa' % (sys.argv[0])
quit(1)
with open(infile) as handle:
records = {}
fasta = Fasta(handle)
c = 0
for record in fasta:
records[c] = record.seq
c += 1
# Create edit-distance matrix
scores = PrettyDict()
for a, b in combinations(records, 2):
aln = Nw.align(records[a], records[b])
d = 100*aln[-1][-1]/float(len(max([records[a],records[b]], key=len)))
scores[(b, a)] = d
print scores
print ''
# UPGMA it, get tree in Newick format
tree = Upgma.calculate_tree(scores)
print '%s' % str(tree).replace('(','( ').replace(')',') ')
class Upgma(object):
""" Unweighted Pair Group Method w/ Arithmatic Mean """
@classmethod
def calculate_tree(self, m):
""" Returns tree in Newick format given hash of distances """
while len(m) > 1:
low = float('inf')
for i in m:
if m[i] <= low:
lk, low = i, m[i]
del m[lk]
for i in m.keys():
for j in i:
if j in lk:
try:
new_key = set(i)
new_key.remove(j)
new_key.add(lk)
m[tuple(new_key)] = (m[i] + low)/2
del m[i]
except KeyError:
pass
return m.keys()[0]
class Nw(object):
@classmethod
def align(self, a, b):
""" Returns Levenstein distance of two strings using NW """
a = ' ' + a
b = ' ' + b
m = Matrix(len(a), len(b))
for f, y in zip(b, range(len(m))):
for s, x in zip(a, range(len(m[y]))):
if y is 0 and x is 0:
continue
elif y == 0:
m[y][x] = m[y][x-1] + 1
elif x == 0:
m[y][x] = m[y-1][x] + 1
elif f == s:
m[y][x] = m[y-1][x-1]
elif f != s:
m[y][x] = min([m[y-1][x-1]+1, m[y-1][x]+1, m[y][x-1]+1])
return m
class Matrix(list):
""" A cute little matrix object """
def __init__(self, x, y):
""" initialize """
[ self.append([0]*x) for i in range(y) ]
def __repr__(self):
""" niceprint """
return '\n'.join(['\t'.join([str(j) for j in i]) for i in self ])
class PrettyDict(dict):
""" Makes your Dict look like a Matrix """
def __repr__(self):
j = []
[ j.extend(i) for i in self.keys() ]
l = len(set(j))
m = Matrix(l, l)
for i in self:
m[i[1]][i[0]] = '%.0f' % self[i]
return m.__repr__()
class Fasta:
''' A FASTA iterator/generates DNA objects. '''
def __init__(self, handle):
self.handle = handle
def __repr__(self):
return '<FASTA: for handle = %s>' % handle
def __iter__(self):
header, sequence = '', []
for line in self.handle:
if line[0] == '>':
if sequence: yield Record(header, sequence)
header = line[1:-1]
sequence = []
else:
sequence.append(line.strip())
yield Record(header, sequence)
class Record(object):
""" A FASTA record """
def __init__(self, header, sequence):
self.header = header
self.sequence = sequence
def __str__(self):
return '>%s\n%s' % (self.header, '\n'.join(self.sequence))
def __len__(self):
return len(self.seq)
@property
def seq(self):
return ''.join(self.sequence)
@property
def head(self):
return self.header
if __name__ == '__main__':
try:
main()
except KeyboardInterrupt:
print 'ancelled!'
quit(-1)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment