Tested on Python 2.7
python upgma.py dataset.fa
Unweighted Pair Group Method with Arithmatic Mean is an algorithm for computing trees that convey the relatedness of strings. In our case, the strings are protein sequences taken from the SCOP protein database. The tree (given in Newick format), groups closely related proteins together, then groups closely related groups together. Evolutionary distance is measured by the edit distance which is calculated using the Needleman-Wunsch algorithm. When groups are joined, the distance from that group to others is computed as the average (hence, arithmatic mean).
For example, consider these strings:
1. seoul
2. saopaolo
3. vancouver
We use Needleman Wunsch to compute edit distances and store them in a matrix:
- 1 2 3
1 0 5 7
2 0 8
3 0
(minor note: algorithm actually scores pairs by 100*edit_distance/length(max{|a|, |b|})
)
We then group the most closely related entries in the matrix, in this case 1
and 2
:
s((1,2), 3) = s(1,3) + s(2,3) = (5 + 8)/2 = 9
- 1,2 3
1,2 0 9
3 0
At this point, the only possible grouping is between (1,2)
and 3
so we group those together and return out the tree in Newigg format:
(3, (2, 1))
This makes sense: 2
and 1
do seem more similar to eachother than each is to 3
.
This shows that 2
and 1
are most closely related to each other. You can't say that (2, 1)
is related to 3
because 3
is the only choice. Bioinformaticians use outgroups for this purpose. An outgroup is a sequence that is predicted to be unrelated. If 3
was closer to (2, 1)
than say anobviouslyveryunrelatedstringsuchasthisone
, then it would be safe to say that 3
is related to (2, 1)
The algorithm was tested on 4 representative proteins for 4 families in the SCOP database.
- Ribulose-phoshate binding barrel
- Cellulase-like
- OMPA-like
- Ubiquitin-like
This is the Newigg format for the tree computed:
( 11, ( 10, ( 8, ( 2, ( 3, ( 0, ( ( ( 9, ( ( ( ( 1, ( ( 6, 5) , 15) ) , 7) , 4) , 14) ) , 12) , 13) ) ) ) ) ) )
This is the tree drawn with DrawTree:
This is how the families should be grouped if you have a non-bifurcating tree:
((1, 2, 3, 4), (5, 6, 7, 8), (9, 10, 11, 12), (13, 14, 15, 16))
Which doesn't match the computed tree very much. Some families group together on the tree but, for example, 5 grops with 15 which shouldn't happen as they are from different superfamilies.
The problem with the UPGMA algorithm is that it assumes constant evolutionary rate which isn't true. It would work better on closely related proteins.