Skip to content

Instantly share code, notes, and snippets.

@harshberia93
Created March 19, 2014 17:36
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save harshberia93/9647053 to your computer and use it in GitHub Desktop.
Save harshberia93/9647053 to your computer and use it in GitHub Desktop.
Title : Pairwise Sequence Alignment
Organization: Open Bioinformatics Foundation
Short description: The goal of this project is to implement pairwise sequence alignment for a protein sequence using dynamic programming in Biopython. Functions are provided for both local and global alignment, in which the user can specify match/mismatch score, gap opening and gap extension penalty or alternatively use Blosum62 or PAM matrix for match/mismatch. All possible alignments are displayed.
Full Name : Harsh Beria
Motivation :
Bioinformatics is relevant to my field of study and interest. I am currently in my third year at Indian Institute of Technology Kharagpur. I love programming and there is a lot in bioinformatics to implement. Previously, I have written PDB parsers, global sequence alignments and protein folding using Monte Carlo simulations. Currently, I am working in improvement of the performance of Protein Docking algorithm by refining Side Chains. Given my understanding of the subject, implementing pairwise sequence alignment won't be very challenging. Also, I want to start contributing to Open Bioinformatics Foundation so as to improve my programming skills and making my implementation more efficient. This is something which goes missing in my college projects.
Summary of Programming Experience and Skills :
I work in C and Python. I have done substantial amount of work in Web Programming, having implemented instant chat engine, forum support. I have written Python code for solving Graph Theory problems like Dijkstra and Bellman Ford Shortest path algorithm, DFS, BFS, etc. I have created Python Tutorial Section in Hacker Rank (https://www.hackerrank.com/categories/miscellaneous/python-tutorials). Overall, I am well equipped with good understanding of algorithms and data structures typically covered in second and third year of college. Also, I have completed advanced course in Graph Theory. Apart from this, I have some experience in building mobile applications.
I like working on problems which require a clear understanding of fundamentals of algorithms and data structures and wish to improve my programming skills.
Project Details :
In the Biopython module, there is already an implementation of Pairwise Sequence Alignment (Bio/pairwise2.py). As pointed out by Michiel de Hoon in Biopython-dev mailing list, it is not general and not being maintained. So, I want to rebuild it for both Global and Local pairwise alignment for any nucleotide or Protein Sequence. I will be using dynamic programming ( Needleman Wunch algorithm for global alignment and Smith Waterman algorithm for local alignment ). The input will be two valid nucleotide or protein sequences and the output will be all possible alignments between the two sequences. Also, the score will be displayed along with the alignments. The entire project is divided into five parts :
Input Parameters - For both local and global alignments, there are two sets of data, (a)match/mismatch score (b)gap opening and gap extension penalty.
1. Match/Mismatch Score - Four cases :
(I) Default [MATCH = 1, MISMATCH = 0]
(II) Match, mismatch score provided in function arguments
(III) Blosum62 or PAM matrix used
(IV) Callback function provided.
2. Gap Penalty - Three cases :
(I) Default [GAP OPENING PENALTY, GAP EXTENSION PENALTY = 0, 0]
(II) Gap opening and Gap extension penalty provided in function arguments
(III) Callback function provided.
Validations - Three major validations :
(I) Check whether input is valid protein sequence
(II) Check whether valid function names given
(III) Check whether correct parameters given [USE OF DECORATORS]
Matrix Computation - Two cases :
(I) For global alignment
(II) For local alignment.
In order to keep track of all possible alignments, the residual matrix, which contains information of the previous cell from which its value came, will contain all possible locations which gave same maximum score. All paths in the matrix which lead to the final score are found out recursively, thus giving all possible alignments.
Printing :
(I) Alignment score is simply displayed.
(II) Formatted alignment of the two sequences is done using dash to fill the blank spaces in alignment and '|' to show correspondence between two residues, as is implemented currently.
Test cases and detailed documentation.
Detailed Timeline :
Week 1-2 (May 19 - June 2):
Tasks:
1. To implement functions for different cases of match and mismatch.
2. To implement functions for different cases of gap opening and extension penalty.
3. Lay a prototype for global and local alignment functions.
4. Write unit tests.
Week 3 (June 2 - June 9):
Tasks:
1. Write validations for protein sequence.
2. Write validations for function names.
3. Write validations for correct parameter list in each function (Using Decorators).
4. Write unit tests.
Week 4 (June 9 - June 16):
Tasks:
1. Complete implementation of global alignment using Needleman Wunch.
2. Generate all possible global sequence alignment for two protein sequences.
3. Write unit tests.
Week 5 (June 16 - June 23):
Tasks:
1. I will be travelling from USA to India in this time, so can't get much work done.
2. To clear any bugs before submission for Mid-term evaluation.
Week 6 (June 23 - June 30):
Tasks:
1. Mid-term evaluation period.
2. To design a prototype for local alignment and start working on it.
Week 7-8 (June 30 - July 14):
Tasks:
1. Complete implementation of local alignment using Smith Waterman algorithm.
2. Generate all possible local sequence alignment for two protein sequences.
3. Write unit tests.
Week 9-10 (July 14 - July 28):
Tasks:
1. Formatting of the alignment.
2. Bug fixing.
3. Write unit tests.
Week 11-12 (July 28 - August 11):
Tasks:
1. Extensive documentation.
2. Example code
2. User Testing.
3. Code refactoring.
Week 13 (August 11 - August 18):
Tasks:
1. Bug fixing.
2. Clean up documentation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment