abstatic/gsoc_proposal.md

## gsoc_proposal.md

      
    Raw
  

              gsoc_proposal.md
            
          
    Project Info

Project Title : Improve SegAnnDB interactive genomic segmentation web app

Project Short Title : Improve SegAnnDB web app .

URL of project Idea Page: https://bitbucket.org/mugqic/gsoc2016#markdown-header-improve-seganndb-interactive-genomic-segmentation-web-app
Biographical Information

My name is Abhishek Shrivastava and I am CSE major  at University Institute Of Technology, Bhopal, India. My time zone is UTC+05:30.
I have been using computers for as long as I can remember. I have been programming for the past four years in C/C++, Java and Python. I have keen interest in algorithms and data structures.
I also have familiarity with python on a scientific level and have used numpy, scipy, matplotlib and IPython notebooks in many of my assignments in my undergraduate studies. I also have used python for web developemnt using Flask and django. My own blog is based on Pelican using Jinja templating engine. I am a quick learner and I have already picked up the basic functioning of SegAnnDB including pyramid framework, berkeley db, and the existing source code.
Moreover, to further deepen my understanding about genome segmentation I have been reading quite extensively about chromosomes, copy number variations, aCGH etc, which demonstrates that I am willing to go that extra mile to complete this project.
Contact Information

Student Name: Abhishek Shrivastava

Melange Link Id: ??

Student Postal Address: 229 C, Sector C, Indrapuri, Raisen Road, Bhopal

Telephones: +91-8989822629, +91-8518097289

Email: x.abhishek.flyhigh@gmail.com , abhishek.shrivastava.ts@gmail.com

Other Communication Channel: Skype - abhishek.shrivastava.ts

Github - https://github.com/abstatic

Linked In - https://in.linkedin.com/in/abshrivastava1

Twitter - https://twitter.com/abstatic_

Website/Blog - http://abstatic.github.io
Student Affiliation

Institution: University Institute Of Technology , Bhopal

Program: Bachelor Of Engineering , Computer Science Engineering

Stage Of Completion: 3rd Year, I graduate in 2017

Contact To Verify:
Schedule Conflicts

I will have my semester exams between May 15th - June 1st and my productivity will be around 65%-70% , other than this I won't have any conflicts.
Mentors

Mentor Names: Toby Hocking

Mentor Emails: toby.hocking@mail.mcgill.ca

Mentor Link_ids: ??
Have you been in touch with the mentors ? When and how? -

Yes. I emailed the assigned mentor Toby Hockings about my interest in the project, and also completed the required selection test. After that we exchanged a few emails regarding how I can get started with this project and what should be my area  of focus.
Synopsis

SegAnnDB is a web app which is used for interactive segmentation and annotation of chromosomes. It visualizes by plotting the log ratio against chromosome length. It is one of the most accurate systems for annotations.
The aim of this project is to further improvise SegAnnDB by -

Add appropriate unit tests and a regressiong testing suite so that development of SegAnnDB is  more easier.

2.Render plots based on the chromosome region the user wants to see/annotate.
Add sharing functionality. Currently it is not posssible for one user to share his annotations with other users. We want to add this functionality as it will greatly simplify collaboration.
Faster deletion of profiles - Currently the algorithm used for deleting profiles is O(ND) , we want it to be O(1)
Viewing several profiles at a time - Currently the user can only see one profile at a time. We want to have a section in the app where the user can view multiple profiles at once.
Safe deletion of log files- SegAnnDB uses BerkeleyDB for fast DB related actions. But BerkeleyDB generates a lot of unwanted log files. We want to have a cron job which will periodically check for all the log files in the system and safely delete them.
Simplify Installation - It is tricky to install SegAnnDB on local machine/server. We want to create a docker image for SegAnnDB. Thereby making it more easier to install and distribute.

Benefits to Community (Max 250 words)

It was thought that genes were almost present in two copies in a genome. But recent discoveries reveal that two genes can vary in copy number -


Although copy number variations are common in humans, but many studies have found that copy number variations in genes are related to diseases like tumors, cancers, alzheimers. Progress made in field of CNV will help greatly in demystifying the causes and cures behind these diseases.
SegAnnDB is one of the most accurate platforms for genome segmentation and annotations.
SegAnnDB focuses on helping researchers analyse the copy number alterations in a chromosome. Copy Number Variations have been found to be related to many types of tumors, cancers and even alzheimers.
By improvising SegAnnDB we are essentially taking a step forward in improving our understanding of tumors, cancers and how our genetics are related to them. Upon completion of this project SegAnnDB will receive much needed features and functionality. Collaboration between people using SegAnnDB will also become easier and better.
Coding Plan & Methods

1. Developing a test suite:  Currently SegAnnDB has no testing suite, which
makes further development of the software cumbersome as many regressions can be introduced while introducing new features.
The first deliverable of the project will be to develop a good test suite, which will test for all the existing functionality and regressions which might occur in further development. For creating the test suite I will use Selenium WebDriver.
I have past experience with Selenium (in testing a Java RESTful Application), so the learning curve with python bindings will be minimal. Moreover, selenium webdriver has good documentation and supports headless testing on almost all of  the modern browsers.
With headless testing we can simulate the tests as if a user is using the applications. The probable candidates for test cases will be functionalities like upload, annotate, delete etc.
As part of the test suite I also plan to develop unit tests for all the new methods that I will add. All the unit tests will be separate from regression testing.
During my community bonding period I plan to get more familiar with the source code and start planning on to which parts will need to be tested using selenium.
Upon completion : We will have a selenium based test suite, with which we will be able to test all the existing functionality of SegAnnDB .
2.Replace large PNG:  This will be the second deliverable of the project. Currrently we render very large PNGs and show the whole chromosome at once, due to this the width of the PNG becomes very high and is unsupported in many web browsers.
I plan to modify the existing functionality by -

a) Setting the width of all rendered images to 1500px

b) Dividing the chromosome based on subregions.
Right now, for a profile uploaded, we are able to navigate to different numbers of chromosome chr1-chr**  e.g. -  http://bioviz.rocq.inria.fr/profile/dr1hg19/ , chr2 -> http://bioviz.rocq.inria.fr/profile/dr1hg19/2/
Here  we see different zooming options available, they generate very big pngs which are not viewable in many browsers.
We will modify the pngs to have a fixed width of 1500pixels and then divide the chromosome on basis of chromStart and chromEnd , we will set a fixed number of bp (base pairs) on basis of which the user will be able to view the subregions of the chromosome. So, the endpoints will look something like -
http://bioviz.rocq.inria.fr/profile/dr3hg19/2/1000000:2000000/ and on the page there will be options to go forward and backwards in the same chromosome.
Implementation - We are already using bedgraph to generate the scatterplots, so for implementing this functionality we will have to add code to generate a scatterplot which will plot the points in 1500 x 200 px png image.
We can make use of the draw(arrays, fn, width, height, lr_min=None, lr_max=None, pos_min=None, pos_max=None) function in scatterplot.py to generate the png once we have the required chromStart and chromEnd positions. The probable pseudo-code can be something like -
1. Receive chromStart and chromEnd.  
2. Get required log ratio data from the bedgraph file for given profile and chromStart, chromEnd  
3. Calculate the required values for normalization and logratios - for 1500px width.  
4. Plot the points and generate scatterplot and the png.   

After generating the scatterplot , we will also need to make sure that it is interactive, i.e. annotations can be done and stored on that profile.
Other than that, we will also need to modify __init__.py file and add a new view so that we are accomodating the changes in the front end as well.

I also found out that we can make use of this in the links page as well. For example - In the page showing detected alterations for a profile, e.g. http://bioviz.rocq.inria.fr/links/A9L/ we will be able to render it in SegAnnDB as well. (the sub regions)n.

Upon completion : We will be able to interact with the sub regions of chromosomes in other browsers as well, in a png of size 1500 x 200 pixels, i.e. we will be able to navigate to pages like http://bioviz.rocq.inria.fr/profile/dr3hg19/2/1000000:2000000/ on any web browser.
3.Sharing Annotations:  This will be the third deliverable of the project. Currently there is no way for one user to share his/her annotations with some other users.
Implementation - We are already storing user-specific annotations data in our database. We will allow the user to share sub regions of the chromosome where he has done modifications.
It will cover all the annotations, segmentation and breakpoints made by the user.
So, to implement sharing of these annotatations, we can define one endpoint in our web app which will contain:

User id of the person who wants to share his/her annotations (most probably his email, depending on mozilla persona)
Profile name of the chromosome
Chromosome number
Subregion of the chromosome

On basis of these details we will be able to regenerate the annotations specific to that particular user.
An example shareable link might look like -
http://bioviz.rocq.inria.fr/{user_id}/dr3hg19/2/1000000:2000000/
After this we will need to add a function in views such that the correct profile is generated from the given details.
We will add a new view which will accomodate the plot, and on the backend, we will add a function which will interact with BDB to retrieve the required information and then generate the required plot and png.
Upon Completion :  The user will be able to share his/her annotations with some other user irrsespective of share value in the bedgraph file. When the user will want to share his annotation of a sequence, he/she will receive a permalink which he can give to another user and they will be able to regenerate the exact annotated sequence.
4. Faster deletion of profiles We want the user to be able to delete the profiles that he has uploaded. I found out that this functionality is not yet active.
We have some code written for the view plotter.views.delete_profile and some database operations as well in plotter.db.Profile.delete but as pointed out in the TODO list it takes O(ND) in completion, we want to improve this.
Current delete method - O(ND)

def delete(self):
        deleted = []
        pro_info = self.get()
        pro_name = pro_info["name"]
        for db_key in UserProfiles.db_keys():
            up = UserProfiles(db_key)
            up.remove(pro_name)
        # To delete: 
        for cls_name in self.RELATED:
            cls = eval(cls_name)
            keys = cls.db_keys()
            for key_txt in keys:
                values = key_txt.split(" ")
                key_dict = dict(zip(cls.keys, values))
                if key_dict["name"] == pro_name:
                    res = cls(*values)
                    deleted.append(str(res))
                    res.put(None)
        # TODO: make file deletion cross-platform.
        name = self.values[0]
        cmd = "rm -rf %s/%s"%(SECRET_DIR, name)
        os.system(cmd)
        deleted.append("files")
        return "deleted " + ", ".join(deleted)

Implementation - For full implementation we will need to find a way to effectively improve the performance of the delete operation, so that it can be done in O(1) time.
One way can be to identify a relationship between the user and the profile he wants to delete, and on basis of that modify the delete operation.
Upon Completion: The user will be able to delete the profiles and the deletion will happen in O(1) time.
5. Permission System:
asda
6. Safe deletion of log files: In this task, I will develop a bash script which will work as a periodic cron job and clean out all the junk files that are generated by SegAnnDB. Among these most of the log files are generated by BerkeleyDB as it pre-writes all actions to log files before modifying the actual database files.
As per BerkeleyDB documentation, log files can be removed if -

The log file is not involved in an active transaction.
checkpoint has been writen subsequent to the log file's creation
not the only log file in the environment

For deleting log files we can-

db_archive utility can be used for cleaning out of logfiles.
Can also programmaticaly configure BerkeleyDB to auto remove logfiles by passing in the flag
DB_LOG_AUTOREMOVE , but that will make catastropich recovery impossible.

I would like to use the db_archive utility for handling log files, as it will give us better control on as to when we want to delete the log files, and when to not.
Upon Completion: We will have a shell script which will automatically clean out all the unnecessary log
files.
7. Simplify Installation:  Aim of this deliverable will be to create a docker image for SegAnnDB, so that we it is distributable as a docker container.
I am already familiar with docker so there will be no learning curve involved.
I have planned this to be the last deliverable so that we can package all the existing funcionality and the work done in the summer as a standalone docker container and upload it to docker hub.
Implementation - We will have to create a docker file which will include all the commands that are needed to install and configure SegAnnDB on a fresh machine (preferably ubuntu).
Upon Completion : Installing a local copy of SegAnnDB will be as easy as docker run segann/seganndb run_segann and then with help of port mapping the user can just point his browser to localhost:8080 to interact with SegAnnDB. With this people using windows will also be able to run SegAnnDB locally.
Timeline

Total number of weeks April 22, 2016 : August 16, 2016 - 16 Weeks (Aprox.)
Project Milestones and Deliverables


	Community Bonding Period (April 22-May 22)
	Become more familiar with the codebase and how it works.
		Realize how and what the test cases should be about. Also, observe and process 			that what would be the most efficient way of coding the other parts as well.
		Towards the last week, start coding the unit tests using Selenium test
		framework.
	

	May 23 - May 30 (1 Week)
	Using Selenium Web Driver Framework, start working on the unit testing and 	
		acceptance testing to ensure minimum possible regressions.
	

	May 31 - June 13 (2 Weeks)
	Work on replacing large pngs, with functionality to view subregions of chromosome.
	

	June 14 - June 20 (1 Week)
	Work on Social Annotations. 


	Midterm Evaluations
	Submit midterm evaluation by June 22. Then continue coding.


	June 22 - June 28 (1 week)
	Wrap up the remaining work on social annotations


	June 29 - July 12 (2 Weeks)
	Faster deletion of profiles


	July 13 - July 26 (2 Week)
	View multiple profiles at a time


	July 27 - August 2(1 Week)
	Cron Job for safe deletion of log files of BerkeleyDB


	August 3 - August 9 (1 Week)
	Create Docker Image. Package SegAnnDB as a standalone docker container and upload it
		 to dockerhub
	

	Remaining days
	Reserved as a buffer period in case something takes longer than expected or unforeseen 			difficulties arise. If everything runs as per the timeline then this period will be used for and 			more code cleanup and better testing and more documentation.
	

Management of Coding Project

For submitting code I would like to use github, and with each commit/milestone I plan to email my mentor about the progress that I have made and how it can be verified/tested on his machine. We will already develop a test suite, so there will be minimum chances of regressions. Also, with each new method that I write, I will also develop the corresponding unit test along side.
I plan to commit at least 2 times a week, depending on the progress that I am making. Also, I will maintain a weekly blog (at abstatic.github.io) reagarding my progress with the project.
If I fail to commit at least once a week or don't write about my progress for no apparent reasons then that would indicate a problem.
Test

I have completed the required selection test for the project and here is my screencast ,  https://youtu.be/fyyzU7BxWU8
My project mentor indicated that this was a satisfactory solution to the problem.
I found that the installation process for SegAnnDB was tricky and very particular about directory structures that's why I have also proposed on improving the SegAnnDB installation process.
Anything Else:

I mainly a self taught programmer and use internet to my full benefit. I am able to keep my self motivated. I have completed these MOOCs.
CS50 - Most popular course offered by Harvard University
CS75 - Another course offered by Harvard University Focused on web development
MIT OCW 6.00 - Introduction to Python, I learnt Python and basics of Scientific Computing from here.
Community Bonding Period (April 22-May 22)	Become more familiar with the codebase and how it works. Realize how and what the test cases should be about. Also, observe and process that what would be the most efficient way of coding the other parts as well. Towards the last week, start coding the unit tests using Selenium test framework.
May 23 - May 30 (1 Week)	Using Selenium Web Driver Framework, start working on the unit testing and acceptance testing to ensure minimum possible regressions.
May 31 - June 13 (2 Weeks)	Work on replacing large pngs, with functionality to view subregions of chromosome.
June 14 - June 20 (1 Week)	Work on Social Annotations.
Midterm Evaluations	Submit midterm evaluation by June 22. Then continue coding.
June 22 - June 28 (1 week)	Wrap up the remaining work on social annotations
June 29 - July 12 (2 Weeks)	Faster deletion of profiles
July 13 - July 26 (2 Week)	View multiple profiles at a time
July 27 - August 2(1 Week)	Cron Job for safe deletion of log files of BerkeleyDB
August 3 - August 9 (1 Week)	Create Docker Image. Package SegAnnDB as a standalone docker container and upload it to dockerhub
Remaining days	Reserved as a buffer period in case something takes longer than expected or unforeseen difficulties arise. If everything runs as per the timeline then this period will be used for and more code cleanup and better testing and more documentation.