anamika-yadav99/gsoc2022.md

## gsoc2022.md

      
    Raw
  

              gsoc2022.md
            
          
    Google Summer of Code 2022 - Final Report

By Anamika Yadav ☀️
Hi there! I’m Anamika. I’m an undergrad student from India. I was selected for Google Summer of Code 2022. I spent the summer working on the project Genestorian data refinement under mentorship of Dr. Manuel Lera Ramirez. The project comes under the organization Open Bioinformatics Foundation and is a part of the bigger project Genestorian.
Project Synopsis:

Genestorian data refinement pipeline is part of the project Genestorian. Genestorian is an open source web tool for collection and organization of strain data. The purpose of Genestorian refinement pipeline is to extract genotype of the strains from the input file. The input file  is most likely to be stored in the spreadsheets. From the genotypes extracted from the spreadsheets, it  further exracts the alleles and identifies the patterns followed by those alleles.
For example :
Genotype:  h- ace2Δ::kanMX6 sep1::ura4 leu1-32
Alleles extracted are :  ace2Δ::kanMX6,   sep1::ura4,  leu1-32
Patterns followed by the alleles are:

ace2Δ::kanMX6: Gene Deletion
sep1::ura4: GENE-GENE
leu1-32: ALLELE

Implementation:

For in depth implementation details, please checkout the readme.
The task to identify the patterns followed by allele was a little challenging to figure out and needed a lot of research and experimentation. Hence, we followed agile development methodology for development of the pipeline. First couple of weeks were spent exploring and understanding the data. From then on we worked on different versions of the pipeline starting with reading the spreadsheet correctly and making sure correct data is read. We transfer this data from spreadsheet to a tsv file.
Input of the pipeline:

The input to the pipeline ia a tsv file(example) that contains genotype and strain_id. Alleles are extracted from this genotype. To extract required data from spreadsheet to a tsv file we have a file called format.py. For details have a look at : Formatting input data
Tokenization and tagging:

The allele features are identified by matching them to the allele feature dataset available in the genestorian database. Alleles are tokenized and then tagged based on the features they match to. Those which are left unidentified are tagged with the tag “others”. For more details please checkout :  Build_nltk_tags
An example from the tsv file
Column 1	Column 2
FY21859	h90 mug28::kanMX6 ade6-M216 ura4- his7+::lacI-GFP lys1+::lacO
FY21860	h90 mug29::kanMX6 ade6-M216 ura4- his7+::lacI-GFP lys1+::lacO

The output is:
[
      {
            "name": "his7+::laci-gfp",
            "pattern": [["GENE", ["his7"]], ["other", ["+"]], ["-", ["::"]], ["other", ["laci"]], ["-", ["-"]], ["TAG", ["gfp"]]]},
      
      {
            "name": "ura4-",
            "pattern": [["ALLELE", ["ura4-"]]]},
            
      {
            "name": "lys1+::laco",
            "pattern": [["ALLELE", ["lys1+"]], ["-", ["::"]], ["other", ["laco"]]]},
            
      {
            "name": "mug28::kanmx6",
            "pattern": [["GENE", ["mug28"]], ["-", ["::"]], ["MARKER", ["kanmx6"]]]},
            
      {
            "name": "ade6-m216",
            "pattern": [["ALLELE", ["ade6-m216"]]]},
            
      {      "name": "mug29::kanmx6",
            "pattern": [["GENE", ["mug2"]], ["other", ["9"]], ["-", ["::"]], ["MARKER", ["kanmx6"]]]}
   ]

Identification of Patterns:

NLTK regexParser is used for identification of  patterns from the tagged allele features.  NLTK is not designed to work with complex biological data hence we had to define our own grammar which contains the parsing rules as well as a regex to validate the rules. For details please checkout : Grammar for nltk parser and  Build NLTK Trees
The input to the parser is the tagged features of the allele. The output is a nltk tree with patterns identified as subtree
For example genotype
h- ace2Δ::kanMX6 sep1::ura4 leu1-32
NLTK tree:

The above tree is stored in a json file. Later patterns from this json file will be migrated to genestorian database.
Code merged during GSoC period:

Understanding and exploring the data:

I spent a couple of weeks trying to understand the data. I wrote a script for each lab strain excel sheet to read the alleles and look through the elements of the allele.  Code written to explore the dataset: PR #1 |  PR #2 | PR #10 
Data Preprocessing

First Version of pipeline:  #23


Wrote a script get_data/convert_genes2toml.py that writes the file data/genes.toml from data/gene_IDs_names.tsv
Wrotes the first version of pipeline based on data from Dey lab where the input is the xlsx file and output is a text file where allele component of genotype is replaced with the word ALLELE
Generalized the replacement of components of genotype for gene, marker, tags etc with the word GENE, TAG, MARKER respectively


Issue
Description


#17 
Convert gene_IDs_names to toml


#16
Implentation of first version of pipeline


#19
Generalise substitution of allele features


Second Version of pipeline: #28


Wrote script that produces two files stains.json and alleles.json from the input strains.tsv
Strains.json is a list of dicts which contain name of genotype, stain_id,  mating type and the alleles
Alleles.json is also a list of dicts which contain alleles extracted from the genotype , allele pattern and a list of  allele features.
Added tests for 2nd version of pipeline | #24 |


Issue
Description


#22
Implementation of second version of pipeline and test


Tokenization and Tagging of allele components

Third Version of Pipeline:  #23


Addition to second version: added the co-ordinates of the allele features to the alleles.jsom
Created the genestorian module and moved the code to the module
Tests to check that the coordinates are right


Issue
Description


# 27
Implementation of third version of pipeline and test


Fourth Version: #33


Wrote the script build_nltk_tags.py and summary_nltk_tags.py
Build_nltk_tags takes strains.tsv as input, add tags to features in the allele and outputs a json file allele_patterns_nltk.json. The json file is list of dicts which contain allele name and allele pattern in the tagging syntax used by nltk.
Wrote tests for both the scripts
Updated readme documenting the progress so far.


Issue
Description


#32
Retreiving fluorescent protein data from a public API


#30
Implentation of fourth version of pipeline


Pattern Identification

Fifth version:   #35


Wrote two scripts build_grammar.py and build_nltk_trees.py
Added a peudo_grammar.json file as well which contains rules to be used by nltk chunker and some regex to validate the rules matched by nltk.
build_grammar.py takes pseudo grammar as input and produces a grammar.txt file which is used by the nltk for parsing
Build_nltk_trees.py takes allele_patterns_nltk.json as input and identifies the patterns in the alleles such as gene deletion, amino acid substitution etc. The output is a json file with the name of allele as key and nltk tree built by parser as value.
Wrote tests for both the scripts
Updated readme upto the fifth version.


Issue
Description


#29
Implementation of nltk chunker to identify pattern


CI using Github workflows and dockerisation #37


Wrote a ci.yaml file for github action
Modified existing tests to fit the best needs of ci pipeline and the refinement pipeline
Built a docker image and updated ci.yaml to update and push docker image to registry every time it’s updated.
Updated the readme


Issue
Description


#36
Setting Github action workflow and docker


Conclusion

Google Summer of Code was one of the best summers for me. I would recommend it to anyone starting out in software development. It was a great learning experience. I learnt good coding practices,  reading documentations , graphql, setting up github action, writing tests, and  a ton of biology. Turns out, I like biology and I look forward to working on more bioinformatics projects.
GSoC was certainly a little challenging but it was fun at the same time. I would like to thank my mentors and OBF team for the opportunity and especially my mentor Manuel Lera Ramirez for guiding me so well throughout the GSoC period. I plan to keep contributing to the project whenever I can.
Issue	Description
#17	Convert gene_IDs_names to toml
#16	Implentation of first version of pipeline
#19	Generalise substitution of allele features
Issue	Description
#32	Retreiving fluorescent protein data from a public API
#30	Implentation of fourth version of pipeline