Skip to content

Instantly share code, notes, and snippets.

@anamika-yadav99
Last active January 19, 2023 11:30
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save anamika-yadav99/371a486a489b6e327906eefad69f71ba to your computer and use it in GitHub Desktop.
Save anamika-yadav99/371a486a489b6e327906eefad69f71ba to your computer and use it in GitHub Desktop.

Google Summer of Code 2022 - Final Report

By Anamika Yadav ☀️

Hi there! I’m Anamika. I’m an undergrad student from India. I was selected for Google Summer of Code 2022. I spent the summer working on the project Genestorian data refinement under mentorship of Dr. Manuel Lera Ramirez. The project comes under the organization Open Bioinformatics Foundation and is a part of the bigger project Genestorian.

Project Synopsis:

Genestorian data refinement pipeline is part of the project Genestorian. Genestorian is an open source web tool for collection and organization of strain data. The purpose of Genestorian refinement pipeline is to extract genotype of the strains from the input file. The input file is most likely to be stored in the spreadsheets. From the genotypes extracted from the spreadsheets, it further exracts the alleles and identifies the patterns followed by those alleles.

For example : Genotype: h- ace2Δ::kanMX6 sep1::ura4 leu1-32

Alleles extracted are : ace2Δ::kanMX6, sep1::ura4, leu1-32

Patterns followed by the alleles are:

  • ace2Δ::kanMX6: Gene Deletion
  • sep1::ura4: GENE-GENE
  • leu1-32: ALLELE

Implementation:

For in depth implementation details, please checkout the readme.

The task to identify the patterns followed by allele was a little challenging to figure out and needed a lot of research and experimentation. Hence, we followed agile development methodology for development of the pipeline. First couple of weeks were spent exploring and understanding the data. From then on we worked on different versions of the pipeline starting with reading the spreadsheet correctly and making sure correct data is read. We transfer this data from spreadsheet to a tsv file.

Input of the pipeline:

The input to the pipeline ia a tsv file(example) that contains genotype and strain_id. Alleles are extracted from this genotype. To extract required data from spreadsheet to a tsv file we have a file called format.py. For details have a look at : Formatting input data

Tokenization and tagging:

The allele features are identified by matching them to the allele feature dataset available in the genestorian database. Alleles are tokenized and then tagged based on the features they match to. Those which are left unidentified are tagged with the tag “others”. For more details please checkout : Build_nltk_tags

An example from the tsv file

Column 1	Column 2
FY21859	h90 mug28::kanMX6 ade6-M216 ura4- his7+::lacI-GFP lys1+::lacO
FY21860	h90 mug29::kanMX6 ade6-M216 ura4- his7+::lacI-GFP lys1+::lacO

The output is:

[
      {
            "name": "his7+::laci-gfp",
            "pattern": [["GENE", ["his7"]], ["other", ["+"]], ["-", ["::"]], ["other", ["laci"]], ["-", ["-"]], ["TAG", ["gfp"]]]},
      
      {
            "name": "ura4-",
            "pattern": [["ALLELE", ["ura4-"]]]},
            
      {
            "name": "lys1+::laco",
            "pattern": [["ALLELE", ["lys1+"]], ["-", ["::"]], ["other", ["laco"]]]},
            
      {
            "name": "mug28::kanmx6",
            "pattern": [["GENE", ["mug28"]], ["-", ["::"]], ["MARKER", ["kanmx6"]]]},
            
      {
            "name": "ade6-m216",
            "pattern": [["ALLELE", ["ade6-m216"]]]},
            
      {      "name": "mug29::kanmx6",
            "pattern": [["GENE", ["mug2"]], ["other", ["9"]], ["-", ["::"]], ["MARKER", ["kanmx6"]]]}
   ]

Identification of Patterns:

NLTK regexParser is used for identification of patterns from the tagged allele features. NLTK is not designed to work with complex biological data hence we had to define our own grammar which contains the parsing rules as well as a regex to validate the rules. For details please checkout : Grammar for nltk parser and Build NLTK Trees The input to the parser is the tagged features of the allele. The output is a nltk tree with patterns identified as subtree For example genotype h- ace2Δ::kanMX6 sep1::ura4 leu1-32 NLTK tree:

image

The above tree is stored in a json file. Later patterns from this json file will be migrated to genestorian database.

Code merged during GSoC period:

Understanding and exploring the data:

I spent a couple of weeks trying to understand the data. I wrote a script for each lab strain excel sheet to read the alleles and look through the elements of the allele. Code written to explore the dataset: PR #1 | PR #2 | PR #10

Data Preprocessing

First Version of pipeline: #23

  • Wrote a script get_data/convert_genes2toml.py that writes the file data/genes.toml from data/gene_IDs_names.tsv
  • Wrotes the first version of pipeline based on data from Dey lab where the input is the xlsx file and output is a text file where allele component of genotype is replaced with the word ALLELE
  • Generalized the replacement of components of genotype for gene, marker, tags etc with the word GENE, TAG, MARKER respectively
Issue Description
#17 Convert gene_IDs_names to toml
#16 Implentation of first version of pipeline
#19 Generalise substitution of allele features

Second Version of pipeline: #28

  • Wrote script that produces two files stains.json and alleles.json from the input strains.tsv
  • Strains.json is a list of dicts which contain name of genotype, stain_id, mating type and the alleles
  • Alleles.json is also a list of dicts which contain alleles extracted from the genotype , allele pattern and a list of allele features.
  • Added tests for 2nd version of pipeline | #24 |
Issue Description
#22 Implementation of second version of pipeline and test

Tokenization and Tagging of allele components

Third Version of Pipeline: #23

  • Addition to second version: added the co-ordinates of the allele features to the alleles.jsom
  • Created the genestorian module and moved the code to the module
  • Tests to check that the coordinates are right
Issue Description
# 27 Implementation of third version of pipeline and test

Fourth Version: #33

  • Wrote the script build_nltk_tags.py and summary_nltk_tags.py
  • Build_nltk_tags takes strains.tsv as input, add tags to features in the allele and outputs a json file allele_patterns_nltk.json. The json file is list of dicts which contain allele name and allele pattern in the tagging syntax used by nltk.
  • Wrote tests for both the scripts
  • Updated readme documenting the progress so far.
Issue Description
#32 Retreiving fluorescent protein data from a public API
#30 Implentation of fourth version of pipeline

Pattern Identification

Fifth version: #35

  • Wrote two scripts build_grammar.py and build_nltk_trees.py
  • Added a peudo_grammar.json file as well which contains rules to be used by nltk chunker and some regex to validate the rules matched by nltk.
  • build_grammar.py takes pseudo grammar as input and produces a grammar.txt file which is used by the nltk for parsing
  • Build_nltk_trees.py takes allele_patterns_nltk.json as input and identifies the patterns in the alleles such as gene deletion, amino acid substitution etc. The output is a json file with the name of allele as key and nltk tree built by parser as value.
  • Wrote tests for both the scripts
  • Updated readme upto the fifth version.
Issue Description
#29 Implementation of nltk chunker to identify pattern

CI using Github workflows and dockerisation #37

  • Wrote a ci.yaml file for github action
  • Modified existing tests to fit the best needs of ci pipeline and the refinement pipeline
  • Built a docker image and updated ci.yaml to update and push docker image to registry every time it’s updated.
  • Updated the readme
Issue Description
#36 Setting Github action workflow and docker

Conclusion

Google Summer of Code was one of the best summers for me. I would recommend it to anyone starting out in software development. It was a great learning experience. I learnt good coding practices, reading documentations , graphql, setting up github action, writing tests, and a ton of biology. Turns out, I like biology and I look forward to working on more bioinformatics projects.

GSoC was certainly a little challenging but it was fun at the same time. I would like to thank my mentors and OBF team for the opportunity and especially my mentor Manuel Lera Ramirez for guiding me so well throughout the GSoC period. I plan to keep contributing to the project whenever I can.

@ranbir7
Copy link

ranbir7 commented Oct 6, 2022

hey there , can I contribute to this project?
I'm Just halfway there with python ,If you suggest me some skills required for this project
I would happily try my best acquiring them.

@anamika-yadav99
Copy link
Author

anamika-yadav99 commented Oct 6, 2022

hey there , can I contribute to this project? I'm Just halfway there with python ,If you suggest me some skills required for this project I would happily try my best acquiring them.

Hi @ranbir7 This project is almost completed. There's nothing left to do in immediate future. Maybe in a few months when Manuel is free, he might open an issue and you can work on it but for now I don't think there's anything. If you want to understand this project better and work on something similar, I'm happy to help.
A good starting point to acquire skills related to this project would be leaning more about tsv, json, toml files, learning pandas and nltk library. For biological concepts, you can go through the biological_concepts.md in the repo.

@ranbir7
Copy link

ranbir7 commented Oct 7, 2022

okay thankss💛

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment