Phenotypic data is key to advancing health in biomedicine. However, most phenotypic data is not in structured form, but is instead encoded as free text in various websites and journal publications. A phenopacket is a standardized structured text file about the genes and the observable biological features (phenotypes) of an individual organism or group of organisms. This project aims to create phenopackets from semi-structured phenotype data.
Phenopacket Scraper is a tool which extracts information from life sciences websites, analyzes them and generates a phenopacket at the end based on the extracted information and correct external ontology references. It includes a multi-level command line interface, a REST API and a webapp. This projects aims to extend the utility of a common phenotype exchange format so as to improve collaboration and analysis among biological researchers.
The command line interface and the api are purely written in python. I have implemented the CLI using cliff framework so as to allow multi-level commands. It takes input in the form of a url or a file, scrapes the required data from the website and generates a phenopacket. The scraping part has been implemented using beautifulsoup library and the phenopacket generation has been implemented mostly using phenopacket-python and scigraph-services. The REST API has been built using Django REST framework and its purpose and core implementation is similar to that of the command line interface. The webapp has been implemented using django and uses the phenopacket-scraper-api to produce its results.
The setup and usage guidelines are well explained in the readme of the respective github repositories.