Skip to content

Instantly share code, notes, and snippets.

@srbcheema1
Last active August 7, 2018 15:00
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save srbcheema1/c4356400549d126bf9274885e8f5b121 to your computer and use it in GitHub Desktop.
Save srbcheema1/c4356400549d126bf9274885e8f5b121 to your computer and use it in GitHub Desktop.
Google Summer of Code

Simplify the Usage of VCF Validation Suite.

About the organization

The Global Alliance for Genomics and Health (GA4GH)

logo_ga.png

The Global Alliance for Genomics and Health (GA4GH) was formed to help accelerate the potential of genomic medicine to advance human health. It brings together over 400 leading Genome Institutes and Centers with IT industry leaders to create global standards and tools for the secure, privacy respecting and interoperable sharing of Genomic data.

European Variation Archive

eva_logo.png

The European Variation Archive is an open-access database of all types of genetic variation data from all species. All users can download data from any study, or submit their own data to the archive. They can also query all variants in the EVA by study, gene, chromosomal location or dbSNP identifier using the Variant Browser.

About the Project

Vcf-validator is a suite of command-line tools that can validate and fix VCF files. The goal of my project was to overcome the limitations of the validation suite that restrict its suitability for users with a less technical, more biological profile. I performed the following tasks:

  • The suite was hard to compile for non-Linux operating systems. I had worked to simplify the build process for Windows and MacOS X. Now it is very easy to compile the suite on almost any platform.

  • The suite is completely terminal-based and can only read from and write reports to local files and needs to be installed and executed in the user’s machine. To deal with this, I have designed a prototype of a network interface to run the suite as a service that would allow users to validate their own remote files, or a dynamically generated VCF stream.

  • Earlier If the input VCF is compressed, it is the user's responsibility to decompress it. My task was to reduce this extra step by making the validator itself capable of decompressing such files.

Team

  • Sarbjit Singh - Student
  • Cristina Yenyxe Gonzalez - Mentor
  • Jose Miguel Mut Lopez - Mentor

My Pull Requests

My contribution to the Project is as follows:

Technologies Used:

  • cpp - the code of vcf-validator is implemented in cpp.
  • Boost Libraries - boost filters to uncompress the compressed input streams.
  • CMake - To generate Build scripts.
  • MinGW - Used to test build in windows (didn't work)
  • MSVS - Used to build in windows.
  • python3 - language for code of remote-validator.
  • gRPC - technology used to send and receive data stream over network.
  • web sockets - another technology used in prototyping the network interface.
  • Asyncio - asyncronous library used to implement websockets.
  • Shell script - to implement install_dependencies.sh
  • Batch script - to implement install_dependencies.bat
  • VS Studio - to compile odb libraries for windows

Completed Tasks:

Before and During GSoc Period, I have completed the following tasks:

Reading Compressed files Directly

Issues :

Pull Requests :

Simplifying the Build Process

Mac OS

Working with mac build was really interesting and Challenging task. Earlier it seemed like build on mac would be quite similar to linux as both of them share common environment as mac is unix based. But lateron as the work started it came with new challenges. The very first challenge was MacOS does not support fully static build, reference links here and here. There is no way to get rid of libSystem.B.dylib reference. Another Challenge was linking the libraries. The way they are linked in macOS is quite different from linux. So It was also to be explored and implemented. Once the build was successful the next thing was to make CMakeLists.txt compatible for both the platforms. this was really interesting to restructure it and make it more smart. The work for simplifying build for osx was done in this PR

Windows

Windows build involved several challenges like installing odb dependency libraries, sqlite3 and boost packages. Finally we were able to build static binaries with odb dependencies dynamically linked. The work for simplifying build for windows was done in PR

Prototype for a network interface for validator.

I had proposed several technologies for the remote validator out of those websockets and gRPC were shortlisted by mentors. I prepared prototypes of remote validator using these two technologies. Working with gRPC was really interesting as that is an awesome technology.

gRPC

web sockets

Ongoing tasks :

Assembly checker tool

Assembly Checker is used to to read the CHR, POS and REF columns from the vcf file, and for each line, look into the FASTA file to see if the REF sequence matches in that region or not. By now PR is ready with following functionalities:

  • summary report - it is able to display %age of matches in vcf file.
  • Text report - I have broken text report into 2 different parts: valid and invalid. both of them write valid and invalid lines from vcf file to different report files.

Things scheduled for next PR:

  • adding support for stdin reading vcf files so that we can use zcat for reading compressed files
  • adding support for reading compressed files directly.

Conclusion :

My Project involved both research work (exploring libraries to send datastream over connection) and coding. Working with validation team is really a nice experience. Got to learn a lot from mentors. Tasks were interesting and challenging. I found my work really interesting especially that remote-validation part. It involved exploring new libraries for sending large data streams over a network. Simplifying the Build part was the most challenging task of this project. There were different build procedures and rules of each paltform. Exploring different ways of building on different platforms was really interesting.

The work is completed successfully. Now the vcf-validator can be easily build on MacOS and windows. Also the prototypes both in gRPC and websockets for remote validator are ready working fine. User can use remote validator to validate their files remotely without installing the validation suite on their local machines.

Now I am working on Assembly Checker tool. Assembly Checker is used to to read the CHR, POS and REF columns from the vcf file, and for each line, it look into the FASTA file to see if the REF sequence matches in that region or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment