Improve Structural Variants Support in VCF Validator
Here's a brief description of the work done during GSoC
VCF validator is a tool which validates Variant Call Format (VCF) files, and generates errors and warnings after performing lexical, syntactic and semantic analysis of the file. It also includes a tool called VCF debugulator which automatically fixes errors in VCF files. Single nucleotide polymorphisms (SNPs) and short insertions/deletions (INDELs) are fully supported in vcf-validator, but the support for structural variants (SVs) was limited. The goal here was to improve the support for SV tags in the tool. For this, work was done to refactor the code to introduce support for SV tags, implement new checks for SV tags and automatically fix errors associated to SV tags. Also, work on detecting and fixing duplicates in INFO, ID, FILTER and FORMAT(and corresponding SAMPLES) columns was done. Besides, a formal grammar for VCF files was proposed.
1. Refactor the code to support SV tags (main goal)
With the older code structure of validating predefined tags, it was difficult to add the support for SV tags and involved some redundancy. Moreover, the validation was done by matching the data lines with the reserved tags specification, and then checking if the metadata also matched it. So, the structure was changed to match the metadata with the reserved tags specification and then match the data lines with the metadata section.
- Added meta entry checks on Type and Number for INFO and FORMAT predefined tags, taking care of the 3 versions of VCF. This also solved the problem of checking the metadata of predefined tags for correctness.
- Extracted reusable componenets of the Ragel machines (Ragel is a finite state machine compiler and a parser generator).
- Removed the checks for general predefined tags from the ragel files.
- Added checks for predefined tags in data lines for INFO and FORMAT. The tags are now checked by matching them with the metadata if present, else matched with the specification.
- Added strict validation checks for INFO and FORMAT predefined tags of SNPs/INDELs and FILTER. This was required for various tags which could not be validated completely with just the metadata matching.
- Added check for validating that the data line matches the meta definition (for any tag) in FORMAT.
2. Support for SV tags in the validator (main goal)
With the new structure, it became convenient to add support for SV tags. This involved writing new checks for SV predefined tags (INFO and FORMAT ones) to match the meta definition, the specification and their strict validation.
- Added new meta header checks for SV tags, to check if the meta definition matches Type and Number in the spec.
- Added data line checks for SVs, to throw errors or warnings is any inconsistency is found.
- Strict validation checks for SV related tags. E.g. Checking that INFO SVLEN is equal to
len(ALT) - len(REF)for non-symbolic ALTs.
3. Automatically fix SV errors in the debugulator (main goal)
Some SV tags have common errors that can be fixed automatically by the debugulator now. This involved changes in the structure of the error throwing mechanism to select the course of action to be taken up by the debugulator (completely removing fields with irrecoverable errors, fixing recoverable errors using the expected values). Also fixed incorrect metadata for predefined tags of SNP/INDEL/SV.
- Added new fixes for SV related tags. E.g. Incorrect END tag value for precise variants now gets fixed to desired value.
- New fix for meta entry Type or Number inconsistencies with predefined tags in the debugulator, to match the desired values in the spec.
4. Detect and fix duplicate field errors in columns (additional work)
The VCF spec Version 4.3 does not permit duplicate IDs, FILTERs, FORMATs(and SAMPLES) or INFOs. This goal involved adding validation and fixing of duplicates in the above columns.
- Added checks in the validator for detecting the presence of duplicate ID fields, FILTER strings, INFO keys and FORMAT fields.
- Added new fixes in the debugulator for the following:
- Remove duplicate ID fields
- Remove duplicate or incorrect FILTER strings
- Remove duplicate INFO keys, taking into consideration the values of those keys
- Remove duplicate FORMAT fields, based on their corresponding values in each of the SAMPLEs columns.
This was done on a regular basis for each of the above tasks and in general to improve the code (done in either the same or separate PRs). Extracted methods & string literals, wrote appropriate tests, updated file structure, doc comments etc.
5. VCF formal grammar (optional goal)
Users find it difficult to understand the meaning conveyed by the VCF specification, which leads to ambiguity, as it is written in natural language and no formal definition of the grammar has been provided yet. Adding a formal definition is a great aid to people who use and develop tools based on VCF.
Work in Progress
6. Simplify build for OS-X and Windows (optional goal)
Building the tool on Linux is pretty straightforward using docker, but it's complicated for Mac and Windows users. Work on this is still being done!