./expand_and_flatten_vcf.py schema -i kaviar_100.vcf -o schema.json
./expand_and_flatten_vcf.py vcf -i kaviar_100.vcf -o expanded_vcf
Expand the INFO column and flatten multiple variants to turn a canonical VCF into a flat table. Also extract the schema. Useful for storing in a database---for instance, uploading to GCP BigQuery.
The canonical format for a VCF file contains 8 "fixed fields"
#CHROM POS ID REF ALT QUAL FILTER INFO
The INFO
column contains key-value pairs separated by a delimiter ;
.
Example from ClinVar:
ALLELEID=959428;CLNDISDB=MedGen:CN517202;CLNDN=not_provided;CLNHGVS=NC_000001.11:g.943363G>C;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Uncertain_significance;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;GENEINFO=SAMD11:148398;MC=SO:0001583|missense_variant;ORIGIN=1
Example from Kaviar:
AF=0.0000379,0.0000379;AC=1,1;AN=26378;END=10145
Also, when multiple variants are called for a single genomic coordinate, these variants are included in a single row for that coordinate are comma-delimited in that column. Associated data for these variants that might be in the INFO
column, such as allele frequency (AF
) are then also comma delimited. For example, the following row from Kaviar identifies three possible variants, and three associated values for the allele frequency and allele count (AC
):
1 10108 . C CA,CCT,CT . . AF=0.0000379,0.0018197,0.0003033;AC=1,48,8;AN=26378
In this case, the values for addional data
The VCF header lines specify the schema for the data contained in the INFO
column.
Full Kaviar header:
##fileformat=VCFv4.1
##fileDate=20160209
##source=bin/makeVCF.pl
##reference=file:///proj/famgen/resources/Kaviar-160204-Public/bin/../tabixedRef/hg19.gz
##version=Kaviar-160204 (hg19)
##kaviar_url=http://db.systemsbiology.org/kaviar
##publication=Glusman G, Caballero J, Mauldin DE, Hood L and Roach J (2011) KAVIAR: an accessible system for testing SNV novelty. Bioinformatics, doi: 10.1093/bioinformatics/btr540
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele Count">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in data sources">
##INFO=<ID=END,Number=.,Type=Integer,Description="End position">
##INFO=<ID=DS,Number=A,Type=String,Description="Data Sources containing allele">
Samples from ClinVar header:
##INFO=<ID=CLNDN,Number=.,Type=String,Description="ClinVar's preferred disease name for the concept specified by disease identifiers in CLNDISDB">
##INFO=<ID=CLNDNINCL,Number=.,Type=String,Description="For included Variant : ClinVar's preferred disease name for the concept specified by disease identifiers in CLNDISDB">
##INFO=<ID=CLNDISDB,Number=.,Type=String,Description="Tag-value pairs of disease database name and identifier, e.g. OMIM:NNNNNN">
##INFO=<ID=CLNDISDBINCL,Number=.,Type=String,Description="For included Variant: Tag-value pairs of disease database name and identifier, e.g. OMIM:NNNNNN">
##INFO=<ID=CLNHGVS,Number=.,Type=String,Description="Top-level (primary assembly, alt, or patch) HGVS expression.">
##INFO=<ID=CLNREVSTAT,Number=.,Type=String,Description="ClinVar review status for the Variation ID">
##INFO=<ID=CLNSIG,Number=.,Type=String,Description="Clinical significance for this single variant">
While there are many standard or customary INFO fields, such as those in the documentation, custom ones are fine, as in the ClinVar example. In order to generate a full schema specification we need to parse the header rows. We combine this parsed schema with the schema for the fixed fields (constructed by hand), which is shown below.
usage: expand_and_flatten_vcf.py [-h] --input_vcf INPUT_VCF [--output_vcf OUTPUT_VCF] [--info_column_index INFO_COLUMN_INDEX]
[--info_delimiter INFO_DELIMITER] [--base_schema BASE_SCHEMA]
Expand INFO column in VCF Files and ouput or write.
VCF Files have a column called INFO with 'key=vlaue'
pairs separated by ';'.
For example:
<example of INFO column>
Also, when multiple variants are called for a single
genomic position, these alternates are comma-separated
in the VCF file. In these situations, the genomic position
is repeated with the alternate variants in successive rows.
For example:
<example of multiple variants and expanded version>
optional arguments:
-h, --help show this help message and exit
--input_vcf INPUT_VCF, -i INPUT_VCF
Input VCF file with INFO column as string with key-value pairs.
--output_vcf OUTPUT_VCF, -o OUTPUT_VCF
Expanded VCF file
--info_column_index INFO_COLUMN_INDEX, -x INFO_COLUMN_INDEX
0-indexed index of the INFO column. Default value,
according to spec, is 7.
--info_delimiter INFO_DELIMITER, -d INFO_DELIMITER
Custom separator for INFO key-value pairs in case of some
weird file. Default value, according to standard, is ";"
--base_schema BASE_SCHEMA, -b BASE_SCHEMA
The standard VCF format has 7 columns of data and the INFO column.
The schema for these first 7 "base" columns are not in the header.
This should be a JSON string containing the base schema if different
than the default ones in this package.
[
{
"description": "Chromosome",
"mode": "NULLABLE",
"name": "CHROM",
"type": "STRING"
},
{
"description": "Start position (0-based). Corresponds to the first base of the string of reference bases.",
"mode": "NULLABLE",
"name": "POS",
"type": "INTEGER"
},
{
"description": "",
"mode": "NULLABLE",
"name": "ID",
"type": "STRING"
},
{
"description": "Reference bases.",
"mode": "NULLABLE",
"name": "REF",
"type": "STRING"
},
{
"description": "Alternate bases.",
"mode": "NULLABLE",
"name": "ALT",
"type": "STRING"
},
{
"description": "Phred-scaled quality score (-10log10 prob(call is wrong)). Higher values imply better quality.",
"mode": "NULLABLE",
"name": "QUAL",
"type": "FLOAT"
},
{
"description": "List of failed filters (if any) or \"PASS\" indicating the variant has passed all filters.",
"mode": "NULLABLE",
"name": "FILTER",
"type": "STRING"
}
]