abalter/.gitignore

## .gitignore
.ipynb_checkpoints


## expand_and_flatten_vcf.md

      
    Raw
  

              expand_and_flatten_vcf.md
            
          
    Expand and Flatten VCF

./expand_and_flatten_vcf.py schema -i kaviar_100.vcf -o schema.json
./expand_and_flatten_vcf.py vcf -i kaviar_100.vcf -o expanded_vcf
Expand the INFO column and flatten multiple variants to turn a canonical VCF into a flat table. Also extract the schema. Useful for storing in a database---for instance, uploading to GCP BigQuery.
VCF Format

The canonical format for a VCF file contains 8 "fixed fields"
#CHROM POS ID  REF ALT QUAL  FILTER  INFO
The INFO column contains key-value pairs separated by a delimiter ;.
Example from ClinVar:
ALLELEID=959428;CLNDISDB=MedGen:CN517202;CLNDN=not_provided;CLNHGVS=NC_000001.11:g.943363G>C;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Uncertain_significance;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;GENEINFO=SAMD11:148398;MC=SO:0001583|missense_variant;ORIGIN=1

Example from Kaviar:
AF=0.0000379,0.0000379;AC=1,1;AN=26378;END=10145

Also, when multiple variants are called for a single genomic coordinate, these variants are included in a single row for that coordinate are comma-delimited in that column. Associated data for these variants that might be in the INFO column, such as allele frequency (AF) are then also comma delimited. For example, the following row from Kaviar identifies three possible variants, and three associated values for the allele frequency and allele count (AC):
1	10108	.	C	CA,CCT,CT	.	.	AF=0.0000379,0.0018197,0.0003033;AC=1,48,8;AN=26378

In this case, the values for addional data
VCF Header

The VCF header lines specify the schema for the data contained in the INFO column.
Full Kaviar header:
##fileformat=VCFv4.1
##fileDate=20160209
##source=bin/makeVCF.pl
##reference=file:///proj/famgen/resources/Kaviar-160204-Public/bin/../tabixedRef/hg19.gz
##version=Kaviar-160204 (hg19)
##kaviar_url=http://db.systemsbiology.org/kaviar
##publication=Glusman G, Caballero J, Mauldin DE, Hood L and Roach J (2011) KAVIAR: an accessible system for testing SNV novelty. Bioinformatics, doi: 10.1093/bioinformatics/btr540
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele Count">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in data sources">
##INFO=<ID=END,Number=.,Type=Integer,Description="End position">
##INFO=<ID=DS,Number=A,Type=String,Description="Data Sources containing allele">

Samples from ClinVar header:
##INFO=<ID=CLNDN,Number=.,Type=String,Description="ClinVar's preferred disease name for the concept specified by disease identifiers in CLNDISDB">
##INFO=<ID=CLNDNINCL,Number=.,Type=String,Description="For included Variant : ClinVar's preferred disease name for the concept specified by disease identifiers in CLNDISDB">
##INFO=<ID=CLNDISDB,Number=.,Type=String,Description="Tag-value pairs of disease database name and identifier, e.g. OMIM:NNNNNN">
##INFO=<ID=CLNDISDBINCL,Number=.,Type=String,Description="For included Variant: Tag-value pairs of disease database name and identifier, e.g. OMIM:NNNNNN">
##INFO=<ID=CLNHGVS,Number=.,Type=String,Description="Top-level (primary assembly, alt, or patch) HGVS expression.">
##INFO=<ID=CLNREVSTAT,Number=.,Type=String,Description="ClinVar review status for the Variation ID">
##INFO=<ID=CLNSIG,Number=.,Type=String,Description="Clinical significance for this single variant">

Generate Full VCF Schema from Header

While there are many standard or customary INFO fields, such as those in the documentation, custom ones are fine, as in the ClinVar example. In order to generate a full schema specification we need to parse the header rows. We combine this parsed schema with the schema for the fixed fields (constructed by hand), which is shown below.
Usage

usage: expand_and_flatten_vcf.py [-h] --input_vcf INPUT_VCF [--output_vcf OUTPUT_VCF] [--info_column_index INFO_COLUMN_INDEX]
                                 [--info_delimiter INFO_DELIMITER] [--base_schema BASE_SCHEMA]

Expand INFO column in VCF Files and ouput or write.

VCF Files have a column called INFO with 'key=vlaue' 
pairs separated by ';'. 

For example:

<example of INFO column>

Also, when multiple variants are called for a single 
genomic position, these alternates are comma-separated
in the VCF file. In these situations, the genomic position 
is repeated with the alternate variants in successive rows. 
For example:

<example of multiple variants and expanded version> 

optional arguments:
  -h, --help            show this help message and exit
  --input_vcf INPUT_VCF, -i INPUT_VCF
                        Input VCF file with INFO column as string with key-value pairs.
  --output_vcf OUTPUT_VCF, -o OUTPUT_VCF
                        Expanded VCF file
  --info_column_index INFO_COLUMN_INDEX, -x INFO_COLUMN_INDEX
                         0-indexed index of the INFO column. Default value, 
                         according to spec, is 7.
                         
  --info_delimiter INFO_DELIMITER, -d INFO_DELIMITER
                         Custom separator for INFO key-value pairs in case of some 
                         weird file. Default value, according to standard, is ";"
                         
  --base_schema BASE_SCHEMA, -b BASE_SCHEMA
                        The standard VCF format has 7 columns of data and the INFO column. 
                        The schema for these first 7 "base" columns are not in the header. 
                        This should be a JSON string containing the base schema if different 
                        than the default ones in this package.

Schema for Fixed Fields

[
  {
    "description": "Chromosome",
    "mode": "NULLABLE",
    "name": "CHROM",
    "type": "STRING"
  },
  {
    "description": "Start position (0-based). Corresponds to the first base of the string of reference bases.",
    "mode": "NULLABLE",
    "name": "POS",
    "type": "INTEGER"
  },
  {
    "description": "",
    "mode": "NULLABLE",
    "name": "ID",
    "type": "STRING"
  },
  {
    "description": "Reference bases.",
    "mode": "NULLABLE",
    "name": "REF",
    "type": "STRING"
  },
  {
    "description": "Alternate bases.",
    "mode": "NULLABLE",
    "name": "ALT",
    "type": "STRING"
  },
  {
    "description": "Phred-scaled quality score (-10log10 prob(call is wrong)). Higher values imply better quality.",
    "mode": "NULLABLE",
    "name": "QUAL",
    "type": "FLOAT"
  },
  {
    "description": "List of failed filters (if any) or \"PASS\" indicating the variant has passed all filters.",
    "mode": "NULLABLE",
    "name": "FILTER",
    "type": "STRING"
  }
]


## expand_and_flatten_vcf.py
#!/usr/bin/env python

import os
import re
import sys
import json
import textwrap

fixed_schema = [
  {
    "description": "Chromosome",
    "mode": "NULLABLE",
    "name": "CHROM",
    "type": "STRING"
  },
  {
    "description": "Start position (0-based). Corresponds to the first base of the string of reference bases.",
    "mode": "NULLABLE",
    "name": "POS",
    "type": "INTEGER"
  },
  {
    "description": "dbSNP ID (rs###)",
    "mode": "NULLABLE",
    "name": "ID",
    "type": "STRING"
  },
  {
    "description": "Reference bases.",
    "mode": "NULLABLE",
    "name": "REF",
    "type": "STRING"
  },
  {
    "description": "Alternate bases.",
    "mode": "NULLABLE",
    "name": "ALT",
    "type": "STRING"
  },
  {
    "description": "Phred-scaled quality score (-10log10 prob(call is wrong)). Higher values imply better quality.",
    "mode": "NULLABLE",
    "name": "QUAL",
    "type": "STRING"
  },
  {
    "description": "List of failed filters (if any) or \"PASS\" indicating the variant has passed all filters.",
    "mode": "NULLABLE",
    "name": "FILTER",
    "type": "STRING"
  }
]


class VCF_INFO_EXPANDER():

    fixed_schema = [
      {
        "description": "Chromosome",
        "mode": "NULLABLE",
        "name": "CHROM",
        "type": "STRING"
      },
      {
        "description": "Start position (0-based). Corresponds to the first base of the string of reference bases.",
        "mode": "NULLABLE",
        "name": "POS",
        "type": "INTEGER"
      },
      {
        "description": "dbSNP ID (rs###)",
        "mode": "NULLABLE",
        "name": "ID",
        "type": "STRING"
      },
      {
        "description": "Reference bases.",
        "mode": "NULLABLE",
        "name": "REF",
        "type": "STRING"
      },
      {
        "description": "Alternate bases.",
        "mode": "NULLABLE",
        "name": "ALT",
        "type": "STRING"
      },
      {
        "description": "Phred-scaled quality score (-10log10 prob(call is wrong)). Higher values imply better quality.",
        "mode": "NULLABLE",
        "name": "QUAL",
        "type": "STRING"
      },
      {
        "description": "List of failed filters (if any) or \"PASS\" indicating the variant has passed all filters.",
        "mode": "NULLABLE",
        "name": "FILTER",
        "type": "STRING"
      }
    ]

    def __init__(
            self,
            input_vcf_file="",
            output_vcf_file=None,
            info_column_index=7,
            info_delimiter = ";",
            fixed_schema=fixed_schema
        ):

#         print("__init__")

        self.input_vcf_file = input_vcf_file
        self.output_vcf_file = output_vcf_file
        self.info_column_index = info_column_index
        self.fixed_schema = json.loads(fixed_schema)
#         print(self.fixed_schema)
        self.info_delimiter = info_delimiter

        self.info_schema = self.parseVCF_Schema()
#         print(self.info_schema)

        self.full_schema = self.fixed_schema + self.info_schema

        self.base_fields = [var["name"] for var in self.fixed_schema]
        self.info_fields = [var["name"] for var in self.info_schema]
        self.all_fields = self.base_fields + self.info_fields

#         print(self.info_schema)

    def parseVCF_Schema(self):
#         print("parseVCF_Schema")

        vcf_file = self.input_vcf_file
        info_column_index = self.info_column_index

        info_schema = []

        with open(vcf_file) as vcf:

            for line in vcf:

                ### Capture lines that have field info
                ### They look like with "##INFO=<k=v,k=v, ...>"
                if bool(re.search("^##INFO", line)):
                    info_data = re.sub("^##INFO=<(.*)>", r"\1", line).strip()
                    ### regex to split by commas, but only outside of quotes
                    regex = r",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)"
                    kv_pairs = re.split(regex, info_data)
                    info_dict = dict(item.split("=") for item in kv_pairs)

                    ### Rename necessary fields to match
                    ### BigQuery schema fields
                    ### Description --> description
                    ### Type --> type
                    ### ID --> name
                    ### Number --> .=Nullable, 1=Repeatable
                    info_dict['description'] = info_dict['Description']
                    del info_dict['Description']
                    info_dict['type'] = info_dict['Type']
                    del info_dict['Type']
                    info_dict['name'] = info_dict['ID']
                    del info_dict['ID']
                    if info_dict['Number'] == '.':
                        info_dict['mode'] = 'NULLABLE'
                    else:
                        info_dict['mode'] = 'NULLABLE'
                    del info_dict['Number']

                    ### Add parsed schema to base schema
                    info_schema.append(info_dict )

                ### ignore lines that are header lines but not field info
                elif bool(re.search("^##", line)):
                    pass

                ### done with header, stop reading. Prevents reading through
                ### entire file.
                else:
                    break

        return info_schema


    def getNumHeaderLines(self):
#         print("getNumHeaderLines")

        filename = self.input_vcf_file
        num_header_lines = 0

        with open(filename) as vcf:
            for line in vcf:

                ### capture lines that have field info
                if bool(re.search("^##", line)):
        #             print(line)
                    num_header_lines += 1

                ### done with header, stop reading
                else:
                    break
        return num_header_lines


    def expandInfoData(self, info):
#         print("expandInfoData")

        fields = self.info_fields

        kv_pairs = [pair.split("=") for pair in info.split(";")]
        info_dict = {kv[0]:kv[1] for kv in kv_pairs}

        if not fields:
            fields = info_dict.keys()

        info_dict = {k:info_dict.get(k, ".").split(",") for k in fields}

        return info_dict


    def splitRowDict(
            self,
            row_dict,
            alt_sep=",",
            alt_field="ALT"
        ):
#         print("splitRowDict")

        num_alts = len(row_dict[alt_field])

        row_dicts = [{}]*num_alts
        for i in range(num_alts):
            row_dicts[i] = {k:v[ min(i, len(v)-1) ] for k,v in row_dict.items()}

        return row_dicts


    def convertRowStringToRowDict(
            self,
            row_string
        ):
#         print("convertRowStringToRowDict")

        info_column_index = self.info_column_index
        info_delimiter = self.info_delimiter
        info_fields = self.info_fields
        base_fields = self.base_fields

        values = row_string.strip().split("\t")
        info_data = values.pop(info_column_index)

        row_dict = self.expandInfoData(info_data)
        for i in range(len(base_fields)):
            row_dict[base_fields[i]] = values[i].split(",")

        return row_dict


    def writeRowDict(self,row_dict):

        row_string = ""
        for field in self.all_fields:
            value = row_dict[field]
            row_string += "\t" if value is "." else (value + "\t")
        row_string += "\n"

#         fields = self.all_fields
#         values = [row_dict[field] for field in fields]
#         print(values)
#         values = ["" if value is "." else value for value in [row_dict[field] for field in self.all_fields]]
#         print(values)
#         row_string = "\t".join(values) + "\n"
#         print(row_string)

        self.outfile.write(row_string)


    def expandAndFlatten(self):

#         print("expandAndFlatten")

        info_column_index = self.info_column_index
        info_schema = self.info_schema
        fixed_schema = self.fixed_schema
        infilename = self.input_vcf_file
        outfilename = self.output_vcf_file

        base_fields = self.base_fields
        info_fields = self.info_fields
        all_fields = self.all_fields

        if outfilename is None:
            self.outfile = sys.stdout
        else:
            self.outfile = open(outfilename, "w")

        ### Write header
        dummy = self.outfile.write("\t".join(all_fields) + "\n")

        num_header_lines = self.getNumHeaderLines()

        with open(infilename) as file:
            ### Skip header
            for _ in range(num_header_lines+1):
                dummy = next(file)

            ### Start reading
            for line in file:
#                 print(line)
                row_dict = self.convertRowStringToRowDict(row_string=line)
                row_dicts = self.splitRowDict(row_dict)

                for row_dict in row_dicts:
                    self.writeRowDict(row_dict)

        dummy = self.outfile.close()


    def getSchema(self):
#         print("getSchema")

        if self.output_vcf_file is None:
            self.outfile = sys.stdout
        else:
            self.outfile = open(self.output_vcf_file, "w")

        dummy = self.outfile.write(json.dumps(self.full_schema, indent=2))


if __name__ == "__main__":

    import argparse

    parser = argparse.ArgumentParser(
        description = """\
Expand INFO column in VCF Files and ouput or write.

VCF Files have a column called INFO with 'key=vlaue'
pairs separated by ';'.

For example:

<example of INFO column>

Also, when multiple variants are called for a single
genomic position, these alternates are comma-separated
in the VCF file. In these situations, the genomic position
is repeated with the alternate variants in successive rows.
For example:

<example of multiple variants and expanded version>
""",
        formatter_class=argparse.RawTextHelpFormatter
        )

    parser.add_argument('operation',
        type = str,
        help = """\
 Which operation to perform. To export an expanded and flattend
 vcf file, use "vcf". To export the full schema use "schema". The
 default value is "vcf" if not specified.
 """,
        choices = ["vcf", "schema"],
        default = "vcf"
        )

    parser.add_argument('--input_vcf', '-i',
        required=True,
        type=str,
        help="Input VCF file with INFO column as string with key-value pairs.",
        default = None
        )
    parser.add_argument('--output_vcf', '-o',
        required=False,
        type=str,
        help="Expanded VCF file",
        default=None
        )
    parser.add_argument('--info_column_index', '-x',
        required=False,
        type=int,
        help="""\
 0-indexed index of the INFO column. Default value,
 according to spec, is 7.
 """,
        default = 7
        )
    parser.add_argument('--info_delimiter', '-d',
        required=False,
        type=str,
        help="""\
 Custom separator for INFO key-value pairs in case of some
 weird file. Default value, according to standard, is \";\"
 """,
        default=";"
        )
    parser.add_argument('--fixed_schema', '-b',
        required=False,
        type=str,
        help="""\
The standard VCF format has 7 columns of data and the INFO column.
The schema for these first 7 \"base\" columns are not in the header.
This should be a JSON string containing the base schema if different
than the default ones in this package.
Check `VCF_INFO_EXPANDER.fixed_schema`
""",
        default=json.dumps(fixed_schema)
        )

    args = parser.parse_args()

    operation = args.operation
    print("operation", operation)
    input_vcf_file = args.input_vcf
#     print("input_vcf_file", input_vcf_file)
    output_vcf_file = args.output_vcf
#     print("output_vcf_file", output_vcf_file)
    info_delimiter = args.info_delimiter
#     print("info_delimiter", info_delimiter)
    info_column_index = args.info_column_index
#     print("info_column_index", info_column_index)
    fixed_schema = args.fixed_schema
    if fixed_schema is None:
        fixed_schema = json.dumps(fixed_schema)
#     print("fixed_schema", fixed_schema)


    expander = VCF_INFO_EXPANDER(
        input_vcf_file=input_vcf_file,
        output_vcf_file=output_vcf_file,
        info_column_index=info_column_index,
        info_delimiter=info_delimiter,
        fixed_schema=fixed_schema
    )

    if operation == "vcf":
        expander.expandAndFlatten()
    else:
        expander.getSchema()


## kaviar_100.vcf
##fileformat=VCFv4.1
##fileDate=20160209
##source=bin/makeVCF.pl
##reference=file:///proj/famgen/resources/Kaviar-160204-Public/bin/../tabixedRef/hg19.gz
##version=Kaviar-160204 (hg19)
##kaviar_url=http://db.systemsbiology.org/kaviar
##publication=Glusman G, Caballero J, Mauldin DE, Hood L and Roach J (2011) KAVIAR: an accessible system for testing SNV novelty. Bioinformatics, doi: 10.1093/bioinformatics/btr540
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele Count">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in data sources">
##INFO=<ID=END,Number=.,Type=Integer,Description="End position">
##INFO=<ID=DS,Number=A,Type=String,Description="Data Sources containing allele">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
1	10001	.	T	C	.	.	AF=0.0000379;AC=1;AN=26378
1	10002	.	A	C,T	.	.	AF=0.0001137,0.0000379;AC=3,1;AN=26378
1	10002	.	A	AT	.	.	AF=0.0000379;AC=1;AN=26378
1	10003	.	A	C,T	.	.	AF=0.0000379,0.0000758;AC=1,2;AN=26378
1	10004	.	C	A	.	.	AF=0.0000379;AC=1;AN=26378
1	10018	.	C	T	.	.	AF=0.0000379;AC=1;AN=26378
1	10019	rs775809821	TA	T	.	.	AF=0.0000379;AC=1;AN=26378;END=10020
1	10055	rs768019142	T	TA	.	.	AF=0.0000379;AC=1;AN=26378
1	10108	rs62651026	C	T	.	.	AF=0.0000758;AC=2;AN=26378
1	10108	.	C	CA,CCT,CT	.	.	AF=0.0000379,0.0018197,0.0003033;AC=1,48,8;AN=26378
1	10109	rs376007522	A	T	.	.	AF=0.0006445;AC=17;AN=26378
1	10114	.	T	C	.	.	AF=0.0000379;AC=1;AN=26378
1	10114	.	T	TA	.	.	AF=0.0007203;AC=19;AN=26378
1	10122	.	A	C	.	.	AF=0.0000379;AC=1;AN=26378
1	10128	rs796688738	A	AC	.	.	AF=0.0000379;AC=1;AN=26378
1	10139	rs368469931	A	T	.	.	AF=0.0000379;AC=1;AN=26378
1	10140	.	A	AC	.	.	AF=0.0003412;AC=9;AN=26378
1	10144	rs144773400	TA	T,TT	.	.	AF=0.0000379,0.0000379;AC=1,1;AN=26378;END=10145
1	10146	rs779258992	AC	AA,A	.	.	AF=0.0002654,0.0020851;AC=7,55;AN=26378;END=10147
1	10150	rs371194064	C	T	.	.	AF=0.0003033;AC=8;AN=26378
1	10153	.	A	AC	.	.	AF=0.0004928;AC=13;AN=26378
1	10165	rs796884232	A	AC	.	.	AF=0.0000379;AC=1;AN=26378
1	10168	.	C	T	.	.	AF=0.0000379;AC=1;AN=26378
1	10174	.	C	T	.	.	AF=0.0000758;AC=2;AN=26378
1	10175	.	T	A	.	.	AF=0.0000758;AC=2;AN=26378
1	10175	.	T	TTA	.	.	AF=0.0001137;AC=3;AN=26378
1	10177	rs201752861	A	C	.	.	AF=0.0010236;AC=27;AN=26378
1	10177	rs367896724	A	AC,AT	.	.	AF=0.0835545,0.0000379;AC=2204,1;AN=26378
1	10179	.	C	CCT	.	.	AF=0.0001516;AC=4;AN=26378
1	10180	rs201694901	T	C	.	.	AF=0.0009098;AC=24;AN=26378
1	10200	.	A	AC	.	.	AF=0.0001516;AC=4;AN=26378
1	10201	.	CCCT	C	.	.	AF=0.0001137;AC=3;AN=26378;END=10204
1	10204	.	TA	T	.	.	AF=0.0001516;AC=4;AN=26378;END=10205
1	10228	rs143255646	TA	T	.	.	AF=0.0000379;AC=1;AN=26378;END=10229
1	10228	rs200462216	TAACCCCTAACCCTAACCCTAAACCCTA	T	.	.	AF=0.0000379;AC=1;AN=26378;END=10255
1	10230	rs200279319	AC	AA,A	.	.	AF=0.0002654,0.0048525;AC=7,128;AN=26378;END=10231
1	10234	rs145599635	C	T	.	.	AF=0.0009098;AC=24;AN=26378
1	10235	.	T	A	.	.	AF=0.0006445;AC=17;AN=26378
1	10235	rs540431307	T	TA	.	.	AF=0.0002275;AC=6;AN=26378
1	10237	.	A	C	.	.	AF=0.0000379;AC=1;AN=26378
1	10240	.	C	CT	.	.	AF=0.0000758;AC=2;AN=26378
1	10241	.	T	TA	.	.	AF=0.0007203;AC=19;AN=26378
1	10243	.	A	AC	.	.	AF=0.0000379;AC=1;AN=26378
1	10247	rs796996180	T	C	.	.	AF=0.0001137;AC=3;AN=26378
1	10247	rs148908337	TA	T,TT	.	.	AF=0.0001516,0.0007961;AC=4,21;AN=26378;END=10248
1	10249	rs774211241	AAC	A	.	.	AF=0.0015164;AC=40;AN=26378;END=10251
1	10250	rs199706086	A	C	.	.	AF=0.0007582;AC=20;AN=26378
1	10254	.	T	C	.	.	AF=0.0000758;AC=2;AN=26378
1	10254	rs140194106	TA	T,TT	.	.	AF=0.0001137,0.0006066;AC=3,16;AN=26378;END=10255
1	10257	rs111200574	A	C	.	.	AF=0.0008719;AC=23;AN=26378
1	10259	rs200940095	C	A	.	.	AF=0.0000379;AC=1;AN=26378
1	10261	.	TA	T	.	.	AF=0.0001137;AC=3;AN=26378;END=10262
1	10268	.	A	C	.	.	AF=0.0000379;AC=1;AN=26378
1	10274	.	A	C	.	.	AF=0.0000379;AC=1;AN=26378
1	10279	.	T	C	.	.	AF=0.0000379;AC=1;AN=26378
1	10280	.	A	C	.	.	AF=0.0000379;AC=1;AN=26378
1	10285	.	T	C	.	.	AF=0.0003791;AC=10;AN=26378
1	10286	.	A	C	.	.	AF=0.0000379;AC=1;AN=26378
1	10291	rs145427775	C	T	.	.	AF=0.0008719;AC=23;AN=26378
1	10297	.	C	T	.	.	AF=0.0003791;AC=10;AN=26378
1	10298	.	A	T	.	.	AF=0.0000379;AC=1;AN=26378
1	10309	.	C	G,T	.	.	AF=0.0000379,0.0000379;AC=1,1;AN=26378
1	10315	.	C	T	.	.	AF=0.0001896;AC=5;AN=26378
1	10321	.	C	T	.	.	AF=0.0004549;AC=12;AN=26378
1	10327	rs112750067	T	C	.	.	AF=0.0005307;AC=14;AN=26378
1	10327	.	TA	T,TT	.	.	AF=0.0000379,0.0000379;AC=1,1;AN=26378;END=10328
1	10328	rs201106462	AACCCCTAACCCTAACCCTAACCCT	A	.	.	AF=0.0000379;AC=1;AN=26378;END=10352
1	10329	rs150969722	AC	AA,A	.	.	AF=0.0002654,0.0007582;AC=7,20;AN=26378;END=10330
1	10333	.	C	T	.	.	AF=0.0000379;AC=1;AN=26378
1	10333	.	CT	C	.	.	AF=0.0000758;AC=2;AN=26378;END=10334
1	10348	.	A	C	.	.	AF=0.0000379;AC=1;AN=26378
1	10351	.	C	T	.	.	AF=0.0000758;AC=2;AN=26378
1	10352	.	T	A	.	.	AF=0.0009098;AC=24;AN=26378
1	10352	rs145072688	T	TA	.	.	AF=0.0871181;AC=2298;AN=26378
1	10353	.	A	AAC	.	.	AF=0.0001137;AC=3;AN=26378
1	10354	.	C	A	.	.	AF=0.0001137;AC=3;AN=26378
1	10357	.	T	C	.	.	AF=0.0000379;AC=1;AN=26378
1	10377	.	A	AC	.	.	AF=0.0000379;AC=1;AN=26378
1	10383	rs147093981	A	AC	.	.	AF=0.0002275;AC=6;AN=26378
1	10389	rs766767872	AC	AA,A	.	.	AF=0.0001516,0.0043218;AC=4,114;AN=26378;END=10390
1	10393	.	C	T	.	.	AF=0.0004170;AC=11;AN=26378
1	10394	.	TA	T	.	.	AF=0.0001516;AC=4;AN=26378;END=10395
1	10396	.	AC	AA,A	.	.	AF=0.0000758,0.0007582;AC=2,20;AN=26378;END=10397
1	10400	.	C	T	.	.	AF=0.0001516;AC=4;AN=26378
1	10401	.	TA	T	.	.	AF=0.0002275;AC=6;AN=26378;END=10402
1	10409	.	A	C	.	.	AF=0.0000379;AC=1;AN=26378
1	10421	.	A	AC	.	.	AF=0.0002275;AC=6;AN=26378
	#!/usr/bin/env python

	import os
	import re
	import sys
	import json
	import textwrap

	fixed_schema = [
	{
	"description": "Chromosome",
	"mode": "NULLABLE",
	"name": "CHROM",
	"type": "STRING"
	},
	{
	"description": "Start position (0-based). Corresponds to the first base of the string of reference bases.",
	"mode": "NULLABLE",
	"name": "POS",
	"type": "INTEGER"
	},
	{
	"description": "dbSNP ID (rs###)",
	"mode": "NULLABLE",
	"name": "ID",
	"type": "STRING"
	},
	{
	"description": "Reference bases.",
	"mode": "NULLABLE",
	"name": "REF",
	"type": "STRING"
	},
	{
	"description": "Alternate bases.",
	"mode": "NULLABLE",
	"name": "ALT",
	"type": "STRING"
	},
	{
	"description": "Phred-scaled quality score (-10log10 prob(call is wrong)). Higher values imply better quality.",
	"mode": "NULLABLE",
	"name": "QUAL",
	"type": "STRING"
	},
	{
	"description": "List of failed filters (if any) or \"PASS\" indicating the variant has passed all filters.",
	"mode": "NULLABLE",
	"name": "FILTER",
	"type": "STRING"
	}
	]


	class VCF_INFO_EXPANDER():

	fixed_schema = [
	{
	"description": "Chromosome",
	"mode": "NULLABLE",
	"name": "CHROM",
	"type": "STRING"
	},
	{
	"description": "Start position (0-based). Corresponds to the first base of the string of reference bases.",
	"mode": "NULLABLE",
	"name": "POS",
	"type": "INTEGER"
	},
	{
	"description": "dbSNP ID (rs###)",
	"mode": "NULLABLE",
	"name": "ID",
	"type": "STRING"
	},
	{
	"description": "Reference bases.",
	"mode": "NULLABLE",
	"name": "REF",
	"type": "STRING"
	},
	{
	"description": "Alternate bases.",
	"mode": "NULLABLE",
	"name": "ALT",
	"type": "STRING"
	},
	{
	"description": "Phred-scaled quality score (-10log10 prob(call is wrong)). Higher values imply better quality.",
	"mode": "NULLABLE",
	"name": "QUAL",
	"type": "STRING"
	},
	{
	"description": "List of failed filters (if any) or \"PASS\" indicating the variant has passed all filters.",
	"mode": "NULLABLE",
	"name": "FILTER",
	"type": "STRING"
	}
	]

	def __init__(
	self,
	input_vcf_file="",
	output_vcf_file=None,
	info_column_index=7,
	info_delimiter = ";",
	fixed_schema=fixed_schema
	):

	# print("__init__")

	self.input_vcf_file = input_vcf_file
	self.output_vcf_file = output_vcf_file
	self.info_column_index = info_column_index
	self.fixed_schema = json.loads(fixed_schema)
	# print(self.fixed_schema)
	self.info_delimiter = info_delimiter

	self.info_schema = self.parseVCF_Schema()
	# print(self.info_schema)

	self.full_schema = self.fixed_schema + self.info_schema

	self.base_fields = [var["name"] for var in self.fixed_schema]
	self.info_fields = [var["name"] for var in self.info_schema]
	self.all_fields = self.base_fields + self.info_fields

	# print(self.info_schema)

	def parseVCF_Schema(self):
	# print("parseVCF_Schema")

	vcf_file = self.input_vcf_file
	info_column_index = self.info_column_index

	info_schema = []

	with open(vcf_file) as vcf:

	for line in vcf:

	### Capture lines that have field info
	### They look like with "##INFO=<k=v,k=v, ...>"
	if bool(re.search("^##INFO", line)):
	info_data = re.sub("^##INFO=<(.*)>", r"\1", line).strip()
	### regex to split by commas, but only outside of quotes
	regex = r",(?=(?:[^\"]\"[^\"]\")[^\"]$)"
	kv_pairs = re.split(regex, info_data)
	info_dict = dict(item.split("=") for item in kv_pairs)

	### Rename necessary fields to match
	### BigQuery schema fields
	### Description --> description
	### Type --> type
	### ID --> name
	### Number --> .=Nullable, 1=Repeatable
	info_dict['description'] = info_dict['Description']
	del info_dict['Description']
	info_dict['type'] = info_dict['Type']
	del info_dict['Type']
	info_dict['name'] = info_dict['ID']
	del info_dict['ID']
	if info_dict['Number'] == '.':
	info_dict['mode'] = 'NULLABLE'
	else:
	info_dict['mode'] = 'NULLABLE'
	del info_dict['Number']

	### Add parsed schema to base schema
	info_schema.append(info_dict )

	### ignore lines that are header lines but not field info
	elif bool(re.search("^##", line)):
	pass

	### done with header, stop reading. Prevents reading through
	### entire file.
	else:
	break

	return info_schema


	def getNumHeaderLines(self):
	# print("getNumHeaderLines")

	filename = self.input_vcf_file
	num_header_lines = 0

	with open(filename) as vcf:
	for line in vcf:

	### capture lines that have field info
	if bool(re.search("^##", line)):
	# print(line)
	num_header_lines += 1

	### done with header, stop reading
	else:
	break
	return num_header_lines


	def expandInfoData(self, info):
	# print("expandInfoData")

	fields = self.info_fields

	kv_pairs = [pair.split("=") for pair in info.split(";")]
	info_dict = {kv[0]:kv[1] for kv in kv_pairs}

	if not fields:
	fields = info_dict.keys()

	info_dict = {k:info_dict.get(k, ".").split(",") for k in fields}

	return info_dict


	def splitRowDict(
	self,
	row_dict,
	alt_sep=",",
	alt_field="ALT"
	):
	# print("splitRowDict")

	num_alts = len(row_dict[alt_field])

	row_dicts = [{}]*num_alts
	for i in range(num_alts):
	row_dicts[i] = {k:v[ min(i, len(v)-1) ] for k,v in row_dict.items()}

	return row_dicts


	def convertRowStringToRowDict(
	self,
	row_string
	):
	# print("convertRowStringToRowDict")

	info_column_index = self.info_column_index
	info_delimiter = self.info_delimiter
	info_fields = self.info_fields
	base_fields = self.base_fields

	values = row_string.strip().split("\t")
	info_data = values.pop(info_column_index)

	row_dict = self.expandInfoData(info_data)
	for i in range(len(base_fields)):
	row_dict[base_fields[i]] = values[i].split(",")

	return row_dict


	def writeRowDict(self,row_dict):

	row_string = ""
	for field in self.all_fields:
	value = row_dict[field]
	row_string += "\t" if value is "." else (value + "\t")
	row_string += "\n"

	# fields = self.all_fields
	# values = [row_dict[field] for field in fields]
	# print(values)
	# values = ["" if value is "." else value for value in [row_dict[field] for field in self.all_fields]]
	# print(values)
	# row_string = "\t".join(values) + "\n"
	# print(row_string)

	self.outfile.write(row_string)


	def expandAndFlatten(self):

	# print("expandAndFlatten")

	info_column_index = self.info_column_index
	info_schema = self.info_schema
	fixed_schema = self.fixed_schema
	infilename = self.input_vcf_file
	outfilename = self.output_vcf_file

	base_fields = self.base_fields
	info_fields = self.info_fields
	all_fields = self.all_fields

	if outfilename is None:
	self.outfile = sys.stdout
	else:
	self.outfile = open(outfilename, "w")

	### Write header
	dummy = self.outfile.write("\t".join(all_fields) + "\n")

	num_header_lines = self.getNumHeaderLines()

	with open(infilename) as file:
	### Skip header
	for _ in range(num_header_lines+1):
	dummy = next(file)

	### Start reading
	for line in file:
	# print(line)
	row_dict = self.convertRowStringToRowDict(row_string=line)
	row_dicts = self.splitRowDict(row_dict)

	for row_dict in row_dicts:
	self.writeRowDict(row_dict)

	dummy = self.outfile.close()


	def getSchema(self):
	# print("getSchema")

	if self.output_vcf_file is None:
	self.outfile = sys.stdout
	else:
	self.outfile = open(self.output_vcf_file, "w")

	dummy = self.outfile.write(json.dumps(self.full_schema, indent=2))


	if __name__ == "__main__":

	import argparse

	parser = argparse.ArgumentParser(
	description = """\
	Expand INFO column in VCF Files and ouput or write.

	VCF Files have a column called INFO with 'key=vlaue'
	pairs separated by ';'.

	For example:

	<example of INFO column>

	Also, when multiple variants are called for a single
	genomic position, these alternates are comma-separated
	in the VCF file. In these situations, the genomic position
	is repeated with the alternate variants in successive rows.
	For example:

	<example of multiple variants and expanded version>
	""",
	formatter_class=argparse.RawTextHelpFormatter
	)

	parser.add_argument('operation',
	type = str,
	help = """\
	Which operation to perform. To export an expanded and flattend
	vcf file, use "vcf". To export the full schema use "schema". The
	default value is "vcf" if not specified.
	""",
	choices = ["vcf", "schema"],
	default = "vcf"
	)

	parser.add_argument('--input_vcf', '-i',
	required=True,
	type=str,
	help="Input VCF file with INFO column as string with key-value pairs.",
	default = None
	)
	parser.add_argument('--output_vcf', '-o',
	required=False,
	type=str,
	help="Expanded VCF file",
	default=None
	)
	parser.add_argument('--info_column_index', '-x',
	required=False,
	type=int,
	help="""\
	0-indexed index of the INFO column. Default value,
	according to spec, is 7.
	""",
	default = 7
	)
	parser.add_argument('--info_delimiter', '-d',
	required=False,
	type=str,
	help="""\
	Custom separator for INFO key-value pairs in case of some
	weird file. Default value, according to standard, is \";\"
	""",
	default=";"
	)
	parser.add_argument('--fixed_schema', '-b',
	required=False,
	type=str,
	help="""\
	The standard VCF format has 7 columns of data and the INFO column.
	The schema for these first 7 \"base\" columns are not in the header.
	This should be a JSON string containing the base schema if different
	than the default ones in this package.
	Check `VCF_INFO_EXPANDER.fixed_schema`
	""",
	default=json.dumps(fixed_schema)
	)

	args = parser.parse_args()

	operation = args.operation
	print("operation", operation)
	input_vcf_file = args.input_vcf
	# print("input_vcf_file", input_vcf_file)
	output_vcf_file = args.output_vcf
	# print("output_vcf_file", output_vcf_file)
	info_delimiter = args.info_delimiter
	# print("info_delimiter", info_delimiter)
	info_column_index = args.info_column_index
	# print("info_column_index", info_column_index)
	fixed_schema = args.fixed_schema
	if fixed_schema is None:
	fixed_schema = json.dumps(fixed_schema)
	# print("fixed_schema", fixed_schema)


	expander = VCF_INFO_EXPANDER(
	input_vcf_file=input_vcf_file,
	output_vcf_file=output_vcf_file,
	info_column_index=info_column_index,
	info_delimiter=info_delimiter,
	fixed_schema=fixed_schema
	)

	if operation == "vcf":
	expander.expandAndFlatten()
	else:
	expander.getSchema()
	##fileformat=VCFv4.1
	##fileDate=20160209
	##source=bin/makeVCF.pl
	##reference=file:///proj/famgen/resources/Kaviar-160204-Public/bin/../tabixedRef/hg19.gz
	##version=Kaviar-160204 (hg19)
	##kaviar_url=http://db.systemsbiology.org/kaviar
	##publication=Glusman G, Caballero J, Mauldin DE, Hood L and Roach J (2011) KAVIAR: an accessible system for testing SNV novelty. Bioinformatics, doi: 10.1093/bioinformatics/btr540
	##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
	##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele Count">
	##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in data sources">
	##INFO=<ID=END,Number=.,Type=Integer,Description="End position">
	##INFO=<ID=DS,Number=A,Type=String,Description="Data Sources containing allele">
	#CHROM POS ID REF ALT QUAL FILTER INFO
	1 10001 . T C . . AF=0.0000379;AC=1;AN=26378
	1 10002 . A C,T . . AF=0.0001137,0.0000379;AC=3,1;AN=26378
	1 10002 . A AT . . AF=0.0000379;AC=1;AN=26378
	1 10003 . A C,T . . AF=0.0000379,0.0000758;AC=1,2;AN=26378
	1 10004 . C A . . AF=0.0000379;AC=1;AN=26378
	1 10018 . C T . . AF=0.0000379;AC=1;AN=26378
	1 10019 rs775809821 TA T . . AF=0.0000379;AC=1;AN=26378;END=10020
	1 10055 rs768019142 T TA . . AF=0.0000379;AC=1;AN=26378
	1 10108 rs62651026 C T . . AF=0.0000758;AC=2;AN=26378
	1 10108 . C CA,CCT,CT . . AF=0.0000379,0.0018197,0.0003033;AC=1,48,8;AN=26378
	1 10109 rs376007522 A T . . AF=0.0006445;AC=17;AN=26378
	1 10114 . T C . . AF=0.0000379;AC=1;AN=26378
	1 10114 . T TA . . AF=0.0007203;AC=19;AN=26378
	1 10122 . A C . . AF=0.0000379;AC=1;AN=26378
	1 10128 rs796688738 A AC . . AF=0.0000379;AC=1;AN=26378
	1 10139 rs368469931 A T . . AF=0.0000379;AC=1;AN=26378
	1 10140 . A AC . . AF=0.0003412;AC=9;AN=26378
	1 10144 rs144773400 TA T,TT . . AF=0.0000379,0.0000379;AC=1,1;AN=26378;END=10145
	1 10146 rs779258992 AC AA,A . . AF=0.0002654,0.0020851;AC=7,55;AN=26378;END=10147
	1 10150 rs371194064 C T . . AF=0.0003033;AC=8;AN=26378
	1 10153 . A AC . . AF=0.0004928;AC=13;AN=26378
	1 10165 rs796884232 A AC . . AF=0.0000379;AC=1;AN=26378
	1 10168 . C T . . AF=0.0000379;AC=1;AN=26378
	1 10174 . C T . . AF=0.0000758;AC=2;AN=26378
	1 10175 . T A . . AF=0.0000758;AC=2;AN=26378
	1 10175 . T TTA . . AF=0.0001137;AC=3;AN=26378
	1 10177 rs201752861 A C . . AF=0.0010236;AC=27;AN=26378
	1 10177 rs367896724 A AC,AT . . AF=0.0835545,0.0000379;AC=2204,1;AN=26378
	1 10179 . C CCT . . AF=0.0001516;AC=4;AN=26378
	1 10180 rs201694901 T C . . AF=0.0009098;AC=24;AN=26378
	1 10200 . A AC . . AF=0.0001516;AC=4;AN=26378
	1 10201 . CCCT C . . AF=0.0001137;AC=3;AN=26378;END=10204
	1 10204 . TA T . . AF=0.0001516;AC=4;AN=26378;END=10205
	1 10228 rs143255646 TA T . . AF=0.0000379;AC=1;AN=26378;END=10229
	1 10228 rs200462216 TAACCCCTAACCCTAACCCTAAACCCTA T . . AF=0.0000379;AC=1;AN=26378;END=10255
	1 10230 rs200279319 AC AA,A . . AF=0.0002654,0.0048525;AC=7,128;AN=26378;END=10231
	1 10234 rs145599635 C T . . AF=0.0009098;AC=24;AN=26378
	1 10235 . T A . . AF=0.0006445;AC=17;AN=26378
	1 10235 rs540431307 T TA . . AF=0.0002275;AC=6;AN=26378
	1 10237 . A C . . AF=0.0000379;AC=1;AN=26378
	1 10240 . C CT . . AF=0.0000758;AC=2;AN=26378
	1 10241 . T TA . . AF=0.0007203;AC=19;AN=26378
	1 10243 . A AC . . AF=0.0000379;AC=1;AN=26378
	1 10247 rs796996180 T C . . AF=0.0001137;AC=3;AN=26378
	1 10247 rs148908337 TA T,TT . . AF=0.0001516,0.0007961;AC=4,21;AN=26378;END=10248
	1 10249 rs774211241 AAC A . . AF=0.0015164;AC=40;AN=26378;END=10251
	1 10250 rs199706086 A C . . AF=0.0007582;AC=20;AN=26378
	1 10254 . T C . . AF=0.0000758;AC=2;AN=26378
	1 10254 rs140194106 TA T,TT . . AF=0.0001137,0.0006066;AC=3,16;AN=26378;END=10255
	1 10257 rs111200574 A C . . AF=0.0008719;AC=23;AN=26378
	1 10259 rs200940095 C A . . AF=0.0000379;AC=1;AN=26378
	1 10261 . TA T . . AF=0.0001137;AC=3;AN=26378;END=10262
	1 10268 . A C . . AF=0.0000379;AC=1;AN=26378
	1 10274 . A C . . AF=0.0000379;AC=1;AN=26378
	1 10279 . T C . . AF=0.0000379;AC=1;AN=26378
	1 10280 . A C . . AF=0.0000379;AC=1;AN=26378
	1 10285 . T C . . AF=0.0003791;AC=10;AN=26378
	1 10286 . A C . . AF=0.0000379;AC=1;AN=26378
	1 10291 rs145427775 C T . . AF=0.0008719;AC=23;AN=26378
	1 10297 . C T . . AF=0.0003791;AC=10;AN=26378
	1 10298 . A T . . AF=0.0000379;AC=1;AN=26378
	1 10309 . C G,T . . AF=0.0000379,0.0000379;AC=1,1;AN=26378
	1 10315 . C T . . AF=0.0001896;AC=5;AN=26378
	1 10321 . C T . . AF=0.0004549;AC=12;AN=26378
	1 10327 rs112750067 T C . . AF=0.0005307;AC=14;AN=26378
	1 10327 . TA T,TT . . AF=0.0000379,0.0000379;AC=1,1;AN=26378;END=10328
	1 10328 rs201106462 AACCCCTAACCCTAACCCTAACCCT A . . AF=0.0000379;AC=1;AN=26378;END=10352
	1 10329 rs150969722 AC AA,A . . AF=0.0002654,0.0007582;AC=7,20;AN=26378;END=10330
	1 10333 . C T . . AF=0.0000379;AC=1;AN=26378
	1 10333 . CT C . . AF=0.0000758;AC=2;AN=26378;END=10334
	1 10348 . A C . . AF=0.0000379;AC=1;AN=26378
	1 10351 . C T . . AF=0.0000758;AC=2;AN=26378
	1 10352 . T A . . AF=0.0009098;AC=24;AN=26378
	1 10352 rs145072688 T TA . . AF=0.0871181;AC=2298;AN=26378
	1 10353 . A AAC . . AF=0.0001137;AC=3;AN=26378
	1 10354 . C A . . AF=0.0001137;AC=3;AN=26378
	1 10357 . T C . . AF=0.0000379;AC=1;AN=26378
	1 10377 . A AC . . AF=0.0000379;AC=1;AN=26378
	1 10383 rs147093981 A AC . . AF=0.0002275;AC=6;AN=26378
	1 10389 rs766767872 AC AA,A . . AF=0.0001516,0.0043218;AC=4,114;AN=26378;END=10390
	1 10393 . C T . . AF=0.0004170;AC=11;AN=26378
	1 10394 . TA T . . AF=0.0001516;AC=4;AN=26378;END=10395
	1 10396 . AC AA,A . . AF=0.0000758,0.0007582;AC=2,20;AN=26378;END=10397
	1 10400 . C T . . AF=0.0001516;AC=4;AN=26378
	1 10401 . TA T . . AF=0.0002275;AC=6;AN=26378;END=10402
	1 10409 . A C . . AF=0.0000379;AC=1;AN=26378
	1 10421 . A AC . . AF=0.0002275;AC=6;AN=26378