The VCF specification
- always required
- must be the first line in the file
- details VCF format version number
- e.g. ##fileformat=VCFv4.2
- template: ##INFO=<ID=ID,Number=number,Type=type,Description="description",Source="source",Version="version">
- e.g. ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
- Possible Types for INFO fields
- Integer
- Float
- Flag
- Character
- String
- Number entry is an Integer that decribes the number of values that can be included with the INFO field
- 1 for the INFO field contains a single number, 2 for the field descrives a pair of numbers, and so on
- special characters for special cases
- 'A' for the field has one value per alternate allele
- 'R' for the field has one value for each possible allele (including the reference)
- 'G' for the field has one value for each possible genotype (more relevant to the FORMAT tags)
- '.' for the number of possible values varies is unknown or unbounded
- filters that have been applied to the data
- template: ##FILTER=<ID=ID,Description="description">
- e.g. ##FILTER=<ID=q10,Description="Quality below 10">
- template: ##FORMAT=<ID=ID,Number=number,Type=type,Description="description">
- e.g. ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
- Possible Types for FORMAT fields are
- Integer
- Float
- Character
- String
- Symbolic alternate alleles for imprecise structural variants
- can be a colon-separated list of types and subtypes
- ID values are case sensitive strings and may not conttain whitespace or angle brackets
- template: ##ALT=<ID=type,Description=description>
- The first level type must be one of the following
- DEL
- Deletion relative to the reference
- INS
- Insertion of novel sequence relative to the reference
- DUP
- Region of elevated copy number relative to the reference
- INV
- Invertion of reference sequence
- CNV
- Copy number variation region (may be both deletion and duplication)
- CNV category should not be used when a more specific category can be applied
- Reserved subtypes include
- DUP:TANDEM
- Tandem duplication
- DEL:ME
- Deletion of mobile element relative to the reference
- INS:ME
- Insertion of a mobile lement relative to the reference
- DUP:TANDEM
- DEL
- For all of the ##INFO, ##FORMAT, ##FILTER, and ##ALT metainformation, extra fields can be included after the default fields
- e.g. ##INFO=<ID=ID,Number=number,Type=type,Description="description",Source="description",Version="128">
- Optional fields should be stored as strings even for numeric values
- Breakpoint assemblies for structural variations may use external file
- template: ##assembly=url
- The URL field specifies the location of a fasta file containing breakpoint assemblies referenced in the VCF records for structural variants via the BKPTID INFO key
- highly recommended but nor required
- The contigs referred to in the VCF file
- Allowing these contigs to come from different files
- e.g. ##contig=<ID=ctg1,URL=ftp://somewhere.org/assembly.fa,...>
- To define sample to genome mappings
- e.g. ##SAMPLE=<ID=S_ID,Genomes=G1_ID;G2_ID; ...;GK_ID,Mixture=N1;N2; ...;NK,Description=S1;S2; ...;SK>
- Pedigree field format
- To record relationships between genomes
- e.g. ##PEDIGREE=<Name_0=G0-ID,Name_1=G1-ID,...,Name_N=GN-ID>
- Or a link to a database
- e.g. ##pedigreeDB=
- 8 fixed, mandatory columns
- #CHROM
- POS
- ID
- REF
- ALT
- QUAL
- FILTER
- INFO
- tab-delimited
- If genotype data is present in the file, these are followed by a FORMAT column header, then an arbitrary number of sample IDs
- e.g. #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
- CHROM
- chromosome
- String, no white-space permitted, required
- An identifier from the reference genome or an angle-bracketed ID String pointing to a contig in the assembly file
- cf. the ##assembly line in the header
- All entries for a specific CHROM shoud form a contiguous block within the VCF file
- The colon symbol (:) must be absent from all chromosome names
- POS
- position
- Integer, required
- Positions are sorted numerically, in increasing order, within each reference sequence CHROM
- Having multiple records with the same POS is permitted
- Telomeres are indicated by using positons 0 or N+1 where N is the length of the corresponding chromosome or contig
- ID
- identifier
- String, no white-space or semi-colons permitted
- Semi-colon separated list of unique identifiers where available
- encouraged to use the rs number(s) if this is a dbSNP variant
- No identifier should be present in more than one data record
- missing value should be used if there is no identifier available
- REF
- reference base(s)
- String, required
- Each base must be one of A,C,G,T,N (case sensitive)
- Multiple bases are permitted
- The value in the POS field refers to the position of the first base in the String
- If simple insertions and deletions in which either the REF or one of the ALT alleles whould be null/empty
- unless the event occurs at position 1 on the contigs
- The REF and ALT Strings must include the base 'before' the event
- must be reflected in the POS field
- else
- It must include the base 'after' the event
- This padding base is not required
- although permitted
- e.g. complex substitutions or other events where all alleles have at least one base represented in their String
- unless the event occurs at position 1 on the contigs
- If any of the ALT alleles is a symbolic allele (an angle bracketed ID String "")
- The padding base is required
- POS denotes the coordinate of the base preceding the polymorphism
- Tools processing VCF files are not required to preserve case in the allele String
- ALT
- alternate base(s)
- String; no whitespace, commas, or angle-brackets are permitted in the ID String itself
- Comma separated list of alternate non-reference allels called on at least one of the samples
- A,C,G,T,N,* (case insensitive) or an angle-bracketed ID String ("")
- or a breakend replcement string as described in the section on breakends
- The '*' allele is reserved to indicate that the allele is missing due to a upstream deletion
- If there are not alternative alleles
- the missing value should be used
- Tools processing VCF files are not required to preserve case in the allele String, except for IDs, which are case sensitive
- QUAL
- quality
- Numeric
- Phred-scaled quality score for the assertion made in ALT
- FILTER
- filter status
- String, no white-space or semi-colons permitted
- PASS if this position has passed all filters
- i.e. a call is made at this position
- If the site has not passed all filters
- a semicolon-separated list of codes for filters that fail
- e.g. "q10;s50" might indicate that at this site the quality is below 10 and the number of samples with data is below 50% of the total numebr of samples
- a semicolon-separated list of codes for filters that fail
- '0' is reserved and should not be used as a filer String
- If filters have not been applied
- This field should be set to the missing value
- INFO
- additional information
- String, no white-space, semicolons, or equals-signs permitted
- commas are permitted only as delimiters for lists of values
- Encoded as a semicolon-separated series of short keys with optional values
- format: =[,data]
- Arbitatry keys are permitted
- reserved subfields
- AA
- ancestral allele
- AC
- allele count in genotype, for each ALT allele, in the same order as listed
- AF
- allele frequency for each ALT allele in the same order as listed
- use this when estimated from primary data, not called genotype
- AN
- total number of alleles in called genotypes
- BQ
- RMS base quality at this position
- CIGAR
- cigar string decribing how to align an alternate allele to the reference allele
- DB
- dbSNP membership
- DP
- combined depth across samples, e.g. DP=154
- END
- end position of the variant descrived in this record
- for use with symbolic alleles
- end position of the variant descrived in this record
- H2
- membership in hapmap2
- H3
- membership in hapmap3
- MQ
- RMS mapping quaality, e.g. MQ=52
- MQ0
- Number of MAPQ == 0 reads covering this record
- NS
- Number of samples with data
- SB
- strand bias at this position
- SOMATIC
- indicates that the record is a somatic mutation
- for cancer genomics
- VALIDATED
- validated by follow-up experiment
- 1000G
- membership in 1000 Genomes
- AA
- The exact format of each INFO sub-field should be specified in the meta-information
- e.g. DP=154;MQ=52;H2 for an INFO field
- Keys without corresponding values are allowed in order to indicate group membership
- e.g. H2 indicates the SNP is found in HapMap 2
- Not necessary to list all the properties that a site does NOT have
- e.g. H2=0
- If genotype information is present
- The same type of data must be present for all samples
- FORMAT field is given specifying the data types and order
- colon-separated alphanumeric String
- FORMAT field is followed by one field per sample corresponding to the types spcified in the format
- colon-separated
- The first sub-field must always be the genotype (GT) if it is present
- No required sub-fields
- GT
- genotype
- encoded as allele values separated by either of / or |
- The allele values are
- 0 for the reference allele (what is in the REF field)
- 1 for the first allele listed in ALT
- 2 for the second allele listed in ALT
- and so on
- For haploid calls
- e.g. on Y, male non-pseudoautosomal X, or mitochondrion
- only one allele value shoud be given
- For triploid call
- might look like: 0/0/1
- If a call cannot be made for a sample at a given locus
- '.' should be specified for each missing allele in the GT field
- e.g. './.' for a diploid genotype and '.' for haploid genotype
- '.' should be specified for each missing allele in the GT field
- The meanings of separators
- /
- genotype unphased
- |
- genotype phased
- /
- DP
- read depth at this position for this sample
- FT
- sample genotype filter indicating if this genotype was "called"
- similar in concept to the FILTER field
- use PASS to indicate that all filters have been passed
- a semi-colon separated list of codes for filters that fail
- '.' to indicate that filters have not been applied
- should be descrived in the meta-information in the same way as FILTERs
- GL
- genotype likelihoods
- comprised of comma separated floating point log10-scaled likelihoods
- for all possible genotypes given the set of alleles defined in the REF and ALT field
- In presence of the GT field
- the same ploidy is expected
- the canonical order is used
- Without GT field
- diploidy is assumed
- if A is the allele in REF and B,C, ... are the alleles as ordered in ALT
- the ordering of genorypes for the likelihoods is given by: F(j/k) = (k*(k+1)/2)+j.
- For biallelic sites
- the ordering is: AA, AB, BB
- For triallelic sites
- the ordering is: AA,AB,BB,AC,BC,CC, etc.
- eg. GT:GL 0/1:-323.03,-99.29,-802.53
- GLE
- genotype likelihoods of heterogenous ploidy
- used in presence of uncertain copy number
- e.g. GLE=0:-75.22,1:-223.42,0/0:-323.03,1/0:-99.29,1/1:-802.53
- PL
- the phred-scaled genotype likelihoods rounded to the closest integer
- otherwise defined precisely as the GL field
- the phred-scaled genotype likelihoods rounded to the closest integer
- GP
- the phred-scaled genotype posterior probabilities
- otherwise defined precisely as the GL field
- intended to store imputed genotype probabilities
- the phred-scaled genotype posterior probabilities
- GQ
- conditional genotype quality
- encoded as a phred quality
- -10log10 p(genotype call is wrond, conditioned on the site's being variant)
- HQ
- haplotype qualities
- two comma separated phred qualities
- PS
- phase set
- A phase set is defined as a set of phased genotypes to which this genotype belongs
- Phased genotypes for an individual that are on the same chromosome and have the same PS value are in the same phased set
- A phase set specifies multi-marker haplotypes for the phased genotypes in the set
- All phased genotypes that do not contain a PS subfield are assumed to belong to the same phased set
- If the genotype in the GT field is unphased
- the corresponding PS field is ignored
- The recommended convention is tu use the position of the first variant in the set as the PS identifier
- not required
- PQ
- phasing quality
- the phred-scaled probability that alleles are ordered incorrectly in a heterozygote
- against all other members in the phase set
- not included the specific measure for precisely defininf "phasing quality"
- just to reserve the PQ tag for future use as a measure of phasing quality
- EC
- comma separated list of expected alternate allele counts for each alternate allele in the same order as listed in the ALT field
- typically used in assosiation analyses
- comma separated list of expected alternate allele counts for each alternate allele in the same order as listed in the ALT field
- MQ
- RMS mapping quality
- similar to the version in the INFO field
- GT
- encoded as allele values separated by either of / or |
- The allele values are
- 0 for the reference allele (what is in the REF field)
- 1 for the first allele listed in ALT
- 2 for the second allele listed in ALT
- The meanings of separators - / - genotype unphased - | - genotype phased
- DP
- Integer
- FT
- String, no white-space or semi-colons permitted
- GL
- Floats
- GLE
- String
- PL
- Integers
- GP
- Floats
- GQ
- Integer
- HQ
- Integers
- PS
- Non-negative 32-bit
- PQ
- Integer
- EC
- Integer
- MQ
- Integer