inutano/vcf_v4.2_specification.md

## vcf_v4.2_specification.md

      
    Raw
  

              vcf_v4.2_specification.md
            
          
    The VCF specification
Meta-information lines

File format


always required
must be the first line in the file
details VCF format version number

e.g. ##fileformat=VCFv4.2


Information field format


template: ##INFO=<ID=ID,Number=number,Type=type,Description="description",Source="source",Version="version">

e.g. ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">


Possible Types for INFO fields

Integer
Float
Flag
Character
String


Number entry is an Integer that decribes the number of values that can be included with the INFO field

1 for the INFO field contains a single number, 2 for the field descrives a pair of numbers, and so on
special characters for special cases

'A' for the field has one value per alternate allele
'R' for the field has one value for each possible allele (including the reference)
'G' for the field has one value for each possible genotype (more relevant to the FORMAT tags)
'.' for the number of possible values varies is unknown or unbounded


Filter field format


filters that have been applied to the data
template: ##FILTER=<ID=ID,Description="description">
e.g. ##FILTER=<ID=q10,Description="Quality below 10">

Individual format field format


template: ##FORMAT=<ID=ID,Number=number,Type=type,Description="description">
e.g. ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
Possible Types for FORMAT fields are

Integer
Float
Character
String


Alternative allele field format


Symbolic alternate alleles for imprecise structural variants
can be a colon-separated list of types and subtypes
ID values are case sensitive strings and may not conttain whitespace or angle brackets
template: ##ALT=<ID=type,Description=description>
The first level type must be one of the following

DEL

Deletion relative to the reference


INS

Insertion of novel sequence relative to the reference


DUP

Region of elevated copy number relative to the reference


INV

Invertion of reference sequence


CNV

Copy number variation region (may be both deletion and duplication)
CNV category should not be used when a more specific category can be applied
Reserved subtypes include

DUP:TANDEM

Tandem duplication


DEL:ME

Deletion of mobile element relative to the reference


INS:ME

Insertion of a mobile lement relative to the reference


For all of the ##INFO, ##FORMAT, ##FILTER, and ##ALT metainformation, extra fields can be included after the default fields

e.g. ##INFO=<ID=ID,Number=number,Type=type,Description="description",Source="description",Version="128">
Optional fields should be stored as strings even for numeric values


Assembly field format


Breakpoint assemblies for structural variations may use external file
template: ##assembly=url
The URL field specifies the location of a fasta file containing breakpoint assemblies referenced in the VCF records for structural variants via the BKPTID INFO key

Contig field format


highly recommended but nor required
The contigs referred to in the VCF file
Allowing these contigs to come from different files
e.g. ##contig=<ID=ctg1,URL=ftp://somewhere.org/assembly.fa,...>

Sample field format


To define sample to genome mappings
e.g. ##SAMPLE=<ID=S_ID,Genomes=G1_ID;G2_ID; ...;GK_ID,Mixture=N1;N2; ...;NK,Description=S1;S2; ...;SK>

Pedigree field format


To record relationships between genomes

e.g. ##PEDIGREE=<Name_0=G0-ID,Name_1=G1-ID,...,Name_N=GN-ID>


Or a link to a database

e.g. ##pedigreeDB=


Header line syntax


8 fixed, mandatory columns

#CHROM
POS
ID
REF
ALT
QUAL
FILTER
INFO


tab-delimited
If genotype data is present in the file, these are followed by a FORMAT column header, then an arbitrary number of sample IDs

e.g. #CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	NA00001	NA00002	NA00003


Data lines

Fixed fields


CHROM

chromosome
String, no white-space permitted, required
An identifier from the reference genome or an angle-bracketed ID String pointing to a contig in the assembly file

cf. the ##assembly line in the header


All entries for a specific CHROM shoud form a contiguous block within the VCF file
The colon symbol (:) must be absent from all chromosome names


POS

position
Integer, required
Positions are sorted numerically, in increasing order, within each reference sequence CHROM
Having multiple records with the same POS is permitted
Telomeres are indicated by using positons 0 or N+1 where N is the length of the corresponding chromosome or contig


ID

identifier
String, no white-space or semi-colons permitted
Semi-colon separated list of unique identifiers where available
encouraged to use the rs number(s) if this is a dbSNP variant
No identifier should be present in more than one data record
missing value should be used if there is no identifier available


REF

reference base(s)
String, required
Each base must be one of A,C,G,T,N (case sensitive)
Multiple bases are permitted
The value in the POS field refers to the position of the first base in the String
If simple insertions and deletions in which either the REF or one of the ALT alleles whould be null/empty

unless the event occurs at position 1 on the contigs

The REF and ALT Strings must include the base 'before' the event
must be reflected in the POS field


else

It must include the base 'after' the event


This padding base is not required

although permitted
e.g. complex substitutions or other events where all alleles have at least one base represented in their String


If any of the ALT alleles is a symbolic allele (an angle bracketed ID String "")

The padding base is required
POS denotes the coordinate of the base preceding the polymorphism


Tools processing VCF files are not required to preserve case in the allele String


ALT

alternate base(s)
String; no whitespace, commas, or angle-brackets are permitted in the ID String itself
Comma separated list of alternate non-reference allels called on at least one of the samples
A,C,G,T,N,* (case insensitive) or an angle-bracketed ID String ("")
or a breakend replcement string as described in the section on breakends
The '*' allele is reserved to indicate that the allele is missing due to a upstream deletion
If there are not alternative alleles

the missing value should be used


Tools processing VCF files are not required to preserve case in the allele String, except for IDs, which are case sensitive


QUAL

quality
Numeric
Phred-scaled quality score for the assertion made in ALT


FILTER

filter status
String, no white-space or semi-colons permitted
PASS if this position has passed all filters

i.e. a call is made at this position


If the site has not passed all filters

a semicolon-separated list of codes for filters that fail

e.g. "q10;s50" might indicate that at this site the quality is below 10 and the number of samples with data is below 50% of the total
numebr of samples


'0' is reserved and should not be used as a filer String
If filters have not been applied

This field should be set to the missing value


INFO

additional information
String, no white-space, semicolons, or equals-signs permitted
commas are permitted only as delimiters for lists of values
Encoded as a semicolon-separated series of short keys with optional values

format: =[,data]
Arbitatry keys are permitted


reserved subfields

AA

ancestral allele


AC

allele count in genotype, for each ALT allele, in the same order as listed


AF

allele frequency for each ALT allele in the same order as listed
use this when estimated from primary data, not called genotype


AN

total number of alleles in called genotypes


BQ

RMS base quality at this position


CIGAR

cigar string decribing how to align an alternate allele to the reference allele


DB

dbSNP membership


DP

combined depth across samples, e.g. DP=154


END

end position of the variant descrived in this record

for use with symbolic alleles


H2

membership in hapmap2


H3

membership in hapmap3


MQ

RMS mapping quaality, e.g. MQ=52


MQ0

Number of MAPQ == 0 reads covering this record


NS

Number of samples with data


SB

strand bias at this position


SOMATIC

indicates that the record is a somatic mutation
for cancer genomics


VALIDATED

validated by follow-up experiment


1000G

membership in 1000 Genomes


The exact format of each INFO sub-field should be specified in the meta-information

e.g. DP=154;MQ=52;H2 for an INFO field


Keys without corresponding values are allowed in order to indicate group membership

e.g. H2 indicates the SNP is found in HapMap 2


Not necessary to list all the properties that a site does NOT have

e.g. H2=0


Genotype fields


If genotype information is present

The same type of data must be present for all samples


FORMAT field is given specifying the data types and order

colon-separated alphanumeric String


FORMAT field is followed by one field per sample corresponding to the types spcified in the format

colon-separated


The first sub-field must always be the genotype (GT) if it is present
No required sub-fields

reserved keywords (common and standards across the community)


GT

genotype
encoded as allele values separated by either of / or |
The allele values are

0 for the reference allele (what is in the REF field)
1 for the first allele listed in ALT
2 for the second allele listed in ALT
and so on


For haploid calls

e.g. on Y, male non-pseudoautosomal X, or mitochondrion
only one allele value shoud be given


For triploid call

might look like: 0/0/1


If a call cannot be made for a sample at a given locus

'.' should be specified for each missing allele in the GT field

e.g. './.' for a diploid genotype and '.' for haploid genotype


The meanings of separators

/

genotype unphased


|

genotype phased


DP

read depth at this position for this sample


FT

sample genotype filter indicating if this genotype was "called"
similar in concept to the FILTER field
use PASS to indicate that all filters have been passed
a semi-colon separated list of codes for filters that fail
'.' to indicate that filters have not been applied
should be descrived in the meta-information in the same way as FILTERs


GL

genotype likelihoods
comprised of comma separated floating point log10-scaled likelihoods

for all possible genotypes given the set of alleles defined in the REF and ALT field


In presence of the GT field

the same ploidy is expected
the canonical order is used


Without GT field

diploidy is assumed


if A is the allele in REF and B,C, ... are the alleles as ordered in ALT

the ordering of genorypes for the likelihoods is given by: F(j/k) = (k*(k+1)/2)+j.


For biallelic sites

the ordering is: AA, AB, BB


For triallelic sites

the ordering is: AA,AB,BB,AC,BC,CC, etc.


eg. GT:GL 0/1:-323.03,-99.29,-802.53


GLE

genotype likelihoods of heterogenous ploidy
used in presence of uncertain copy number
e.g. GLE=0:-75.22,1:-223.42,0/0:-323.03,1/0:-99.29,1/1:-802.53


PL

the phred-scaled genotype likelihoods rounded to the closest integer

otherwise defined precisely as the GL field


GP

the phred-scaled genotype posterior probabilities

otherwise defined precisely as the GL field
intended to store imputed genotype probabilities


GQ

conditional genotype quality
encoded as a phred quality
-10log10 p(genotype call is wrond, conditioned on the site's being variant)


HQ

haplotype qualities
two comma separated phred qualities


PS

phase set
A phase set is defined as a set of phased genotypes to which this genotype belongs
Phased genotypes for an individual that are on the same chromosome and have the same PS value are in the same phased set
A phase set specifies multi-marker haplotypes for the phased genotypes in the set
All phased genotypes that do not contain a PS subfield are assumed to belong to the same phased set
If the genotype in the GT field is unphased

the corresponding PS field is ignored


The recommended convention is tu use the position of the first variant in the set as the PS identifier

not required


PQ

phasing quality
the phred-scaled probability that alleles are ordered incorrectly in a heterozygote

against all other members in the phase set


not included the specific measure for precisely defininf "phasing quality"
just to reserve the PQ tag for future use as a measure of phasing quality


EC

comma separated list of expected alternate allele counts for each alternate allele in the same order as listed in the ALT field

typically used in assosiation analyses


MQ

RMS mapping quality
similar to the version in the INFO field


Strict type of keywords


GT

encoded as allele values separated by either of / or |
The allele values are

0 for the reference allele (what is in the REF field)
1 for the first allele listed in ALT
2 for the second allele listed in ALT


The meanings of separators
- /
- genotype unphased
- |
- genotype phased


DP

Integer


FT

String, no white-space or semi-colons permitted


GL

Floats


GLE

String


PL

Integers


GP

Floats


GQ

Integer


HQ

Integers


PS

Non-negative 32-bit


PQ

Integer


EC

Integer


MQ

Integer