Skip to content

Instantly share code, notes, and snippets.

@inutano
Last active March 2, 2023 04:26
Show Gist options
  • Save inutano/f0a2f5c219ab4920c5b5 to your computer and use it in GitHub Desktop.
Save inutano/f0a2f5c219ab4920c5b5 to your computer and use it in GitHub Desktop.
VCF ver. 4.2 specification

The VCF specification

Meta-information lines

File format

  • always required
  • must be the first line in the file
  • details VCF format version number
    • e.g. ##fileformat=VCFv4.2

Information field format

  • template: ##INFO=<ID=ID,Number=number,Type=type,Description="description",Source="source",Version="version">
    • e.g. ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
  • Possible Types for INFO fields
    • Integer
    • Float
    • Flag
    • Character
    • String
  • Number entry is an Integer that decribes the number of values that can be included with the INFO field
    • 1 for the INFO field contains a single number, 2 for the field descrives a pair of numbers, and so on
    • special characters for special cases
      • 'A' for the field has one value per alternate allele
      • 'R' for the field has one value for each possible allele (including the reference)
      • 'G' for the field has one value for each possible genotype (more relevant to the FORMAT tags)
      • '.' for the number of possible values varies is unknown or unbounded

Filter field format

  • filters that have been applied to the data
  • template: ##FILTER=<ID=ID,Description="description">
  • e.g. ##FILTER=<ID=q10,Description="Quality below 10">

Individual format field format

  • template: ##FORMAT=<ID=ID,Number=number,Type=type,Description="description">
  • e.g. ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
  • Possible Types for FORMAT fields are
    • Integer
    • Float
    • Character
    • String

Alternative allele field format

  • Symbolic alternate alleles for imprecise structural variants
  • can be a colon-separated list of types and subtypes
  • ID values are case sensitive strings and may not conttain whitespace or angle brackets
  • template: ##ALT=<ID=type,Description=description>
  • The first level type must be one of the following
    • DEL
      • Deletion relative to the reference
    • INS
      • Insertion of novel sequence relative to the reference
    • DUP
      • Region of elevated copy number relative to the reference
    • INV
      • Invertion of reference sequence
    • CNV
      • Copy number variation region (may be both deletion and duplication)
      • CNV category should not be used when a more specific category can be applied
      • Reserved subtypes include
        • DUP:TANDEM
          • Tandem duplication
        • DEL:ME
          • Deletion of mobile element relative to the reference
        • INS:ME
          • Insertion of a mobile lement relative to the reference
  • For all of the ##INFO, ##FORMAT, ##FILTER, and ##ALT metainformation, extra fields can be included after the default fields
    • e.g. ##INFO=<ID=ID,Number=number,Type=type,Description="description",Source="description",Version="128">
    • Optional fields should be stored as strings even for numeric values

Assembly field format

  • Breakpoint assemblies for structural variations may use external file
  • template: ##assembly=url
  • The URL field specifies the location of a fasta file containing breakpoint assemblies referenced in the VCF records for structural variants via the BKPTID INFO key

Contig field format

  • highly recommended but nor required
  • The contigs referred to in the VCF file
  • Allowing these contigs to come from different files
  • e.g. ##contig=<ID=ctg1,URL=ftp://somewhere.org/assembly.fa,...>

Sample field format

  • To define sample to genome mappings
  • e.g. ##SAMPLE=<ID=S_ID,Genomes=G1_ID;G2_ID; ...;GK_ID,Mixture=N1;N2; ...;NK,Description=S1;S2; ...;SK>
    • Pedigree field format
  • To record relationships between genomes
    • e.g. ##PEDIGREE=<Name_0=G0-ID,Name_1=G1-ID,...,Name_N=GN-ID>
  • Or a link to a database
    • e.g. ##pedigreeDB=

Header line syntax

  • 8 fixed, mandatory columns
    1. #CHROM
    2. POS
    3. ID
    4. REF
    5. ALT
    6. QUAL
    7. FILTER
    8. INFO
  • tab-delimited
  • If genotype data is present in the file, these are followed by a FORMAT column header, then an arbitrary number of sample IDs
    • e.g. #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003

Data lines

Fixed fields

  1. CHROM
    • chromosome
    • String, no white-space permitted, required
    • An identifier from the reference genome or an angle-bracketed ID String pointing to a contig in the assembly file
      • cf. the ##assembly line in the header
    • All entries for a specific CHROM shoud form a contiguous block within the VCF file
    • The colon symbol (:) must be absent from all chromosome names
  2. POS
    • position
    • Integer, required
    • Positions are sorted numerically, in increasing order, within each reference sequence CHROM
    • Having multiple records with the same POS is permitted
    • Telomeres are indicated by using positons 0 or N+1 where N is the length of the corresponding chromosome or contig
  3. ID
    • identifier
    • String, no white-space or semi-colons permitted
    • Semi-colon separated list of unique identifiers where available
    • encouraged to use the rs number(s) if this is a dbSNP variant
    • No identifier should be present in more than one data record
    • missing value should be used if there is no identifier available
  4. REF
    • reference base(s)
    • String, required
    • Each base must be one of A,C,G,T,N (case sensitive)
    • Multiple bases are permitted
    • The value in the POS field refers to the position of the first base in the String
    • If simple insertions and deletions in which either the REF or one of the ALT alleles whould be null/empty
      • unless the event occurs at position 1 on the contigs
        • The REF and ALT Strings must include the base 'before' the event
        • must be reflected in the POS field
      • else
        • It must include the base 'after' the event
      • This padding base is not required
        • although permitted
        • e.g. complex substitutions or other events where all alleles have at least one base represented in their String
    • If any of the ALT alleles is a symbolic allele (an angle bracketed ID String "")
      • The padding base is required
      • POS denotes the coordinate of the base preceding the polymorphism
    • Tools processing VCF files are not required to preserve case in the allele String
  5. ALT
    • alternate base(s)
    • String; no whitespace, commas, or angle-brackets are permitted in the ID String itself
    • Comma separated list of alternate non-reference allels called on at least one of the samples
    • A,C,G,T,N,* (case insensitive) or an angle-bracketed ID String ("")
    • or a breakend replcement string as described in the section on breakends
    • The '*' allele is reserved to indicate that the allele is missing due to a upstream deletion
    • If there are not alternative alleles
      • the missing value should be used
    • Tools processing VCF files are not required to preserve case in the allele String, except for IDs, which are case sensitive
  6. QUAL
    • quality
    • Numeric
    • Phred-scaled quality score for the assertion made in ALT
  7. FILTER
    • filter status
    • String, no white-space or semi-colons permitted
    • PASS if this position has passed all filters
      • i.e. a call is made at this position
    • If the site has not passed all filters
      • a semicolon-separated list of codes for filters that fail
        • e.g. "q10;s50" might indicate that at this site the quality is below 10 and the number of samples with data is below 50% of the total numebr of samples
    • '0' is reserved and should not be used as a filer String
    • If filters have not been applied
      • This field should be set to the missing value
  8. INFO
    • additional information
    • String, no white-space, semicolons, or equals-signs permitted
    • commas are permitted only as delimiters for lists of values
    • Encoded as a semicolon-separated series of short keys with optional values
      • format: =[,data]
      • Arbitatry keys are permitted
    • reserved subfields
      • AA
        • ancestral allele
      • AC
        • allele count in genotype, for each ALT allele, in the same order as listed
      • AF
        • allele frequency for each ALT allele in the same order as listed
        • use this when estimated from primary data, not called genotype
      • AN
        • total number of alleles in called genotypes
      • BQ
        • RMS base quality at this position
      • CIGAR
        • cigar string decribing how to align an alternate allele to the reference allele
      • DB
        • dbSNP membership
      • DP
        • combined depth across samples, e.g. DP=154
      • END
        • end position of the variant descrived in this record
          • for use with symbolic alleles
      • H2
        • membership in hapmap2
      • H3
        • membership in hapmap3
      • MQ
        • RMS mapping quaality, e.g. MQ=52
      • MQ0
        • Number of MAPQ == 0 reads covering this record
      • NS
        • Number of samples with data
      • SB
        • strand bias at this position
      • SOMATIC
        • indicates that the record is a somatic mutation
        • for cancer genomics
      • VALIDATED
        • validated by follow-up experiment
      • 1000G
        • membership in 1000 Genomes
    • The exact format of each INFO sub-field should be specified in the meta-information
      • e.g. DP=154;MQ=52;H2 for an INFO field
    • Keys without corresponding values are allowed in order to indicate group membership
      • e.g. H2 indicates the SNP is found in HapMap 2
    • Not necessary to list all the properties that a site does NOT have
      • e.g. H2=0

Genotype fields

  • If genotype information is present
    • The same type of data must be present for all samples
  • FORMAT field is given specifying the data types and order
    • colon-separated alphanumeric String
  • FORMAT field is followed by one field per sample corresponding to the types spcified in the format
    • colon-separated
  • The first sub-field must always be the genotype (GT) if it is present
  • No required sub-fields

reserved keywords (common and standards across the community)

  • GT
    • genotype
    • encoded as allele values separated by either of / or |
    • The allele values are
      • 0 for the reference allele (what is in the REF field)
      • 1 for the first allele listed in ALT
      • 2 for the second allele listed in ALT
      • and so on
    • For haploid calls
      • e.g. on Y, male non-pseudoautosomal X, or mitochondrion
      • only one allele value shoud be given
    • For triploid call
      • might look like: 0/0/1
    • If a call cannot be made for a sample at a given locus
      • '.' should be specified for each missing allele in the GT field
        • e.g. './.' for a diploid genotype and '.' for haploid genotype
    • The meanings of separators
      • /
        • genotype unphased
      • |
        • genotype phased
  • DP
    • read depth at this position for this sample
  • FT
    • sample genotype filter indicating if this genotype was "called"
    • similar in concept to the FILTER field
    • use PASS to indicate that all filters have been passed
    • a semi-colon separated list of codes for filters that fail
    • '.' to indicate that filters have not been applied
    • should be descrived in the meta-information in the same way as FILTERs
  • GL
    • genotype likelihoods
    • comprised of comma separated floating point log10-scaled likelihoods
      • for all possible genotypes given the set of alleles defined in the REF and ALT field
    • In presence of the GT field
      • the same ploidy is expected
      • the canonical order is used
    • Without GT field
      • diploidy is assumed
    • if A is the allele in REF and B,C, ... are the alleles as ordered in ALT
      • the ordering of genorypes for the likelihoods is given by: F(j/k) = (k*(k+1)/2)+j.
    • For biallelic sites
      • the ordering is: AA, AB, BB
    • For triallelic sites
      • the ordering is: AA,AB,BB,AC,BC,CC, etc.
    • eg. GT:GL 0/1:-323.03,-99.29,-802.53
  • GLE
    • genotype likelihoods of heterogenous ploidy
    • used in presence of uncertain copy number
    • e.g. GLE=0:-75.22,1:-223.42,0/0:-323.03,1/0:-99.29,1/1:-802.53
  • PL
    • the phred-scaled genotype likelihoods rounded to the closest integer
      • otherwise defined precisely as the GL field
  • GP
    • the phred-scaled genotype posterior probabilities
      • otherwise defined precisely as the GL field
      • intended to store imputed genotype probabilities
  • GQ
    • conditional genotype quality
    • encoded as a phred quality
    • -10log10 p(genotype call is wrond, conditioned on the site's being variant)
  • HQ
    • haplotype qualities
    • two comma separated phred qualities
  • PS
    • phase set
    • A phase set is defined as a set of phased genotypes to which this genotype belongs
    • Phased genotypes for an individual that are on the same chromosome and have the same PS value are in the same phased set
    • A phase set specifies multi-marker haplotypes for the phased genotypes in the set
    • All phased genotypes that do not contain a PS subfield are assumed to belong to the same phased set
    • If the genotype in the GT field is unphased
      • the corresponding PS field is ignored
    • The recommended convention is tu use the position of the first variant in the set as the PS identifier
      • not required
  • PQ
    • phasing quality
    • the phred-scaled probability that alleles are ordered incorrectly in a heterozygote
      • against all other members in the phase set
    • not included the specific measure for precisely defininf "phasing quality"
    • just to reserve the PQ tag for future use as a measure of phasing quality
  • EC
    • comma separated list of expected alternate allele counts for each alternate allele in the same order as listed in the ALT field
      • typically used in assosiation analyses
  • MQ
    • RMS mapping quality
    • similar to the version in the INFO field

Strict type of keywords

  • GT
    • encoded as allele values separated by either of / or |
    • The allele values are
      • 0 for the reference allele (what is in the REF field)
      • 1 for the first allele listed in ALT
      • 2 for the second allele listed in ALT
    • The meanings of separators - / - genotype unphased - | - genotype phased
  • DP
    • Integer
  • FT
    • String, no white-space or semi-colons permitted
  • GL
    • Floats
  • GLE
    • String
  • PL
    • Integers
  • GP
    • Floats
  • GQ
    • Integer
  • HQ
    • Integers
  • PS
    • Non-negative 32-bit
  • PQ
    • Integer
  • EC
    • Integer
  • MQ
    • Integer
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment