andy-thomason/Genomics_A_Programmers_Guide.md

## Genomics_A_Programmers_Guide.md

      
    Raw
  

              Genomics_A_Programmers_Guide.md
            
          
    Genomics - A programmer's guide.

Andy Thomason is a Senior Programmer at Genomics PLC.
He has been witing graphics systems, games and compilers since
the '70s and specialises in code performance.
https://www.genomicsplc.com

Genes - a gentle introduction

The human genome consists of two copies of about 3 billion base pairs
of DNA using the alphabet A, C, G and T. This is about two bits per base
or
3,000,000,000 * 2 * 2 / 8 = 1,500,000,000 or about 1.5GB of data.
https://en.wikipedia.org/wiki/Human_genome
In reality the two copies are very similar and indeed the DNA of all humans
is nearly identical from Wall Street trader to Australian aboriginal.
There are a number of "reference genomes" such as the Ensembl Fasta files
available from here:
ftp://ftp.ensembl.org/pub/release-96/fasta/homo_sapiens/
Reference genomes help us to build a map of where we might find particlar
features in human DNA, but do not represent any real individuals.
For example, we can use it to name a "location" for a protein coding gene
such as BRCA2, a DNA repair mechanism implicated in breast cancer:
https://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000139618;r=13:32315474-32400266
This gene is on chromosome 13 from position 32315474 to 32400266.

Genetic Variations

Humans are so similar that all we usually have to store to represent them
is a small set of "variants".
Over time our DNA gets damaged by cosmic rays and copy errors and so
the DNA that parents hand down to children is very slightly different than their own.
A process called recombination further shuffles things so that the DNA a child inherits from their parent is actually a mix of the DNA from their two grandparents on that side.
So for each change to our DNA, we only need to store the differences from the reference genome
and this usually comes in the form of a VCF file (Variant Call Format).
As with almost all files in bioinformatics, this is a TSV (Tab separated value) file.
You can get your own VCF file from the likes of 23 and me and ancestry.com: you pay a relatively
small fee and send a sample which will be have a subset of its variants sequenced on an array
chip. The chip lights up where DNA matches expected sequences.
Example abbreviated from
http://www.internationalgenome.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-40/
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36
##phasing=partial
#CHROM POS     ID        REF ALT    QUAL FILTER INFO                              FORMAT      NA00001        NA00002        NA00003
20     14370   rs6054257 G      A       29   PASS   NS=3;DP=14;AF=0.5;DB;H2           GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.

Here we have three people named NA00001..3 (we take personal data security very seriously in the genetics world) who have some variants 0|0 1|0 and 1|1 changing from G to A in position 14370 of chromosome 20.
There are two numbers per person as we all have two copies of chromosome 20 (one from each parent; the only exceptions are the sex chromosomes. I am unlucky enough to only have one X chromosome and so I have inheirited colour blindness from my Grandfather via my mother.).
We can have either
0|0 both the same as the reference
1|0 and 0|1 only one chromosome is different from the reference
1|1 both chromosomes are different from the reference.

VCF files are "Phased" if we can work out which chromosome the variant is actually on, or at least where it is relative
to its neighbours. In practice, unless you have a very small pair of tweezers, it is hard to tell which chromosome
the DNA came from and so we guess!
So in reality we have a bit vector 001011 which is enough to classify these three humans in this variation.
These are the haplotypes or variation on individual chromosomes.
GWAS studies

We can now use this bit vector to try to work out what bits of the genome affect what diseases or other things like
hair colour or height. For each variant we plot each haplotype against the thing we want to measure (the phenotype)
This is a GWAS or Genome wide association study and is the core of genetic variant analysis.
It makes a map from variants to observations.
For example
Haplotype Height Person
0         1.5m   NA00001
0         1.5m
1         1.75m  NA00002
0         1.75m
1         1.95m  NA00003
1         1.95m

Note that everyone has two haplotypes because we have pairs of chromosomes.
Here we see that the amount of 1'ness gives more height and we do a linear regression to find two values:
beta            The change in height with a change in variant.
standard error  A measure of how noisy our data is.

In practice the data is really noisy and the error tends to be larger than the beta, but often we have a few
variants where the beta is much stronger than the error and this ratio - the Z-score and its associated value
the p-value show which variants are likely to be an influence on height.
The simplest way to do the regression is to use the Moore Penrose inverse.
https://en.wikipedia.org/wiki/Moore%E2%80%93Penrose_inverse
We make a 2x2 matrix of covariance with the dot product of the two vectors and then solve
to find the least squares solution.
For a good sized data set we have literally trillions of data points, so doing this efficiently is
important.
The curse of linkage disequilibrium

Because we inheirit big chunks of DNA from our parents it means that when we look at a particular
area of the DNA we look very similar - far more similar that chance would dictate.
This is great for us as our genes carry on working the way they did in our ancestors, but bad
for genomics researchers. It means that we are not sufficiently different for us to be able to tell
which variant is the cause of a phenotype change.
Linkage Disequilibrium (LD) is a measure of how similar two vectors of variants are.
https://en.wikipedia.org/wiki/Linkage_disequilibrium
It gives a value between -1 and 1 where
-1     The two variants are completely opposite
 0     The variants are not alike
 1     The variants are exactly the same

We make big square matrices of LD around certain sites in the genome to explore
how similar the variants are. In practice many of the variants surrounding a site
are almost identical to the variant in the middle.
The matrix ends up a bit like this with big squares of similarity.
    v0  v2  v4  v6  v8  va  vc  ve  vg
      v1  v3  v5  v7  v9  vb  vd  vf
v0    1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 
v1    1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0
v2    1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0
v3    1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0
v4    1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0
v5    1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0
v6    1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0
v7    1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0
v8    0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
v9    0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
va    0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
vb    0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
vc    0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
vd    0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
ve    0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
vf    0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
vg    0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1

The actual values aren't exactly 0 or 1, but very similar.
The gap between v7 and v8 is where recombination has happened.
This has made v0..v7 different from v8..vg
The problem with the similarity is that we know that one
of the variants in the group caused something, but not which one.
This limits the resolution of out genomic microscope and we need to
use further methods such as functional genomics to resolve this issue.
Conclusion

At the end of the day, we can't be sure that our choice of an answer
to "what causes my thing" is the right one, but that is genetics through
and through. Biology is not a nice tidy machine with perfect clockwork parts,
it is a seething mass of randomness that somehow makes all this stuff we
call life. That is why statistics, or "Machine Learning" as it has been
rebranded is so important.