gtsambos/wiki.md

## wiki.md

      
    Raw
  

              wiki.md
            
          
    Aim

To see whether the detailed IBD information that we can extract from tree sequences may be useful in large-scale simulation studies.
Motivating paper

[1] Gladstein, A. L., & Hammer, M. F. (2019). Substructured Population Growth in the Ashkenazi Jews Inferred with Approximate Bayesian Computation. Molecular Biology and Evolution, 36(6), 1162–1171. https://doi.org/10.1093/molbev/msz047
In this paper, the authors conducted a large-scale ABC simulation study to investigate the history of Ashkenazi Jewish people.
Some relevant background info from [1]:

Between the 11th and 16th centuries, the AJ (Ashkenazi Jewish) population expanded eastward leading to two culturally distinct communities in Western/Central and Eastern Europe. Our aim was to determine whether the western and eastern groups are genetically distinct, and if so, what demographic processes contributed to population differentiation.


The eastward expansion of the AJ settlement started after the 11th century and continued into the 16th century due to religious and ethnic persecution in Western and Central Europe (Hundert 2004). This led to the establishment of two major culturally distinct communities in Western/ Central and Eastern Europe (Meyer et al. 1996). This is reflected in the division between the Western and Eastern Yiddish dialects (Katz 2004; supplementary fig. S1, Supplementary Material online).

Unless explicitly mentioned otherwise, all the quotes on this page are taken from this paper.
What I did


I simulated an AJ dataset under the demographic model and parameters inferred in [1] and treated it as the 'truth'.
I simulated 100 000 AJ datasets under some of the models hypothesised in [1].
Using ABC methods, I tried to recreate some of the important findings from [1].

What I DID NOT do

I'd be really uncomfortable with any suggestion/language to the effect that I've "replicated" Ariella's work here -- I coded up slightly simplified versions of her demographic models, and did not test the full range of scenarios that she did. Also, as I was working entirely with simulated data, I did not have to consider many of the tricky issues eg. ascertainment bias, that she did.
I do not want to suggest that my work here is contributing to our knowledge of the history of Ashkenazi Jewish people, or to be presented as having expertise about this. If anyone wants to know more about AJ population genetics we should refer them to one of the many other papers that look at this eg. [1,2,3,4,5].
I intend this to be more of a 'power study' -- a demonstration of the fact that IBD information can be extracted efficiently enough from tree sequences to be useful in high-powered studies investigating subtle demographic questions.
Demographic hypotheses


To test whether EAJ (East Ashkenazi Jews) and WAJ (West Ashkenazi Jews) diverged into two subpopula- tions, and if so, whether different histories of population growth or gene flow contributed to the differentiation, we chose the best out of three demographic models (fig. 1)with ABC.

More technical details about the populations, parameters and prior distributions in these scenarios are here. Here's a brief overview of the various demographic hypotheses tested in [1]. See [1], Figure 1 for pictures of the models.
Model 1 (Null hypothesis)


Our first model represents the null hypothesis of cultural differentiation without genetic division between EAJ and WAJ.

Model 2 (Substructure)


Next, the Substructure model includes a population split between EAJ andWAJ allowing for different population sizes, following one common gene flow event with Europeans after the initial founder event

Model 3 (Substructure + differential admixture)


Finally, the Substructure with Differential Admixture model is similar to the substructure model; however, it allows for gene flow with Europeans separately in EAJ and WAJ after the initial divergence of WAJ and EAJ.

I didn't test Model 3 in my simulations.
"True" model

Since I don't have access to the real summary statistics used in [1], I simulated a dataset using the inferred model and parameter values from [1] and treated it as the "true" dataset.
Why use IBD/haplotype-based information for this analysis?

Notice that Model1 and Model2 differ only subtly. In Model 1, all AJ samples are considered to be from a single, isolated panmictic population from the time of their divergence from the wider Jewish population 20-36 generations ago. In Model 2, there is an additional later split in the AJ samples, corresponding to a divergence between West and East AJ populations.
In evolutionary terms, this is a tiny difference. We would not expect allele frequencies, or statistics based on allele frequencies, to be that difference in these two cases. By contrast, large IBD segments are believed to be very informative
Many recent popgen studies of AJ populations have used IBD information as the primary basis for inference [1, 2, 3]. Indeed, the development of several IBD-based inference methods were motivated by the study of European Jewish and AJ populations [2, 3].
Allele-based inference on populations that haven't been studied extensively can also be affected by ascertainment issues, esp. those using SNP array datasets. Without correcting for this, the allele frequency spectrum tends to be biased towards a deficiency of rare alleles and an excess of common alleles [8].
Why use a tree sequence simulator for this?

I used msprime [9] for the simulations in this analysis, primarily for its scalability, speed and efficiency. Obviously, these are never bad things! But here are a few reasons why they were particularly necessary for this analysis:

As recent IBD sharing is still fairly rare for samples from large panmictic populations, you need to simulate

large genomic regions
large sample sizes
to have a reasonably sized set of shared haplotypes in a sample.


When I did some test simulations with small sample sizes and chromosome lengths, many of them had no recent IBD segments at all. There's some chance of this happening in any simulation, but it's best to choose sample sizes and genomic regions that are large enough to make this unlikely. It's not obviously clear how you should deal with simulations that have no IBD, and simply disregarding them isn't ideal since the chance of 'IBD missingness' is dependent on demographic parameters.

Since all of my summary statistics were based on IBD defined wrt a particular set of recent time points, I did not need to simulate any of the genomic history earlier than these times, or any mutations. This made the simulations much faster!

Method, software, some technical details

I used snakemake [10]. The Snakefile showing my pipeline is here.
Simulations

See rules truth_set, simulations1 and simulations2 in the Snakefile.

used msprime [9] in 'coalescent' mode

100 Mb chromosome
240 haplotypes (samples)
50 000 replicates from each of the 2 hypothesised models, 20 replicates from the 'true' model
rho =  $1.25*10^{-7}$ per base per gen


Placed census events at (10, 15, 20) generations in the past.
I calculated all segment pairs that were IBD with respect to these times, and calculated the first 5  moments of the distribution of segment lengths. These were the summary statistics I used.
Since all statistics were based on genealogical information from these recent times, I stopped the simulations once all of these time points were reached, and didn't bother simulating mutations.

Model selection

See rule abcrf in the snakefile

used random forests with the abcrf R package [11]

Each prediction was based on the statistics from a bootstrapped sample of the simulations (50 000 samples taken with replacement)
1000 decision trees in each forest


Reduced dimension to 10 variables (might not have been necessary)
20 bootstrapped replicates were taken. For each replicate, an abcrf prediction was computed. I recorded the 'votes' for each model (ie. the number of decision trees in the forest supporting each model), the estimated posterior probability and the prior out-of-bag error rate.

Parameter estimation

See rule param_est in the snakefile

I used just one of the 20 canonical 'truth' sets here.
Used a neural network regression model [12] with the abc R package [13]
tolerance of 0.5 (ie. the 50% closest simulations were used)
Estimated the following parameters:
N_waj: Ne for West Ashkenazi Jewish pop
N_eaj: Ne for East Ashkenazi Jewish pop
N_asj: Ne for ancestral Ashkenazi Jewish pop
t_waj_eaj_d: time of divergence between WAJ and EAJ

Results

Model selection


Votes for correct model
Posterior probability
Prior error rate


93.36% +/- 4.69%
0.9633 +/- 0.0268
0.0993 +/- 0.0008


Parameter estimation

The parameter estimated are pleasingly close to the true values, with the exception of N_eaj. When I investigated further, I realised that I'd made a typo in my simulation script that likely explains this. I haven't had time to rerun the simulations yet (and when I do there are a few other things I'd change as well), but thought I'd just note this here. I also don't think this error would have a large bearing on the other results, or on the overall 'story' demonstrated thus far -- in particular, I suspect that the model selection results shown in the previous section would be stronger once this correction is made.
Here are some preliminary plots of posterior distributions (red) against the prior (thin black dotted), with the true parameter value shown with a thick black dotted line.
N_waj: (Ne for WAJ)

N_eaj: (Ne for EAJ - this is the one that is affected by my mistake)

N_asj: (Ne for ancestral AJs)

t_waj_eaj_d: (divergence time between WAJ and EAJ)

With the exception of N_eaj, we do pretty well - and in fact, our high posterior density regions are basically the same as Ariella's. (We're getting a much tighter estimate of t_waj_eaj_d than the one reported in [1]).


Parameter
My mode
True value
HPDI 95
HPDI in [1]


log10(N_waj)
4.20
3.90
(3.64, 6.64)
(3.10, 6.48)


log10(N_eaj)
4.24
6.18
(3.93, 6.35)
(4.24, 6.70)


log10(N_asj)
3.28
3.04
(2.06, 4.44)
(2.00, 4.33)


t_waj_eaj_d
13.59
14.93
(11.22, 19.90)
(2.00, 29.00)


Additional comments and caveats

A recent preprint [14] shows that when you use the coalescent to model large genomic regions in large samples, there are biases in IBD patterns. It would have been better to use a Wright-Fisher model, I just wasn't sure that I'd have enough time/compute power for it. If there's time I will repeat.
In practice, you wouldn't typically have your true dataset in tree sequence format, so would need some external software to estimate IBD in this dataset. Typically, these softwares use a slightly different definition of IBD to what I am calculating in this analysis - something more like "IBD segments longer than a given length l". I am currently working on a notebook that shows how you can also compute this from tree sequences.
All the analysis ran overnight on my PhD computer, so it's defs not at the limits of what is computationally possible with this kind of procedure. I was very much was motivated by what I could do in time for this upcoming conference!
References

[1] Gladstein, A. L., & Hammer, M. F. (2019). Substructured Population Growth in the Ashkenazi Jews Inferred with Approximate Bayesian Computation. Molecular Biology and Evolution, 36(6), 1162–1171. https://doi.org/10.1093/molbev/msz047
[2] Palamara, P. F., Lencz, T., Darvasi, A., & Pe’er, I. (2012). Length distributions of identity by descent reveal fine-scale demographic history. American Journal of Human Genetics, 91(5), 809–822. https://doi.org/10.1016/j.ajhg.2012.08.030
[3] Gusev, A., Palamara, P. F., Aponte, G., Zhuang, Z., Darvasi, A., Gregersen, P., & Pe’er, I. (2012). The architecture of long-range haplotypes shared within and across populations. Molecular Biology and Evolution, 29(2), 473–486. https://doi.org/10.1093/molbev/msr133
[4] Carmi, S., Hui, K. Y., Kochav, E., Liu, X., Xue, J., Grady, F., … Pe’Er, I. (2014). Sequencing an Ashkenazi reference panel supports population-targeted personal genomics and illuminates Jewish and European origins. Nature Communications, 5, 1–9. https://doi.org/10.1038/ncomms5835
[5] Guha, S., Rosenfeld, J. A., Malhotra, A. K., Lee, A. T., Gregersen, P. K., Kane, J. M., … Lencz, T. (2012). Implications for health and disease in the genetic signature of the Ashkenazi Jewish population. Genome Biology, 13(1). https://doi.org/10.1186/gb-2012-13-1-r2
[6] Ralph, P., & Coop, G. (2013). The Geography of Recent Genetic Ancestry across Europe. PLoS Biology, 11(5). https://doi.org/10.1371/journal.pbio.1001555
[7] Martin, A. R., Karczewski, K. J., Kerminen, S., Kurki, M. I., Sarin, A. P., Artomov, M., … Daly, M. J. (2018). Haplotype Sharing Provides Insights into Fine-Scale Population History and Disease in Finland. American Journal of Human Genetics, 102(5), 760–775. https://doi.org/10.1016/j.ajhg.2018.03.003
[8] Quinto-Cortés, C. D., Woerner, A. E., Watkins, J. C., & Hammer, M. F. (2018). Modeling SNP array ascertainment with Approximate Bayesian Computation for demographic inference. Scientific Reports, 8(1), 1–10. https://doi.org/10.1038/s41598-018-28539-y
[9] Kelleher, J., Etheridge, A. M., & McVean, G. (2016). Efficient Coalescent Simulation and Genealogical Analysis for Large Sample Sizes. PLOS Computational Biology, 12(5), e1004842. https://doi.org/10.1371/journal.pcbi.1004842
[10] Rahmann, S., & Ko, J. (2012). Snakemake — a scalable bioinformatics workflow engine, 28(19), 2520–2522. https://doi.org/10.1093/bioinformatics/bts480
[11] Raynal, L., Marin, J. M., Pudlo, P., Ribatet, M., Robert, C. P., & Estoup, A. (2019). ABC random forests for Bayesian parameter inference. Bioinformatics, 35(10), 1720–1728. https://doi.org/10.1093/bioinformatics/bty867
[12] Blum, M. G. B., & François, O. (2010). Non-linear regression models for Approximate Bayesian Computation. Statistics and Computing, 20(1), 63–73. https://doi.org/10.1007/s11222-009-9116-0
[13] Csilléry, K., François, O., & Blum, M. G. B. (2012). Abc: An R package for approximate Bayesian computation (ABC). Methods in Ecology and Evolution, 3(3), 475–479. https://doi.org/10.1111/j.2041-210X.2011.00179.x
[14] Nelson, D., Kelleher, J., Ragsdale, A. P., McVean, G., & Gravel, S. (2019). Coupling Wright-Fisher and coalescent dynamics for realistic simulation of population-scale datasets. BioRxiv, 674440. https://doi.org/10.1101/674440
Parameter	My mode	True value	HPDI 95	HPDI in [1]
log10(`N_waj`)	4.20	3.90	(3.64, 6.64)	(3.10, 6.48)
log10(`N_eaj`)	4.24	6.18	(3.93, 6.35)	(4.24, 6.70)
log10(`N_asj`)	3.28	3.04	(2.06, 4.44)	(2.00, 4.33)
`t_waj_eaj_d`	13.59	14.93	(11.22, 19.90)	(2.00, 29.00)