Multi-locus Analysis of Genomic Time Series Data from Experimental Evolution
A growing number of experimental biologists are generating “evolve-and-resequence” (E&R) data in which the genomes of an experimental population are repeatedly sequenced over time. The resulting time series data provide important new insights into the dynamics of evolution. This type of analysis has only recently been made possible by next-generation sequencing, and new statistical procedures are required to analyze this novel data source. We present such a procedure here, and apply it to both simulated and real E&R data.
Published in the journal:
. PLoS Genet 11(4): e32767. doi:10.1371/journal.pgen.1005069
Category:
Research Article
doi:
https://doi.org/10.1371/journal.pgen.1005069
Summary
A growing number of experimental biologists are generating “evolve-and-resequence” (E&R) data in which the genomes of an experimental population are repeatedly sequenced over time. The resulting time series data provide important new insights into the dynamics of evolution. This type of analysis has only recently been made possible by next-generation sequencing, and new statistical procedures are required to analyze this novel data source. We present such a procedure here, and apply it to both simulated and real E&R data.
Introduction
A common study design in population genetics consists of collecting genomic variation data from living organisms to make inferences about unobserved evolutionary and biological phenomena. The many areas where this design has been applied include demographic inference (see [1] for a recent review), recombination rate estimation [2–6], and detection of natural selection [7–13]. Recently, there has been much interest in utilizing time series genetic data—e.g., from ancient DNA [14–21], experimental evolution of a population under controlled laboratory environments [22–26], or direct measurements in fast evolving populations [27]—to enhance our ability to probe into evolution. In particular, understanding the genetic basis of adaptation to changes in the environment can be significantly facilitated by such temporal data. Specifically, the dynamics of allele frequencies in an evolving population potentially convey added information about how the genome functions [28], information which is inaccessible to methods which operate only on a static snapshot of that genome.
An experimental methodology which serially interrogates the genomes of an controlled population over time could potentially yield new insights. In fact, this methodology can now be realized thanks to the advent of next-generation sequencing. By sequencing successive generations of model organisms raised in a controlled environment, genetic time series data can be generated which describe evolution at nucleotide resolution [24, 25, 28, 29]. This so-called evolve-and-resequence (henceforth, E&R) methodology is fundamentally different than the observational approach described above, and new inference procedures are needed to analyze this type of data.
In this paper, we present such a procedure and study its ability to perform a number of testing and estimation tasks relevant to population genetics. Our method is based on an approximation to the multi-locus Wright-Fisher process, and is well-suited to the small population, discrete generation, and random mating setting in which many E&R experiments are conducted. Furthermore, because it is based on a canonical population genetic model of genome evolution, our method can directly estimate population genetic quantities such as fitness, dominance, recombination rate, and effective population size. It can also be used to design future experiments with sufficient power to reliably infer these quantities.
We first use simulated data to demonstrate the utility of our method. Then, we apply our method to analyze genome-wide data from a real E&R experiment of D. melanogaster, designed to study the adaptation to a novel laboratory environment over tens of generations.
Related work
There is a small but growing literature on the analysis of evolve-and-resequence data. Feder et al. [30] present a statistical test for detecting selection at a single biallelic locus in time series data. (Although it is not a major focus, their method can also be used to estimate the selection parameter.) Similar to our method, they model the sample paths of the Wright-Fisher process as Gaussian perturbations around a deterministic trajectory in order to obtain a computable test statistic. However, their aim is slightly different from ours in that they analyze yeast and bacteria data sets where the population size is both large and must be estimated from data. Here we focus on population sizes which are smaller and more typical of experiments performed on higher organisms, for example mice or Drosophila. We generally assume that the effective population size is known but also test our ability to estimate it from data. Also, because of the increased amount of drift present in the small population regime, we necessarily restrict our attention to selection coefficients which are somewhat larger than those considered by Feder et al. Finally, although Feder et al. do study the performance of their method when time series data are corrupted by noise due to finite sampling (as in e.g. a next-generation sequencing experiment), they do not model this effect. Here we properly account for the effect of sampling by integrating over the latent space of population-level frequencies when computing the likelihood.
Another related work is Baldwin-Brown et al. [31], which presents a thorough study of the effects of sequencing effort, replicate count, strength of selection, and other parameters on the power to detect and localize a single selected locus segregating in a 1 Mb region. Results are obtained by simulating data under different experimental conditions and comparing the resulting distributions of allele trajectories under selection and neutrality using a modified form of t-test. Because it is not model-based, this method is incapable of performing parameter estimation. As a result of their study, Baldwin-Brown et al. present a number of design recommendations to experimenters seeking to attain a given level of power to detect selection. In a related work, Kofler and Schlötterer [32] carried out forward simulations of whole genomes to provide guidelines for designing E&R experiments to maximize the power to detect selected variants.
Illingworth et al. [33] derive a probabilistic model for time series data generated from large, asexually reproducing populations. The population size is sufficiently large (on the order of ∼ 108) that population allele frequencies evolve quasi-deterministically. The deterministic trajectories are governed by a system of differential equations describing the effect of a selected (“driver”) mutation on nearby linked neutral (“passenger”) mutations. Randomness arises due to the finite sampling of alleles by sequencing. The main difference between the setting of Illingworth et al.’s and our own concerns genetic drift. While drift may be ignored when studying a large population of microorganisms, we show that it confounds our ability to detect and estimate selection in populations of order ∼ 103. Thus, for E&R studies on (smaller) populations of macroscopic organisms, methods which assume that allele frequencies evolve deterministically may not perform as well as those which explicitly take drift into account.
Topa et al. [34] present a Bayesian model for single-locus time series data obtained by next-generation sequencing. In each time period, the allele count is modeled as a draw from a binomial distribution with number of trials equal to the depth of sequencer coverage, and success probability equaling the population-level allele frequency. The posterior allele frequency distribution is used to test for selection by comparing a neutral model to one in which unobserved allele frequencies to depend on time. In the non-neutral case, a Gaussian process is used to allow for directional selection acting on the posterior allele frequency distributions.
Finally, Lynch et al. [35] derive a likelihood-based method for estimating population allele frequency at a single locus in pooled sequencing data. The method allows for the possibility of sequencing errors as well as subsampling the population prior to sequencing. Using theoretical results as well as simulations, the authors give guidelines on the (subsampled) population size and coverage depth needed to reliably detect a difference in allele frequency between two populations. Unlike the other methods surveyed here, the approach of Lynch et al. is not designed to analyze time series data. Hence the data requirements needed to reliably detect allele frequency changes using their method—for example, sequencing coverage depth of at least 100 reads—are potentially greater than for methods are informed by a population-genetic model of genome evolution over time.
Novelty of our method
Our method differs from the above-mentioned approaches in several regards. To the best of our knowledge, ours is the first method capable of analyzing time series data from multiple linked sites jointly. We find that this is advantageous when studying selection in E&R data. Furthermore, it enables us to analyze features of these data which cannot be studied using single-locus models, such as local levels of linkage disequilibrium and the effect of a recombination hotspot. Additionally, because our model is based on a principled approximation to the Wright-Fisher process, it can numerically estimate the selection coefficient, dominance parameter, recombination rates, and other population genetic quantities of interest. In this way it is distinct from the aforementioned simulation-based methods [31, 32], methods which only focus on testing for selection [30, 31, 34], or methods based on general statistical procedures which are not specific to population genetics [34, 35].
Software and data availability
Source code implementing the method described in this paper is included in S1 Code. The experimental data analyzed in Analysis of a real E&R experiment data are from Franssen et al. [36] and are available on the Dryad digital repository http://dx.doi.org/10.5061/dryad.403b2.
Results
As described above, the primary methodological advance of this paper is to derive a tractable approximation to the discrete, multi-locus Wright-Fisher model with selection. This approximation enables us to perform statistical inference on time-series data generated in E&R experiments. Before studying how our approximation performs on both simulated and real data, we give a brief overview of its motivation and derivation.
A brief overview of the method
We consider the following model of an E&R experiment. A sexually reproducing population of N diploid individuals is evolved in discrete, non-overlapping generations. Pooled DNA sequencing [37, 38] is performed T times at generations t1 < t2 < ⋯ < tT. At each segregating site in the resulting data set, we assume that there are two alleles, denoted A0 and A1. (As will be seen below, up to a change in the sign of the selection coefficient associated with each site, the model is agnostic to which allele is called A0 or A1.) Let L and R denote the number of loci and the number of experimental replicates, respectively. The array D ∊ [0, 1]T×L×R counts relative frequency with which the A1 allele was observed for each combination of generation, locus and replicate.
Given D and a vector of underlying population-genetic parameters θ, let ℙ(D∣θ) denote the model likelihood. In an idealized E&R experiment, generations are discrete and non-overlapping, mating is random, and the population size is fixed, so the likelihood is well approximated by the classical Wright-Fisher model of genome evolution [39]:
where ℙθ(Gi∣Gi−1) is the transition function of the discrete, many-locus Wright-Fisher Markov chain from genomic configuration Gi−1 to Gi given parameters θ,
Zdroje
1. Veeramah KR, Hammer MF (2014) The impact of whole-genome sequencing on the reconstruction of human population history. Nature Reviews Genetics 15: 149–162. doi: 10.1038/nrg3625 24492235
2. McVean GAT, Myers SR, Hunt S, Deloukas P, Bentley DR, et al. (2004) The fine-scale structure of recombination rate variation in the human genome. Science 304: 581–584. doi: 10.1126/science.1092500 15105499
3. Myers S, Bottolo L, Freeman C, McVean G, Donnelly P (2005) A fine-scale map of recombination rates and hotspots across the human genome. Science 310: 321–324. doi: 10.1126/science.1117196 16224025
4. Auton A, Fledel-Alon A, Pfeifer S, Venn O, Ségurel L, et al. (2012) A fine-scale chimpanzee genetic map from population sequencing. Science 336: 193–198. doi: 10.1126/science.1216872 22422862
5. Chan AH, Jenkins PA, Song YS (2012) Genome-wide fine-scale recombination rate variation in Drosophila melanogaster. PLoS Genetics 8: e1003090. doi: 10.1371/journal.pgen.1003090 23284288
6. Auton A, Li YR, Kidd J, Oliveira K, Nadel J, et al. (2013) Genetic recombination is targeted towards gene promoter regions in dogs. PLoS Genetics 9: e1003984. doi: 10.1371/journal.pgen.1003984 24348265
7. Nielsen R, Bustamante C, Clark AG, Glanowski S, Sackton TB, et al. (2005) A scan for positively selected genes in the genomes of humans and chimpanzees. PLoS Biology 3: e170. doi: 10.1371/journal.pbio.0030170 15869325
8. Bustamante CD, Fledel-Alon A, Williamson S, Nielsen R, Hubisz MT, et al. (2005) Natural selection on protein-coding genes in the human genome. Nature 437: 1153–1157. doi: 10.1038/nature04240 16237444
9. Sabeti PC, Schaffner SF, Fry B, Lohmueller J, Varilly P, et al. (2006) Positive natural selection in the human lineage. Science 312: 1614–1620. doi: 10.1126/science.1124309 16778047
10. Nielsen R, Hellmann I, Hubisz M, Bustamante C, Clark AG (2007) Recent and ongoing selection in the human genome. Nature Reviews Genetics 8: 857–868. doi: 10.1038/nrg2187 17943193
11. Sella G, Petrov DA, Przeworski M, Andolfatto P (2009) Pervasive natural selection in the Drosophila genome? PLoS Genetics 5: e1000495. doi: 10.1371/journal.pgen.1000495 19503600
12. Hernandez RD, Kelley JL, Elyashiv E, Melton SC, Auton A, et al. (2011) Classic selective sweeps were rare in recent human evolution. Science 331: 920–924. doi: 10.1126/science.1198878 21330547
13. Langley CH, Stevens K, Cardeno C, Lee YCG, Schrider DR, et al. (2012) Genomic variation in natural populations of Drosophila melanogaster. Genetics 192: 533–598. doi: 10.1534/genetics.112.142018 22673804
14. Hummel S, Schmidt D, Kremeyer B, Herrmann B, Oppermann M (2005) Detection of the CCR5-Delta32 HIV resistance gene in bronze age skeletons. Genes and Immunity 6: 371–374. doi: 10.1038/sj.gene.6364172 15815693
15. Green RE, Krause J, Briggs AW, Maricic T, Stenzel U, et al. (2010) A draft sequence of the Neandertal genome. Science 328: 710–722. doi: 10.1126/science.1188021 20448178
16. Reich D, Green RE, Kircher M, Krause J, Patterson N, et al. (2010) Genetic history of an archaic hominin group from Denisova Cave in Siberia. Nature 468: 1053–1060. doi: 10.1038/nature09710 21179161
17. Ludwig A, Pruvost M, Reissmann M, Benecke N, Brockmann GA, et al. (2009) Coat color variation at the beginning of horse domestication. Science 324: 485. doi: 10.1126/science.1172750 19390039
18. Meyer M, Kircher M, Gansauge MT, Li H, Racimo F, et al. (2012) A high-coverage genome sequence from an archaic Denisovan individual. Science 338: 222–226. doi: 10.1126/science.1224344 22936568
19. Orlando L, Ginolhac A, Zhang G, Froese D, Albrechtsen A, et al. (2013) Recalibrating equus evolution using the genome sequence of an early middle pleistocene horse. Nature 499: 74–78. doi: 10.1038/nature12323 23803765
20. Sankararaman S, Mallick S, Dannemann M, Prüfer K, Kelso J, et al. (2014) The genomic landscape of Neanderthal ancestry in present-day humans. Nature 507: 354–357. doi: 10.1038/nature12961 24476815
21. Steinrücken M, Bhaskar A, Song YS (2014) A novel spectral method for inferring general diploid selection from time series genetic data. Annals of Applied Statistics 8: 2203–2222. doi: 10.1214/14-AOAS764 25598858
22. Wiser MJ, Ribeck N, Lenski RE (2013) Long-term dynamics of adaptation in asexual populations. Science 342: 1364–1367. doi: 10.1126/science.1243357 24231808
23. Lang GI, Rice DP, Hickman MJ, Sodergren E, Weinstock GM, et al. (2013) Pervasive genetic hitchhiking and clonal interference in forty evolving yeast populations. Nature 500: 571–574. doi: 10.1038/nature12344 23873039
24. Burke MK, Dunham JP, Shahrestani P, Thornton KR, Rose MR, et al. (2010) Genome-wide analysis of a long-term evolution experiment with Drosophila. Nature 467: 587–590. doi: 10.1038/nature09352 20844486
25. Orozco ter Wengel P, Kapun M, Nolte V, Kofler R, Flatt T, et al. (2012) Adaptation of Drosophila to a novel laboratory environment reveals temporally heterogeneous trajectories of selected alleles. Molecular Ecology 21: 4931–4941. doi: 10.1111/j.1365-294X.2012.05673.x
26. Tenaillon O, Rodríguez-Verdugo A, Gaut RL, McDonald P, Bennett AF, et al. (2012) The molecular diversity of adaptive convergence. Science 335: 457–461. doi: 10.1126/science.1212986 22282810
27. Shankarappa R, Margolick JB, Gange SJ, Rodrigo AG, Upchurch D, et al. (1999) Consistent viral evolutionary changes associated with the progression of human immunodeficiency virus type 1 infection. Journal of Virology 73: 10489–10502. 10559367
28. Burke MK (2012) How does adaptation sweep through the genome? Insights from long-term selection experiments. Proceedings of the Royal Society B: Biological Sciences 279: 5029–5038. doi: 10.1098/rspb.2012.0799 22833271
29. Parts L, Cubillos FA, Warringer J, Jain K, Salinas F, et al. (2011) Revealing the genetic structure of a trait by sequencing a population under selection. Genome Research 21: 1131–1138. doi: 10.1101/gr.116731.110 21422276
30. Feder AF, Kryazhimskiy S, Plotkin JB (2014) Identifying signatures of selection in genetic time series. Genetics 196: 509–522. doi: 10.1534/genetics.113.158220 24318534
31. Baldwin-Brown JG, Long AD, Thornton KR (2014) The power to detect quantitative trait loci using resequenced, experimentally evolved populations of diploid, sexual organisms. Molecular Biology and Evolution 31: 1040–1055. doi: 10.1093/molbev/msu048 24441104
32. Kofler R, Schlötterer C (2014) A guide for the design of evolve and resequencing studies. Molecular Biology and Evolution 31: 474–483. doi: 10.1093/molbev/mst221 24214537
33. Illingworth CJR, Parts L, Schiffels S, Liti G, Mustonen V (2012) Quantifying selection acting on a complex trait using allele frequency time series data. Molecular Biology and Evolution 29: 1187–1197. doi: 10.1093/molbev/msr289 22114362
34. Topa H, Jónás Á, Kofler R, Kosiol C, Honkela A (2014) Gaussian process test for highthroughput sequencing time series: application to experimental evolution. arXiv q-bio.PE: 1403:4086.
35. Lynch M, Bost D, Wilson S, Maruki T, Harrison S (2014) Population-genetic inference from pooled-sequencing data. Genome Biology and Evolution 6: 1210–1218. doi: 10.1093/gbe/evu085 24787620
36. Franssen SU, Nolte V, Tobler R, Schlötterer C (2015) Patterns of linkage disequilibrium and long range hitchhiking in evolving experimental Drosophila melanogaster populations. Molecular Biology and Evolution, 32: 495–509. doi: 10.1093/molbev/msu320
37. Futschik A, Schlötterer C (2010) The next generation of molecular markers from massively parallel sequencing of pooled DNA samples. Genetics 186: 207–218. doi: 10.1534/genetics.110.114397 20457880
38. Schlötterer C, Tobler R, Kofler R, Nolte V (2014) Sequencing pools of individuals—mining genome-wide polymorphism data without big funding. Nature Reviews Genetics 15: 749–763. doi: 10.1038/nrg3803 25246196
39. Ewens WJ (1979) Mathematical Population Genetics. Springer Verlag.
40. Hazel JR (1995) Thermal adaptation in biological membranes: is homeoviscous adaptation the explanation? Annual Review of Physiology 57: 19–42. doi: 10.1146/annurev.ph.57.030195.000315 7778864
41. Comeron JM, Ratnappan R, Bailin S (2012) The many landscapes of recombination in Drosophila melanogaster. PLoS Genetics 8: e1002905. doi: 10.1371/journal.pgen.1002905 23071443
42. Singh ND, Stone EA, Aquadro CF, Clark AG (2013) Fine-scale heterogeneity in crossover rate in the garnet-scalloped region of the Drosophila melanogaster X chromosome. Genetics 194: 375–387. doi: 10.1534/genetics.112.146746 23410829
43. Cutler DJ, Jensen JD (2010) To pool, or not to pool? Genetics 186: 41–43. doi: 10.1534/genetics.110.121012 20855575
44. Gautier M, Foucaud J, Gharbi K, Cézard T, Galan M, et al. (2013) Estimation of population allele frequencies from next-generation sequencing data: pool-versus individual-based genotyping. Molecular Ecology 22: 3766–3779. doi: 10.1111/mec.12360 23730833
45. Lynch M, Bost D, Wilson S, Maruki T, Harrison S (2014) Population-genetic inference from pooled-sequencing data. Genome Biology and Evolution 6: 1210–1218. doi: 10.1093/gbe/evu085 24787620
46. Kirkpatrick M, Johnson T, Barton N (2002) General models of multilocus evolution. Genetics 161: 1727. 12196414
47. Barton NH, Otto SP (2005) Evolution of recombination due to random drift. Genetics 169: 2353–2370. doi: 10.1534/genetics.104.032821 15687279
48. Stephan W, Song YS, Langley CH (2006) The hitchhiking effect on linkage disequilibrium between linked neutral loci. Genetics 172: 2647–2663. doi: 10.1534/genetics.105.050179 16452153
49. Hudson RR (2002) Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics 18: 337–338. doi: 10.1093/bioinformatics/18.2.337 11847089
50. Li H, Stephan W (2006) Inferring the demographic history and rate of adaptive substitution in Drosophila. PLoS Genetics 2: e166. doi: 10.1371/journal.pgen.0020166 17040129
51. Peng B, Kimmel M (2005) simuPOP: a forward-time population genetics simulation environment. Bioinformatics 21: 3686–3687. doi: 10.1093/bioinformatics/bti584 16020469
Štítky
Genetika Reprodukční medicínaČlánek vyšel v časopise
PLOS Genetics
2015 Číslo 4
- Srdeční frekvence embrya může být faktorem užitečným v předpovídání výsledku IVF
- Primární hyperoxalurie – aktuální možnosti diagnostiky a léčby
- Souvislost haplotypu M2 genu pro annexin A5 s opakovanými reprodukčními ztrátami
- Akutní intermitentní porfyrie
- Hodnota lidského choriového gonadotropinu v časném stadiu gravidity po IVF – asociace s rozvojem preeklampsie?
Nejčtenější v tomto čísle
- Lack of GDAP1 Induces Neuronal Calcium and Mitochondrial Defects in a Knockout Mouse Model of Charcot-Marie-Tooth Neuropathy
- Proteolysis of Virulence Regulator ToxR Is Associated with Entry of into a Dormant State
- Frameshift Variant Associated with Novel Hoof Specific Phenotype in Connemara Ponies
- Ataxin-2 Regulates Translation in a New BAC-SCA2 Transgenic Mouse Model