UMAP reveals cryptic population structure and phenotype heterogeneity in large genomic cohorts
Autoři:
Alex Diaz-Papkovich aff001; Luke Anderson-Trocmé aff002; Chief Ben-Eghan aff002; Simon Gravel aff002
Působiště autorů:
Quantitative Life Sciences, McGill University, Montreal, Québec, Canada
aff001; McGill University and Genome Quebec Innovation Centre, Montreal, Québec, Canada
aff002; Department of Human Genetics, McGill University, Montreal, Quebec, Canada
aff003
Vyšlo v časopise:
UMAP reveals cryptic population structure and phenotype heterogeneity in large genomic cohorts. PLoS Genet 15(11): e32767. doi:10.1371/journal.pgen.1008432
Kategorie:
Research Article
doi:
https://doi.org/10.1371/journal.pgen.1008432
Souhrn
Human populations feature both discrete and continuous patterns of variation. Current analysis approaches struggle to jointly identify these patterns because of modelling assumptions, mathematical constraints, or numerical challenges. Here we apply uniform manifold approximation and projection (UMAP), a non-linear dimension reduction tool, to three well-studied genotype datasets and discover overlooked subpopulations within the American Hispanic population, fine-scale relationships between geography, genotypes, and phenotypes in the UK population, and cryptic structure in the Thousand Genomes Project data. This approach is well-suited to the influx of large and diverse data and opens new lines of inquiry in population-scale datasets.
Klíčová slova:
African people – Caribbean – Data visualization – Ethnicities – Europe – Hispanic people – Chinese people – principal component analysis
Zdroje
1. Lawson DJ, Hellenthal G, Myers S, Falush D (2012) Inference of population structure using dense haplotype data. PLOS Genetics 8(1):e1002453. doi: 10.1371/journal.pgen.1002453 22291602
2. Novembre J, Peter BM (2016) Recent advances in the study of fine-scale population structure in humans. Current Opinion in Genetics & Development 41:98–105. doi: 10.1016/j.gde.2016.08.007
3. Spence JP, Steinrücken M, Terhorst J, Song YS (2018) Inference of population history using coalescent hmms: review and outlook. Current Opinion in Genetics & Development 53:70–76. doi: 10.1016/j.gde.2018.07.002
4. Patterson N, Price AL, Reich D (2006) Population structure and eigenanalysis. PLOS Genetics 2(12):1–20. doi: 10.1371/journal.pgen.0020190
5. Hellenthal G, et al. (2014) A genetic atlas of human admixture history. Science 343(6172):747–751. doi: 10.1126/science.1243518 24531965
6. McVean G (2009) A genealogical interpretation of principal components analysis. PLOS Genetics 5(10):e1000686. doi: 10.1371/journal.pgen.1000686 19834557
7. Brisbin A, et al. (2012) PCAdmix: principal components-based assignment of ancestry along each chromosome in individuals with admixed ancestry from two or more populations. Human Biology 84(4):343. doi: 10.3378/027.084.0401 23249312
8. Novembre J, et al. (2008) Genes mirror geography within Europe. Nature 456:98–101. doi: 10.1038/nature07331 18758442
9. Nelson MR, et al. (2008) The Population Reference Sample, POPRES: a resource for population, disease, and pharmacological genetics research. The American Journal of Human Genetics 83(3):347–358. doi: 10.1016/j.ajhg.2008.08.005 18760391
10. van der Maaten L, Hinton G (2008) Visualizing data using t-SNE. Journal of Machine Learning Research 9(Nov):2579–2605.
11. Platzer A (2013) Visualization of SNPs with t-SNE. PLOS One 8(2):e56883. doi: 10.1371/journal.pone.0056883 23457633
12. 1000 Genomes Project Consortium (2015) A global reference for human genetic variation. Nature 526(7571):68. doi: 10.1038/nature15393 26432245
13. Li W, Cerise JE, Yang Y, Han H (2017) Application of t-SNE to human genetic data. Journal of Bioinformatics and Computational Biology 15(04):1750017. doi: 10.1142/S0219720017500172 28718343
14. McInnes L, Healy J (2018) UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426.
15. Becht E, et al. (2018) Dimensionality reduction for visualizing single-cell data using UMAP. Nature Biotechnology. doi: 10.1038/nbt.4314 30531897
16. Juster FT, Suzman R (1995) An overview of the Health and Retirement Study. Journal of Human Resources pp. S7–S56. doi: 10.2307/146277
17. Sudlow C, et al. (2015) UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLOS Medicine 12(3):e1001779. doi: 10.1371/journal.pmed.1001779 25826379
18. Reich D, Thangaraj K, Patterson N, Price AL, Singh L (2009) Reconstructing indian population history. Nature 461:489 EP –. doi: 10.1038/nature08365 19779445
19. 23andMe (2019) 23andme tests new ancestry breakdown in central and south asia. [Online; accessed 2019-04-04].
20. Han E, et al. (2017) Clustering of 770,000 genomes reveals post-colonial population structure of north america. Nature Communications 8:14238. doi: 10.1038/ncomms14238 28169989
21. Jordan I, Rishishwar L, Conley AB (2018) Cryptic Native American ancestry recapitulates population-specific migration and settlement of the continental United States. bioRxiv.
22. Leslie S, et al. (2015) The fine-scale genetic structure of the British population. Nature 519(7543):309. doi: 10.1038/nature14230 25788095
23. Robinson MR, et al. (2015) Population genetic differentiation of height and body mass index across Europe. Nature Genetics 47(11):1357. doi: 10.1038/ng.3401 26366552
24. Komlos A (1994) Stature, living standards, and economic development: Essays in anthropometric history. (University of Chicago Press).
25. Quanjer PH, et al. (2012) Multi-ethnic reference values for spirometry for the 3–95-yr age range: the global lung function 2012 equations.
26. Ortega VE, Kumar R (2015) The effect of ancestry and genetic variation on lung function predictions: what is “normal” lung function in diverse human populations? Current Allergy and Asthma Reports 15(4):16. doi: 10.1007/s11882-015-0516-2 26130473
27. Novembre J, Stephens M (2008) Interpreting principal component analyses of spatial population genetic variation. Nature Genetics 40(5):646. doi: 10.1038/ng.139 18425127
28. Purcell S, et al. (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses. The American Journal of Human Genetics 81(3):559–575. doi: 10.1086/519795 17701901
29. Baharian S, et al. (2016) The great migration and African-American genomic diversity. PLOS Genetics 12(5):e1006059. doi: 10.1371/journal.pgen.1006059 27232753
30. Maples BK, Gravel S, Kenny EE, Bustamante CD (2013) RFMix: A discriminative modeling approach for rapid and robust local-ancestry inference. Am J Hum Genet 93(2):278–288. doi: 10.1016/j.ajhg.2013.06.020 23910464
31. Pedregosa F, et al. (2011) Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12:2825–2830.
32. Jones E, Oliphant T, Peterson P, et al. (2001–) SciPy: Open source scientific tools for Python. [Online; accessed 2018-02-02].
33. Seabold S, Perktold J (2010) Statsmodels: Econometric and statistical modeling with python in 9th Python in Science Conference.
34. R Core Team (2013) R: A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, Vienna, Austria).
35. Hunter JD (2007) Matplotlib: A 2d graphics environment. Computing In Science & Engineering 9(3):90–95. doi: 10.1109/MCSE.2007.55
36. Wickham H (2016) ggplot2: Elegant Graphics for Data Analysis. (Springer-Verlag New York).
Štítky
Genetika Reprodukční medicínaČlánek vyšel v časopise
PLOS Genetics
2019 Číslo 11
- Management pacientů s MPN a neobvyklou kombinací genových přestaveb – systematický přehled a kazuistiky
- Management péče o pacientku s karcinomem ovaria a neočekávanou mutací CDH1 – kazuistika
- Primární hyperoxalurie – aktuální možnosti diagnostiky a léčby
- Vliv kvality morfologie spermií na úspěšnost intrauterinní inseminace
- Akutní intermitentní porfyrie
Nejčtenější v tomto čísle
- The genetic architecture of helminth-specific immune responses in a wild population of Soay sheep (Ovis aries)
- A circadian output center controlling feeding:Fasting rhythms in Drosophila
- AMPK regulates ESCRT-dependent microautophagy of proteasomes concomitant with proteasome storage granule assembly during glucose starvation
- Chromatin dynamics enable transcriptional rhythms in the cnidarian Nematostella vectensis