Scalable probabilistic PCA for large-scale genetic variation data
Autoři:
Aman Agrawal aff001; Alec M. Chiu aff002; Minh Le aff003; Eran Halperin aff002; Sriram Sankararaman aff002; Eran Halperin aff003; Sriram Sankararaman aff003
Působiště autorů:
Department of Computer Science, Indian Institute of Technology, Delhi, India
aff001; Bioinformatics Interdepartmental Program, University of California, Los Angeles, California, United States of America
aff002; Bioinformatics Interdepartmental Program, University of California, Los Angeles, California United States of America
aff002; Department of Computer Science, University of California, Los Angeles, California, United States of America
aff003; Department of Computer Science, University of California, Los Angeles, California United States of America
aff003; Department of Human Genetics, University of California, Los Angeles, California, United States of America
aff004; Department of Human Genetics, University of California, Los Angeles, California United States of America
aff004; Department of Anesthesiology and Perioperative Medicine, University of California, Los Angeles, California, United States of America
aff005; Department of Anesthesiology and Perioperative Medicine, University of California, Los Angeles, California United States of America
aff005; Department of Computational Medicine, David Geffen School of Medicine, University of California, Los Angeles, California, United States of America
aff006; Department of Computational Medicine, David Geffen School of Medicine, University of California, Los Angeles, California United States of America
aff006; Institute of Precision Health, University of California, Los Angeles, California, United States of America
aff007
Vyšlo v časopise:
Scalable probabilistic PCA for large-scale genetic variation data. PLoS Genet 16(5): e32767. doi:10.1371/journal.pgen.1008773
Kategorie:
Research Article
doi:
https://doi.org/10.1371/journal.pgen.1008773
Souhrn
Principal component analysis (PCA) is a key tool for understanding population structure and controlling for population stratification in genome-wide association studies (GWAS). With the advent of large-scale datasets of genetic variation, there is a need for methods that can compute principal components (PCs) with scalable computational and memory requirements. We present ProPCA, a highly scalable method based on a probabilistic generative model, which computes the top PCs on genetic variation data efficiently. We applied ProPCA to compute the top five PCs on genotype data from the UK Biobank, consisting of 488,363 individuals and 146,671 SNPs, in about thirty minutes. To illustrate the utility of computing PCs in large samples, we leveraged the population structure inferred by ProPCA within White British individuals in the UK Biobank to identify several novel genome-wide signals of recent putative selection including missense mutations in RPGRIP1L and TLR4.
Klíčová slova:
Algorithms – Genome-wide association studies – Genomic signal processing – Genomics statistics – Molecular genetics – principal component analysis – Singular value decomposition – Variant genotypes
Zdroje
1. Novembre J, Ramachandran S. Perspectives on human population structure at the cusp of the sequencing era. Annual review of genomics and human genetics. 2011;12:245–274. 21801023
2. Novembre J, Johnson T, Bryc K, Kutalik Z, Boyko A, Auton A, et al. Genes mirror geography within Europe. Nature. 2008;456(7219):274.
3. Yang WY, Novembre J, Eskin E, Halperin E. A model-based approach for analysis of spatial structure in genetic data. Nature genetics. 2012;44(6):725–731. doi: 10.1038/ng.2285 22610118
4. Baran Y, Quintela I, Carracedo Á, Pasaniuc B, Halperin E. Enhanced localization of genetic samples through linkage-disequilibrium correction. The American Journal of Human Genetics. 2013;92(6):882–894. doi: 10.1016/j.ajhg.2013.04.023 23726367
5. Price AL, Zaitlen NA, Reich D, Patterson N. New approaches to population stratification in genome-wide association studies. Nature reviews Genetics. 2010;11(7):459. doi: 10.1038/nrg2813 20548291
6. Patterson N, Price AL, Reich D. Population Structure and Eigenanalysis. PLoS Genetics. 2006;2(12):e190+. doi: 10.1371/journal.pgen.0020190 17194218
7. Hanis CL, Chakraborty R, Ferrell RE, Schull WJ. Individual admixture estimates: disease associations and individual risk of diabetes and gallbladder disease among Mexican-Americans in Starr County, Texas. Am J Phys Anthropol. 1986;70(4):433–441. 3766713
8. Pritchard J, Stephens M, Donnelly P. Inference of Population Structure Using Multilocus Genotype Data. Genetics. 2000;155:945–959. 10835412
9. Chen C, Durand E, Forbes F, François O. Bayesian clustering algorithms ascertaining spatial population structure: a new computer program and a comparison study. Molecular Ecology Resources. 2007;7(5):747–756.
10. Engelhardt BE, Stephens M. Analysis of population structure: a unifying framework and novel methods based on sparse factor analysis. PLoS genetics. 2010;6(9):e1001117. doi: 10.1371/journal.pgen.1001117 20862358
11. Jolliffe IT. Principal Component Analysis and Factor Analysis. In: Principal component analysis. Springer; 1986. p. 115–128.
12. Galinsky KJ, Bhatia G, Loh PR, Georgiev S, Mukherjee S, Patterson NJ, et al. Fast principal-component analysis reveals convergent evolution of ADH1B in Europe and East Asia. The American Journal of Human Genetics. 2016;98(3):456–472. doi: 10.1016/j.ajhg.2015.12.022 26924531
13. Abraham G, Qiu Y, Inouye M. FlashPCA2: principal component analysis of biobank-scale genotype datasets. Bioinformatics. 2017;. 28475694
14. Prive F, Aschard H, Ziyatdinov A, Blum M. Efficient analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr. Bioinformatics. 2018;34(16):2781–2787. doi: 10.1093/bioinformatics/bty185 29617937
15. Bose A, Kalantzis V, Kontopoulou EM, Elkady M, Paschou P, Drineas P. TeraPCA: a fast and scalable software package to study genetic variation in tera-scale genotypes. Bioinformatics. 2019;35(19):3679–3683. 30957838
16. Chang C, Chow C, Tellier L, Vattikuti S, Purcell S, Lee J. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience. 2015;4:7. doi: 10.1186/s13742-015-0047-8 25722852
17. Price A, Patterson N, Plenge R, Weinblatt M, Shadick N, Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics. 2006;38:904–909. 16862161
18. Canela-Xandri O, Law A, Gray A, Woolliams JA, Tenesa A. A new tool called DISSECT for analysing large genomic data sets using a Big Data approach. Nature communications. 2015;6:10162. doi: 10.1038/ncomms10162 26657010
19. Roweis ST. EM algorithms for PCA and SPCA. In: Advances in neural information processing systems; 1998. p. 626–632.
20. Tipping ME, Bishop CM. Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 1999;61(3):611–622.
21. Liberty E, Zucker SW. The mailman algorithm: A note on matrix–vector multiplication. Information Processing Letters. 2009;109(3):179–182.
22. Bycroft C, Freeman C, Petkova D, Band G, Elliott LT, Sharp K, et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562:203–209. doi: 10.1038/s41586-018-0579-z 30305743
23. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. The American Journal of Human Genetics. 2007;81(3):559–575. doi: 10.1086/519795 17701901
24. The 1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;491(7422):56–65.
25. Shriver MD, Kennedy GC, Parra EJ, Lawson HA, Sonpar V, Huang J, et al. The genomic distribution of population substructure in four populations using 8,525 autosomal SNPs. Human genomics. 2004;1(4):274. doi: 10.1186/1479-7364-1-4-274 15588487
26. Tian C, Plenge RM, Ransom M, Lee A, Villoslada P, Selmi C, et al. Analysis and application of European genetic substructure using 300 K SNP information. PLoS genetics. 2008;4(1):e4. doi: 10.1371/journal.pgen.0040004 18208329
27. Wiegering A, Ruther U, Gerhardt C. The ciliary protein Rpgrip1l in development and disease. Dev Biol. 2018;442(1):60–68. 30075108
28. Delous M, Baala L, Salomon R, Laclef C, Vierkotten J, Tory K, et al. The ciliary gene RPGRIP1L is mutated in cerebello-oculo-renal syndrome (Joubert syndrome type B) and Meckel syndrome. Nature Genetics. 2007;39:875–881. 17558409
29. Devuyst O, Arnould VJ. Mutations in RPGRIP1L: extending the clinical spectrum of ciliopathies. Nephrology Dialysis Transplantation. 2008;23(5):1500–1503.
30. Khanna H, Davis EE, Murga-Zamalloa CA, Estrada-Cuzcano A, Lopez I, den Hollander AI, et al. A common allele in RPGRIP1L is a modifier of retinal degeneration in ciliopathies. Nature Genetics. 2009;41(6):739–45. doi: 10.1038/ng.366 19430481
31. Aschard H, Vilhjálmsson B, Greliche N, Morange P, Trégouët D, Kraft P. Maximizing the power of principal-component analysis of correlated phenotypes in genome-wide association studies. AJHG. 2014;94(5):662–76.
32. Korneev K, Atretkhany K, Drutskaya M, Grivennikov S, Kuprash D, Nedospasov S. TLR-signaling and proinflammatory cytokines as drivers of tumorigenesis. Cytokine. 2017;89:127–135. 26854213
33. Mockenhaupt F, Cramer J, Hamann L, Stegemann M, Eckert J, Oh N, et al. Toll-like receptor (TLR) polymorphisms in African children: Common TLR-4 variants predispose to severe malaria. PNAS. 2006;103(1):177–182. doi: 10.1073/pnas.0506803102 16371473
34. Van der Graaf C, Netea M, Morré S, Den Heijer M, Verweij P, Van der Meer J, et al. Toll-like receptor 4 Asp299Gly/Thr399Ile polymorphisms are a risk factor for Candida bloodstream infection. European Cytokine Network. 2006;17(1):29–34. 16613760
35. Field Y, Boyle EA, Telis N, Gao Z, Gaulton KJ, Golan D, et al. Detection of human adaptation during the past 2000 years. Science. 2016;354(6313):760–764. doi: 10.1126/science.aag0776 27738015
36. Albers, McVean. Dating genomic variants and shared ancestry in population-scale sequencing data. bioRxiv. 2019.
37. Wu Y, Sankararaman S. A scalable estimator of SNP heritability for biobank-scale data. Bioinformatics. 2018;34(13):i187–i194. doi: 10.1093/bioinformatics/bty253 29950019
38. Halko N, Martinsson PG, Tropp JA. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review. 2011;53(2):217–288.
39. Mathieson I, McVean G. Differential confounding of rare and common variants in spatially structured populations. Nature genetics. 2012;44(3):243. doi: 10.1038/ng.1074 22306651
40. Hellenthal G, Auton A, Falush D. Inferring Human Colonization History Using a Copying Model. PLoS Genet. 2008;4(5):e1000078. doi: 10.1371/journal.pgen.1000078 18497854
41. Li N, Stephens M. Modeling Linkage Disequilibrium and Identifying Recombination Hotspots Using Single-Nucleotide Polymorphism Data. Genetics. 2003;165(4):2213–2233. 14704198
42. Wen X, Stephens M. Using linear predictors to impute allele frequencies from summary or pooled genotype data. The annals of applied statistics. 2010;4(3):1158. 21479081
43. Schein AI, Saul LK, Ungar LH. A generalized linear model for principal component analysis of binary data. In: AISTATS. vol. 3; 2003. p. 10.
44. Li W, Cerise J, Yang Y, Han H. Application of t-SNE to human genetic data. J Bioinform Comput Biol. 2017;15(4):1750017. 28718343
45. Becht E, McInnes L, Healy J, Dutertre C, Kwok I, Ng L, et al. Dimensionality reduction for visualizing single-cell data using UMAP. Nat Biotechnol. 2019;37:38–44.
46. Anderson TW, Rubin H. Statistical inference in factor analysis. In: Proceedings of the third Berkeley symposium on mathematical statistics and probability. vol. 5; 1956. p. 111–150.
47. Szlam A, Tulloch A, Tygert M. Accurate Low-Rank Approximations Via a Few Iterations of Alternating Least Squares. SIAM Journal on Matrix Analysis and Applications. 2017;38(2):425–433.
48. Lehoucq RB, Sorensen DC. Deflation techniques for an implicitly restarted Arnoldi iteration. SIAM Journal on Matrix Analysis and Applications. 1996;17(4):789–821.
49. Manichaikul A, Mychaleckyj J, Rich S, Daly K, Sale M, Chen W. Robust relationship inference in genome-wide association studies. Bioinformatics. 2010;26(22):2867–2873. doi: 10.1093/bioinformatics/btq559 20926424
Článek vyšel v časopise
PLOS Genetics
2020 Číslo 5
- Nový algoritmus zpřesní predikci rizika kardiovaskulárních onemocnění
- Není statin jako statin aneb praktický přehled rozdílů jednotlivých molekul
- Metamizol jako analgetikum první volby: kdy, pro koho, jak a proč?
- Jak se válečná Ukrajina stala semeništěm superrezistentních bakterií
- Mohou být časté noční můry předzvěstí demence?
Nejčtenější v tomto čísle
- A new neuropeptide insect parathyroid hormone iPTH in the red flour beetle Tribolium castaneum
- The domesticated transposase ALP2 mediates formation of a novel Polycomb protein complex by direct interaction with MSI1, a core subunit of Polycomb Repressive Complex 2 (PRC2)
- Polyploidy breaks speciation barriers in Australian burrowing frogs Neobatrachus
- The phosphorelay BarA/SirA activates the non-cognate regulator RcsB in Salmonella enterica