On the cross-population generalizability of gene expression prediction models
Autoři:
Kevin L. Keys aff001; Angel C. Y. Mak aff001; Marquitta J. White aff001; Walter L. Eckalbar aff001; Andrew W. Dahl aff001; Joel Mefford aff001; Anna V. Mikhaylova aff003; María G. Contreras aff001; Jennifer R. Elhawary aff001; Celeste Eng aff001; Donglei Hu aff001; Scott Huntsman aff001; Sam S. Oh aff001; Sandra Salazar aff001; Michael A. Lenoir aff005; Jimmie C. Ye aff006; Timothy A. Thornton aff003; Noah Zaitlen aff008; Esteban G. Burchard aff001; Christopher R. Gignoux aff009
Působiště autorů:
Department of Medicine, University of California, San Francisco, California, United States of America
aff001; Berkeley Institute for Data Science, University of California, Berkeley, California, United States of America
aff002; Department of Biostatistics, University of Washington, Seattle, Washington, United States of America
aff003; San Francisco State University, San Francisco, California, United States of America
aff004; Bay Area Pediatrics, Oakland, California, United States of America
aff005; Department of Epidemiology and Biostatistics, University of California, San Francisco, California, United States of America
aff006; Department of Bioengineering and Therapeutic Biosciences, University of California, San Francisco, California, United States of America
aff007; Department of Neurology, University of California, Los Angeles, California, United States of America
aff008; Colorado Center for Personalized Medicine, University of Colorado Anschutz Medical Campus, Aurora, Colorado, United States of America
aff009; Department of Biostatistics and Informatics, School of Public Health, University of Colorado Anschutz Medical Campus, Aurora, Colorado, United States of America
aff010
Vyšlo v časopise:
On the cross-population generalizability of gene expression prediction models. PLoS Genet 16(8): e32767. doi:10.1371/journal.pgen.1008927
Kategorie:
Research Article
doi:
https://doi.org/10.1371/journal.pgen.1008927
Souhrn
The genetic control of gene expression is a core component of human physiology. For the past several years, transcriptome-wide association studies have leveraged large datasets of linked genotype and RNA sequencing information to create a powerful gene-based test of association that has been used in dozens of studies. While numerous discoveries have been made, the populations in the training data are overwhelmingly of European descent, and little is known about the generalizability of these models to other populations. Here, we test for cross-population generalizability of gene expression prediction models using a dataset of African American individuals with RNA-Seq data in whole blood. We find that the default models trained in large datasets such as GTEx and DGN fare poorly in African Americans, with a notable reduction in prediction accuracy when compared to European Americans. We replicate these limitations in cross-population generalizability using the five populations in the GEUVADIS dataset. Via realistic simulations of both populations and gene expression, we show that accurate cross-population generalizability of transcriptome prediction only arises when eQTL architecture is substantially shared across populations. In contrast, models with non-identical eQTLs showed patterns similar to real-world data. Therefore, generating RNA-Seq data in diverse populations is a critical step towards multi-ethnic utility of gene expression prediction.
Klíčová slova:
African American people – Europe – Forecasting – Gene expression – Gene prediction – Phenotypes – Population genetics – Serial analysis of gene expression
Zdroje
1. Sudlow C, Gallacher J, Allen N, Beral V, Burton P, Danesh J, et al. UK Biobank: An Open Access Resource for Identifying the Causes of a Wide Range of Complex Diseases of Middle and Old Age. PLoS Med. 2015;12. doi: 10.1371/journal.pmed.1001779 25826379
2. NHLBI Trans-Omics for Precision Medicine. [cited 13 Nov 2018]. Available: https://www.nhlbiwgs.org/
3. NHGRI Genome Sequencing Program (GSP). In: National Human Genome Research Institute (NHGRI) [Internet]. [cited 13 Nov 2018]. Available: https://www.genome.gov/10001691/nhgri-genome-sequencing-program-gsp/
4. The 1000 Genomes Consortium. An integrated map of genetic variation from 1,092 human genomes | Nature. [cited 13 Nov 2018]. Available: https://www.nature.com/articles/nature11632
5. Gusev A, Ko A, Shi H, Bhatia G, Chung W, Penninx BWJH, et al. Integrative approaches for large-scale transcriptome-wide association studies. Nat Genet. 2016;48: 245–252. doi: 10.1038/ng.3506 26854917
6. Gamazon ER, Wheeler HE, Shah KP, Mozaffari SV, Aquino-Michaels K, Carroll RJ, et al. A gene-based association method for mapping traits using reference transcriptome data. Nat Genet. 2015;47: 1091–1098. doi: 10.1038/ng.3367 26258848
7. GTEx Consortium. The Genotype-Tissue Expression (GTEx) project. Nat Genet. 2013;45: 580–585. doi: 10.1038/ng.2653 23715323
8. Battle A, Mostafavi S, Zhu X, Potash JB, Weissman MM, McCormick C, et al. Characterizing the genetic basis of transcriptome diversity through RNA-sequencing of 922 individuals. Genome Res. 2014;24: 14–24. doi: 10.1101/gr.155192.113 24092820
9. Barbeira AN, Dickinson SP, Torres JM, Bonazzola R, Zheng J, Torstenson ES, et al. Exploring the phenotypic consequences of tissue specific gene expression variation inferred from GWAS summary statistics. Nat Commun. 2018;9. doi: 10.1038/s41467-018-03621-1 29739930
10. Zhu Z, Zhang F, Hu H, Bakshi A, Robinson MR, Powell JE, et al. Integration of summary data from GWAS and eQTL studies predicts complex trait gene targets. Nat Genet. 2016;48: 481–487. doi: 10.1038/ng.3538 27019110
11. Mostafavi S, Gaiteri C, Sullivan SE, White CC, Tasaki S, Xu J, et al. A molecular network of the aging human brain provides insights into the pathology and cognitive decline of Alzheimer’s disease. Nat Neurosci. 2018;21: 811. doi: 10.1038/s41593-018-0154-9 29802388
12. Ferreira MAR, Jansen R, Willemsen G, Penninx B, Bain LM, Vicente CT, et al. Gene-based analysis of regulatory variants identifies four putative novel asthma risk genes related to nucleotide synthesis and signaling. J Allergy Clin Immunol. 2017;139: 1148–1157. doi: 10.1016/j.jaci.2016.07.017 27554816
13. Lamontagne M, Bérubé J-C, Obeidat M, Cho MH, Hobbs BD, Sakornsakolpat P, et al. Leveraging lung tissue transcriptome to uncover candidate causal genes in COPD genetic associations. Hum Mol Genet. 2018;27: 1819–1829. doi: 10.1093/hmg/ddy091 29547942
14. Thériault S, Gaudreault N, Lamontagne M, Rosa M, Boulanger M-C, Messika-Zeitoun D, et al. A transcriptome-wide association study identifies PALMD as a susceptibility gene for calcific aortic valve stenosis. Nat Commun. 2018;9: 988. doi: 10.1038/s41467-018-03260-6 29511167
15. Porcu E, Rüeger S, Consortium eQTLGen, Santoni FA, Reymond A, Kutalik Z. Mendelian Randomization integrating GWAS and eQTL data reveals genetic determinants of complex and clinical traits. bioRxiv. 2018; 377267. doi: 10.1101/377267
16. Gusev A, Lawrenson K, Segato F, Fonseca M, Kar S, Lee J, et al. Multi-Tissue Transcriptome-Wide Association Studies Identify 21 Novel Candidate Susceptibility Genes for High Grade Serous Epithelial Ovarian Cancer. bioRxiv. 2018; 330613. doi: 10.1101/330613
17. Huckins LM, Dobbyn A, Ruderfer D, Hoffman G, Wang W, Pardinas AF, et al. Gene expression imputation across multiple brain regions reveals schizophrenia risk throughout development. bioRxiv. 2017; 222596. doi: 10.1101/222596
18. Consortium GTEx. Genetic effects on gene expression across human tissues. Nature. 2017;550: 204–213. doi: 10.1038/nature24277 29022597
19. Bustamante CD, Burchard EG, De la Vega FM. Genomics for the world. Nature. 2011;475: 163–165. doi: 10.1038/475163a 21753830
20. Popejoy AB, Fullerton SM. Genomics is failing on diversity. Nature. 2016;538: 161–164. doi: 10.1038/538161a 27734877
21. Bentley AR, Callier S, Rotimi CN. Diversity and inclusion in genomic research: why the uneven progress? J Community Genet. 2017;8: 255–266. doi: 10.1007/s12687-017-0316-6 28770442
22. Hindorff LA, Bonham VL, Brody LC, Ginoza MEC, Hutter CM, Manolio TA, et al. Prioritizing diversity in human genomics research. Nat Rev Genet. 2018;19: 175–185. doi: 10.1038/nrg.2017.89 29151588
23. Asimit JL, Hatzikotoulas K, McCarthy M, Morris AP, Zeggini E. Trans-ethnic study design approaches for fine-mapping. Eur J Hum Genet. 2016;24: 1330–1336. doi: 10.1038/ejhg.2016.1 26839038
24. Wang X, Cheng C-Y, Liao J, Sim X, Liu J, Chia K-S, et al. Evaluation of transethnic fine mapping with population-specific and cosmopolitan imputation reference panels in diverse Asian populations. Eur J Hum Genet. 2016;24: 592–599. doi: 10.1038/ejhg.2015.150 26130488
25. Li YR, Keating BJ. Trans-ethnic genome-wide association studies: advantages and challenges of mapping in diverse populations. Genome Med. 2014;6: 91. doi: 10.1186/s13073-014-0091-5 25473427
26. Kumar R, Seibold MA, Aldrich MC, Williams LK, Reiner AP, Colangelo L, et al. Genetic ancestry in lung-function predictions. N Engl J Med. 2010;363: 321–330. doi: 10.1056/NEJMoa0907897 20647190
27. Yang JJ, Cheng C, Devidas M, Cao X, Fan Y, Campana D, et al. Ancestry and pharmacogenomics of relapse in acute lymphoblastic leukemia. Nat Genet. 2011;43: 237–241. doi: 10.1038/ng.763 21297632
28. Acuña-Alonzo V, Flores-Dorantes T, Kruit JK, Villarreal-Molina T, Arellano-Campos O, Hünemeier T, et al. A functional ABCA1 gene variant is associated with low HDL-cholesterol levels and shows evidence of positive selection in Native Americans. Hum Mol Genet. 2010;19: 2877–2885. doi: 10.1093/hmg/ddq173 20418488
29. Adeyemo A, Rotimi C. Genetic variants associated with complex human diseases show wide variation across multiple populations. Public Health Genomics. 2010;13: 72–79. doi: 10.1159/000218711 19439916
30. Manrai AK, Funke BH, Rehm HL, Olesen MS, Maron BA, Szolovits P, et al. Genetic Misdiagnoses and the Potential for Health Disparities. N Engl J Med. 2016;375: 655–665. doi: 10.1056/NEJMsa1507092 27532831
31. Petrovski S, Goldstein DB. Unequal representation of genetic variation across ancestry groups creates healthcare inequality in the application of precision medicine. Genome Biol. 2016;17: 157. doi: 10.1186/s13059-016-1016-y 27418169
32. Oh SS, White MJ, Gignoux CR, Burchard EG. Making Precision Medicine Socially Precise. Take a Deep Breath. Am J Respir Crit Care Med. 2016;193: 348–350. doi: 10.1164/rccm.201510-2045ED 26871667
33. Oh SS, Galanter J, Thakur N, Pino-Yanes M, Barcelo NE, White MJ, et al. Diversity in Clinical and Biomedical Research: A Promise Yet to Be Fulfilled. PLoS Med. 2015;12. doi: 10.1371/journal.pmed.1001918 26671224
34. Belbin GM, Nieves-Colón MA, Kenny EE, Moreno-Estrada A, Gignoux CR. Genetic diversity in populations across Latin America: implications for population and medical genetic studies. Curr Opin Genet Dev. 2018;53: 98–104. doi: 10.1016/j.gde.2018.07.006 30125792
35. Martin AR, Gignoux CR, Walters RK, Wojcik GL, Neale BM, Gravel S, et al. Human Demographic History Impacts Genetic Risk Prediction across Diverse Populations. Am J Hum Genet. 2017;100: 635–649. doi: 10.1016/j.ajhg.2017.03.004 28366442
36. Bild DE, Bluemke DA, Burke GL, Detrano R, Diez Roux AV, Folsom AR, et al. Multi-Ethnic Study of Atherosclerosis: objectives and design. Am J Epidemiol. 2002;156: 871–881. doi: 10.1093/aje/kwf113 12397006
37. Liu Y, Ding J, Reynolds LM, Lohman K, Register TC, De La Fuente A, et al. Methylomics of gene expression in human monocytes. Hum Mol Genet. 2013;22: 5065–5074. doi: 10.1093/hmg/ddt356 23900078
38. Mogil LS, Andaleon A, Badalamenti A, Dickinson SP, Guo X, Rotter JI, et al. Genetic architecture of gene expression traits across diverse populations. PLOS Genet. 2018;14: e1007586. doi: 10.1371/journal.pgen.1007586 30096133
39. Mak ACY, White MJ, Eckalbar WL, Szpiech ZA, Oh SS, Pino-Yanes M, et al. Whole-Genome Sequencing of Pharmacogenetic Drug Response in Racially Diverse Children with Asthma. Am J Respir Crit Care Med. 2018;197: 1552–1564. doi: 10.1164/rccm.201712-2529OC 29509491
40. Thakur N, Oh SS, Nguyen EA, Martin M, Roth LA, Galanter J, et al. Socioeconomic status and childhood asthma in urban minority youths. The GALA II and SAGE II studies. Am J Respir Crit Care Med. 2013;188: 1202–1209. doi: 10.1164/rccm.201306-1016OC 24050698
41. Borrell LN, Nguyen EA, Roth LA, Oh SS, Tcheurekdjian H, Sen S, et al. Childhood Obesity and Asthma Control in the GALA II and SAGE II Studies. Am J Respir Crit Care Med. 2013;187: 697–702. doi: 10.1164/rccm.201211-2116OC 23392439
42. Nishimura KK, Galanter JM, Roth LA, Oh SS, Thakur N, Nguyen EA, et al. Early-life air pollution and asthma risk in minority children. The GALA II and SAGE II studies. Am J Respir Crit Care Med. 2013;188: 309–318. doi: 10.1164/rccm.201302-0264OC 23750510
43. Lappalainen T, Sammeth M, Friedländer MR, ‘t Hoen PA, Monlong J, Rivas MA, et al. Transcriptome and genome sequencing uncovers functional variation in humans. Nature. 2013;501: 506–511. doi: 10.1038/nature12531 24037378
44. Sudmant PH, Rausch T, Gardner EJ, Handsaker RE, Abyzov A, Huddleston J, et al. An integrated map of structural variation in 2,504 human genomes. Nature. 2015;526: 75–81. doi: 10.1038/nature15394 26432246
45. Mikhaylova AV, Thornton TA. Accuracy of Gene Expression Prediction From Genotype Data With PrediXcan Varies Across and Within Continental Populations. Front Genet. 2019;10. doi: 10.3389/fgene.2019.00261 31001318
46. Fryett JJ, Morris AP, Cordell HJ. Investigation of prediction accuracy and the impact of sample size, ancestry, and tissue in transcriptome-wide association studies. Genet Epidemiol. 2020;n/a. doi: 10.1002/gepi.22290 32190932
47. Stranger BE, Nica AC, Forrest MS, Dimas A, Bird CP, Beazley C, et al. Population genomics of human gene expression. Nat Genet. 2007;39: 1217–1224. doi: 10.1038/ng2142 17873874
48. Viñuela A, Brown AA, Buil A, Tsai P-C, Davies MN, Bell JT, et al. Age-dependent changes in mean and variance of gene expression across tissues in a twin cohort. Hum Mol Genet. 2018;27: 732–741. doi: 10.1093/hmg/ddx424 29228364
49. McCall MN, Illei PB, Halushka MK. Complex Sources of Variation in Tissue Expression Data: Analysis of the GTEx Lung Transcriptome. Am J Hum Genet. 2016;99: 624–635. doi: 10.1016/j.ajhg.2016.07.007 27588449
50. Zhu Y, Wang L, Yin Y, Yang E. Systematic analysis of gene expression patterns associated with postmortem interval in human tissues. Sci Rep. 2017;7: 5435. doi: 10.1038/s41598-017-05882-0 28710439
51. Ferreira PG, Muñoz-Aguirre M, Reverter F, Godinho CPS, Sousa A, Amadoz A, et al. The effects of death and post-mortem cold ischemia on human tissue transcriptomes. Nat Commun. 2018;9: 490. doi: 10.1038/s41467-017-02772-x 29440659
52. Martin AR, Karczewski KJ, Kerminen S, Kurki MI, Sarin A-P, Artomov M, et al. Haplotype Sharing Provides Insights into Fine-Scale Population History and Disease in Finland. Am J Hum Genet. 2018;102: 760–775. doi: 10.1016/j.ajhg.2018.03.003 29706349
53. Yuan Y, Tian L, Lu D, Xu S. Analysis of Genome-Wide RNA-Sequencing Data Suggests Age of the CEPH/Utah (CEU) Lymphoblastoid Cell Lines Systematically Biases Gene Expression Profiles. Sci Rep. 2015;5: 7960. doi: 10.1038/srep07960 25609584
54. Çalışkan M, Pritchard JK, Ober C, Gilad Y. The Effect of Freeze-Thaw Cycles on Gene Expression Levels in Lymphoblastoid Cell Lines. PLOS ONE. 2014;9: e107166. doi: 10.1371/journal.pone.0107166 25192014
55. The International HapMap 3 Consortium. Integrating common and rare genetic variation in diverse human populations. Nature. 2010;467: 52–58. doi: 10.1038/nature09298 20811451
56. Su Z, Marchini J, Donnelly P. HAPGEN2: simulation of multiple disease SNPs. Bioinformatics. 2011;27: 2304–2305. doi: 10.1093/bioinformatics/btr341 21653516
57. Baharian S, Barakatt M, Gignoux CR, Shringarpure S, Errington J, Blot WJ, et al. The Great Migration and African-American Genomic Diversity. PLOS Genet. 2016;12: e1006059. doi: 10.1371/journal.pgen.1006059 27232753
58. Hoffmann TJ, Zhan Y, Kvale MN, Hesselson SE, Gollub J, Iribarren C, et al. Design and coverage of high throughput genotyping arrays optimized for individuals of East Asian, African American, and Latino race/ethnicity using imputation and a novel hybrid SNP selection algorithm. Genomics. 2011;98: 422–430. doi: 10.1016/j.ygeno.2011.08.007 21903159
59. Das S, Forer L, Schönherr S, Sidore C, Locke AE, Kwong A, et al. Next-generation genotype imputation service and methods. Nat Genet. 2016;48: 1284–1287. doi: 10.1038/ng.3656 27571263
60. Loh P-R, Danecek P, Palamara PF, Fuchsberger C, Reshef YA, Finucane HK, et al. Reference-based phasing using the Haplotype Reference Consortium panel. Nat Genet. 2016;48: 1443–1448. doi: 10.1038/ng.3679 27694958
61. Wheeler HE, Shah KP, Brenner J, Garcia T, Aquino-Michaels K, Consortium Gte, et al. Survey of the Heritability and Sparse Architecture of Gene Expression Traits across Human Tissues. PLOS Genet. 2016;12: e1006423. doi: 10.1371/journal.pgen.1006423 27835642
62. Gravel S. Population genetics models of local ancestry. Genetics. 2012;191: 607–619. doi: 10.1534/genetics.112.139808 22491189
63. Price AL, Tandon A, Patterson N, Barnes KC, Rafaels N, Ruczinski I, et al. Sensitive detection of chromosomal segments of distinct ancestry in admixed populations. PLoS Genet. 2009;5: e1000519. doi: 10.1371/journal.pgen.1000519 19543370
64. Tange O. GNU Parallel 2018. Ole Tange; 2018. doi: 10.5281/zenodo.1146014
65. Shih DJH. argparser: Command-Line Argument Parser. 2016. Available: https://CRAN.R-project.org/package=argparser
66. Wickham H. assertthat: Easy Pre and Post Assertions. 2019. Available: https://CRAN.R-project.org/package=assertthat
67. Dowle M, Srinivasan A, Gorecki J, Chirico M, Stetsenko P, Short T, et al. data.table: Extension of “data.frame.” 2019. Available: https://CRAN.R-project.org/package=data.table
68. Calaway R, Corporation M, Weston S, Tenenbaum D. doParallel: Foreach Parallel Adaptor for the “parallel” Package. 2018. Available: https://CRAN.R-project.org/package=doParallel
69. Dinno A. dunn.test: Dunn’s Test of Multiple Comparisons Using Rank Sums. 2017. Available: https://CRAN.R-project.org/package=dunn.test
70. Xie Y, Vogt A, Andrew A, Zvoleff A, http://www.andre-simon.de) AS (the C files under inst/themes/ were derived from the H package, Atkins A, et al. knitr: A General-Purpose Package for Dynamic Report Generation in R. 2019. Available: https://CRAN.R-project.org/package=knitr
71. Davis TL, package.) AD (Some documentation and examples ported from the getopt, module.) PSF (Some documentation from the optparse P, Lianoglou S, Nikelski J, Müller K, et al. optparse: Command Line Option Parser. 2019. Available: https://CRAN.R-project.org/packa=optparsege
72. Gentleman R. annotate: Annotation for microarrays. 2018. Available: http://bioconductor.org/packages/annotate/
73. Durinck S, Moreau Y, Kasprzyk A, Davis S, De Moor B, Brazma A, et al. BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis. Bioinforma Oxf Engl. 2005;21: 3439–3440. doi: 10.1093/bioinformatics/bti525 16082012
74. Durinck S, Spellman PT, Birney E, Huber W. Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt. Nat Protoc. 2009;4: 1184–1191. doi: 10.1038/nprot.2009.97 19617889
75. Bolstad B. preprocessCore. 2017. Available: https://github.com/bmbolstad/preprocessCore
76. Wickham, Hadley, Grolemund, Garrett. R for Data Science. O’Reilly Media, Inc.; 2017. Available: https://r4ds.had.co.nz/
77. Wickham Hadley. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York; 2016. Available: http://ggplot2.org.
Článek vyšel v časopise
PLOS Genetics
2020 Číslo 8
- Jak a kdy u celiakie začíná reakce na lepek? Možnou odpověď poodkryla čerstvá kanadská studie
- Pomůže v budoucnu s triáží na pohotovostech umělá inteligence?
- Spermie, vajíčka a mozky – „jednohubky“ z výzkumu 2024/38
- Metamizol jako analgetikum první volby: kdy, pro koho, jak a proč?
- Infekce se v Americe po příjezdu Kolumba šířily nesrovnatelně déle, než se traduje
Nejčtenější v tomto čísle
- Genomic imprinting: An epigenetic regulatory system
- Uptake of exogenous serine is important to maintain sphingolipid homeostasis in Saccharomyces cerevisiae
- A human-specific VNTR in the TRIB3 promoter causes gene expression variation between individuals
- Immediate activation of chemosensory neuron gene expression by bacterial metabolites is selectively induced by distinct cyclic GMP-dependent pathways in Caenorhabditis elegans