A cautionary note on the use of unsupervised machine learning algorithms to characterise malaria parasite population structure from genetic distance matrices
Autoři:
James A. Watson aff001; Aimee R. Taylor aff003; Elizabeth A. Ashley aff002; Arjen Dondorp aff001; Caroline O. Buckee aff003; Nicholas J. White aff001; Chris C. Holmes aff006
Působiště autorů:
Mahidol-Oxford Tropical Medicine Research Unit, Faculty of Tropical Medicine, Mahidol University, Bangkok, Thailand
aff001; Centre for Tropical Medicine and Global Health, Nuffield Department of Medicine, University of Oxford, Oxford, United Kingdom
aff002; Center for Communicable Disease Dynamics, Department of Epidemiology, Harvard T. H. Chan School of Public Health, Boston, Massachusetts, USA
aff003; Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA
aff004; Lao-Oxford-Mahosot Hospital Wellcome Trust Research Unit, Vientiane, Laos
aff005; Department of Statistics, University of Oxford, Oxford, United Kingdom
aff006; Nuffield Department of Medicine, University of Oxford, Oxford, United Kingdom
aff007
Vyšlo v časopise:
A cautionary note on the use of unsupervised machine learning algorithms to characterise malaria parasite population structure from genetic distance matrices. PLoS Genet 16(10): e32767. doi:10.1371/journal.pgen.1009037
Kategorie:
Research Article
doi:
https://doi.org/10.1371/journal.pgen.1009037
Souhrn
Genetic surveillance of malaria parasites supports malaria control programmes, treatment guidelines and elimination strategies. Surveillance studies often pose questions about malaria parasite ancestry (e.g. how antimalarial resistance has spread) and employ statistical methods that characterise parasite population structure. Many of the methods used to characterise structure are unsupervised machine learning algorithms which depend on a genetic distance matrix, notably principal coordinates analysis (PCoA) and hierarchical agglomerative clustering (HAC). PCoA and HAC are sensitive to both the definition of genetic distance and algorithmic specification. Importantly, neither algorithm infers malaria parasite ancestry. As such, PCoA and HAC can inform (e.g. via exploratory data visualisation and hypothesis generation), but not answer comprehensively, key questions about malaria parasite ancestry. We illustrate the sensitivity of PCoA and HAC using 393 Plasmodium falciparum whole genome sequences collected from Cambodia and neighbouring regions (where antimalarial resistance has emerged and spread recently) and we provide tentative guidance for the use and interpretation of PCoA and HAC in malaria parasite genetic epidemiology. This guidance includes a call for fully transparent and reproducible analysis pipelines that feature (i) a clearly outlined scientific question; (ii) a clear justification of analytical methods used to answer the scientific question along with discussion of any inferential limitations; (iii) publicly available genetic distance matrices when downstream analyses depend on them; and (iv) sensitivity analyses. To bridge the inferential disconnect between the output of non-inferential unsupervised learning algorithms and the scientific questions of interest, tailor-made statistical models are needed to infer malaria parasite ancestry. In the absence of such models speculative reasoning should feature only as discussion but not as results.
Klíčová slova:
DNA recombination – Genetic epidemiology – Genetics – Machine learning algorithms – Malaria – Malarial parasites – Plasmodium – Population genetics
Zdroje
1. Wesolowski A, Taylor AR, Chang HH, Verity R, Tessema S, Bailey JA, et al. Mapping malaria by combining parasite genomic and epidemiologic data. BMC Medicine. 2018;16(1):190. doi: 10.1186/s12916-018-1181-9 30333020
2. Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics. 2006;38(8):904. doi: 10.1038/ng1847
3. Pritchard JK, Feldman MW. Statistics for microsatellite variation based on coalescence. Theoretical Population Biology. 1996;50(3):325–344. doi: 10.1006/tpbi.1996.0034
4. Lawson DJ, Hellenthal G, Myers S, Falush D. Inference of Population Structure using Dense Haplotype Data. PLoS Genetics. 2012;8(1):e1002453. doi: 10.1371/journal.pgen.1002453
5. Alexander DH, Novembre J, Lange K. Fast model-based estimation of ancestry in unrelated individuals. Genome Research. 2009;19(9):1655–1664. doi: 10.1101/gr.094052.109
6. Baton LA, Ranford-Cartwright LC. Spreading the seeds of million-murdering death: metamorphoses of malaria in the mosquito. Trends in Parasitology. 2005;21(12):573–580.
7. Zhu SJ, Hendry JA, Almagro-Garcia J, Pearson RD, Amato R, Miles A, et al. The origins and relatedness structure of mixed infections vary with local prevalence of P. falciparum malaria. Elife. 2019;8:e40845. doi: 10.7554/eLife.40845 31298657
8. Miotto O, Almagro-Garcia J, Manske M, MacInnis B, Campino S, Rockett KA, et al. Multiple populations of artemisinin-resistant Plasmodium falciparum in Cambodia. Nature Genetics. 2013;45(6):648. doi: 10.1038/ng.2624 23624527
9. Amato R, Pearson RD, Almagro-Garcia J, Amaratunga C, Lim P, Suon S, et al. Origins of the current outbreak of multidrug-resistant malaria in southeast Asia: a retrospective genetic study. Lancet Infectious Diseases. 2018;18(3):337–345. doi: 10.1016/S1473-3099(18)30068-9 29398391
10. Hamilton WL, Amato R, van der Pluijm RW, Jacob CG, Quang HH, Thuy-Nhien NT, et al. Evolution and expansion of multidrug-resistant malaria in southeast Asia: a genomic epidemiology study. Lancet Infectious diseases. 2019;0(0).
11. McVean G. A genealogical interpretation of principal components analysis. PLoS Genetics. 2009;5(10):e1000686. doi: 10.1371/journal.pgen.1000686
12. Taylor AR, Jacob PE, Neafsey DE, Buckee CO. Estimating relatedness between malaria parasites. Genetics. 2019; p. genetics–302120.
13. Verity R, Aydemir O, Brazeau NF, Watson OJ, Hathaway NJ, Mwandagalirwa MK, et al. The impact of antimalarial resistance on the genetic structure of Plasmodium falciparum in the DRC. Nature Communications. 2020;11(1):1–10. doi: 10.1038/s41467-020-15779-8
14. Ashley EA, Dhorda M, Fairhurst RM, Amaratunga C, Lim P, Suon S, et al. Spread of artemisinin resistance in Plasmodium falciparum malaria. New England Journal of Medicine. 2014;371(5):411–423. doi: 10.1056/NEJMoa1314981 25075834
15. Miotto O, Amato R, Ashley EA, MacInnis B, Almagro-Garcia J, Amaratunga C, et al. Genetic architecture of artemisinin-resistant Plasmodium falciparum. Nature Genetics. 2015;47(3):226. doi: 10.1038/ng.3189 25599401
16. Imwong M, Suwannasin K, Kunasol C, Sutawong K, Mayxay M, Rekol H, et al. The spread of artemisinin-resistant Plasmodium falciparum in the Greater Mekong subregion: a molecular epidemiology observational study. Lancet Infectious Diseases. 2017;17(5):491–497. doi: 10.1016/S1473-3099(17)30048-8 28161569
17. Imwong M, Hien TT, Thuy-Nhien NT, Dondorp AM, White NJ. Spread of a single multidrug resistant malaria parasite lineage (PfPailin) to Vietnam. Lancet Infectious Diseases. 2017;17(10):1022–1023. doi: 10.1016/S1473-3099(17)30524-8
18. van der Pluijm RW, Imwong M, Chau NH, Hoa NT, Thuy-Nhien NT, Thanh NV, et al. Determinants of dihydroartemisinin-piperaquine treatment failure in Plasmodium falciparum malaria in Cambodia, Thailand, and Vietnam: a prospective clinical, pharmacological, and genetic study. Lancet Infectious Diseases. 2019;19(9):952–961. doi: 10.1016/S1473-3099(19)30391-3 31345710
19. World Health Organization. Guidelines for the treatment of malaria. 2015.
20. Scornavacca C, Zickmann F, Huson DH. Tanglegrams for rooted phylogenetic trees and networks. Bioinformatics. 2011;27(13):i248–i256. doi: 10.1093/bioinformatics/btr210
21. De Vienne DM. Tanglegrams are misleading for visual evaluation of tree congruence. Molecular Biology and Evolution. 2019;36(1):174–176. doi: 10.1093/molbev/msy196
22. Behr M, Ansari MA, Munk A, Holmes C. Testing for dependence on tree structures. Proceedings of the National Academy of Sciences. 2020;117(18):9787–9792. doi: 10.1073/pnas.1912957117
23. Robinson WS. A Method for Chronologically Ordering Archaeological Deposits. American Antiquity. 1951;16(4):293–301. doi: 10.2307/276978
24. Hahsler M, Hornik K, Buchta C. Getting things in order: an introduction to the R package seriation. Journal of Statistical Software. 2008;25(3):1–34.
25. Schaffner SF, Taylor AR, Wong W, Wirth DF, Neafsey DE. hmmIBD: software to infer pairwise identity by descent between haploid genotypes. Malaria Journal. 2018;17(1):196. doi: 10.1186/s12936-018-2349-7
26. Henden L, Lee S, Mueller I, Barry A, Bahlo M. Identity-by-descent analyses for measuring population dynamics and selection in recombining pathogens. PLoS genetics. 2018;14(5):e1007279. doi: 10.1371/journal.pgen.1007279
27. Auburn S, Benavente ED, Miotto O, Pearson RD, Amato R, Grigg MJ, et al. Genomic analysis of a pre-elimination Malaysian Plasmodium vivax population reveals selective pressures and changing transmission dynamics. Nature Communications. 2018;9(1):1–12. doi: 10.1038/s41467-018-04965-4
28. Leslie S, Winney B, Hellenthal G, Davison D, Boumertit A, Day T, et al. The fine-scale genetic structure of the British population. Nature. 2015;519(7543):309–314. doi: 10.1038/nature14230 25788095
29. Taylor AR, Schaffner SF, Cerqueira GC, Nkhoma SC, Anderson TJ, Sriprawat K, et al. Quantifying connectivity between local Plasmodium falciparum malaria parasite populations using identity by descent. PLoS Genetics. 2017;13(10):e1007065. doi: 10.1371/journal.pgen.1007065 29077712
30. Taylor AR, Echeverry DF, Anderson TJC, Neafsey DE, Buckee CO. Identity-by-descent relatedness estimates with uncertainty characterise departure from isolation-by-distance between Plasmodium falciparum populations on the Colombian-Pacific coast. [Preprint] bioRxiv. 2020.
31. Speidel L, Forest M, Shi S, Myers SR. A method for genome-wide genealogy estimation for thousands of samples. Nature Genetics. 2019;51(9):1321–1329. doi: 10.1038/s41588-019-0484-x
32. Anderson E, Dunham K. The influence of family groups on inferences made with the program Structure. Molecular Ecology Resources. 2008;8(6):1219–1229. doi: 10.1111/j.1755-0998.2008.02355.x
33. Lawson DJ, Van Dorp L, Falush D. A tutorial on how not to over-interpret STRUCTURE and ADMIXTURE bar plots. Nature Communications. 2018;9(1):3258. doi: 10.1038/s41467-018-05257-7
34. Pacheco MA, Forero-Peña DA, Schneider KA, Chavero M, Gamardo A, Figuera L, et al. Malaria in Venezuela: changes in the complexity of infection reflects the increment in transmission intensity. Malaria Journal. 2020;19(1):176. doi: 10.1186/s12936-020-03247-z 32380999
35. Sánchez-Pacheco SJ, Kong S, Pulido-Santacruz P, Murphy RW, Kubatko L. Median-joining network analysis of SARS-CoV-2 genomes is neither phylogenetic nor evolutionary. Proceedings of the National Academy of Sciences. 2020;117(23):12518–12519. doi: 10.1073/pnas.2007062117
36. Feynman RP, Leighton R. “Surely you’re joking, Mr. Feynman!”: adventures of a curious character. Random House; 1992.
37. Stark PB, Saltelli A. Cargo-cult statistics and scientific crisis. Significance. 2018;15(4):40–43. doi: 10.1111/j.1740-9713.2018.01174.x
38. Saltelli A. A short comment on statistical versus mathematical modelling. Nature Communications. 2019;10(1):1–3.
39. Manske M, Miotto O, Campino S, Auburn S, Almagro-Garcia J, Maslen G, et al. Analysis of Plasmodium falciparum diversity in natural infections by deep sequencing. Nature. 2012;487(7407):375–379. doi: 10.1038/nature11174 22722859
40. Redmond SN, MacInnis BM, Bopp S, Bei AK, Ndiaye D, Hartl DL, et al. De novo mutations resolve disease transmission pathways in clonal malaria. Molecular Biology and Evolution. 2018;35(7):1678–1689. doi: 10.1093/molbev/msy059 29722884
41. MalariaGEN Plasmodium falciparum Community Project. Genomic epidemiology of artemisinin resistant malaria. eLife. 2016;5:e08714. doi: 10.7554/eLife.08714 26943619
42. Amambua-Ngwa A, Amenga-Etego L, Kamau E, Amato R, Ghansah A, Golassa L, et al. Major subpopulations of Plasmodium falciparum in sub-Saharan Africa. Science. 2019;365(6455):813–816. doi: 10.1126/science.aav5427 31439796
43. Maaten Lvd, Hinton G. Visualizing data using t-SNE. Journal of Machine Learning Research. 2008;9(Nov):2579–2605.
44. Schrider DR, Kern AD. Supervised Machine Learning for Population Genetics: A New Paradigm. Trends in Genetics. 2018;34(4):301–312. https://doi.org/10.1016/j.tig.2017.12.005. 29331490
45. Patterson N, Price AL, Reich D. Population structure and eigenanalysis. PLoS Genetics. 2006;2(12):e190. doi: 10.1371/journal.pgen.0020190
46. Nguyen LH, Holmes S. Ten quick tips for effective dimensionality reduction. PLoS Computational Biology. 2019;15(6):e1006907. doi: 10.1371/journal.pcbi.1006907
47. Saitou N, Nei M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Molecular Biology and Evolution. 1987;4(4):406–425.
48. Kong S, Sánchez-Pacheco SJ, Murphy RW. On the use of median-joining networks in evolutionary biology. Cladistics. 2016;32(6):691–699. doi: 10.1111/cla.12147
49. R Core Team. R: A Language and Environment for Statistical Computing; 2019. Available from: https://www.R-project.org/.
50. Müllner D. fastcluster: Fast Hierarchical, Agglomerative Clustering Routines for R and Python. Journal of Statistical Software. 2013;53(9):1–18.
51. Galili T. dendextend: an R package for visualizing, adjusting, and comparing trees of hierarchical clustering. Bioinformatics. 2015.
52. Henden L, Wakeham D, Bahlo M. XIBD: software for inferring pairwise identity by descent on the X chromosome. Bioinformatics. 2016;32(15):2389–2391. doi: 10.1093/bioinformatics/btw124
53. Rabiner LR. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE. 1989;77(2):257–286. doi: 10.1109/5.18626
54. Daniels RF, Schaffner SF, Wenger EA, Proctor JL, Chang HH, Wong W, et al. Modeling malaria genomics reveals transmission decline and rebound in Senegal. Proceedings of the National Academy of Sciences. 2015;112(22):7067–7072. doi: 10.1073/pnas.1505691112
Článek vyšel v časopise
PLOS Genetics
2020 Číslo 10
- Může hubnutí souviset s vyšším rizikem nádorových onemocnění?
- Raději si zajděte na oční! Jak souvisí citlivost zraku s rozvojem demence?
- Co způsobuje pooperační infekce? Na vině může být i naše vlastní mikrobiota
- Čeká nás průlom v diagnostice karcinomu pankreatu?
- Polibek, který mi „vzal nohy“ aneb vzácný výskyt EBV u 70leté ženy – kazuistika
Nejčtenější v tomto čísle
- Evaluation of both exonic and intronic variants for effects on RNA splicing allows for accurate assessment of the effectiveness of precision therapies
- RNA-directed DNA Methylation
- The DNA methylome of human sperm is distinct from blood with little evidence for tissue-consistent obesity associations
- Correction: Molecular predictors of brain metastasis-related microRNAs in lung adenocarcinoma