A study on separation of the protein structural types in amino acid sequence feature spaces
Autoři:
Xiaogeng Wan aff001; Xinying Tan aff002
Působiště autorů:
College of Mathematics and Physics, Beijing University of Chemical Technology, Beijing, China
aff001; The Fourth Center of PLA General Hospital, Beijing, China
aff002
Vyšlo v časopise:
PLoS ONE 14(12)
Kategorie:
Research Article
doi:
https://doi.org/10.1371/journal.pone.0226768
Souhrn
Proteins are diverse with their sequences, structures and functions, it is important to study the relations between the sequences, structures and functions. In this paper, we conduct a study that surveying the relations between the protein sequences and their structures. In this study, we use the natural vector (NV) and the averaged property factor (APF) features to represent protein sequences into feature vectors, and use the multi-class MSE and the convex hull methods to separate proteins of different structural classes into different regions. We found that proteins from different structural classes are separable by hyper-planes and convex hulls in the natural vector feature space, where the feature vectors of different structural classes are separated into disjoint regions or convex hulls in the high dimensional feature spaces. The natural vector outperforms the averaged property factor method in identifying the structures, and the convex hull method outperforms the multi-class MSE in separating the feature points. These outcomes convince the strong connections between the protein sequences and their structures, and may imply that the amino acids composition and their sequence arrangements represented by the natural vectors have greater influences to the structures than the averaged physical property factors of the amino acids.
Klíčová slova:
Machine learning – Protein sequencing – Protein structure – Protein structure databases – Sequence alignment – Sequence databases – Structural proteins – Vector spaces
Zdroje
1. Levitt M. Nature of the protein universe. Proceedings of the National Academy of Sciences of the United States of America. 2009; 106 (27): 11079–84. doi: 10.1073/pnas.0905029106 19541617
2. Yau ST, Yu C, He RL. A protein map and its application. DNA and Cell Biology. 2008; 27: 241250.
3. Yu C, Cheng SY, He RL, Yau ST. Protein map: An alignment-free sequence comparison method based on various properties of amino acids. Gene. 2011; 486(1–2): 110–118. doi: 10.1016/j.gene.2011.07.002 21803133
4. Yu C, Deng M, Cheng SY, Yau SC, He RL, Yau ST. Protein space: A natural method for realizing the nature of protein universe. Journal of Theoretical Biology. 2013; 318:197–204. doi: 10.1016/j.jtbi.2012.11.005 23154188
5. Zhao B, He RL, Yau ST. A new distribution vector and its application in genome clustering. Molecular Phylogenetics and Evolution. 2011; 59: 438–443. doi: 10.1016/j.ympev.2011.02.020 21385621
6. Zhao X, Wan X, He RL, Yau ST. A new method for studying the evolutionary origin of the SAR11 clade marine bacteria. Molecular Phylogenetics and Evolution. 2016; 98: 271–279. doi: 10.1016/j.ympev.2016.02.015 26926946
7. Yu C, He RL, Yau ST. Protein sequence comparison based on K-string dictionary. Gene. 2013; 529: 250–256. doi: 10.1016/j.gene.2013.07.092 23939466
8. Ding CHQ, Dubchak I. Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics. 2001; 17(4), 349–358. doi: 10.1093/bioinformatics/17.4.349 11301304
9. Edler L, Grassmann J, Suhai S. Role and results of statistical methods in protein fold class prediction. Mathematical and Computer Modelling. 2001; 33(12–13): 1401–1417.
10. Huang CD, Lin CT, Pal NR. Hierarchical learning architecture with automatic feature selection for multiclass protein fold classification. IEEE transactions on NanoBioscience. 2003; 2(4): 221–232. doi: 10.1109/tnb.2003.820284 15376912
11. Jo T, Hou J, Eickholt J, Cheng J. Improving protein fold recognition by deep learning networks. Scientific reports. 2015; 5: 17573. doi: 10.1038/srep17573 26634993
12. Khan MA, Shahzad W, Baig AR. Protein classification via an ant-inspired association rules-based classifier. International Journal of Bio-Inspired Computation. 2016; 8(1): 51–65.
13. Markowetz F, Edler L, Vingron M. Support vector machines for protein fold class prediction. Biometrical Journal: Journal of Mathematical Methods in Biosciences. 2003; 45(3): 377–389.
14. Tan AC, Gilbert D, Deville Y. Multi-class protein fold classification using a new ensemble machine learning approach. Genome Informatics. 2003; 14: 206–217. 15706535
15. Wei L, Liao M, Gao X, Zou Q. Enhanced protein fold prediction method through a novel feature extraction technique. IEEE transactions on nanobioscience. 2015; 14(6): 649–659. doi: 10.1109/TNB.2015.2450233 26335556
16. Wei L, Zou Q. Recent progress in machine learning-based methods for protein fold recognition. International journal of molecular sciences. 2016; 17(12): 2118.
17. Wang J, Wang Z, Tian X. Bioinformatics: Fundamentals and Applications. Tsinghua University Press. 2014.
18. Rackovsky S. Sequence physical properties encode the global organization of protein structure space. PNAS. 2009; 106(34): 14345–14348. doi: 10.1073/pnas.0903433106 19706520
19. Duda RO, Hart PE, Stork DG. Pattern Classification, second Edition. China Machine Press. 2001.
20. Tian K, Zhao X, Yau ST. Convex hull analysis of evolutionary and phylogenetic relationships between biological groups. Journal of Theoretical Biology. 2018; 456: 34–40. doi: 10.1016/j.jtbi.2018.07.035 30059661
21. Shen HB, Chou KC. PseAAC: A flexible web server for generating various kinds of protein pseudo amino acid composition. Analytical Biochemistry. 2008; 373(2): 386–388. doi: 10.1016/j.ab.2007.10.012 17976365
22. Liu B, Liu F, Wang X, Chen J, Fang L and Chou KC. Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Research. 2015; 43 (W1): W65–W71. doi: 10.1093/nar/gkv458 25958395
23. Xu Y, Ding J, Wu LY, Chou KC. iSNO-PseAAC: predict cysteine S-nitrosylation sites in proteins by incorporating position specific amino acid propensity into pseudo amino acid composition. PLoS ONE. 2013; 8(2): e55844. doi: 10.1371/journal.pone.0055844 23409062
24. Gribskov M, Mclachlan AD, Eisenberg D. Profile analysis: detection of distantly related proteins. Proceedings of the National Academy of Sciences. 1987; 84(13), 4355–4358.
25. Jeong JC, Lin X, Chen XW. On position-specific scoring matrix for protein function prediction. IEEE/ACM Transactions on Computational Biology & Bioinformatics. 2011; 8 (2), 308–315.
26. Hsu C, Chang C, Lin C. A practical guide to support vector classification. BJU International. 2008; 101(1):1396–1400.
27. Breiman L. Random Forests. Machine Learning. 2001; 45 (1): 5–32.
28. Lim A., Breiman L, Cutler A. Big random forests: classification and regression forests for large data sets. 2014.
29. Kidera A, Konishi Y, Oka M, Ooi T, Scheraga HA. Statistical analysis of the physical properties of the 20 naturally occurring amino acids. Journal of Protein Chemistry. 1985; 4(1): 23–55.
30. Kidera A, Konishi Y, Ooi T, Scheraga HA. Relation between sequence similarity and structural similarity in proteins: Role of important properties of amino acids. Journal of Protein Chemistry. 1985; 4(5):265–297.
31. Chang CC and Lin CJ. LibSVM: A Library for support vector machines. ACM Transactions on Intelligent Systems & Technology. 2011; 2(3): 27.
32. Lin C, Chen W, Qiu C, Wu Y, Krishnan S, Zou Q. LibD3C: Ensemble Classifiers with a Clustering and Dynamic Selection Strategy. Neurocomputing. 2014; 123: 424–435.
33. Lin C, Zou Y, Qin J, Liu X, Jiang Y, Ke C, et al. Hierarchical classification of protein folds using a novel ensemble classifier. PLoS ONE. 2013; 8(2): e56499. doi: 10.1371/journal.pone.0056499 23437146
Článek vyšel v časopise
PLOS One
2019 Číslo 12
- S diagnostikou Parkinsonovy nemoci může nově pomoci AI nástroj pro hodnocení mrkacího reflexu
- Je libo čepici místo mozkového implantátu?
- Pomůže v budoucnu s triáží na pohotovostech umělá inteligence?
- AI může chirurgům poskytnout cenná data i zpětnou vazbu v reálném čase
- Nová metoda odlišení nádorové tkáně může zpřesnit resekci glioblastomů
Nejčtenější v tomto čísle
- Methylsulfonylmethane increases osteogenesis and regulates the mineralization of the matrix by transglutaminase 2 in SHED cells
- Oregano powder reduces Streptococcus and increases SCFA concentration in a mixed bacterial culture assay
- The characteristic of patulous eustachian tube patients diagnosed by the JOS diagnostic criteria
- Parametric CAD modeling for open source scientific hardware: Comparing OpenSCAD and FreeCAD Python scripts
Zvyšte si kvalifikaci online z pohodlí domova
Všechny kurzy