EGNN-CMutPred: Predicting Protein Mutational Effects by Integrating Primary and Tertiary Protein Structures
DOI:
https://doi.org/10.71222/1wqjsc49Keywords:
protein mutational effect prediction, EGNN, ESM-2, primary and tertiary protein structures, semantic embedding, structural encodingAbstract
The study proposed the EGNN-CMutPred, Equivariant Graph Neural Network based Comprehensive Protein Mutational Effect Predictor, a novel approach for predicting the effects of protein mutations by integrating both primary and tertiary protein structures. By combining Evolutionary Scale Modeling 2 (ESM-2) for semantic embedding with Equivariant Graph Neural Network (EGNN) for structural encoding, the model improves its accuracy in predicting how mutations impact protein function and stability. The study aims to address the limitations of traditional sequence and structure-based prediction methods by incorporating both semantic and topological embeddings of proteins, allowing the model to capture a comprehensive understanding of each protein. EGNN-CMutPred was trained on non-redundant protein sequences from the CATH v4.3.0 database and evaluated against benchmarks, including ProteinGym (DMS) and the ProThermDB database, which measure changes in melting temperature (△Tm) and change in the variation of Gibbs free energy (△△G). The model effectively simulates mutations and predicts changes in protein stability, as demonstrated by strong performance in metrics such as Spearman’s Correlation and True Positive Rate. These results suggest that EGNN-CMutPred is a valuable tool for precision medicine and protein engineering, offering enhanced prediction capabilities over existing methods. Future research will refine the model’s computational techniques and expand its applicability to larger, more diverse datasets, furthering its potential in understanding protein mutations and their implications for disease and therapeutic development.
References
1. J. A. Lycklama a Nijeholt and A. J. M. Driessen, "The bacterial Sec-translocase: structure and mechanism," Philos. Trans. R. Soc. B Biol. Sci., vol. 367, no. 1592, pp. 1016-1028, 2012, doi: 10.1098/rstb.2011.0201.
2. M. Knudsen and C. Wiuf, "The CATH database," Hum. Genomics, vol. 4, no. 3, p. 207, 2010, doi: 10.1186/1479-7364-4-3-207.
3. R. Bonetta and G. Valentino, "Machine learning techniques for protein function prediction," Proteins: Struct., Funct., Bioinf., vol. 88, no. 3, pp. 397-413, 2020, doi: 10.1002/prot.25832..
4. W. Lu et al., "AlphaFold3, a secret sauce for predicting mutational effects on protein-protein interactions," bioRxiv, 2024, doi: 10.1101/2024.05.25.595871.
5. The UniProt Consortium, "UniProt: the universal protein knowledgebase," Nucleic Acids Res., vol. 46, no. 5, p. 2699, 2018, doi: 10.1093/nar/gky092.
6. P. W. Rose et al., "The RCSB protein data bank: integrative view of protein, gene and 3D structural information," Nucleic Acids Res., 2016, doi: 10.1093/nar/gkw1000.
7. G. R. Reeck et al., "“Homology” in proteins and nucleic acids: a terminology muddle and a way out of it," Cell, vol. 50, no. 5, pp. 667, 1987.
8. S. Sinha, B. Eisenhaber, and A. M. Lynn, "Predicting protein function using homology-based methods," in Bioinformatics: sequences, structures, phylogeny, Singapore: Springer Singapore, 2018, pp. 289-305, doi: 10.1007/978-981-13-1562-6_13.
9. J. C. Whisstock and A. M. Lesk, "Prediction of protein function from protein sequence and structure," Q. Rev. Biophys., vol. 36, no. 3, pp. 307-340, 2003, doi: 10.1017/S0033583503003901.
10. M. Camps et al., "Genetic constraints on protein evolution," Crit. Rev. Biochem. Mol. Biol., vol. 42, no. 5, pp. 313-326, 2007, doi: 10.1080/10409230701597642.
11. N. V. Prabhu and K. A. Sharp, "Heat capacity in proteins," Annu. Rev. Phys. Chem., vol. 56, no. 1, pp. 521-548, 2005, doi: 10.1146/annurev.physchem.56.092503.141202.
12. S. Zhang et al., "Graph convolutional networks: a comprehensive review," Comput. Soc. Netw., vol. 6, no. 1, pp. 1-23, 2019, doi: 10.1186/s40649-019-0069-y.
13. O. Handa et al., "Reduction of butyric acid-producing bacteria in the ileal mucosa-associated microbiota is associated with the history of abdominal surgery in patients with Crohn’s disease," Redox Rep., vol. 28, no. 1, p. 2241615, 2023, doi: 10.1080/13510002.2023.2241615.
14. A. R. Katebi and R. L. Jernigan, "The critical role of the loops of triosephosphate isomerase for its oligomerization, dynamics, and functionality," Protein Sci., vol. 23, no. 2, pp. 213-228, 2014, doi: 10.1002/pro.2407.
15. P. Veličković et al., "Graph attention networks," arXiv preprint arXiv:1710.10903, 2017.
16. V. G. Satorras, E. Hoogeboom, and M. Welling, "E (n) equivariant graph neural networks," in Proc. Int. Conf. Mach. Learn., PMLR, 2021.
17. K. H. Choo and S. Ranganathan, "Flanking signal and mature peptide residues influence signal peptide cleavage," BMC Bioinformatics, vol. 9, Suppl 12, p. S15, 2008, doi: 10.1186/1471-2105-9-S12-S15.
18. Y. Tan et al., "Multi-level protein representation learning for blind mutational effect prediction," arXiv preprint arXiv:2306.04899, 2023.
19. S. Yo et al., "Exercise affects mucosa-associated microbiota and colonic tumor formation induced by azoxymethane in high-fat-diet-induced obese mice," Microorganisms, vol. 12, no. 5, p. 957, 2024, doi: 10.3390/microorganisms12050957.
20. S. Xue et al., "Comprehensive analysis of signal peptides in Saccharomyces cerevisiae reveals features for efficient secretion," Adv. Sci., vol. 10, no. 2, p. 2203433, 2023, doi: 10.1002/advs.202203433.
21. F. M. G. Pearl et al., "The CATH database: an extended protein family resource for structural and functional genomics," Nucleic Acids Res., vol. 31, no. 1, pp. 452-455, 2003, doi: 10.1093/nar/gkg062.
22. S. Velankar et al., "PDBe: protein data bank in Europe," Nucleic Acids Res., vol. 38, suppl. 1, pp. D308-D317, 2010, doi: 10.1093/nar/gkp916.
23. R. Nikam et al., "ProThermDB: thermodynamic database for proteins and mutants revisited after 15 years," Nucleic Acids Res., vol. 49, no. D1, pp. D420-D424, 2021, doi: 10.1093/nar/gkaa1035.
24. C. Aliferis and G. Simon, "Overfitting, underfitting and general model overconfidence and under-performance pitfalls and best practices in machine learning and AI," in Artificial intelligence and machine learning in health care and medical sciences: Best practices and pitfalls, 2024, pp. 477-524, doi: 10.1007/978-3-031-39355-6_10.
25. M. Eltehiwy and A. B. Abdul-Motaal, "A new Method for Computing and TestingThe significance of the Spearman Rank Correlation," Comput. J. Math. Stat. Sci., vol. 2, no. 2, pp. 240-250, 2023, doi: 10.21608/cjmss.2023.229746.1015.
26. J. N. Suojanen, "False false positive rates," N. Engl. J. Med., vol. 341, no. 2, p. 131, 1999, doi: 10.1056/NEJM199907083410217.
27. M. Sasahira et al., "The relationship between bacterial flora in saliva and esophageal mucus and endoscopic severity in patients with eosinophilic esophagitis," Int. J. Mol. Sci., vol. 26, no. 7, p. 3026, 2025, doi: 10.3390/ijms26073026.
28. E. M. Gertz et al., "Composition-based statistics and translated nucleotide searches: improving the TBLASTN module of BLAST," BMC Biol., vol. 4, no. 1, p. 41, 2006, doi: 10.1186/1741-7007-4-41.
29. W. Deng et al., "ViroBLAST: a stand-alone BLAST web server for flexible queries of multiple databases and user's datasets," Bioinformatics, vol. 23, no. 17, pp. 2334-2336, 2007, doi: 10.1093/bioinformatics/btm331.
30. S. A. Shiryev et al., "Improved BLAST searches using longer words for protein seeding," Bioinformatics, vol. 23, no. 21, pp. 2949-2951, 2007, doi: 10.1093/bioinformatics/btm479.
31. IBM, "What are Recurrent Neural Networks?," IBM.com, Oct. 6, 2021. [Online]. Available: https://www.ibm.com/topics/recurrent-neural-networks.
32. H. Owji et al., "A comprehensive review of signal peptides: Structure, roles, and applications," Eur. J. Cell Biol., vol. 97, no. 6, pp. 422-441, 2018, doi: 10.1016/j.ejcb.2018.06.003.
33. S. Grasso et al., "Signal peptide efficiency: from high-throughput data to prediction and explanation," ACS Synth. Biol., vol. 12, no. 2, pp. 390-404, 2023, doi: 10.1021/acssynbio.2c00328.
34. Y. Zhou et al., "DDMut: predicting effects of mutations on protein stability using deep learning," Nucleic Acids Res., vol. 51, no. W1, pp. W122-W128, 2023, doi: 10.1093/nar/gkad472.
35. C. Pancotti et al., "Predicting protein stability changes upon single-point mutation: a thorough comparison of the available tools on a new dataset," Brief. Bioinform., vol. 23, no. 2, 2022, doi: 10.1093/bib/bbab555.
36. M. A. Pak et al., "Using AlphaFold to predict the impact of single mutations on protein stability and function," PLoS One, vol. 18, no. 3, p. e0282689, 2023, doi: 10.1371/journal.pone.0282689.
37. G. R. Buel and K. J. Walters, "Can AlphaFold2 predict the impact of missense mutations on structure?," Nat. Struct. Mol. Biol., vol. 29, no. 1, pp. 1-2, 2022, doi: 10.1038/s41594-021-00714-2.
38. Y. Peng, E. Alexov, and S. Basu, "Structural perspective on revealing and altering molecular functions of genetic variants linked with diseases," Int. J. Mol. Sci., vol. 20, no. 3, p. 548, 2019, doi: 10.3390/ijms20030548.
39. J. Meier et al., "Language models enable zero-shot prediction of the effects of mutations on protein function," Adv. Neural Inf. Process. Syst., vol. 34, pp. 29287-29303, 2021.
40. M. H. Høie et al., "Predicting and interpreting large-scale mutagenesis data using analyses of protein stability and conservation," Cell Rep., vol. 38, no. 2, 2022, doi: 10.1016/j.celrep.2021.110207.
41. N. Brandes et al., "Genome-wide prediction of disease variant effects with a deep protein language model," Nat. Genet., vol. 55, no. 9, pp. 1512-1522, 2023, doi: 10.1038/s41588-023-01465-0.
42. X. Liu et al., "Deep geometric representations for modeling effects of mutations on protein-protein binding affinity," PLoS Comput. Biol., vol. 17, no. 8, p. e1009284, 2021, doi: 10.1371/journal.pcbi.1009284.
43. P. Notin et al., "Proteingym: Large-scale benchmarks for protein fitness prediction and design," Adv. Neural Inf. Process. Syst., vol. 36, pp. 64331-64379, 2023.
44. Z. Lin et al., "Evolutionary-scale prediction of atomic-level protein structure with a language model," Science, vol. 379, no. 6637, pp. 1123-1130, 2023, doi: 10.1126/science.ade2574.
45. E. Krieger, S. B. Nabuurs, and G. Vriend, "Homology modeling," in Struct. Bioinf., 2003, pp. 509-523, doi: 10.1002/0471721204.
46. D. J. Diaz et al., "Using machine learning to predict the effects and consequences of mutations in proteins," Curr. Opin. Struct. Biol., vol. 78, p. 102518, 2023, doi: 10.1016/j.sbi.2022.102518.
47. A. Zhou et al., "Proteolytic processing in the secretory pathway," J. Biol. Chem., vol. 274, no. 30, pp. 20745-20748, 1999, doi: 10.1074/jbc.274.30.20745.
48. N. Shah et al., "Misunderstood parameter of NCBI BLAST impacts the correctness of bioinformatics workflows," Bioinformatics, vol. 35, no. 9, pp. 1613-1614, 2019, doi: 10.1093/bioinformatics/bty833.
49. X. Liu, "Deep recurrent neural network for protein function prediction from sequence," arXiv preprint arXiv:1701.08318, 2017.
50. H. Matsumoto et al., "Characteristics of mucosa-associated microbiota in ulcerative colitis patients with 5-aminosalicylic acid intolerance," Biomedicines, vol. 12, no. 9, p. 2125, 2024, doi: 10.3390/biomedicines12092125.
51. V. Gligorijević et al., "Structure-based protein function prediction using graph convolutional networks," Nat. Commun., vol. 12, no. 1, p. 3168, 2021, doi: 10.1038/s41467-021-23303-9.
52. S. Aizawa et al., "Adenosine stimulates neuromedin U mRNA expression in the rat pars tuberalis," Mol. Cell. Endocrinol., vol. 496, p. 110518, 2019, doi: 10.1016/j.mce.2019.110518.
53. D. E. V. Pires, D. B. Ascher, and T. L. Blundell, "mCSM: predicting the effects of mutations in proteins using graph-based signatures," Bioinformatics, vol. 30, no. 3, pp. 335-342, 2014, doi: 10.1093/bioinformatics/btt691.