Imagine studying with a textbook that’s missing every other page. Both humans and machines struggle with learning when crucial data is absent. This challenge is magnified when studying biological problems like the molecular dynamics of disease, where vast amounts of data are needed to map protein-protein interactions (PPIs)—the complex ways in which protein molecules affect each other within living systems.
Emerging deep learning (DL)-based computational models are useful for revealing new PPI insights, but often falter due to a scarcity of labelled training data, which is costly to acquire. This problem is exacerbated by domain shift: a phenomenon where models trained on data from one context (e.g. a well-studied set of proteins from a bacterial species) can fail to generalise what they’ve learned to another (e.g. the same set of proteins in a different species).
“The combined impact of label scarcity and domain shift can markedly reduce how generalisable and reliable computational models are in PPI research,” said Ziyuan Zhao, a Senior Research Engineer at A*STAR’s Institute for Infocomm Research (I2R). “This poses significant obstacles to their ability to accurately, consistently predict how complex biological systems work.”
To address this, Zhao worked with Principal Scientist Xulei Yang and I2R colleagues, as well as researchers from A*STAR’s Genome Institute of Singapore (GIS); Nanyang Technological University, Singapore; and Shanghai University, China; to propose a more effective, efficient and generalisable PPI DL model.
Their model, described as a self-ensembling multi-graph neural network for PPI prediction (SemiGNN-PPI) was designed to overcome issues of limited data or unfamiliar PPI contexts by combining graph neural networks (GNNs), which help map complex relationships, with a Mean Teacher model, a technique that learns from both labelled and unlabelled data.
“The self-ensembling strategy uses the collective insights from a set of aggregated predictions—generated from multiple prior evaluations of the GNN—to guide and refine the model's learning trajectory, enhancing its performance in complex biological environments,” said Zhao.
The team also added an element called multi-graph learning to view PPIs from different angles, improving predictions even with imperfect data, while including consistency constraints to ensure the model's accuracy and reliability.
The team found that SemiGNN-PPI outperformed existing benchmark DL-based methods in PPI prediction, especially in scenarios with limited labelled data or with previously unstudied unseen protein datasets. It also showed strong generalisation capabilities, performing well on datasets with different characteristics from those it was trained on.
“Remarkably, the model achieved results on par with fully-supervised models, even when operating with substantially fewer labels, showcasing its efficiency in addressing label scarcity,” said Zhao.
Zhao noted a similar approach can be applied to create more reliable computational models for tackling bioinformatics challenges beyond PPIs. The team plans to refine SemiGNN-PPI further by enhancing its performance on highly imbalanced datasets, as well as explore its use in predicting other types of biological interactions.
The A*STAR-affiliated researchers contributing to this research are from the Institute for Infocomm Research (I2R) and Genome Institute of Singapore (GIS).