Drifting through our cells in the millions, ribonucleic acids (RNA) may be known best as messengers, providing templates of our genetic code to protein-making cellular factories. Yet RNA does much more than simply carry information. Some RNAs ferry building blocks to those same factories; others form the factories themselves; while still others regulate their pace, or help protect them from invaders and injury.
“Each RNA molecule has a task that depends on the structural motifs it folds into,” said Mile Šikić, a Senior Principal Scientist at the A*STAR Genome Institute of Singapore (A*STAR GIS). “If we can predict how RNA folds, we gain direct insights into its likely functions and behaviour, and how those might be regulated or disrupted in disease.”
Šikić and colleagues from A*STAR GIS, A*STAR Bioinformatics Institute (A*STAR BII), and the University of Zagreb, Croatia, recently developed the RiboNucleic Acid Language Model (RiNALMo), a large language model (LLM) pre-trained on 36 million non-coding RNA (ncRNA) sequences drawn from several public databases. With over 650 million parameters—making it possibly the largest RNA language model to date—RiNALMo provides researchers a powerful tool for capturing the underlying structural information encoded in RNA sequences.
In benchmark tests, the team found that RiNALMo not only achieved state-of-the-art performance on secondary structure prediction tasks, but also generalised its learning across different RNA families—even those it had not encountered during training. In addition, when tasked with extracting important functional information from previously unseen RNA types, RiNALMo outperformed other general-purpose RNA models as well as language models trained specifically for those tasks, including SpliceBERT and UTR-LM.
“While RiNALMo is broadly similar to earlier RNA models, we believe what truly made the difference was our hands-on, adaptive training process,” said Šikić. “We were very careful in selecting training data and paid close attention to how the model was learning, adjusting our approach whenever progress slowed.”
The team hypothesises that RiNALMo’s predictive edge may be due to its ability to capture hidden knowledge and important structural information embedded within RNA sequences. However, exactly which features the model attends to remains unclear.
“Understanding what deep learning models actually learn remains one of the biggest open challenges in artificial intelligence today,” noted Šikić. “With collaborators at the University of California, Berkeley (UC Berkeley), US, we’re actively working to understand what RiNALMo is learning and which RNA features it captures internally.”
The team is currently building a generative version of RiNALMo that can actively design and optimise RNA molecules for therapeutic applications. RiNALMo’s success has also led to collaborations with multiple groups locally and internationally, including the National University of Singapore; Nanyang Technological University, Singapore; the National Heart Centre Singapore; UC Berkeley; and biotech company Alltrna.
The A*STAR researchers contributing to this research are from the A*STAR Genome Institute of Singapore (A*STAR GIS).