Highlights

In brief

RiNALMo, a 650 million-parameter language model trained on 36 million non-coding RNA sequences, achieves state-of-the-art performance on multiple downstream RNA structure and function prediction tasks, and proves able to generalise its learning to unseen RNA families.

Photo by Vink Fan | Shutterstock

Predicting purpose in unseen shapes

17 Feb 2026

The largest RNA language model to date helps researchers uncover the hidden roles played by non-coding RNA sequences in our cells.

Drifting through our cells in the millions, ribonucleic acids (RNA) may be known best as messengers, providing templates of our genetic code to protein-making cellular factories. Yet RNA does much more than simply carry information. Some RNAs ferry building blocks to those same factories; others form the factories themselves; while still others regulate their pace, or help protect them from invaders and injury.

“Each RNA molecule has a task that depends on the structural motifs it folds into,” said Mile Šikić, a Senior Principal Scientist at the A*STAR Genome Institute of Singapore (A*STAR GIS). “If we can predict how RNA folds, we gain direct insights into its likely functions and behaviour, and how those might be regulated or disrupted in disease.”

Šikić and colleagues from A*STAR GIS, A*STAR Bioinformatics Institute (A*STAR BII), and the University of Zagreb, Croatia, recently developed the RiboNucleic Acid Language Model (RiNALMo), a large language model (LLM) pre-trained on 36 million non-coding RNA (ncRNA) sequences drawn from several public databases. With over 650 million parameters—making it possibly the largest RNA language model to date—RiNALMo provides researchers a powerful tool for capturing the underlying structural information encoded in RNA sequences.

In benchmark tests, the team found that RiNALMo not only achieved state-of-the-art performance on secondary structure prediction tasks, but also generalised its learning across different RNA families—even those it had not encountered during training. In addition, when tasked with extracting important functional information from previously unseen RNA types, RiNALMo outperformed other general-purpose RNA models as well as language models trained specifically for those tasks, including SpliceBERT and UTR-LM.

“While RiNALMo is broadly similar to earlier RNA models, we believe what truly made the difference was our hands-on, adaptive training process,” said Šikić. “We were very careful in selecting training data and paid close attention to how the model was learning, adjusting our approach whenever progress slowed.”

The team hypothesises that RiNALMo’s predictive edge may be due to its ability to capture hidden knowledge and important structural information embedded within RNA sequences. However, exactly which features the model attends to remains unclear.

“Understanding what deep learning models actually learn remains one of the biggest open challenges in artificial intelligence today,” noted Šikić. “With collaborators at the University of California, Berkeley (UC Berkeley), US, we’re actively working to understand what RiNALMo is learning and which RNA features it captures internally.”

The team is currently building a generative version of RiNALMo that can actively design and optimise RNA molecules for therapeutic applications. RiNALMo’s success has also led to collaborations with multiple groups locally and internationally, including the National University of Singapore; Nanyang Technological University, Singapore; the National Heart Centre Singapore; UC Berkeley; and biotech company Alltrna.

The A*STAR researchers contributing to this research are from the A*STAR Genome Institute of Singapore (A*STAR GIS).

Want to stay up to date with breakthroughs from A*STAR? Follow us on Twitter and LinkedIn!

References

Penić, R.J., Vlašić, T., Huber, R.G., Wan, Y and Šikić, M. RiNALMo: general-purpose RNA language models can generalize well on structure prediction tasks. Nature Communications 16, 5671 (2025). | article

About the Researcher

Mile Šikić is a researcher in computational genomics and artificial intelligence (AI), currently serving as a Senior Principal Scientist at the A*STAR Genome Institute of Singapore (A*STAR GIS) and a full Professor in Computer Science at the University of Zagreb, Croatia. He earned his PhD degree in computer science from the University of Zagreb in 2008 and has since been at the forefront of algorithms and AI-driven genomics, contributing to de novo genome assembly (Racon, Raven), nanopore sequencing error correction (HERRO) and RNA language models (RiNALMo). Šikić leads a research team of over 15 scientists across Singapore and Croatia, fostering interdisciplinary collaboration. He is an active member of several international consortia, including Telomere-to-Telomere (T2T), Human Pangenome Reference Consortium (HPRC), and Cancer Genome in the Bottle. He has founded multiple companies and worked as a system integrator, consultant and project manager on over 70 industry projects spanning computer networks, mobile networks and cybersecurity. He has also contributed to complex and social network analysis, co-developing a novel election and market forecasting methodology that successfully predicted geopolitical events in 2016.

This article was made for A*STAR Research by Wildtype Media Group