To peer into the inner workings of cells, scientists study gene expression patterns—scouting which parts of the cellular blueprint are switched on based on the amount of ribonucleic acids (RNA) present. These patterns may reveal clues for how cells develop unique identities or flag genes that malfunction in disease.
Simply tallying up RNA levels, however, only scratches the surface. “Looking only at total gene expression is like counting how many books there are in a library without knowing which titles are there,” said Jonathan Göke, a Principal Investigator at the A*STAR Genome Institute of Singapore (A*STAR GIS). “Many genes can produce various isoforms of RNA that each perform distinct roles in the body.”
Short-read RNA sequencing has long been the standard for decoding such transcript-level differences. The technique’s affordability has enabled the creation of vast repositories of short-read RNA data, fuelling biomedical discoveries and tool development.
“However, this approach cannot easily capture full-length transcripts or resolve complex splicing patterns that give rise to isoforms. Meanwhile, long-read RNA sequencing can cover entire transcripts and reveal more detailed RNA features,” said Göke.
To overcome these challenges, Göke and A*STAR GIS Senior Scientist Ying Chen launched the Singapore Nanopore Expression (SG-NEx) project to generate large-scale, high-quality RNA sequencing datasets. The collaboration united expertise from multiple institutions including A*STAR GIS; the National University of Singapore; the National Cancer Centre Singapore; the Walter and Eliza Hall Institute of Medical Research, the Garvan Institute of Medical Research, and Peter MacCallum Cancer Centre in Australia; the Francis Crick Institute, UK; Seqera Labs, Spain; and University of North Carolina at Chapel Hill, US.
SG-NEx profiled several cell lines and patient samples for a broader representation of human tissues. The team employed multiple sequencing methods—Nanopore long-read direct RNA, amplification-free direct cDNA, PCR-amplified cDNA, PacBio IsoSeq and short-read cDNA sequencing—for systematic comparison. They found that long-read approaches provided greater accuracy in identifying the most abundant isoforms across samples.
“The SG-NEx dataset allows precise measurement of transcript levels, which is essential for identifying biomarkers in neurodegenerative, cardiovascular and infectious diseases,” said Göke. “These insights can support earlier, more accurate diagnoses and inform next-generation treatments.”
To ensure broad scientific benefit, the team released both raw and processed data in an open-access format, complete with computational pipelines. Scientists worldwide can now explore SG-NEx data, assess each platform’s strengths and limitations, and build analytical tools to uncover complex cellular events at the isoform level.
Looking ahead, Göke and Chen aim to develop AI-driven computational pipelines capable of handling the complexity of long-read data and detecting disease-related RNA features. “We are also exploring ways to enhance data accessibility and standardisation across global research institutions, which will support the integration of long-read sequencing into routine clinical and translational research,” said Göke.
The A*STAR-affiliated researchers contributing to this research are from the A*STAR Genome Institute of Singapore (A*STAR GIS).