Highlights

In brief

A combination of nanopore sequencing and a bootstrapped deep learning approach enables high-throughput sequencing of xeno-nucleic acids containing non-canonical bases with over 80 percent accuracy and 99 percent specificity, with potential applications in synthetic biology, genomics and genetic data storage.

Photo by sweet_tomato | Freepik

Reading a richer genetic alphabet

6 Apr 2026

Nanopore sequencing and a self-enhancing AI model team up to enable a faster and more accurate method for reading synthetic genes.

Like a compact recipe book, DNA encodes the building instructions for a living organism using an alphabet of just four ‘letters’—the bases A, T, C and G. Today, genetic engineers are expanding that alphabet by introducing non-canonical bases (NCBs). Artificially produced genetic material containing these bases, called xeno-nucleic acids (XNAs), has promising biotechnological applications, including ultra-high-density genetic data storage.

However, for XNAs to be practical storage systems, they need to be readable quickly and at scale. While next-generation sequencing (NGS) methods for genetic materials have become increasingly fast and affordable, the few that work with NCBs are still limited in their speed, resolution and throughput.

A team of A*STAR Genome Institute of Singapore (A*STAR GIS) researchers led by Associate Director Niranjan Nagarajan and postdoctoral fellow Mauricio L. Perez aimed to address this challenge by adapting nanopore sequencing to NCBs. In this NGS method, strands of genetic material are fed through minute pores, with each individual base triggering a distinct electrical signal that can be analysed by a computer.

“Nanopore sequencing’s first step is very permissive to whatever bases are present,” said Nagarajan. “A nanopore doesn’t ‘look’ for specific letters; it simply measures tiny changes in electrical current as each base passes through. So the method’s main limitation isn’t its hardware or sequencing chemistry, but the basecaller: the computational model that converts signals into base sequences."

Working with colleagues from the former A*STAR Institute of Bioengineering and Bioimaging, the team tested their approach by running 20 NCB-containing XNA templates through a commercially available Oxford Nanopore Technologies MinION device. The system generated more than 2.3 million reads per flow cell, a performance comparable to standard DNA sequencing runs.

Notably, electrical signals near NCBs diverged significantly from those produced by canonical bases, with error rates exceeding 60 percent. “Those high error rates confirmed that NCBs were producing signal patterns that were genuinely different from traditional bases,” Perez explained. “This meant the basecaller model simply needed to be trained to interpret these new signal patterns.”

The team created an XNA template library containing over 6 million sequencing reads, augmented with ‘spliced’ reads that contained both real DNA and simulated XNA signals. They used this database to train an artificial intelligence (AI) basecaller model through a ‘bootstrapping’ strategy, in which both the model and the training data improve iteratively over multiple training rounds.

“Through bootstrapping, each training round produces a more accurate model and a cleaner dataset,” said Perez. “Over successive iterations, the system then works its way toward high accuracy despite the imperfect starting point.”

The team reported that their final model achieved NCB sequencing accuracy above 80 percent with 99 percent specificity, even with XNA sequences not seen during training.

“What surprised us most was how well the splicing method worked,” said Perez. “Even with inevitable imperfections from signal noise, the model trained on spliced DNA-XNA reads performed far better than our earlier approaches that relied purely on simulated signals.”

Beyond XNA sequencing, the team believes their strategies could advance nanopore-based protein sequencing, particularly by improving how models learn from noisy or limited datasets. “These methods could make direct protein reading more accurate, potentially enabling earlier disease detection and deeper insights into how proteins change during illness,” said Nagarajan.

The A*STAR-affiliated researchers contributing to this research are from the A*STAR Genome Institute of Singapore (A*STAR GIS).

Want to stay up to date with breakthroughs from A*STAR? Follow us on Twitter and LinkedIn!

References

Perez, M., Kimoto, M., Rajakumar, P., Suphavilai, C., Peres da Silva, R., et al. Direct high-throughput deconvolution of non-canonical bases via nanopore sequencing and bootstrapped learning. Nature Communications 16, 6980 (2025). | article

About the Researchers

View articles

Niranjan Nagarajan

Associate Director and Senior Group Leader

A*STAR Genome Institute of Singapore (A*STAR GIS)
Niranjan Nagarajan is an Associate Director and Senior Group Leader at the A*STAR Genome Institute of Singapore (A*STAR GIS). He is also an Associate Professor in the Department of Medicine and Department of Computer Science at the National University of Singapore. Nagarajan received a BA in Computer Science and Mathematics from Ohio Wesleyan University in 2000, and a PhD in Computer Science from Cornell University in 2006. He did his postdoctoral work at the Center for Bioinformatics and Computational Biology at the University of Maryland, working on problems in genome assembly and metagenomics. Currently, his research focuses on developing cutting-edge genome analytic tools and using them to study the role of microbial communities in human health. His team conducts research at the interface of genetics, computer science and microbiology, focusing on using a systems biology approach to understand host-microbiome-pathogen interactions in various disease conditions.
Mauricio L. Perez is a postdoctoral fellow at the A*STAR Genome Institute of Singapore (A*STAR GIS), where he works at the Laboratory of Metagenomic Technologies and Microbial Systems, researching machine learning methods for nanopore signal analysis and genomics data. He obtained his PhD degree in 2021 from Nanyang Technological University (NTU), his MSc in 2016 from University of Campinas (UNICAMP) in Brazil, and his BSc in 2012 from Federal University of São Carlos (UFSCar) in Brazil.

This article was made for A*STAR Research by Wildtype Media Group