Like a compact recipe book, DNA encodes the building instructions for a living organism using an alphabet of just four ‘letters’—the bases A, T, C and G. Today, genetic engineers are expanding that alphabet by introducing non-canonical bases (NCBs). Artificially produced genetic material containing these bases, called xeno-nucleic acids (XNAs), has promising biotechnological applications, including ultra-high-density genetic data storage.
However, for XNAs to be practical storage systems, they need to be readable quickly and at scale. While next-generation sequencing (NGS) methods for genetic materials have become increasingly fast and affordable, the few that work with NCBs are still limited in their speed, resolution and throughput.
A team of A*STAR Genome Institute of Singapore (A*STAR GIS) researchers led by Associate Director Niranjan Nagarajan and postdoctoral fellow Mauricio L. Perez aimed to address this challenge by adapting nanopore sequencing to NCBs. In this NGS method, strands of genetic material are fed through minute pores, with each individual base triggering a distinct electrical signal that can be analysed by a computer.
“Nanopore sequencing’s first step is very permissive to whatever bases are present,” said Nagarajan. “A nanopore doesn’t ‘look’ for specific letters; it simply measures tiny changes in electrical current as each base passes through. So the method’s main limitation isn’t its hardware or sequencing chemistry, but the basecaller: the computational model that converts signals into base sequences."
Working with colleagues from the former A*STAR Institute of Bioengineering and Bioimaging, the team tested their approach by running 20 NCB-containing XNA templates through a commercially available Oxford Nanopore Technologies MinION device. The system generated more than 2.3 million reads per flow cell, a performance comparable to standard DNA sequencing runs.
Notably, electrical signals near NCBs diverged significantly from those produced by canonical bases, with error rates exceeding 60 percent. “Those high error rates confirmed that NCBs were producing signal patterns that were genuinely different from traditional bases,” Perez explained. “This meant the basecaller model simply needed to be trained to interpret these new signal patterns.”
The team created an XNA template library containing over 6 million sequencing reads, augmented with ‘spliced’ reads that contained both real DNA and simulated XNA signals. They used this database to train an artificial intelligence (AI) basecaller model through a ‘bootstrapping’ strategy, in which both the model and the training data improve iteratively over multiple training rounds.
“Through bootstrapping, each training round produces a more accurate model and a cleaner dataset,” said Perez. “Over successive iterations, the system then works its way toward high accuracy despite the imperfect starting point.”
The team reported that their final model achieved NCB sequencing accuracy above 80 percent with 99 percent specificity, even with XNA sequences not seen during training.
“What surprised us most was how well the splicing method worked,” said Perez. “Even with inevitable imperfections from signal noise, the model trained on spliced DNA-XNA reads performed far better than our earlier approaches that relied purely on simulated signals.”
Beyond XNA sequencing, the team believes their strategies could advance nanopore-based protein sequencing, particularly by improving how models learn from noisy or limited datasets. “These methods could make direct protein reading more accurate, potentially enabling earlier disease detection and deeper insights into how proteins change during illness,” said Nagarajan.
The A*STAR-affiliated researchers contributing to this research are from the A*STAR Genome Institute of Singapore (A*STAR GIS).