Deep within the genomes of cancer cells lie subtle clues to their malignant origins. Somatic variants are genetic alterations that point to DNA replication errors or exposure to carcinogens and are known to contribute to the development of tumours.
However, using automated platforms to find these genetic fingerprints of cancer in DNA sequences from patient samples has, until now, been exceedingly difficult as tumours are often highly heterogenous and DNA sequencing is prone to errors.
Anders Skanderup, a Group Leader from A*STAR’s Genome Institute of Singapore (GIS), said that breakthroughs in machine learning combined with the availability of large, multidimensional training datasets can help realise the full potential of diagnostic technologies powered by artificial intelligence.
“The ability to generate and use large scale next-generation sequencing data of cancer genomes can enable the training of large deep learning models,” said Skanderup.
Using this approach, Skanderup worked with first author Kiran Krishnamachari and colleagues to develop a deep learning system designed to detect somatic variants in tumours called VarNet. The platform was trained using 4.6 million high-confidence somatic variants found in 356 tumour genomes spanning seven cancer types.
The team built VarNet using ground-truth labels with an ensemble method which enabled it to recognise genetic mutations in unlabelled genetic data. “While there are many cancer sequencing datasets available, they do not contain ground-truth mutation labels that can be used to train large models,” explained Skanderup, adding that they overcame the challenge using scale and weak supervision.
They also devised two distinct deep learning models to identify single letter DNA changes (single nucleotide variants) and insertions or deletions to the DNA code (indels). Finally, the system was engineered to generate image-like representations of mutation sites which allowed VarNet to better ‘see’ mutations and make mutation probability predictions at each site.
Prior machine learning platforms tended to struggle with ‘low purity’ tumour samples containing healthy tissues that can make it harder to distinguish somatic variants. However, validation tests proved that VarNet’s performance often exceeded current state-of-the-art methods in these challenging scenarios.
“VarNet was shown to be more accurate than existing systems in benchmarks of low-tumour-purity settings, which improves its potential for practical use,” Skanderup remarked, adding that the platform was specifically designed to mimic human experts who would use visualisations of sequencing data to make side-by-side comparisons of normal and tumour samples.
VarNet’s unprecedented accuracy can be a game-changer both in research and commercial settings, said Skanderup, who suggested that it can enhance specialised mutation detection technologies often used by medical diagnostic companies.
The A*STAR-affiliated researchers contributing to this research are from the Genome Institute of Singapore (GIS).