Cancer genomics

Tracking down suspected cancer-inducing mutations using machine learning

Machine learning helps locate dozens of mutations statistically linked to gastric cancer in non-coding DNA regions

Published online Aug 8, 2018

By using machine learning, A*STAR researchers have identified 34 hotspots in non-coding DNA that are strongly statistically linked with gastric cancer.

By using machine learning, A*STAR researchers have identified 34 hotspots in non-coding DNA that are strongly statistically linked with gastric cancer.

© KTSDESIGN/Getty

By harnessing the power of machine learning, A*STAR researchers have identified more than 30 mutation hotspots for gastric cancer in regions of DNA that do not code for proteins1. This information could be used to diagnose gastric cancer and monitor the effectiveness of treatments. 

The molecular machinery inside cells uses instructions encoded in DNA to make proteins for a wide range of functions. When the code becomes corrupted through mutations, the resulting proteins may not perform as expected, and in the worst-case scenario, a cell can become cancerous. Thus, many studies have investigated links between mutations in protein-coding DNA and various cancers. 

But DNA that encodes for proteins makes up a meagre 2 per cent of the human genome; the remaining 98 per cent is known as non-coding DNA and its relationship with cancer is largely unstudied. 

Now, by using machine learning to analyze the whole genomes of tumors from 212 gastric cancer patients, Anders Skanderup at the A*STAR Genome Institute of Singapore and colleagues have identified 34 hotspots in non-coding DNA that are strongly statistically correlated with gastric cancer. 

The team chose to study gastric cancer because there was both a clinical need and a research opportunity. “Gastric cancer is one of the top cancer killers in the world, and there’s a strong gastric cancer research community in Singapore,” says Skanderup. “So it made a lot of sense for us to study it.” 

The team’s method had two steps. Machine learning was used to first identify all the mutations in each genome of each patient’s tumor. It then looked for patterns across all these patients’ tumor genomes. “We were looking for regions with an unexpectedly high rate of mutations across the patients, which would indicate that something suspicious was going on in those regions,” explains Skanderup. 

Of the 34 hotspots, 11 were sites at which CTCF — a protein that controls the expression of genes by determining whether they are copied into RNA — binds. “We discovered that these hotspots had an unexpectedly high rate of mutations, which we couldn’t explain in terms of random chance,” says Skanderup. “When we probed further, what immediately jumped out at us was that 11 of them overlapped with CTCF-binding sites. That was so striking because we wouldn’t have expected any such sites among the 34 hotspots.” He notes that other studies had also found links between mutations and CTCF-binding sites, which suggests that these sites may play an important role in gastric cancer.

The team now intends to explore the relationship between non-coding mutations and how well patients respond to treatments.

The A*STAR-affiliated researchers contributing to this research are from the Genome Institute of Singapore.

Tags: gastric cancergenetic analysismachine learningnon-coding DNAmutationsCTCFGenome Institute of Singapore (GIS)

Reference

  1. Guo, Y. A., Chang, M. M., Huang, W., Ooi, W. F., Xing, M. et al. Mutation hotspots at CTCF binding sites coupled to chromosomal instability in gastrointestinal cancers. Nature Communications 9, 1520 (2018).| Article