Nature serves as a bountiful source of compounds poised to revolutionise human health and wellbeing. The rich chemical diversity present in plants and fungi have presented us with molecules that can combat pathogens and enhance longevity.
Still, there remains a wealth of nature yet to be discovered. Dillon Tay and Shi Jun Ang, Research Scientists at A*STAR’s Institute of Sustainability for Chemicals, Energy and Environment (ISCE2) and Institute of High Performance Computing (IHPC), explained that there’s profound value in delving deeper into the universe of naturally derived chemicals. But traditional approaches are notoriously slow and offer no guarantee of uncovering anything of worth.
“The traditional approach to natural product discovery is through experimental screening of natural samples like plant extracts in the hopes of finding hits with the desired bioactivity,” explained the scientists.
Artificial intelligence (AI) can circumvent these limitations, said Tay and Ang. “We can now design ‘fit-for-purpose’ natural products, by pairing the generation of novel natural product-like structures with activity prediction models.”
In their study, the team employed a machine learning architecture, based on long short-term memory (LSTM), to generate a natural product database. The LSTM architecture is adept at managing sequential data, enabling the retention of information over extended sequences—a critical feature for the accurate generation of complex molecular structures. Additionally, the structures generated had to be novel, chemically sound and diverse, spanning a broad physiochemical spectrum. These stringent criteria ensured the quality and diversity of the database, Tay and Ang explained.
Using a LSTM machine learning model trained on the COCONUT database, an open-source library of known natural product molecules, the team successfully built a new database comprising over 67 million natural product-like structures, massively surpassing the roughly 400,000 known natural products registered in COCONUT. Their computationally generated database also boasts the advantage of being substantially more cost-effective than commercially available natural product libraries.
To validate their model, the researchers compared the generated library with a dataset of 81,384 COCONUT entries that had not been used for the model’s training. The library successfully reproduced 37 percent of the held-out natural products. The team also demonstrated that the generated molecules closely resembled
known natural products in COCONUT in terms of structural likeness scores and biosynthetic pathway distributions, providing further validation.
Having demonstrated the utility and potency of deep generative machine learning models in natural product discovery, Tay and Ang are already looking towards future prospects across a variety of industries.
For example, the generated library of molecules holds promise for uncovering novel sustainable bio-alternatives to existing fossil fuel-based chemicals. “We are excited to explore the potential of our generated natural products for various applications including insect repellents and therapeutics, as well as an aid for more precise analytics,” the team concluded.
The A*STAR-affiliated researchers contributing to this research are from the Institute of Sustainability for Chemicals, Energy and Environment (ISCE2) and the Institute of High Performance Computing (IHPC).