In brief

Using generative artificial intelligence based on long short-term memory architecture, researchers generated a vast database of 67 million natural product-like molecules, exponentially increasing the number of candidate molecules to spur advancements in medicine, agriculture and numerous other industries.

© Unsplash

Millions of nature’s secrets revealed

14 Feb 2024

A*STAR scientists use advanced computer models to create a massive database of natural products for drug discovery and beyond.

Nature serves as a bountiful source of compounds poised to revolutionise human health and wellbeing. The rich chemical diversity present in plants and fungi have presented us with molecules that can combat pathogens and enhance longevity.

Still, there remains a wealth of nature yet to be discovered. Dillon Tay and Shi Jun Ang, Research Scientists at A*STAR’s Institute of Sustainability for Chemicals, Energy and Environment (ISCE2) and Institute of High Performance Computing (IHPC), explained that there’s profound value in delving deeper into the universe of naturally derived chemicals. But traditional approaches are notoriously slow and offer no guarantee of uncovering anything of worth.

“The traditional approach to natural product discovery is through experimental screening of natural samples like plant extracts in the hopes of finding hits with the desired bioactivity,” explained the scientists.

Artificial intelligence (AI) can circumvent these limitations, said Tay and Ang. “We can now design ‘fit-for-purpose’ natural products, by pairing the generation of novel natural product-like structures with activity prediction models.”

In their study, the team employed a machine learning architecture, based on long short-term memory (LSTM), to generate a natural product database. The LSTM architecture is adept at managing sequential data, enabling the retention of information over extended sequences—a critical feature for the accurate generation of complex molecular structures. Additionally, the structures generated had to be novel, chemically sound and diverse, spanning a broad physiochemical spectrum. These stringent criteria ensured the quality and diversity of the database, Tay and Ang explained.

Using a LSTM machine learning model trained on the COCONUT database, an open-source library of known natural product molecules, the team successfully built a new database comprising over 67 million natural product-like structures, massively surpassing the roughly 400,000 known natural products registered in COCONUT. Their computationally generated database also boasts the advantage of being substantially more cost-effective than commercially available natural product libraries.

To validate their model, the researchers compared the generated library with a dataset of 81,384 COCONUT entries that had not been used for the model’s training. The library successfully reproduced 37 percent of the held-out natural products. The team also demonstrated that the generated molecules closely resembled
known natural products in COCONUT in terms of structural likeness scores and biosynthetic pathway distributions, providing further validation.

Having demonstrated the utility and potency of deep generative machine learning models in natural product discovery, Tay and Ang are already looking towards future prospects across a variety of industries.

For example, the generated library of molecules holds promise for uncovering novel sustainable bio-alternatives to existing fossil fuel-based chemicals. “We are excited to explore the potential of our generated natural products for various applications including insect repellents and therapeutics, as well as an aid for more precise analytics,” the team concluded.

The A*STAR-affiliated researchers contributing to this research are from the Institute of Sustainability for Chemicals, Energy and Environment (ISCE2) and the Institute of High Performance Computing (IHPC).

Want to stay up to date with breakthroughs from A*STAR? Follow us on Twitter and LinkedIn!


Tay, D.W.P., Yeo, N.Z.X., Adaikkappan, K., Lim, Y.H. and Ang, S.J. 67 million natural product-like compound database generated via molecular language processing. Scientific Data 10, 296 (2023). | article

About the Researchers

Dillon Tay completed his BSc (Hons) 1st Class in Chemistry & Biological Chemistry and was the recipient of the Lee Kuan Yew (Gold Medal) from Nanyang Technological University. He completed his PhD degree at Imperial College London where he studied homogeneous catalysis applications in carbonylation and CO2 utilisation. Upon returning to Singapore, he joined the Institute of Sustainability for Chemicals, Energy and Environment (ISCE2) where he is currently a Senior Scientist (Chemical Biotechnology & Biocatalysis). His research interests include sustainable bio-manufacturing, biocatalysis, cheminformatics, machine learning and artificial intelligence. He is a member of the Royal Society of Chemistry (MRSC), a Registered Scientist (RSci) and an associate member of the Higher Education Academy (AFHEA).
Shi Jun Ang is a Scientist at A*STAR's Institute of High Performance Computing (IHPC) and Institute of Sustainability for Chemicals, Energy and Environment (ISCE2). After completing his PhD studies in computational chemistry at National University Singapore (NUS), he pursued a postdoctoral stint at Massachusetts Institute of Technology, US with Prof Rafael Gomez-Bombarelli. There, he worked on using machine learning techniques to accelerate molecular dynamics for studying challenging chemical reactions. Back at A*STAR, he works at the intersections of AI, cheminformatics, and high-throughput quantum chemistry for sustainable chemistry. Shi Jun also holds an adjunct lectureship position at NUS.

This article was made for A*STAR Research by Wildtype Media Group