Highlights

In brief

Combining data augmentation with a machine learning method known as generative adversarial networks helps algorithms make more accurate audio classifications using much less data.

© Shutterstock

Sharper sound classification with less data

11 May 2021

A generative adversarial network has been used to develop audio classification technologies that require much less training data.

“Hey Siri, what’s the weather forecast for today?” Ever wondered how devices such as smart speakers understand and respond to such requests? Apple’s Siri and Amazon’s Alexa are classic examples of audio classification technologies, devices powered by artificial intelligence (AI) that perform tasks according to voice commands.

As with most AI-driven systems, audio classification systems first need to go through a training regime. Here, their machine learning networks are fed large datasets of thousands of audio samples, to condition the software to interpret and perform tasks accurately.

However, compiling these massive audio datasets—such as the Google Speech Command Dataset—is both expensive and time-consuming. Moreover, due to their highly variable nature, audio inputs pose a unique challenge for machine learning processes.

“Sound propagation is very sensitive to the environment, which is constantly changing,” explained speech technology expert Huy Dat Tran, a Senior Scientist from A*STAR’s Institute for Infocomm Research (I2R), adding that collecting audio data for all the possible sound variations of a specific command is virtually impossible.

Tran and the study’s first author, fellow I2R researcher Kah Kuan Teh, explored the potential of using data augmentation—a process of expanding datasets by adding slightly modified versions of existing data—to streamline and expedite the development of audio classification systems.

The researchers explored the use of two audio data augmentation methods, physical modeling and wavelet scattering transfer, as well as a machine learning framework called the generative adversarial network, or GAN.

Data augmentation techniques were applied to condensed versions of the Google Speech Command Dataset, using between 10 and 25 percent of the original data. Tran and Teh found that combining these two approaches and embedding them into GAN yielded a ground-breaking result: Their new model interpreted voice commands with 92 percent accuracy after being trained with just 10 percent of the Google dataset.

By dramatically lowering the amount of training data required, this new GAN has the potential to create powerful voice command technologies more quickly and cost-effectively than ever before, said Tran.

The researchers are currently leveraging their innovation to enhance a range of audio detection applications, from security surveillance systems to senior care devices that listen out for falls. “More recently, in response to COVID-19, we have developed an audio cough detection system to monitor people in public areas,” added Tran.

The A*STAR-affiliated researchers contributing to this research are from the Institute for Infocomm Research (I2R).

Want to stay up to date with breakthroughs from A*STAR? Follow us on Twitter and LinkedIn!

References

Teh, KK., Tran, HD. Embedding physical augmentation and wavelet scattering transform to generative adversarial networks for audio classification with limited training resources. ICASSP 2019 – 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 3262–3266 (2019) | article

About the Researcher

Huy Dat Tran is a Senior Scientist at A*STAR’s Institute for Infocomm Research (I2R). With more than 25 years’ of experience, Tran is an expert in acoustic and speech technologies. He received his PhD degree on acoustic at the National Academy of Sciences of Ukraine in 2000 and completed his postdoc fellowship at the F. Itakura Lab at Nagoya University, Japan in 2005. He joined I2R in 2006 and now holds the position of Group Leader of the Audio Analytic and Speech Recognition Group as well as Deputy Department Head of the Aural and Language Intelligence Department.

This article was made for A*STAR Research by Wildtype Media Group