“Hey Siri, what’s the weather forecast for today?” Ever wondered how devices such as smart speakers understand and respond to such requests? Apple’s Siri and Amazon’s Alexa are classic examples of audio classification technologies, devices powered by artificial intelligence (AI) that perform tasks according to voice commands.
As with most AI-driven systems, audio classification systems first need to go through a training regime. Here, their machine learning networks are fed large datasets of thousands of audio samples, to condition the software to interpret and perform tasks accurately.
However, compiling these massive audio datasets—such as the Google Speech Command Dataset—is both expensive and time-consuming. Moreover, due to their highly variable nature, audio inputs pose a unique challenge for machine learning processes.
“Sound propagation is very sensitive to the environment, which is constantly changing,” explained speech technology expert Huy Dat Tran, a Senior Scientist from A*STAR’s Institute for Infocomm Research (I2R), adding that collecting audio data for all the possible sound variations of a specific command is virtually impossible.
Tran and the study’s first author, fellow I2R researcher Kah Kuan Teh, explored the potential of using data augmentation—a process of expanding datasets by adding slightly modified versions of existing data—to streamline and expedite the development of audio classification systems.
The researchers explored the use of two audio data augmentation methods, physical modeling and wavelet scattering transfer, as well as a machine learning framework called the generative adversarial network, or GAN.
Data augmentation techniques were applied to condensed versions of the Google Speech Command Dataset, using between 10 and 25 percent of the original data. Tran and Teh found that combining these two approaches and embedding them into GAN yielded a ground-breaking result: Their new model interpreted voice commands with 92 percent accuracy after being trained with just 10 percent of the Google dataset.
By dramatically lowering the amount of training data required, this new GAN has the potential to create powerful voice command technologies more quickly and cost-effectively than ever before, said Tran.
The researchers are currently leveraging their innovation to enhance a range of audio detection applications, from security surveillance systems to senior care devices that listen out for falls. “More recently, in response to COVID-19, we have developed an audio cough detection system to monitor people in public areas,” added Tran.
The A*STAR-affiliated researchers contributing to this research are from the Institute for Infocomm Research (I2R).