There’s more to human speech than words alone. Tone, rhythm and subtle inflections carry layers of meaning that text often fails to convey.
“When we hear someone talk instead of read a message from them, we can gather a lot more information from acoustic signals embedded in their voice, including their emotions, gender, accent and energy levels,” said Ai Ti Aw, Head of the Aural and Language Intelligence Department at the A*STAR Institute for Infocomm Research (A*STAR I2R).
Today, the large language models (LLMs) behind generative artificial intelligence (AI) systems such as ChatGPT and Gemini are gaining a similar skill for information processing by becoming multimodal: digesting not just text but also images, audio and video. Among these advances, however, audio-based LLMs (AudioLLMs) still struggle to consistently understand and respond to sound.
In a recent study, Aw and A*STAR I2R colleagues including Senior Principal Scientist Nancy F. Chen identified a gap in the field: the absence of a standardised method for assessing AudioLLM performance. Existing models are tested on different datasets—structured collections of audio samples—making direct comparison difficult. Moreover, current evaluation tasks fail to capture the full range of listening and reasoning skills needed to handle real-world audio inputs.
To tackle this, the team announced the release of AudioBench, an open-source and comprehensive evaluation benchmark for AudioLLMs. AudioBench spans 26 datasets—including six newly assembled by the team to fill existing gaps—to cover eight different tasks across three major skill areas: understanding speech, interpreting ambient sounds in a scene, and picking up non-verbal cues such as emotion, accent and gender.
Testing five AudioLLMs against AudioBench, the researchers found that no single model outperformed others across all tasks. The transcription of longer recordings proved particularly challenging, likely due to the typical use of short audio clips for training. The results also varied depending on prompts or how instructions were phrased.
For open-ended queries, the team appointed another AI model—the open-source LLaMA-3-70B-Instruct—as a human-like ‘judge’ to evaluate answers, but consistent grading remained a challenge.
AudioBench was developed with the support of the National Research Foundation, Singapore. By sharing AudioBench with the wider community, the researchers hope to encourage collaboration and accelerate advances in audio-based AI.
“AudioBench provides a suite of evaluation toolkits, data and a leaderboard so other teams can advance benchmarks, test their own models and compare results with others across the world,” said Chen.
Looking ahead, the team plans to develop a Southeast Asian edition of AudioBench to include regional languages and accents, and to build empathetic LLMs capable of understanding tone, emotion and speaker demographics. “We’re also expanding AudioBench under our efforts in Singapore’s National Multimodal LLM Programme, which include A*STAR’S MERaLiON models and related research in privacy-preserving machine learning and multicultural reasoning,” Chen added.
The A*STAR-affiliated researchers contributing to this research are from the A*STAR Institute for Infocomm Research (A*STAR I2R) and the A*STAR Centre for Frontier AI Research (A*STAR CFAR).