Highlights

In brief

AudioBench, an audio-based LLM benchmark comprising 26 datasets and eight tasks, tests models on their ability to process real-world sounds and execute user prompts, revealing key areas for refinement.

Photo by Freepik

A sound test for AI listeners

2 Jan 2026

A new comprehensive benchmark for audio-based artificial intelligence allows researchers to gauge and improve the way models follow speech, soundscapes and everyday cues.

There’s more to human speech than words alone. Tone, rhythm and subtle inflections carry layers of meaning that text often fails to convey.

“When we hear someone talk instead of read a message from them, we can gather a lot more information from acoustic signals embedded in their voice, including their emotions, gender, accent and energy levels,” said Ai Ti Aw, Head of the Aural and Language Intelligence Department at the A*STAR Institute for Infocomm Research (A*STAR I2R).

Today, the large language models (LLMs) behind generative artificial intelligence (AI) systems such as ChatGPT and Gemini are gaining a similar skill for information processing by becoming multimodal: digesting not just text but also images, audio and video. Among these advances, however, audio-based LLMs (AudioLLMs) still struggle to consistently understand and respond to sound.

In a recent study, Aw and A*STAR I2R colleagues including Senior Principal Scientist Nancy F. Chen identified a gap in the field: the absence of a standardised method for assessing AudioLLM performance. Existing models are tested on different datasets—structured collections of audio samples—making direct comparison difficult. Moreover, current evaluation tasks fail to capture the full range of listening and reasoning skills needed to handle real-world audio inputs.

To tackle this, the team announced the release of AudioBench, an open-source and comprehensive evaluation benchmark for AudioLLMs. AudioBench spans 26 datasets—including six newly assembled by the team to fill existing gaps—to cover eight different tasks across three major skill areas: understanding speech, interpreting ambient sounds in a scene, and picking up non-verbal cues such as emotion, accent and gender.

Testing five AudioLLMs against AudioBench, the researchers found that no single model outperformed others across all tasks. The transcription of longer recordings proved particularly challenging, likely due to the typical use of short audio clips for training. The results also varied depending on prompts or how instructions were phrased.

For open-ended queries, the team appointed another AI model—the open-source LLaMA-3-70B-Instruct—as a human-like ‘judge’ to evaluate answers, but consistent grading remained a challenge.

AudioBench was developed with the support of the National Research Foundation, Singapore. By sharing AudioBench with the wider community, the researchers hope to encourage collaboration and accelerate advances in audio-based AI.

“AudioBench provides a suite of evaluation toolkits, data and a leaderboard so other teams can advance benchmarks, test their own models and compare results with others across the world,” said Chen.

Looking ahead, the team plans to develop a Southeast Asian edition of AudioBench to include regional languages and accents, and to build empathetic LLMs capable of understanding tone, emotion and speaker demographics. “We’re also expanding AudioBench under our efforts in Singapore’s National Multimodal LLM Programme, which include A*STAR’S MERaLiON models and related research in privacy-preserving machine learning and multicultural reasoning,” Chen added.

The A*STAR-affiliated researchers contributing to this research are from the A*STAR Institute for Infocomm Research (A*STAR I2R) and the A*STAR Centre for Frontier AI Research (A*STAR CFAR).

Want to stay up to date with breakthroughs from A*STAR? Follow us on Twitter and LinkedIn!

References

Wang, B., Zou, X., Lin, G., Sun, S., Liu, Z., et al. AudioBench: A universal benchmark for audio large language models. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies 1, 4297–4316 (2025). | article

About the Researchers

View articles

Nancy F. Chen

Senior Principal Scientist and Lead Principal Investigator

A*STAR Institute for Infocomm Research (A*STAR I2R)
Nancy F. Chen is a Senior Principal Scientist and Lead Principal Investigator at the A*STAR Institute for Infocomm Research (A*STAR I²R), where she heads the Multimodal Generative AI group and AI for Education programme. A serial best paper award winner and honoree of Singapore’s 100 Women in Tech, her AI research spans culture, healthcare, neuroscience, social media, education and forensics. Chen's multilingual tech has led to commercial spinoffs and adoption by Singapore’s Ministry of Education. Chen has multiple grants under Singapore’s National Multimodal LLM Programme in addition to leading research efforts for MERaLiON (Multimodal Empathetic Reasoning and Learning in One Network). Chen is an active international research advisor and leader, having served as Program Chair for AI conferences such as NeurIPS and ICLR. She is also a member of the APSIPA Board of Governors and has served as IEEE SPS Distinguished Lecturer and an ISCA Board Member. Previously, she worked at MIT Lincoln Lab during her PhD studies at MIT and Harvard, US.
View articles

Ai Ti Aw

Head, Aural and Language Intelligence Department

A*STAR Institute for Infocomm Research (A*STAR I2R)
Ai Ti Aw is the Head of the Aural and Language Intelligence Department at the A*STAR Institute for Infocomm Research (A*STAR I²R), where she spearheads capability development of machine translation and multilingual technology in local and Southeast Asian languages. She is also the co-PI of the National Multimodal LLM Programme to develop Singapore’s research and engineering capabilities in the field. Under this programme, Aw currently leads the MERaLiON, where she oversees the development of models designed for Singapore’s and the region’s diverse cultural and linguistic context. A pioneer in Southeast Asian natural language processing since the late 1990s, Aw has helped position Singapore at the forefront of machine translation and language processing, with her teams’ innovations earning awards such as the Firefly Awards, MCI IDEA! Award, ASEAN Outstanding Engineering Achievement Award, and the President’s Technology Award.
Bin Wang was formerly a Scientist with the A*STAR Institute for Infocomm Research (A*STAR I2R). Before that, he obtained his PhD degree from the University of Southern California, Los Angeles in 2021 and B.Eng from the University of Electronic Science and Technology of China. He was a Research Fellow with the National University of Singapore from 2021 to 2023. His research focuses on Multimodal LLM and conversational AI systems. He served the publication chair at EMNLP 2023 and was an Editorial Member for APSIPA Transactions. He has published more than 40+ academic papers in top journals and conferences including ACL, EMNLP, NAACL, ACM KDD, TNNLS and TASLP, and won multiple best-paper awards.

This article was made for A*STAR Research by Wildtype Media Group