At global events like the G20 summit, international representatives face a daunting challenge: effective communication. Capturing the subtleties of various languages, cultural norms and reasoning styles can make discussions cumbersome, even with the best interpreters.
Artificial intelligence (AI) researchers believe that advancements in cross-lingual consistency and cultural reasoning could transform these interactions. Bin Wang and Zhengyuan Liu from the A*STAR Institute for Infocomm Research (A*STAR I2R) emphasised the revolutionary potential of such models.
“AI systems that bridge language and cultural gaps can make it easier for people from different countries and cultures to communicate, share ideas and work together,” said Wang and Liu. “This helps create a more connected and inclusive world where everyone can contribute and feel understood.”
In education, for example, AI could deliver tailored learning experiences by adapting content to students’ cultural and linguistic needs. In content creation, it could ensure localisation by making messages resonate with target audiences, bridge divides and expand global reach.
However, the current landscape of AI language models presents significant limitations. They are predominantly English-centric, reflecting the developers’ perspectives and resource availability. This bias leads to inconsistent performance when the same question is asked in different languages. “Multilingual models often fail to transfer knowledge seamlessly across languages,” noted Wang and Liu.
To bridge this gap, the team developed SeaEval, a comprehensive benchmark to assess multilingual AI models. It incorporated 28 datasets, including seven new ones designed specifically for cultural reasoning and cross-lingual consistency.
SeaEval evaluates AI models using a range of metrics, including accuracy, cross-lingual consistency and instruction sensitivity. By identifying gaps in handling linguistic and cultural nuances, it aims to guide improvements in multilingual AI.
The team’s findings brought several critical challenges to light: even top-performing models like GPT-4 show significant drops—over 10 percent—in performance when switching from English to other languages like Vietnamese. Models also remained sensitive to label arrangement and paraphrased instructions, revealing biases that compromise stability. Lastly, cultural reasoning capabilities remain underdeveloped even in advanced systems.
Acknowledging the support of the National Supercomputing Centre (NSCC) and the National Research Foundation, Wang and Liu said that SeaEval is a crucial step towards building AI systems capable of connecting diverse global communities more effectively.
The A*STAR-affiliated researchers contributing to this research are from the A*STAR Institute for Infocomm Research (A*STAR I2R).