Highlights

In brief

Model training based on Consistent Prompt Tuning can help vision language models learn to classify unknown objects and discover new categories, which can be important for wildlife monitoring and medical imaging applications.

Photo by Eric Isselee | Shutterstock

Seeing the unknown

3 Nov 2025

A consistent rulebook of prompts could be the key to advancing artificial intelligence models, enhancing their image classification performance even when dealing with unknown objects.

How does one distinguish a cat from a dog? To us, the answer may seem like a no-brainer; we can recognise different kinds of dogs and group them all under the same category, even if we have never encountered that exact breed before. However, these tasks require extensive training for artificial intelligence (AI) systems such as vision language models (VLMs).

VLMs are the cousins of large language models like ChatGPT but work with both text and images instead of only text. As VLMs rely on extracting patterns based on well-defined and labelled training data, the types of data they can handle and the categories they can identify are often limited.

Muli Yang, a Scientist at the A*STAR Institute for Infocomm Research (A*STAR I2R), worked with collaborators from Xidian University, China and Nanyang Technological University, Singapore, to advance VLMs’ capabilities, hoping to enable more accurate classification of images even beyond previously defined categories. “This would be important for real-world problems like rare wildlife monitoring or medical imaging, where labelled data is hard to come by,” Yang explained.

The researchers turned to a training method called Consistent Prompt Tuning (CPT), where the model is repetitively given a standard prompt to follow. For example, they would consistently feed the model with a sentence such as “The foreground is an animal named [blank]”, with the blank changing depending on the species in the image. Since the prompt’s structure stays the same, but the details change to match the image, the VLM can make sense of the similarities and differences with each image, eventually learning to categorise different species shown.

By providing the model with a stable ‘rulebook’, the researchers hypothesised that the VLM could better adapt to open and uncertain environments. Critical to the CPT approach was ensuring consistency. The team verified that each description matched the corresponding image, and that two photos of the same animal were treated alike by the model.

“Consistency checking keeps the VLM from drifting off-track, helping it discover new categories more accurately and reliably than previous methods,” Yang added.

The CPT-trained VLM outperformed other state-of-the-art methods on image classification tasks. It was also able to flexibly group images in different ways, such as by location rather than by object. This ability showed that the model was not just memorising training data but also identifying broader patterns that improved its classification performance.

“Category discovery gives the model the ability to walk into a brand-new environment and make sense of both the things it already knows and the things it has never seen before,” Yang said. Reducing reliance on labelled data makes such AI systems more scalable and accessible, broadening their potential use across multiple sectors.

The A*STAR-affiliated researchers contributing to this research are from the A*STAR Institute for Infocomm Research (A*STAR I2R).

Want to stay up to date with breakthroughs from A*STAR? Follow us on Twitter and LinkedIn!

References

Yang, M., Yin, J., Gu, Y., Deng, C., Zhang, H., et al. Consistent prompt tuning for generalized category discovery. International Journal of Computer Vision 133, 4014-4041 (2025). | article

About the Researcher

Muli Yang is a Scientist at the A*STAR Institute for Infocomm Research (A*STAR I2R). He received his PhD degree from Xidian University, Xi’an, China, in 2023, and was a visiting PhD student at Nanyang Technological University, Singapore, from 2022 to 2023. His research focuses on open world learning and vision-language modelling. Yang has published more than 25 papers in leading conferences and journals, including CVPR, ICCV, ICLR, NeurIPS, ACL, IJCV and TIP, and has received a Best Paper Award.

This article was made for A*STAR Research by Wildtype Media Group