How does one distinguish a cat from a dog? To us, the answer may seem like a no-brainer; we can recognise different kinds of dogs and group them all under the same category, even if we have never encountered that exact breed before. However, these tasks require extensive training for artificial intelligence (AI) systems such as vision language models (VLMs).
VLMs are the cousins of large language models like ChatGPT but work with both text and images instead of only text. As VLMs rely on extracting patterns based on well-defined and labelled training data, the types of data they can handle and the categories they can identify are often limited.
Muli Yang, a Scientist at the A*STAR Institute for Infocomm Research (A*STAR I2R), worked with collaborators from Xidian University, China and Nanyang Technological University, Singapore, to advance VLMs’ capabilities, hoping to enable more accurate classification of images even beyond previously defined categories. “This would be important for real-world problems like rare wildlife monitoring or medical imaging, where labelled data is hard to come by,” Yang explained.
The researchers turned to a training method called Consistent Prompt Tuning (CPT), where the model is repetitively given a standard prompt to follow. For example, they would consistently feed the model with a sentence such as “The foreground is an animal named [blank]”, with the blank changing depending on the species in the image. Since the prompt’s structure stays the same, but the details change to match the image, the VLM can make sense of the similarities and differences with each image, eventually learning to categorise different species shown.
By providing the model with a stable ‘rulebook’, the researchers hypothesised that the VLM could better adapt to open and uncertain environments. Critical to the CPT approach was ensuring consistency. The team verified that each description matched the corresponding image, and that two photos of the same animal were treated alike by the model.
“Consistency checking keeps the VLM from drifting off-track, helping it discover new categories more accurately and reliably than previous methods,” Yang added.
The CPT-trained VLM outperformed other state-of-the-art methods on image classification tasks. It was also able to flexibly group images in different ways, such as by location rather than by object. This ability showed that the model was not just memorising training data but also identifying broader patterns that improved its classification performance.
“Category discovery gives the model the ability to walk into a brand-new environment and make sense of both the things it already knows and the things it has never seen before,” Yang said. Reducing reliance on labelled data makes such AI systems more scalable and accessible, broadening their potential use across multiple sectors.
The A*STAR-affiliated researchers contributing to this research are from the A*STAR Institute for Infocomm Research (A*STAR I2R).
