In the real world, artificial intelligence (AI) systems with sight-based tasks will likely deal with objects they've never seen before. A service robot might find an unfamiliar tool in a warehouse; a medical scanner might detect a rare tumour. But as it’s often impractical to teach an AI to recognise every possible object in reality, many open world object detection (OWOD) models instead rely on calculating how ‘object-like’ an unfamiliar shape is to decide if it’s more than just part of the background.
While this method can be effective, it has two limitations. “These models cannot explain how an object is detected, and often struggle when background features resemble unknown objects,” explained Muli Yang, a Scientist at the A*STAR Institute for Infocomm Research (A*STAR I²R).
In a recent collaborative work with researchers from the University of Hong Kong, Sichuan University and Xidian University, China, Yang and A*STAR I²R colleagues including Principal Investigator Hongyuan Zhu proposed an OWOD system with a different approach. Rather than ‘object-likeness’, their model asks: what attributes does this object have?
“By shifting to well-defined attributes, like 'umbrella-like' or 'transparent', our model can describe objects using rich, natural language,” said Yang. “This makes the model’s decisions more transparent and less prone to confusion, as it learns about the intrinsic properties of objects rather than statistical probabilities.”
The team’s innovation lies in how these attributes are selected. Existing models use multi-stage processes that begin with selecting attributes, then refining them, but this method can be time-consuming and prone to accumulating errors. “We needed a unified, end-to-end approach that could optimise selection and detection simultaneously, not in disjointed steps," said Yang.
For more efficient attribute selection, the team incorporated a mathematical model known as Partial Optimal Transport (POT). In conventional optimal transport, a model assumes every attribute in its database must match an object, even if some pairings don’t make sense. POT solves this problem by selecting an object’s most relevant attributes, then bringing those attributes to the next stage of computation.
"We realised this was about transporting only a targeted fraction of attributes which truly aligned with the visual objects," said Yang.
By combining POT with curriculum learning, which trains models on progressively harder visual problems, the team developed the Partial Attribute Assignment (PASS) system. When tested on five challenging real-world datasets spanning aquatic animals, aerial photos, video game avatars, medical X-rays and surgical footage, PASS significantly outperformed state-of-the-art OWOD methods across all benchmarks, and provided a clear view of the top attributes it used to detect both familiar and unfamiliar objects.
"The diversity of our testing benchmarks shows that this method has broad applications," said Yang. “PASS is a game changer for domains where ‘unknown’ anomalies are critical, but data on them is scarce. This could include robotics and mobile manipulation; medical imaging and diagnostics; and industrial inspection and automation.”
The A*STAR-affiliated researchers contributing to this research are from the A*STAR Institute for Infocomm Research (A*STAR I²R).