Finding the proverbial 'needle in a haystack' is notoriously challenging, not just because of the target’s small size, but because of the clutter around it. When creating computers that ‘see’ like we do, artificial intelligence (AI) researchers face a similar puzzle. To learn how to precisely pinpoint that needle, object detection algorithms often need to scrutinize high-resolution images in detail, driving up the processing power needed to train them.
This constraint has profound implications on computer vision’s real-world applications. For example, to a self-driving car, a small object initially noticed by its cameras might in fact be a distant, yet rapidly approaching one.
“Small objects like debris, road signs, pedestrians and animals pose dangers for on-road autonomous vehicles,” said Fen Fang, a Research Scientist at A*STAR’s Institute for Infocomm Research (I2R). “These need to be detected early and accurately to ensure safe travel. However, detection algorithms can be slowed down by the difficulties of identifying small objects from limited visual cues, processing their small details, and picking them out from cluttered environments.”
To give computers a sharper, faster, and yet more efficient eye for detail, Fang’s team developed a novel policy framework that fuses reinforcement learning techniques with a spatial transformation network (STN) and a transformer model with a convolutional neural network (CNN). This approach trains an RL agent to use a two-step process: a coarse location query (CLQ), followed by context-sensitive object detection.
“CLQ predicts the regions in an image where small objects are likely to be located, which enables the object detector to use high-resolution image patches for those regions, and low-resolution patches for the rest,” said Fang.
Imagine a detective searching a satellite photo of a city for clues; the STN acts like their local guide, drawing a map of key districts and roads over the photo. The detective can then use the CNN-transformer like a magnifying glass, giving them more detailed, close-up views of suspicious districts. This synergistic framework not only homes in on areas likely concealing small targets, but also minimises false positive readings, which reduces the computing resources and human effort needed.
Tested across diverse image datasets—from bustling city streets to aerial vistas—the researchers’ framework showed improvements in detection accuracy by up to two percent while reducing the number of processed pixels. Moreover, the new framework not only matched but outperformed current state-of-the-art methods in some cases.
“Our framework can aid autonomous vehicles in avoiding accidents by improving their environmental awareness and predicting potential hazards,” said Fang. “By identifying small objects like trees, power lines and infrastructure in aerial images, the framework could also help accurately map and monitor environments, which would support urban planning, disaster management and environmental conservation.”
The team is currently reinforcing the robustness of their recently patented model by expanding its perceptual field through enriched datasets, including synthetic data made by generative AI.
The A*STAR-affiliated researchers contributing to this research are from the Institute for Infocomm Research (I2R).