We experience the world the way we do because the left and right eyes capture images from slightly different angles before the brain seamlessly merges them into a single, three-dimensional image. Computer vision experts have struggled to equip machines with a similar capability: how do you teach a computer to ‘see’ in 3D?
3D object retrieval (searching and retrieving 3D objects from large databases based on their similarity to a given query object) is critical for powering applications such as 3D printing, autonomous driving, augmented reality and industrial product design.
Unlike conventional 2D reverse image searches, 3D object retrieval requires an accurate interpretation of the shape and structure of the object from different perspectives. Among the current approaches for 3D object retrieval, view-based methods are favoured because they are more flexible and computationally efficient, but they can sometimes miss details that are specific to each individual view, such as fine-grained object parts or local variations.
Research Scientist Dongyun Lin from A*STAR’s Institute for Infocomm Research (I2R) explained that adding self-attention modules can give view-based methods a much-needed boost. “Self-attention modules can be really beneficial for tasks where different parts/subregions of the input are more or less important for making accurate predictions,” Lin said.
Lin gives the example of a computer vision platform given the task of identifying 3D objects within a complicated image. “Self-attention modules can help the model ‘pay attention’ to certain regions of the image that are most relevant for identifying the object.”
Together with their collaborators from Safran Landing System, Lin and colleagues built two custom self-attention modules, the View Attention Module (VAM) and the Instance Attention Module (IAM). VAM identifies features in a specific view of an object, while the IAM identifies relevant features that are present across all views. By running both modules in parallel and implementing a novel combinatory loss function on the extracted features, the team proposed a new-and-improved workflow for accurate 3D object retrieval.
Lin’s team compared their proposed method to other view-based benchmarks using a variety of everyday images from public datasets. They found that the VAM and IAM duo was not only more efficient, but also more consistent compared to existing methods. This development has the potential to springboard applications that rely on 3D object retrieval, such as computer-aided design (CAD).
Speaking on plans to further develop their platform, Lin said: “We plan to incorporate the VAM and IAM into sequence models like LSTM or Vision Transformer to improve the aggregation performance of multi-view data for better retrieval performance.”
The A*STAR-affiliated researchers contributing to this research are from the Institute for Infocomm Research (I2R).