A group of friends sit around a table to piece together a jigsaw puzzle, but imagine if the image on the puzzle changed every time someone made a move. This is reminiscent of what happens in multirobot systems—where groups of robots try to learn and adapt to their environment simultaneously.
This phenomenon, known as 'nonstationarity', refers to the challenge each robot faces in learning from its surroundings to make better decisions. As robots learn and modify their actions, their collective impact changes the robots’ environment unpredictably.
Hongliang Guo, a Scientist at A*STAR’s Institute for Infocomm Research (I2R), painted a picture of the complications nonstationarity causes: “In the worst-case scenario, although the robots have visited every part of an environment, a moving target in that environment may still not be detected.”
Robots often rely on traditional learning methods such as deep Q-networks and policy gradient methods, which excel in static and predictable environments. However, Guo explained that these methods face challenges in dynamic settings because they assume stable conditions while robots are in the process of learning to navigate and complete tasks.
To counter this, Guo and researchers from the University of Electronic Science and Technology of China; and Massachusetts Institute of Technology, US; proposed a solution involving a rule called a cross-entropy regularisation policy gradient (CE-PG). This strategy helps robots in a multirobot system spread out and learn more effectively in variable environments, encouraging them not to cluster in one place but to explore different areas.
Initially, robots were trained centrally with shared information but executed their tasks independently using the learned policies. This setup avoids real-time policy adjustments that can destabilise learning. Subsequently, CE-PG aided in dispersing the robots, ensuring coverage of different areas during tasks.
Through a series of test simulations and real-world experiments, the researchers showed that the CE-PG approach successfully overcomes the issue of unpredictable changes by ensuring that the robots stick to their initial strategies during tasks. In all cases, the CE-PG scheme found the moving target, outperforming or matching standard policy gradient and deep Q-network techniques, especially in maintaining robustness against individual robot failures.
This method can significantly enhance the efficiency and reliability of multirobot systems in real-world applications such as search and rescue, surveillance and exploration. Guo suggested some practical applications: “Multirobot search teams could look for a missing child in a mall environment, or for lost luggage at the airport.”
The decentralised execution aspect of the team's method also means it scales well with the number of robots involved, potentially enabling larger and more complex multirobot operations. “Our next step is to devise CE-PG+, which is applicable to ‘unknown’ environments, without prior topological information,” said Guo.
The A*STAR-affiliated researchers contributing to this research are from the Institute for Infocomm Research (I2R).
