How do you make an omelette? Follow a tutorial video, and you would first crack an egg into a bowl then whisk until it becomes smooth. After pouring into a pan, you’d cook it over medium heat, then serve. If an artificial intelligence (AI) model were shown only the start (uncracked egg) and the goal (omelette), could it predict the steps in between?
This question lies at the heart of procedure planning, which entails predicting action sequences from instructional videos. By reducing reliance on predefined programming, such models could enable robots to become more autonomous in fields like manufacturing, as well as make human-AI collaborations more adaptive to various situations.
However, procedure planning models are typically given only the start and goal states. “The lack of intermediate visual information complicates the process, since small variations in similar tasks might require different actions in specific contexts,” said Fen Fang, a Research Scientist at the A*STAR Institute for Infocomm Research (A*STAR I2R).
In addition, these AI models face a vast pool of action choices. For instance, if tasked with making an omelette, the model must identify cooking actions rather than ones related to painting, and ensure that the planned steps follow one another logically and coherently.
Motivated by these challenges, Fang, Principal Scientist Xulei Yang, Research Scientist Muli Yang and other A*STAR I2R colleagues developed the Visual State Generation helps Task-Selective Diffusion (VISTA-D) framework to improve prediction accuracy.
VISTA-D uses a generative AI model called Stable Diffusion to fill in the visual gap between the start and goal states. Stable Diffusion creates images meant to immediately follow the start state, while an automatic selection mechanism chooses the most suitable scenes for training. “As the generated images provide concrete visual context, this approach reduces ambiguity in predicting the next steps and anchors the action plan to a more realistic trajectory,” said Fang.
Moreover, VISTA-D employs a task-selective mask that categorises the task and filters out irrelevant action choices. To avoid errors, the researchers also used existing vision-language models to describe human actions in the scene, much like adding closed captions for audio, and extract informative features for strengthening classification.
Upon testing VISTA-D on different datasets, the team found notable improvements. The resulting planning model predicted action sequences up to 11 per cent more accurately than the baseline approach when dealing with a dataset with a large pool of action choices.
“Moving forward, we intend to translate our framework into other domains like manufacturing,” said Fang. The method could help identify defective stages and devise an alternative action plan to salvage production. “Such observation-driven re-planning has the potential to enhance manufacturing yield and resilience.”
The A*STAR-affiliated researchers contributing to this research are from the A*STAR Institute for Infocomm Research (A*STAR I2R).