Highlights

In brief

A training framework improves procedure planning models by generating intermediate visual steps between the start and goal states of a task while using a task-selective mask to identify the most relevant actions to pursue.

Photo by monsterstudio | Magnific

Training AI to plan step by step

26 Jun 2026

Just by seeing the starting state and end goal, AI models could predict how to complete a task by filling in the unknown steps and focusing only on the actions that matter.

How do you make an omelette? Follow a tutorial video, and you would first crack an egg into a bowl then whisk until it becomes smooth. After pouring into a pan, you’d cook it over medium heat, then serve. If an artificial intelligence (AI) model were shown only the start (uncracked egg) and the goal (omelette), could it predict the steps in between?

This question lies at the heart of procedure planning, which entails predicting action sequences from instructional videos. By reducing reliance on predefined programming, such models could enable robots to become more autonomous in fields like manufacturing, as well as make human-AI collaborations more adaptive to various situations.

However, procedure planning models are typically given only the start and goal states. “The lack of intermediate visual information complicates the process, since small variations in similar tasks might require different actions in specific contexts,” said Fen Fang, a Research Scientist at the A*STAR Institute for Infocomm Research (A*STAR I2R).

In addition, these AI models face a vast pool of action choices. For instance, if tasked with making an omelette, the model must identify cooking actions rather than ones related to painting, and ensure that the planned steps follow one another logically and coherently.

Motivated by these challenges, Fang, Principal Scientist Xulei Yang, Research Scientist Muli Yang and other A*STAR I2R colleagues developed the Visual State Generation helps Task-Selective Diffusion (VISTA-D) framework to improve prediction accuracy.

VISTA-D uses a generative AI model called Stable Diffusion to fill in the visual gap between the start and goal states. Stable Diffusion creates images meant to immediately follow the start state, while an automatic selection mechanism chooses the most suitable scenes for training. “As the generated images provide concrete visual context, this approach reduces ambiguity in predicting the next steps and anchors the action plan to a more realistic trajectory,” said Fang.

Moreover, VISTA-D employs a task-selective mask that categorises the task and filters out irrelevant action choices. To avoid errors, the researchers also used existing vision-language models to describe human actions in the scene, much like adding closed captions for audio, and extract informative features for strengthening classification.

Upon testing VISTA-D on different datasets, the team found notable improvements. The resulting planning model predicted action sequences up to 11 per cent more accurately than the baseline approach when dealing with a dataset with a large pool of action choices.

“Moving forward, we intend to translate our framework into other domains like manufacturing,” said Fang. The method could help identify defective stages and devise an alternative action plan to salvage production. “Such observation-driven re-planning has the potential to enhance manufacturing yield and resilience.”

The A*STAR-affiliated researchers contributing to this research are from the A*STAR Institute for Infocomm Research (A*STAR I2R).

Want to stay up to date with breakthroughs from A*STAR? Follow us on Twitter and LinkedIn!

References

Fang, F., Yang, M., Wu, M., Yang, Y., Xu, Q., et al. Toward accurate procedure planning in instructional videos: Visual state generation helps task-selective diffusion. IEEE Transactions on Pattern Analysis and Machine Intelligence 48(4), 4033–4050 (2026). | article

About the Researchers

Fen Fang received her PhD degree in computer science and engineering, majoring in computer graphics, from Nanyang Technological University, Singapore, in 2014. She is currently a Research Scientist with the Institute for Infocomm Research, A*STAR, Singapore. Her research interests include defect detection, object detection and segmentation, scene recognition and reinforcement learning.
Muli Yang is a Scientist at the A*STAR Institute for Infocomm Research (A*STAR I2R). He received his PhD degree from Xidian University, China, in 2023, and was a visiting PhD student at Nanyang Technological University, Singapore, from 2022 to 2023. His research focuses on open-world learning and vision-language modelling. Yang has published more than 30 papers in leading conferences and journals, including CVPR, ICCV, ICLR, NeurIPS, ACL, IJCV, TPAMI and TIP, and has received a Best Paper Award and a Best Demonstration Award.
Xulei Yang received his PhD degree from Nanyang Technological University (NTU) in 2007. He is currently a principal scientist and group leader at Institute for Infocomm Research (I2R), A*STAR, previously the research head at YITU Technology Singapore, with more than 16 years of R&D experience in deep/machine learning for computer vision and healthcare. He has published more than 100 scientific papers and international patents in the fields of deep learning, 3D Vision and medical imaging. He is currently an IEEE Senior Member and Kaggle Competition Master.

This article was made for A*STAR Research by Wildtype Media Group