Highlights

In brief

Through process-supervised direct preference optimisation, a novel training framework for large language models uses offline simulations to evaluate and reward step-by-step reasoning, boosting their logical consistency and accuracy compared to larger models.

Photo by Google DeepMind | Unsplash

Points for the work, not just the answer

3 Dec 2025

By adding rewards for logical reasoning steps, a new training framework aims to make large language models more robust and reliable.

When asked to solve a tricky problem, some large language models (LLMs) today offer step-by-step explanations of their reasoning. However, these ‘thought processes’ can often be riddled with errors or pure fabrications. Cleaning up this messy logic usually calls for human help or the model itself to review each step, which can take more time and resources. So, how might a model be taught to reason more reliably on its own?

“One idea we had was to teach LLMs to evaluate their reasoning with heuristic, or ‘good enough’, real-time search methods,” said Fangkai Jiao, a fourth-year PhD student at Nanyang Technological University (NTU) and the A*STAR Institute for Infocomm Research (A*STAR I2R). “At the same time, we wondered if—like a chess player studying past games—models could learn from their own offline reasoning histories to identify promising steps, without relying on human annotation or real-time search.”

With this twofold approach in mind, Jiao and A*STAR I2R colleagues including Principal Scientist Nancy Chen and Lead Research Engineer Zhengyuan Liu, collaborated with NTU and Salesforce Research Singapore to develop a novel offline LLM training framework based on process-supervised Direct Preference Optimisation (pDPO). Their latest work builds on a previous study on self-supervised, logic-enhanced training for LLMs, which focused on activating LLM reasoning through in-context learning.

Where most trial-and-error frameworks for LLM training reward correct answers, pDPO rewards sound reasoning. “Our framework uses offline simulations to let models try multiple ways to complete partial solutions, then see which ones reach correct answers, allowing them to estimate and rank the most promising intermediate steps,” Jiao explained. “It then aims to learn from complete reasoning trajectories with higher accumulated grades.”

The team showed that when tested against LogiQA-v2, a challenging dataset to evaluate a model’s logical reasoning, a seven-billion-parameter LLM trained with pDPO (Llama2-7B-pDPO) outperformed a much larger model (GPT-3.5-Turbo), scoring 55.5 versus 45.4 on the benchmark.

“While our results show that small reasoning models can outperform large general models, reasoning-focused training can enhance models of any size,” said Jiao. “We found that process rewards help models find a sweet spot between speed and sound logic, cutting out redundant steps while keeping what’s essential. Ideally, a model’s reasoning should be complete enough to verify, but concise enough to understand; like how a good teacher explains concepts clearly without unnecessary tangents.”

The team’s work also offers lessons for the design of training datasets: the LogiQA-v2 dataset, which demands multi-step reasoning, proved more effective at teaching robust reasoning than ReClor, another widely used benchmark.

“While ReClor often allows one-step solutions, LogiQA-v2 contains logic problems that call for categorical reasoning, complex logical changes, and sufficient or necessary conditions,” noted Jiao. “Therefore, future model training datasets should be designed to resist ‘shortcut learning’ by requiring genuine reasoning across varied problem types. This would ensure models develop generalisable thinking skills, rather than task-specific tricks.”

The A*STAR-affiliated researchers contributing to this research are from the A*STAR Institute for Infocomm Research (A*STAR I2R).

Want to stay up to date with breakthroughs from A*STAR? Follow us on Twitter and LinkedIn!

References

Jiao, F., Qin, C., Liu, Z., Chen, N.F. and Joty, S. Learning planning-based reasoning via trajectories collection and process reward synthesizing. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 334-350 (2024). | article

About the Researchers

Fangkai Jiao is a fourth-year PhD student at Nanyang Technological University (NTU) and the A*STAR Institute for Infocomm Research (A*STAR I2R). Prior to his PhD, he received his MEng and BEng degrees from Shandong University of China in 2022 and 2019, respectively. Jiao's research focuses on weak-supervised training and data synthesis for machine reasoning and large language models, from the mid-training to post-training stage. He has published over 20 papers in top-tier conferences and journals, including ACL, EMNLP, NAACL, ICLR, TMLR and TPAMI. He has also held research internships at DAMO Academy, Alibaba Group and Microsoft Research Asia.
View articles

Zhengyuan Liu

Tech Lead, Multimodal Generative AI group

A*STAR Institute for Infocomm Research (A*STAR I2R)
Zhengyuan Liu is currently a Tech Lead in the Multimodal Generative AI group and the Asst. Head of AI for Education programme at the A*STAR Institute for Infocomm Research (A*STAR I2R). He has published over 30 research papers in top-tier AI and Natural Language Processing conferences including ACL, NAACL, EMNLP, COLING, ICASSP and INTERSPEECH. He serves as the reviewer at conferences including NeurIPS, ICLR and ACL; and journals including IEEE TASLP, ACM CSUR and Neurocomputing. He has been elected as an IEEE senior member for his significant professional achievements and won the Best Paper Award at SIGDIAL 2021, C3NLP in ACL 2024, and SUMEval in COLING 2025; and the Outstanding Paper Award at EMNLP 2023 and EMNLP 2024.
View articles

Nancy F. Chen

Senior Principal Scientist and Lead Principal Investigator

A*STAR Institute for Infocomm Research (A*STAR I2R)
Nancy F. Chen is a Senior Principal Scientist and Lead Principal Investigator at the A*STAR Institute for Infocomm Research (A*STAR I²R), where she heads the Multimodal Generative AI group and AI for Education programme. A serial best paper award winner and honoree of Singapore’s 100 Women in Tech, her AI research spans culture, healthcare, neuroscience, social media, education and forensics. Chen's multilingual tech has led to commercial spinoffs and adoption by Singapore’s Ministry of Education. Chen has multiple grants under Singapore’s National Multimodal LLM Programme in addition to leading research efforts for MERaLiON (Multimodal Empathetic Reasoning and Learning in One Network). Chen is an active international research advisor and leader, having served as Program Chair for AI conferences such as NeurIPS and ICLR. She is also a member of the APSIPA Board of Governors and has served as IEEE SPS Distinguished Lecturer and an ISCA Board Member. Previously, she worked at MIT Lincoln Lab during her PhD studies at MIT and Harvard, US.

This article was made for A*STAR Research by Wildtype Media Group