When asked to solve a tricky problem, some large language models (LLMs) today offer step-by-step explanations of their reasoning. However, these ‘thought processes’ can often be riddled with errors or pure fabrications. Cleaning up this messy logic usually calls for human help or the model itself to review each step, which can take more time and resources. So, how might a model be taught to reason more reliably on its own?
“One idea we had was to teach LLMs to evaluate their reasoning with heuristic, or ‘good enough’, real-time search methods,” said Fangkai Jiao, a fourth-year PhD student at Nanyang Technological University (NTU) and the A*STAR Institute for Infocomm Research (A*STAR I2R). “At the same time, we wondered if—like a chess player studying past games—models could learn from their own offline reasoning histories to identify promising steps, without relying on human annotation or real-time search.”
With this twofold approach in mind, Jiao and A*STAR I2R colleagues including Principal Scientist Nancy Chen and Lead Research Engineer Zhengyuan Liu, collaborated with NTU and Salesforce Research Singapore to develop a novel offline LLM training framework based on process-supervised Direct Preference Optimisation (pDPO). Their latest work builds on a previous study on self-supervised, logic-enhanced training for LLMs, which focused on activating LLM reasoning through in-context learning.
Where most trial-and-error frameworks for LLM training reward correct answers, pDPO rewards sound reasoning. “Our framework uses offline simulations to let models try multiple ways to complete partial solutions, then see which ones reach correct answers, allowing them to estimate and rank the most promising intermediate steps,” Jiao explained. “It then aims to learn from complete reasoning trajectories with higher accumulated grades.”
The team showed that when tested against LogiQA-v2, a challenging dataset to evaluate a model’s logical reasoning, a seven-billion-parameter LLM trained with pDPO (Llama2-7B-pDPO) outperformed a much larger model (GPT-3.5-Turbo), scoring 55.5 versus 45.4 on the benchmark.
“While our results show that small reasoning models can outperform large general models, reasoning-focused training can enhance models of any size,” said Jiao. “We found that process rewards help models find a sweet spot between speed and sound logic, cutting out redundant steps while keeping what’s essential. Ideally, a model’s reasoning should be complete enough to verify, but concise enough to understand; like how a good teacher explains concepts clearly without unnecessary tangents.”
The team’s work also offers lessons for the design of training datasets: the LogiQA-v2 dataset, which demands multi-step reasoning, proved more effective at teaching robust reasoning than ReClor, another widely used benchmark.
“While ReClor often allows one-step solutions, LogiQA-v2 contains logic problems that call for categorical reasoning, complex logical changes, and sufficient or necessary conditions,” noted Jiao. “Therefore, future model training datasets should be designed to resist ‘shortcut learning’ by requiring genuine reasoning across varied problem types. This would ensure models develop generalisable thinking skills, rather than task-specific tricks.”
The A*STAR-affiliated researchers contributing to this research are from the A*STAR Institute for Infocomm Research (A*STAR I2R).
