Apple and Duke scientists present a reinforcing learning method that allows LLMs to provide intermediate answers, improve speed and accuracy

Long Cot Reasoning improves the performance of the big language models on complex tasks, but comes with disadvantages. The typical “thinking-derenter-response” method slows down response times, which interferes with real-time interactions like those in chatbots. It also risks inaccuracies as errors in previous reasoning steps can lead to a misleading final answer. Unlike people who often share partial thoughts or conclusions during conversations, the LLMS items delay until all reasoning is completed. While RL is often used to train reasoning models, it mainly rewards final answers overlooking useful interviews. There is growing interest in teaching models that switch between thinking and answers, but this is still a challenge.

RL has become a popular method of improving the reasoning in LLMs and building on its success in adapting models with human preferences. Two ordinary reward types of Guide RL: Result -based Rewards (Worm) that focus on the final response, and process -based rewards (PRM) that provide feedback on intermediate reasoning steps. While PRMs offer more detailed supervision, they often rely on human annotation and additional models, making them complex and prone to problems such as reward hacking. Separately, the efforts to improve the LLM Reasoning have investigated the obtaining of strategies, structured reasoning, tool integration and methods of reducing latency time and improving efficiency.

Researchers from Apple and Duke University introduce intertwined reasoning, a new RL approach that allows language models to switch between thinking and answer when resolving complex, multi-step questions. Instead of waiting until the end of responding, models provide informative intermediate answers, which improves feedback for users and guides their reasoning. Using a straightforward rule -based reward, the model is trained to produce useful reasoning steps, leading to over 80% faster answers and up to 19.3% better accuracy. Trained only on QA and Logic Data Set, the method shows strong generalization to more challenging benchmarks such as math, GPQA and MMLU.

The study suggests a reinforcing learning framework to train LLMs for intertwined reasoning, where models switch between internal thinking and user -facing intermediate answers. Each intermediate stage or “underwear” is divided when the model reaches a meaningful milestone in the reasoning. A specialized training template with and Brands are used. The procedure uses rule-based rewards-specific format, final accuracy and conditional mid-accuracy-to-guide learning. Remarkably, intermediate rewards are only used when specific criteria are met, ensuring that the model prioritizes overall correctness. They also test various reward schemes, such as all-or-no, partly credit and time discounted rewards, to optimize the quality of the reasoning.

The intertwined reasoning was evaluated on both well -known and unknown data sets using QWEN2.5 models (1.5b and 7b). Unlike traditional methods that separate thinking and answers, the intertwined method answers step -by -step, which improves both speed and the utility. When combined with intermediate rewards, it significantly improves the model’s performance and reduces the response delays by over 80%. Even without exposure to new domains during training, the model adapts well and shows strong generalization. These results highlight the value of intertwined reasoning by making AI systems more responsive and effective in the real world, multi-stage-reasoning tasks.

Finally, the study examines how intertwined reasoning – where models switch between reasoning and generating intermediate answers – can significantly improve performance and responsiveness. Using the QWEN2.5-1.5B model, the authors show that the provision of timely intermediate feedback during exercise increases accuracy and speeds up response generation. Various RL strategies were tested where PPO shows stable results and conditional, discounted rewards that prove to be the most effective. The method scales well to complex tasks and surpasses traditional thinking-thereafter-answer-lines. Unlike reward models at token-level, this approach uses simple rule-based rewards after completing full reasoning steps, thereby avoiding reward hacking. Ultimately, intertwined reasoning improves reasoning quality and efficiency without relying on external tools.


Check the paper. All credit for this research goes to the researchers in this project. You are also welcome to follow us on Twitter And don’t forget to join our 95k+ ml subbreddit and subscribe to Our newsletter.


Sana Hassan, a consultant intern at MarkTechpost and dual-degree students at IIT Madras, is passionate about using technology and AI to tackle challenges in the real world. With a great interest in solving practical problems, he brings a new perspective to the intersection of AI and real solutions.

Leave a Comment