Qwen scientists suggest Qwenlong-L1: A framework for reinforcing learning for long context-resonance in large language models

While large reasoning models (LRMs) have shown impressive capacities in map surfaces resonance through reinforcement learning (RL), these gains do not generalize well to long-context scenarios. Applications such as multi-document QA, research synthesis and legal or economic analysis require models to process and resonate over sequences that exceed 100k-tokens. However, RL optimization in such regimes is plagued by slower reward composition, unstable political updates due to KL Division fluctuations and reduced investigation as a result of entropy breakdown. These bottlenecks reveal a basic gap in the transition of LRMs from short-filled skills to long-context generalization.

Qwenlong-L1: A structured RL frame for long-context adaptation

In order to tackle these restrictions introduce the Qwen -Researcher team Qwenlong-L1A new RL frame designed to customize LRMs to long-context-resonance tasks. The frame is structured in three key phases:

  • Heating monitored fine tuning (SFT): Provides stable initialization to the political model by training on curated questions about context-answer triplets, ensuring basic competence in contextual understanding and answer extraction.
  • Curriculum-controlled phase reinforcement learning: Introduces a staged training process with gradually increasing context lengths. This progression allows the model to step-by-step acquiring long-context-resonance behavior without destabilizing political updates.
  • Difficulty-Magazine Providence: Improves the investigation by maintaining and recycling harsh examples from previous stages, weighted by their difficulties, to encourage deeper reasoning and robustness across different inputs.

These phases are supplemented with hybrid reward mechanisms-to violate rule-based accurate match confirmation with semantic evaluation of a slight LLM-to do both precision and recall during policy education.

Technical design and methodological benefits

Qwenlong-L1 Integrates recent progress in group-relative RL optimization, specifically GRPO and DapoTo mitigate the calculation cost associated with estimation of long context value:

  • GRPO Estimates benefit by normalizing rewards within sampled groups, eliminating the need for a separate value network and encouraging different generation patterns.
  • Dapo Incorporates mechanisms such as dynamic sampling, overlooking and asymmetrical cliff thresholds to prevent entropy collapse and mitigate length distortions during exercise.

The reward function is defined as a maximum of two signals: a deterministic rule-based match and a semantic assessment from a compact evaluation model (eg QWEN2.5-1.5B). This hybrid approach avoids overfitting to rigid formats while maintaining response corrections across different notations and phrasing.

In addition, the frame is optimized via Progressive context scalingWhere the RL process transfers from the 20K-token to the 60k-token input lengths in controlled phases, stabilizes the exercise dynamics and facilitates policy generalization.

Experimental results and benchmark -performance

Qwenlong-L1 was evaluated on seven long-context document QA-Benchmarks, including Docmath, Framer, 2Wikimultihopqa, Hotpotqa, Musique, Narrativeqa and Qasper. 32b -variant, Qwenlong-L1-32Bdemonstrated strong empirical performance:

  • It surpassed baseline models such as R1-Distill-Qwen-32B by 5.1 points and exceeded leading proprietary systems that Openai-O3 mini and QWEN3-235B-A22B.
  • Its performance was Comparable to Claude-3.7-Sonnet thinkingwhich indicates competitive reasoning functions under extreme context lengths.
  • Pass@K -Analysis revealed consistent improvements with increased sampling, achieve a pass@2 average on 73.7that surpasses Deepseek-R1 and Openai-O1-PreViewEven at low sampling speeds.

Ablation studies further validated the individual contributions from SFT, phase RL and retrospective sampling. In particular, RL played a crucial role in enabling promotional reasoning behavior such as grounding, subgoal setting, verification and backtracking tranger not effectively induced by monitored fine-tuning alone.

Conclusion

Qwenlong-L1 represents a systematic approach to equipping LRMs with robust long context-resonance features through reinforcement learning. Its design bridges bridged the gap between expertise in maps and contexts and the requirements for information -dimming environments by combining monitored initialization, curriculum -driven context scaling and hybrid evaluation strategies. The framework not only achieves advanced results across the long control benchmarks, but also demonstrates the emergence of interpretable reasoning patterns during exercise.


Check the paper, the model about embraced face and github —side. All credit for this research goes to the researchers in this project. You are also welcome to follow us on Twitter And don’t forget to join our 95k+ ml subbreddit and subscribe to Our newsletter.


Asif Razzaq is CEO of Marketchpost Media Inc. His latest endeavor is the launch of an artificial intelligence media platform, market post that stands out for its in -depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts over 2 million monthly views and illustrates its popularity among the audience.

Leave a Comment