
Researchers at Nvidia have developed a new technique that flips the script on how large language models (LLMs) learn to reason.
method, is called reinforcement learning pre-training (RLP), integrates RL into the initial training phase instead of saving it for the end.
this approach Encourages the model to “think for itself before predicting what will happen next, thus teaching an independent thinking behavior in early training,” The researchers explain in their paper.
By learning to reason on plain text without the need for external validators, Models trained with RLP show significant improvement in learning complex reasoning tasks. Downstream, it points to a future of more capable and adaptable AI for real-world tasks.
Typical LLM Training Cycle
Typically, large language models are first pre-trained using large amounts of text. "next-token prediction" Objective, where they are given a series of text and asked to continuously guess what the next word (or token) will be. In this stage, they learn grammar, facts and basic relationships.
In the post-training phase, models typically learn complex reasoning capabilities such as chain of thought (COT) where a model presents its logic step by step. This step often involves supervised fine-tuning (SFT) or Reinforcement learning from human feedback (RLHF), which requires special, curated datasets.
The paper’s authors argue that this sequential process does not match human understanding, which “is not a linear token-by-token process, but rather a parallel integration of input with prior knowledge.” Existing pre-training methods lack this mechanism, which hinders a model’s ability to develop deep reasoning from the start.
How does reinforcement learning pre-training work?
RLP has redesigned this process by treating COT generation as an action taken by the model before predicting the next token. At each step, the model first generates an internal "thought" Or logic chain. It then predicts the next word in the text, using the original context augmented with its new idea.
The model is rewarded based on how much its idea improved its prediction accuracy compared to the baseline, which did not generate any ideas (pure next-token prediction). This reward signal is calculated automatically based on changes in probability, eliminating the need for external validators or human-labeled data.
The reward is positive only if the idea generated helps the model to better predict the next token. By rewarding ideas based on their predicted benefit, RLP effectively teaches the model how to think usefully on the same huge, unstructured datasets used for standard pre-training.
This continuous feedback loop allows the model to learn when a simple predictive estimate is sufficient and when it needs to engage in deeper reasoning. As the researchers said, “RLP is designed to shape thinking into a base model Rewarding only those ideas that measurably help next-token prediction.”
However, this basic approach does not make subsequent fine-tuning steps obsolete. According to Brian Catanzaro, VP of Applied Deep Learning Research at Nvidia and co-author of the paper, RLP is designed to complement, not replace, these critical steps. "RLP is not intended for later stages of training such as supervised fine-tuning or reinforcement learning from human feedback," Catanzaro told VentureBeat. "Those steps remain important for refining model behavior…it’s really designed to increase the effectiveness of those later steps by giving the model a head start."
RLP is in action
in experiments with Qwen3-1.7B And Nemotron-Nano-12BThe Nvidia team tested RLP in a set of math and science logic benchmarks. The results show that Models augmented with RLP consistently outperformed their conventionally trained counterparts, with particularly strong gains in logic-heavy tasks.
For an enterprise, this improved logic can translate into more reliable output in multi-step workflows such as financial analysis or legal document summarization.
"RLP encourages the model to think before making predictions during pre-training, helping the model internalize a more consistent reasoning style," Catanzaro said. "This can help reduce subtle logical errors, especially in long workflows.
Emphasizing that RLP-trained models will still need the usual guardrails like validation layers, human oversight, and consistency checks, Catanzaro said that “RLP gives you a stronger baseline."
Importantly, the benefits of RLP compound rather than disappear during subsequent fine-tuning steps (catastrophic forgetting is a common problem in LLM training, where subsequent training steps cause the model to forget its previously learned skills and knowledge). The RLP-trained models achieved 7–8% higher overall scores compared to baseline following a similar post-training regimen. The researchers concluded that RLP “establishes a strong reasoning foundation that is not destroyed by downstream alignment but rather becomes integrated with post-training.”
The efficiency of the technique is an important finding. On the Qwen3-1.7B model, RLP improved performance by 17% compared to standard continuous pretraining and also defeated a similar technique called Reinforcement Pretraining via Prefix-Matching Rewards (RPT). This benefit persisted even when the baseline model was trained with 35 times more data to match the computational cost, confirming that the benefits come from the method itself, not just more processing.
Furthermore, RLP demonstrates impressive scalability and versatility, successfully extracting a logic signal from general-purpose web data, not just curated datasets. When the hybrid Mamba-Transformer model was applied to the Nemotron-Nano-12B, RLP achieves 35% relative improvement over highly trained baseline Whereas only a small fraction of the data is being used.
While these results point toward a more efficient path to building powerful models, Catanzaro presents the innovation as a fundamental shift in the learning process rather than an immediate solution to high training costs.
"This research is exciting because it changes how models absorb information during pretraining which improves the learning process," he explained. "This will not replace large-scale pre-training, but will provide a more creative approach in building the best possible models."
A new foundation for AI training
Ultimately, RLP points to a future where pre-training is no longer a monolithic process of next-token prediction. Instead, the next generation of models could be built on mixed objectives, creating AI that learns to think more robustly from day one. Catanzaro offers a powerful analogy to outline this shift:
"Next-token prediction teaches a model what the world might look like; Reinforcement-style objectives like RLP can teach him how to think about what he is seeing," He said. "Combining these two objectives can help models develop deeper, more structured thinking much earlier in training…tools like RLP can build on top of that foundation, making learning more active, curious, and even more efficient."
There is still a lot to learn about the dynamics of reinforcement learning in the pre-training phase, but what seems clear is that “starting exploration earlier in training opens up a new axis for scaling – not just in size, but in how models learn to reason,” Catanzaro said.

