
Researchers at MILA have proposed a new technique that makes large language models (LLMs) significantly more efficient when performing complex reasoning. called markovian thinkingThe approach allows LLMs to engage in longer reasoning without incurring the prohibitive computational costs that currently limit such tasks.
The team’s implementation, an environment called Delethink, structures the logic chain into fixed-size chunks, breaking the scaling problem that plagues very long LLM responses. Initial estimates suggest that for 1.5B parameter models, this method can cut training costs by more than two-thirds compared to the standard approach.
Quadratic curse of long-chain logic
For an LLM to solve a complex problem, it often needs to generate a long chain of intermediate “thinking” tokens, often referred to as a chain-of-thought (COT). In recent years, researchers have found using reinforcement learning (RL) has significantly improved the reasoning capabilities of models by training them to produce longer COT (sometimes referred to as LongCOT).
However, the standard method for this has a serious flaw: AI "State" (the prompt plus all the argument tokens generated so far in its processing) increases with each new argument token. for modern Transformer-Based ModelThis means that as the argument chain becomes longer, the computational cost increases quadratically, making it extremely expensive to train models for very complex tasks.
Most current efforts to manage this cost focus on limiting how much the model thinks, indirectly preferring smaller solutions or terminating the process early. Although these methods provide some relief, MILA researchers still work within the LongSOT framework and are thus fundamentally bound by its quadratic nature.
Instead of trying to control computational growth, Mila created an RL environment that avoids the quadratic problem altogether. As co-author Amirhossein Kazmanejad explained, the goal is to enable capabilities like multi-week reasoning and scientific discovery. "That arrangement (and the RL required to enable such capabilities) is not supported by the current LongCOT paradigm due to the quadratic computation cost," He said.
Thinking in chunks with Dailythink
The researchers’ solution is a paradigm they call "markovian thinker," Where the model performs logic while keeping the size of its logic context window constant. The main idea is to change the RL setup to different "how long does the model think" From "How much context it has to process." If done correctly, a Markovian thinker transforms the quadratic growth problem into linear computation and fixed memory requirements for LLM logic.
The researchers put this paradigm into practice through Delethink, which forces models to reason in sequences of fixed-sized pieces, such as 8,000 tokens at a time. Within each fragment, the model reasons normally using the classic attention mechanism. But when it reaches the segment limit, the environment resets the context, creating a new prompt that contains the original query and a brief information. "to push forward" From the previous part. For example, carryover could be the last few tokens of the previous part of the COT or a summary of the most important results.
This reordering of the problem forces the model to learn how to embed a summary of its progress, or "textual markovian state," In this carryover to continue our argument in the next section. This addresses the common concern whether the model can remember important details from previous steps.
According to Kazmanejad, the model learns what to remember. "With training… the model is forced to learn to advance the task-critical state," he explained. They added an important clarification for practical use: The original input prompt has not been modified, including the documents or relevant data added to it. “Our approach is targeted at the logic stage and does not modify the signal," He said.
delay action
To test their approach, the researchers trained R1-Distill-1.5b with Delthink on a dataset of competition-level math problems, then evaluated it against several benchmarks. The model was trained to reason on up to 24,000 tokens, but with fixed 8,000-token segments.
researcher This was compared with models trained with the standard LongCoT-RL method. Their findings indicate that the model trained with DailyThink can scale up to 24,000 tokens, and matches or surpasses the LongCoT model trained with a similar 24,000-token budget on math benchmarks. On other tasks such as coding and PhD-level questions, DailyThink also matches or slightly surpasses its LongCoin counterpart. “Overall, these results indicate that DailyThink uses its Thinking Token as effectively as LongCOT-RL with less computation,” the researchers wrote.
The benefits become even more apparent when moving beyond the training budget. While models trained with LongCoT quickly stabilized at their training limits, Delethink-trained models continued to improve their performance. For example, some math problems were solved by the model only after reasoning with up to 140,000 tokens, far exceeding its 24,000-token training budget. This linear calculation gain is sufficient for enterprise applications. The researchers estimate that training a model with an average thought length of 96,000 tokens would require 27 H100-GPU-months with LongCOT, while only 7 would be required with DailyThink.
This efficiency extends directly to the estimation of primary operating costs for most enterprises. "Models trained in Markovian Thinking use the same inference style (delaythink-tracing) during testing time, which provides the same benefits as linear computation and persistent memory after training," Kazmanejad said. He offered a practical example: an AI agent could do "Debug a larger codebase and think long term… which certainly reduces costs significantly compared to the traditional LongCOT approach."
Interestingly, the researchers found that humans, even without any specific training, already display some ability to think in a Markovian way. This discovery has immediate practical implications for developers. "In practice, this means that—without DELETHINK-RL—these models can already run a DELETHINK-tracing wrapper and have competitive performance with LongCOT on our benchmark tasks." Kazmanejad said.
such as his experiments with larger models GPT-OSS 120B Showed strong performance with Delthink across many complex tasks. This latent capacity provides a strong starting point for RL training, helping to explain why the method is so effective. “Together, these results show that DailyThink is consistent with and scales well with state-of-the-art models,” the researchers concluded.
The success of Markovian thinking shows that it may be possible "Next generation logic models for thinking for millions of tokens," The researchers noted. This opens the door to fundamentally new AI capabilities that transcend existing constraints.
"Markovian thinking…opens the way for models that can ‘think’ over much longer horizons, which we see as a necessary step toward ultimate scientific discovery," Kazmanejad said. "Our approach overcomes a major hurdle and can allow training for tasks with much longer horizons, enabling the next generation of capabilities."

