
Enterprises expanding AI deployment are hitting an invisible performance wall. Criminal? Fixed speculators who cannot keep pace with changing workloads.
Speculators are small AI models that work with larger language models during predictions. They further draft multiple tokens, which the main model validates in parallel. This technique (called speculative decoding) has become essential for enterprises trying to reduce inference costs and latency. Instead of generating tokens one at a time, the system can accept multiple tokens at once, dramatically improving throughput.
Together A.I. today announced research and a new system called ATLAS (Adaptive-Learning Speculator System) that aims to help enterprises overcome the challenge of static speculators. The technology provides a self-learning inference optimization capability that can help deliver up to 400% faster inference performance compared to the baseline level of performance available in existing inference technologies such as VLLM. The system solves an important problem: As AI workloads grow, the speed of predictions slows down, even with specialized speculators.
which company it has started Focus is on 2023 optimization estimation On your enterprise AI platform. The company earlier this year raised $305 million As customer acceptance and demand has increased.
"The companies we work with typically see workloads change as they grow, and then they don’t see the same agility in speculative execution as before," Tri Dao, chief scientist at Together AI, told VentureBeat in an exclusive interview. "These punters generally do not do well when their workload starts changing."
No one talks about the problem of workload flow
Most speculators in production today "steady" Model. They are trained once on a fixed dataset representing the expected workload, then deployed without any ability for customization. Companies like Meta and Mistral send speculators pre-trained with their main models. Inference platforms like VLLM use these stable speculators to boost throughput without changing output quality.
but there is a problem. The accuracy of the static bookmaker decreases as an enterprise’s AI use evolves.
"If you are a company making coding agents, and most of your developers are writing in Python, then suddenly some of them start writing Rust or C, then you see that the speed starts decreasing," Dao explained. "There is a mismatch between what the bookmaker was trained on versus what the actual workload is."
This workload drift represents a hidden tax on AI scaling. Enterprises either accept poor performance or invest in retraining custom speculators. That process takes only a snapshot in time and quickly becomes outdated.
How Adaptive Bookmakers Work: A Dual-Model Approach
Atlas uses a dual speculative architecture that combines stability with optimization:
static speculator – A heavyweight model trained on extensive data delivers consistent baseline performance. It acts as a "speed floor."
adaptive bookmaker – A lightweight model continuously learns from live traffic. It specializes in providing immediate attention to emerging domains and usage patterns.
confidence-aware controller – An orchestration layer dynamically chooses which bookmaker to use. It accommodates speculation "look ahead" Based on confidence score.
"We still have the static bookmaker to help get you up to speed in the beginning before the adaptive bookmaker learns anything," Ben Athivaratkun, staff AI scientist at Together AI, explained to VentureBeat. "Once the adaptive speculator becomes more confident, the speed increases over time."
The technical innovation lies in balancing the acceptance rate (how often the target model agrees with the drafted token) and draft latency. As the adaptive model learns from traffic patterns, the controller trusts the lighter speculator more and looks further ahead. This increases performance.
Users do not need to tune any parameters. "On the user side, users don’t need to turn any knobs," Dao said. "For our part, we’ve tweaked these knobs for users to adjust to a configuration that results in a good speedup."
Performance that rivals custom silicon
AI testing shows that Atlas reaches 500 tokens per second on DeepSeek-v3.1 when fully optimized. More impressively, those numbers on the Nvidia B200 GPU match or exceed typical spec chips. Grok’s Custom Hardware.
"Software and algorithm improvements are able to really bridge the gap with specific hardware," Dao said. "We were looking at 500 tokens per second on these huge models which is even faster than some optimized chips."
The 400% speedup the company claims for estimation represents the cumulative effect of Together’s turbo optimization suite. FP4 quantization provides 80% speedup over the FP8 baseline. Stable turbo bookmaker adds another 80-100% profit. The adaptive system layers are on top. Each adaptation adds to the benefits of the others.
compared to standard inference engines such as VLLM or Nvidia’s TensorRT-LLM, the improvement is substantial. The AI ​​benchmarks the two simultaneously against the stronger baseline for each workload before applying speculative optimizations.
Memory-Compute Tradeoff Explained
The performance gains arise from exploiting a fundamental inefficiency in modern inference: wasted computation capacity.
Dao pointed out that typically during inference, most of the computation power is not fully utilized.
"During inference, which is really the major workload these days, you’re mostly using the memory subsystem," He said.
Speculative decoding trades idle computation for less memory access. When a model generates one token at a time, it is memory-bound. The GPU remains idle while waiting for memory. But when the speculator proposes five tokens and the target model validates them simultaneously, the usage count increases while the memory accesses remain almost constant.
"The total amount of computation to generate five tokens is the same, but you only have to access the memory once instead of five times," Dao said.
Think of it as intelligent caching for AI
For infrastructure teams familiar with traditional database optimization, adaptive specifiers act like an intelligent caching layer, but with one important difference.
Traditional caching systems like Redis or memcached require exact matching. You store the exact same query result and retrieve it when that specific query is run again. Adaptive bookmakers work differently.
"You can look at it as an intelligent way of caching, not storing exact, but detecting some patterns that you see," Dao explained. "Broadly speaking, we’re looking at whether you’re working with similar code, or working with similar, you know, controlling computation in similar ways. Then we can predict what the bigger model is going to say. We become better at predicting it."
Instead of storing exact responses, the system learns patterns from how the model generates tokens. This recognizes that if you are editing Python files in a specific codebase, certain token sequences become more probable. The bookmaker adapts to those patterns, improving its predictions over time without requiring the same inputs.
Use cases: RL training and evolving workloads
Two enterprise scenarios particularly benefit from adaptive speculators:
reinforcement learning training: As the policy evolves during training, stable speculators quickly fall out of alignment. Atlas continuously adapts to changing policy delivery.
workload evolution: As enterprises discover new AI use cases, the workload structure changes. "Maybe they started out using AI for chatbots, but then they realized, hey, it can write code, so they started shifting to code," Dao said. "Or they realize that these AIs can actually call up tools and control computers and do accounting and things like that."
In a vibe-coding session, the adaptive system can specialize for the specific codebase being edited. These are files that were not seen during training. This further increases the acceptance rate and decoding speed.
What this means for enterprises and the estimation ecosystem
Atlas is now available on Together AI’s dedicated endpoints as part of the platform at no additional cost. The company’s more than 800,000 developers (up from 450,000 in February) have access to the customization.
But the broader implications extend beyond one vendor’s product. The shift from static to adaptive optimization represents a fundamental rethinking of how estimation platforms should work. As enterprises deploy AI across multiple domains, the industry will need to move beyond models trained once toward systems that continuously learn and improve.
AI has historically released some of its research techniques as open source and collaborated with projects such as VLLM. While the fully integrated Atlas system is proprietary, some of the underlying technologies may ultimately impact the broader estimation ecosystem.
For enterprises looking to lead in AI, the message is clear: Adaptive algorithms on commodity hardware can match custom silicon at a fraction of the cost. As this approach matures across the industry, software optimization is increasingly becoming dominant over specific hardware.

