
While much of the AI world is racing to build big language models like OpenAI’s GPT-5 and Anthropic’s Cloud Sonnet 4.5, Israeli AI startup AI21 Taking a different path.
AI21 has just been unveiled Jamba Reasoning 3BA 3-billion-parameter model. This compact, open-source model can handle huge context windows of 250,000 tokens (meaning it can “remember” and reason about much more text than typical language models) and can run at high speed even on consumer devices. The launch highlights a growing shift: smaller, more efficient models could shape the future of AI just as much as raw scale.
“We believe in a more decentralized future for AI – where not everything runs in massive data centers,” says ori goshenIn an interview with Co-CEO of AI21 ieee spectrum“Large models will still play a role, but smaller, powerful models running on devices will have a significant impact on both the future and economics of AI,” he says. Jamba is built for developers who want to build edge-AI applications and specialized systems that run efficiently on devices.
AI21’s Jamba Reasoning 3B is designed to handle long sequences of text and challenging tasks like math, coding and logical reasoning – all while running with impressive speed on everyday devices like laptops and mobile phones. Jamba Reasoning 3B can also work in hybrid setups: simple tasks are handled locally by the device, while heavier problems are sent to powerful cloud servers. According to AI21, this smart routing could dramatically cut the cost of AI infrastructure for some workloads – possibly by orders of magnitude.
A small but mighty LLM
With 3 billion parameters, Jamba Reasoning 3B is small by today’s AI standards. Goshen noted that models like GPT-5 or Cloud exceed 100 billion parameters and even small models like Llama 3 (8b) or Mistral (7b) are more than twice the size of AI21’s models.
That compact size makes it all the more remarkable that AI21’s model can handle a reference window of 250,000 tokens on consumer devices. Some proprietary models, such as GPT-5, provide even longer context windows, but Jamba sets a new high-water mark among open-source models. previous open-model Record of 128,000 tokens was organized by Meta’s Llama 3.2 (3b), Microsoft’s Phi-4 Mini, and DeepSeek R1, which are all much larger models. Jamba Reasoning 3B can process more than 17 tokens per second even when operating at full capacity– that is, with Extremely long inputs that use its full 250,000-token reference window. Many other models slow down or struggle when their input length exceeds 100,000 tokens.
Goshen explains that this model is built on an architecture called JambaWhich combines two types of neural network designs: Transformer Layers, familiar from other large language models, and mamba Layers, which are designed to be more memory-efficient. This hybrid design enables the model to handle long documents, large codebases, and other extensive inputs directly on a laptop or phone using about one-tenth the memory of a traditional Transformer. Goshen says the model runs much faster than traditional Transformers because it relies less on a memory component called KV CashWhich can slow down processing due to the input being long.
Why is there a need for a short LLM?
The hybrid architecture of the model gives it an advantage in both speed and memory efficiency, even with very long inputs, confirms a software engineer working in the LLM industry. The engineer requested to remain anonymous because he is not authorized to comment on other companies’ models. As more users run generative AI locally on laptops, models need to quickly handle long context lengths without consuming too much memory. The engineer says that at 3 billion parameters, Jamba meets these requirements, making it a model that is optimized for on-device use.
Jamba is open source under Reasoning 3B permissiveness Apache 2.0 License and is available on popular platforms like hugging face And LM StudioThis release also comes with instructions on how to fine-tune the model through an open-source reinforcement-learning platform (called VERL), making it easier and more economical for developers to adapt models to their tasks.
“Jamba Reasoning 3B marks the beginning of a family of small, efficient reasoning models,” Goshen said. “Scaling down enables decentralization, personalization, and cost efficiency. Instead of relying on expensive GPUs in data centers, individuals and enterprises can run their own models on devices. This opens up new economies of scale and broader reach.”
From articles on your site
Related articles on the web

