
For more than a decade, conversational AI has promised human-like assistants that can do more than chat. Yet as large language models (LLMs) like ChatGPT, Gemini, and Cloud learn to reason, explain, and code, an important category of interaction remains largely unexplored – getting people to reliably complete tasks. outside chat,
still The best AI models only score in 30th percentile on terminal-bench hard, A third-party benchmark designed to evaluate the performance of AI agents in completing various browser-based tasks, with reliability far below that demanded by most enterprises and users. and task-specific benchmarks Like TAU-bench airline, one who measures Reliability of AI agents in finding and booking flights From a user side, pass rates are also not very high Only 56% for top performing agents and models (Cloud 3.7 Sonnet) – which means the agent fails about half the time.
based in new york city Augmented Intelligence (AUI) Inc.Co-founded by ohad elhello And ori cohenbelieves it has finally come up with a solution to increase AI agent reliability to a level where most enterprises can trust that they will work as instructed, reliably.
The company’s new foundation model is called Apollo-1 – which is currently in preview with early testers, but nearing an imminent general release – is built on a principle it calls Stateful neuro-symbolic logic.
This is a hybrid architecture also supported by eOne LLM skeptics like Gary MarcusDesigned to guarantee consistent, policy-compliant outcomes in every customer interaction.
“Conversational AI is essentially two parts,” Alhelo said in a recent interview with VentureBeat. “The first part – open-ended dialogue – is handled beautifully by LLMs. They are designed for creative or exploratory use cases. The second part is task-oriented dialogue, where there is always a specific goal behind the conversation. That half is left unresolved because it requires certainty.”
AUI defines certainty As a distinction between an agent who “probably” performs an action and one who almost “always” does it.
For example, on TAU-benchmark airline, it performs at an astonishing 92.5% pass rateAll other existing competitors are left far behind, according to benchmarks shared with VentureBeat and Posted on AUI website.
AllHello offered simple examples: a bank that must enforce ID verification for refunds over $200, or an airline that must always offer a business-class upgrade before economy.
“Those are not the priorities,” he said. “Those are the requirements. And no purely generative approach can provide that kind of practical certainty.”
Its work on improving AUI and reliability was previously covered by the subscription news outlet InformationBut till now it has not received wide coverage in publicly accessible media.
From pattern matching to predictive action
The team argues that Transformer models, by design, cannot meet that standard. Large language models produce reliable text, not guaranteed behavior. “When you ask an LLM to always offer insurance before payment, it usually can,” Elhello said. “Configure Apollo-1 with that rule, and it will happen every time.”
This difference, he said, arises from the architecture itself. Transformers predict the next token in a sequence. Apollo-1, in contrast, predicts next step In a conversation, the AUI is working on who calls typed symbolic state,
Cohen explained the idea in more technical terms. “Neuro-symbolic means we are merging two major paradigms,” he said. “The symbolic layer gives you structure – it knows what an intention, an entity and a parameter are – while the neural layer gives you language fluency. The neuro-symbolic reasoner sits between them. It’s a different kind of brain to communicate.”
Where Transformers treat each output as text generation, Apollo-1 runs a closed logic loop: an encoder translates natural language into a symbolic state, a state machine maintains that state, a decision engine determines the next action, a planner executes it, and a decoder transforms the result back into language. “The process is iterative,” Cohen said. “It loops until the task is completed. This way you get determinism instead of probability.”
A Foundation Model for Performance
Unlike traditional chatbots or bespoke automation systems, Apollo-1 is meant to work as a Foundation Model For task-oriented communication – a single, domain-agnostic system that can be configured for banking, travel, retail or insurance through AUI system prompt,
AllHello said, “The system prompt is not a configuration file.” “It’s a behavioral contract. You define exactly how your agent should behave in situations of interest, and Apollo-1 guarantees that those behaviors will be executed.”
Organizations can use prompts to encode symbolic slots – intentions, parameters, and policies – as well as device limitations and state-dependent rules.
For example, a food delivery app might enforce “If allergies are noted, always notify the restaurant”, while a telecommunications provider might define “After three unsuccessful payment attempts, suspend service.” In both cases, the behavior is executed deterministically, not statically.
eight years in the making
AUI’s path to Apollo-1 began in 2017, when the team began encoding millions of real task-oriented conversations conducted by a 60,000-person human agent workforce.
That work led to the creation of a symbolic language capable of differentiating procedural knowledge – Steps, Constraints, and Flow – From descriptive knowledge Like entities and attributes.
“The insight was that there are universal procedural patterns in task-oriented communication,” Elhello said. “Food delivery, claims processing, and order management all share similar structures. Once you model it explicitly, you can definitely calculate it.”
From there, the company built the Neuro-Symbolic Reasoner – a system that uses symbolic states to decide what will happen next, rather than guessing through token prediction.
Benchmarks suggest that the architecture makes a measurable difference.
In its assessment of AUI, Apollo 1 achieved 90 percent τ-Bench-Airline compared to the task accomplished on the benchmark 60 percent For Cloud-4.
it’s up 83 percent Live Booking Chat vs. Google Flights 22 percent for Gemini 2.5-flash, and 91 percent Retail scenarios on Amazon Vs. 17 percent For Rufus.
“These are not incremental improvements,” Cohen said. “Those are orders of magnitude reliability differences.”
a complement, not a competitor
AUI is presenting Apollo-1 not as a replacement for larger language models, but as their essential counterpart. In Alhello’s words: “Transformers optimize for creative possibility. Apollo-1 optimizes for behavioral certainty. Together, they create the full spectrum of conversational AI.”
The model is already running in limited pilots with undisclosed Fortune 500 companies in sectors including finance, travel and retail.
AUI has also confirmed this Strategic partnership with Google and plans for General availability in November 2025That’s when it will open up the API, release full documentation, and add voice and image capabilities. Interested potential customers and partners can sign up to receive more information Becomes available on AUI website form.
Till then, the company will keep the details under wraps. When asked what would happen next, Elhello smiled. “Let’s say we are preparing a declaration,” he said. “Soon.”
That Act Towards Conversation
For all its technological sophistication, Apollo-1’s pitch is simple: Create AI that businesses can trust to do work — not just talk. “We’re on a mission to democratize access to AI that works,” Cohen said at the end of the interview.
Whether Apollo-1 becomes the new standard for action-oriented communication remains to be seen. But if AUI’s architecture performs as promised, the long-standing rift between chatbots that sound human and agents that perform reliably human tasks may finally begin to close.

