For more than a decade, conversational AI has promised human-like assistants that can do more than chat. Yet even as large language models (LLMs) like ChatGPT, Gemini, and Claude learn to reason, explain, and code, one critical category of interaction remains largely unsolved — reliably completing tasks for people outside of chat.
Even the best AI models score only in the 30th percentile on Terminal-Bench Hard, a third-party benchmark designed to evaluate the performance of AI agents on completing a variety of browser-based tasks, far below the reliability demanded by most enterprises and users. And task-specific benchmarks like TAU-Bench airline, which measures the reliability of AI agents on finding and booking flights on behalf of a user, also don't have much higher pass rates, with only 56% for the top performing agents and models (Claude 3.7 Sonnet) — meaning the agent fails nearly half the time.
New York City-based Augmented Intelligence (AUI) Inc., co-founded by Ohad Elhelo and Ori Cohen, believes it has finally come with a solution to boost AI agent reliability to a level where most enterprises can trust they will do as instructed, reliably.
The company’s new foundation model, called Apollo-1 — which remains in preview with early testers now but is close to an impending general release — is built on a principle it calls stateful neuro-symbolic reasoning.
It's a hybrid architecture championed by even LLM skeptics like Gary Marcus, designed to guarantee consistent, policy-compliant outcomes in every customer interaction.
“Conversational AI is essentially two halves,” said Elhelo in a recent interview with VentureBeat. “The first half — open-ended dialogue — is handled beautifully by LLMs. They’re designed for creative or exploratory use cases. The other half is task-oriented dialogue, where there’s always a specific goal behind the conversation. That half has remained unsolved because it requires certainty.”
AUI defines certainty as the difference between an agent that “probably” performs a task and one that almost “always” does.
For example, on TAU-Bench Airline, it performs at a staggering 92.5% pass rate, leaving all the other current competitors far behind in the dust — according to benchmarks shared with VentureBeat and posted on AUI's website.
Elhelo offered simple examples: a bank that must enforce ID verification for refunds over $200, or an airline that must always offer a business-class upgrade before economy.
“Those aren’t preferences,” he said. “They’re requirements. And no purely generative approach can deliver that kind of behavioral certainty.”
AUI and its work on improving reliability was previously covered by subscription news outlet The Information, but has not received widespread coverage in publicly accessible media — until now.
From Pattern Matching to Predictable Action
The team argues that transformer models, by design, can’t meet that bar. Large language models generate plausible text, not guaranteed behavior. “When you tell an LLM to always offer insurance before payment, it might — usually,” Elhelo said. “Configure Apollo-1 with that rule, and it will — every time.”
That distinction, he said, stems from the architecture itself. Transformers predict the next token in a sequence. Apollo-1, by contrast, predicts the next action in a conversation, operating on what AUI calls a typed symbolic state.
Cohen explained the idea in more technical terms. “Neuro-symbolic means we’re merging the two dominant paradigms,” he said. “The symbolic layer gives you structure — it knows what an intent, an entity, and a parameter are — while the neural layer gives you language fluency. The neuro-symbolic reasoner sits between them. It’s a different kind of brain for dialogue.”
Where transformers treat every output as text generation, Apollo-1 runs a closed reasoning loop: an encoder translates natural language into a symbolic state, a state machine maintains that state, a decision engine determines the next action, a planner executes it, and a decoder turns the result back into language. “The process is iterative,” Cohen said. “It loops until the task is done. That’s how you get determinism instead of probability.”
A Foundation Model for Task Execution
Unlike traditional chatbots or bespoke automation systems, Apollo-1 is meant to serve as a foundation model for task-oriented dialogue — a single, domain-agnostic system that can be configured for banking, travel, retail, or insurance through what AUI calls a System Prompt.
“The System Prompt isn’t a configuration file,” Elhelo said. “It’s a behavioral contract. You define exactly how your agent must behave in situations of interest, and Apollo-1 guarantees those behaviors will execute.”
Organizations can use the prompt to encode symbolic slots — intents, parameters, and policies — as well as tool boundaries and state-dependent rules.
A food delivery app, for example, might enforce “if allergy mentioned, always inform the restaurant,” while a telecom provider might define “after three failed payment attempts, suspend service.” In both cases, the behavior executes deterministically, not statistically.
Eight Years in the Making
AUI’s path to Apollo-1 began in 2017, when the team started encoding millions of real task-oriented conversations handled by a 60,000-person human agent workforce.
That work led to a symbolic language capable of separating procedural knowledge — steps, constraints, and flows — from descriptive knowledge like entities and attributes.
“The insight was that task-oriented dialogue has universal procedural patterns,” said Elhelo. “Food delivery, claims processing, and order management all share similar structures. Once you model that explicitly, you can compute over it deterministically.”
From there, the company built the neuro-symbolic reasoner — a system that uses the symbolic state to decide what happens next rather than guessing through token prediction.
Benchmarks suggest the architecture makes a measurable difference.
In AUI’s own evaluations, Apollo-1 achieved over 90 percent task completion on the τ-Bench-Airline benchmark, compared with 60 percent for Claude-4.
It completed 83 percent of live booking chats on Google Flights versus 22 percent for Gemini 2.5-Flash, and 91 percent of retail scenarios on Amazon versus 17 percent for Rufus.
“These aren’t incremental improvements,” said Cohen. “They’re order-of-magnitude reliability differences.”
A Complement, Not a Competitor
AUI isn’t pitching Apollo-1 as a replacement for large language models, but as their necessary counterpart. In Elhelo’s words: “Transformers optimize for creative probability. Apollo-1 optimizes for behavioral certainty. Together, they form the complete spectrum of conversational AI.”
The model is already running in limited pilots with undisclosed Fortune 500 companies across sectors including finance, travel, and retail.
AUI has also confirmed a strategic partnership with Google and plans for general availability in November 2025, when it will open APIs, release full documentation, and add voice and image capabilities. Interested potential customers and partners can sign up to receive more information when it becomes available on AUI's website form.
Until then, the company is keeping details under wraps. When asked about what comes next, Elhelo smiled. “Let’s just say we’re preparing an announcement,” he said. “Soon.”
Toward Conversations That Act
For all its technical sophistication, Apollo-1’s pitch is simple: make AI that businesses can trust to act — not just talk. “We’re on a mission to democratize access to AI that works,” Cohen said near the end of the interview.
Whether Apollo-1 becomes the new standard for task-oriented dialogue remains to be seen. But if AUI’s architecture performs as promised, the long-standing divide between chatbots that sound human and agents that reliably do human work may finally start to close.