AI Agents Have, So Far, Mostly Been A Dud

If chatbots answer your queries, AI agents are supposed to do things on your behalf, to actively take care of your life and business. They might shop for you, book travel, organize your calendar, summarize news for you in custom ways, keep track of your finances, maintain databases and even whole software systems, and much more. Ultimately agents ought to be able to do anything cognitive that you might ask a human to do for you.

By the end of 2024, Google, OpenAI, Anthropic, and many others announced that their first agents were imminent. In January, Anthropic’s Dario Amodei told the Financial Times, “I think 2025 is going to be the year that AI can do what, like, a PhD student or an early degree professional in a field is able to do.” Around the same time, OpenAI’s CEO Sam Altman wrote “We believe that, in 2025, we may see the first AI agents “join the workforce” and materially change the output of companies.” Google somewhat more modestly announced Project Astra, “a project on the way to building a universal AI assistant.”

Tech columnnist Kevin Roose of the New York Times, always quick to cheerlead for the industry, wrote this:

Well, maybe. But do they work?

My own darker prediction (January 1, 2025) was that “AI “Agents” will be endlessly hyped throughout 2025 but far from reliable, except possibly in very narrow use cases.”

With 5 months left to go, where do things stand?

All the big companies have in fact introduced agents. That much is true.

But, none, to my knowledge are reliable, except maybe in very narrow uses cases. Astra is in beta (or perhaps alpha) only available via a “wait list for trusted testers.” OpenAI actually released their ChatGPT agent [rather unconventionally the word agent is not capitalized in the name].

ChatGPT agent sounds pretty much just like an agent ought to – but comes with a lot of caveats.

ChatGPT now thinks and acts, proactively choosing from a toolbox of agentic skills to complete tasks for you using its own computer….

For instance, it can gather information about your calendar through an API, efficiently reason over large amounts of text using the text-based browser, while also having the ability to interact visually with websites designed primarily for humans.

All this is done using its own virtual computer, which preserves the context necessary for the task, even when multiple tools are used—the model can choose to open a page using the text browser or visual browser, download a file from the web, manipulate it by running a command in the terminal, and then view the output back in the visual browser. The model adapts its approach to carry out tasks with speed, accuracy, and efficiency.

But also

ChatGPT agent is still in its early stages. It’s capable of taking on a range of complex tasks, but it can still make mistakes …. this introduces new risks, particularly because ChatGPT agent can work directly with your data, whether it’s information accessed through connectors or websites that you have logged it into via takeover mode.

In practice, those mistakes and errors have greatly limited the utility agents like ChatGPT agent. The “pushing the boundaries of what’s possible” that Roose promised us? Largely absent.

Even by the end of March, people were starting to see that reality and hype weren’t lining up, as in this Fortune story about a system called Manus (March’s flavor of the month, now largely forgotten), reporting that “some AI agent customers say reality doesn’t match the hype”. That sentiment that has only grown over the last several months.

In recent weeks, there have been multiple reports like this:

AI coding agents also seem to be incurring a large amount of technical debt, writing large amounts of copypasta code that is hard to debug. As MIT Prof Armando Solar-Lezama told the WSJ, AI is like a “brand new credit card here that is going to allow us to accumulate technical debt in ways we were never able to do before.” Or to a paraphrase an old saying that goes back to 1969, “To err is human, to really screw up takes an AI agent.

Penrose.com created a benchmark for basic account balance tracking, using a year’s worth of actual data from places like Stripe — and found a result that I suspect will be typical: AI errors tend to compound over time. (In fairness, ChatGPT agent wasn’t yet out when Penrose ran the test, but I would be surprised if their results were wildly different).

The unrelenting hallucinations problem frequently rears its ugly head:

Someone working in the AI agent industry posted the following critique on reddit; the “demo to reality” issue they mention is exactly what we should expect — probably for a years to come:

In July Futurism ran an analysis reporting that “the percentage of tasks AI agents are currently failing at may spell trouble for the industry”, adding that the “failure rate is absolutely painful”, drawing on a CMU benchmark, AgentCompany, that showed failure rates of 70% on some tasks.

One of X’s most enthusiastic and generally optimistic influencers forlornly posted this a few days ago, “serious question: do any of you use the ChatGPT agent? And if so: for which use-case? I just can’t find a use-case that matches its (limited) capabilities”. And on Tuesday Jeremy Kahn, at Fortune, often enthusiastic about AI, wrote this:

As a consequence of their superficial understanding of what they’re being asked, current agents are grossly vulnerable to cyberattacks.

As CMU PhD student Andy Zou recently reported, as part of a large multi-team effort:

Even the most secure system was undermined 1.45% of the time, which meant that over fifteen hundred attacks were successful. (Even one successful attack can be devastating. An important, publicly-visible system that can be beat in .001% of the times it is attacked is a mess.)

None of these flaws should surprise anyone. What drives LLMs (and what drives the current batch of agents) is mimicry, and not what Ernest Davis and I called in 2019 deep understanding.

Current systems can mimic the kinds of words people use in completing tasks, often in contextually relevant ways, but that doesn’t really mean that they understand the things that they are doing; they have no concept of what it means to delete a database or make up a fictitious calendar entry. Whether they commit such blunders or not is mainly just a matter of what they happen to mimic. Sometimes mimicry works, and sometimes it doesn’t. As I have often emphasized, that’s when we get hallucinations and boneheaded reasoning errors.

Importantly, agentic tasks often involve multiple steps. In fundamentally unreliable systems like LLMs, that means multiple chances for error. Soon or later, those errors catch up. That can and sometimes does lead to catastrophe.

I don’t expect agents to go away; eventually AI agents will be among the biggest time-savers humanity has ever known. There is reason to research them, and in the end trillions of dollars to be made.

But I seriously doubt that LLMs will ever yield the substrate we need.

One of the week’s most telling reports came from The Information, which, drawing no a scoop at OpenAI, basically affirmed what I have been saying all along (albeit without citing my earlier analyses): pure scaling is not getting us to AGI; returns are diminishing. GPT-5, they report, will not be the jump over GPT-4 that GPT-4 was over GPT-3.

Another new paper this week argued that LLMs are confronting a wall (echoing my own infamous phrase):

Without neurosymbolic AI that is more deeply integrated in systems as a whole, with rich world models as a central component, generally following the approach I laid out five years ago, I just don’t see how agents can work out. Reliable agents may not require “artificial general intelligence”, but they surely require what I called in 2020 “robust”, trustworthy AI. LLMs aren’t getting us there.

Someday, I predict, the failure of current agents to produce much of lasting value will be seen as a poignant reflection of a massive mistake, intellectually and economically speaking: the mistake of investing almost solely in LLMs, as a kind of hopeful shortcut. The idea was that we could avoid facing the challenges that classical AI struggled with, simply by gathering lots of data and stirring in massive amounts of compute. This has made a small number of people (most notably Nvidia CEO Jensen Huang) massively wealthy, but after years and what must be close to a trillion dollars invested this appproach has failed to yield systems that can reliably handle your calendar or your bank accounts, let alone serve as the PhD-level agents we have been promised.

Yet more bad money is being poured after good. Staggeringly large amounts of money. According to Renaissance Macro Research, “AI CapEx” which they “define as information processing equipment plus software has added more to GDP growth than consumers’ spending.” In a rational world, generative AI would be penalized for failing to meet expectations around agents; in the actual world, investors and big tech companies are continuing to pour on the gas.

Meanwhile alternative approaches like neurosymbolic AI continue to be wildly underfunded, receiving, if I had to guess maybe 1%, probably less, of total investments. Investors, who want near immediate returns, seem unwilling to go there, and seem resistant to exploring genuinely new avenues. Venture capitalists style themselves as brave funders of innovation, but in the current climate, they are anything but.

Perhaps after enough failures of AI agents, which shouldn’t be trusted with your calendar let alone your credit cards, reality may finally start settle in.

Gary Marcus founded a machine learning company sold to Uber and is the author of six books on natural and artificial intelligence.

Source link

What's Hot

Disney, Foxconn, Hyundai, Lilly, SAP, Siemens, TSMC, others – all-in on new Nvidia AI platform

Anthropic Says Agentic AI Has Been ‘Weaponized’ for Hacking

Anthropic forms national security advisory council to guide AI use in government

AI Agents have, so far, mostly been a dud

AI, layoffs, productivity and The Klarna Effect

Is this the moment when the Generative AI bubble finally deflates?

Things are so desperate at OpenAI that Sam Altman is starting to sound like Gary Marcus

AWAW and NYFA Award $521,125 in Environmental Art Grants

A Well-Preserved Roman Mausoleum Unearthed in France

France Will Return Colonial-Era Human Remains to Madagascar

Vail Settles with Native American Artist in Suit on Pro-Palestine Art

Disney, Foxconn, Hyundai, Lilly, SAP, Siemens, TSMC, others – all-in on new Nvidia AI platform

Anthropic Says Agentic AI Has Been ‘Weaponized’ for Hacking

Anthropic forms national security advisory council to guide AI use in government

What's Hot

AI Agents have, so far, mostly been a dud

Related Posts

Subscribe to Updates