OpenAI just dropped some uncomfortable news about artificial intelligence: no matter how much we improve these systems, they’ll always hallucinate. That means ChatGPT, Claude, and other AI chatbots will keep making up plausible-sounding information that’s completely wrong.
This isn’t coming from AI critics or skeptics. OpenAI researchers themselves published this study on September 4, essentially admitting that the technology powering their wildly popular ChatGPT has built-in flaws that can’t be fixed with better engineering.
OpenAI Research Reveals a Fundamental Flaw in LLMs
The research team, led by OpenAI’s Adam Tauman Kalai, Edwin Zhang, and Ofir Nachum, along with Georgia Tech’s Santosh S. Vempala, created a mathematical framework that proves why AI systems must generate false information. They compared it to students guessing on difficult exam questions instead of admitting they don’t know the answer.
“Like students facing hard exam questions, large language models sometimes guess when uncertain, producing plausible yet incorrect statements instead of admitting uncertainty,” the researchers explained.
Here’s the kicker: even when trained on perfect data, these systems will still hallucinate. The study shows that AI’s “generative error rate is at least twice the misclassification rate,” meaning there are mathematical limits that no amount of technological advancement can overcome.
The researchers tested this theory on current top-tier models. When they asked “How many Ds are in DEEPSEEK?” the DeepSeek-V3 model with 600 billion parameters gave answers ranging from 2 to 3 across ten trials. The correct answer is 4. Meta AI and Claude made similar mistakes, with some responses as wildly off as 6 or 7.

Even more concerning, OpenAI’s own advanced reasoning models performed worse than simpler systems. Their o1 model hallucinated 16% of the time when summarizing public information. The newer o3 and o4-mini models were even less reliable, hallucinating 33% and 48% of the time respectively.
The Three Core Reasons for AI Hallucinations
The study identified three core reasons why hallucinations are unavoidable:
First, there’s “epistemic uncertainty” – when information rarely appeared in the training data, the AI simply doesn’t have enough examples to learn from. Second, current AI architectures have fundamental limitations in what they can represent. Third, some problems are computationally impossible to solve, even for hypothetically superintelligent systems.
Neil Shah from Counterpoint Technologies put it bluntly: “Unlike human intelligence, it lacks the humility to acknowledge uncertainty. When unsure, it doesn’t defer to deeper research or human oversight; instead, it often presents estimates as facts.”
The research revealed something troubling about how we evaluate AI systems. Nine out of ten major AI benchmarks actually encourage hallucinations by punishing models that say “I don’t know” while rewarding confident wrong answers.
This creates a perverse incentive where AI systems learn to always give an answer, even when they’re uncertain. The researchers argue that “language models hallucinate because the training and evaluation procedures reward guessing over acknowledging uncertainty.”
Why Flawless Models Are Impossible?
For companies already using AI, this research demands a complete strategy overhaul. Charlie Dai from Forrester notes that enterprises are “increasingly struggling with model quality challenges in production, especially in regulated sectors like finance and healthcare.”
The solution isn’t trying to eliminate hallucinations – that’s mathematically impossible. Instead, businesses need to shift from prevention to risk management. This means implementing stronger human oversight, creating domain-specific safety measures, and continuously monitoring AI outputs.
Dai recommends that companies “prioritize calibrated confidence and transparency over raw benchmark scores” when choosing AI vendors. Look for systems that provide uncertainty estimates and have been tested in real-world scenarios, not just laboratory benchmarks.
Shah suggests the industry needs evaluation standards similar to automotive safety ratings, dynamic grades that reflect each model’s reliability and risk profile. The current approach of treating all AI outputs as equally trustworthy clearly isn’t working.
The message for anyone using AI is clear: these systems will always make mistakes. The key is building processes that account for this reality rather than hoping the technology will eventually become perfect. As the OpenAI researchers concluded, some level of unreliability will persist regardless of technical improvements.