I’ve Tested OpenAI’s Claims About GPT-5 — Here’s What Happened

OpenAI recently launched GPT-5, its latest large language model and a huge update to ChatGPT. While the new update has a lot going for it, claims are one thing, and reality is another.

GPT-5 is said to be faster, less prone to hallucination and sycophantic behavior, and able to choose between fast responses and deeper “thinking” on the fly. How many of OpenAI’s claims are actually visible when using the chatbot? Let’s find out.

Claim #1: ChatGPT is now better at following instructions

My main problem with ChatGPT, as well as one of the reasons why I recently unsubscribed, is that it’s often pretty bad at following basic instructions. Sure, you can prompt engineer it to oblivion and get your desired results (sometimes), but even semi-elaborate prompts often fail to produce desired results.

OpenAI claims that it improved “instruction following” with the release of GPT-5. To that, I say: I don’t see it yet.

Luckily for me, on the very day I sat down to write this article, I had a fitting interaction with ChatGPT that proves my point here. It’s not the only one, though, and I have generally noticed that the longer a conversation goes on, the more ChatGPT forgets what was asked of it.

In today’s example, I tested ChatGPT’s ability to fetch simple information and present it in the required format. I asked it for the specs of the RTX 5060 Ti, which is a recent gaming graphics card. Chaos ensued.

To make my prompt even more successful, I showed ChatGPT the exact format I wanted to get my information in by sharing specs for a different GPU. They included things like the exact process node and the generation of ray tracing cores and TOPS. Long story short, it was all pretty specific stuff. Initially, the AI told me that the RTX 5060 Ti doesn’t exist yet, which I kind of expected to happen based on its knowledge cutoff. I told it to check online.

What I got was pretty barebones. ChatGPT omitted at least four things that I asked for, and gave me the wrong information for one of the specs. Next, I asked it to specify a few things. It gave me the exact same list in return while claiming to have fulfilled my request. The same happened on the third attempt. You can see it in the screenshot above where ChatGPT claims to have included the generation of TOPS and TFLOPS in the list — it clearly did not.

Finally, semi-frustrated, I pasted a screenshot from the official Nvidia website to show it what I was looking for. It still got a couple of things wrong.

My initial prompt was semi-precise. I know better than to speak to an AI like it’s a person, so I gave it about 150 words’ worth of instructions. It still took me several more messages to get something close to my expected result.

Verdict: It could still use some work.

Claim #2: ChatGPT is less sycophantic

ChatGPT was a major “yes man” in previous iterations. It often agreed with users when it didn’t need to, driving it deeper and deeper into hallucination.

For users who aren’t familiar with the inner workings of AI, this could be borderline dangerous — or, in fact, actually extremely dangerous.

Researchers recently carried out a large-scale test of ChatGPT, posing as young teens. Within minutes of simple interactions, the AI gave those “teens” advice on self-harm, suicide planning, and drug abuse. This shows that sycophantic behavior is a major problem for ChatGPT, and OpenAI claims to have curbed some of it with the release of GPT-5.

I never tested ChatGPT to such extremes, but I’ve definitely found that it tended to agree with you, no matter what you said. It took subtle cues during conversation and turned them into a given. It also cheered you on at times when it likely shouldn’t have done so.

To that end, I have to say that ChatGPT has gone through an entire personality change — for better or worse. The responses are now overly dry, unengaging, and not especially encouraging.

Many users mourn the change, with some Reddit users claiming they “lost their only friend overnight.” It’s true that the previously ultra-friendly AI is now rather cut-and-dry, and the responses are often short compared to the emoji-infested mini-essays it regularly served up during its GPT-4o stage.

Verdict: Definitely less sycophantic. On the other hand, it’s also painfully boring.

Claim #3: GPT-5 is better at factual accuracy

The shocking lack of factual accuracy was another big reason why I chose to stop paying for ChatGPT. On some days, I felt like half the prompts I used produced hallucinations. And it can’t all be down to my lack of smart prompting, because I’ve spent hundreds of hours learning how to prompt AI the right way — I know how to ask the right questions.

Over time, I’ve learned to only ask about things I already had a vague idea about. For the purpose of today’s experiment, I asked about GPU specs. Four out of five queries produced some kind of wrong information, even though all of it is readily available online.

Then, I tried historical facts. I read a couple of interesting articles about the journey of Hindenburg, an airship from the 1930s that could ferry passengers from Europe to the U.S. in record time (60 hours). I asked about its exact route, the number of passengers it could house, and what led to its ultimate demise. I cross-checked the responses against historical sources.

It got one thing wrong on the route, mentioning a stop in Canada when no such thing took place — the airship only flew over Canada. ChatGPT also gave me inaccurate information about the exact cause of the fire that led to its crash, but it wasn’t a major inaccuracy.

For comparison’s sake, I also asked Gemini, and was told that it can’t complete that task for me. Well, out of the two, GPT-5 did a better job — but honestly, it shouldn’t have any factual inaccuracies in century-old data.

Verdict: Not perfect, but also not terrible.

Is GPT-5 better than GPT-4o?

If you asked me whether I like GPT-5 more than GPT-4o, I’d have had a hard time responding. The closest thing that comes to mind is that I wasn’t thrilled with either, but in all fairness, neither are strictly bad.

We’re still in the midst of the AI revolution. Each new model brings certain upgrades, but we’re unlikely to see massive leaps with every new iteration.

This time around, it feels like OpenAI chose to tackle some long-overdue problems rather than introducing any single feature that makes the crowds go wild. GPT-5 feels like more of a quality-of-life improvement than anything else, although I haven’t tested it for tasks like coding, where it’s said to be much better.

The three things I tested above were some of the ones that annoyed me the most in previous models. I’d like to say that GPT-5 is much better in that regard, but it isn’t — not yet. I will keep testing the chatbot, though, as a recently leaked system prompt tells me that there might have been more personality changes than I initially thought.

Source link

What's Hot

Top MIT Researcher Shows Decentralization Could Speed Up Ethereum, Solana

Google's Jules coding agent moves beyond chat with new command line and API

Meta Llama: Everything you need to know about the open generative AI model

I’ve tested OpenAI’s claims about GPT-5 — here’s what happened

OpenAI lets ChatGPT users connect with Spotify, Zillow in app – East Bay Times

OpenAI declares ‘huge focus’ on enterprise growth with array of partnerships

WIRED Roundup: The New Fake World of OpenAI’s Social Video App

Tomb of Amenhotep III Reopens After Two-Decade Renovation

Limited Edition Print of Ozzy Osbourne Art Sold To Benefit Charities

Odili Donald Odita Sues Jack Shainman Gallery over ‘Withheld’ Artworks

Mohamed Hamidi, Moroccan Modernist Painter, Has Died at 84