Grok 4 Scores High On Benchmarks But Controversy Clouds The Launch

(Source: sdx15/Shutterstock)

As the AI race grows more competitive by the week, Elon Musk is once again trying to pull ahead. His latest model, Grok 4, comes packed with bold claims: faster reasoning, better test scores, and an edge over rivals like OpenAI and Google. It’s not the first time we’ve heard promises like this, and Musk isn’t exactly known for understatement. Whether the results live up to the hype is still an open question, but the buzz around Grok 4 suggests the industry is watching closely.

Grok 4 is the most advanced release yet in xAI’s growing family of AI assistants. It’s Musk’s answer to models like ChatGPT and Gemini, and builds on the earlier Grok 3 with a long list of upgrades. Like its predecessors, it can answer questions, solve math problems, write and explain code, and analyze images. xAI says this update brings a larger training set, better reasoning, and tighter integration with live web data.

The model is available in two versions. There’s a standard option aimed at everyday use, and a Heavy tier designed for more demanding tasks, which runs multiple AI agents in parallel to tackle complex problems. Grok 4 is also deeply embedded into X, where premium users can access it directly. That integration has given it a highly visible platform, one that showcases its strengths, but also puts every misstep on full display. Grok 4 is powered by xAI’s Colossus supercomputer, the infrastructure behind its latest generation of models.

(Source: Shutterstock)

Beyond feature upgrades, Grok 4’s early benchmark results are where xAI is focusing much of the attention. The model has been tested on Humanity’s Last Exam, a 2,500-question benchmark designed to evaluate reasoning across a wide range of disciplines, including mathematics, natural sciences, and the humanities.

According to xAI, Grok 4 scored 25.4% without tool assistance, outperforming Google’s Gemini 2.5 Pro at 21.6% and OpenAI’s o3 model at 21%. In its enhanced configuration, Grok 4 Heavy reached 44.4% using external tools, including search and code execution. By comparison, Gemini 2.5 Pro scored 26.9% under the same conditions.

xAI also reported gains on ARC-AGI-2, a benchmark that tests pattern recognition and abstraction through grid-based visual puzzles. Grok 4 scored 15.9 %, a result that the ARC Prize Foundation independently verified using a hidden evaluation set. This score is nearly double that of the next best commercial model, Claude Opus 4.

While the ARC benchmarks are artificial tasks, performance on them is often seen as a signal of how well a model can apply reasoning to unfamiliar problems and generalize beyond training data.

“Grok 4 is at the point where it essentially never gets math/physics exam questions wrong, unless they are skillfully adversarial,” Musk wrote in a post on X. “It can identify errors or ambiguities in questions, then fix the error in the question or answer each variant of an ambiguous question.”

The timing of the launch was far from ideal, landing in the middle of a turbulent stretch for Musk’s AI efforts. xAI found itself in damage control after Grok’s automated account on X posted a string of antisemitic replies. The posts were swiftly deleted, and xAI placed temporary restrictions on the account. But the incident renewed concerns about how the model handles sensitive topics.

Humanity’s Last Exam Benchmark (Source: X.ai)

Meanwhile, just hours before Grok 4’s unveiling, Linda Yaccarino resigned as CEO of X. Though her departure wasn’t linked to the chatbot directly, the timing added to the sense of instability surrounding the launch.

Some observers see Grok 4 as a meaningful step forward, especially in technical domains, but also note clear limitations. Alex Olteanu, a senior data science editor at AI education platform DataCamp, has tested the model and says it performs well on advanced benchmarks and structured reasoning tasks, particularly in math and science. At the same time, he points out that it’s not built for everyone.

“It’s not your day-to-day general-purpose assistant. It’s slower than Grok 3, its image and video understanding are still early-stage, and it lacks some polish when it comes to everyday usability. You’ll need to prompt carefully and trim your inputs due to the relatively limited context window. And if you want the best performance, via Grok 4 Heavy, you’ll be paying a premium for it.”

“For developers and researchers, it’s worth exploring. For casual users, the speed and responsiveness of Grok 3 or other mainstream models are a better fit. The roadmap is ambitious, with a coding model, multimodal agent, and video generator all due by October. Whether xAI can deliver those on time is another question. But with Grok 4, they’ve at least made a compelling case that they’re in the race.”

ARC benchmark (Source: X.ai)

xAI has shared very little about how Grok 4 was built. There’s no paper, no model specs, and no open testing data. That makes it hard to know how it really compares to other top models. However, we do know that xAI is moving fast and going public early. Unlike OpenAI or Anthropic, which release models with papers and safety updates, xAI is focused on getting attention and building inside X. It’s a different kind of strategy, one that is more about reach than research.

Grok 4’s ability to scale is still uncertain. Unlike OpenAI or Google, xAI seems to be working with a smaller, mixed setup that may include Tesla hardware. That could explain the slower performance some users have noticed. The claims for benchmark records have Grok 4 attention, but holding onto that attention will require real work performance. Better timing for the launch could have also helped.

Source link

What's Hot

Snowflake Is in Talks to Buy Reka AI for $1 Billion

AHELM: A Holistic Evaluation of Audio-Language Models – Takara TLDR

Inside Sparkvia AI Writing Platform, First XRP-Native Credit System For Writers & Creators

Grok 4 Scores High on Benchmarks but Controversy Clouds the Launch

NetDocuments Launches Judge Analytics App – Artificial Lawyer

An Iowa nuke plant may be the next to restart thanks to AI power demand

Equinor backs Orsted after Trump ramps up attacks on offshore wind

80 Museum Exhibitions and Biennials to See in Fall 2025

Woodmere Art Museum Sues Trump Administration Over Canceled IMLS Grant

Barbara Gladstone’s Chelsea Townhouse in NYC Sells for $13.1 M.

Trump Meets with Smithsonian Leader Amid Threats of Content Review

Snowflake Is in Talks to Buy Reka AI for $1 Billion

AHELM: A Holistic Evaluation of Audio-Language Models – Takara TLDR

Inside Sparkvia AI Writing Platform, First XRP-Native Credit System For Writers & Creators

What's Hot

Grok 4 Scores High on Benchmarks but Controversy Clouds the Launch

Related

Related Posts

Subscribe to Updates