China’s open-source AI scene is heating up again. After DeepSeek’s rapid rise earlier this year, a new challenger is making waves in the form of Kimi K2 from Moonshot AI.
Although it launches with less fanfare, Kimi K2 is now drawing serious attention from AI insiders and outperforming some of the biggest names in the game.
It’s fast, climbing the ranks, beating expectations on benchmarks, and sparking comparisons to DeepSeek’s breakout moment. Some even believe it’s strong enough to have made OpenAI rethink its release schedule.
“China’s Kimi K2 is having its mini DeepSeek moment: it is now #14 on OpenRouter today, ahead of Grok 4 and GPT-4.1,” Deedy Das of Menlo Ventures wrote in a post on X
He added that this is a non-reasoning model, yet it scores highest on major EQ and creative writing benchmarks. “Best model smell since (Claude) 3.5 Sonnet,” he said.
Based on current API pricing, Kimi K2 is roughly 80-90% cheaper than Claude Sonnet 4 when comparing per-token costs, especially for API usage.
The model is now available in preview on GroqCloud at 185 tokens per second.
Kimi K2 uses a sparse mixture‑of‑experts (MoE) design, featuring one trillion total parameters and 32 billion active ones per query. Of its 384 specialised expert subnetworks, only a few are activated dynamically based on the input. This setup lowers compute needs while preserving capacity. It also supports a 1,28,000-token context window.
As soon as the model was dropped, OpenAI CEO Sam Altman announced a delay in the release of their open-source model.
“Kimi mogged OpenAI, and I genuinely think the real reason they delayed the open-source model release is Kimi K2,” AI enthusiast Ashutosh Shrivastava wrote on X. He added that OpenAI “never saw this coming”. Kimi K2 outperforms DeepSeek V3 and goes head-to-head with Claude Opus 4 and GPT-4.1.
This comes against the backdrop of OpenAI naming another Chinese AI startup, Zhipu, as a potential threat to its dominance.
Kimi K2 delivered top-tier results in coding and math benchmarks. On SWE-bench Verified, it scored 65.8%, outperforming GPT-4.1 at 54.6% and coming close to Claude Sonnet 4. On LiveCodeBench, it achieved 53.7%, ahead of DeepSeek V3 (46.9%) and GPT-4.1 (44.7%).
In the Math-500 benchmark, it scored 97.4%, compared to GPT-4.1’s 92.4%. Kimi K2 also performs strongly across AIME, GPQA, OGBench, and tool-use evaluations.
Artificial Analysis said that while Moonshot AI’s Kimi K2 is the leading open-weight non-reasoning model in its Intelligence Index, it outputs roughly three times more tokens than other non-reasoning models, blurring the line between reasoning and non-reasoning.
As a non-reasoning model, it excels in creative tasks. It is now the Short-Story Creative Writing champion, scoring 8.56 and surpassing the previous leader, o3-pro, which scored 8.44.
Kimi-K2-Instruct now ranks #1 on EQ-Bench 3, a benchmark for emotional intelligence in LLMs. It leads GPT-4o, Claude, and Gemini across empathy, insight, and creative writing. pic.twitter.com/91amc3W9wB
— 👋 Jan (@jandotai) July 14, 2025
Agentic Capabilities
Kimi K2 has good agentic capabilities. According to the company, unlike traditional LLMs, Kimi K2 can plan and execute multi‑step tasks autonomously. It can call external APIs, generate and debug code, create plots, webpages and more, all without manual prompting at each step.
There are two versions of the model. While the Base variant is designed for research and fine-tuning, the Instruct variant is intended for use in chatbots and agents.
In a blog post, the company shared that Kimi K2’s agentic abilities are driven by two core components: large-scale tool-use training and general reinforcement learning (RL).
In order to teach the model how to use tools effectively, Moonshot AI built a large-scale synthetic data pipeline inspired by ACEBench. This system simulates real-world tool-use tasks across hundreds of domains and thousands of tools, combining both real and synthetic examples.
“Our approach systematically evolves hundreds of domains containing thousands of tools, including both real MCP (Model Context Protocol) tools and synthetic ones, then generates hundreds of agents with diverse tool sets,” the company said.
It Comes with Flaws
Despite the good benchmark figures, Ethan Mollick, a professor at Wharton, described Kimi K2 as “a really weird model” that still needs much more testing. He recounted an experiment where he gave it a slightly altered version of the novel The Great Gatsby.
Like Claude, the model spotted the two intentional changes, but then “made up a ton of hallucinated nonsense that sounded plausible but was just plain wrong”.
He added that the DeepSeek moment was largely fueled by pent-up consumer demand for high-quality free AI, especially among students looking for help with homework.
According to him, Kimi K2, despite its strong performance, hasn’t seen the same immediate public impact. One possible reason he observed is that for most consumers and students, “DeepSeek is good enough”.
“Feels like unlike DeepSeek, the general public hasn’t felt the effect/impacts of Kimi K2 yet – most non-technical people have probably never even heard of it. Wonder why it is being overlooked when DeepSeek got so much attention,” wrote a user on X.
Meanwhile, DeepSeek’s upcoming model, R2, is still unreleased, and it may be delayed further. A recent report suggests that US export restrictions on NVIDIA’s H20 chips, which are essential for training and deploying the model, could pose serious challenges in China.
Kimi K2 may not have the same hype DeepSeek had, but its performance is hard to ignore. With strong benchmarks and growing visibility, it is clear that China’s open-source push is far from over.