At first glance, the following statements may seem like they come from the harshest critics of China’s technological innovations:
“We often say that the gap between China and the United States in AI is one or two years. But the real gap between the U.S. and China AI is creativity and imitation. China will be a follower forever if this doesn’t change.”
“In the past 30 years, China has essentially not produced any innovation in the tide of IT development, merely following along as a free rider, without contributing to any real technological innovation.”
“Chinese companies are accustomed to taking other (foreign) companies’ innovation, developing applications based on those, and making a fortune from it. But this should not be taken for granted.”
“We have been used to waiting for Moore’s Law to come down from the sky, and then, boom, 18 months later, we have better hardware and software to use. Now, in China, the same is happening with scaling laws.”
But surprisingly, these words come from Liang Wenfeng, the founder of DeepSeek, a Chinese AI start-up that recently shocked the global AI community, particularly in Silicon Valley and on Wall Street.
DeepSeek’s success marks a significant boost for China’s AI innovation. It shows that even in the face of US chip restrictions, Chinese companies can adopt innovative solutions to drive cost-effective development. Their work challenges the notion that China will always be a follower in AI innovation.
Taken from an exclusive interview with Liang conducted by 36Kr, a Chinese media platform, in summer 2024 — when DeepSeek released its V2 model — these quotations strike at the heart of long-standing issues in China’s technological innovation system and the way that many Chinese companies approach business.
So, what makes Mr. Liang, a seemingly “nouveau riche” player in China’s AI industry, so undiplomatically frank in his criticism? What sets DeepSeek apart from other AI giants and start-ups in China? Before answering these questions, we need to explore what DeepSeek has actually done to achieve its breakthroughs in AI innovation. What do DeepSeek’s innovations mean for the future of AI development?
Two Myths about DeepSeek’s Success
DeepSeek sent shockwaves through the global AI industry in January 2025 when it announced that its V3 model rivals OpenAI’s GPT-4o and other leading large language models (LLMs), despite the fact that the system was trained at an extremely low cost — US$5.576 million using 2,048 Nvidia A800 chips. This figure pales in comparison to the pre-training costs of around US$40–60 million and the tens of thousands — sometimes even 100,000 — of advanced AI chips (such as the Nvidia H100) used by OpenAI, Meta and other US-based tech giants.
This news drew a sharp reaction from Wall Street, temporarily sinking Nvidia’s stock price as investors feared reduced demand for high-end AI chips (GPUs). Two months later, the reasons for DeepSeek’s success are more clear.
Myth #1: DeepSeek’s cost was just $5.576 million. That $5.576 million accounted only for the cost of GPUs used for the final stage of training. SemiAnalysis, a research and analysis company, estimates that DeepSeek used up to 50,000 H-series chips in earlier training stages, putting its total AI investment at greater than US$1.3 billion. Moreover, its parent company, High-Flyer, a hedge fund also owned by Liang, had stockpiled 10,000 Nvidia A100 chips before US sanctions took effect in October 2022, making it the only company capable of training LLM beyond China’s few top tech giants at the time. The move was not driven by business foresight but rather by curiosity regarding AI and artificial general intelligence (AGI), according to another interview Liang had with 36Kr in May 2023. DeepSeek later purchased additional Nvidia chips, including H800s, H20s and even some H100s, through various channels.
Myth #2: DeepSeek has overturned the trajectory of AI development. Like OpenAI and Google, DeepSeek follows the “deep learning + foundation models” approach, relying on massive data sets, computational power and advanced algorithms (specifically, transformer neural network architecture) to train models that it believes could eventually reach AGI.
All that being said, DeepSeek has made legitimate, important breakthroughs.
Its most notable achievement is its remarkable cost reduction, enabling the training of LLMs comparable to the most advanced models from US companies at a fraction of the cost through highly impressive innovative optimization and engineering in model architecture, training frameworks and algorithms. Although the upfront investment is costly, DeepSeek’s optimizations for cost-effective AI model training are real — its V3 model training costs are just one-tenth of OpenAI’s GPT-4, which was estimated to be US$63 million. Key optimizations that reduced reliance on expensive hardware include:
Mixture of experts (MoE) and multi-dead latent attention (MLA): The optimization of MoE and MLA architectures was critical for the DeepSeek-V3 model to achieve efficient inference and cost-effective training.Think of a large AI model as a team of specialists, each trained to handle different tasks. Instead of using the entire team for every problem, DeepSeek’s MoE architecture pushed MoE’s potential to a new level in only activating the specialists (or “experts”) that are needed for a specific task, reducing unnecessary computations. MLA was regarded as the key innovation for DeepSeek-V3 to significantly reduce the key-value cache, which stores lots of temporary memory for processing information while also slowing down the training of AI models; reducing the cache has resulted in optimizing inference speed and computational efficiency.
Parallel thread execution (PTX) programming: PTX programming is an intermediate instruction set architecture designed by Nvidia for its GPUs. By reconfiguring Nvidia’s H800 chips at the software level to increase the connectivity efficiency between multiprocessors, DeepSeek unlocked new levels of AI compute efficiency.
Multi-token prediction: A novel approach to model training, multi-token prediction allows the system to predict multiple upcoming tokens simultaneously, increasing data throughput by two to three times compared to standard next-token prediction. A token is a component of text that an AI model processes at a time, which can be a word, a part of a word, a single character or a phrase, depending on the language and context.
FP8 mixed-precision training: Reducing training costs by leveraging 8-bit floating-point precision (FP 8) rather than the standard 16-bit (FP16) allows for faster computations with minimal loss in model accuracy. Bits are tiny units of computer memory, and 8-bit floating-point precision is a way of storing numbers using only 8 bits, which helps AI models do math faster and use less memory while keeping accuracy high. Using 16 or 32-bit means even more accuracy but also more computing power.
Model/knowledge distillation: Model/knowledge distillation is a compression technique that transfers knowledge from a large “teacher” model to a smaller “student” one without significantly degrading performance. This approach is used to compress massive neural networks, improving efficiency. A neural network refers to a method in AI that teaches computers to process data in a way that is inspired by the human brain.
Finally, group relative policy optimization (GRPO) is the main innovation in DeepSeek-R1 model. This is a reinforcement learning (RL) algorithm that enhances reasoning capabilities. Unlike traditional RL methods such as Proximal Policy Optimization, which rely on external critics that are separate evaluation models to judge an AI’s responses, GRPO evaluates groups of responses relative to one another, improving response quality.
Implications for the Future of AI
While DeepSeek’s innovations are not purely original, but instead based on the optimization of existing technologies, they do represent remarkable progress in AI development: DeepSeek’s ability to optimize the cost efficiency of LLM training makes it a game-changer. These optimizations significantly lower the threshold for AI model training, making advanced AI technology more accessible for businesses, start-ups and developers worldwide.
Further, because its technology is open-source, DeepSeek is making these innovations freely available, further democratizing AI development and encouraging innovation. This shift could lead to a more inclusive era of AI development featuring cost-effective and scalable machine learning, fewer monopolies by tech giants and greater participation from businesses globally.
The Impact on China’s AI Innovation
DeepSeek’s breakthrough was a shot in the arm for China’s AI innovation, encouraging developers, start-ups and investors to double down on creative and cost-effective solutions for AI development and applications in various sectors. DeepSeek’s homegrown young engineers have demonstrated exceptional skill in optimizing existing technologies facing the US restrictions on advanced AI chips. This outcome reinforces the saying that “necessity is the mother of invention,” and ironically calls into question the effectiveness of US sanctions in limiting China’s AI progress.
Before DeepSeek, there was considerable pessimism in China regarding its ability to lead on AI. While China still lags behind the United States in its approach to AI development, DeepSeek demonstrates that fostering an environment of curiosity-driven innovation — rather than simply chasing profit — can lead to original technological breakthroughs.
As Liang said in an interview with 36Kr in summer 2024, “We did not intend to become the catfish that caused the catfish effect in the first place; it happened by accident.”
That being said, two key points must be considered when assessing DeepSeek’s success. First, as noted above, its achievements are based on optimizing existing AI approaches rather than developing entirely new paradigms. Second, DeepSeek’s free-style, curiosity-driven innovation is somewhat unique in China. It stands out due to Liang’s passion and geek-like working style — with his day job of writing codes, reading papers and participating in group discussion — reminiscent of the early days of Bill Gates and Steve Jobs.
Liang’s goal of developing AGI, in line with global leaders such as OpenAI and Sam Altman, sets DeepSeek apart from other AI companies in China. He believes that the most important thing for Chinese companies is to participate in the global wave of innovation and technological progress rather than focusing solely on short-term financial gain. This belief reflects a hope for more genuine, original innovation in China.
Liang’s comments about China as a follower in innovation echo the words of economist Zhang Weiying, who argued in 2017 that the country’s rapid economic growth in recent decades was built on technology and products made by advanced Western countries over the past 500 years, a period when China did not produce any real innovations. He argued that China’s future of innovation will depend on market-driven entrepreneurship. By contrast, another nationalist scholar, Zhang Weiwei, claims that China is leading the Fourth Industrial Revolution and setting a global benchmark in innovation in many sectors. He argues that China’s model of development can compete with Western models, and sometimes even surpasses them.
The future of China’s innovation will depend on which of these perspectives Chinese policy makers ultimately choose to embrace.