Chinese AI lab DeepSeek is under renewed scrutiny following the release of its updated R1 model, with researchers suggesting it may have been trained on outputs from Google’s Gemini models.
Developer Sam Paech pointed to linguistic similarities between DeepSeek’s R1-0528 and Gemini 2.5 Pro, claiming in a post on X that the model’s phrasing patterns suggest a switch from OpenAI-based to Gemini-generated synthetic data. Another developer behind the SpeechMap evaluation tool said DeepSeek’s internal “traces” resemble those of Gemini.
This isn’t the first time DeepSeek has faced such allegations. In December, its V3 model appeared to misidentify itself as ChatGPT. OpenAI previously told the Financial Times that it had linked DeepSeek to data scraping via distillation – training a model on the outputs of more advanced ones. Microsoft reportedly detected suspicious data exfiltration from OpenAI-linked developer accounts in late 2024.
While model similarities don’t prove misuse – many AIs echo common phrasing due to web content saturation – experts say the risk of “AI slop” in training data is growing. As a countermeasure, OpenAI and others have begun limiting API access and summarizing model traces to hinder unauthorized distillation.
“DeepSeek is short on GPUs and flush with cash,” said AI2 researcher Nathan Lambert. “Using synthetic data from top-tier models would be a logical shortcut.”