DeepSeek, the Chinese AI startup, has launched its new hybrid reasoning model called DeepSeek V3.1 which is designed for agentic use cases and tool calling. It comes with two modes: Think and Non-Think, and can automatically think for longer if the query requires more time to solve. The Think/Non-Think mode can be enabled using the “DeepThink” button.
The non-think mode uses deepseek-chat, and the thinking mode uses deepseek-reasoner. Both come with a context length of 128K tokens and activate 37B parameters, out of 671B parameters. For your information, the DeepSeek V3.1 Base is trained on 840B tokens, on top of V3. What is interesting is that DeepSeek V3.1 performs very well at multi-step reasoning tasks.
For instance, in SWE-bench Verified — a benchmark that tests coding performance on real-world software engineering tasks — DeepSeek V3.1 scored 66.0%, much higher than DeepSeek R1-0528 which got 44.6%. For reference, OpenAI’s GPT-5 Thinking scored 74.9% and Anthropic’s Claude Opus 4.1 achieved 74.5%.
In Humanity’s Last Exam (HLE), DeepSeek V3.1 achieved 29.8% with tool calling, and in GPQA Diamond, the new V3.1 model scored 81%. Overall, it seems the new DeepSeek V3.1 model is better than its earlier R1-0528 AI model. However, it doesn’t outperform GPT-5 or Claude 4 models. As for API pricing, the DeepSeek V3.1 costs $0.56 / $1.68 for input/output per 1 million tokens.