DeepSeek-R1’s Latest Update: 23K Tokens Per Question, 87.5% AIME Score

Quick Take: DeepSeek R1 just dropped R1-0528, a major upgrade to their reasoning model that’s now performing close to O3 and Gemini 2.5 Pro levels. The big wins? AIME 2025 accuracy jumped from 70% to 87.5%, coding performance is up significantly, and they’ve added proper system prompt support. Plus, they distilled the reasoning chains into an 8B model that punches way above its weight class.


🚀 The Crunch

⚡ Developer Tip: Perfect timing to test advanced reasoning for complex coding tasks, mathematical problem-solving, or multi-step workflows. The 8B distilled version makes high-quality reasoning accessible for local deployment scenarios.

Key Actionable Features:

  • System Prompt Support Finally: No more hacking with “\n” prefixes – proper system prompts work out of the box now
  • Massive Performance Jumps: AIME 2025 went from 70% to 87.5%, LiveCodeBench from 63.5% to 73.3%, Codeforces rating from 1530 to 1930
  • 8B Distilled Model Available: DeepSeek-R1-0528-Qwen3-8B delivers 86% AIME 2024 performance in a local-friendly package
  • OpenAI-Compatible API Ready: Drop-in replacement via platform.deepseek.com with the same interface you’re already using
  • Enhanced Function Calling: Better tool integration and reduced hallucination rates for production workflows

🌐 Availability: Live now on chat.deepseek.com (enable “DeepThink” mode), API at platform.deepseek.com, local deployment via GitHub.

⚠️ Trade-offs: Higher token usage per task (23K avg vs 12K previously) means increased costs, but the reasoning quality jump is substantial. MIT license means full commercial use is OK.

🎯 TLDR: DeepSeek’s R1 just got a major brain upgrade – now matching top-tier models like O3 while staying open source and commercially usable. The 8B version makes advanced reasoning accessible for everyone.


🔬 The Dive

So what’s the big deal about yet another model upgrade? DeepSeek-R1-0528 represents a significant shift in how reasoning models approach complex problems – and the benchmarks prove it.

🔬 Technical Deep Dive: The real breakthrough here is in “thinking depth.” DeepSeek doubled the average token usage per reasoning task (from 12K to 23K tokens), but that extra computational overhead translates directly into dramatically better performance. A model that is now genuinely competitive with closed-source leaders.

The benchmark improvements are looking great across the board. GPQA-Diamond jumped from 71.5% to 81%, Codeforces rating increased by 400 points (looks great for competitive programming), and SWE-Verified went from 49.2% to 57.6%. These aren’t marginal gains – they represent real capability improvements that matter for production use cases.

Source: Hugging Face

But here’s where it gets really interesting: the 8B distilled model. DeepSeek took the reasoning chains from R1-0528 and used them to post-train Qwen3 8B Base. The result? A model that achieves 86% on AIME 2024 – matching the performance of much larger models like Qwen3-235B-thinking while being deployable on consumer hardware.

💡 This distillation approach is a game-changer for the industry. It shows that advanced reasoning capabilities can be transferred to smaller models without massive infrastructure requirements. That’s huge for democratizing access to sophisticated AI capabilities.

Implementation-wise, DeepSeek fixed some of the rough edges. The previous version required manually adding “\n” to trigger reasoning mode, which was clunky for production use. Now it supports proper system prompts and handles reasoning activation automatically. They’ve also provided detailed prompt templates for file uploading and web search integration.

The commercial implications are significant too. With MIT licensing, companies can use this for commercial applications without the restrictions that some other models impose. Combined with the OpenAI-compatible API, it’s positioned as a drop-in replacement for teams looking to reduce costs or maintain more control over their AI stack.

Ready to Test Advanced Reasoning?

Tom Furlanis
Researcher. Narrative designer. Wannabe Developer.
Twenty years ago, Tom was coding his 1st web applications in PHP. But then he left it all to pursue studies in humanities. Now, two decades later, empowered by his coding assistants, a degree in AI ethics and a plethora of unrealized dreams, Tom is determined to develop his apps. Developer heaven or bust? Stay tuned to discover!