
DeepSeek: A Game-Changer in Cost-Effective AI Training
The AI landscape is evolving rapidly, with DeepSeek emerging as a game-changer in developing cost-efficient large-scale model training. While many focus on how cheap DeepSeek’s training costs are, the real question is why and how they achieved such efficiency.
After exploring the DeepSeekMath, DeepSeek-V2, DeepSeek-V3, and DeepSeek R1 papers, I uncovered several key innovations driving their success.
DeepSeekMath: Smarter Data Extraction for Better Models
📌 DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models — Published: April 27, 2024
DeepSeekMath model proves that high-quality data is out there. We just need better methods to extract and utilize it. To tackle this, they developed a FastText-based classifier to filter mathematical text at scale.
How They Did It?
- Building a Robust Seed Dataset
The team started by using OpenWebMath as positive examples and 500K pages from Common Crawl as negatives, ensuring a diverse and high-quality training set. - Leveraging FastText for Efficient Filtering
FastText was chosen for its character-based embeddings and n-grams, making it well-suited for identifying mathematical content. Its lightweight design enabled the team to process massive datasets efficiently. - Extracting 120B Tokens to Surpass Existing Corpora
Using this approach, they successfully built the DeepSeekMath Corpus, extracting 120B high-quality math-related tokens, dataset that far surpasses existing corpora such as MathPile (8.9B), OpenWebMath (13.6B), Proof-of-Pile (51.9B).
Introduction of Group Relative Policy Optimization (GRPO)
They developed Group Relative Policy Optimization (GRPO) as a memory-efficient alternative to Proximal Policy Optimization (PPO). While both methods are based on reinforcement learning, GRPO introduces key advancements that make it better suited for cost-effectively training large AI models.

What do PPO and GRPO do?
PPO and GRPO are methods for improving a model while maintaining stability.
In Figure 1:
- q → Input
- Policy Model → The LLM
- o → Output
- Reference Model → Estimates the output of the original LLM and penalizes deviations from it.
- Reward Model → A model trained on the environment, acting as if it were the environment itself, providing a reward based on the LLM’s output.
- Value Model → Predicts the expected average reward, providing a baseline.
- A (Advantage) → The computed advantage, which is used in gradient descent to improve the LLM.
PPO (Proximal Policy Optimization)
Given an input (q), the Policy Model (LLM) generates an output (o). This output is then processed in three parallel steps:
- Reference Model: A frozen version of the original LLM estimates the output and applies a penalty based on how different the new output is.
- Reward Model: A separate model, acting as a proxy for the environment, assigns a reward to the LLM’s output.
- Value Model: Predicts the expected average reward to establish a baseline.
The penalty, reward, and predicted value are then used in Generalized Advantage Estimation (GAE) to compute the advantage (A). This advantage is used to train both the LLM and the Value Model, refining their performance.
GRPO (Group Relative Policy Optimization)
GRPO is similar to PPO, but with a key difference is instead of predicting the average reward using a Value Model, the LLM generates a batch of sample outputs. The rewards for these outputs are computed, and their average is used as the baseline.
Why GRPO Outperforms PPO?
GRPO outperforms PPO by eliminating the Value Model, leading to significant improvements in efficiency, scalability, and cost-effectiveness.
- Lower Memory Overhead: Removing the Value Model significantly reduces memory consumption, making GRPO more scalable for large AI models.
- Faster Training: Without the computational burden of the Value Model, training time is reduced, leading to more efficient resource usage.
- Cost-Effective & Efficient: By optimizing resource allocation, GRPO lowers training costs while maintaining high performance.
GRPO is first introduced in April 2024 and quickly proved to be highly effective. It was later adopted in subsequent models, including DeepSeek-V3 and DeepSeek-R1.
For a deep dive, check out Yannic Kilcher’s explanation⁷.
Benchmark Results

Tool-Integrated Reasoning on English and Chinese Benchmarks. Scores in gray denote majority
votes with 32 candidates; The others are Top1 scores. DeepSeekMath-RL 7B beats all open-
source models from 7B to 70B, as well as the majority of closed-source models. Although
DeepSeekMath-RL 7B is only further trained on chain-of-thought-format instruction tuning data
of GSM8K and MATH, it improves over DeepSeekMath-Instruct 7B on all benchmarks.¹
A 7B model pre-trained on the DeepSeekMath Corpus outperformed some 70B models, including math-specific models like WizardMath-70B. Despite its smaller size, the model demonstrated superior performance, proving that high-quality data can be more valuable than model size.
As a result, it can be confirmed that more high-quality data leads to better performance.
2. DeepSeek-V2: Architectural Innovations That Drive Efficiency
📌 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model Published: June 19, 2024
In DeepSeek-V2, the key improvements include implementation of Multi-Head Latent Attention (MLA) for faster inference, and DeepSeekMoE for cost-effective training, and Device-Limited Routing to reduce communication overhead, making the model more efficient and scalable.

significantly reducing the KV cache for generation, and DeepSeekMoE enables training strong
models at an economical cost through the sparse architecture.²
Multi-Head Latent Attention (MLA): Faster and More Memory-Efficient Inference
Problem:
Large language models rely on Multi-Head Attention (MHA), but managing the Key-Value (KV) cache poses a significant obstacle to the inference efficiency of LLMs. The KV cache requires significant memory and slows down inference speed.
Existing Solutions:
Approaches like Grouped-Query Attention (GQA) and Multi-Query Attention (MQA) attempt to reduce the KV cache size. However, these methods compromises model performance as they sacrifice attention quality for efficiency.
DeepSeek Innovation:
DeepSeek introduces Multi-Head Latent Attention (MLA), which compresses the KV cache using low-rank key-value compression. This optimization allows DeepSeek-V2 to reduce memory usage while maintaining accuracy.
Key benefits of MLA:
- Reduces memory footprint during inference.
- Speeds up response times without lowering accuracy.
- Outperforms standard MHA while requiring fewer computational resources.
DeepSeekMoE : Smarter Expert Allocation for Cost-Effective Training
Problem:
Traditional Mixture of Experts (MoE) architectures such as GShard, suffer from inefficiencies due to redundant computations and unbalanced workloads. As a result, they require excessive computing power to maintain performance.
Existing Solutions:
Most MoE architectures activate a fixed number of experts per token, leading to repeated computations even when some experts are unnecessary. This results in higher training costs and inefficient use of resources.
DeepSeek Innovation:
DeepSeekMoE introduces a two-part optimization:
- Fine-grained expert segmentation: Experts are isolated into smaller, specialized units, allowing for more accurate knowledge acquisition with less computation.
- Shared vs. Isolated Experts:
- Shared Experts: Always active, handling generalization tasks.
- Isolated Experts: Activated only when needed, reducing redundant calculations.
3. Key benefits of DeepSeekMoE:
- Enhances higher expert specialization, improving model efficiency.
- Eliminates redundant computations, lowering computational costs.
- Trains stronger models using fewer resources, making large-scale AI more economical.
Device-Limited Routing: Reducing Communication Overhead
Problem:
When expert parallelism is used in MoE models, experts are distributed across multiple devices. However, as the number of devices increases, so does the communication overhead, making training slower and less efficient.
Existing Solutions:
Most MoE architectures allow each token to be routed to any available expert, even if that expert is on a different device. This increases cross-device communication costs, leading to latency issues and inefficient scaling.
DeepSeek Innovation:
DeepSeek-V2 introduces Device-Limited Routing, which restricts the number of devices a token can interact with. By limiting communication to a set number of devices, DeepSeek controls communication overhead and balances workloads more effectively.
Similarly, DeepSeek-V3 refines this approach with a restricted routing mechanism called Node-Limited Routing that ensures each token is sent to at most M nodes. These nodes are chosen based on the highest affinity scores of the experts on each node, ensuring only the most relevant experts are selected.
Key benefits of Device-Limited Routing:
- Prevents excessive communication overhead, making expert parallelism more scalable.
- Reduces latency, leading to faster inference and training times.
- Ensures efficient workload distribution, improving hardware utilization.
- Optimizes expert selection, allowing only high-affinity nodes to be utilized, further reducing inefficiencies.
DeepSeek-V2 demonstrates that AI models can achieve high performance while significantly reducing training costs. By optimizing resource allocation and minimizing inefficiencies, it sets a new standard for cost-effective and scalable large-scale model training.
3. DeepSeek-V3: Infrastructure Breakthroughs for Cost-Efficient AI
📌 DeepSeek-V3 Technical Report Published: December 27, 2024
DeepSeek-V3 adopts the architecture of DeepSeek-V2 and serves as the foundation for DeepSeek R1, bringing groundbreaking efficiency improvements in the infrastructure.
They introduced a highly optimized training framework called HAI-LLM, designed from the ground up to tackle the inefficiencies of large-scale AI training. By minimizing communication overhead, optimizing memory usage, and leveraging mixed precision training, DeepSeek-V3 achieves exceptional cost-efficiency while maintaining top-tier performance.
HAI-LLM Training Framework (Built From Scratch):
The HAI-LLM framework, developed specifically for DeepSeek-V3, is engineered to address one of the biggest challenges in large-scale AI training which is cross-node communication overhead in expert parallelism. Without optimization, this results in a 1:1 computation-to-communication ratio, significantly slowing down training.
To solve this, DeepSeek-V3 introduces DualPipe, an innovative pipeline parallelism algorithm that improves how computations and communications are handled.
DualPipe Algorithm for Pipeline Parallelism: A Game Changer in Training Efficiency
DualPipe is one of the most crucial advancements in DeepSeek-V3, significantly improving training efficiency by minimizing communication overhead and maximizing computation-communication overlap. Compared to conventional pipeline scheduling methods, DualPipe introduces key optimizations that reduce idle time and enhance parallel processing.
Overlapping Computation and Communication
One of the key innovations in DualPipe is its ability to process forward and backward passes simultaneously, avoiding traditional bottlenecks.

boundaries of the transformer blocks are not aligned). Orange denotes forward, green denotes
“backward for input”, blue denotes “backward for weights”, purple denotes PP communication,
and red denotes barriers. Both all-to-all and PP communication can be fully hidden.³
It achieves this through:
- Bidirectional pipeline scheduling, where micro-batches are fed from both ends of the pipeline to ensure all compute units remain active.
- Splitting backward computations into two stages (input gradients and weight gradients), similar to ZeroBubble, reducing latency and improving training efficiency.
Validation Through Simulation
According the Table 2, It has been claimed that DualPipe significantly reduces the bubble rate while only increasing the peak activation memory by 1/PP times. Although, it is consuming 2 times more memory, it has been stated that it doesn’t significantly increase the memory consumption since they use a large size (64-way) of Expert Parallelism during training.

methods. 𝐹 denotes the execution time of a forward chunk, 𝐵 denotes the execution time of a
full backward chunk, 𝑊 denotes the execution time of a “backward for weights” chunk, and 𝐹&𝐵
denotes the execution time of two mutually overlapped forward and backward chunks.³
To validate these claims, I executed simulations using the Hugging Face Playground provided by Sea AI Lab (access here). The results demonstrate how DualPipe optimizes computation-communication overlap, minimizes pipeline bubbles, and maintains efficient memory usage, confirming the claimed benefits.

Figure 4 illustrates 1F1B pipeline scheduling. The forward pass (F) in blue is followed by the backward pass (B) in light blue, with weight gradients (W) in green computed afterward. Each device (PP rank) executes tasks sequentially, leading to significant idle time at the beginning and end of execution.
Key observations:
- Pipeline bubbles are significant, caused by sequential forward and backward execution.
- Idle time is high, with staggered initialization and delayed completion.
- Computation-communication overlap is poor, leading to inefficiencies.
Limitations:
- Hardware utilization is inefficient.
- Training time increases due to sequential scheduling.

Figure 5 illustrates ZB1P scheduling. Unlike 1F1B, ZB1P overlaps backward computation (B) with forward passes (F) to reduce idle time. The weight gradient computations (W) appear earlier, meaning weight updates start sooner.
Key observations:
- Forward and backward passes overlap more efficiently, reducing idle time.
- Compute resource utilization improves compared to 1F1B.
- Some idle periods remain, indicating that inefficiencies still exist.
Limitations:
- While more efficient than 1F1B, ZB1P does not fully eliminate pipeline bubbles.
- Some idle time remains due to incomplete overlap of forward and backward computations.

The micro-batches in the reverse direction are symmetric to those in the forward direction, so
we omit their batch ID for illustration simplicity. Two cells enclosed by a shared black border
have mutually overlapped computation and communication.³
Figure 6 illustrates DualPipe, the most optimized scheduling method. Forward (yellow) and backward (green) computations are fully overlapped, eliminating unnecessary waiting. Backward for input (light blue) and backward for weights (dark blue) are separated, reducing delays. Overlapped forward & backward sections (orange) indicate near-perfect utilization.
Key observations:
- Pipeline bubbles are minimal, allowing continuous computation.
- Bidirectional scheduling is implemented, keeping all devices active.
- Workload is evenly distributed across ranks, improving overall training efficiency.
Advantages over 1F1B and ZB1P:
- Eliminates nearly all pipeline bubbles, and it enables fine-grained expert parallelism across nodes, ensuring zero all-to-all communication overhead.
- Maximizes GPU utilization by reducing idle time.
- Backward computations (B & W) are split, further optimizing resource allocation.
- Achieves high memory efficiency, making it more scalable across multiple devices.
Libraries like vLLM already leverage similar techniques to enhance inference throughput, but DeepSeek-V3 takes this further by applying similar optimizations to training. By incorporating these efficiency-driven methods, DeepSeek-V3 enables far more cost-effective model scaling while maintaining high computational efficiency. This approach is particularly crucial in Mixture-of-Experts (MoE) architectures, where reducing communication overhead and optimizing resource allocation can significantly lower overall training expenses.
How does this compare to NVIDIA’s large scale training framework called Megatron-LM?
Megatron still relies on 1F1B scheduling as its core pipeline strategy (reference⁸). However, 1F1B suffers from inefficiencies, leading to higher communication overhead and unnecessary computational costs, which ultimately slow down training performance.
More efficient scheduling alternatives exist. Research from Sea AI Lab highlights superior methods like ZeroBubble, which outperforms 1F1B by up to 23% in throughput under similar memory constraints. When memory constraints are relaxed, this performance gain can be further pushed to 31%, demonstrating that newer scheduling techniques offer significant improvements in both efficiency and scalability⁵.
By adopting DualPipe, DeepSeek-V3 effectively surpasses the limitations of 1F1B-based scheduling, ensuring better throughput, lower communication overhead, and optimized resource utilization — making it a far more cost-effective training framework compared to NVIDIA’s Megatron.
Optimized Cross-Node Communication
DeepSeek-V3 enhances efficiency with high-performance cross-node communication kernels, leveraging InfiniBand (IB) and NVLink. With NVLink’s 160 GB/s bandwidth (3.2× faster than IB’s 50 GB/s), DeepSeek-V3 limits each token to 4 nodes, reducing IB traffic while maintaining accuracy. Tokens first transmit via IB, then forward through NVLink to GPUs hosting expert computations, ensuring seamless communication overlap. This allows expert routing to scale from 8 to 13 experts per token without increasing communication costs, eliminating bottlenecks and making MoE training more scalable and cost-efficient.
FP8 Mixed Precision Training
DeepSeek-V3 implements FP8 mixed precision training to significantly enhance computational speed while reducing memory consumption. By strategically using FP8 for most compute-intensive operations and retaining higher precision where necessary, this method doubles computational speed compared to BF16 while maintaining numerical stability. The overall framework is illustrated in Figure 7.

the Linear operator is illustrated.³
A major challenge in FP8-based training is activation outliers, which can cause:
- Overflows (values too large), leading to instability in weight updates.
- Underflows (values too small), resulting in a loss of numerical accuracy.
These issues arise due to FP8’s reduced exponent range, making it less tolerant to extreme values. Without proper handling, this can degrade model performance and introduce instability during training.
To mitigate these challenges, DeepSeek-V3 adopts a mixed precision framework, where:
- Most compute-intensive operations are conducted in FP8, maximizing speed and efficiency.
- Key operations are retained in higher precision (BF16 or FP32) to ensure numerical stability.
This approach allows for aggressive low-precision optimization while preserving accuracy. Notably, compared to BF16-based training, this framework reduces memory consumption significantly while keeping the relative loss error consistently below 0.25%, well within the acceptable range of training randomness.
Notably, compared with the BF16 baseline, the relative loss error of our FP8-training model remains consistently below 0.25%, a level well within the acceptable range of training randomness.
Despite the efficiency of FP8, certain operators require higher precision due to their sensitivity to low-precision computations. To maintain stability without unnecessary overhead, DeepSeek-V3:
- Uses FP32 for critical operations, such as embedding layers, MoE gating, normalization, and attention.
- Maintains BF16 for low-cost operators where minimal precision loss is acceptable.
By selectively applying higher precision where needed, DeepSeek-V3 ensures stable training while minimizing computational and memory overhead.
Multi-Token Prediction
One notable innovation in DeepSeek-V3 is Multi-Token Prediction (MTP). Instead of generating one token at a time, DeepSeek-V3 predicts two tokens in parallel, significantly improving inference speed. The second token prediction maintains 85–90% accuracy, proving that this approach preserves model quality while nearly doubling efficiency. This results in a 1.8× increase in tokens generated per second. Additionally, the MTP modules can be discarded, allowing the model to function independently and normally.
4. DeepSeek R1: Optimizing Reinforcement Learning for Efficiency
📌 DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning Published: January 22, 2025
DeepSeek-R1 builds on DeepSeek-V3-Base (Pre-Trained version) but introduces a major cost-saving innovation which is eliminating Supervised Fine-Tuning (SFT) as a preliminary step for Reinforcement Learning (RL). This approach reduces training costs while maintaining model efficiency. GRPO was reintroduced to further optimize reinforcement learning costs. This allows the model to efficiently learn reasoning tasks without excessive fine-tuning.
Towards the end of Yannic Kilcher’s DeepSeekMath explanation, he mentioned that through extensive research and experimentation, we have discovered that the model possesses an inherent capability. However, with Reinforcement Learning (RL) and even Supervised Fine-Tuning (SFT), we are effectively narrowing down the available pathways, ultimately converging on the probability distribution we aim to achieve.
This also reinforces the idea that pre-training already provides thecapabilities to the model, and fine-tuning serves to expose and optimize them, rather than create them from scratch.
Training Cost Breakdown

Table 3 originally belongs to DeepSeek-V3, but since post-training processes are significantly cheaper than pre-training, it is reasonable to assume that the DeepSeek-R1 model also costs around $5.6M. As a reminder, many experiments were conducted, but this cost applies only to a single training run.
5. Stacking Up All These Optimizations Makes the Results Astounding
By implementing all these architectural and training optimizations, DeepSeekV3 achieves an unparalleled level of cost-efficiency, significantly outperforming Meta’s Llama 3.1 405B model, as shown in Table 4:

*The H100 GPU is significantly more powerful than the H800 and could be more expensive to rent or purchase.
- Over 11x lower GPU hours compared to Llama 3.1 405B, despite having 66% more parameters (671B vs. 405B).
- Only $5.576M estimated training cost, a nearly 90% cost reduction compared to Llama 3.1’s estimated $61.68M+.
- Massive energy savings: DeepSeek-V3 requires an estimated 0.975 GW total power vs. Llama 3.1’s 21.6 GW, reducing power consumption by over 22x.
💡 Key Takeaways: How Did DeepSeek Achieve Cost-Effective Training?
1️⃣ Smarter Data Extraction & Filtering: DeepSeekMath proved that better data selection beats sheer volume.
2️⃣ Optimized Model Architecture: MLA, DeepSeekMoE, and No-Limited Routing significantly improve efficiency.
3️⃣ Advanced Training Techniques: FP8 Mixed precision, DualPipe parallelism, and GRPO enable cost-effective training at scale.
4️⃣ Elimination of Unnecessary Computation: Models inherently acquire their core capabilities during pre-training, while reinforcement learning (RL) and supervised fine-tuning (SFT) primarily serve to adjust and amplify the desired probability distribution. We
DeepSeek’s approach isn’t just about making training cheaper , it’s about systematic architectural and computational improvements that set new standards for efficiency.
There’s still more to explore in their work, but this provides a high-level technical summary of what makes their methods so powerful.
🧠 What’s Next?
These breakthroughs raise important questions:
💭 Will NVIDIA adopt similar techniques in future Megatron frameworks?
💭 What’s the next big leap in cost-effective model scaling?
💭 Will startups and smaller companies finally compete with AI giants using DeepSeek’s breakthroughs?
Let’s discuss in the comments! 🚀
References:
- Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y. K., Wu, Y., & Guo, D. (2024, April 27). DeepSeekMath: Pushing the limits of mathematical reasoning in open language models. arXiv.org. https://arxiv.org/abs/2402.03300
- DeepSeek-AI, Liu, A., Feng, B., Wang, B., Wang, B., Liu, B., Zhao, C., Dengr, C., Ruan, C., Dai, D., Guo, D., Yang, D., Chen, D., Ji, D., Li, E., Lin, F., Luo, F., Hao, G., Chen, G., … Xie, Z. (2024, June 19). Deepseek-V2: A strong, economical, and efficient mixture-of-experts language model. arXiv.org. https://arxiv.org/abs/2405.04434
- DeepSeek-AI, Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., Dai, D., Guo, D., Yang, D., Chen, D., Ji, D., Li, E., Lin, F., Dai, F., … Pan, Z. (2024, December 27). Deepseek-V3 Technical Report. arXiv.org. https://arxiv.org/abs/2412.19437
- Sea AI Lab. (n.d.). Zero bubble Pipeline Parallellism — a hugging face space by Sail. Zero Bubble Pipeline Parallellism — a Hugging Face Space by sail. https://huggingface.co/spaces/sail/zero-bubble-pipeline-parallellism
- Qi, P., Wan, X., Huang, G., & Lin, M. (2023, November 30). Zero bubble pipeline parallelism. arXiv.org. https://arxiv.org/abs/2401.10241
- Meta. (n.d.-a). Meta / llama-3.1–405b-instruct. NIM. https://docs.api.nvidia.com/nim/reference/meta-llama-3_1-405b
- Kilcher, Y. (2025, January 26). DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models (Paper Explained). YouTube. https://www.youtube.com/watch?v=bAWV_yrqx4w
- Nvidia. (n.d.). Megatron-LM/megatron/core/pipeline_parallel/schedules.py at d5069b8ebdc59445f0baeadb65dd8614706b789e · NVIDIA/megatron-LM. GitHub. https://github.com/NVIDIA/Megatron-LM/blob/d5069b8ebdc59445f0baeadb65dd8614706b789e/megatron/core/pipeline_parallel/schedules.py#L560