Model Architecture Behind DeepSeek R1

In our earlier introductory post, we discussed the critical innovations and emphasis on scaling that have propelled advancements in recent years, as shown by the development of OpenAI models.

Fig 1: Progression of Open AI Models driven by scaling and some innovations

As predicted by the scaling laws, the performance of these models have shown steady growth. Now, GPT-4 outperforms the average student on the Multistate Bar Exam (MBE), which is a rigorous series of tests required for practicing law in the United States.

Founders Creative is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Fig 2: Performance of GPT models on The Multistate Bar Exam (MBE), which is a challenging battery of tests designed to evaluate an applicant’s legal knowledge and skills, and is a precondition to practice law in the US. Source: GPT-4 Passes the Bar Exam

Other model families like Google’s Gemini, Facebook’s Llama, Mistral AI’s Mistral too follow a very similar trend. However, the DeepSeek team was compelled to shift their focus due to US trade restrictions. Instead of scaling, they concentrated on model architecture, training methodologies, and the training framework. Over 18 months, they developed numerous tweaks in each area, which they published in a series of four papers.

Fig 3: DeepSeek’s series of papers: DeepSeek-LLM, DeepSeek-V2, DeepSeek-V3 and DeepSeek-R1

Let us now dive into the actual improvements, starting with model architecture.

DeepSeek-MoE

MoE Architecture And Challenges

The Mixture of Experts (MoE) model architecture, which has been around for a while and used by other teams as well, breaks down the single large feedforward network in the transformer block into multiple feedforward networks called experts. For any given token, only a subset of these experts are activated by a router component, reducing the number of FLOPs done per token. This architecture includes a router component that decides which experts to activate for a given token.

However, this approach presents scaling and performance challenges:

  • Network Collapse: The router may learn to activate only a smaller subset of experts for all tokens, creating an imbalance. This is typically addressed using Auxiliary-Loss based load balancing, which adds a penalty for over or under utilization of an expert to the loss function. However, being part of the loss function, this impacts the actual weights learned by the model, decreasing its performance.

  • Communication Overload: In distributed training, where experts are distributed across devices, the constant exchange of tokens and weights between devices leads to significant communication overhead.

  • Generalization Issues: The specialized nature of MoE experts can hinder the model’s ability to generalize across different tasks.

DeepSeekMoE Innovations

Fig 4: DeepSeekMoE from DeepSeek-V3 paper, where the typical single feedforward network of the Transformer block (second yellow box on the left) has been split into multiple feedforward networks, so called experts (smaller blue and green boxes on the right). The yellow Router on the right is a gating network that learns to selectively activate a subset of experts for each token.

The DeepSeekMoE architecture, shown above, incorporates several enhancements to address the challenges posed by training MoE models.

  • Auxiliary-Loss-Free Load Balancing: To counter the issue of model collapse, a novel dynamic bias term was introduced. This term is exclusively used for updating the token affinity score for routing and is disregarded during model weight updates. This strategy effectively prevents model collapse without compromising performance.

  • Shared Experts: The architecture also includes shared experts that learn from all tokens. This reduces redundancy among specialized experts, promoting efficiency and enhancing the model’s generalization capabilities across diverse tasks.

  • Framework improvements: Additionally, substantial improvements were made to the model training framework to mitigate communication overhead and reduce compute cost. These include:

    • FP8 Mixed Precision: Matrix multiplications utilize the FP8 data format, halving memory usage and data transfer. Block-wise scaling and periodic “promotion” to FP32 after brief accumulation intervals prevent numeric overflow/underflow errors, maintaining numerical stability despite the reduced numeric range of FP8.

    • DualPipe Parallelism: This technique overlaps forward and backward computation with the MoE all-to-all dispatch, optimizing network communication, especially across InfiniBand.

    • PTX-Level & Warp Specialization: Warp-level instructions in PTX were fine-tuned, and the chunk size for all-to-all dispatch was auto-tuned to fully leverage InfiniBand and NVLink. Additionally, microcontroller allocation for communication versus compute tasks was adjusted. These optimizations ensure that communication does not impede computation.

Multi-Head Latent Attention (MLA)

The transformer architecture’s multi-head attention involves Query, Key, and Value vectors, which can be as large as the embedding dimensions divided by the number of attention heads. The MLA approach reduces the FLOPs per token by using down projection matrices to fold these vectors into smaller “latent” vectors, some of which are also cached. This approach helps reduce the FLOPs done per token.

In order to avoid any performance drop, the team also implemented:

  • Dynamic Low-Rank Projection: MLA adjusts the compression strength for Key/Value vectors based on sequence length.

  • Adaptive Query Compression: Adaptive scaling of the query at different layer depths. Early layers maintain expressiveness with higher-dimensional queries, while deeper layers compress more aggressively.

  • Joint KV Storage: Shared KV storage further reduces memory traffic during multi-node inference.

  • Layer-Wise Adaptive Cache: Instead of caching all past tokens for all layers, V3 prunes older KV entries at deeper layers to manage memory usage with 128K context windows.

The architecture diagram for MLA, showing the compression of queries, keys, and values using down projection matrices, is displayed below.

Fig 5: Multi-Head Latent Attention from DeepSeek-V3 paper, where down projection matrices are used to compress Query, Key and Value vectors

Multi-Token Prediction

The architecture was enhanced with MTP modules by the DeepSeek team to enable the model to predict multiple future tokens simultaneously. This allows for a deeper understanding of context and generation of more coherent sequences due to the model’s ability to look ahead further into the sequence. Compared to traditional single-token prediction methods, this significantly improves both efficiency and performance. The diagram below illustrates the MTP module alongside the main module.

Fig 6: Multi-Token Prediction from DeepSeek-V3 paper, where more than one future tokens are predicted simultaneously. This shows how tokens t2 and t3 are predicted along with t1.

Conclusion

This concludes this section. We explored three primary architectural patterns that the DeepSeek team adapted and enhanced to develop the DeepSeek-R1 model: DeepSeekMoE, Multi-Head Latent Attention, and Mult-Token Prediction. We also reviewed various improvements made to the training framework to accommodate the architectural changes. In the next and final post of this series, we will examine the changes made to the model training methodology.

Finally I have a shameless plug – I will be leading a panel discussion on foundation models and, ofcourse, DeepSeek-R1 at the engineering summit on March 28 at Palo Alto, California. A bunch of very enthusiastic volunteers of Founder’s Creative are organizing the event. We have a very interesting lineup of speakers and panels. If you are in the San Francisco Bay area around that time then you should take a look.

Founders Creative is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

What Is Behind The DeepSeek Hype?

DeepSeek R1, a new open-sourced large language model released by a Chinese startup in January 2025, has garnered significant attention. Developed on a tight budget and timeline, it rivals the industry’s top LLMs. While concerns persist around security, safety, veracity, and accuracy, the technical innovations behind the model are undeniable. This tri-part series will focus exclusively on exploring these technical advancements.

First Some Context

To gain a better understanding of DeepSeek’s innovations and contributions, it is crucial to first examine the broader context surrounding the rapid rise of Large Language Models (LLMs) and the current industry practices and trends. The way for the GenAI revolution has been paved by a long list of innovations in deep learning and NLP research over the past few decades. However, for sake of time, we will keep our highly opinionated review short and just focus on the recent few innovations since transformer model architecture.

Founders Creative is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Fig 1: Few key innovations behind rise of ChatGPT and Generative AI

Transformer Architecture

The 2017 paperAttention Is All You Need” introduced the Attention mechanism, which allows the model to focus on different parts of a sequence to effectively complete the task. For example, if the model is given the sentence “he walked to her home” and asked to classify the gender of the main character, the Attention mechanism would give more weight to the word “he“. The below diagram shows the encoder part of the encoder-decoder transformer architecture in that paper.

Fig 2: Encoder block of the transformer, showing the multi-head attention and the single feed forward network. (https://arxiv.org/pdf/1706.03762)

Mixture of Experts (MoE) Architecture

Mixture of Experts (MoE) architecture then emerged, which created a sparser network with specialized segments called experts. Rather than updating the parameters of one large feedforward network, a subset of many smaller feedforward networks is updated. This decreased the computational requirements during both training and inference, allowing for the training of larger networks with more data. Below figure shows how the feed forward network in the transformer’s encoder (highlighted in blue) was broken down into multiple smaller FFNs in MoE.

Fig 3: The encoder block of original transformer on the left and that of MoE on right. The dense feed forward network (FFN) layer on left has been replaced with a sparse Switch FFN layer (light blue) on the right. The layer operates independently on the tokens in the sequence. Also shows two tokens (x1 = “More” and x2 = “Parameters” below) being routed (solid lines) across four FFN experts, where the router independently routes each token. The switch FFN layer returns the output of the selected FFN multiplied by the router gate value, show by dotted-lines. (https://arxiv.org/abs/2101.03961)

Two Stage Training Pipeline – GPT, SFT and RLHF

The next big step was in the training methodology used to train LLMs. The 2018 GPT paper introduced a two-stage training pipeline, consisting of unsupervised Generative Pre-Training (GPT) on massive datasets followed by Supervised Fine Tuning (SFT). In some sense, the true birth of generative AI revolution was marked by this paper. The InstructGPT paper from 2022 then added a third step: Reinforcement Learning From Human Feedback (RLHF). This step uses human annotators to score the model’s output and then fine-tunes the model further, ensuring alignment with human expectations.

Fig 4: The two stage model training pipeline introduced by Generalized Pre-Training – GPT (https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf) and InstructGPT (https://arxiv.org/abs/2203.02155) papers.

Chain of Thought

The focus then shifted to post-training or test-time improvements with prompt engineering, specifically Chain of Thought reasoning. Chain of thought prompting decomposes complex tasks into a series of logical sub-tasks, guiding the model to reason in a human-like manner. Finally, ChatGPT was introduced as a user-friendly chat interface.

Scaling Laws

The last stop on our tour is scaling laws. This theoretical innovation provided the framework that shifted industry focus from pre-training to post-training and finally to test-time.

The 2020 publication Scaling Laws for Neural Language Models introduced the concept that LLM performance improves with increases in model size, dataset size, and compute used for training. This spurred a focus on scaling:

  • Initially, pre-training scaling was the focus, with efforts to increase the size of models, pre-training datasets, and compute clusters.

  • When the limits of internet data were reached, the focus shifted to post-training scaling, using Reinforcement Learning with Human Feedback (RLHF) and Supervised Fine Tuning (SFT). High-quality, human-annotated, task-specific data for fine tuning became a key differentiator for many companies.

  • As post-training gains diminished, test-time scaling came into focus, using prompt engineering and Chain of Thought reasoning.

Enter DeepSeek

The previous section highlighted that the industry was more focused on scaling these past few years, rather than model architecture or training methodologies. This is changing with DeepSeek’s announcement of its R1 model on January 20, 2025. US export restrictions on China seemingly limited scaling as an option to improve LLMs, forcing innovation across multiple aspects of model building, which can be categorized into three areas.

Model Architecture

  • DeepSeekMoE with Auxiliary-Loss-Free Load Balancing

    • Only 37 billion parameters activated out of 671 billion for each token

  • MLA – Multi-Headed Latent Attention

    • Compressing key/value vectors using down-projection and up-projection matrices for more optimal memory and compute use

  • Multi-Token Prediction Training Objective

    • Predicting more than 1 token at a time, again optimizing compute

    • Powers speculative decoding during inference to speed up inference

Training Methodology

  • Direct Reinforcement Learning on the Base Model

    • No supervised fine tuning (SFT)

    • Surfaced emergent CoT behaviors – self-verification, reflection etc.

  • New Group Relative Policy Optimization (GRPO)

    • Estimates from group score instead of using critic model

    • Rules based reward with accuracy and format rewards

  • New Four Stage Training Pipeline for Final Model

    • Cold start data

    • Reasoning oriented RL

    • Rejection sampling and SFT

    • RL for all scenarios for alignment

  • Distillation of Reasoning Patterns

    • Data generated by DeepSeek-R1 used to fine-tune smaller dense models like Qwen and Llama

Training Framework

  • FP8 Mixed Precision Training Framework

    • Previous approaches of Quantization were about converting the weights form FP32 to FP8 after model training

  • DualPipe Algorithm Pipeline Parallelism

    • Bidirectional pipeline scheduling and overlapping communication with computation

    • Reduces pipeline bubbles and communication overhead introduced by cross-node expert parallelism

    • Near-zero communication overhead while scaling the model and employing fine-grained experts across nodes

  • Better Utilization of InfiniBand and NVLink Bandwidths

    • Improved cross-node communication

    • 20 of the 132 processing units on each H800 specifically programmed to manage cross-chip communications

  • Memory Optimizations

    • Selective compression and caching

Conclusion

This concludes our first post this three part series on DeepSeek. We explored the past few years when scaling laws drove the industry to focus on scaling datasets, compute, and model size to enhance LLMs. We briefly discussed key innovations – Transformer architecture, Mixture of Experts architecture, 2-stage training pipeline, prompt engineering, and Chain of Thought – that facilitated this scaling. Additionally, we did a quick review of DeepSeek’s innovations, given their scaling limitations due to export restrictions. The technical details of these innovations will be explored in the upcoming posts. Stay tuned!

Founders Creative is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

If 2024 was the AI Playground for Product Leaders, 2025 is Walking a Tightrope

The AI explosion of 2024 was a wild ride—full of experimentation, excitement, and a rush to integrate AI into products. But as we enter 2025, the landscape is shifting. The real challenge now isn’t just adopting AI; it’s about striking the right balance—between innovation and business value, risk and reward, speed and sustainability.

For product leaders, AI is no longer a playground; it’s a tightrope. Walk it well, and you unlock massive competitive advantage. Misstep, and you risk wasted investments, reputational damage, or falling behind.

Here are three critical balancing acts every product leader must master in 2025.

Balancing Act #1: Balancing Experimentation and Delivering Business Value Using AI

2024 was about rapid AI experimentation—prototyping, proof-of-concepts, and launching AI features just to “have AI.” But in 2025, the game has changed. Product leaders can’t afford to experiment endlessly without delivering tangible business impact.

The Challenge:

  • AI capabilities evolve rapidly, but not all experiments translate into measurable business value.

  • Leaders must differentiate between hype-driven innovation and customer-driven impact.

How to Strike the Balance:

  • Tie AI experiments to key business metrics (e.g., cost savings, revenue growth, retention).

  • Adopt an MVP mindset: Validate AI use cases quickly, discard low-impact ones, and double down on winners.

  • Measure impact early and often: Build AI features with clear KPIs, ensuring they solve real customer pain points.

In 2025, AI that doesn’t drive business outcomes will be left behind. The focus shifts from “What AI can do” to “What AI should do.”

Balancing Act #2: Balancing the Risks vs. Rewards of Generative AI

Generative AI was the star of 2024, with companies racing to integrate chatbots, content generation, and automation. But as businesses scale AI usage, the risks have become clearer—hallucinations, copyright issues, security vulnerabilities, and ethical concerns.

The Challenge:

  • AI-generated outputs can be unpredictable and require human oversight.

  • Businesses must balance the efficiency gains of GenAI with legal, ethical, and brand risks.

How to Strike the Balance:

  • Implement AI governance frameworks to monitor accuracy, security, and compliance.

  • Educate teams on AI risks—GenAI isn’t just a tech issue; it’s a business and reputation issue.

  • Blend AI with human oversight—use AI for acceleration but keep humans in the loop for decision-making.

The reward? Massive efficiency gains and smarter automation. The risk? A compliance nightmare or AI-generated disaster. The leaders who master this balance will reap the benefits without the backlash.

Balancing Act #3: Balancing Speed of Execution While Keeping Up with AI Advancements

AI moves at breakneck speed—what was cutting-edge last quarter may be outdated today. Product leaders face a paradox: Move too slow, and you lose to competitors. Move too fast, and you risk half-baked AI features.

The Challenge:

  • AI models, tools, and frameworks are evolving faster than traditional product development cycles.

  • Keeping pace with AI advancements without disrupting execution is increasingly difficult.

How to Strike the Balance:

  • Embed AI learning into your team’s DNA—establish dedicated AI research tracks.

  • Use modular AI architectures to integrate new AI advancements without overhauling entire systems.

  • Prioritize AI investments wisely—not every new model or tool needs immediate adoption.

The best AI-driven product teams in 2025 will be those that execute fast while staying informed, adapting to change without chasing every trend.

Final Thoughts: AI Success in 2025 Is About Balance, Not Just Speed

2024 was about rushing into AI—2025 is about walking the tightrope.

Product leaders who master the balancing act will build AI products that are scalable, valuable, and future-proof.
Those who don’t risk wasted investments, ethical pitfalls, and falling behind.

So, as you lead AI-driven products into 2025, ask yourself: Are you just running toward AI, or are you strategically balancing the tightrope?

Which balancing act do you find the hardest?

That’s precisely why we founded the Product Council at Founders Creative—to tackle these pressing and timely questions. We invite you to join the conversation by joining the Leadership Track of the Product Council by signing up here

Founders Creative is a community connecting over 10,000 AI founders, investors, engineers, and operators through exclusive events in Silicon Valley, fostering collaboration and innovation.

If Software Ate the World, AI’s Digesting It

They say lightning never strikes twice, but technology has proven this wrong time and again. The 90s saw enterprises transform through ERPs. The late 1990s ushered in the SaaS revolution, with Salesforce’s founding in 1999 fundamentally changing how businesses consume software. Cloud computing took off in the late 2000s with AWS’s launch in 2006, letting […]

Founders Creative: The Mission

A few years ago I found myself on the road 12 weeks of the year as a startup founder from Las Vegas to Amsterdam. I challenged myself to put myself on every stage in the cities I traveled in and encountered some very sharp white and male elbows. It was amusing just as much as […]