Chatbots, CoPilots, and Choreographers: Where will you start with AI Agents?

Agentic AI and the Shift Beyond Traditional Automation
Artificial Intelligence is no longer just a tool for automation—it’s becoming an active decision-maker, collaborator, and orchestrator of complex business processes. The rise of Agentic AI marks a fundamental shift from traditional automation, where AI simply follows predefined rules, to intelligent systems that act autonomously, adapt dynamically, and execute tasks with minimal human intervention.
For product leaders, this shift is impossible to ignore. AI is no longer limited to Chatbots handling customer inquiries—it’s evolving into CoPilots that assist humans in high-stakes decision-making and Choreographers that autonomously coordinate business processes across multiple systems and stakeholders. These AI-driven agents are redefining productivity, accelerating workflows, and unlocking new revenue streams across industries.
The real question isn’t whether your business will use Agentic AI—it’s whether you’re prepared to leverage its full potential before your competitors do.
In this post, we explore three critical areas where Agentic AI is making an impact: Chatbots that go beyond scripted conversations, CoPilots that augment human decision-making, and Choreographers that automate complex business processes.
Let’s dive in.
1. Chatbots: From Simple Responders to Intelligent Agents
Chatbots have been a staple of AI applications, particularly in customer service. However, Agentic AI expands their role beyond simple query resolution, enabling them to handle multi-step workflows, dynamically escalate issues, and integrate across business functions—acting more like autonomous agents than mere responders.
Use Cases:
Customer Service Agents: Unlike traditional chatbots that follow scripted responses, Agentic AI-driven bots can diagnose problems, autonomously retrieve relevant data, and execute actions such as processing refunds or modifying orders—reducing human intervention while improving customer satisfaction.
AI-Driven Marketing Assistants: Modern marketing teams leverage AI-powered assistants to analyze consumer behavior, generate personalized content, and optimize campaign performance in real time. These intelligent agents go beyond automation, dynamically adjusting messaging, budget allocation, and audience targeting based on evolving market trends.
Sales and Lead Qualification: AI agents autonomously assess potential leads, schedule follow-ups, and provide tailored product recommendations—functioning as autonomous sales assistants rather than passive responders.
2. CoPilots: AI-Powered Decision-Makers for Humans
Agentic AI is particularly powerful when deployed as a CoPilot—an AI assistant tailored to specific industries and human personas, enhancing efficiency in specialized roles. Rather than replacing humans, these AI companions augment expertise, reduce cognitive load, and improve decision-making by providing real-time insights, automating repetitive tasks, and adapting to unique workflows across sectors like IT, healthcare, HR, and finance.
Use Cases:
Health CoPilot for Clinicians: AI-powered clinical decision support tools assist doctors by summarizing patient histories, flagging anomalies in diagnostics, and suggesting personalized treatment plans based on real-time data.
HR CoPilot for Employee Experience: HR teams leverage AI-driven assistants to proactively address employee concerns, recommend career growth paths, and personalize benefits—enhancing engagement and reducing attrition.
IT & Storage CoPilot (Pure Storage’s IT Manager’s Companion): This AI assistant autonomously monitors infrastructure, predicts failures, and optimizes storage allocation, ensuring seamless IT operations with minimal human oversight.
3. Choreographers: Automating Business Processes with AI
One of the most transformative applications of Agentic AI is in business process automation, where AI systems dynamically identify bottlenecks, orchestrate workflows, and optimize complex operational tasks. These AI-driven Choreographers act as autonomous conductors of business functions—seamlessly coordinating multiple systems, departments, and tasks.
Use Cases:
Autonomous Supply Chain Management: AI agents predict demand fluctuations, automate vendor negotiations, and optimize logistics in real time, significantly reducing inefficiencies in global supply chains.
Finance and Risk Management: AI-driven financial agents autonomously analyze transaction patterns, detect fraud, and make real-time credit decisions—revolutionizing risk assessment processes.
HR Onboarding Automation: Instead of manual onboarding processes, AI agents coordinate across IT, HR, and operations to provision accounts, assign training materials, and personalize onboarding workflows for new employees.
The Imperative for Product Leaders
For product leaders, Agentic AI is not just another efficiency tool—it represents a fundamental shift in how AI-driven systems interact, execute, and learn. The key to unlocking its full potential lies in identifying complex, high-value workflows where AI can operate autonomously—reducing friction, enhancing decision-making, and driving real business impact.
Where will you start? How will you strategize?
The challenge ahead is not just about automation—it’s about redesigning product strategies to embrace autonomy and orchestration. To stay ahead, product leaders must move beyond automating repetitive tasks and intentionally integrate Chatbots, CoPilots, and Choreographers into their business ecosystems. The winners in this AI revolution will be those who leverage AI not just to support but to lead business processes, unlocking new efficiencies, innovations, and competitive advantages.💡 Are you ready to embrace Agentic AI, or will your competitors beat you to it?
Automating Prompt Engineering with DSPy: An Overview

The power of Language Models (LMs, or oftentimes called Large Language Models or LLMs) is often harnessed by chaining them together into sophisticated Language Model Programs, capable of tackling increasingly complex Natural Language Processing (NLP) tasks. Think of these programs as multi-step recipes where each step involves prompting an LM to perform a specific sub-task. Traditionally, building these LM programs relies heavily on manual prompt engineering: the time-consuming process of crafting effective instructions and examples (prompts) for each step through trial and error.
Enter DSPy: a novel framework that introduces automated prompt engineering for LM programs. It is the framework for programming—not prompting—LM programs. Instead of painstakingly hand-tuning prompts, DSPy treats the instructions and demonstrations given to the LMs within a program as parameters that can be automatically optimized to maximize performance on a specific task.
DSPy: Programming and Optimizing in One Go
DSPy provides a declarative programming model that allows developers to define what they want their LM program to achieve, without needing to specify exactly how each LM call should be prompted initially. You define the program as a series of modules, each with prompt templates containing open slots or variables for instructions and demonstrations (examples).
The core innovation of DSPy lies in its ability to automatically find the best values for these prompt variables. It does this by using various optimization strategies that don’t require access to the internal workings (e.g. gradients) of the LMs themselves or detailed labels for the intermediate steps within the program. DSPy only needs the LM program, a metric to measure its overall success (e.g., accuracy, precision, recall, F1), and a training dataset of inputs (and optionally, final outputs).
Tackling the Challenges of Prompt Optimization
Optimizing prompts for multi-stage LM programs presents two key challenges that DSPy is designed to address:
•The Proposal Challenge: The space of all possible instructions and combinations of demonstrations is incredibly vast. DSPy needs efficient techniques to propose a small set of high-quality prompt candidates for each module.
•The Credit Assignment Challenge: When an LM program doesn’t perform well, it’s difficult to determine which module’s prompt is the culprit. DSPy needs strategies to infer the impact of prompt choices in each module on the overall program performance to guide the optimization process.
How DSPy Optimizes: Key Strategies
DSPy employs several intelligent strategies to tackle these challenges:
•Bootstrapping Demonstrations: DSPy can automatically generate potential few-shot examples (input/output pairs) for each module by running the initial version of the program on training data. If the program produces a successful final output (according to the defined metric), the input/output traces of each module are treated as valuable demonstrations. The optimizer can then intelligently select combinations of these bootstrapped demonstrations to include in the prompts.

•Grounded Instruction Proposal: To generate effective instructions, DSPy utilizes another LM, a “proposer” LM. This proposer LM is provided with relevant context to help it craft better instructions. This context can include summaries of the training data, the structure of the LM program itself, and even previously evaluated prompts and their scores. By “grounding” the proposer in this information, DSPy aims to generate instructions that are more tailored to the specific task and the role of each module within the program.
•Surrogate Models for Efficient Search: To navigate the vast space of possible prompt configurations efficiently, DSPy can use surrogate models, such as Bayesian optimization. These models learn to predict the performance of different prompt combinations based on past evaluations, allowing DSPy to focus its search on the most promising areas. This reduces the number of costly LM calls needed for evaluation.
•Meta-Optimization of Proposal Strategies: DSPy can even go a step further by learning the best way to propose prompts. By parameterizing the hyperparameters of the proposal process (e.g., the temperature of the proposer LM, which grounding information to use), DSPy can use techniques like Bayesian optimization to find the proposal strategies that yield the best performing prompts for a given task and LM setup.
MIPRO: A Powerful Optimizer in the DSPy Toolkit
MIPRO (Multi-prompt Instruction PRoposal Optimizer) is an algorithm built using the above insights that demonstrates strong performance. MIPRO jointly optimizes both the instructions and the few-shot examples for each module in an LM program. By separating the task of proposing prompts from the task of evaluating and selecting the best combinations (credit assignment using a surrogate model), MIPRO can effectively find high-performing prompt configurations.
Key Lessons Learned from DSPy Optimization
The evaluation of DSPy optimizers on a diverse set of tasks yielded several important lessons:
•Optimizing bootstrapped demonstrations is often crucial for achieving the best performance in LM programs . Providing relevant examples can be a very effective way to guide LMs.
•Jointly optimizing both instructions and few-shot examples, as done by MIPRO, generally leads to the best overall performance across a range of tasks.
•Instruction optimization becomes particularly important for tasks with complex, conditional rules that are difficult to convey through a limited number of examples. In such cases, refining the instructions can have a significant impact.
•Providing relevant context (“grounding“) to the instruction proposal process is generally helpful, but the most beneficial type of context can vary depending on the specific task.
•There is still much to learn about the optimal strategies for LM program optimization, and future research can explore the performance of different optimizers under varying resource constraints and with different base LMs.

Conclusion: Towards More Efficient and Effective LM Programs
DSPy represents a significant advancement in how we build and optimize Language Model Programs. By automating the often tedious and error-prone process of prompt engineering, DSPy offers the potential for increased efficiency, better performing NLP solutions, and a reduced reliance on manual trial and error. As LM programs become increasingly central to tackling complex tasks, frameworks like DSPy will play a vital role in making this technology more accessible and powerful.
Join Founders Creative at the AI Engineering Summit 3/28 in Palo Alto to connect with 50+ engineering leaders!
Sign up today using code RAYMOND for $50 off! Early bird ticket sales end 3/9 11:59p Pacific!
DeepSeek’s Model Training Methodology

In the previous post, we explored how the DeepSeek team utilized an HPC codesign approach. By enhancing both the model architecture and the training framework, they were able to train DeepSeek models effectively while using fewer resources. In this article, we will delve into the innovative techniques they employed in their training methodology.
Large Scale Reinforcement Learning
DeepSeek-R1-Zero naturally acquires the ability to solve increasingly complex reasoning tasks by leveraging extended test-time computation. This computation ranges from generating hundreds to thousands of reasoning tokens, allowing the model to explore and refine its thought processes in greater depth
One of the most remarkable aspects of this self-evolution is the emergence of sophisticated behaviors as the test-time computation increases. Behaviors such as reflection—where the model revisits and reevaluates its previous steps—and the exploration of alternative approaches to problem-solving arise spontaneously.
– DeepSeek-R1 paper
One of the primary innovations from the DeepSeek team is the application of direct, large-scale unsupervised Reinforcement Learning (RL) for LLM training. Their research demonstrates that the use of RL naturally enhanced the model’s reasoning abilities.
Reinforcement learning (RL) has been used extensively in multiple domains, including robotics. It is a machine learning technique where the model learns to make decisions based on feedback. Desired behaviors are rewarded, and undesired behaviors are punished.
At a high level, the RL learning process involves an agent (the model being trained) and an interpreter model. The interpreter model reviews the agent’s results and takes input from the environment to determine a reward to apply to the model.
RL enables an agent to learn optimal strategies in complex environments by interacting directly with the environment and focusing on maximizing long-term rewards. This makes the model particularly adept at handling dynamic, uncertain situations where immediate feedback may not be available. In contrast, traditional machine learning methods often rely on pre-labeled datasets. RL essentially helps the model learn by experiencing the consequences of its actions and adapting its behavior to achieve the best possible outcome.
Existing Proximal Policy Optimization (PPO)
While there are many reinforcement learning algorithms, PPO, introduced by OpenAI in 2017, has been the default RL algorithm since 2018.
PPO’s inner workings
As shown in Fig 2 below, PPO uses four models:
-
Policy Model: This is the LLM model being tuned.
-
Reference Model: This is identical to the policy model, but it’s frozen and used to reduce model divergence.
-
Reward Model: This is a pre-trained model that evaluates the reward for generated text.
-
Value Model: This is trained as part of the RL process to estimate the long-term value for the generated output.

The PPO Process
-
A query (q) is submitted to the policy model, which generates an output (o).
-
The reward model computes a reward (r) for the output.
-
The value model estimates a value (v) for the output.
-
The Generalized Advantage Estimation (GAE) function combines r, v, and the reference model output to estimate the advantage (A).
-
The advantage is then used to update the policy model weights.
Key Takeaway
PPO’s use of four models makes it compute-intensive, presenting challenges for large-scale RL implementations.
DeepSeek’s Group Relative Policy Optimization (GRPO)
The DeepSeek was able to address the scaling challenges by simplifying two key parts of PPO:
-
replacing a learnt value model with a simpler rules based reward computation
-
simplified KL regularization
With these optimizations, the team was able to use RL at scale during pre-training of the model.

Four Stage Training Pipeline
The use of large-scale unsupervised RL led to the development of a strong reasoning model. However, this model encountered challenges related to readability, language mixing, and generalization to non-reasoning tasks. To address these issues, the team devised a new four-stage training pipeline that incorporates two supervised fine-tuning (SFT) and two reinforcement learning (RL) stages.
The initial SFT stage utilizes high-quality cold start data to stabilize the subsequent RL step, which in turn enhances the model’s reasoning capabilities. This is followed by another SFT stage that employs rejection sampling to further strengthen the model and includes non-reasoning examples to improve its performance on non-reasoning tasks.
The final RL stage incorporates more general tasks to align the model with human expectations and includes a reward for readability and single language usage in its policy optimization step.
DeepSeek-R1 Results
The DeepSeek-R1 model showcased impressive results, with the team summarizing their findings as follows:
-
DeepSeek-R1’s performance was on par with OpenAI-o1-1217 on multiple tasks.
-
The model’s strong document analysis capabilities were evident in its performance on FRAMES, a long-context-dependent QA task.
-
DeepSeek-R1 displayed strong instruction-following capabilities, based on impressive results on the IF-Eval benchmark.
-
The model’s strengths in writing tasks and open-domain question answering were highlighted by good performance with AlpacaEval2.0 and ArenaHard.
-
DeepSeek-R1’s performance on the Chinese SimpleQA benchmark was worse than DeepSeek-V3 due to the addition of Safety RL to control output (censorship?).
-
Large-scale reinforcement learning was highly effective for STEM-related questions with clear and specific answers.
-
Reasoning models were generally better at handling fact-based queries.
-
Reasoning tasks
-
79.8% Pass@1 on AIME 2024
-
97.3% on MATH-500
-
2,029 Elo rating on Codeforces
-
-
Knowledge
-
90.8% on MMLU
-
84.0% on MMLU-Pro
-
71.5% on GPQA Diamond
-
-
Others
-
87.6% on AlpacaEval 2.0
-
92.3% on ArenaHard
-
Model Distillation
Distillation Creates Smaller, More Efficient Models
The DeepSeek team found that smaller, denser models trained on data from the larger R1 model, through a process called distillation, performed very well on benchmarks. This finding can help create smaller, more efficient models in the industry.

Six Distilled Models Created
Using the Llama and Qwen models, the team created six distilled models:
-
Distill-Qwen-1.5B (1.5 billion parameters)
-
Distill-Qwen-7B (7 billion parameters)
-
Distill-Qwen-14B (14 billion parameters)
-
Distill-Qwen-32B (32 billion parameters)
-
Distill-Llama-8B (8 billion parameters)
-
Distill-Llama-70B (70 billion parameters)
Distilled Models Perform Well
The distilled models also performed very well compared to existing similar models on multiple tasks.
Summary
The DeepSeek team has innovated on multiple facets of model building to create a best of the breed reasoning model. By open sourcing everything, they are also enabling innovation in the industry. It will be very fascinating to see how these innovations power more improvements in LLMs.
Model Architecture Behind DeepSeek R1

In our earlier introductory post, we discussed the critical innovations and emphasis on scaling that have propelled advancements in recent years, as shown by the development of OpenAI models.
As predicted by the scaling laws, the performance of these models have shown steady growth. Now, GPT-4 outperforms the average student on the Multistate Bar Exam (MBE), which is a rigorous series of tests required for practicing law in the United States.

Other model families like Google’s Gemini, Facebook’s Llama, Mistral AI’s Mistral too follow a very similar trend. However, the DeepSeek team was compelled to shift their focus due to US trade restrictions. Instead of scaling, they concentrated on model architecture, training methodologies, and the training framework. Over 18 months, they developed numerous tweaks in each area, which they published in a series of four papers.

Let us now dive into the actual improvements, starting with model architecture.
DeepSeek-MoE
MoE Architecture And Challenges
The Mixture of Experts (MoE) model architecture, which has been around for a while and used by other teams as well, breaks down the single large feedforward network in the transformer block into multiple feedforward networks called experts. For any given token, only a subset of these experts are activated by a router component, reducing the number of FLOPs done per token. This architecture includes a router component that decides which experts to activate for a given token.
However, this approach presents scaling and performance challenges:
-
Network Collapse: The router may learn to activate only a smaller subset of experts for all tokens, creating an imbalance. This is typically addressed using Auxiliary-Loss based load balancing, which adds a penalty for over or under utilization of an expert to the loss function. However, being part of the loss function, this impacts the actual weights learned by the model, decreasing its performance.
-
Communication Overload: In distributed training, where experts are distributed across devices, the constant exchange of tokens and weights between devices leads to significant communication overhead.
-
Generalization Issues: The specialized nature of MoE experts can hinder the model’s ability to generalize across different tasks.
DeepSeekMoE Innovations

The DeepSeekMoE architecture, shown above, incorporates several enhancements to address the challenges posed by training MoE models.
-
Auxiliary-Loss-Free Load Balancing: To counter the issue of model collapse, a novel dynamic bias term was introduced. This term is exclusively used for updating the token affinity score for routing and is disregarded during model weight updates. This strategy effectively prevents model collapse without compromising performance.
-
Shared Experts: The architecture also includes shared experts that learn from all tokens. This reduces redundancy among specialized experts, promoting efficiency and enhancing the model’s generalization capabilities across diverse tasks.
-
Framework improvements: Additionally, substantial improvements were made to the model training framework to mitigate communication overhead and reduce compute cost. These include:
-
FP8 Mixed Precision: Matrix multiplications utilize the FP8 data format, halving memory usage and data transfer. Block-wise scaling and periodic “promotion” to FP32 after brief accumulation intervals prevent numeric overflow/underflow errors, maintaining numerical stability despite the reduced numeric range of FP8.
-
DualPipe Parallelism: This technique overlaps forward and backward computation with the MoE all-to-all dispatch, optimizing network communication, especially across InfiniBand.
-
PTX-Level & Warp Specialization: Warp-level instructions in PTX were fine-tuned, and the chunk size for all-to-all dispatch was auto-tuned to fully leverage InfiniBand and NVLink. Additionally, microcontroller allocation for communication versus compute tasks was adjusted. These optimizations ensure that communication does not impede computation.
-
Multi-Head Latent Attention (MLA)
The transformer architecture’s multi-head attention involves Query, Key, and Value vectors, which can be as large as the embedding dimensions divided by the number of attention heads. The MLA approach reduces the FLOPs per token by using down projection matrices to fold these vectors into smaller “latent” vectors, some of which are also cached. This approach helps reduce the FLOPs done per token.
In order to avoid any performance drop, the team also implemented:
-
Dynamic Low-Rank Projection: MLA adjusts the compression strength for Key/Value vectors based on sequence length.
-
Adaptive Query Compression: Adaptive scaling of the query at different layer depths. Early layers maintain expressiveness with higher-dimensional queries, while deeper layers compress more aggressively.
-
Joint KV Storage: Shared KV storage further reduces memory traffic during multi-node inference.
-
Layer-Wise Adaptive Cache: Instead of caching all past tokens for all layers, V3 prunes older KV entries at deeper layers to manage memory usage with 128K context windows.
The architecture diagram for MLA, showing the compression of queries, keys, and values using down projection matrices, is displayed below.

Multi-Token Prediction
The architecture was enhanced with MTP modules by the DeepSeek team to enable the model to predict multiple future tokens simultaneously. This allows for a deeper understanding of context and generation of more coherent sequences due to the model’s ability to look ahead further into the sequence. Compared to traditional single-token prediction methods, this significantly improves both efficiency and performance. The diagram below illustrates the MTP module alongside the main module.

Conclusion
This concludes this section. We explored three primary architectural patterns that the DeepSeek team adapted and enhanced to develop the DeepSeek-R1 model: DeepSeekMoE, Multi-Head Latent Attention, and Mult-Token Prediction. We also reviewed various improvements made to the training framework to accommodate the architectural changes. In the next and final post of this series, we will examine the changes made to the model training methodology.
Finally I have a shameless plug – I will be leading a panel discussion on foundation models and, ofcourse, DeepSeek-R1 at the engineering summit on March 28 at Palo Alto, California. A bunch of very enthusiastic volunteers of Founder’s Creative are organizing the event. We have a very interesting lineup of speakers and panels. If you are in the San Francisco Bay area around that time then you should take a look.
What Is Behind The DeepSeek Hype?

DeepSeek R1, a new open-sourced large language model released by a Chinese startup in January 2025, has garnered significant attention. Developed on a tight budget and timeline, it rivals the industry’s top LLMs. While concerns persist around security, safety, veracity, and accuracy, the technical innovations behind the model are undeniable. This tri-part series will focus exclusively on exploring these technical advancements.
First Some Context
To gain a better understanding of DeepSeek’s innovations and contributions, it is crucial to first examine the broader context surrounding the rapid rise of Large Language Models (LLMs) and the current industry practices and trends. The way for the GenAI revolution has been paved by a long list of innovations in deep learning and NLP research over the past few decades. However, for sake of time, we will keep our highly opinionated review short and just focus on the recent few innovations since transformer model architecture.
Transformer Architecture
The 2017 paper “Attention Is All You Need” introduced the Attention mechanism, which allows the model to focus on different parts of a sequence to effectively complete the task. For example, if the model is given the sentence “he walked to her home” and asked to classify the gender of the main character, the Attention mechanism would give more weight to the word “he“. The below diagram shows the encoder part of the encoder-decoder transformer architecture in that paper.

Mixture of Experts (MoE) Architecture
Mixture of Experts (MoE) architecture then emerged, which created a sparser network with specialized segments called experts. Rather than updating the parameters of one large feedforward network, a subset of many smaller feedforward networks is updated. This decreased the computational requirements during both training and inference, allowing for the training of larger networks with more data. Below figure shows how the feed forward network in the transformer’s encoder (highlighted in blue) was broken down into multiple smaller FFNs in MoE.

Two Stage Training Pipeline – GPT, SFT and RLHF
The next big step was in the training methodology used to train LLMs. The 2018 GPT paper introduced a two-stage training pipeline, consisting of unsupervised Generative Pre-Training (GPT) on massive datasets followed by Supervised Fine Tuning (SFT). In some sense, the true birth of generative AI revolution was marked by this paper. The InstructGPT paper from 2022 then added a third step: Reinforcement Learning From Human Feedback (RLHF). This step uses human annotators to score the model’s output and then fine-tunes the model further, ensuring alignment with human expectations.

Chain of Thought
The focus then shifted to post-training or test-time improvements with prompt engineering, specifically Chain of Thought reasoning. Chain of thought prompting decomposes complex tasks into a series of logical sub-tasks, guiding the model to reason in a human-like manner. Finally, ChatGPT was introduced as a user-friendly chat interface.
Scaling Laws
The last stop on our tour is scaling laws. This theoretical innovation provided the framework that shifted industry focus from pre-training to post-training and finally to test-time.
The 2020 publication Scaling Laws for Neural Language Models introduced the concept that LLM performance improves with increases in model size, dataset size, and compute used for training. This spurred a focus on scaling:
-
Initially, pre-training scaling was the focus, with efforts to increase the size of models, pre-training datasets, and compute clusters.
-
When the limits of internet data were reached, the focus shifted to post-training scaling, using Reinforcement Learning with Human Feedback (RLHF) and Supervised Fine Tuning (SFT). High-quality, human-annotated, task-specific data for fine tuning became a key differentiator for many companies.
-
As post-training gains diminished, test-time scaling came into focus, using prompt engineering and Chain of Thought reasoning.
Enter DeepSeek
The previous section highlighted that the industry was more focused on scaling these past few years, rather than model architecture or training methodologies. This is changing with DeepSeek’s announcement of its R1 model on January 20, 2025. US export restrictions on China seemingly limited scaling as an option to improve LLMs, forcing innovation across multiple aspects of model building, which can be categorized into three areas.
Model Architecture
-
DeepSeekMoE with Auxiliary-Loss-Free Load Balancing
-
Only 37 billion parameters activated out of 671 billion for each token
-
-
MLA – Multi-Headed Latent Attention
-
Compressing key/value vectors using down-projection and up-projection matrices for more optimal memory and compute use
-
-
Multi-Token Prediction Training Objective
-
Predicting more than 1 token at a time, again optimizing compute
-
Powers speculative decoding during inference to speed up inference
-
Training Methodology
-
Direct Reinforcement Learning on the Base Model
-
No supervised fine tuning (SFT)
-
Surfaced emergent CoT behaviors – self-verification, reflection etc.
-
-
New Group Relative Policy Optimization (GRPO)
-
Estimates from group score instead of using critic model
-
Rules based reward with accuracy and format rewards
-
-
New Four Stage Training Pipeline for Final Model
-
Cold start data
-
Reasoning oriented RL
-
Rejection sampling and SFT
-
RL for all scenarios for alignment
-
-
Distillation of Reasoning Patterns
-
Data generated by DeepSeek-R1 used to fine-tune smaller dense models like Qwen and Llama
-
Training Framework
-
FP8 Mixed Precision Training Framework
-
Previous approaches of Quantization were about converting the weights form FP32 to FP8 after model training
-
-
DualPipe Algorithm Pipeline Parallelism
-
Bidirectional pipeline scheduling and overlapping communication with computation
-
Reduces pipeline bubbles and communication overhead introduced by cross-node expert parallelism
-
Near-zero communication overhead while scaling the model and employing fine-grained experts across nodes
-
-
Better Utilization of InfiniBand and NVLink Bandwidths
-
Improved cross-node communication
-
20 of the 132 processing units on each H800 specifically programmed to manage cross-chip communications
-
-
Memory Optimizations
-
Selective compression and caching
-
Conclusion
This concludes our first post this three part series on DeepSeek. We explored the past few years when scaling laws drove the industry to focus on scaling datasets, compute, and model size to enhance LLMs. We briefly discussed key innovations – Transformer architecture, Mixture of Experts architecture, 2-stage training pipeline, prompt engineering, and Chain of Thought – that facilitated this scaling. Additionally, we did a quick review of DeepSeek’s innovations, given their scaling limitations due to export restrictions. The technical details of these innovations will be explored in the upcoming posts. Stay tuned!
If 2024 was the AI Playground for Product Leaders, 2025 is Walking a Tightrope

The AI explosion of 2024 was a wild ride—full of experimentation, excitement, and a rush to integrate AI into products. But as we enter 2025, the landscape is shifting. The real challenge now isn’t just adopting AI; it’s about striking the right balance—between innovation and business value, risk and reward, speed and sustainability.
For product leaders, AI is no longer a playground; it’s a tightrope. Walk it well, and you unlock massive competitive advantage. Misstep, and you risk wasted investments, reputational damage, or falling behind.
Here are three critical balancing acts every product leader must master in 2025.
Balancing Act #1: Balancing Experimentation and Delivering Business Value Using AI
2024 was about rapid AI experimentation—prototyping, proof-of-concepts, and launching AI features just to “have AI.” But in 2025, the game has changed. Product leaders can’t afford to experiment endlessly without delivering tangible business impact.
The Challenge:
-
AI capabilities evolve rapidly, but not all experiments translate into measurable business value.
-
Leaders must differentiate between hype-driven innovation and customer-driven impact.
How to Strike the Balance:
-
Tie AI experiments to key business metrics (e.g., cost savings, revenue growth, retention).
-
Adopt an MVP mindset: Validate AI use cases quickly, discard low-impact ones, and double down on winners.
-
Measure impact early and often: Build AI features with clear KPIs, ensuring they solve real customer pain points.
In 2025, AI that doesn’t drive business outcomes will be left behind. The focus shifts from “What AI can do” to “What AI should do.”
Balancing Act #2: Balancing the Risks vs. Rewards of Generative AI
Generative AI was the star of 2024, with companies racing to integrate chatbots, content generation, and automation. But as businesses scale AI usage, the risks have become clearer—hallucinations, copyright issues, security vulnerabilities, and ethical concerns.
The Challenge:
-
AI-generated outputs can be unpredictable and require human oversight.
-
Businesses must balance the efficiency gains of GenAI with legal, ethical, and brand risks.
How to Strike the Balance:
-
Implement AI governance frameworks to monitor accuracy, security, and compliance.
-
Educate teams on AI risks—GenAI isn’t just a tech issue; it’s a business and reputation issue.
-
Blend AI with human oversight—use AI for acceleration but keep humans in the loop for decision-making.
The reward? Massive efficiency gains and smarter automation. The risk? A compliance nightmare or AI-generated disaster. The leaders who master this balance will reap the benefits without the backlash.
Balancing Act #3: Balancing Speed of Execution While Keeping Up with AI Advancements
AI moves at breakneck speed—what was cutting-edge last quarter may be outdated today. Product leaders face a paradox: Move too slow, and you lose to competitors. Move too fast, and you risk half-baked AI features.
The Challenge:
-
AI models, tools, and frameworks are evolving faster than traditional product development cycles.
-
Keeping pace with AI advancements without disrupting execution is increasingly difficult.
How to Strike the Balance:
-
Embed AI learning into your team’s DNA—establish dedicated AI research tracks.
-
Use modular AI architectures to integrate new AI advancements without overhauling entire systems.
-
Prioritize AI investments wisely—not every new model or tool needs immediate adoption.
The best AI-driven product teams in 2025 will be those that execute fast while staying informed, adapting to change without chasing every trend.
Final Thoughts: AI Success in 2025 Is About Balance, Not Just Speed
2024 was about rushing into AI—2025 is about walking the tightrope.
Product leaders who master the balancing act will build AI products that are scalable, valuable, and future-proof.
Those who don’t risk wasted investments, ethical pitfalls, and falling behind.
So, as you lead AI-driven products into 2025, ask yourself: Are you just running toward AI, or are you strategically balancing the tightrope?
Which balancing act do you find the hardest?
That’s precisely why we founded the Product Council at Founders Creative—to tackle these pressing and timely questions. We invite you to join the conversation by joining the Leadership Track of the Product Council by signing up here
Founders Creative is a community connecting over 10,000 AI founders, investors, engineers, and operators through exclusive events in Silicon Valley, fostering collaboration and innovation.
If Software Ate the World, AI’s Digesting It

They say lightning never strikes twice, but technology has proven this wrong time and again. The 90s saw enterprises transform through ERPs. The late 1990s ushered in the SaaS revolution, with Salesforce’s founding in 1999 fundamentally changing how businesses consume software. Cloud computing took off in the late 2000s with AWS’s launch in 2006, letting […]
Founders Creative: The Mission

A few years ago I found myself on the road 12 weeks of the year as a startup founder from Las Vegas to Amsterdam. I challenged myself to put myself on every stage in the cities I traveled in and encountered some very sharp white and male elbows. It was amusing just as much as […]