GRPO — Group Relative Policy Optimization

← All Simulations  |  The training algorithm behind DeepSeek-R1

Algorithm Overview

Group Relative Policy Optimization (GRPO) is a policy optimizationA family of RL algorithms that directly optimize the parameters of a policy (probability distribution over actions) to maximize expected reward. algorithm that eliminates the need for a learned critic by using group samplingFor each prompt/state, GRPO samples a group of G outputs from the current policy and scores them all, rather than collecting single trajectories. and relative advantage normalizationWithin a group, advantages are computed by normalizing rewards: A_i = (R_i - mean) / std. This uses the group as its own baseline, replacing the critic network.. Instead of training a separate value network (critic), GRPO samples a group of outputs, scores each one, and normalizes the rewards within the group to compute advantages. It also uses clippingThe policy ratio r = π(a)/π_old(a) is clipped to [1-ε, 1+ε] to prevent overly large policy updates, borrowed from PPO. (from PPO) and an optional KL penaltyA term β·KL(π || π_ref) that penalizes the policy for drifting too far from a reference policy, helping maintain training stability. to keep updates stable.

Sample G outputs
Score each
Normalize in group
Update policy
GRPO Objective:
L(θ) = (1/G) Σi min(ri·Ai, clip(ri, 1−ε, 1+εAi) − β·KL(π || πref)
where ri = π(ai) / πold(ai),   Ai = (Ri − mean(R)) / std(R)

Demo Problem: To illustrate these mechanics, we use a bandit-style action-selection problem. The agent maintains a policy (categorical probability distribution) over 10 discrete actions (numbers 1–10) and must learn to concentrate probability mass on the highest-reward action(s). There are no states or transitions—just repeated action selection against a fixed reward landscape—so the focus stays squarely on how group sampling, relative normalization, clipping, and the KL penalty drive the policy toward the optimum.

Three reward functions let you explore different learning challenges: Single Peak — a triangular reward centered at a configurable position, testing basic convergence; Double Peak — two peaks of different heights (0.7 at action 3, 1.0 at action 8), testing whether the policy finds the global maximum or gets stuck on the local one; and Linear — reward grows monotonically (action xx/10), a simple baseline. Adjust the parameters below and watch the policy converge.

Controls

Steps
0
KL from Ref
0.000
Max π(a)
0.100
Entropy
2.303

Policy Distribution π(a)

Training History

Step Detail

+

Run at least one step to see details.

GRPO vs REINFORCE Comparison

Run 200 steps with each algorithm using the same reward function and compare convergence: GRPO REINFORCE REINFORCE+Baseline

Key Concepts

Why group normalization replaces the critic: In standard actor-critic methods (like PPO), a learned value function provides the baseline for variance reduction. GRPO instead samples a group of G outputs for each input and normalizes the rewards within the group: Ai = (Ri − mean) / std. This group mean acts as a natural baseline, eliminating the need for a separate critic network and its associated training overhead.

Clipping: Like PPO, GRPO clips the probability ratio r = π(a)/πold(a) to the range [1−ε, 1+ε]. This prevents any single update from changing the policy too drastically, improving training stability.

REINFORCEPPOGRPO
CriticNoneLearned V(s)None
BaselineOptional meanV(s)Group mean/std
ClippingNoYesYes
KL PenaltyNoOptionalYes
Samples/Update1+BatchGroup of G
References:
• Shao et al., DeepSeekMath: Pushing the Limits of Mathematical Reasoning (arXiv:2402.03300) — introduced GRPO
• DeepSeek-AI, DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (arXiv:2501.12948)