GRPO — Group Relative Policy Optimization

← All Simulations | The training algorithm behind DeepSeek-R1

Algorithm Overview

Group Relative Policy Optimization (GRPO) is a policy optimizationA family of RL algorithms that directly optimize the parameters of a policy (probability distribution over actions) to maximize expected reward. algorithm that eliminates the need for a learned critic by using group samplingFor each prompt/state, GRPO samples a group of G outputs from the current policy and scores them all, rather than collecting single trajectories. and relative advantage normalizationWithin a group, advantages are computed by normalizing rewards: A_i = (R_i - mean) / std. This uses the group as its own baseline, replacing the critic network.. Instead of training a separate value network (critic), GRPO samples a group of outputs, scores each one, and normalizes the rewards within the group to compute advantages. It also uses clippingThe policy ratio r = π(a)/π_old(a) is clipped to [1-ε, 1+ε] to prevent overly large policy updates, borrowed from PPO. (from PPO) and an optional KL penaltyA term β·KL(π || π_ref) that penalizes the policy for drifting too far from a reference policy, helping maintain training stability. to keep updates stable.

Sample G outputs

→

Score each

→

Normalize in group

→

Update policy

GRPO Objective:
L(θ) = (1/G) Σ_i min(r_i·A_i, clip(r_i, 1−ε, 1+ε)·A_i) − β·KL(π || π_ref)
where r_i = π(a_i) / π_old(a_i), A_i = (R_i − mean(R)) / std(R)

Demo Problem: To illustrate these mechanics, we use a bandit-style action-selection problem. The agent maintains a policy (categorical probability distribution) over 10 discrete actions (numbers 1–10) and must learn to concentrate probability mass on the highest-reward action(s). There are no states or transitions—just repeated action selection against a fixed reward landscape—so the focus stays squarely on how group sampling, relative normalization, clipping, and the KL penalty drive the policy toward the optimum.

Three reward functions let you explore different learning challenges: Single Peak — a triangular reward centered at a configurable position, testing basic convergence; Double Peak — two peaks of different heights (0.7 at action 3, 1.0 at action 8), testing whether the policy finds the global maximum or gets stuck on the local one; and Linear — reward grows monotonically (action x → x/10), a simple baseline. Adjust the parameters below and watch the policy converge.

Controls

Reward Function

Peak Position: 7

Group Size G: 8

Learning Rate α: 0.10

KL Penalty β: 0.00

Clip ε: 0.20

Steps

KL from Ref

0.000

Max π(a)

0.100

Entropy

2.303

Policy Distribution π(a)

Training History

Step Detail

Run at least one step to see details.

GRPO vs REINFORCE Comparison

Run 200 steps with each algorithm using the same reward function and compare convergence: GRPO REINFORCE REINFORCE+Baseline

Key Concepts

Why group normalization replaces the critic: In standard actor-critic methods (like PPO), a learned value function provides the baseline for variance reduction. GRPO instead samples a group of G outputs for each input and normalizes the rewards within the group: A_i = (R_i − mean) / std. This group mean acts as a natural baseline, eliminating the need for a separate critic network and its associated training overhead.

Clipping: Like PPO, GRPO clips the probability ratio r = π(a)/π_old(a) to the range [1−ε, 1+ε]. This prevents any single update from changing the policy too drastically, improving training stability.

	REINFORCE	PPO	GRPO
Critic	None	Learned V(s)	None
Baseline	Optional mean	V(s)	Group mean/std
Clipping	No	Yes	Yes
KL Penalty	No	Optional	Yes
Samples/Update	1+	Batch	Group of G

References:
• Shao et al., DeepSeekMath: Pushing the Limits of Mathematical Reasoning (arXiv:2402.03300) — introduced GRPO
• DeepSeek-AI, DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (arXiv:2501.12948)