GRPO — In-Class Activity

Hands-on exploration of Group Relative Policy Optimization, the training algorithm behind DeepSeek-R1.
← Back to GRPO Simulation  |  All Simulations

This activity uses the GRPO interactive simulation to build intuition for how group sampling and relative advantage normalization train a policy without a critic network. You will observe single steps, watch convergence, experiment with group sizes, and compare GRPO to REINFORCE.
Session

GRPO — Pick the Best Number

A policy (probability distribution over numbers 1–10) must learn to find a hidden reward peak. Explore how GRPO uses group sampling and relative normalization to converge, and compare its behavior to vanilla REINFORCE.

  • Predict how a uniform policy changes after one GRPO step
  • Observe group sampling and advantage normalization in detail
  • Watch convergence and track entropy decay
  • Experiment with group size and its effect on stability
  • Compare GRPO vs REINFORCE vs REINFORCE+Baseline
  • Connect group normalization to LLM training (DeepSeek-R1)
~25 minutes
Start Activity →