GRPO — In-Class Activity

This activity uses the GRPO interactive simulation to build intuition for how group sampling and relative advantage normalization train a policy without a critic network. You will observe single steps, watch convergence, experiment with group sizes, and compare GRPO to REINFORCE.

Session

GRPO — Pick the Best Number

A policy (probability distribution over numbers 1–10) must learn to find a hidden reward peak. Explore how GRPO uses group sampling and relative normalization to converge, and compare its behavior to vanilla REINFORCE.

Predict how a uniform policy changes after one GRPO step
Observe group sampling and advantage normalization in detail
Watch convergence and track entropy decay
Experiment with group size and its effect on stability
Compare GRPO vs REINFORCE vs REINFORCE+Baseline
Connect group normalization to LLM training (DeepSeek-R1)

~25 minutes
Start Activity →