GRPO Activity — Pick the Best Number

← Activity Hub  |  Open GRPO Simulation  |  All Simulations

Activity Overview

Progress: 0 / 0 questions answered
Step 1A — Predict Before You Run
1A 3 min

Open the GRPO simulation. Use default settings (Single Peak at 7, G=8). Do NOT click any buttons yet.

  • Describe the initial policy distribution. What is π(a) for each action?
  • Predict: after one GRPO step, which actions will gain probability and which will lose? Why?
Step 1B — Single Step Observation
1B 5 min

Click "One Step" and expand the "Step Detail" panel below the charts.

  • Which samples got positive advantages? Which got negative? Do the advantages sum approximately to zero?
  • How did the policy distribution shift? Was your prediction from 1A correct?
  • Click "Reset" and run one step again. Do you get the exact same samples? Why or why not?
Step 1C — Watch Convergence
1C 5 min

Reset, then click "Run N Steps" with N=50. Observe the Training History chart.

  • How many steps until π(7) exceeds 0.5? (Reset and try a few times to get a range)
  • Describe the entropy trend. What does decreasing entropy mean for the policy?
  • Record the final π(a) after 50 steps in this table:
a=1a=2a=3a=4a=5a=6a=7a=8a=9a=10
Step 1D — Group Size Effect
1D 5 min

For each group size below, reset and run 50 steps. Record π(7) after 50 steps. Try each size 2–3 times and note the range.

G = 4G = 16G = 32
π(7) trial 1
π(7) trial 2
π(7) trial 3
  • Which group size converges most consistently? Why does larger G help?
Step 1E — GRPO vs REINFORCE
1E 5 min

Click "Run Comparison (200 steps)" in the GRPO simulation. Examine the two comparison charts.

  • Which algorithm reaches the highest max probability fastest?
  • Which has the smoothest entropy curve? Why is lower variance important during training?
  • Why does GRPO have lower variance than vanilla REINFORCE? (Hint: think about what the group normalization does.)
Step 1F — Reflection
1F 2 min
  • Explain in your own words: why does group normalization replace the need for a critic network?
  • How does this connect to training large language models like DeepSeek-R1? Why is eliminating the critic especially valuable for LLM training?
Bonus — KL Penalty
Bonus Optional

Set β (KL penalty) to 0.30. Reset and run 50 steps.

  • What happens to convergence compared to β=0? Why might a KL penalty be useful in LLM training?
Copied to clipboard!