GRPO Activity — Pick the Best Number
← Activity Hub | Open GRPO Simulation | All Simulations
Activity Overview
Step 1A — Predict Before You Run
1A
3 min
Open the GRPO simulation. Use default settings (Single Peak at 7, G=8). Do NOT click any buttons yet.
- Describe the initial policy distribution. What is π(a) for each action?
- Predict: after one GRPO step, which actions will gain probability and which will lose? Why?
Step 1B — Single Step Observation
1B
5 min
Click "One Step" and expand the "Step Detail" panel below the charts.
- Which samples got positive advantages? Which got negative? Do the advantages sum approximately to zero?
- How did the policy distribution shift? Was your prediction from 1A correct?
- Click "Reset" and run one step again. Do you get the exact same samples? Why or why not?
Step 1C — Watch Convergence
1C
5 min
Reset, then click "Run N Steps" with N=50. Observe the Training History chart.
- How many steps until π(7) exceeds 0.5? (Reset and try a few times to get a range)
- Describe the entropy trend. What does decreasing entropy mean for the policy?
- Record the final π(a) after 50 steps in this table:
| a=1 | a=2 | a=3 | a=4 | a=5 | a=6 | a=7 | a=8 | a=9 | a=10 |
|---|---|---|---|---|---|---|---|---|---|
Step 1D — Group Size Effect
1D
5 min
For each group size below, reset and run 50 steps. Record π(7) after 50 steps. Try each size 2–3 times and note the range.
| G = 4 | G = 16 | G = 32 | |
|---|---|---|---|
| π(7) trial 1 | |||
| π(7) trial 2 | |||
| π(7) trial 3 |
- Which group size converges most consistently? Why does larger G help?
Step 1E — GRPO vs REINFORCE
1E
5 min
Click "Run Comparison (200 steps)" in the GRPO simulation. Examine the two comparison charts.
- Which algorithm reaches the highest max probability fastest?
- Which has the smoothest entropy curve? Why is lower variance important during training?
- Why does GRPO have lower variance than vanilla REINFORCE? (Hint: think about what the group normalization does.)
Step 1F — Reflection
1F
2 min
- Explain in your own words: why does group normalization replace the need for a critic network?
- How does this connect to training large language models like DeepSeek-R1? Why is eliminating the critic especially valuable for LLM training?
Bonus — KL Penalty
Bonus
Optional
Set β (KL penalty) to 0.30. Reset and run 50 steps.
- What happens to convergence compared to β=0? Why might a KL penalty be useful in LLM training?