GRPO Activity — Pick the Best Number

Activity Overview

Name: ID:

Progress: 0 / 0 questions answered

Step 1A — Predict Before You Run

1A 3 min

Open the GRPO simulation. Use default settings (Single Peak at 7, G=8). Do NOT click any buttons yet.

Predict: after one GRPO step, which actions will gain probability and which will lose? Why?

Step 1B — Single Step Observation

1B 5 min

Click "One Step" and expand the "Step Detail" panel below the charts.

Which samples got positive advantages? Which got negative? Do the advantages sum approximately to zero?

Click "Reset" and run one step again. Do you get the exact same samples? Why or why not?

Step 1C — Watch Convergence

1C 5 min

Reset, then click "Run N Steps" with N=50. Observe the Training History chart.

How many steps until π(7) exceeds 0.5? (Reset and try a few times to get a range)

a=1	a=2	a=3	a=4	a=5	a=6	a=7	a=8	a=9	a=10

Step 1D — Group Size Effect

1D 5 min

For each group size below, reset and run 50 steps. Record π(7) after 50 steps. Try each size 2–3 times and note the range.

Step 1E — GRPO vs REINFORCE

1E 5 min

Click "Run Comparison (200 steps)" in the GRPO simulation. Examine the two comparison charts.

Which has the smoothest entropy curve? Why is lower variance important during training?

Why does GRPO have lower variance than vanilla REINFORCE? (Hint: think about what the group normalization does.)

Step 1F — Reflection

1F 2 min

Explain in your own words: why does group normalization replace the need for a critic network?

How does this connect to training large language models like DeepSeek-R1? Why is eliminating the critic especially valuable for LLM training?

Bonus — KL Penalty

Bonus Optional

Set β (KL penalty) to 0.30. Reset and run 50 steps.

What happens to convergence compared to β=0? Why might a KL penalty be useful in LLM training?

Copied to clipboard!