Hands-on exploration of Group Relative Policy Optimization, the training algorithm behind DeepSeek-R1.
← Back to GRPO Simulation | All Simulations
A policy (probability distribution over numbers 1–10) must learn to find a hidden reward peak. Explore how GRPO uses group sampling and relative normalization to converge, and compare its behavior to vanilla REINFORCE.