This activity is split into two sessions designed to be completed in order.
Each session has its own worksheet with auto-saving answers, progress tracking, and export.
Students work in pairs or small groups, making predictions before running experiments on the
interactive simulation.
Session 1
Explore why TD(0) learns faster than Monte Carlo on the 5-state Random Walk,
and reproduce the classic RMS error comparison from the textbook.
- Predict which states change after one TD(0) episode
- Compare TD(0) and MC convergence after 100 episodes
- Run the Figure 6.3 comparison experiment
- Reflect on why bootstrapping helps despite using estimates
~25 minutes
Start Session 1 →
Session 2
Discover why SARSA learns a safe path while Q-Learning finds the optimal-but-risky
cliff-edge route, and understand the on-policy vs off-policy distinction.
- Predict shortest vs cautious paths before training
- Train both algorithms and compare learned paths
- Investigate how ε affects SARSA's behavior
- Compare heatmaps and fill in the on-policy / off-policy summary
~30 minutes
Start Session 2 →