Session 2: Cliff Walking — SARSA vs Q-Learning
Predict, Experiment, Explain | Open Simulation | All Sessions | ← All simulations
Activity Overview
Time: ~30 minutes
Format: Pairs or small groups
Materials: Laptop with TDLearning.html — Cliff Walking tab
Method: Predict → Experiment → Explain
Type your answers directly into the text boxes below. Your work is auto-saved to your browser. When finished, click "Copy All Answers" or "Download as Text" to submit. Hover over underlined terms in the simulation for built-in hints.
Cliff Walking — SARSA vs Q-Learning
2A — Predict Before You Run
3 min
Open the Cliff Walking tab in the simulation. Study the grid.
The cliff gives −100 and sends you back to start. Normal steps cost −1.
- Describe what you think the shortest path from S to G looks like.
- Describe what a "cautious" path might look like if you were worried about accidentally stepping onto the cliff.
- Predict: Which path will SARSA learn? Which will Q-Learning learn? Why?
2B — Watch One Episode
5 min
- Set algorithm to SARSA, ε = 0.1, α = 0.5, speed ~30%
- Click "Run 1 Episode (animated)". Watch the agent stumble around.
- Click a cell near the cliff in the Q-Value Inspector.
- The agent moves randomly at first. Why?
Hint: what are the Q-values before any learning?
- What do the Q-values look like for a cell near the cliff early in training?
2C — Train Both Algorithms
7 min
- Reset. Set algorithm to "Both", ε = 0.1, α = 0.5
- Run 500 episodes
- Click "Show Learned Paths".
- Which algorithm's path is shorter?
- Which algorithm's path is safer during ε-greedy execution? Why?
- Click a cell adjacent to the cliff in the Q-Value Inspector. Compare SARSA's and Q-Learning's Q-values for the "Down" action (toward the cliff). What do you notice?
| Q-value for "Down" | SARSA | Q-Learning |
|---|---|---|
| Cell (2, 5) — above cliff | ||
| Cell (2, 8) — above cliff |
2D — The ε Experiment
5 min
- Reset. Set algorithm to "Both", α = 0.5
- Run 500 episodes with ε = 0.1. Note the paths and Figure 6.4 chart.
- Reset. Run 500 episodes with ε = 0.01. Note the paths again.
- Did SARSA's path change with smaller ε? Why?
- As ε → 0, should SARSA and Q-Learning converge to the same path? Why?
- Look at Figure 6.4 for both runs. Which ε gave SARSA better online rewards?
2E — The Key Insight
3 min
- Turn on Heatmap mode.
- Switch between SARSA and Q-Learning using the algorithm selector to compare the heatmaps.
- Where do the two heatmaps disagree most? Why?
2F — Final Reflection
2 min
Fill in the blanks:
SARSA learns the value of the policy it (follows / wishes it could follow).
Q-Learning learns the value of the policy regardless of what it actually does.
If I deploy my agent with ε-greedy exploration, I should prefer because .
If I turn off exploration after training, I should prefer because .
Q-Learning learns the value of the policy regardless of what it actually does.
If I deploy my agent with ε-greedy exploration, I should prefer because .
If I turn off exploration after training, I should prefer because .
Bonus Challenges (if time permits)
- Can you find α and ε settings where Q-Learning gets better online rewards than SARSA? What does that tell you about the on-policy vs off-policy distinction?
- Train SARSA with ε = 0.5 (very high exploration). What path does it learn? How does the Figure 6.4 chart look compared to ε = 0.1?
Your answers are auto-saved in your browser. Use the buttons above to export for submission.
Previous session: ← Random Walk — TD vs Monte Carlo