Session 2: Cliff Walking — SARSA vs Q-Learning

Predict, Experiment, Explain  |  Open Simulation  |  All Sessions  |  ← All simulations

Activity Overview

Saved
Time: ~30 minutes
Format: Pairs or small groups
Materials: Laptop with TDLearning.html — Cliff Walking tab
Method: Predict → Experiment → Explain
Progress: 0 / 0 questions answered

Type your answers directly into the text boxes below. Your work is auto-saved to your browser. When finished, click "Copy All Answers" or "Download as Text" to submit. Hover over underlined terms in the simulation for built-in hints.

Cliff Walking — SARSA vs Q-Learning
2A — Predict Before You Run 3 min

Open the Cliff Walking tab in the simulation. Study the grid.

The cliff gives −100 and sends you back to start. Normal steps cost −1.

  • Describe what you think the shortest path from S to G looks like.
  • Describe what a "cautious" path might look like if you were worried about accidentally stepping onto the cliff.
  • Predict: Which path will SARSA learn? Which will Q-Learning learn? Why?
2B — Watch One Episode 5 min
  • Set algorithm to SARSA, ε = 0.1, α = 0.5, speed ~30%
  • Click "Run 1 Episode (animated)". Watch the agent stumble around.
  • Click a cell near the cliff in the Q-Value Inspector.
  • The agent moves randomly at first. Why?
    Hint: what are the Q-values before any learning?
  • What do the Q-values look like for a cell near the cliff early in training?
2C — Train Both Algorithms 7 min
  • Reset. Set algorithm to "Both", ε = 0.1, α = 0.5
  • Run 500 episodes
  • Click "Show Learned Paths".
  • Which algorithm's path is shorter?
  • Which algorithm's path is safer during ε-greedy execution? Why?
  • Click a cell adjacent to the cliff in the Q-Value Inspector. Compare SARSA's and Q-Learning's Q-values for the "Down" action (toward the cliff). What do you notice?
Q-value for "Down"SARSAQ-Learning
Cell (2, 5) — above cliff
Cell (2, 8) — above cliff
2D — The ε Experiment 5 min
  • Reset. Set algorithm to "Both", α = 0.5
  • Run 500 episodes with ε = 0.1. Note the paths and Figure 6.4 chart.
  • Reset. Run 500 episodes with ε = 0.01. Note the paths again.
  • Did SARSA's path change with smaller ε? Why?
  • As ε → 0, should SARSA and Q-Learning converge to the same path? Why?
  • Look at Figure 6.4 for both runs. Which ε gave SARSA better online rewards?
2E — The Key Insight 3 min
  • Turn on Heatmap mode.
  • Switch between SARSA and Q-Learning using the algorithm selector to compare the heatmaps.
  • Where do the two heatmaps disagree most? Why?
2F — Final Reflection 2 min

Fill in the blanks:

SARSA learns the value of the policy it (follows / wishes it could follow).
Q-Learning learns the value of the policy regardless of what it actually does.
If I deploy my agent with ε-greedy exploration, I should prefer because .
If I turn off exploration after training, I should prefer because .

Bonus Challenges (if time permits)

Your answers are auto-saved in your browser. Use the buttons above to export for submission.
Previous session: ← Random Walk — TD vs Monte Carlo
Copied to clipboard!