Session 2: Cliff Walking — SARSA vs Q-Learning

Activity Overview

Name: ID: Saved

Time: ~30 minutes

Format: Pairs or small groups

Materials: Laptop with TDLearning.html — Cliff Walking tab

Method: Predict → Experiment → Explain

Progress: 0 / 0 questions answered

Type your answers directly into the text boxes below. Your work is auto-saved to your browser. When finished, click "Copy All Answers" or "Download as Text" to submit. Hover over underlined terms in the simulation for built-in hints.

2A — Predict Before You Run 3 min

Open the Cliff Walking tab in the simulation. Study the grid.

The cliff gives −100 and sends you back to start. Normal steps cost −1.

Describe what you think the shortest path from S to G looks like.

Describe what a "cautious" path might look like if you were worried about accidentally stepping onto the cliff.

Predict: Which path will SARSA learn? Which will Q-Learning learn? Why?

2B — Watch One Episode 5 min

Set algorithm to SARSA, ε = 0.1, α = 0.5, speed ~30%
Click "Run 1 Episode (animated)". Watch the agent stumble around.
Click a cell near the cliff in the Q-Value Inspector.

The agent moves randomly at first. Why?
Hint: what are the Q-values before any learning?

What do the Q-values look like for a cell near the cliff early in training?

2C — Train Both Algorithms 7 min

Reset. Set algorithm to "Both", ε = 0.1, α = 0.5
Run 500 episodes
Click "Show Learned Paths".

Which algorithm's path is shorter?

Which algorithm's path is safer during ε-greedy execution? Why?

Click a cell adjacent to the cliff in the Q-Value Inspector. Compare SARSA's and Q-Learning's Q-values for the "Down" action (toward the cliff). What do you notice?

Q-value for "Down"	SARSA	Q-Learning
Cell (2, 5) — above cliff
Cell (2, 8) — above cliff

2D — The ε Experiment 5 min

Reset. Set algorithm to "Both", α = 0.5
Run 500 episodes with ε = 0.1. Note the paths and Figure 6.4 chart.
Reset. Run 500 episodes with ε = 0.01. Note the paths again.

Did SARSA's path change with smaller ε? Why?

As ε → 0, should SARSA and Q-Learning converge to the same path? Why?

Look at Figure 6.4 for both runs. Which ε gave SARSA better online rewards?

2E — The Key Insight 3 min

Turn on Heatmap mode.
Switch between SARSA and Q-Learning using the algorithm selector to compare the heatmaps.

Where do the two heatmaps disagree most? Why?

2F — Final Reflection 2 min

Fill in the blanks:

SARSA learns the value of the policy it (follows / wishes it could follow).
Q-Learning learns the value of the policy regardless of what it actually does.
If I deploy my agent with ε-greedy exploration, I should prefer because .
If I turn off exploration after training, I should prefer because .

Session 2: Cliff Walking — SARSA vs Q-Learning

Activity Overview

Bonus Challenges (if time permits)