Session 1: Random Walk — TD vs Monte Carlo
Predict, Experiment, Explain | Open Simulation | All Sessions | ← All simulations
Activity Overview
Time: ~25 minutes
Format: Pairs or small groups
Materials: Laptop with TDLearning.html — Random Walk tab
Method: Predict → Experiment → Explain
Type your answers directly into the text boxes below. Your work is auto-saved to your browser. When finished, click "Copy All Answers" or "Download as Text" to submit. Hover over underlined terms in the simulation for built-in hints.
Random Walk — TD(0) vs Monte Carlo
1A — Predict Before You Run
3 min
Before touching anything, write down your answers:
- All 5 states start at V = 0.5. The true values are 1/6, 2/6, 3/6, 4/6, 5/6. Which state's estimate is already closest to its true value? Why?
- After running 1 episode of TD(0), will all 5 value estimates change, or only some? Why?
1B — Single Episode Observation
5 min
In the simulation's Random Walk tab:
- Set algorithm to TD(0), α = 0.1, speed to slow (~25%)
- Click "Run 1 Episode (animated)" and watch carefully
- Record: Which states did the agent visit? Which value estimates changed? Which didn't?
- Discuss with your partner: Why didn't unvisited states change? Compare with your prediction from 1A.
- Reset. Run 1 episode again. Did you get the same trajectory? Why or why not?
1C — Convergence Race
5 min
- Reset. Set algorithm to TD(0), α = 0.1. Click "Run N Episodes" with N = 100.
- Record the 5 value estimates in the table below.
- Reset. Set algorithm to MC, α = 0.01. Run 100 episodes.
- Record the 5 value estimates.
| State | A | B | C | D | E |
|---|---|---|---|---|---|
| True value | 0.167 | 0.333 | 0.500 | 0.667 | 0.833 |
| TD(0) after 100 ep. | |||||
| MC after 100 ep. |
- Which method got closer to the true values in 100 episodes?
- Look at the Figure 6.2 chart — what do you see?
1D — The Comparison Experiment
5 min
- Reset. Click "Run Comparison (100 runs)" and wait for the Figure 6.3 chart.
Examine the RMS error chart. Write a 1-sentence answer for each:
- At any given episode count, which method (TD or MC) has lower RMS error?
- What happens to MC when you increase α? Does it always help?
- Why does TD work better with larger α than MC can handle?
Hint: hover over "lower variance" in the simulation's bottom section.
1E — Reflection
2 min
- In your own words: The TD update uses V(S') — an estimate that might be wrong. Why is using a "wrong" estimate still better than waiting for the true return like MC does?
Bonus Challenge (if time permits)
- In the Random Walk, set MC with α = 0.5 and run 100 episodes. What happens? Why?
Hint: hover over "zero bias" and "lower variance" in the simulation.
Your answers are auto-saved in your browser. Use the buttons above to export for submission.
Next session: Cliff Walking — SARSA vs Q-Learning →