Session 1: Random Walk — TD vs Monte Carlo

Predict, Experiment, Explain  |  Open Simulation  |  All Sessions  |  ← All simulations

Activity Overview

Saved
Time: ~25 minutes
Format: Pairs or small groups
Materials: Laptop with TDLearning.html — Random Walk tab
Method: Predict → Experiment → Explain
Progress: 0 / 0 questions answered

Type your answers directly into the text boxes below. Your work is auto-saved to your browser. When finished, click "Copy All Answers" or "Download as Text" to submit. Hover over underlined terms in the simulation for built-in hints.

Random Walk — TD(0) vs Monte Carlo
1A — Predict Before You Run 3 min

Before touching anything, write down your answers:

  • All 5 states start at V = 0.5. The true values are 1/6, 2/6, 3/6, 4/6, 5/6. Which state's estimate is already closest to its true value? Why?
  • After running 1 episode of TD(0), will all 5 value estimates change, or only some? Why?
1B — Single Episode Observation 5 min

In the simulation's Random Walk tab:

  • Set algorithm to TD(0), α = 0.1, speed to slow (~25%)
  • Click "Run 1 Episode (animated)" and watch carefully
  • Record: Which states did the agent visit? Which value estimates changed? Which didn't?
  • Discuss with your partner: Why didn't unvisited states change? Compare with your prediction from 1A.
  • Reset. Run 1 episode again. Did you get the same trajectory? Why or why not?
1C — Convergence Race 5 min
  • Reset. Set algorithm to TD(0), α = 0.1. Click "Run N Episodes" with N = 100.
  • Record the 5 value estimates in the table below.
  • Reset. Set algorithm to MC, α = 0.01. Run 100 episodes.
  • Record the 5 value estimates.
StateABCDE
True value0.1670.3330.5000.6670.833
TD(0) after 100 ep.
MC after 100 ep.
  • Which method got closer to the true values in 100 episodes?
  • Look at the Figure 6.2 chart — what do you see?
1D — The Comparison Experiment 5 min
  • Reset. Click "Run Comparison (100 runs)" and wait for the Figure 6.3 chart.

Examine the RMS error chart. Write a 1-sentence answer for each:

  • At any given episode count, which method (TD or MC) has lower RMS error?
  • What happens to MC when you increase α? Does it always help?
  • Why does TD work better with larger α than MC can handle?
    Hint: hover over "lower variance" in the simulation's bottom section.
1E — Reflection 2 min
  • In your own words: The TD update uses V(S') — an estimate that might be wrong. Why is using a "wrong" estimate still better than waiting for the true return like MC does?

Bonus Challenge (if time permits)

Your answers are auto-saved in your browser. Use the buttons above to export for submission.
Next session: Cliff Walking — SARSA vs Q-Learning →
Copied to clipboard!