Session 1: Random Walk — TD vs Monte Carlo

Activity Overview

Name: ID: Saved

Time: ~25 minutes

Format: Pairs or small groups

Materials: Laptop with TDLearning.html — Random Walk tab

Method: Predict → Experiment → Explain

Progress: 0 / 0 questions answered

Type your answers directly into the text boxes below. Your work is auto-saved to your browser. When finished, click "Copy All Answers" or "Download as Text" to submit. Hover over underlined terms in the simulation for built-in hints.

1A — Predict Before You Run 3 min

Before touching anything, write down your answers:

All 5 states start at V = 0.5. The true values are 1/6, 2/6, 3/6, 4/6, 5/6. Which state's estimate is already closest to its true value? Why?

After running 1 episode of TD(0), will all 5 value estimates change, or only some? Why?

1B — Single Episode Observation 5 min

In the simulation's Random Walk tab:

Set algorithm to TD(0), α = 0.1, speed to slow (~25%)
Click "Run 1 Episode (animated)" and watch carefully

Record: Which states did the agent visit? Which value estimates changed? Which didn't?

Discuss with your partner: Why didn't unvisited states change? Compare with your prediction from 1A.

Reset. Run 1 episode again. Did you get the same trajectory? Why or why not?

1C — Convergence Race 5 min

Reset. Set algorithm to TD(0), α = 0.1. Click "Run N Episodes" with N = 100.
Record the 5 value estimates in the table below.
Reset. Set algorithm to MC, α = 0.01. Run 100 episodes.
Record the 5 value estimates.

State	A	B	C	D	E
True value	0.167	0.333	0.500	0.667	0.833
TD(0) after 100 ep.
MC after 100 ep.

Which method got closer to the true values in 100 episodes?

Look at the Figure 6.2 chart — what do you see?

1D — The Comparison Experiment 5 min

Reset. Click "Run Comparison (100 runs)" and wait for the Figure 6.3 chart.

Examine the RMS error chart. Write a 1-sentence answer for each:

At any given episode count, which method (TD or MC) has lower RMS error?

What happens to MC when you increase α? Does it always help?

Why does TD work better with larger α than MC can handle?
Hint: hover over "lower variance" in the simulation's bottom section.

1E — Reflection 2 min

In your own words: The TD update uses V(S') — an estimate that might be wrong. Why is using a "wrong" estimate still better than waiting for the true return like MC does?

Session 1: Random Walk — TD vs Monte Carlo

Activity Overview

Bonus Challenge (if time permits)