TD Learning — Random Walk & Cliff Walking
Interactive visualizations of Temporal Difference methods from Chapter 6 of Sutton & Barto | In-Class Activity | ← All simulations
Random Walk (Example 6.2)
A Markov reward processA Markov chain extended with a reward function: a tuple (S, P, R, γ) where S is a finite set of states, P is a state transition probability matrix, R is a reward function, and γ is a discount factor. There are no actions—state transitions happen according to fixed probabilities. with 5 non-terminal states (A–E) and two terminal states. The agent starts at C and moves left or right with equal probability. Reaching the right terminal gives reward +1; all other rewards are 0. True valuesThe true value V(s) of a state is the expected return (total future reward) starting from that state under the given policy. Since the agent walks randomly with γ=1, the true value of each state equals the probability of reaching the right terminal (+1) from that state. For a symmetric random walk on 5 states, these probabilities are exactly 1/6, 2/6, 3/6, 4/6, 5/6.: 1/6This is the Gambler’s Ruin result. Label the positions 0 (left terminal), 1 (A), 2 (B), …, 5 (E), 6 (right terminal). Let p(k) = probability of reaching position 6 from position k. Boundary conditions: p(0)=0, p(6)=1. At any interior position the agent goes left or right with equal probability, giving the recurrence p(k) = ½p(k−1) + ½p(k+1). Rearranging: p(k+1)−p(k) = p(k)−p(k−1), so the differences are constant. This means p(k) is a linear function of k. Fitting the boundaries: p(k) = k/6. State A (k=1) → 1/6, B → 2/6, …, E → 5/6. Click for the full derivation on Wikipedia., 2/6, 3/6, 4/6, 5/6.
Figure 6.2 — Estimated ValueWhat to look for: Each colored line shows the value estimates after a certain number of episodes. Watch how the lines gradually approach the dotted “True values” line as episodes increase. After 0 episodes all estimates start flat at 0.5. After just 1 episode the curve already tilts in the right direction. By 100 episodes it nearly matches the true values. When “Both” is selected, solid lines are TD and dashed lines are MC — compare how quickly each method’s estimates approach the truth.
True values: The dotted line shows 1/6, 2/6, 3/6, 4/6, 5/6 — rising linearly because states closer to the right terminal (+1 reward) have a higher probability of reaching it.
Figure 6.3 — RMS ErrorWhat to look for: The y-axis is RMS error (lower = better). Watch how quickly each curve drops toward zero. TD lines (blue shades) typically fall faster than MC lines (red shades) at comparable α values — this is TD’s key advantage in this task. Also notice the effect of α: too large and the curve may oscillate or plateau; too small and learning is slow. Click “Run Comparison” to see averaged results across 100 runs for multiple α values.
RMS Error = √[(1/n) · Σ(V̂(s) − V(s))²], measuring how far learned estimates are from the true values. Lower is better.
Cliff Walking (Example 6.6)
A 4×12 gridworld. The agent starts at the bottom-left (S) and must reach the bottom-right (G). The bottom row between S and G is a cliff: stepping on it gives reward −100 and sends the agent back to start. All other steps give reward −1. Compare SARSA (safe path) vs Q-Learning (optimal path along cliff edge).
Q-Value InspectorQ(s,a) is the expected total reward for taking action a in state s and then following the policy. Each cell has 4 Q-values (one per direction). The highlighted row is the greedy action — the direction the policy would choose. Click any grid cell to see how the agent evaluates each possible move from that position. Large negative values near the cliff show the agent has learned that direction is dangerous.
Figure 6.4 — Sum of Rewards per EpisodeEach point is the total reward collected in one episode (closer to 0 is better, since every step costs −1 and cliff falls cost −100). SARSA (blue) typically gets higher rewards during training because its safe path avoids the cliff, even though the path is longer. Q-Learning (orange) shows more spikes to −100 early on because its cliff-edge policy suffers from ε-greedy exploration slipping off the cliff. Over time both improve, but SARSA maintains better online performance — the key insight of on-policy vs off-policy learning.
Learned PathsAfter training, this shows the greedy path each algorithm would take (always picking the action with the highest Q-value, no exploration). SARSA’s path (blue) typically goes along the top row, avoiding the cliff — it learned that cliff-adjacent cells are risky under ε-greedy. Q-Learning’s path (orange) hugs the cliff edge — the shortest route, optimal if you never slip. Overlapping cells appear with a blended color. Train both algorithms (select “Both”) for 500 episodes, then click “Show Learned Paths” to compare.
Foundations — The Methods Behind the Simulations
The two simulations above illustrate the two faces of Temporal-Difference learning:
- Random Walk (Tab 1) shows prediction — TD(0) and Monte Carlo estimate state values V(s) without choosing actions.
- Cliff Walking (Tab 2) shows control — SARSA and Q-Learning extend TD to action values Q(s, a) and learn a policy that decides how to act.
SARSA is essentially TD(0) applied on-policy to state-action pairs; Q-Learning applies the same idea off-policy. The sections below unpack these foundations: how TD and MC compare for prediction, how they extend to control, and the key formulas that connect them.
TD(0)The “0” in TD(0) means the update looks ahead only 0 extra steps beyond the immediate next state — it bootstraps from V(S’) right away. More generally, TD(λ) uses a parameter λ ∈ [0, 1] to blend different look-ahead depths: TD(0) = pure one-step bootstrapping; TD(1) = equivalent to Monte Carlo (uses the full return, no bootstrapping); values in between (e.g. TD(0.5), TD(0.9)) mix short- and long-horizon returns via eligibility traces. Higher λ means less bias but more variance. TD(0) is the simplest and most common starting point. vs MC — Prediction Methods Compared
Temporal Difference learningThe name “Temporal Difference” comes from the fact that updates are driven by the difference between value estimates at two successive time steps. The TD error δ = R + γV(S’) − V(S) compares what you expected (V(S)) with a better estimate available one step later (R + γV(S’)). This temporal difference is the learning signal that drives the update. combines ideas from Monte Carlo and Dynamic Programming. Unlike MC, TD methods learn from incomplete episodes by bootstrappingIn RL, bootstrapping means updating an estimate based on another estimate rather than waiting for the actual outcome. TD methods update V(S) using V(S') — the current estimate of the next state — instead of waiting until the episode ends to compute the true return. This is like revising your guess of total travel time mid-trip based on how far you’ve come, rather than waiting until you arrive.—updating estimates based on other estimates. Compare the two pseudocodes below to see the key structural difference: TD updates at every step using the next state’s estimate, while MC updates only after the episode ends using the actual return.
Note: The MC methodConstant-α first-visit Monte Carlo prediction: run a full episode, compute the actual discounted return G for each visited state, then update V(S) ← V(S) + α[G − V(S)]. This is a pure sample-based estimation method with no tree or planning component. here is tabular Monte Carlo prediction (constant-α first-visit MC)—it simply averages sampled episode returns to estimate state values. This is not Monte Carlo Tree Search (MCTS), which is a planning algorithm that builds a search tree using selection, expansion, simulation, and backpropagation.
V(S) — estimated value of state S (how good it is to be in that state; the quantity we are learning)
α (alpha) — learning rate / step size; controls how much each update shifts the estimate (e.g. 0.1)
γ (gamma) — discount factor ∈ [0, 1]; how much future rewards are worth relative to immediate ones (γ=1 means no discounting)
R — reward received after taking an action (immediate payoff from one step)
S′ — the next state reached after taking an action from S
δ (delta) — TD error; the surprise signal R + γV(S′) − V(S) that drives TD updates
G — return; the actual total discounted reward from a state to the end of the episode (used by MC only)
A — the action selected by the policy in state S
T — the final time step of an episode (when a terminal state is reached)
V(S) ← V(S) + α · [R + γ · V(S') − V(S)]
TD error:
δ = R + γ · V(S') − V(S)
Pseudocode (Tabular TD(0)):
Initialize V(s) arbitrarily for all s
Repeat (for each episode):
Initialize S
Repeat (for each step of episode):
A ← action given by policy for S
Take action A, observe R, S'
V(S) ← V(S) + α[R + γV(S') − V(S)]
S ← S'
until S is terminal
V(S) ← V(S) + α · [G − V(S)]
Return:
G = R1 + γR2 + γ²R3 + … (actual total reward)
Pseudocode (Constant-α MC):
Initialize V(s) arbitrarily for all s
Repeat (for each episode):
Generate episode: S0, A0, R1, …, ST
Wait for episode to finish
G ← 0
For t = T−1, T−2, …, 0:
G ← γG + Rt+1
V(St) ← V(St) + α[G − V(St)]
Concrete Example — One TD(0) Update with Actual Numbers
Abstract formulas make more sense once you plug in numbers. Here is a single TD(0) update from the Random Walk environment above, compared side-by-side with what Monte Carlo would do for the same transition.
Setup. States A–E, γ = 1, α = 0.1, all intermediate rewards = 0. Current value estimates (before the update):
The agent is in C, moves right to D, receives R = 0.
TD updates immediately:
TD error:
δ = R + γ·V(D) − V(C)
δ = 0 + 1·0.60 − 0.50 = +0.10
Apply update:
V(C) ← 0.50 + 0.1 × 0.10
V(C) ← 0.51
Result: V(C) nudged up from 0.50 → 0.51
because the next state D had a higher estimate (0.60 > 0.50),
so C’s value is pulled upward toward its neighbor.
Only V(C) changed. No other values touched.
The agent walked C → D → E → Right terminal (+1).
MC must wait until the episode ends:
Actual return from C:
G = 0 + 1·0 + 1·1 = 1.0
(all intermediate rewards are 0; final reward is +1)
Apply update:
V(C) ← 0.50 + 0.1 × (1.0 − 0.50)
V(C) ← 0.50 + 0.1 × 0.50
V(C) ← 0.55
Result: V(C) jumped from 0.50 → 0.55
— a bigger move because G = 1 is far from 0.50.
But this single sample is noisy: a different episode
ending left would give G = 0, pushing V(C) down to 0.45.
Key takeaway: TD made a small, stable adjustment (+0.01) using only the next state’s estimate — no waiting required. MC made a larger adjustment (+0.05) using the actual outcome, but had to wait for the episode to finish, and the update is noisier because one lucky or unlucky episode can swing it widely. Over many episodes, both converge to the true values, but TD’s lower-variance steps typically get there faster.
SARSA — On-Policy TD Control
SARSA learns Q-values using the action actually taken under the current ε-greedyAn ε-greedy policy picks the action with the highest Q-value (greedy) most of the time, but with probability ε it picks a random action instead. This ensures the agent keeps exploring rather than always exploiting its current best guess. For example, with ε=0.1, the agent takes a random action 10% of the time and the greedy action 90% of the time. policy. It evaluates and improves the policy it follows, making it on-policyAn on-policy method learns about the same policy it uses to make decisions. SARSA updates Q(S,A) using the action A’ that the agent actually takes next (chosen by its ε-greedy policy, including random exploration). This means the learned values reflect the real behavior, including mistakes from exploration. In contrast, an off-policy method like Q-Learning learns about a different (optimal) policy than the one it follows.. In Cliff Walking, SARSA learns a safe pathSARSA’s safe path goes along the top row of the grid, far from the cliff. Why? Because SARSA is on-policy: it learns Q-values for the policy it actually follows, which includes random exploration (ε-greedy). Walking next to the cliff means a random action could step onto it (−100 penalty). SARSA’s Q-values for cliff-adjacent cells reflect this risk, so the greedy policy steers away. Q-Learning, being off-policy, ignores exploration risk and finds the shorter cliff-edge path — optimal in theory, but dangerous under ε-greedy execution. Try training both in the Cliff Walking tab to see this difference! away from the cliff because the ε-greedy exploration occasionally falls off the edge, so SARSA accounts for this risk.
Q(S,A) ← Q(S,A) + α · [R + γ · Q(S',A') − Q(S,A)]
where A' is the action chosen by the ε-greedy policy at S'
Q-Learning — Off-PolicyAn off-policy method learns about a target policy (usually the optimal greedy policy) while following a different behavior policy (usually ε-greedy for exploration). Q-Learning’s update uses maxa Q(S’,a) — the best possible next action — even though the agent may have taken a random exploratory action. This decoupling lets Q-Learning discover the optimal policy regardless of how the agent explores. TD Control
Q-Learning learns the optimal Q-values Q* regardless of the exploration policy. It uses the maximum Q-value for the next state, not the action actually taken. In Cliff Walking, Q-Learning finds the optimal path along the cliff edge because it evaluates the greedy policy while following an exploratory one.
Q(S,A) ← Q(S,A) + α · [R + γ · maxa Q(S',a) − Q(S,A)]
where maxa Q(S',a) uses the best possible next action
TD vs Monte Carlo
TD bootstraps (uses estimates of successor states), while MC waits for the full return at the end of an episode. TD has lower varianceVariance measures how much an estimate fluctuates across different episodes. MC’s update target G (the actual return) depends on every random action and reward from the current step to the end of the episode — all that randomness compounds, causing high variance. TD’s update target R + γV(S’) depends on only one random step (the immediate reward and next state), so it fluctuates much less. Lower variance means more stable, consistent updates. but some bias; MC has zero biasBias means the estimate is systematically off from the true value. MC’s target G is the actual return — its expected value is exactly V(S) by definition, so MC updates are unbiased. TD’s target R + γV(S’) uses V(S’), which is itself an estimate that may be wrong. If V(S’) is inaccurate, the TD update is biased toward that wrong estimate. However, as learning progresses and V(S’) improves, the bias shrinks toward zero. but higher variance. TD can learn before an episode ends and from incomplete episodesLook at the pseudocodes above: TD updates V(S) immediately after each step using R + γV(S’). It only needs the current reward and the next state’s estimate — not the final outcome. MC, by contrast, needs G (the total return), which isn’t known until the episode terminates. This means TD works even in continuing (non-episodic) tasks where episodes never end, and it starts improving its estimates from the very first step rather than waiting potentially thousands of steps for an episode to finish., making it more practical in many settings. The Random Walk example demonstrates TD's advantage: it converges faster than MCTry it yourself: click “Run Comparison” in the Figure 6.3 panel above. You’ll see TD’s RMS error drops faster than MC’s at comparable α values. Why? TD updates after every step, so each episode provides many learning updates (one per step visited). MC updates only once per state per episode, and only after it ends. TD’s lower variance also means each update is more reliable, so smaller step sizes (α) work well. MC needs many full episodes to average out its high variance before estimates stabilize. for small step sizes.
Reference
Sutton, R. S. & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. Chapter 6: Temporal-Difference Learning.
Examples reproduced: Example 6.2 (Random Walk), Example 6.6 (Cliff Walking), Figure 6.2, Figure 6.3, Figure 6.4.
Book website ·
Full text (PDF)