Sutton & Barto, Reinforcement Learning: An Introduction (2nd ed.), Section 1.5
Sutton & Barto introduce tic-tac-toe as a motivating example for reinforcement learning. The classical approach to this game is minimaxA game-tree search algorithm that picks the move maximizing the worst-case outcome, assuming both players play perfectly. It guarantees optimal play against an optimal opponent, but cannot adapt to or exploit a weaker opponent’s mistakes because it doesn’t learn from experience., which computes optimal play by assuming both players act perfectly. But minimax cannot exploit an imperfect opponent's mistakes — it plays the same way regardless of who it faces. An evolutionaryEvolutionary methods treat each policy (strategy) as an individual in a population. Policies that win more games are selected and mutated to produce the next generation. The drawback: only the final win/loss signal is used, discarding all the information from individual moves within a game. method could search the space of policies directly, but it only uses the final outcome of each game, discarding the information available during play.
The RL approach instead learns a value functionA function V(s) that assigns a number to each state s, estimating how good it is to be in that state. Here, V(s) estimates the probability of winning from board state s. States where you’re about to win have V ≈ 1; states where you’re about to lose have V ≈ 0. V(s) that estimates the probability of winning from each board state s. Specifically, the agent evaluates afterstatesAn afterstate is the board position that results immediately after the agent places its mark, but before the opponent responds. Evaluating afterstates instead of state-action pairs is more efficient because different actions from different positions can produce the same board layout, so the agent can share learned values across those situations. — the board positions that result after the agent places its mark but before the opponent responds. This is efficient because different actions from different prior states can lead to the same afterstate.
The agent uses an ε-greedyA simple exploration strategy: with probability ε (e.g. 0.1 = 10% of the time), pick a random move to explore new board states. The rest of the time (1−ε), pick the greedy move — the one leading to the highest-valued afterstate. Without exploration, the agent may never discover better strategies. policy: with probability ε it makes a random exploratoryAn exploratory move is chosen randomly rather than by the value function. This ensures the agent visits states it might otherwise overlook. Crucially, TD backups are not performed after exploratory moves, because the move wasn’t chosen for its value — updating would inject noise from a random decision. move; otherwise it makes the greedyA greedy move picks the action whose resulting afterstate has the highest estimated value V(s). It’s the agent’s current “best guess” move. TD learning updates happen after greedy moves because they reflect the agent’s actual strategy. move — the one leading to the afterstate with the highest estimated value. After each greedy move, the value estimate is updated using temporal-difference (TD) learningTD learning updates value estimates after every move by comparing the current estimate V(St) with the estimate at the next state V(St+1). The key idea: instead of waiting until the game ends (like Monte Carlo), TD uses the next state’s estimate as a target. This bootstrapping approach learns faster because every move provides a learning signal.. Exploratory moves are not backed upIn RL, a “backup” is the process of updating a state’s value estimate using information from a successor state. The value of the current state is revised (“backed up”) to be more consistent with what the agent observed next. In TD learning, each greedy move triggers a one-step backup: V(St) is nudged toward V(St+1). — because they were chosen randomly rather than by the value function, including them would inject noise from decisions that don’t reflect the agent’s learned strategy. The update rule is:
Here St is the previous afterstate, St+1 is the current afterstate, and α is the step-size parameterAlso called the learning rate. Controls how much each update shifts the value estimate. Large α (e.g. 0.5) means fast but noisy learning; small α (e.g. 0.01) means slow but stable convergence. A value of 0.1 is a common starting point. controlling learning speed. The quantity V(St+1) − V(St) is the TD errorThe “surprise” signal that drives learning: δ = V(St+1) − V(St). If the next state has a higher value than expected, δ > 0 and the current state’s value is nudged up. If lower, δ < 0 and the value is nudged down. Over time, these corrections propagate backward through the game.: the difference between successive value estimates that drives learning. Over many games, these values convergeConvergence means the value estimates stop changing significantly between updates. For tic-tac-toe, this typically happens after several thousand self-play games. The values settle to the true winning probabilities under the current policy. to the true winning probabilities and the agent learns to play optimally against its training opponent.
Part 1 — Play First, Lose Later (2 min)
Switch to the Play tab. Play as X against the untrained agent. You should win easily.
Question: Why is the AI so bad? Turn on “Show AI Value Estimates” — what do the numbers on empty cells mean, and why are they all 0.50?
Part 2 — Watch It Learn (3 min)
Switch to the Train tab. Keep defaults (α=0.10, ε=0.10, self-play on).
Click Step a few times. Watch the TD Update panel — notice whether each move is “Greedy” or “Exploratory” and how the TD error shifts value estimates.
Now click 1K to train 1,000 games. Watch the Training Progress chart — which player starts winning more? What does the draw rate do?
Click 10K. How have the curves changed?
Part 3 — Test Your Trained Agent (2 min)
Switch to the Play tab. Turn on “Show AI Value Estimates.” Play a few games.
Question: Can you still beat it? Look at the value estimates — does the AI correctly identify moves that lead to winning vs. losing positions?
Part 4 — Break It, Then Fix It (3 min)
Click Reset All. Change one parameter at a time and train 10K games each:
Takeaway question: TD learning updates values at every move using the next state’s estimate — how is this different from waiting until the game ends to learn (Monte Carlo)?
Train the agent via self-play using TD learning. Watch games step-by-step or batch-train thousands at once.
Play against the trained agent. The AI uses its learned value function with no exploration (ε=0).