Tic-Tac-Toe — Reinforcement Learning Agent

Sutton & Barto, Reinforcement Learning: An Introduction (2nd ed.), Section 1.5

Section 1.5: An Extended Example — Tic-Tac-Toe

Sutton & Barto introduce tic-tac-toe as a motivating example for reinforcement learning. The classical approach to this game is minimax, which computes optimal play by assuming both players act perfectly. But minimax cannot exploit an imperfect opponent's mistakes — it plays the same way regardless of who it faces. An evolutionary method could search the space of policies directly, but it only uses the final outcome of each game, discarding the information available during play.

The RL approach instead learns a value function V(s) that estimates the probability of winning from each board state s. Specifically, the agent evaluates afterstates — the board positions that result after the agent places its mark but before the opponent responds. This is efficient because different actions from different prior states can lead to the same afterstate.

The agent uses an ε-greedy policy: with probability ε it makes a random exploratory move; otherwise it makes the greedy move — the one leading to the afterstate with the highest estimated value. After each greedy move (not exploratory ones), the value estimate is updated using temporal-difference (TD) learning:

V(St) V(St) + α [ V(St+1) V(St) ]

Here St is the previous afterstate, St+1 is the current afterstate, and α is the step-size parameter controlling learning speed. The quantity V(St+1) − V(St) is the TD error: the difference between successive value estimates that drives learning. Over many games, these values converge to the true winning probabilities and the agent learns to play optimally against its training opponent.

α (step size)
Learning rate; typically 0.1–0.5. Controls how fast values update.
ε (exploration)
Probability of random move; typically 0.1. Ensures all states are visited.
V(win) = 1.0
Terminal winning states are initialized with probability 1.
V(other) = 0.5
Non-terminal states start at 0.5 (uninformed prior). Loss = 0, draw = 0.5.
Training Mode — Self-Play

Mode

Train
Play

Train the agent via self-play using TD learning. Watch games step-by-step or batch-train thousands at once.

0.10
0.10
8

Play against the trained agent. The AI uses its learned value function with no exploration (ε=0).

Statistics

Training Progress

X wins
O wins
Draws

TD Update

Train the agent to see TD backups here.

Training Log

Training results will appear here...