Monte Carlo Tree Search
Sutton & Barto, Section 8.11 — Interactive MCTS on Tic-Tac-Toe
Monte Carlo Tree Search (MCTS) builds a search tree incrementally by running simulated playouts. Each iteration has four phases: Selection picks the most promising path using a tree policy; Expansion adds a new child node; Simulation plays a random rollout to a terminal state; Backpropagation updates statistics along the path back to the root.
What is UCB? UCB stands for Upper Confidence Bound — a strategy from the multi-armed bandit literature (the same exploration-vs-exploitation tradeoff from Chapter 2). When choosing which tree branch to explore next, we want to balance two goals: exploitation (prefer nodes with high win rates) and exploration (revisit nodes we haven't tried much — they might be better than they appear). UCB achieves this by adding an “optimism bonus” to each node's estimated value.
The UCB1 formula applies this idea to tree search. The “1” in UCB1 denotes the first and simplest variant in a family of UCB algorithms introduced by Auer, Cesa-Bianchi & Fischer (2002). Their paper also proposed UCB2 (a more complex epoch-based variant) and UCB1-Tuned (which incorporates reward variance). UCB1 became the most widely adopted because it is simple, has strong theoretical guarantees (logarithmic regret), and works well in practice — which is why MCTS adopted it. During the Selection phase, MCTS picks the child with the highest UCB1 score:
Q̅j = average win rate of child j (exploitation term — prefer nodes that have been winning)
C · √ln Nparent / Nj = exploration bonus (large when j has few visits relative to its parent — encourages trying under-explored moves)
C = exploration constant (default √2 ≈ 1.41; higher = more exploration, lower = more exploitation)
How to use this demo
- The tree starts with a single root node showing “0” — this represents the current board position (an empty board by default) with zero visits so far.
- Step Phase walks through one MCTS iteration phase by phase (Selection → Expansion → Simulation → Backpropagation). Watch the colored highlights on the tree and the phase bar above it.
- Step Iteration runs one full 4-phase cycle at once.
- Auto runs iterations continuously — watch the tree grow and the win-rate chart converge. Use the speed slider to control how fast.
- +10 / +100 / +1000 run many iterations instantly (no animation, but faster results).
- To analyze a specific position: click Setup Board, click cells to cycle through empty/X/O, then click Done Editing. The tree resets to search from your position.
Controls
Board Position
Statistics
Move Rankings
| Move | Visits | Win% | Q | UCB1 |
|---|
UCB1 Live
UCB1 (Upper Confidence Bound 1) scores each child node as UCB1 = Q̅ + C · √ln Np / Nj, where Q̅ is the node's average reward (wins / visits), Np is how often the parent was visited, Nj is how often this child was visited, and C is a tunable constant (typically √2). The score is an optimistic estimate of a node's true value: the first term (exploitation) reflects how promising a move has been so far, while the second term (exploration) acts as a confidence bonus that grows for less-visited nodes — ensuring no move is neglected. During selection, MCTS always picks the child with the highest UCB1, which automatically balances exploiting known-good moves against exploring uncertain ones. Click any tree node to inspect its breakdown below.