Connect 4 — RL, MCTS & LLM Evaluation

Train RL agents, explore MCTS search, and evaluate LLM board understanding | ← All simulations

Game

Red: Yellow: Depth:

Red's turn

Game Traces

0 traces saved

How to Play Connect 4

Objective: Be the first player to connect four of your pieces in a row — horizontally, vertically, or diagonally.
Taking turns: Red always goes first. Players alternate dropping one piece per turn into any column that is not full.
Dropping pieces: Click a column to drop your piece. It falls to the lowest available row in that column.
Winning: The game ends immediately when a player forms an unbroken line of four pieces. Winning cells are highlighted with a glow.
Draw: If all 42 cells are filled and neither player has four in a row, the game is a draw.

Game modes:

Player vs Player — Two humans take turns on the same screen.
Player vs Minimax AI — Play against a classic search-based AI. Higher depth = stronger play (depth 4+ recommended).
Player vs RL Agent — Play against a neural-network agent trained in the Train RL tab. Train a model first, then select it here.
Player vs MCTS Agent — Play against Monte Carlo Tree Search. More iterations = stronger play.
RL Agent vs AI — Watch a trained RL model (Red) play automatically against a configurable AI opponent (Yellow). Adjust speed, step through moves, or auto-play with a live scoreboard.

Solved Game & Optimal Play

Connect 4 is a solved game. With perfect play, the first player (Red) can always force a win, provided they open in the center column. If the first player opens in a non-center column, the second player can force a draw.

The game was independently solved by James D. Allen (October 1, 1988) and Victor Allis (October 16, 1988). Allis used a knowledge-based approach combining nine strategic rules with alpha-beta search, while Allen developed a combinatorial analysis of "threats" — categorizing them as major, minor, and useless. Both proofs demonstrated that the first player wins within at most 41 moves.

John Tromp later computed a complete 8-ply opening database and extended the solution to boards of various sizes up to width+height=15, requiring approximately 40,000 CPU hours at CWI Amsterdam. His Fhourstones program remains a widely used benchmark for integer performance.

References:

Allis, V. (1988). A Knowledge-Based Approach of Connect-Four: The Game is Solved: White Wins. M.Sc. Thesis, Report No. IR-163, Faculty of Mathematics and Computer Science, Vrije Universiteit, Amsterdam. [PDF]
Allen, J. D. (1990). Expert Play in Connect-Four. [Link]
Tromp, J. (2008). Solving Connect-4 on Medium Board Sizes. ICGA Journal, 31(2), 110–112. [DOI]
Edelkamp, S. & Kissmann, P. (2008). Symbolic Classification of General Two-Player Games. Proc. KI 2008, LNAI 5243, pp. 185–192. Springer.

RL Training

Learning Rate:3e-4

Gamma:0.99

ε Start:1.00

ε End:0.05

Replay Size:10000

Batch Size:32

Learning Rate:3e-4

Gamma:0.99

Clip ε:0.20

GAE λ:0.95

Epochs:3

Learning Rate:1e-3

Gamma:0.99

Entropy Coef:0.010

Learning Rate:1e-3

Gamma:0.99

λ:0.80

ε Start:1.00

ε End:0.05

Learning Rate:1e-3

Gamma:0.99

ε Start:1.00

ε End:0.05

Opponent: Episodes:

Episode

Win Rate (100)

Avg Reward

Avg Loss

Episode Reward

Win Rate (per 100)

Loss

Algorithm Detail

Training Log

Models

Import

Algorithm Reference

Deep Q-Network stores a Q-value table as a neural network. Uses experience replay and a target network for stable learning.

Learning Rate	Step size for gradient descent (log scale: 10^x). Lower = slower but more stable.
Gamma	Discount factor for future rewards. Higher values make the agent plan further ahead.
ε Start	Initial exploration rate. At 1.0 the agent explores randomly at first.
ε End	Final exploration rate. A small value ensures some exploration always remains.
Replay Size	Max transitions stored in the replay buffer. Larger = more diverse training samples.
Batch Size	Transitions sampled per training step. Larger = more stable gradients.

Proximal Policy Optimization clips policy updates to prevent destructive large steps. Actor-critic architecture.

Learning Rate	Step size for gradient descent (log scale: 10^x). Lower = slower but more stable.
Gamma	Discount factor for future rewards. Higher values make the agent plan further ahead.
Clip ε	Clipping range for policy ratio. Prevents large policy updates. Typical: 0.1–0.3.
GAE λ	Generalized Advantage Estimation lambda. Balances bias vs variance in advantage estimates.
Epochs	Policy update passes per batch of data. More epochs = more learning per batch.

REINFORCE is a simple policy gradient method. Directly adjusts the policy to increase reward.

Learning Rate	Step size for gradient descent (log scale: 10^x). Lower = slower but more stable.
Gamma	Discount factor for future rewards. Higher values make the agent plan further ahead.
Entropy Coef	Bonus for policy randomness. Higher = more exploration. Prevents premature convergence.

Temporal Difference learning updates a state-value function V(s) after every move using the TD error:
δ = r + γ·V(s') − V(s)Unlike REINFORCE, which waits until the end of an episode, TD bootstraps from its own value estimates, enabling faster, incremental learning. TD(λ) extends this with eligibility traces — λ=0 gives pure one-step TD, λ=1 approaches Monte Carlo.

Learning Rate	Step size for gradient descent (log scale: 10^x). Lower = slower but more stable.
Gamma	Discount factor for future rewards. Higher values make the agent plan further ahead.
λ	Trace decay parameter. λ=0 gives pure one-step TD (bootstrap only from next state), λ=1 approaches Monte Carlo (use full episode returns). Typical: 0.7–0.9.
ε Start	Initial exploration rate. At 1.0 the agent explores randomly at first.
ε End	Final exploration rate. A small value ensures some exploration always remains.

SARSA (State-Action-Reward-State'-Action') is an on-policy TD control method that learns Q(s,a) values directly. Update rule:
Q(s,a) ← Q(s,a) + α·[r + γ·Q(s',a') − Q(s,a)]where a' is the action actually chosen by the ε-greedy policy. Unlike DQN (off-policy), SARSA uses no replay buffer and no target network — it learns from each transition immediately. Because it evaluates the policy it actually follows, SARSA tends to learn more conservative, safer strategies.

Learning Rate	Step size for gradient descent (log scale: 10^x). Lower = slower but more stable.
Gamma	Discount factor for future rewards. Higher values make the agent plan further ahead.
ε Start	Initial exploration rate. At 1.0 the agent explores randomly at first.
ε End	Final exploration rate. A small value ensures some exploration always remains.

1. Selection

2. Expansion

3. Simulation

4. Backpropagation

Current Board

Iterations

Tree Nodes

Best Move

Best Win%

Move Rankings

Column	Visits	Win%	UCB1

Win Rate Over Iterations

Iteration Log

Waiting for first iteration...

LLM Configuration

Mode:

Provider:

Model ID:

API Key:

Evaluation Tasks

Source:

Task:

Positions:

Results

Accuracy

Questions

Correct

Avg Latency

Response Log

LLM-Guided MCTS

MCTS where the simulation (rollout) step is replaced by LLM position evaluation. Instead of random playouts, the LLM estimates “who is more likely to win?” for each position.

Configuration

Uses LLM settings from the LLM Eval tab.

Iterations: C: 1.41

Board Position

Results

LLM Calls

Cache Hits

Avg Latency

Cache Size

Move Rankings

Column	Visits	Win%	Source