Reinforcement Learning: Interactive Simulations

Ten interactive visualizations designed to help students learn key concepts from Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto.

Tic-Tac-Toe RL Agent Ch. 1 Activity
Tic-Tac-Toe TD Learning
Play against a temporal-difference learning agent that improves its value estimates in real time, illustrating the core RL idea from Section 1.5.
10-Armed Bandit Testbed Ch. 2
10-Armed Bandit ε-greedy UCB Gradient Bandit
Explore the exploration-exploitation trade-off by comparing epsilon-greedy, UCB, and gradient bandit strategies on a 10-armed testbed.
Gridworld Value Function Ch. 3
GridWorld Value Iteration
Visualize state-value functions and optimal policies on the 5×5 Gridworld with special jump states (Example 3.5 / Figure 3.2).
Gambler's Problem (Dynamic Programming) Ch. 4
Gambler's Problem Value Iteration
Watch value iteration solve the Gambler's Problem step by step, revealing how the optimal policy emerges from successive sweeps.
Dynamic Programming — In-Class Activity Ch. 4 Activity
Gambler's Problem Value Iteration
A guided worksheet exploring the Gambler's Problem: predict value functions, watch convergence sweep by sweep, analyze the spiky optimal policy, and experiment with different coin probabilities.
TD Learning — Random Walk & Cliff Walking Ch. 6
Random Walk Cliff Walking TD(0) SARSA Q-Learning
Explore TD(0), SARSA, and Q-Learning with animated Random Walk (Example 6.2) and Cliff Walking (Example 6.6) simulations, reproducing the book's key figures.
Monte Carlo Tree Search (MCTS) Ch. 8
Tic-Tac-Toe MCTS UCB1
Step through the four MCTS phases—Selection, Expansion, Simulation, Backpropagation—on a Tic-Tac-Toe board, with a live tree visualization and tunable UCB1 exploration.
LLM-Guided MCTS — Tic-Tac-Toe Ch. 8 + LLM
Tic-Tac-Toe MCTS LLM Evaluation
Compare LLM-guided MCTS against traditional random-rollout MCTS on Tic-Tac-Toe. Watch how replacing random simulations with LLM position evaluation affects move quality, convergence speed, and decision latency.
Policy Gradient — REINFORCE Ch. 13
Left-Right Game REINFORCE
See the REINFORCE algorithm learn a stochastic policy on a simple left-right game, with live plots of policy probabilities and reward curves.
PPO vs A2C — Actor-Critic Methods Ch. 13
Left-Right Game PPO A2C
Compare Proximal Policy Optimization and Advantage Actor-Critic side by side on a simple navigation task, highlighting PPO's clipped objective and multi-epoch updates.
CartPole — PPO vs A2C Ch. 13
CartPole PPO A2C
Watch PPO and A2C learn to balance a pole on a cart side-by-side, with neural network policies, live CartPole animation, reward curves, and PPO clipping visualization.
GRPO — Group Relative Policy Optimization Modern RL
Bandit (1–10) GRPO REINFORCE
Interactive visualization of GRPO, the algorithm behind DeepSeek-R1. Watch a policy learn through group sampling and relative advantage normalization on a bandit-style action-selection problem with configurable reward landscapes.
GRPO — In-Class Activity Modern RL Activity
Bandit (1–10) GRPO REINFORCE
25-minute guided worksheet: predict policy changes, compare with REINFORCE, connect to LLM training.
Flappy Bird RL — PPO, DQN & A2C Deep RL
Flappy Bird PPO DQN A2C
Play Flappy Bird yourself, then train neural networks with PPO, DQN, and A2C to master it. Includes a pretrained DQN model ready to play immediately, a real-time training dashboard, live AI demos, and model export/import.
Connect 4 — RL, MCTS & LLM Evaluation Deep RL + MCTS
Connect 4 DQN PPO SARSA REINFORCE TD(λ) MCTS Minimax
Play Connect 4 against minimax, RL, and MCTS agents. Train with DQN, PPO, SARSA, REINFORCE, and TD(λ) via a real-time dashboard, explore MCTS with interactive tree visualization and phase stepping, and evaluate LLM board understanding with configurable evaluation tasks.
Connect 4 — Getting Started Activity Deep RL + MCTS Activity
Connect 4 DQN MCTS Minimax
A quick 10-minute guided tour of the Connect 4 simulation: play against Minimax, train a DQN agent, explore MCTS search, and brainstorm how LLMs could enhance game-playing AI.