Reinforcement Learning: Interactive Simulations

Ten interactive visualizations designed to help students learn key concepts from Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto.

  1. Tic-Tac-Toe RL Agent Ch. 1 Activity
    Tic-Tac-Toe TD Learning
    Play against a temporal-difference learning agent that improves its value estimates in real time, illustrating the core RL idea from Section 1.5.
  2. 10-Armed Bandit Testbed Ch. 2
    10-Armed Bandit ε-greedy UCB Gradient Bandit
    Explore the exploration-exploitation trade-off by comparing epsilon-greedy, UCB, and gradient bandit strategies on a 10-armed testbed.
  3. Gridworld Value Function Ch. 3
    GridWorld Value Iteration
    Visualize state-value functions and optimal policies on the 5×5 Gridworld with special jump states (Example 3.5 / Figure 3.2).
  4. Gambler's Problem (Dynamic Programming) Ch. 4
    Gambler's Problem Value Iteration
    Watch value iteration solve the Gambler's Problem step by step, revealing how the optimal policy emerges from successive sweeps.
  5. Dynamic Programming — In-Class Activity Ch. 4 Activity
    Gambler's Problem Value Iteration
    A guided worksheet exploring the Gambler's Problem: predict value functions, watch convergence sweep by sweep, analyze the spiky optimal policy, and experiment with different coin probabilities.
  6. TD Learning — Random Walk & Cliff Walking Ch. 6
    Random Walk Cliff Walking TD(0) SARSA Q-Learning
    Explore TD(0), SARSA, and Q-Learning with animated Random Walk (Example 6.2) and Cliff Walking (Example 6.6) simulations, reproducing the book's key figures.
  7. Monte Carlo Tree Search (MCTS) Ch. 8
    Tic-Tac-Toe MCTS UCB1
    Step through the four MCTS phases—Selection, Expansion, Simulation, Backpropagation—on a Tic-Tac-Toe board, with a live tree visualization and tunable UCB1 exploration.
  8. LLM-Guided MCTS — Tic-Tac-Toe Ch. 8 + LLM
    Tic-Tac-Toe MCTS LLM Evaluation
    Compare LLM-guided MCTS against traditional random-rollout MCTS on Tic-Tac-Toe. Watch how replacing random simulations with LLM position evaluation affects move quality, convergence speed, and decision latency.
  9. Policy Gradient — REINFORCE Ch. 13
    Left-Right Game REINFORCE
    See the REINFORCE algorithm learn a stochastic policy on a simple left-right game, with live plots of policy probabilities and reward curves.
  10. PPO vs A2C — Actor-Critic Methods Ch. 13
    Left-Right Game PPO A2C
    Compare Proximal Policy Optimization and Advantage Actor-Critic side by side on a simple navigation task, highlighting PPO's clipped objective and multi-epoch updates.
  11. CartPole — PPO vs A2C Ch. 13
    CartPole PPO A2C
    Watch PPO and A2C learn to balance a pole on a cart side-by-side, with neural network policies, live CartPole animation, reward curves, and PPO clipping visualization.
  12. GRPO — Group Relative Policy Optimization Modern RL
    Bandit (1–10) GRPO REINFORCE
    Interactive visualization of GRPO, the algorithm behind DeepSeek-R1. Watch a policy learn through group sampling and relative advantage normalization on a bandit-style action-selection problem with configurable reward landscapes.
  13. GRPO — In-Class Activity Modern RL Activity
    Bandit (1–10) GRPO REINFORCE
    25-minute guided worksheet: predict policy changes, compare with REINFORCE, connect to LLM training.
  14. Flappy Bird RL — PPO, DQN & A2C Deep RL
    Flappy Bird PPO DQN A2C
    Play Flappy Bird yourself, then train neural networks with PPO, DQN, and A2C to master it. Includes a pretrained DQN model ready to play immediately, a real-time training dashboard, live AI demos, and model export/import.
  15. Connect 4 — RL, MCTS & LLM Evaluation Deep RL + MCTS
    Connect 4 DQN PPO SARSA REINFORCE TD(λ) MCTS Minimax
    Play Connect 4 against minimax, RL, and MCTS agents. Train with DQN, PPO, SARSA, REINFORCE, and TD(λ) via a real-time dashboard, explore MCTS with interactive tree visualization and phase stepping, and evaluate LLM board understanding with configurable evaluation tasks.
  16. Connect 4 — Getting Started Activity Deep RL + MCTS Activity
    Connect 4 DQN MCTS Minimax
    A quick 10-minute guided tour of the Connect 4 simulation: play against Minimax, train a DQN agent, explore MCTS search, and brainstorm how LLMs could enhance game-playing AI.