-
Tic-Tac-Toe RL Agent
Ch. 1
Activity
Tic-Tac-Toe
TD Learning
Play against a temporal-difference learning agent that improves its value estimates in real time, illustrating the core RL idea from Section 1.5.
-
10-Armed Bandit Testbed
Ch. 2
10-Armed Bandit
ε-greedy
UCB
Gradient Bandit
Explore the exploration-exploitation trade-off by comparing epsilon-greedy, UCB, and gradient bandit strategies on a 10-armed testbed.
-
Gridworld Value Function
Ch. 3
GridWorld
Value Iteration
Visualize state-value functions and optimal policies on the 5×5 Gridworld with special jump states (Example 3.5 / Figure 3.2).
-
Gambler's Problem (Dynamic Programming)
Ch. 4
Gambler's Problem
Value Iteration
Watch value iteration solve the Gambler's Problem step by step, revealing how the optimal policy emerges from successive sweeps.
-
Dynamic Programming — In-Class Activity
Ch. 4
Activity
Gambler's Problem
Value Iteration
A guided worksheet exploring the Gambler's Problem: predict value functions, watch convergence sweep by sweep, analyze the spiky optimal policy, and experiment with different coin probabilities.
-
TD Learning — Random Walk & Cliff Walking
Ch. 6
Random Walk
Cliff Walking
TD(0)
SARSA
Q-Learning
Explore TD(0), SARSA, and Q-Learning with animated Random Walk (Example 6.2) and Cliff Walking (Example 6.6) simulations, reproducing the book's key figures.
-
Monte Carlo Tree Search (MCTS)
Ch. 8
Tic-Tac-Toe
MCTS
UCB1
Step through the four MCTS phases—Selection, Expansion, Simulation, Backpropagation—on a Tic-Tac-Toe board, with a live tree visualization and tunable UCB1 exploration.
-
LLM-Guided MCTS — Tic-Tac-Toe
Ch. 8 + LLM
Tic-Tac-Toe
MCTS
LLM Evaluation
Compare LLM-guided MCTS against traditional random-rollout MCTS on Tic-Tac-Toe. Watch how replacing random simulations with LLM position evaluation affects move quality, convergence speed, and decision latency.
-
Policy Gradient — REINFORCE
Ch. 13
Left-Right Game
REINFORCE
See the REINFORCE algorithm learn a stochastic policy on a simple left-right game, with live plots of policy probabilities and reward curves.
-
PPO vs A2C — Actor-Critic Methods
Ch. 13
Left-Right Game
PPO
A2C
Compare Proximal Policy Optimization and Advantage Actor-Critic side by side on a simple navigation task, highlighting PPO's clipped objective and multi-epoch updates.
-
CartPole — PPO vs A2C
Ch. 13
CartPole
PPO
A2C
Watch PPO and A2C learn to balance a pole on a cart side-by-side, with neural network policies, live CartPole animation, reward curves, and PPO clipping visualization.
-
GRPO — Group Relative Policy Optimization
Modern RL
Bandit (1–10)
GRPO
REINFORCE
Interactive visualization of GRPO, the algorithm behind DeepSeek-R1. Watch a policy learn through group sampling and relative advantage normalization on a bandit-style action-selection problem with configurable reward landscapes.
-
GRPO — In-Class Activity
Modern RL
Activity
Bandit (1–10)
GRPO
REINFORCE
25-minute guided worksheet: predict policy changes, compare with REINFORCE, connect to LLM training.
-
Flappy Bird RL — PPO, DQN & A2C
Deep RL
Flappy Bird
PPO
DQN
A2C
Play Flappy Bird yourself, then train neural networks with PPO, DQN, and A2C to master it. Includes a pretrained DQN model ready to play immediately, a real-time training dashboard, live AI demos, and model export/import.
-
Connect 4 — RL, MCTS & LLM Evaluation
Deep RL + MCTS
Connect 4
DQN
PPO
SARSA
REINFORCE
TD(λ)
MCTS
Minimax
Play Connect 4 against minimax, RL, and MCTS agents. Train with DQN, PPO, SARSA, REINFORCE, and TD(λ) via a real-time dashboard, explore MCTS with interactive tree visualization and phase stepping, and evaluate LLM board understanding with configurable evaluation tasks.
-
Connect 4 — Getting Started Activity
Deep RL + MCTS
Activity
Connect 4
DQN
MCTS
Minimax
A quick 10-minute guided tour of the Connect 4 simulation: play against Minimax, train a DQN agent, explore MCTS search, and brainstorm how LLMs could enhance game-playing AI.