-
Tic-Tac-Toe RL Agent
Ch. 1
Play against a temporal-difference learning agent that improves its value estimates in real time, illustrating the core RL idea from Section 1.5.
-
10-Armed Bandit Testbed
Ch. 2
Explore the exploration-exploitation trade-off by comparing epsilon-greedy, UCB, and gradient bandit strategies on a 10-armed testbed.
-
Gridworld Value Function
Ch. 3
Visualize state-value functions and optimal policies on the 5×5 Gridworld with special jump states (Example 3.5 / Figure 3.2).
-
Gambler's Problem (Dynamic Programming)
Ch. 4
Watch value iteration solve the Gambler's Problem step by step, revealing how the optimal policy emerges from successive sweeps.
-
Monte Carlo Tree Search (MCTS)
Ch. 8
Step through the four MCTS phases—Selection, Expansion, Simulation, Backpropagation—on a Tic-Tac-Toe board, with a live tree visualization and tunable UCB1 exploration.
-
Policy Gradient — REINFORCE
Ch. 13
See the REINFORCE algorithm learn a stochastic policy on a simple left-right game, with live plots of policy probabilities and reward curves.
-
PPO vs A2C — Actor-Critic Methods
Ch. 13
Compare Proximal Policy Optimization and Advantage Actor-Critic side by side on a simple navigation task, highlighting PPO's clipped objective and multi-epoch updates.
-
Flappy Bird RL — PPO, DQN & A2C
Deep RL
Play Flappy Bird yourself, then train neural networks with PPO, DQN, and A2C to master it. Includes a pretrained DQN model ready to play immediately, a real-time training dashboard, live AI demos, and model export/import.