Flappy Bird RL — Deep Reinforcement Learning
PPO, DQN & A2C on a Flappy Bird Environment | ← All Simulations
Deep RL Algorithms: PPO, DQN & A2C
Train a neural network to play Flappy Bird using three classic deep RL algorithms. PPO (Proximal Policy Optimization) uses a clipped surrogate objective for stable policy updates. DQN (Deep Q-Network) learns action values with experience replay and a target network. A2C (Advantage Actor-Critic) combines policy gradient with a learned baseline for variance reduction.
r(θ) = πθ(a|s) / πθold(a|s)
L = min(r·A, clip(r, 1−ε, 1+ε)·A)
y = r + γ maxa' Qtarget(s', a')
L = (Q(s,a) − y)²
A = r + γV(s') − V(s)
∇J = A · ∇ log π(a|s)
Play Flappy Bird
Press Space or click the canvas to flap. Navigate through pipes to score points.
Score
Game State
Velocity: --
Next Pipe: --
Gap Center: --
How It Works
The bird experiences constant gravity (0.5 px/frame downward). Each flap applies an upward impulse of -8 px/frame. Pipes scroll left at 2.5 px/frame with a gap of 130 px. The game ends on collision with pipes, ceiling, or ground.
In the Train tab, neural networks learn to play by observing 7 normalized state features and choosing to flap or not each frame.
Algorithm
Training Controls
Model
Training Log
Training Dashboard
Load pretrained DQN results (1000 episodes)
Episode Reward (smoothed)
Total reward accumulated per episode, smoothed over a window. Rising trend means the agent is learning to survive longer and pass more pipes.
Pipe Score per Episode
Number of pipes successfully passed each episode. This is the raw game performance metric — higher is better.
Loss Curve
Policy gradient loss from clipped surrogate objective. Lower values indicate the policy is stabilizing.
Algorithm Detail
Fraction of policy updates clipped to [1-ε, 1+ε]. High values mean the policy is changing rapidly; decreasing values indicate convergence.
Metrics Export
Export training metrics as JSON for analysis or import into visualization tools like wandb.
AI Demo
AI Statistics
State Features
Action Probabilities
About Demo Mode
Watch the trained AI agent play Flappy Bird in real time. The State Features panel shows the 7 normalized inputs the network sees each frame. Action Probabilities show the policy output (for PPO/A2C) or derived from Q-values (for DQN).
- Schulman, John, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. "Proximal Policy Optimization Algorithms." arXiv preprint arXiv:1707.06347 (2017). arxiv.org/abs/1707.06347
- Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, et al. "Human-Level Control through Deep Reinforcement Learning." Nature 518, no. 7540 (2015): 529–33. doi.org/10.1038/nature14236
- Mnih, Volodymyr, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. "Asynchronous Methods for Deep Reinforcement Learning." In Proceedings of the 33rd International Conference on Machine Learning (ICML), 1928–37. 2016. arxiv.org/abs/1602.01783
- Sutton, Richard S., and Andrew G. Barto. Reinforcement Learning: An Introduction. 2nd ed. Cambridge, MA: MIT Press, 2018. Ch. 13, "Policy Gradient Methods." incompleteideas.net/book
- Nguyen, Dong. Flappy Bird (2013). flappybird.io