Flappy Bird RL — Deep Reinforcement Learning

PPO, DQN & A2C on a Flappy Bird Environment | ← All Simulations

Deep RL Algorithms: PPO, DQN & A2C

Train a neural network to play Flappy Bird using three classic deep RL algorithms. PPO (Proximal Policy Optimization) uses a clipped surrogate objective for stable policy updates. DQN (Deep Q-Network) learns action values with experience replay and a target network. A2C (Advantage Actor-Critic) combines policy gradient with a learned baseline for variance reduction.

PPO Objective:
r(θ) = πθ(a|s) / πθold(a|s)
L = min(r·A, clip(r, 1−ε, 1+ε)·A)
DQN Loss:
y = r + γ maxa' Qtarget(s', a')
L = (Q(s,a) − y)²
A2C Update:
A = r + γV(s') − V(s)
∇J = A · ∇ log π(a|s)

Play Flappy Bird

Press Space or click the canvas to flap. Navigate through pipes to score points.

Space / Click to flap  |  R to restart

Score

0
Score
Last
0
Best
0
Games

Game State

Bird Y: --
Velocity: --
Next Pipe: --
Gap Center: --

How It Works

The bird experiences constant gravity (0.5 px/frame downward). Each flap applies an upward impulse of -8 px/frame. Pipes scroll left at 2.5 px/frame with a gap of 130 px. The game ends on collision with pipes, ceiling, or ground.

In the Train tab, neural networks learn to play by observing 7 normalized state features and choosing to flap or not each frame.

Algorithm

3e-4
0.99
0.20
0.95
2

Training Controls

0
Episode
0.0
Avg Reward
0
Best Score
0
Avg Steps

Model

Training Log

Training Dashboard

Load pretrained DQN results (1000 episodes)

Episode Reward (smoothed)

Total reward accumulated per episode, smoothed over a window. Rising trend means the agent is learning to survive longer and pass more pipes.

Pipe Score per Episode

Number of pipes successfully passed each episode. This is the raw game performance metric — higher is better.

Loss Curve

Policy gradient loss from clipped surrogate objective. Lower values indicate the policy is stabilizing.

Algorithm Detail

Fraction of policy updates clipped to [1-ε, 1+ε]. High values mean the policy is changing rapidly; decreasing values indicate convergence.

Metrics Export

Export training metrics as JSON for analysis or import into visualization tools like wandb.

AI Demo

AI Statistics

0
Games
0.0
Avg Score
0
Max Score
0
Current

State Features

Y
Vel
Dist
Top
Bot
RelY
VDir

Action Probabilities

No Flap
50%
Flap
50%

About Demo Mode

Watch the trained AI agent play Flappy Bird in real time. The State Features panel shows the 7 normalized inputs the network sees each frame. Action Probabilities show the policy output (for PPO/A2C) or derived from Q-values (for DQN).

References:
  • Schulman, John, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. "Proximal Policy Optimization Algorithms." arXiv preprint arXiv:1707.06347 (2017). arxiv.org/abs/1707.06347
  • Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, et al. "Human-Level Control through Deep Reinforcement Learning." Nature 518, no. 7540 (2015): 529–33. doi.org/10.1038/nature14236
  • Mnih, Volodymyr, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. "Asynchronous Methods for Deep Reinforcement Learning." In Proceedings of the 33rd International Conference on Machine Learning (ICML), 1928–37. 2016. arxiv.org/abs/1602.01783
  • Sutton, Richard S., and Andrew G. Barto. Reinforcement Learning: An Introduction. 2nd ed. Cambridge, MA: MIT Press, 2018. Ch. 13, "Policy Gradient Methods." incompleteideas.net/book
  • Nguyen, Dong. Flappy Bird (2013). flappybird.io
← All Simulations