Flappy Bird RL — Deep Reinforcement Learning

PPO, DQN & A2C on a Flappy Bird Environment | ← All Simulations

Deep RL Algorithms: PPO, DQN & A2C

Train a neural network to play Flappy Bird using three classic deep RL algorithms. PPO (Proximal Policy Optimization) uses a clipped surrogate objective for stable policy updates. DQN (Deep Q-Network) learns action values with experience replay and a target network. A2C (Advantage Actor-Critic) combines policy gradient with a learned baseline for variance reduction.

PPO Objective:
r(θ) = π_θ(a|s) / π_{θ_old}(a|s)
L = min(r·A, clip(r, 1−ε, 1+ε)·A)

DQN Loss:
y = r + γ max_a' Q_target(s', a')
L = (Q(s,a) − y)²

A2C Update:
A = r + γV(s') − V(s)
∇J = A · ∇ log π(a|s)

Play Flappy Bird

Press Space or click the canvas to flap. Navigate through pipes to score points.

Space / Click to flap | R to restart

Score

—

Last

Best

Games

Game State

            Bird Y: --
Velocity: --
Next Pipe: --
Gap Center: --
          

How It Works

The bird experiences constant gravity (0.5 px/frame downward). Each flap applies an upward impulse of -8 px/frame. Pipes scroll left at 2.5 px/frame with a gap of 130 px. The game ends on collision with pipes, ceiling, or ground.

In the Train tab, neural networks learn to play by observing 7 normalized state features and choosing to flap or not each frame.

Algorithm

Learning Rate: 3e-4

Gamma: 0.99

Clip ε: 0.20

GAE λ: 0.95

Epochs: 2

Learning Rate: 3e-4

Gamma: 0.99

ε Start: 1.00

ε End: 0.01

Replay Size: 10000

Batch Size: 64

Learning Rate: 3e-4

Gamma: 0.99

N-steps: 5

Entropy Coef: 0.010

Training Controls

Episode

0.0

Avg Reward

Best Score

Avg Steps

Model

Training Log

Training Dashboard

Load pretrained DQN results (1000 episodes)

Episode Reward (smoothed)

Total reward accumulated per episode, smoothed over a window. Rising trend means the agent is learning to survive longer and pass more pipes.

Pipe Score per Episode

Number of pipes successfully passed each episode. This is the raw game performance metric — higher is better.

Loss Curve

Policy gradient loss from clipped surrogate objective. Lower values indicate the policy is stabilizing.

Algorithm Detail

Fraction of policy updates clipped to [1-ε, 1+ε]. High values mean the policy is changing rapidly; decreasing values indicate convergence.

Metrics Export

Export training metrics as JSON for analysis or import into visualization tools like wandb.

AI Demo

Model:

AI Statistics

Games

0.0

Avg Score

Max Score

Current

State Features

Vel

Dist

Top

Bot

RelY

VDir

Action Probabilities

No Flap

50%

Flap

50%

About Demo Mode

Watch the trained AI agent play Flappy Bird in real time. The State Features panel shows the 7 normalized inputs the network sees each frame. Action Probabilities show the policy output (for PPO/A2C) or derived from Q-values (for DQN).

References:

Schulman, John, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. "Proximal Policy Optimization Algorithms." arXiv preprint arXiv:1707.06347 (2017). arxiv.org/abs/1707.06347
Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, et al. "Human-Level Control through Deep Reinforcement Learning." Nature 518, no. 7540 (2015): 529–33. doi.org/10.1038/nature14236
Mnih, Volodymyr, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. "Asynchronous Methods for Deep Reinforcement Learning." In Proceedings of the 33rd International Conference on Machine Learning (ICML), 1928–37. 2016. arxiv.org/abs/1602.01783
Sutton, Richard S., and Andrew G. Barto. Reinforcement Learning: An Introduction. 2nd ed. Cambridge, MA: MIT Press, 2018. Ch. 13, "Policy Gradient Methods." incompleteideas.net/book
Nguyen, Dong. Flappy Bird (2013). flappybird.io

← All Simulations