CartPole — PPO vs A2C

Sutton & Barto §13 — Policy Gradient with Neural Network Policy | Activity Worksheet | ← All Simulations

Actor-Critic Methods: A2C and PPO with Neural Network

Both A2C (Advantage Actor-Critic) and PPO (Proximal Policy Optimization) are actor-critic policy gradient methods. The actor learns a parameterized policy π(a|s;θ), while the critic learns a value function V(s;w) used to compute the advantage A(s,a) = r + γV(s') − V(s), which reduces variance compared to raw returns.

A2C updates the policy directly using the advantage — this can lead to large, destabilizing updates. PPO constrains updates by clipping the probability ratio r(θ) = π_new(a|s) / π_old(a|s) to [1−ε, 1+ε], and reuses each trajectory for multiple update epochs, improving sample efficiency and stability.

Network architecture: A shared hidden layer 4 → 16 (tanh) → actor(2, softmax) + critic(1, linear) with Xavier initialization. State: [x, ẋ, θ, θ̇] normalized to roughly [−1, 1].

A2C Update:
θ ← θ + α_actor · A · ∇_θ log π(a|s;θ)
V(s) ← V(s) + α_critic · (r + γV(s') − V(s))

PPO Clipped Objective:
r(θ) = π_new(a|s) / π_old(a|s)
L = min(r(θ)·A, clip(r(θ), 1−ε, 1+ε)·A)

Controls

α_actor: 0.0010

α_critic: 0.0010

γ: 0.99

PPO ε: 0.20

PPO epochs: 4

Speed: 5

Space Step R Reset

A2C — CartPole A2C

PPO — CartPole PPO

A2C Statistics A2C

Episodes

0.0

Avg Reward (50)

Best Reward

Success Rate (50)

PPO Statistics PPO

Episodes

0.0

Avg Reward (50)

Best Reward

Success Rate (50)

Reward Comparison

Episode reward (smoothed over 20 episodes) for both algorithms. Max possible = 200.

Policy Stability — Max Probability Ratio

Maximum π_new/π_old ratio across trajectory per episode (lower = more stable)

PPO Clipping Detail

Probability ratio r(θ) for each step in the last PPO trajectory, with clip boundaries [1−ε, 1+ε]

A2C Last Update A2C

Run an episode to see A2C update details

PPO Last Update PPO

Run an episode to see PPO update details

References

Textbook: Sutton, R. S. & Barto, A. G. (2018). Reinforcement Learning: An Introduction, 2nd ed. MIT Press. Chapter 13: Policy Gradient Methods. [link]

PPO Paper: Schulman, J. et al. (2017). “Proximal Policy Optimization Algorithms.” arXiv:1707.06347. [link]

CartPole: Barto, A. G., Sutton, R. S. & Anderson, C. W. (1983). “Neuronlike adaptive elements that can solve difficult learning control problems.” IEEE Trans. on Systems, Man, and Cybernetics, 13(5), 834–846.