CartPole — PPO vs A2C

Sutton & Barto §13 — Policy Gradient with Neural Network Policy | Activity Worksheet | ← All Simulations

Actor-Critic Methods: A2C and PPO with Neural Network

Both A2C (Advantage Actor-Critic) and PPO (Proximal Policy Optimization) are actor-critic policy gradient methods. The actor learns a parameterized policy π(a|s;θ), while the critic learns a value function V(s;w) used to compute the advantage A(s,a) = r + γV(s') − V(s), which reduces variance compared to raw returns.

A2C updates the policy directly using the advantage — this can lead to large, destabilizing updates. PPO constrains updates by clipping the probability ratio r(θ) = πnew(a|s) / πold(a|s) to [1−ε, 1+ε], and reuses each trajectory for multiple update epochs, improving sample efficiency and stability.

Network architecture: A shared hidden layer 4 → 16 (tanh) → actor(2, softmax) + critic(1, linear) with Xavier initialization. State: [x, ẋ, θ, θ̇] normalized to roughly [−1, 1].

A2C Update:
θ ← θ + αactor · A · ∇θ log π(a|s;θ)
V(s) ← V(s) + αcritic · (r + γV(s') − V(s))
PPO Clipped Objective:
r(θ) = πnew(a|s) / πold(a|s)
L = min(r(θ)·A, clip(r(θ), 1−ε, 1+ε)·A)

Controls

0.0010
0.0010
0.99
0.20
4
5
Space Step   R Reset

A2C — CartPole A2C

PPO — CartPole PPO

A2C Statistics A2C

0
Episodes
0.0
Avg Reward (50)
0
Best Reward
0%
Success Rate (50)

PPO Statistics PPO

0
Episodes
0.0
Avg Reward (50)
0
Best Reward
0%
Success Rate (50)

Reward Comparison

Episode reward (smoothed over 20 episodes) for both algorithms. Max possible = 200.

Policy Stability — Max Probability Ratio

Maximum πnewold ratio across trajectory per episode (lower = more stable)

PPO Clipping Detail

Probability ratio r(θ) for each step in the last PPO trajectory, with clip boundaries [1−ε, 1+ε]

A2C Last Update A2C

Run an episode to see A2C update details

PPO Last Update PPO

Run an episode to see PPO update details

References

Textbook: Sutton, R. S. & Barto, A. G. (2018). Reinforcement Learning: An Introduction, 2nd ed. MIT Press. Chapter 13: Policy Gradient Methods. [link]

PPO Paper: Schulman, J. et al. (2017). “Proximal Policy Optimization Algorithms.” arXiv:1707.06347. [link]

CartPole: Barto, A. G., Sutton, R. S. & Anderson, C. W. (1983). “Neuronlike adaptive elements that can solve difficult learning control problems.” IEEE Trans. on Systems, Man, and Cybernetics, 13(5), 834–846.