CartPole — PPO vs A2C
Sutton & Barto §13 — Policy Gradient with Neural Network Policy | Activity Worksheet | ← All Simulations
Actor-Critic Methods: A2C and PPO with Neural Network
Both A2C (Advantage Actor-Critic) and PPO (Proximal Policy Optimization) are actor-critic policy gradient methods. The actor learns a parameterized policy π(a|s;θ), while the critic learns a value function V(s;w) used to compute the advantage A(s,a) = r + γV(s') − V(s), which reduces variance compared to raw returns.
A2C updates the policy directly using the advantage — this can lead to large, destabilizing updates. PPO constrains updates by clipping the probability ratio r(θ) = πnew(a|s) / πold(a|s) to [1−ε, 1+ε], and reuses each trajectory for multiple update epochs, improving sample efficiency and stability.
Network architecture: A shared hidden layer 4 → 16 (tanh) → actor(2, softmax) + critic(1, linear) with Xavier initialization. State: [x, ẋ, θ, θ̇] normalized to roughly [−1, 1].
θ ← θ + αactor · A · ∇θ log π(a|s;θ)
V(s) ← V(s) + αcritic · (r + γV(s') − V(s))
r(θ) = πnew(a|s) / πold(a|s)
L = min(r(θ)·A, clip(r(θ), 1−ε, 1+ε)·A)
Controls
A2C — CartPole A2C
PPO — CartPole PPO
A2C Statistics A2C
PPO Statistics PPO
Reward Comparison
Policy Stability — Max Probability Ratio
PPO Clipping Detail
A2C Last Update A2C
PPO Last Update PPO
References
Textbook: Sutton, R. S. & Barto, A. G. (2018). Reinforcement Learning: An Introduction, 2nd ed. MIT Press. Chapter 13: Policy Gradient Methods. [link]
PPO Paper: Schulman, J. et al. (2017). “Proximal Policy Optimization Algorithms.” arXiv:1707.06347. [link]
CartPole: Barto, A. G., Sutton, R. S. & Anderson, C. W. (1983). “Neuronlike adaptive elements that can solve difficult learning control problems.” IEEE Trans. on Systems, Man, and Cybernetics, 13(5), 834–846.