PPO vs A2C — Actor-Critic Methods
Sutton & Barto §13 — Policy Gradient with Baseline | ← All Simulations
Actor-Critic Methods: A2C and PPO
Both A2C (Advantage Actor-Critic) and PPO (Proximal Policy Optimization) are actor-critic policy gradient methods. The actor learns a parameterized policy π(a|s;θ), while the critic learns a value function V(s;w) used to compute the advantage A(s,a) = r + γV(s') − V(s), which reduces variance compared to raw returns.
A2C updates the policy directly using the advantage — this can lead to large, destabilizing updates. PPO constrains updates by clipping the probability ratio r(θ) = πnew(a|s) / πold(a|s) to [1−ε, 1+ε], and reuses each trajectory for multiple update epochs, improving sample efficiency and stability.
θ ← θ + αactor · A · ∇θ log π(a|s;θ)
V(s) ← V(s) + αcritic · (r + γV(s') − V(s))
r(θ) = πnew(a|s) / πold(a|s)
L = min(r(θ)·A, clip(r(θ), 1−ε, 1+ε)·A)
Environment: 1D Cliff Walk
A simple 1D corridor with 10 positions (0–9). The agent starts at position 0 and must reach the goal at position 9 (reward +10). Each step costs −0.1. Falling off either edge gives −1 penalty and the agent stays in place. The episode ends at the goal or after 50 steps. Actions: Left (0) and Right (1). The policy and value function are tabular (one entry per state), making all 10 states easy to visualize.
Controls
A2C — Environment A2C
A2C — Policy π(Right|s) A2C
A2C — Value Function V(s) A2C
PPO — Environment PPO
PPO — Policy π(Right|s) PPO
PPO — Value Function V(s) PPO
A2C Statistics A2C
PPO Statistics PPO
Reward Comparison
Policy Stability — Max Probability Ratio
PPO Clipping Detail
A2C Last Update A2C
PPO Last Update PPO
References
Textbook: Sutton, R. S. & Barto, A. G. (2018). Reinforcement Learning: An Introduction, 2nd ed. MIT Press. Chapter 13: Policy Gradient Methods. [link]
PPO Paper: Schulman, J. et al. (2017). “Proximal Policy Optimization Algorithms.” arXiv:1707.06347. [link]
Application: Goodfriend, S. (2024). “Mastering microRTS with PPO and A2C.” IEEE Conference on Games (CoG). Demonstrates PPO and A2C in competitive real-time strategy game AI.