PPO vs A2C — Actor-Critic Methods

Sutton & Barto §13 — Policy Gradient with Baseline | ← All Simulations

Actor-Critic Methods: A2C and PPO

Both A2C (Advantage Actor-Critic) and PPO (Proximal Policy Optimization) are actor-critic policy gradient methods. The actor learns a parameterized policy π(a|s;θ), while the critic learns a value function V(s;w) used to compute the advantage A(s,a) = r + γV(s') − V(s), which reduces variance compared to raw returns.

A2C updates the policy directly using the advantage — this can lead to large, destabilizing updates. PPO constrains updates by clipping the probability ratio r(θ) = πnew(a|s) / πold(a|s) to [1−ε, 1+ε], and reuses each trajectory for multiple update epochs, improving sample efficiency and stability.

A2C Update:
θ ← θ + αactor · A · ∇θ log π(a|s;θ)
V(s) ← V(s) + αcritic · (r + γV(s') − V(s))
PPO Clipped Objective:
r(θ) = πnew(a|s) / πold(a|s)
L = min(r(θ)·A, clip(r(θ), 1−ε, 1+ε)·A)

Environment: 1D Cliff Walk

A simple 1D corridor with 10 positions (0–9). The agent starts at position 0 and must reach the goal at position 9 (reward +10). Each step costs −0.1. Falling off either edge gives −1 penalty and the agent stays in place. The episode ends at the goal or after 50 steps. Actions: Left (0) and Right (1). The policy and value function are tabular (one entry per state), making all 10 states easy to visualize.

Controls

0.10
0.10
0.99
0.20
4
5
Space Step   R Reset

A2C — Environment A2C

A2C — Policy π(Right|s) A2C

A2C — Value Function V(s) A2C

PPO — Environment PPO

PPO — Policy π(Right|s) PPO

PPO — Value Function V(s) PPO

A2C Statistics A2C

0
Episode
0.000
Avg Reward (50)
0
Avg Steps (50)
0%
Goal Rate (50)

PPO Statistics PPO

0
Episode
0.000
Avg Reward (50)
0
Avg Steps (50)
0%
Goal Rate (50)

Reward Comparison

Episode reward (smoothed over 20 episodes) for both algorithms

Policy Stability — Max Probability Ratio

Maximum πnewold ratio across all states per episode (lower = more stable)

PPO Clipping Detail

Shows the probability ratio r(θ), clipped range [1−ε, 1+ε], and effective objective for the last PPO update

A2C Last Update A2C

Run an episode to see A2C update details

PPO Last Update PPO

Run an episode to see PPO update details

References

Textbook: Sutton, R. S. & Barto, A. G. (2018). Reinforcement Learning: An Introduction, 2nd ed. MIT Press. Chapter 13: Policy Gradient Methods. [link]

PPO Paper: Schulman, J. et al. (2017). “Proximal Policy Optimization Algorithms.” arXiv:1707.06347. [link]

Application: Goodfriend, S. (2024). “Mastering microRTS with PPO and A2C.” IEEE Conference on Games (CoG). Demonstrates PPO and A2C in competitive real-time strategy game AI.