Policy Gradient: Left-Right Game
REINFORCE (Monte Carlo Policy Gradient) — Sutton & Barto §13
REINFORCE Algorithm
REINFORCE is a Monte Carlo policy gradient method that learns a parameterized policy π(a|s;θ) directly, without estimating a value function. The agent samples actions from the policy, collects rewards, and updates the parameters in the direction that increases the probability of actions that led to higher returns.
In the Left-Right Game, there is a single state with two actions: Left (reward −1) and Right (reward +1). The policy uses a softmax over two learnable parameters θ = [θL, θR]. The optimal policy should learn to always go Right.
Key concepts (hover for definitions):
REINFORCE is a pure ActorThe Actor is the policy network π(a|s;θ) that decides which action to take. It maps states to a probability distribution over actions.
Example: In this Left-Right game, the actor is the softmax over θL and θR. It outputs, say, “go Right with 73% probability.” In CartPole, the actor is a neural network that outputs “push left” vs “push right” probabilities given the pole angle and velocity. method — the policy itself chooses actions and learns from the returns.
Unlike Actor-Critic methods (A2C, PPO), REINFORCE has no CriticThe Critic is a learned value function V(s) that estimates “how good is this state?” — the expected total future reward. The critic does not choose actions; it only evaluates situations.
Example: In CartPole, a critic network might output V = 185 for a well-balanced pole (many future steps expected) and V = 12 for a nearly-fallen pole (few steps left). REINFORCE has no critic — it uses the actual episode return Gt instead, which is noisier but unbiased. — it uses the raw episode return Gt to judge actions.
Adding a critic enables computing the AdvantageThe Advantage A(s,a) = Q(s,a) − V(s) measures how much better action a is compared to the average action in state s. Positive advantage means “better than usual,” negative means “worse than usual.”
Example: In CartPole, if V(s) = 150 (the critic’s baseline) and pushing right actually yields Gt = 190, then A ≈ +40 — pushing right was much better than average. If pushing left yields Gt = 80, then A ≈ −70. PPO and A2C use this advantage to reduce variance and learn faster than REINFORCE., which measures how much better an action is than average — this is the key upgrade in PPO and A2C.
2. REINFORCE Update: θ ← θ + α ∇θ log π(a|s;θ) · Gt
3. Softmax Policy: π(a) = exp(θa) / Σj exp(θj)
4. Log-Softmax Gradient: ∇θj log π(a) = 1{j=a} − π(j)
The Left-Right Game
Controls
Current Policy
Statistics
Probability History
Update Details
Parameter Trajectory
Episode Log
| Ep | Action | Reward | θL | θR | π(L) | π(R) |
|---|---|---|---|---|---|---|
| No episodes yet | ||||||