Policy Gradient: Left-Right Game
REINFORCE (Monte Carlo Policy Gradient) — Sutton & Barto §13
REINFORCE Algorithm
REINFORCE is a Monte Carlo policy gradient method that learns a parameterized policy π(a|s;θ) directly, without estimating a value function. The agent samples actions from the policy, collects rewards, and updates the parameters in the direction that increases the probability of actions that led to higher returns.
In the Left-Right Game, there is a single state with two actions: Left (reward −1) and Right (reward +1). The policy uses a softmax over two learnable parameters θ = [θL, θR]. The optimal policy should learn to always go Right.
1. Policy Gradient Theorem:
∇J(θ) = Eπ[∇θ log π(a|s;θ) · Gt]
2. REINFORCE Update: θ ← θ + α ∇θ log π(a|s;θ) · Gt
3. Softmax Policy: π(a) = exp(θa) / Σj exp(θj)
4. Log-Softmax Gradient: ∇θj log π(a) = 1{j=a} − π(j)
2. REINFORCE Update: θ ← θ + α ∇θ log π(a|s;θ) · Gt
3. Softmax Policy: π(a) = exp(θa) / Σj exp(θj)
4. Log-Softmax Gradient: ∇θj log π(a) = 1{j=a} − π(j)
The Left-Right Game
Agent chooses Left or Right. Arrow width reflects action probability.
Controls
Space Step R Reset
Current Policy
Action probabilities π(Left) and π(Right)
Statistics
0
Episode
0.500
π(Right)
0.000
θL
0.000
θR
0.000
Avg Reward
0
Total Reward
Probability History
π(Left) and π(Right) over episodes
Update Details
Step-by-step REINFORCE computation for the last episode
Run an episode to see the math
Parameter Trajectory
θL and θR over episodes
Episode Log
Last 50 episodes (newest first)
| Ep | Action | Reward | θL | θR | π(L) | π(R) |
|---|---|---|---|---|---|---|
| No episodes yet | ||||||