Policy Gradient: Left-Right Game

REINFORCE (Monte Carlo Policy Gradient) — Sutton & Barto §13

REINFORCE Algorithm

REINFORCE is a Monte Carlo policy gradient method that learns a parameterized policy π(a|s;θ) directly, without estimating a value function. The agent samples actions from the policy, collects rewards, and updates the parameters in the direction that increases the probability of actions that led to higher returns.

In the Left-Right Game, there is a single state with two actions: Left (reward −1) and Right (reward +1). The policy uses a softmax over two learnable parameters θ = [θ_L, θ_R]. The optimal policy should learn to always go Right.

1. Policy Gradient Theorem: ∇J(θ) = E_π[∇_θ log π(a|s;θ) · G_t]
2. REINFORCE Update: θ ← θ + α ∇_θ log π(a|s;θ) · G_t
3. Softmax Policy: π(a) = exp(θ_a) / Σ_j exp(θ_j)
4. Log-Softmax Gradient: ∇_{θ_j} log π(a) = 1{j=a} − π(j)

The Left-Right Game

Agent chooses Left or Right. Arrow width reflects action probability.

Controls

α (learning rate): 0.10

Speed: 5

Space Step R Reset

Current Policy

Action probabilities π(Left) and π(Right)

Statistics

Episode

0.500

π(Right)

0.000

θ_L

0.000

θ_R

0.000

Avg Reward

Total Reward

Probability History

π(Left) and π(Right) over episodes

Update Details

Step-by-step REINFORCE computation for the last episode

Run an episode to see the math

Parameter Trajectory

θ_L and θ_R over episodes

Episode Log

Last 50 episodes (newest first)

Ep	Action	Reward	θ_L	θ_R	π(L)	π(R)
No episodes yet