Gridworld — Reinforcement Learning Example 3.5

Sutton & Barto, Reinforcement Learning: An Introduction (2nd ed.), Chapter 3

Example 3.5: Gridworld

This example from Sutton & Barto illustrates value functions and optimal policies in a simple finite MDP. A 5 × 5 grid represents the environment, where the agent can take four actions — north, south, east, west — each deterministically moving it one cell in the chosen direction.

Actions that would move the agent off the grid leave its position unchanged and yield a reward of −1. All other moves yield reward 0, except from two special states:

  • State A (row 0, col 1): any action yields reward +10 and teleports the agent to A′ (row 4, col 1).
  • State B (row 0, col 3): any action yields reward +5 and teleports the agent to B′ (row 2, col 3).

Figure 3.2 (below left) shows the gridworld layout and the state-value function vπ under an equiprobable random policy (π(a|s) = 0.25 for all actions in all states), with discount factor γ = 0.9. Notice the negative values near the lower edge — these result from the high probability of hitting the grid boundary under the random policy. State A is the best state under this policy, but its value (~8.8) is less than its immediate reward of +10 because from A the agent is teleported to A′ near the bottom edge, where it frequently runs into the boundary. State B, on the other hand, is valued more than +5 because B′ is near the center of the grid, a region with positive expected return.

Figure 3.5 (below right) shows the optimal value function v* and optimal policy π*, computed via value iteration using the Bellman optimality equation. The optimal values are much larger than those under the random policy because the optimal agent avoids hitting boundaries and efficiently reaches state A for the +10 reward. Where multiple arrows appear in a cell, all indicated actions are equally optimal.

Figure 3.2

(a) Exceptional reward dynamics
(b) vπ for equiprobable random policy

The gridworld and state-value function for the equiprobable random policy (γ = 0.9).

Figure 3.5

(a) v*
(b) π*

Optimal value function and an optimal policy for the gridworld (γ = 0.9).

Policy Evaluation (Random Policy)
A → A' (+10)
B → B' (+5)
Off-grid (-1)
Normal (0)
Low valueHigh value

Bellman Equation Details

Click any cell to see its value computation.

Select a cell to see the Bellman equation breakdown.

Mode

Policy Evaluation
Optimal Values
Play / Explore

Compute Vπ under equiprobable random policy (π(a|s) = 0.25) using iterative policy evaluation.

Compute V* and π* using value iteration (Bellman optimality equation).

Navigate the grid manually or follow the optimal policy. Click cells or use arrow keys.

0.90
5

Statistics

0
Iterations
Max Δ
0
Total Reward
0
Steps

About This Gridworld

5 × 5 grid with 4 actions: north, south, east, west.

  • Actions move the agent one cell in the chosen direction.
  • Moving off the grid leaves state unchanged, reward = −1.
  • Otherwise reward = 0.
  • State A (row 0, col 1): any action → reward +10, teleport to A' (row 4, col 1).
  • State B (row 0, col 3): any action → reward +5, teleport to B' (row 2, col 3).

Policy Evaluation: computes Vπ for the equiprobable random policy.

Optimal Values: computes V* via value iteration, then extracts the greedy optimal policy π*.