Gridworld - Sutton & Barto Example 3.5 / Figure 3.2

Sutton & Barto, Reinforcement Learning: An Introduction (2nd ed.), Chapter 3 | ← All simulations

Example 3.5: Gridworld

This example from Sutton & Barto illustrates value functions and optimal policies in a simple finite MDP. A 5 × 5 grid represents the environment, where the agent can take four actions — north, south, east, west — each deterministically moving it one cell in the chosen direction.

Actions that would move the agent off the grid leave its position unchanged and yield a reward of −1. All other moves yield reward 0, except from two special states:

State A (row 0, col 1): any action yields reward +10 and teleports the agent to A′ (row 4, col 1).
State B (row 0, col 3): any action yields reward +5 and teleports the agent to B′ (row 2, col 3).

Figure 3.2 (below left) shows the gridworld layout and the state-value function v_π under an equiprobable random policy (π(a|s) = 0.25 for all actions in all states), with discount factor γ = 0.9. Notice the negative values near the lower edge — these result from the high probability of hitting the grid boundary under the random policy. State A is the best state under this policy, but its value (~8.8) is less than its immediate reward of +10 because from A the agent is teleported to A′ near the bottom edge, where it frequently runs into the boundary. State B, on the other hand, is valued more than +5 because B′ is near the center of the grid, a region with positive expected return.

Figure 3.5 (below right) shows the optimal value function v_* and optimal policy π_*, computed via value iteration using the Bellman optimality equation. The optimal values are much larger than those under the random policy because the optimal agent avoids hitting boundaries and efficiently reaches state A for the +10 reward. Where multiple arrows appear in a cell, all indicated actions are equally optimal.

Figure 3.2

(a) Exceptional reward dynamics

(b) v_π for equiprobable random policy

The gridworld and state-value function for the equiprobable random policy (γ = 0.9).

Figure 3.5

(a) v_*

(b) π_*

Optimal value function and an optimal policy for the gridworld (γ = 0.9).

Gridworld — Reinforcement Learning Example 3.5

Example 3.5: Gridworld

Figure 3.2

Figure 3.5

Bellman Equation Details

Mode

Statistics

About This Gridworld

Reference