Sutton & Barto, Reinforcement Learning: An Introduction (2nd ed.), Chapter 3
This example from Sutton & Barto illustrates value functions and optimal policies in a simple finite MDP. A 5 × 5 grid represents the environment, where the agent can take four actions — north, south, east, west — each deterministically moving it one cell in the chosen direction.
Actions that would move the agent off the grid leave its position unchanged and yield a reward of −1. All other moves yield reward 0, except from two special states:
Figure 3.2 (below left) shows the gridworld layout and the state-value function vπ under an equiprobable random policy (π(a|s) = 0.25 for all actions in all states), with discount factor γ = 0.9. Notice the negative values near the lower edge — these result from the high probability of hitting the grid boundary under the random policy. State A is the best state under this policy, but its value (~8.8) is less than its immediate reward of +10 because from A the agent is teleported to A′ near the bottom edge, where it frequently runs into the boundary. State B, on the other hand, is valued more than +5 because B′ is near the center of the grid, a region with positive expected return.
Figure 3.5 (below right) shows the optimal value function v* and optimal policy π*, computed via value iteration using the Bellman optimality equation. The optimal values are much larger than those under the random policy because the optimal agent avoids hitting boundaries and efficiently reaches state A for the +10 reward. Where multiple arrows appear in a cell, all indicated actions are equally optimal.
The gridworld and state-value function for the equiprobable random policy (γ = 0.9).
Optimal value function and an optimal policy for the gridworld (γ = 0.9).
Click any cell to see its value computation.
Compute Vπ under equiprobable random policy (π(a|s) = 0.25) using iterative policy evaluation.
Compute V* and π* using value iteration (Bellman optimality equation).
Navigate the grid manually or follow the optimal policy. Click cells or use arrow keys.
5 × 5 grid with 4 actions: north, south, east, west.
Policy Evaluation: computes Vπ for the equiprobable random policy.
Optimal Values: computes V* via value iteration, then extracts the greedy optimal policy π*.