The Gambler's Problem
Dynamic Programming — Value Iteration (Sutton & Barto, Example 4.3) | Short Activity | Long Activity
Problem Description
A gambler has the opportunity to make bets on the outcomes of a sequence of coin flips. If the coin comes up heads, the gambler wins as many dollars as was staked; if tails, the gambler loses the stake. The game ends when the gambler reaches a goal of $100 or runs out of money ($0).
On each flip, the gambler must decide what portion of capital to stake, in integer amounts. The stateIn RL, a state captures all information needed to make a decision. Here the state is the gambler’s current capital, s ∈ {1, 2, …, 99}. States 0 and 100 are terminal (the game ends). is the gambler's capital s ∈ {1, 2, …, 99}, and the actionsThe set of legal stakes at each state. From state s the gambler can bet any integer a from 1 up to min(s, 100−s), ensuring the wager never exceeds current capital or what is needed to reach $100. are stakes a ∈ {1, 2, …, min(s, 100−s)}. The rewardThe numerical signal received after each transition. Here the only non-zero reward is +1 for reaching $100. All other transitions—including going broke—yield reward 0. is +1 on reaching $100, and 0 on all other transitions. This is an undiscounted, episodic taskEpisodic means the game has natural endings ($0 or $100). Undiscounted (γ = 1) means future rewards count equally—winning later is just as good as winning sooner. So V(s) equals the probability of eventually reaching $100..
The state-value functionV(s) gives the expected return starting from state s under a given policy. Since reward is +1 only for reaching $100, V(s) equals the probability of winning from capital s. The optimal V*(s) gives the winning probability under the best possible strategy. gives the probability of winning from each state. A policyA mapping π(s) that tells the gambler how much to stake at each capital level. The optimal policy π*(s) maximizes the probability of reaching $100 from every state. is a mapping from each state to a stake. Value iterationA dynamic programming algorithm that repeatedly applies the Bellman optimality update to every state until the value function converges. Each complete pass through all states is called a “sweep.” It is guaranteed to converge to V* for any finite MDP. finds the optimal value function and from it the optimal policy.
with V(0) = 0, V(100) = 1 (terminal conditions)
Controls
Value Function V(s)V(s) is the probability of winning from capital s. Initially all zeros (except V(100)=1), values increase as value iteration discovers paths to $100. At convergence this becomes V*(s), the optimal winning probability.
Optimal Policy π*(s)The optimal stake at each capital level. For ph < 0.5, the policy shows a striking “spiky” pattern: the gambler bets aggressively at capitals like 25, 50, and 75 to reach $100 in as few favorable flips as possible.
Convergence HistoryTracks the maximum change in any state’s value (Δ = max|Vnew(s) − Vold(s)|) per sweep. When Δ falls below the threshold θ, the algorithm has converged. The log scale reveals the exponential convergence rate typical of value iteration.
Simulate Optimal PolicyAfter convergence, watch the optimal policy play a single episode. The gambler stakes π*(s) dollars at each step and a coin is flipped with probability ph of heads. The episode ends at $0 (lose) or $100 (win).
RL Dynamic Programming vs Competitive Programming DP
Main DP Algorithms in RL
- Policy Evaluation — compute Vπ(s) for a fixed policy π
- Policy Iteration — alternate between evaluation and greedy improvement until stable
- Value Iteration — combine evaluation and improvement into a single Bellman optimality backup
All three share: iterative updates, full environment knowledge (model-based), and sweeping over every state each iteration.
Intuition
DP in RL is global optimization by improving local value estimates. The core operation is the Bellman backup: update one state’s value using its successors’ values. Repeated sweeps propagate information until all values are self-consistent.
RL DP operates on a cyclic graph with stochastic transitions — unlike competitive-programming DP where the graph is a DAG and a single topological-order pass suffices, cycles here require iterative convergence.
Comparison
| Aspect | RL DP | Competitive Programming DP |
|---|---|---|
| Problem type | Sequential decision-making under uncertainty (MDPs) | Combinatorial optimization (knapsack, LCS, shortest path, …) |
| Transitions | Stochastic; cycles possible | Deterministic; DAG structure |
| Core equation | Bellman equation: V(s) = maxa ∑s′ p(s′|s,a)[r + γV(s′)] | Recurrence relation: dp[i] = f(dp[j], …) |
| Solution method | Iterative convergence (repeated sweeps until Δ < θ) | Single pass in topological order (bottom-up or memoized top-down) |
Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press.
Book website · Full text (PDF) · Short Activity (~15 min) · Long Activity (~30 min) · ← Back to simulations