The Gambler's Problem

Dynamic Programming — Value Iteration (Sutton & Barto, Example 4.3) | Short Activity | Long Activity

Problem Description

A gambler has the opportunity to make bets on the outcomes of a sequence of coin flips. If the coin comes up heads, the gambler wins as many dollars as was staked; if tails, the gambler loses the stake. The game ends when the gambler reaches a goal of $100 or runs out of money ($0).

On each flip, the gambler must decide what portion of capital to stake, in integer amounts. The stateIn RL, a state captures all information needed to make a decision. Here the state is the gambler’s current capital, s ∈ {1, 2, …, 99}. States 0 and 100 are terminal (the game ends). is the gambler's capital s ∈ {1, 2, …, 99}, and the actionsThe set of legal stakes at each state. From state s the gambler can bet any integer a from 1 up to min(s, 100−s), ensuring the wager never exceeds current capital or what is needed to reach $100. are stakes a ∈ {1, 2, …, min(s, 100−s)}. The rewardThe numerical signal received after each transition. Here the only non-zero reward is +1 for reaching $100. All other transitions—including going broke—yield reward 0. is +1 on reaching $100, and 0 on all other transitions. This is an undiscounted, episodic taskEpisodic means the game has natural endings ($0 or $100). Undiscounted (γ = 1) means future rewards count equally—winning later is just as good as winning sooner. So V(s) equals the probability of eventually reaching $100..

The state-value functionV(s) gives the expected return starting from state s under a given policy. Since reward is +1 only for reaching $100, V(s) equals the probability of winning from capital s. The optimal V*(s) gives the winning probability under the best possible strategy. gives the probability of winning from each state. A policyA mapping π(s) that tells the gambler how much to stake at each capital level. The optimal policy π*(s) maximizes the probability of reaching $100 from every state. is a mapping from each state to a stake. Value iterationA dynamic programming algorithm that repeatedly applies the Bellman optimality update to every state until the value function converges. Each complete pass through all states is called a “sweep.” It is guaranteed to converge to V* for any finite MDP. finds the optimal value function and from it the optimal policy.

$0 Lose R = 0 ··· s − a s stake a s + a ··· $100 Win! R = +1 ph (heads) 1−ph (tails) States s ∈ {0, 1, …, 100} · Actions a ∈ {1, …, min(s, 100−s)}
Figure: MDP structure of the Gambler’s Problem. From state s, staking a dollars leads to s+a (heads) or s−a (tails).
Bellman Optimality UpdateThe core equation of value iteration. For each state s, we evaluate every possible stake a by weighting the values of the resulting states by the coin probability, then keep the maximum. Repeating this across all states constitutes one “sweep.”
V(s) maxa [ ph · V(s+a) + (1−ph) · V(s−a) ]
with V(0) = 0, V(100) = 1 (terminal conditions)

Controls

0.40
Ready — press "One Sweep" or "Run to Convergence"
0
SweepsA sweep is one complete pass through all 99 non-terminal states, updating each V(s) using the Bellman equation. Multiple sweeps are needed for values to propagate and converge.
Max ΔThe largest change in any state’s value during the most recent sweep: Δ = max|Vnew(s) − Vold(s)|. When this number is large, values are still shifting significantly. The algorithm converges when Δ drops below θ.
0.40
phThe probability of the coin landing heads. When ph < 0.5 the coin is biased against the gambler. At ph = 0.5 it is fair. Above 0.5 it favors the gambler. The default 0.40 matches Example 4.3 in the textbook.
10-10
θThe convergence threshold. Value iteration stops when the maximum change Δ across all states in a single sweep falls below θ. Smaller values yield more precise results but require more sweeps. The “spiky” optimal policy becomes more pronounced with smaller θ.
Click a sweepEach numbered tag is a snapshot of V(s) after that many sweeps. Click any tag to see how the value function and corresponding policy looked at that point, showing how they evolve toward convergence. to see its value function and corresponding policy function:

Value Function V(s)V(s) is the probability of winning from capital s. Initially all zeros (except V(100)=1), values increase as value iteration discovers paths to $100. At convergence this becomes V*(s), the optimal winning probability.

Probability of winning from each capital state — hover over chart for details

Optimal Policy π*(s)The optimal stake at each capital level. For ph < 0.5, the policy shows a striking “spiky” pattern: the gambler bets aggressively at capitals like 25, 50, and 75 to reach $100 in as few favorable flips as possible.

Optimal stake at each capital level (after convergence) — hover over chart for details

Convergence HistoryTracks the maximum change in any state’s value (Δ = max|Vnew(s) − Vold(s)|) per sweep. When Δ falls below the threshold θ, the algorithm has converged. The log scale reveals the exponential convergence rate typical of value iteration.

Maximum value change (Δ) per sweep (log scale) — dashed line shows θ

Simulate Optimal PolicyAfter convergence, watch the optimal policy play a single episode. The gambler stakes π*(s) dollars at each step and a coin is flipped with probability ph of heads. The episode ends at $0 (lose) or $100 (win).

Watch the optimal policy play out. Run value iteration to convergence first. Play multiple episodes — history is kept below.
$50
5
$0 — Lose (bankrupt) Low capital High capital $100 — Win! 🏃 Current position
$0$25$50$75$100
Waiting for simulation...

RL Dynamic Programming vs Competitive Programming DP

Main DP Algorithms in RL

  • Policy Evaluation — compute Vπ(s) for a fixed policy π
  • Policy Iteration — alternate between evaluation and greedy improvement until stable
  • Value Iteration — combine evaluation and improvement into a single Bellman optimality backup

All three share: iterative updates, full environment knowledge (model-based), and sweeping over every state each iteration.

Intuition

DP in RL is global optimization by improving local value estimates. The core operation is the Bellman backup: update one state’s value using its successors’ values. Repeated sweeps propagate information until all values are self-consistent.

RL DP operates on a cyclic graph with stochastic transitions — unlike competitive-programming DP where the graph is a DAG and a single topological-order pass suffices, cycles here require iterative convergence.

Comparison

Aspect RL DP Competitive Programming DP
Problem type Sequential decision-making under uncertainty (MDPs) Combinatorial optimization (knapsack, LCS, shortest path, …)
Transitions Stochastic; cycles possible Deterministic; DAG structure
Core equation Bellman equation: V(s) = maxas′ p(s′|s,a)[r + γV(s′)] Recurrence relation: dp[i] = f(dp[j], …)
Solution method Iterative convergence (repeated sweeps until Δ < θ) Single pass in topological order (bottom-up or memoized top-down)
Based on Example 4.3 in
Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press.
Book website · Full text (PDF) · Short Activity (~15 min) · Long Activity (~30 min) · ← Back to simulations