The Gambler's Problem

Dynamic Programming — Value Iteration (Sutton & Barto, Example 4.3) | Short Activity | Long Activity

Problem Description

A gambler has the opportunity to make bets on the outcomes of a sequence of coin flips. If the coin comes up heads, the gambler wins as many dollars as was staked; if tails, the gambler loses the stake. The game ends when the gambler reaches a goal of $100 or runs out of money ($0).

On each flip, the gambler must decide what portion of capital to stake, in integer amounts. The stateIn RL, a state captures all information needed to make a decision. Here the state is the gambler’s current capital, s ∈ {1, 2, …, 99}. States 0 and 100 are terminal (the game ends). is the gambler's capital s ∈ {1, 2, …, 99}, and the actionsThe set of legal stakes at each state. From state s the gambler can bet any integer a from 1 up to min(s, 100−s), ensuring the wager never exceeds current capital or what is needed to reach $100. are stakes a ∈ {1, 2, …, min(s, 100−s)}. The rewardThe numerical signal received after each transition. Here the only non-zero reward is +1 for reaching $100. All other transitions—including going broke—yield reward 0. is +1 on reaching $100, and 0 on all other transitions. This is an undiscounted, episodic taskEpisodic means the game has natural endings ($0 or $100). Undiscounted (γ = 1) means future rewards count equally—winning later is just as good as winning sooner. So V(s) equals the probability of eventually reaching $100..

The state-value functionV(s) gives the expected return starting from state s under a given policy. Since reward is +1 only for reaching $100, V(s) equals the probability of winning from capital s. The optimal V*(s) gives the winning probability under the best possible strategy. gives the probability of winning from each state. A policyA mapping π(s) that tells the gambler how much to stake at each capital level. The optimal policy π*(s) maximizes the probability of reaching $100 from every state. is a mapping from each state to a stake. Value iterationA dynamic programming algorithm that repeatedly applies the Bellman optimality update to every state until the value function converges. Each complete pass through all states is called a “sweep.” It is guaranteed to converge to V* for any finite MDP. finds the optimal value function and from it the optimal policy.

Figure: MDP structure of the Gambler’s Problem. From state s, staking a dollars leads to s+a (heads) or s−a (tails).

V(s) ← max_a [ p_h · V(s+a) + (1−p_h) · V(s−a) ]
with V(0) = 0, V(100) = 1 (terminal conditions)

Controls

(heads probability): 0.40

(convergence threshold):

Ready — press "One Sweep" or "Run to Convergence"

—

0.40

10^-10

to see its value function and corresponding policy function:

Value Function V(s)V(s) is the probability of winning from capital s. Initially all zeros (except V(100)=1), values increase as value iteration discovers paths to $100. At convergence this becomes V*(s), the optimal winning probability.

Probability of winning from each capital state — hover over chart for details

Optimal Policy π*(s)The optimal stake at each capital level. For p_h < 0.5, the policy shows a striking “spiky” pattern: the gambler bets aggressively at capitals like 25, 50, and 75 to reach $100 in as few favorable flips as possible.

Optimal stake at each capital level (after convergence) — hover over chart for details

Convergence HistoryTracks the maximum change in any state’s value (Δ = max|V_new(s) − V_old(s)|) per sweep. When Δ falls below the threshold θ, the algorithm has converged. The log scale reveals the exponential convergence rate typical of value iteration.

Maximum value change (Δ) per sweep (log scale) — dashed line shows θ

Simulate Optimal PolicyAfter convergence, watch the optimal policy play a single episode. The gambler stakes π*(s) dollars at each step and a coin is flipped with probability p_h of heads. The episode ends at $0 (lose) or $100 (win).

Watch the optimal policy play out. Run value iteration to convergence first. Play multiple episodes — history is kept below.

Starting capital: $50

Speed: 5

$0 — Lose (bankrupt) Low capital High capital $100 — Win! 🏃 Current position

$0$25$50$75$100

Waiting for simulation...

RL Dynamic Programming vs Competitive Programming DP

Main DP Algorithms in RL

Policy Evaluation — compute V^π(s) for a fixed policy π
Policy Iteration — alternate between evaluation and greedy improvement until stable
Value Iteration — combine evaluation and improvement into a single Bellman optimality backup

All three share: iterative updates, full environment knowledge (model-based), and sweeping over every state each iteration.

Intuition

DP in RL is global optimization by improving local value estimates. The core operation is the Bellman backup: update one state’s value using its successors’ values. Repeated sweeps propagate information until all values are self-consistent.

RL DP operates on a cyclic graph with stochastic transitions — unlike competitive-programming DP where the graph is a DAG and a single topological-order pass suffices, cycles here require iterative convergence.

Comparison

Aspect	RL DP	Competitive Programming DP
Problem type	Sequential decision-making under uncertainty (MDPs)	Combinatorial optimization (knapsack, LCS, shortest path, …)
Transitions	Stochastic; cycles possible	Deterministic; DAG structure
Core equation	Bellman equation: V(s) = max_a ∑_s′ p(s′\|s,a)[r + γV(s′)]	Recurrence relation: dp[i] = f(dp[j], …)
Solution method	Iterative convergence (repeated sweeps until Δ < θ)	Single pass in topological order (bottom-up or memoized top-down)

Based on Example 4.3 in
Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press.
Book website · Full text (PDF) · Short Activity (~15 min) · Long Activity (~30 min) · ← Back to simulations