The Gambler's Problem

Dynamic Programming — Value Iteration (Sutton & Barto, Example 4.3)

Problem Description

A gambler has the opportunity to make bets on the outcomes of a sequence of coin flips. If the coin comes up heads, the gambler wins as many dollars as was staked; if tails, the gambler loses the stake. The game ends when the gambler reaches a goal of $100 or runs out of money ($0).

On each flip, the gambler must decide what portion of capital to stake, in integer amounts. The state is the gambler's capital s ∈ {1, 2, …, 99}, and the actions are stakes a ∈ {1, 2, …, min(s, 100−s)}. The reward is +1 on reaching $100, and 0 on all other transitions. This is an undiscounted, episodic task.

The state-value function gives the probability of winning from each state. A policy is a mapping from each state to a stake. Value iteration finds the optimal value function and from it the optimal policy.

V(s) maxa [ ph · V(s+a) + (1−ph) · V(s−a) ]
with V(0) = 0, V(100) = 1 (terminal conditions)

Controls

0.40
Ready — press "One Sweep" or "Run to Convergence"
0
Sweeps
Max Δ
0.40
ph
10-10
θ
Click a sweep to see its value function:

Value Function V(s)

Probability of winning from each capital state

Optimal Policy π*(s)

Optimal stake at each capital level (after convergence)

Convergence History

Maximum value change (Δ) per sweep (log scale)

Simulate Optimal Policy

Watch the optimal policy play out on a single episode. Run value iteration to convergence first.
$50
Waiting for simulation...