Dynamic Programming In-Class Activity
The Gambler's Problem — Sutton & Barto, Example 4.3 | Open Simulation | ← All simulations
Activity Overview
A gambler repeatedly bets on coin flips. Heads wins the stake, tails loses it. The goal is to reach $100 starting from some capital. Value iteration finds the optimal betting strategy. Type your answers below — they are auto-saved.
Open DynamicProgramming.html. The default coin probability is ph = 0.40 (the coin is biased against the gambler). The value function V(s) gives the probability of winning from capital $s.
- Before running anything: if the coin lands heads 40% of the time, do you think the gambler has a good chance of reaching $100 from $50? Guess the probability V(50).
- The gambler can bet between $1 and min(s, 100−s) at each state. If you had $50 and a 40% coin, would you bet big or small? Why?
- Click "One Sweep" once. Watch the V(s) chart update.
- Hover over the chart at s=25, s=50, and s=75 to read the values.
| State | V(s) after Sweep 1 |
|---|---|
| V(25) | |
| V(50) | |
| V(75) |
- After one sweep, V(s) is still very rough. Why does V(75) already look higher than V(25)?
Hint: from $75, how big a winning bet do you need?
Continue clicking "One Sweep" until you reach sweep 5. After each sweep, hover over s=50 to read V(50). You can also click the numbered sweep tags below the controls to revisit earlier snapshots.
| Sweep | V(50) | Max Δ |
|---|---|---|
| 1 | ||
| 2 | ||
| 3 | ||
| 5 |
- Click back and forth between sweep 1 and sweep 5 using the sweep tags. Describe how the V(s) curve changes shape over successive sweeps.
- Click "Run to Convergence". Wait for the status to read "Converged after … sweeps".
- Look at the Convergence History chart (the orange line).
| Metric | Value |
|---|---|
| Total sweeps to converge | |
| Final V(25) | |
| Final V(50) | |
| Final V(75) |
- How close was your Part 1A guess to the actual V(50)?
- The Convergence History chart shows Δ dropping on a log scale. Does Δ decrease smoothly or in steps? What does the shape tell you about how value iteration works?
Look at the Optimal Policy π*(s) chart on the right. The green bars show the optimal stake at each capital level.
- Hover over the policy chart to read specific stake values.
| Capital | Optimal Stake |
|---|---|
| s = 12 | |
| s = 25 | |
| s = 50 | |
| s = 75 |
- The policy chart has a distinctive spiky pattern — certain states have very high stakes while neighboring states bet small. Why do you think states like s=25, s=50, and s=75 have tall spikes?
Hint: what is special about these numbers relative to the goal of $100?
- Look at the value function V(s). You should see "steps" or "kinks" at s=25, 50, 75. Explain the connection: why does a step in V(s) correspond to a spike in the policy?
Hint: the Bellman equation picks the action a that maximizes ph·V(s+a) + (1−ph)·V(s−a). At a kink, betting to reach the next step is especially valuable.
- The policy says the gambler should sometimes bet everything (e.g., at s=50, the stake might equal 50). Why is "all-in" optimal when the coin is unfavorable?
- Move the ph slider to 0.50 (this resets automatically). Click "Run to Convergence".
| Metric | ph = 0.40 | ph = 0.50 |
|---|---|---|
| V(50) | ||
| Sweeps to converge |
- With a fair coin, what does V(s) look like? Is it curved, straight, or something else? Why does this shape make sense?
Hint: with ph = 0.50, the gambler's chance of reaching $100 from $s is just s/100. Why?
- Set ph = 0.55. Run to convergence. Compare V(s) and the policy to what you saw before.
- How does V(50) change compared to ph = 0.40 and 0.50? Record it below.
| V(50) at ph = 0.55 |
- Look at the policy chart with ph = 0.55. Is the policy still spiky, or has it changed? What strategy does the optimal gambler use when the coin is favorable?
Hint: with a favorable coin, is it better to take many small bets or a few large ones?
- Summarize: how does the optimal strategy change as ph goes from unfavorable (0.40) to fair (0.50) to favorable (0.55)? Why is the difference so dramatic?
Set ph back to 0.40 and run to convergence. Then scroll down to "Simulate Optimal Policy".
- Set starting capital to $50 and click "Play Episode". Watch the runner move along the bar. Note whether the gambler wins or loses.
- Run 4–5 more episodes. Tally the results below.
| Run | Result (Win/Loss) | Steps |
|---|---|---|
| 1 | ||
| 2 | ||
| 3 | ||
| 4 | ||
| 5 |
- How many of your 5 runs ended in a win? Is this consistent with V(50) (the probability of winning from $50)?
The algorithm converges when the maximum change Δ falls below the threshold .
When the coin is unfavorable (ph < 0.5), the optimal strategy is to bet to minimize the number of flips.
When the coin is favorable (ph > 0.5), the optimal strategy is to bet to let the edge compound over many flips.
The spikes in the policy at s = 25, 50, 75 occur because these capitals can reach in a single winning bet.
- Chapter 4 introduces dynamic programming as a way to compute optimal policies when the environment model (transition probabilities and rewards) is fully known. In this problem, what is the "model" that value iteration uses?
Bonus Challenges (if time permits)
- Set ph = 0.40 and try different convergence thresholds (θ): 10−4 vs 10−14. How many sweeps does each need? Does the resulting policy differ?
- Click "Animate" to watch value iteration update sweep by sweep. At what point can you visually tell the value function has "settled"? Is this earlier or later than the mathematical convergence?