Dynamic Programming In-Class Activity

The Gambler's Problem — Sutton & Barto, Example 4.3  |  Open Simulation  |  ← All simulations

Activity Overview

Saved
Time: ~30 minutes
Format: Pairs or small groups
Materials: Laptop with DynamicProgramming.html
Method: Predict → Experiment → Explain
Progress: 0 / 0 questions answered

A gambler repeatedly bets on coin flips. Heads wins the stake, tails loses it. The goal is to reach $100 starting from some capital. Value iteration finds the optimal betting strategy. Type your answers below — they are auto-saved.

Part 1 — Understanding the Gambler's Problem (5 min)
1A — Predict Before You Sweep 3 min

Open DynamicProgramming.html. The default coin probability is ph = 0.40 (the coin is biased against the gambler). The value function V(s) gives the probability of winning from capital $s.

  • Before running anything: if the coin lands heads 40% of the time, do you think the gambler has a good chance of reaching $100 from $50? Guess the probability V(50).
  • The gambler can bet between $1 and min(s, 100−s) at each state. If you had $50 and a 40% coin, would you bet big or small? Why?
1B — The First Sweep 2 min
  • Click "One Sweep" once. Watch the V(s) chart update.
  • Hover over the chart at s=25, s=50, and s=75 to read the values.
StateV(s) after Sweep 1
V(25)
V(50)
V(75)
  • After one sweep, V(s) is still very rough. Why does V(75) already look higher than V(25)?
    Hint: from $75, how big a winning bet do you need?
Part 2 — Watching Value Iteration Converge (7 min)
2A — Sweep by Sweep 4 min

Continue clicking "One Sweep" until you reach sweep 5. After each sweep, hover over s=50 to read V(50). You can also click the numbered sweep tags below the controls to revisit earlier snapshots.

SweepV(50)Max Δ
1
2
3
5
  • Click back and forth between sweep 1 and sweep 5 using the sweep tags. Describe how the V(s) curve changes shape over successive sweeps.
2B — Run to Convergence 3 min
  • Click "Run to Convergence". Wait for the status to read "Converged after … sweeps".
  • Look at the Convergence History chart (the orange line).
MetricValue
Total sweeps to converge
Final V(25)
Final V(50)
Final V(75)
  • How close was your Part 1A guess to the actual V(50)?
  • The Convergence History chart shows Δ dropping on a log scale. Does Δ decrease smoothly or in steps? What does the shape tell you about how value iteration works?
Part 3 — The Optimal Policy (6 min)
3A — Reading the Policy Chart 3 min

Look at the Optimal Policy π*(s) chart on the right. The green bars show the optimal stake at each capital level.

  • Hover over the policy chart to read specific stake values.
CapitalOptimal Stake
s = 12
s = 25
s = 50
s = 75
  • The policy chart has a distinctive spiky pattern — certain states have very high stakes while neighboring states bet small. Why do you think states like s=25, s=50, and s=75 have tall spikes?
    Hint: what is special about these numbers relative to the goal of $100?
3B — Connecting Value and Policy 3 min
  • Look at the value function V(s). You should see "steps" or "kinks" at s=25, 50, 75. Explain the connection: why does a step in V(s) correspond to a spike in the policy?
    Hint: the Bellman equation picks the action a that maximizes ph·V(s+a) + (1−ph)·V(s−a). At a kink, betting to reach the next step is especially valuable.
  • The policy says the gambler should sometimes bet everything (e.g., at s=50, the stake might equal 50). Why is "all-in" optimal when the coin is unfavorable?
Part 4 — Changing the Coin (7 min)
4A — Fair Coin (ph = 0.50) 3 min
  • Move the ph slider to 0.50 (this resets automatically). Click "Run to Convergence".
Metricph = 0.40ph = 0.50
V(50)
Sweeps to converge
  • With a fair coin, what does V(s) look like? Is it curved, straight, or something else? Why does this shape make sense?
    Hint: with ph = 0.50, the gambler's chance of reaching $100 from $s is just s/100. Why?
4B — Favorable Coin (ph = 0.55) 2 min
  • Set ph = 0.55. Run to convergence. Compare V(s) and the policy to what you saw before.
  • How does V(50) change compared to ph = 0.40 and 0.50? Record it below.
V(50) at ph = 0.55
  • Look at the policy chart with ph = 0.55. Is the policy still spiky, or has it changed? What strategy does the optimal gambler use when the coin is favorable?
    Hint: with a favorable coin, is it better to take many small bets or a few large ones?
4C — The Big Picture 2 min
  • Summarize: how does the optimal strategy change as ph goes from unfavorable (0.40) to fair (0.50) to favorable (0.55)? Why is the difference so dramatic?
Part 5 — Simulating the Optimal Policy (3 min)
5A — Watch It Play 3 min

Set ph back to 0.40 and run to convergence. Then scroll down to "Simulate Optimal Policy".

  • Set starting capital to $50 and click "Play Episode". Watch the runner move along the bar. Note whether the gambler wins or loses.
  • Run 4–5 more episodes. Tally the results below.
RunResult (Win/Loss)Steps
1
2
3
4
5
  • How many of your 5 runs ended in a win? Is this consistent with V(50) (the probability of winning from $50)?
Part 6 — Reflection (2 min)
6A — Fill in the Blanks 1 min
Value iteration repeatedly applies the equation to update V(s).
The algorithm converges when the maximum change Δ falls below the threshold .
When the coin is unfavorable (ph < 0.5), the optimal strategy is to bet to minimize the number of flips.
When the coin is favorable (ph > 0.5), the optimal strategy is to bet to let the edge compound over many flips.
The spikes in the policy at s = 25, 50, 75 occur because these capitals can reach in a single winning bet.
6B — Connecting to the Textbook 1 min
  • Chapter 4 introduces dynamic programming as a way to compute optimal policies when the environment model (transition probabilities and rewards) is fully known. In this problem, what is the "model" that value iteration uses?

Bonus Challenges (if time permits)

Your answers are auto-saved in your browser. Use the buttons above to export for submission.
Copied to clipboard!