Dynamic Programming In-Class Activity

The Gambler's Problem — Sutton & Barto, Example 4.3 | Open Simulation | ← All simulations

Activity Overview

Name: Partner: ID: Saved

Time: ~30 minutes

Format: Pairs or small groups

Materials: Laptop with DynamicProgramming.html

Method: Predict → Experiment → Explain

Progress: 0 / 0 questions answered

A gambler repeatedly bets on coin flips. Heads wins the stake, tails loses it. The goal is to reach $100 starting from some capital. Value iteration finds the optimal betting strategy. Type your answers below — they are auto-saved.

Part 1 — Understanding the Gambler's Problem (5 min)

1A — Predict Before You Sweep 3 min

Open DynamicProgramming.html. The default coin probability is p_h = 0.40 (the coin is biased against the gambler). The value function V(s) gives the probability of winning from capital $s.

Before running anything: if the coin lands heads 40% of the time, do you think the gambler has a good chance of reaching $100 from $50? Guess the probability V(50).

The gambler can bet between $1 and min(s, 100−s) at each state. If you had $50 and a 40% coin, would you bet big or small? Why?

1B — The First Sweep 2 min

Click "One Sweep" once. Watch the V(s) chart update.
Hover over the chart at s=25, s=50, and s=75 to read the values.

State	V(s) after Sweep 1
V(25)
V(50)
V(75)

After one sweep, V(s) is still very rough. Why does V(75) already look higher than V(25)?
Hint: from $75, how big a winning bet do you need?

Part 2 — Watching Value Iteration Converge (7 min)

2A — Sweep by Sweep 4 min

Continue clicking "One Sweep" until you reach sweep 5. After each sweep, hover over s=50 to read V(50). You can also click the numbered sweep tags below the controls to revisit earlier snapshots.

Sweep	V(50)	Max Δ
1
2
3
5

Click back and forth between sweep 1 and sweep 5 using the sweep tags. Describe how the V(s) curve changes shape over successive sweeps.

2B — Run to Convergence 3 min

Click "Run to Convergence". Wait for the status to read "Converged after … sweeps".
Look at the Convergence History chart (the orange line).

Metric	Value
Total sweeps to converge
Final V(25)
Final V(50)
Final V(75)

How close was your Part 1A guess to the actual V(50)?

The Convergence History chart shows Δ dropping on a log scale. Does Δ decrease smoothly or in steps? What does the shape tell you about how value iteration works?

Part 3 — The Optimal Policy (6 min)

3A — Reading the Policy Chart 3 min

Look at the Optimal Policy π*(s) chart on the right. The green bars show the optimal stake at each capital level.

Hover over the policy chart to read specific stake values.

Capital	Optimal Stake
s = 12
s = 25
s = 50
s = 75

The policy chart has a distinctive spiky pattern — certain states have very high stakes while neighboring states bet small. Why do you think states like s=25, s=50, and s=75 have tall spikes?
Hint: what is special about these numbers relative to the goal of $100?

3B — Connecting Value and Policy 3 min

Look at the value function V(s). You should see "steps" or "kinks" at s=25, 50, 75. Explain the connection: why does a step in V(s) correspond to a spike in the policy?
Hint: the Bellman equation picks the action a that maximizes p_h·V(s+a) + (1−p_h)·V(s−a). At a kink, betting to reach the next step is especially valuable.

The policy says the gambler should sometimes bet everything (e.g., at s=50, the stake might equal 50). Why is "all-in" optimal when the coin is unfavorable?

Part 4 — Changing the Coin (7 min)

4A — Fair Coin (p_h = 0.50) 3 min

Move the p_h slider to 0.50 (this resets automatically). Click "Run to Convergence".

Metric	p_h = 0.40	p_h = 0.50
V(50)
Sweeps to converge

With a fair coin, what does V(s) look like? Is it curved, straight, or something else? Why does this shape make sense?
Hint: with p_h = 0.50, the gambler's chance of reaching $100 from $s is just s/100. Why?

4B — Favorable Coin (p_h = 0.55) 2 min

Set p_h = 0.55. Run to convergence. Compare V(s) and the policy to what you saw before.

How does V(50) change compared to p_h = 0.40 and 0.50? Record it below.

V(50) at p_h = 0.55

Look at the policy chart with p_h = 0.55. Is the policy still spiky, or has it changed? What strategy does the optimal gambler use when the coin is favorable?
Hint: with a favorable coin, is it better to take many small bets or a few large ones?

4C — The Big Picture 2 min

Summarize: how does the optimal strategy change as p_h goes from unfavorable (0.40) to fair (0.50) to favorable (0.55)? Why is the difference so dramatic?

Part 5 — Simulating the Optimal Policy (3 min)

5A — Watch It Play 3 min

Set p_h back to 0.40 and run to convergence. Then scroll down to "Simulate Optimal Policy".

Set starting capital to $50 and click "Play Episode". Watch the runner move along the bar. Note whether the gambler wins or loses.
Run 4–5 more episodes. Tally the results below.

Run	Result (Win/Loss)	Steps
1
2
3
4
5

How many of your 5 runs ended in a win? Is this consistent with V(50) (the probability of winning from $50)?

Part 6 — Reflection (2 min)

6A — Fill in the Blanks 1 min

Value iteration repeatedly applies the equation to update V(s).
The algorithm converges when the maximum change Δ falls below the threshold .
When the coin is unfavorable (p_h < 0.5), the optimal strategy is to bet to minimize the number of flips.
When the coin is favorable (p_h > 0.5), the optimal strategy is to bet to let the edge compound over many flips.
The spikes in the policy at s = 25, 50, 75 occur because these capitals can reach in a single winning bet.

6B — Connecting to the Textbook 1 min

Chapter 4 introduces dynamic programming as a way to compute optimal policies when the environment model (transition probabilities and rewards) is fully known. In this problem, what is the "model" that value iteration uses?

Bonus Challenges (if time permits)

Set p_h = 0.40 and try different convergence thresholds (θ): 10⁻⁴ vs 10⁻¹⁴. How many sweeps does each need? Does the resulting policy differ?

Click "Animate" to watch value iteration update sweep by sweep. At what point can you visually tell the value function has "settled"? Is this earlier or later than the mathematical convergence?

Your answers are auto-saved in your browser. Use the buttons above to export for submission.