The Gambler's Problem
Dynamic Programming — Value Iteration (Sutton & Barto, Example 4.3)
Problem Description
A gambler has the opportunity to make bets on the outcomes of a sequence of coin flips. If the coin comes up heads, the gambler wins as many dollars as was staked; if tails, the gambler loses the stake. The game ends when the gambler reaches a goal of $100 or runs out of money ($0).
On each flip, the gambler must decide what portion of capital to stake, in integer amounts. The state is the gambler's capital s ∈ {1, 2, …, 99}, and the actions are stakes a ∈ {1, 2, …, min(s, 100−s)}. The reward is +1 on reaching $100, and 0 on all other transitions. This is an undiscounted, episodic task.
The state-value function gives the probability of winning from each state. A policy is a mapping from each state to a stake. Value iteration finds the optimal value function and from it the optimal policy.
with V(0) = 0, V(100) = 1 (terminal conditions)