Interactive exploration of the k-armed bandit problem (Sutton & Barto, Chapter 2)
The k-Armed Bandit Problem
Consider a repeated choice among k = 10 different actions. After each choice you receive a numerical reward
drawn from a stationary probability distribution that depends on the action you selected.
Your objective is to maximize the expected total reward over some time period, say 2000 time steps.
Each action has an expected (mean) reward — its value. We denote the action selected at time step t
as At and its reward as Rt. The value of an arbitrary action a is
q*(a) = E[Rt | At = a]. If we knew the value of each action, the problem would be trivial:
always pick the action with highest value. The challenge is that we don't know these values with certainty — we can only estimate them.
The exploration vs. exploitation dilemma is central: should we exploit what we already know (pick the current
best estimate) or explore other actions that might turn out to be better? The ε-greedy strategy addresses this by
choosing a random (exploratory) action with probability ε, and the greedy action otherwise.
An example 10-armed bandit (cf. Sutton & Barto, Figure 2.1) —
Each action yields rewards drawn from a Gaussian distribution centered at the true value q*(a). Dashed lines mark the means.
Experiment Configuration
Choose a suggested experiment or build your own:
Add custom experiment:
Active experiments:
No experiments configured. Choose a preset or add custom.
Runs vs Steps: Each step is one decision the agent makes (pick an arm, get a reward, update estimates).
Each run creates a brand-new random bandit and lets the agent learn from scratch for the given number of steps.
The charts show results averaged across all runs for smoother, more reliable curves.
Results
Average Reward is the mean reward received at each time step, averaged across all runs.
It shows how quickly and how well the agent learns to pick good actions.
A higher curve means the agent is earning more reward per step.
The dashed "optimal" line marks the expected reward from always choosing the best arm — no real agent can reach it immediately because it must first learn which arm is best.
% Optimal Action is the fraction of time steps on which the agent chose the true best arm, shown as a percentage.
It directly measures how well the agent has identified the optimal action.
A greedy agent (ε=0) often plateaus at a low percentage because it locks onto a suboptimal arm early and never discovers better ones.
An exploring agent rises toward 100% as it gathers enough information to reliably pick the best arm.
Cumulative Regret is the total reward lost compared to always playing the optimal arm, summed from the start up to each time step.
At each step the regret is q*(best arm) − q*(chosen arm); the chart plots the running total.
A steeper slope means the agent is making worse choices at that point in time.
An ideal agent's regret curve flattens out as it learns; a stuck agent's regret keeps growing linearly.
Lower cumulative regret means the agent wasted less reward over the entire experiment.
Sample Bandit Instance (last run)
True action values q*(a) for one sample bandit. The optimal arm is highlighted.