10-Armed Bandit Testbed

Interactive exploration of the k-armed bandit problem (Sutton & Barto, Chapter 2)

The k-Armed Bandit Problem

Consider a repeated choice among k = 10 different actions. After each choice you receive a numerical reward drawn from a stationary probability distribution that depends on the action you selected. Your objective is to maximize the expected total reward over some time period, say 2000 time steps.

Each action has an expected (mean) reward — its value. We denote the action selected at time step t as At and its reward as Rt. The value of an arbitrary action a is q*(a) = E[Rt | At = a]. If we knew the value of each action, the problem would be trivial: always pick the action with highest value. The challenge is that we don't know these values with certainty — we can only estimate them.

The exploration vs. exploitation dilemma is central: should we exploit what we already know (pick the current best estimate) or explore other actions that might turn out to be better? The ε-greedy strategy addresses this by choosing a random (exploratory) action with probability ε, and the greedy action otherwise.

An example 10-armed bandit (cf. Sutton & Barto, Figure 2.1) — Each action yields rewards drawn from a Gaussian distribution centered at the true value q*(a). Dashed lines mark the means.

Experiment Configuration

Choose a suggested experiment or build your own:
Add custom experiment:
Active experiments:
No experiments configured. Choose a preset or add custom.
Runs vs Steps: Each step is one decision the agent makes (pick an arm, get a reward, update estimates). Each run creates a brand-new random bandit and lets the agent learn from scratch for the given number of steps. The charts show results averaged across all runs for smoother, more reliable curves.