10-Armed Bandit Testbed

Interactive exploration of the k-armed bandit problem (Sutton & Barto, Chapter 2) | In-Class Activity | ← All simulations

The k-Armed Bandit Problem

Consider a repeated choice among k = 10 different actions. After each choice you receive a numerical reward drawn from a stationary probability distribution that depends on the action you selected. Your objective is to maximize the expected total reward over some time period, say 2000 time steps.

Each action has an expected (mean) reward — its value. We denote the action selected at time step t as A_t and its reward as R_t. The value of an arbitrary action a is q_*(a) = E[R_t | A_t = a]. If we knew the value of each action, the problem would be trivial: always pick the action with highest value. The challenge is that we don't know these values with certainty — we can only estimate them.

The exploration vs. exploitation dilemma is central: should we exploit what we already know (pick the current best estimate) or explore other actions that might turn out to be better? The ε-greedy strategy addresses this by choosing a random (exploratory) action with probability ε, and the greedy action otherwise.

An example 10-armed bandit (cf. Sutton & Barto, Figure 2.1) — Each action yields rewards drawn from a Gaussian distribution centered at the true value q_*(a). Dashed lines mark the means.

Figure 2.2 — ε-Greedy on the 10-Armed Testbed (Interactive)

This reproduces the classic comparison from Sutton & Barto, Figure 2.2. Three agents (ε=0, ε=0.01, ε=0.1) each play 1000 steps on the same sequence of random bandits and the results are averaged. Adjust the number of runs for smoother curves (more runs = less noise). The chart auto-generates on page load.

Runs:

Average Reward

% Optimal Action

How to read: The top chart shows average reward per step — higher is better. The bottom chart shows how often the agent picks the truly best arm — closer to 100% is better. The greedy agent (ε=0) gets stuck early, while exploring agents steadily improve. Click any legend label to show/hide that line.

Interactive Demo — Try It Yourself

Click any arm to pull it and see a reward. The agent tracks its estimated value Q(a) for each arm using sample averaging. Can you figure out which arm is best? Try the strategy buttons to see how greedy and ε-greedy approaches compare. The true best arm is highlighted green — but a real agent doesn't know this!

Speed

Experiment Configuration

Choose a suggested experiment or build your own:

Add custom experiment:

ε Q₁ α

Active experiments:

No experiments configured. Choose a preset or add custom.

Runs: Steps:

Runs vs Steps: Each step is one decision the agent makes (pick an arm, get a reward, update estimates). Each run creates a brand-new random bandit and lets the agent learn from scratch for the given number of steps. The charts show results averaged across all runs for smoother, more reliable curves.

Reference

This simulation reproduces the 10-armed testbed experiment from Chapter 2 of:

Richard S. Sutton and Andrew G. Barto, Reinforcement Learning: An Introduction, 2nd edition, MIT Press, 2018.
Book website | Full PDF (free)

See especially Section 2.3 (Figure 2.2) for the ε-greedy comparison, Section 2.6 for optimistic initial values, and Section 2.4 for the effect of the step-size parameter α.