Consider a repeated choice among k = 10 different actions. After each choice you receive a numerical reward
drawn from a stationary probability distribution that depends on the action you selected.
Your objective is to maximize the expected total reward over some time period, say 2000 time steps.
Each action has an expected (mean) reward — its value. We denote the action selected at time step t
as At and its reward as Rt. The value of an arbitrary action a is
q*(a) = E[Rt | At = a]. If we knew the value of each action, the problem would be trivial:
always pick the action with highest value. The challenge is that we don't know these values with certainty — we can only estimate them.
The exploration vs. exploitation dilemma is central: should we exploit what we already know (pick the current
best estimate) or explore other actions that might turn out to be better? The ε-greedy strategy addresses this by
choosing a random (exploratory) action with probability ε, and the greedy action otherwise.
An example 10-armed bandit (cf. Sutton & Barto, Figure 2.1) —
Each action yields rewards drawn from a Gaussian distribution centered at the true value q*(a). Dashed lines mark the means.
Figure 2.2 — ε-Greedy on the 10-Armed Testbed (Interactive)
This reproduces the classic comparison from Sutton & Barto, Figure 2.2.
Three agents (ε=0, ε=0.01, ε=0.1) each play 1000 steps on the same sequence of random bandits and the results are averaged.
Adjust the number of runs for smoother curves (more runs = less noise). The chart auto-generates on page load.
Average Reward
% Optimal Action
How to read: The top chart shows average reward per step — higher is better.
The bottom chart shows how often the agent picks the truly best arm — closer to 100% is better.
The greedy agent (ε=0) gets stuck early,
while exploring agents steadily improve.
Click any legend label to show/hide that line.
Interactive Demo — Try It Yourself
Click any arm to pull it and see a reward. The agent tracks its estimated value Q(a) for each arm using sample averaging.
Can you figure out which arm is best? Try the strategy buttons to see how greedy and ε-greedy approaches compare.
The true best arm is highlighted green — but a real agent doesn't know this!
Experiment Configuration
Choose a suggested experiment or build your own:
Add custom experiment:
Active experiments:
No experiments configured. Choose a preset or add custom.
Runs vs Steps: Each step is one decision the agent makes (pick an arm, get a reward, update estimates).
Each run creates a brand-new random bandit and lets the agent learn from scratch for the given number of steps.
The charts show results averaged across all runs for smoother, more reliable curves.
Results
Average Reward is the mean reward received at each time step, averaged across all runs.
It shows how quickly and how well the agent learns to pick good actions.
A higher curve means the agent is earning more reward per step.
The dashed "optimal" line marks the expected reward from always choosing the best arm — no real agent can reach it immediately because it must first learn which arm is best.
% Optimal Action is the fraction of time steps on which the agent chose the true best arm, shown as a percentage.
It directly measures how well the agent has identified the optimal action.
A greedy agent (ε=0) often plateaus at a low percentage because it locks onto a suboptimal arm early and never discovers better ones.
An exploring agent rises toward 100% as it gathers enough information to reliably pick the best arm.
Cumulative Regret is the total reward lost compared to always playing the optimal arm, summed from the start up to each time step.
At each step the regret is q*(best arm) − q*(chosen arm); the chart plots the running total.
A steeper slope means the agent is making worse choices at that point in time.
An ideal agent's regret curve flattens out as it learns; a stuck agent's regret keeps growing linearly.
Lower cumulative regret means the agent wasted less reward over the entire experiment.
Sample Bandit Instance (last run)
True action values q*(a) for one sample bandit. The optimal arm is highlighted.
Reference
This simulation reproduces the 10-armed testbed experiment from Chapter 2 of:
Richard S. Sutton and Andrew G. Barto, Reinforcement Learning: An Introduction, 2nd edition, MIT Press, 2018.
Book website
|
Full PDF (free)
See especially Section 2.3 (Figure 2.2) for the ε-greedy comparison, Section 2.6 for optimistic initial values, and Section 2.4 for the effect of the step-size parameter α.