In-Class Activity: 10-Armed Bandit

Predict, Experiment, Explain | Open Simulation | ← All simulations

Activity Overview

Name: ID: Saved

Time: ~25 minutes

Format: Pairs or small groups

Materials: Laptop with KArmedBandits.html — all sections

Method: Predict → Experiment → Explain

Progress: 0 / 0 questions answered

Type your answers directly into the text boxes below. Your work is auto-saved to your browser. When finished, click "Copy All Answers" or "Download as Text" to submit.

10-Armed Bandit — Exploration vs Exploitation

1A — Predict Before You Play 3 min

Before touching the simulation, write down your predictions:

You face 10 slot machines with unknown payoffs. You have 2000 pulls total. Would you always pull the one that gave the highest reward so far, or sometimes try others? Why?

If you explore 10% of the time (ε = 0.1), roughly how many of your 2000 pulls will be random? What percentage of the time will you pick the best-known arm?

1B — Hands-On: The Interactive Demo 5 min

In the simulation's Interactive Demo section:

Click arms manually a few times. Note the rewards and how Q(a) updates.
Click "Pull Greedy" 20 times. Then click "New Bandit" and click "Pull ε-Greedy" 20 times.

When pulling greedy only, did the agent ever discover the true best arm (green star)? Why or why not?

When using ε-greedy, did it find the best arm more reliably? What's the trade-off?

1C — The Exploration Experiment 5 min

Click preset "1. Explore or Not?" and click "Run Simulation". Wait for results.
Look at the "Average Reward" and "% Optimal Action" tabs.

Record the values at step 1000:

Experiment	Avg Reward (step 1000)	% Optimal (step 1000)
ε = 0 (greedy)
ε = 0.01
ε = 0.1

Which ε achieved the highest average reward by step 1000? Which achieved the highest % optimal action?

The greedy agent (ε = 0) gets stuck early. Explain why in terms of exploration vs. exploitation.

1D — Optimistic Initialization 5 min

Click preset "3. Optimistic Start" and run. Compare the curves.

The optimistic agent (Q₁ = 5, ε = 0) starts greedy but still explores early on. Why does a high initial estimate cause exploration even without randomness?

Does the optimistic greedy agent eventually catch up with or beat the ε-greedy agent? At what step count roughly?

1E — Learning Rate 5 min

Click preset "4. Learning Rate" and run. Compare α = 0.1 vs α = 1/n (sample averaging).

Which learning rate (constant α = 0.1 or sample-average 1/n) performs better? Why might a constant step-size help when the true values could drift over time (non-stationary)?

Look at the "Cumulative Regret" tab. Which experiment accumulates regret fastest? Which is most efficient?

1F — Reflection 2 min

In your own words, explain the exploration-exploitation dilemma. Give a real-world example outside of slot machines where this trade-off appears.

Bonus Challenge (if time permits)

Using the custom experiment builder, create an agent with ε = 0.5. Run it. What happens and why is too much exploration also bad?

Can you find a combination of ε, Q₁, and α that beats all the presets? Record your best configuration and its performance.

Your answers are auto-saved in your browser. Use the buttons above to export for submission.

Copied to clipboard!