In-Class Activity: 10-Armed Bandit

Predict, Experiment, Explain  |  Open Simulation  |  ← All simulations

Activity Overview

Saved
Time: ~25 minutes
Format: Pairs or small groups
Materials: Laptop with KArmedBandits.html — all sections
Method: Predict → Experiment → Explain
Progress: 0 / 0 questions answered

Type your answers directly into the text boxes below. Your work is auto-saved to your browser. When finished, click "Copy All Answers" or "Download as Text" to submit.

10-Armed Bandit — Exploration vs Exploitation
1A — Predict Before You Play 3 min

Before touching the simulation, write down your predictions:

  • You face 10 slot machines with unknown payoffs. You have 2000 pulls total. Would you always pull the one that gave the highest reward so far, or sometimes try others? Why?
  • If you explore 10% of the time (ε = 0.1), roughly how many of your 2000 pulls will be random? What percentage of the time will you pick the best-known arm?
1B — Hands-On: The Interactive Demo 5 min

In the simulation's Interactive Demo section:

  • Click arms manually a few times. Note the rewards and how Q(a) updates.
  • Click "Pull Greedy" 20 times. Then click "New Bandit" and click "Pull ε-Greedy" 20 times.
  • When pulling greedy only, did the agent ever discover the true best arm (green star)? Why or why not?
  • When using ε-greedy, did it find the best arm more reliably? What's the trade-off?
1C — The Exploration Experiment 5 min
  • Click preset "1. Explore or Not?" and click "Run Simulation". Wait for results.
  • Look at the "Average Reward" and "% Optimal Action" tabs.

Record the values at step 1000:

ExperimentAvg Reward (step 1000)% Optimal (step 1000)
ε = 0 (greedy)
ε = 0.01
ε = 0.1
  • Which ε achieved the highest average reward by step 1000? Which achieved the highest % optimal action?
  • The greedy agent (ε = 0) gets stuck early. Explain why in terms of exploration vs. exploitation.
1D — Optimistic Initialization 5 min
  • Click preset "3. Optimistic Start" and run. Compare the curves.
  • The optimistic agent (Q1 = 5, ε = 0) starts greedy but still explores early on. Why does a high initial estimate cause exploration even without randomness?
  • Does the optimistic greedy agent eventually catch up with or beat the ε-greedy agent? At what step count roughly?
1E — Learning Rate 5 min
  • Click preset "4. Learning Rate" and run. Compare α = 0.1 vs α = 1/n (sample averaging).
  • Which learning rate (constant α = 0.1 or sample-average 1/n) performs better? Why might a constant step-size help when the true values could drift over time (non-stationary)?
  • Look at the "Cumulative Regret" tab. Which experiment accumulates regret fastest? Which is most efficient?
1F — Reflection 2 min
  • In your own words, explain the exploration-exploitation dilemma. Give a real-world example outside of slot machines where this trade-off appears.

Bonus Challenge (if time permits)

Your answers are auto-saved in your browser. Use the buttons above to export for submission.
Copied to clipboard!