CartPole PPO vs A2C — In-Class Activity
Predict, Experiment, Explain | Open Simulation | ← All simulations
Activity Overview
Time: ~37 minutes
Format: Pairs or small groups
Materials: Laptop with CartPole_PPO_A2C.html
Method: Predict → Experiment → Explain
Type your answers directly into the text boxes below. Your work is auto-saved to your browser. When finished, click "Copy All Answers" or "Download as Text" to submit.
Part 1 — The CartPole Environment (5 min)
1A — First Impressions
3 min
Open CartPole_PPO_A2C.html. You see two CartPole environments side by side — one for A2C (red) and one for PPO (green). The cart can move left or right, and the pole must stay upright.
The state has four components: [x, ẋ, θ, θ̇] (position, velocity, angle, angular velocity). The agent gets +1 reward per step, up to 200.
- Click "Step" once. Watch both poles — they will likely fall quickly.
- After one episode, look at the CartPole animations. Did both poles fall? How many steps did each last?
- The state has four variables. Which one do you think matters most for keeping the pole balanced? Why?
1B — Early Training
2 min
- Click "×100" to run 100 episodes.
- Look at the statistics panels for both algorithms.
| A2C | PPO | |
|---|---|---|
| Avg Reward (50) | ||
| Best Reward |
- Has either algorithm started learning after 100 episodes? How can you tell from the reward chart?
Part 2 — A2C: Watching the Actor and Critic Learn (8 min)
2A — A2C Learning Curve
3 min
- Click "×1K" to run 1000 episodes total.
- Examine the Reward Comparison chart, focusing on the red A2C line.
- Describe the shape of A2C's learning curve. Does the reward climb steadily, or does it fluctuate?
- What is A2C's average reward now? Record it below.
| Metric | A2C after ~1000 episodes |
|---|---|
| Avg Reward (50) | |
| Success Rate (50) |
- The advantage A = r + γV(s') − V(s) tells the actor whether an action was better or worse than expected. In the Update Details panel, are the advantage values mostly positive or negative? What does this mean?
2B — A2C Stability
5 min
- Click "Reset", then run "×1K" again. Compare this run's A2C curve to what you saw before.
- Do this 2–3 times. Pay attention to the Policy Stability chart (red line).
- Is A2C's learning curve consistent across different runs, or does it vary significantly?
- Look at the Policy Stability chart. The red A2C line shows the maximum policy ratio per episode. Do you see large spikes? What do spikes mean for learning?
- Why might large policy changes (high ratios) be harmful for training? Think about what happens if the agent suddenly changes its strategy mid-training.
Hint: consider what happens to the value estimates when the policy changes drastically.
Part 3 — PPO: Clipping for Stability (8 min)
3A — PPO vs A2C Performance
3 min
- Click "Reset", then run "×1K".
- Compare the Reward Comparison chart (red A2C vs green PPO) and the Policy Stability chart.
- Compare the reward curves. Which algorithm reaches higher rewards faster? Which is smoother?
- Compare the stability charts. How do the PPO ratio spikes compare to A2C? Does the green PPO line stay closer to the [1−ε, 1+ε] band?
3B — Understanding the Clipping Mechanism
5 min
- Click "Step" a few times. After each step, examine the PPO Clipping Detail chart and the PPO Last Update panel.
- In the PPO Clipping Detail chart, are most dots green (within range) or red (clipped)? What does this tell you?
- In the PPO Last Update panel, find a step where the ratio was clipped. What was the ratio value? What was the clipped value?
- Explain in plain English what PPO's clipping does. Why does limiting the ratio to [1−ε, 1+ε] help?
Hint: think of it as a "speed limit" on how fast the policy can change.
Part 4 — Head-to-Head Comparison (8 min)
4A — Full Comparison
5 min
- Click "Reset", then run "×1K".
- Record the final statistics for both algorithms.
| Metric | A2C | PPO |
|---|---|---|
| Avg Reward (50) | ||
| Best Reward | ||
| Success Rate (50) |
- Which algorithm achieved a higher success rate? Which one first consistently reached reward 200?
4B — Consistency Across Runs
3 min
- Reset and re-run "×1K" two more times. Each time, note which algorithm performs better.
- Does the same algorithm win every time? Or does the winner change between runs?
- In real-world applications (robotics, game AI), why does consistency matter as much as peak performance?
Part 5 — Hyperparameter Experiments (5 min)
5A — PPO Clip Epsilon
3 min
The PPO ε slider controls how much the policy can change in one update. Try three different values.
- Reset. Set ε = 0.05 (tight clipping). Run "×1K". Record PPO avg reward.
- Reset. Set ε = 0.20 (default). Run "×1K". Record.
- Reset. Set ε = 0.40 (loose clipping). Run "×1K". Record.
| ε = 0.05 | ε = 0.20 | ε = 0.40 | |
|---|---|---|---|
| PPO Avg Reward (50) |
- How does tight clipping (ε = 0.05) compare to loose clipping (ε = 0.40)? Which learns faster? Which is more stable?
5B — Learning Rate
2 min
- Reset. Set ε back to 0.20. Set αactor = 0.0100 (high). Run "×1K".
- Reset. Set αactor = 0.0001 (low). Run "×1K".
- Describe the effect of high vs low learning rate on training speed and stability.
Part 6 — Reflection (3 min)
6A — Fill in the Blanks
2 min
In an actor-critic method, the learns the policy and the learns the value function.
The measures whether an action was better or worse than expected.
PPO clips the probability ratio to [1−ε, 1+ε] to prevent .
Compared to A2C, PPO typically shows stable training because of .
PPO reuses each trajectory for update epochs, improving sample efficiency.
The measures whether an action was better or worse than expected.
PPO clips the probability ratio to [1−ε, 1+ε] to prevent .
Compared to A2C, PPO typically shows stable training because of .
PPO reuses each trajectory for update epochs, improving sample efficiency.
6B — Connecting to the Textbook
1 min
- Chapter 13 introduces policy gradient methods. CartPole has a continuous 4D state space. Why is a policy gradient approach (with a neural network) more natural here than a tabular method like Q-learning?
Hint: how many entries would a Q-table need for continuous states?
Bonus Challenges (if time permits)
- Set PPO epochs to 1 (so PPO only updates once per trajectory, like A2C). Run ×1K. How does PPO-with-1-epoch compare to A2C? What does this tell you about the importance of multiple epochs?
- After training for ×1K episodes, step through a few episodes where the agent succeeds (reward = 200). Describe the agent's balancing strategy — does it keep the cart centered, or does it drift?
Your answers are auto-saved in your browser. Use the buttons above to export for submission.