CartPole PPO vs A2C — In-Class Activity

Predict, Experiment, Explain  |  Open Simulation  |  ← All simulations

Activity Overview

Saved
Time: ~37 minutes
Format: Pairs or small groups
Materials: Laptop with CartPole_PPO_A2C.html
Method: Predict → Experiment → Explain
Progress: 0 / 0 questions answered

Type your answers directly into the text boxes below. Your work is auto-saved to your browser. When finished, click "Copy All Answers" or "Download as Text" to submit.

Part 1 — The CartPole Environment (5 min)
1A — First Impressions 3 min

Open CartPole_PPO_A2C.html. You see two CartPole environments side by side — one for A2C (red) and one for PPO (green). The cart can move left or right, and the pole must stay upright.

The state has four components: [x, ẋ, θ, θ̇] (position, velocity, angle, angular velocity). The agent gets +1 reward per step, up to 200.

  • Click "Step" once. Watch both poles — they will likely fall quickly.
  • After one episode, look at the CartPole animations. Did both poles fall? How many steps did each last?
  • The state has four variables. Which one do you think matters most for keeping the pole balanced? Why?
1B — Early Training 2 min
  • Click "×100" to run 100 episodes.
  • Look at the statistics panels for both algorithms.
A2CPPO
Avg Reward (50)
Best Reward
  • Has either algorithm started learning after 100 episodes? How can you tell from the reward chart?
Part 2 — A2C: Watching the Actor and Critic Learn (8 min)
2A — A2C Learning Curve 3 min
  • Click "×1K" to run 1000 episodes total.
  • Examine the Reward Comparison chart, focusing on the red A2C line.
  • Describe the shape of A2C's learning curve. Does the reward climb steadily, or does it fluctuate?
  • What is A2C's average reward now? Record it below.
MetricA2C after ~1000 episodes
Avg Reward (50)
Success Rate (50)
  • The advantage A = r + γV(s') − V(s) tells the actor whether an action was better or worse than expected. In the Update Details panel, are the advantage values mostly positive or negative? What does this mean?
2B — A2C Stability 5 min
  • Click "Reset", then run "×1K" again. Compare this run's A2C curve to what you saw before.
  • Do this 2–3 times. Pay attention to the Policy Stability chart (red line).
  • Is A2C's learning curve consistent across different runs, or does it vary significantly?
  • Look at the Policy Stability chart. The red A2C line shows the maximum policy ratio per episode. Do you see large spikes? What do spikes mean for learning?
  • Why might large policy changes (high ratios) be harmful for training? Think about what happens if the agent suddenly changes its strategy mid-training.
    Hint: consider what happens to the value estimates when the policy changes drastically.
Part 3 — PPO: Clipping for Stability (8 min)
3A — PPO vs A2C Performance 3 min
  • Click "Reset", then run "×1K".
  • Compare the Reward Comparison chart (red A2C vs green PPO) and the Policy Stability chart.
  • Compare the reward curves. Which algorithm reaches higher rewards faster? Which is smoother?
  • Compare the stability charts. How do the PPO ratio spikes compare to A2C? Does the green PPO line stay closer to the [1−ε, 1+ε] band?
3B — Understanding the Clipping Mechanism 5 min
  • Click "Step" a few times. After each step, examine the PPO Clipping Detail chart and the PPO Last Update panel.
  • In the PPO Clipping Detail chart, are most dots green (within range) or red (clipped)? What does this tell you?
  • In the PPO Last Update panel, find a step where the ratio was clipped. What was the ratio value? What was the clipped value?
  • Explain in plain English what PPO's clipping does. Why does limiting the ratio to [1−ε, 1+ε] help?
    Hint: think of it as a "speed limit" on how fast the policy can change.
Part 4 — Head-to-Head Comparison (8 min)
4A — Full Comparison 5 min
  • Click "Reset", then run "×1K".
  • Record the final statistics for both algorithms.
MetricA2CPPO
Avg Reward (50)
Best Reward
Success Rate (50)
  • Which algorithm achieved a higher success rate? Which one first consistently reached reward 200?
4B — Consistency Across Runs 3 min
  • Reset and re-run "×1K" two more times. Each time, note which algorithm performs better.
  • Does the same algorithm win every time? Or does the winner change between runs?
  • In real-world applications (robotics, game AI), why does consistency matter as much as peak performance?
Part 5 — Hyperparameter Experiments (5 min)
5A — PPO Clip Epsilon 3 min

The PPO ε slider controls how much the policy can change in one update. Try three different values.

  • Reset. Set ε = 0.05 (tight clipping). Run "×1K". Record PPO avg reward.
  • Reset. Set ε = 0.20 (default). Run "×1K". Record.
  • Reset. Set ε = 0.40 (loose clipping). Run "×1K". Record.
ε = 0.05ε = 0.20ε = 0.40
PPO Avg Reward (50)
  • How does tight clipping (ε = 0.05) compare to loose clipping (ε = 0.40)? Which learns faster? Which is more stable?
5B — Learning Rate 2 min
  • Reset. Set ε back to 0.20. Set αactor = 0.0100 (high). Run "×1K".
  • Reset. Set αactor = 0.0001 (low). Run "×1K".
  • Describe the effect of high vs low learning rate on training speed and stability.
Part 6 — Reflection (3 min)
6A — Fill in the Blanks 2 min
In an actor-critic method, the learns the policy and the learns the value function.
The measures whether an action was better or worse than expected.
PPO clips the probability ratio to [1−ε, 1+ε] to prevent .
Compared to A2C, PPO typically shows stable training because of .
PPO reuses each trajectory for update epochs, improving sample efficiency.
6B — Connecting to the Textbook 1 min
  • Chapter 13 introduces policy gradient methods. CartPole has a continuous 4D state space. Why is a policy gradient approach (with a neural network) more natural here than a tabular method like Q-learning?
    Hint: how many entries would a Q-table need for continuous states?

Bonus Challenges (if time permits)

Your answers are auto-saved in your browser. Use the buttons above to export for submission.
Copied to clipboard!