CartPole PPO vs A2C — In-Class Activity

Predict, Experiment, Explain | Open Simulation | ← All simulations

Activity Overview

Name: ID: Saved

Time: ~37 minutes

Format: Pairs or small groups

Materials: Laptop with CartPole_PPO_A2C.html

Method: Predict → Experiment → Explain

Progress: 0 / 0 questions answered

Type your answers directly into the text boxes below. Your work is auto-saved to your browser. When finished, click "Copy All Answers" or "Download as Text" to submit.

Part 1 — The CartPole Environment (5 min)

1A — First Impressions 3 min

Open CartPole_PPO_A2C.html. You see two CartPole environments side by side — one for A2C (red) and one for PPO (green). The cart can move left or right, and the pole must stay upright.

The state has four components: [x, ẋ, θ, θ̇] (position, velocity, angle, angular velocity). The agent gets +1 reward per step, up to 200.

Click "Step" once. Watch both poles — they will likely fall quickly.

After one episode, look at the CartPole animations. Did both poles fall? How many steps did each last?

The state has four variables. Which one do you think matters most for keeping the pole balanced? Why?

1B — Early Training 2 min

Click "×100" to run 100 episodes.
Look at the statistics panels for both algorithms.

	A2C	PPO
Avg Reward (50)
Best Reward

Has either algorithm started learning after 100 episodes? How can you tell from the reward chart?

Part 2 — A2C: Watching the Actor and Critic Learn (8 min)

2A — A2C Learning Curve 3 min

Click "×1K" to run 1000 episodes total.
Examine the Reward Comparison chart, focusing on the red A2C line.

Describe the shape of A2C's learning curve. Does the reward climb steadily, or does it fluctuate?

What is A2C's average reward now? Record it below.

Metric	A2C after ~1000 episodes
Avg Reward (50)
Success Rate (50)

The advantage A = r + γV(s') − V(s) tells the actor whether an action was better or worse than expected. In the Update Details panel, are the advantage values mostly positive or negative? What does this mean?

2B — A2C Stability 5 min

Click "Reset", then run "×1K" again. Compare this run's A2C curve to what you saw before.
Do this 2–3 times. Pay attention to the Policy Stability chart (red line).

Is A2C's learning curve consistent across different runs, or does it vary significantly?

Look at the Policy Stability chart. The red A2C line shows the maximum policy ratio per episode. Do you see large spikes? What do spikes mean for learning?

Why might large policy changes (high ratios) be harmful for training? Think about what happens if the agent suddenly changes its strategy mid-training.
Hint: consider what happens to the value estimates when the policy changes drastically.

Part 3 — PPO: Clipping for Stability (8 min)

3A — PPO vs A2C Performance 3 min

Click "Reset", then run "×1K".
Compare the Reward Comparison chart (red A2C vs green PPO) and the Policy Stability chart.

Compare the reward curves. Which algorithm reaches higher rewards faster? Which is smoother?

Compare the stability charts. How do the PPO ratio spikes compare to A2C? Does the green PPO line stay closer to the [1−ε, 1+ε] band?

3B — Understanding the Clipping Mechanism 5 min

Click "Step" a few times. After each step, examine the PPO Clipping Detail chart and the PPO Last Update panel.

In the PPO Clipping Detail chart, are most dots green (within range) or red (clipped)? What does this tell you?

In the PPO Last Update panel, find a step where the ratio was clipped. What was the ratio value? What was the clipped value?

Explain in plain English what PPO's clipping does. Why does limiting the ratio to [1−ε, 1+ε] help?
Hint: think of it as a "speed limit" on how fast the policy can change.

Part 4 — Head-to-Head Comparison (8 min)

4A — Full Comparison 5 min

Click "Reset", then run "×1K".
Record the final statistics for both algorithms.

Metric	A2C	PPO
Avg Reward (50)
Best Reward
Success Rate (50)

Which algorithm achieved a higher success rate? Which one first consistently reached reward 200?

4B — Consistency Across Runs 3 min

Reset and re-run "×1K" two more times. Each time, note which algorithm performs better.

Does the same algorithm win every time? Or does the winner change between runs?

In real-world applications (robotics, game AI), why does consistency matter as much as peak performance?

Part 5 — Hyperparameter Experiments (5 min)

5A — PPO Clip Epsilon 3 min

The PPO ε slider controls how much the policy can change in one update. Try three different values.

Reset. Set ε = 0.05 (tight clipping). Run "×1K". Record PPO avg reward.
Reset. Set ε = 0.20 (default). Run "×1K". Record.
Reset. Set ε = 0.40 (loose clipping). Run "×1K". Record.

	ε = 0.05	ε = 0.20	ε = 0.40
PPO Avg Reward (50)

How does tight clipping (ε = 0.05) compare to loose clipping (ε = 0.40)? Which learns faster? Which is more stable?

5B — Learning Rate 2 min

Reset. Set ε back to 0.20. Set α_actor = 0.0100 (high). Run "×1K".
Reset. Set α_actor = 0.0001 (low). Run "×1K".

Describe the effect of high vs low learning rate on training speed and stability.

Part 6 — Reflection (3 min)

6A — Fill in the Blanks 2 min

In an actor-critic method, the learns the policy and the learns the value function.
The measures whether an action was better or worse than expected.
PPO clips the probability ratio to [1−ε, 1+ε] to prevent .
Compared to A2C, PPO typically shows stable training because of .
PPO reuses each trajectory for update epochs, improving sample efficiency.

6B — Connecting to the Textbook 1 min

Chapter 13 introduces policy gradient methods. CartPole has a continuous 4D state space. Why is a policy gradient approach (with a neural network) more natural here than a tabular method like Q-learning?
Hint: how many entries would a Q-table need for continuous states?

Bonus Challenges (if time permits)

Set PPO epochs to 1 (so PPO only updates once per trajectory, like A2C). Run ×1K. How does PPO-with-1-epoch compare to A2C? What does this tell you about the importance of multiple epochs?

After training for ×1K episodes, step through a few episodes where the agent succeeds (reward = 200). Describe the agent's balancing strategy — does it keep the cart centered, or does it drift?

Your answers are auto-saved in your browser. Use the buttons above to export for submission.