How reinforcement learning is structured to find Hamiltonian cycles in the m-ary cube digraph — and experimental results for m=3.
Companion page to Claude's Cycles |
Training script: rl_hamiltonian.py |
W&B: gitcg/rl-hamiltonian
Consider a directed graph (digraph) whose vertices are tuples (i, j, k) with each coordinate in {0, 1, …, m−1}. There are m³ vertices total. Each vertex has exactly 3 outgoing arcs, called bumps: incrementing i, j, or k by 1 (mod m). Claude proved analytically that this digraph can always be decomposed into exactly 3 edge-disjoint Hamiltonian cycles.
The RL exercise asks a different question: can a learned agent rediscover a Hamiltonian cycle purely through trial and error?
The problem is cast as a Markov Decision Process (MDP) with the following components:
A state consists of three components:
The full state space is enormous. The visited-set component alone has 2m³ possible configurations. For m=3, that is 227 ≈ 134 million visited-set states, each combined with 27 possible current vertices. This combinatorial explosion is the central difficulty.
At every step the agent chooses one of 3 actions:
The action space is fixed at {0, 1, 2} regardless of m. This is unusually small — the difficulty comes entirely from the state space, not the action space.
| Event | Reward | Effect |
|---|---|---|
| Visit a new vertex | +1 | Episode continues |
| Complete a Hamiltonian cycle (return to start after visiting all m³ vertices) | +100 | Episode ends (success) |
| Revisit a vertex | −10 | Episode ends (failure) |
The reward is sparse: the large +100 bonus only arrives if the agent perfectly sequences all m³ steps without revisiting any vertex. For m=3 the agent must make 27 correct decisions in a row; for m=5, it must make 125.
We tested four algorithms on the m=3 case. All share the same environment, observation space (32-dim: coordinates, step count, fiber index, visited mask), reward function, and action masking. Seed: 42.
| Algorithm | Episodes | Success Rate | First Solve | Runtime | W&B |
|---|---|---|---|---|---|
| REINFORCE | 10,000 | 1.16% | ~ep 150 | 32s | ssihf20k |
| PPO | 10,000 | 97.25% | ~ep 50 | 42s | 1q21fgzy |
| GNN-PPO | 5,000 | 99.42% | ~ep 10 | 203s | xsw7g6da |
| MCTS | 200 | 26.5% | ~ep 2 | 6s | ny82u4h3 |
Both use the same MLP architecture (32 → 128 → 128 → 3), but PPO's clipped objective and entropy bonus completely prevent the mode collapse that killed REINFORCE. PPO reached 97.25% success rate vs. REINFORCE's 1.16%.
The graph neural network policy (2-layer message-passing GNN with 64-dim messages) exploits the regular structure of the m-ary cube. It converged to near-perfect performance within ~100 episodes — the fastest among all learned policies. Only 29 failures in 5,000 episodes.
Pure tree search (200 simulations/move, UCB1 selection) achieves 26.5% per-episode success with no training at all. It is expensive per episode but has zero training instability. Useful as a non-learned baseline.
REINFORCE found 116 cycles in episodes 1–1,100, then converged to a deterministic 24-step suboptimal path and never found another cycle for the remaining 8,900 episodes. This is the classic high-variance policy gradient failure mode: without entropy regularization, the policy becomes too confident too early.
We ran REINFORCE with three observation/masking configurations. Action masking — zeroing out the probability of actions that lead to already-visited vertices — was the single most impactful design decision.
| Variant | Obs Dim | Action Masking | Best Path | Successes |
|---|---|---|---|---|
| No visited mask | 5 | No | 23/27 | 0 |
| With visited mask | 32 | No | 23/27 | 0 |
| With mask + action masking | 32 | Yes | 27/27 | 116 |
Moving from m=3 to m=5 increases the cycle length from 27 to 125 steps and the observation dimension from 32 to 130 (5 features + 125-bit visited mask). The theoretical state space explodes from 227 to 2125 — a factor of 1029 larger. Can the same algorithms still find Hamiltonian cycles?
| Algorithm | Episodes | Success Rate | First Solve | Runtime | W&B |
|---|---|---|---|---|---|
| PPO | 20,000 | 99.48% | ~ep 100 | ~5 min | 7cp4e667 |
| REINFORCE | 20,000 | 41.83% | ~ep 350 | ~8 min | jpucotko |
| GNN-PPO | 10,000 | DNF (too slow) | — | >8 min/ep | c9tzft69 |
| MCTS | — | Not run | — | — | — |
PPO achieved 99.48% success rate on 125 vertices — nearly identical to its 97.25% on m=3. First Hamiltonian cycle found around episode 100, near-perfect convergence by episode 200. The simple MLP actor-critic with action masking handles the 4.6× longer episode without difficulty.
REINFORCE exhibited a striking three-phase training profile on m=5:
This "sudden phase transition" is a remarkable example of how REINFORCE can spend thousands of episodes in a near-optimal but incorrect mode, then catastrophically shift to the correct policy. The overall 41.83% rate is misleading — it was effectively 0.6% for 11,700 episodes, then 100%.
GNN-PPO could not complete a single episode — the per-node message-passing loop with 125 nodes and 750 edges (bidirectional) takes >8 minutes per episode in Python. A sparse/batched implementation (e.g., PyTorch Geometric) would be needed. MCTS was not attempted: random rollouts on 125 vertices have near-zero probability of finding a Hamiltonian cycle without a learned value function.
| Metric | m=3 (27 vertices) | m=5 (125 vertices) | Growth factor |
|---|---|---|---|
| Vertices | 27 | 125 | 4.6× |
| Observation dim | 32 | 130 | 4.1× |
| Theoretical state space | 227 ≈ 108 | 2125 ≈ 1037 | 1029× |
| PPO success rate | 97.25% | 99.48% | ≈1× (no degradation) |
| PPO first solve | ~ep 50 | ~ep 100 | 2× |
| PPO training time | 42s | ~5 min | ~7× |
| REINFORCE success rate | 1.16% (collapsed) | 41.83% (phase transition) | Different failure mode |
At m=7 the cycle length reaches 343 steps and the observation grows to 348 dimensions. The theoretical state space is 2343 ≈ 10103 — a number larger than the estimated atoms in the observable universe. Does PPO's scaling hold?
| Algorithm | Episodes | Success Rate | First Solve | W&B |
|---|---|---|---|---|
| PPO | 30,000 | 99.11% | ~ep 250 | anf6fgm2 |
| REINFORCE | 30,000 | 12.55% | ~ep 2,700 | 30xx3oku |
| GNN-PPO | — | Not run | — | — |
| MCTS | — | Not run | — | — |
PPO achieved 99.11% success rate on 343 vertices — comparable to m=3 (97.25%) and m=5 (99.48%). First cycle found at ~ep 250, near-perfect by ep 300. The simple 2-layer MLP with action masking handles the 348-dimensional observation and 343-step episodes without any architectural changes. This is now three data points confirming PPO's ~99% success rate is essentially independent of m.
REINFORCE's characteristic three-phase pattern continued at m=7 with an even longer plateau:
The phase transition pattern is now consistent across three scales:
| m | Vertices | Plateau duration | Transition episode | Post-transition |
|---|---|---|---|---|
| 3 | 27 | No recovery | — | Mode collapse (0%) |
| 5 | 125 | ~11,400 eps | ~11,800 | 100% |
| 7 | 343 | ~23,700 eps | ~26,400 | 100% |
The plateau roughly doubles from m=5 to m=7 (while vertices increase 2.7×), suggesting REINFORCE can solve arbitrarily large instances given enough episodes — but its training budget grows rapidly and unpredictably. Interestingly, m=3 never recovered at all, possibly because the 10,000-episode budget was insufficient.
At m=9 the cycle length reaches 729 steps — the agent must make 729 consecutive correct decisions. The observation grows to 734 dimensions and the theoretical state space is 2729 ≈ 10219. This was previously estimated as "uncertain" and at the frontier of feasibility.
| Algorithm | Episodes | Success Rate | First Solve | W&B |
|---|---|---|---|---|
| PPO (from scratch) | 30,000 | 99.01% | ~ep 215 | 7mp85z0c |
| Zeropad m=3 → 9 | 30,000 | 100.00% | ep 1 | ha2uub4u |
| Partial m=3 → 9 | 30,000 | 99.99% | ep 1 | uk2o72tf |
| Zeropad m=5 → 9 | 30,000 | 99.91% | ~ep 1 | 06im3aj8 |
| Partial m=5 → 9 | 30,000 | 99.13% | ~ep 10 | pv1zandf |
PPO's success rate (99.01%) is consistent with m=3 (97.25%), m=5 (99.48%), and m=7 (99.11%). This is now four data points confirming that PPO's performance is essentially independent of m. First solve around episode 215 — actually faster than m=7 (~ep 250).
Pretraining on tiny m=3 (27 vertices) with zero-padded observations and transferring to m=9 (729 vertices) — a 27× increase in problem size — produces a flawless 30,000/30,000 success rate from episode 1. This replicates the perfect result seen at m=7, confirming the pattern is robust.
Both zeropad and partial transfer from m=3 outperform their m=5 counterparts at m=9. Zeropad m=3 achieves 100.00% vs. m=5's 99.91%; partial m=3 achieves 99.99% vs. m=5's 99.13%. This may be because m=3 pretraining converges faster (smaller environment) and the learned strategy is equally general — further evidence that the navigation skill is completely scale-invariant.
| Metric | m=3 (27v) | m=5 (125v) | m=7 (343v) | m=9 (729v) | Trend |
|---|---|---|---|---|---|
| Obs dim | 32 | 130 | 348 | 734 | ~m³ |
| State space | 108 | 1037 | 10103 | 10219 | Irrelevant |
| PPO from scratch | 97.25% | 99.48% | 99.11% | 99.01% | Flat ~99% |
| PPO first solve | ~ep 50 | ~ep 100 | ~ep 250 | ~ep 215 | Sub-linear growth |
| Zeropad m=3→target | — | — | 100.00% | 100.00% | Perfect |
At m=11, the agent must make 1,331 consecutive correct decisions. The observation grows to 1,336 dimensions and the theoretical state space reaches 21331 ≈ 10400. This was the previous estimate for the frontier of RL feasibility.
| Algorithm | Episodes | Success Rate | First Solve | W&B |
|---|---|---|---|---|
| PPO (from scratch) | 30,000 | 97.79% | ~ep 350 | j5n9b3kg |
| Zeropad m=3 → 11 | 30,000 | 100.00% | ep 1 | ls5vwqoi |
| Zeropad m=5 → 11 | 30,000 | 100.00% | ep 1 | 36snd57q |
| Partial m=3 → 11 | 30,000 | 99.97% | ~ep 1 | bm1gzwsf |
| Partial m=5 → 11 | 30,000 | 99.58% | ~ep 10 | xtq619eo |
PPO's from-scratch success rate drops to 97.79% — still high, but noticeably below the ~99% plateau seen at m=5/7/9. First solve moves to ~ep 350, up from ~215 at m=9. The 1,331-step episode is starting to stress credit assignment and the 128-unit hidden layer is compressing a 1,336-dim input. This is the first scale where from-scratch PPO shows wear.
Both zeropad sources achieve flawless 30,000/30,000 at m=11 — a 49× scale-up from m=3 (27 → 1,331 vertices). This is the first scale where m=5 zeropad matches m=3, both achieving perfect results. Transfer learning completely compensates for the from-scratch degradation.
As from-scratch PPO's success rate slowly declines with scale (97.79% at m=11 vs 99.48% at m=5), transfer learning's advantage grows from marginal to meaningful. At m=11, transfer provides both faster convergence (ep 1 vs ep 350) and higher final performance (100% vs 97.79%). Transfer is no longer just a convenience — it is becoming essential.
| Metric | m=3 | m=5 | m=7 | m=9 | m=11 | Trend |
|---|---|---|---|---|---|---|
| Vertices | 27 | 125 | 343 | 729 | 1,331 | m³ |
| Obs dim | 32 | 130 | 348 | 734 | 1,336 | ~m³ |
| State space | 108 | 1037 | 10103 | 10219 | 10400 | Irrelevant |
| PPO from scratch | 97.25% | 99.48% | 99.11% | 99.01% | 97.79% | Slight decline at m=11 |
| PPO first solve | ~50 | ~100 | ~250 | ~215 | ~350 | Growing |
| Zeropad m=3 | — | — | 100% | 100% | 100% | Perfect across all |
A key question from the Claude's Cycles page: can an RL agent trained on small m transfer that experience to larger m? If so, it would suggest the agent has learned the underlying structure of the digraph, not just memorized a specific solution. We tested this on m=7 (results below) and confirmed it scales to m=9 (see m=9 results above).
The observation includes a visited-vertex mask whose size is m³, so weights cannot be directly shared between different m values:
| Source | Obs dim | Target | Obs dim | Problem |
|---|---|---|---|---|
| m=3 | 32 | m=7 | 348 | Input layer incompatible |
| m=5 | 130 | m=7 | 348 | Input layer incompatible |
Keep the hidden layers (128×128) and output heads (actor + critic). Reinitialize the input projection for the new obs_dim.
Hypothesis: hidden layers encode a reusable "navigation strategy."
During pretraining, pad observations to the target size (e.g., 348-dim for m=7). All weights transfer directly — no architecture change.
Hypothesis: network learns to ignore zero-padded dimensions.
| Method | Source | Finetune Success Rate | First Solve | W&B |
|---|---|---|---|---|
| From scratch (baseline) | — | 99.11% | ~ep 250 | anf6fgm2 |
| Partial transfer | m=3 | 99.99% | ep 1 | k3fv51zi |
| Partial transfer | m=5 | 99.99% | ep 1 | noxa0p61 |
| Zero-padded transfer | m=3 | 100.00% | ep 1 | 8ddfz8i5 |
| Zero-padded transfer | m=5 | 99.96% | ep 1 | 5nqe3nqo |
The baseline PPO needs ~215–250 episodes to find its first Hamiltonian cycle on m=7/m=9. All transfer methods solve from episode 1 of finetuning — confirmed on both m=7 (343 vertices) and m=9 (729 vertices). The "navigation strategy" learned on m=3 (27 vertices) transfers across a 27× increase in problem size.
Pretraining on tiny m=3 with zero-padded observations, then finetuning on m=7 or m=9, produces a perfect 30,000/30,000 success rate in both cases. This is better than from-scratch PPO (~99%) and every other transfer variant. The zero-padding preserves the input projection's understanding of visited-mask patterns, giving it a head start.
Partial transfer only preserves hidden layers — the input projection is randomly reinitialized for the new 348-dim input. Despite this, it solves m=7 from episode 1 at 99.99%. This means the hidden layers encode a generalizable navigation policy largely independent of the input encoding.
Both m=3 and m=5 pretrained models transfer equally well. The 27-vertex problem teaches essentially the same strategy as the 125-vertex problem. The core skill — "how to use the action mask and visited information to navigate a 3-cube digraph" — is scale-invariant.
As m increases, three quantities grow at very different rates, each creating a different bottleneck:
The visited mask is an m³-bit vector, so the network input grows cubically. For m=3 the observation is 32-dimensional; for m=5 it is 130; for m=7 it is 348; for m=9 it is 734; for m=11 it is 1,336. PPO handles all of these without architectural changes, though from-scratch performance shows first signs of stress at 1,336 dimensions.
A successful episode requires exactly m³ steps. Longer episodes mean: (a) more forward passes per episode, increasing wall-clock time linearly; (b) longer credit-assignment horizons, making it harder for the agent to attribute the final +100 reward to early actions; (c) more opportunities for a single wrong decision to terminate the episode.
Empirical growth: PPO episodes-to-first-solve: 50 (m=3) → 100 (m=5) → 250 (m=7) → 215 (m=9) → 350 (m=11). The growth is sub-linear relative to vertex count, but m=11 shows the first uptick. Transfer learning eliminates warmup entirely (ep 1 for all tested scales m=7 through m=11).
The visited-set tracking creates an exponentially growing state space. However, results across m=3, 5, 7, 9, and 11 consistently show that this theoretical explosion does not translate to practical difficulty for function-approximation RL. PPO maintains 97–99% success rate as the state space grows from 108 to 10400. The MLP never enumerates states — it learns to generalize. The real bottleneck is episode length and credit assignment, not raw state-space size.
| m | Vertices | Obs dim | Visited-set states | Status |
|---|---|---|---|---|
| 3 | 27 | 32 | 227 ≈ 108 | Solved: all 4 algorithms (PPO 97%, GNN 99%) |
| 5 | 125 | 130 | 2125 ≈ 1037 | Solved: PPO 99.5%, REINFORCE 42% |
| 7 | 343 | 348 | 2343 ≈ 10103 | Solved: PPO 99.1%, REINFORCE 12.6% |
| 9 | 729 | 734 | 2729 ≈ 10219 | Solved: PPO 99.0%, transfer 100% |
| 11 | 1,331 | 1,336 | 21331 ≈ 10400 | Solved: PPO 97.8%, transfer 100% |
With five confirmed scales (m=3,5,7,9,11) and perfect transfer learning through m=11, the estimated limit continues to rise. However, m=11 is the first scale showing from-scratch degradation (97.79%), suggesting we are approaching a capacity limit.
The key lesson from m=3 through m=11: the exponential state space is a red herring. What actually limits RL is network capacity relative to the observation dimension and episode length for credit assignment. PPO from scratch maintains 97–99% success as the state space grows from 108 to 10400. Transfer learning from m=3 provides perfect 100% results at every tested scale up to m=11 (49× scale increase). The practical limit with transfer is likely where the 128-unit hidden layer can no longer compress the policy — around m=13–15.
Everything above shows that RL can find Hamiltonian cycles for specific m values. But Claude didn't just find cycles — Claude found a universal construction (6 conditional rules) that provably works for all odd m > 1. Can RL replicate that?
As documented in Knuth's paper and the discovery narrative, Claude's path over 31 explorations followed a specific intellectual trajectory:
| Phase | Explorations | What happened | Intellectual move |
|---|---|---|---|
| 1 | 1–8 | Discovered serpentine traversal and the fiber structure s = (i+j+k) mod m | Inventing a new abstraction |
| 2 | 9–15 | Realized fiber-based rules could govern all three cycles; invented "direction strings" | Representational insight |
| 3 | 16–24 | Used simulated annealing on small m to find working decompositions; extracted patterns | Computational search + pattern extraction |
| 4 | 25–31 | Crystallized 6 conditional rules; argued they work for all odd m > 1 | Generalization + universality argument |
The proof has two parts: (a) a construction (the 6 bump rules in the rule table), and (b) a correctness argument (each cycle visits all m³ vertices exactly once, and the 3 cycles partition all 3m³ arcs). This is also what a human mathematician would do given enough insight and patience.
The RL algorithms on this page operate in a fundamentally different space than proof discovery:
| Aspect | RL (what we did) | Proof discovery (what Claude did) |
|---|---|---|
| Output | A sequence of m³ bumps for one m | Symbolic rules + correctness argument for all m |
| Action space | {bump i, bump j, bump k} | Mathematical reasoning steps (infinite, unstructured) |
| State | Current vertex + visited set | Current mathematical knowledge and conjectures |
| Reward | +1 per vertex, +100 for cycle | No clear signal until the proof is complete |
| Generalization | Solves one m at a time | Must work for all m simultaneously |
| Verification | Check if all vertices visited | Check logical validity of argument |
Recognizing s = (i+j+k) mod m as the organizing principle was an act of mathematical creativity — defining a new variable that simplifies the problem. RL has no mechanism for inventing new state representations. The agent sees vertices and bumps; it cannot decide "I should group these vertices by the sum of their coordinates modulo m."
Compressing the assignment of 3m³ arcs to 3 cycles into 6 conditional rules using permutations of "012" requires understanding symmetry. The insight that each vertex's three outgoing arcs should be assigned to the three cycles via a permutation — and that the permutation depends only on fiber index and one coordinate — is a representational leap that no reward signal can guide toward.
The rules work for all odd m because the conditions (s=0, s=m−1, i=m−1, j=m−1) are defined relative to m, not using absolute values. Moving from "this works for m=3, 5, 7" to "this works for all odd m" requires logical deduction about parameterized families — a form of reasoning RL does not perform.
In explorations 16–24, Claude used simulated annealing to find solutions for small m, then looked at the solutions to extract patterns. This is meta-reasoning — using computation to generate data, then reasoning about that data. RL just runs the computation; it doesn't step back and ask "what do these solutions have in common?"
Three approaches partially bridge the gap, each with fundamental limitations:
Train an agent to write a program (like Knuth's C code) that generates valid Hamiltonian cycles for arbitrary m. The action space would be token-level code generation; the reward would be running the program for m=3, 5, 7, … and checking validity.
Problem: The search space over program tokens is astronomically larger than finding a cycle for one m. The agent would need to discover fiber structure, direction strings, and conditional rules through trial and error over code tokens — without any guidance toward the right abstractions.
Let the agent propose parametric rules (e.g., "if s=0 and j=m−1, use direction '012'") and verify them computationally for increasing m values. Reward: +1 for each m where the rules produce valid decompositions.
This is closest to what Claude actually did in explorations 16–24. But the critical step was the abstraction: recognizing that the pattern depends on fiber index and one coordinate. An RL agent over rule templates would need this abstraction space pre-defined — which begs the question of who supplies the insight.
A multi-level system where:
This mirrors Claude's actual path. But the Level 2 → 3 transition requires symbolic abstraction — going from "here are 5 trained neural network policies" to "the decision depends on (i+j+k) mod m and whether i=m−1." This is an open problem in AI.
| Capability | RL | LLM reasoning + tools | Human mathematician |
|---|---|---|---|
| Find one cycle for one m | Yes (99%+) | Yes | Yes (tedious) |
| Find cycles for many m | Yes (with transfer) | Yes | Yes |
| Discover fiber structure | No | Yes (Claude did it) | Yes |
| Invent direction strings | No | Yes (Claude did it) | Yes |
| Prove universal construction | No | Yes (Claude did it) | Yes |
| Scale to m=11+ efficiently | Yes (with transfer) | Instant (closed-form rules) | Instant (closed-form rules) |
| Component | Description |
|---|---|
| State | Current vertex (i,j,k) + visited set + step count |
| Actions | {bump i, bump j, bump k} — always 3 choices |
| Reward | +1 new vertex, +100 complete cycle, −10 revisit |
| Goal | Find a Hamiltonian cycle (visit all m³ vertices exactly once, return to start) |
| Not the goal | Prove that exactly 3 disjoint cycles exist for all m |
| Best algorithm (m=3) | GNN-PPO: 99.42% success, first solve ~ep 10 |
| Best algorithm (m=5) | PPO: 99.48% success, first solve ~ep 100 |
| Best algorithm (m=7) | PPO: 99.11% success, first solve ~ep 250 |
| Best algorithm (m=9) | PPO: 99.01% success, first solve ~ep 215 |
| Best algorithm (m=11) | PPO+transfer: 100% success; from scratch: 97.79% |
| PPO scaling trend | 97–99% from scratch, 100% with transfer (tested m=3–11) |
| Transfer learning | Zeropad m=3 → m=7/9/11: perfect 100% from ep 1 |
| Practical RL limit | m ≈ 13–15 with transfer (revised after m=11 results) |