RL for Hamiltonian Cycle Discovery

How reinforcement learning is structured to find Hamiltonian cycles in the m-ary cube digraph — and experimental results for m=3.
Companion page to Claude's Cycles | Training script: rl_hamiltonian.py | W&B: gitcg/rl-hamiltonian

Problem Context

Consider a directed graph (digraph) whose vertices are tuples (i, j, k) with each coordinate in {0, 1, …, m−1}. There are m³ vertices total. Each vertex has exactly 3 outgoing arcs, called bumps: incrementing i, j, or k by 1 (mod m). Claude proved analytically that this digraph can always be decomposed into exactly 3 edge-disjoint Hamiltonian cycles.

The RL exercise asks a different question: can a learned agent rediscover a Hamiltonian cycle purely through trial and error?

MDP Formulation

The problem is cast as a Markov Decision Process (MDP) with the following components:

State Space

A state consists of three components:

Current vertex (i, j, k) — the agent's position in the m-ary cube
Visited set — which of the m³ vertices have already been visited in this episode
Step count — how many moves the agent has taken so far

The full state space is enormous. The visited-set component alone has 2^m³ possible configurations. For m=3, that is 2²⁷ ≈ 134 million visited-set states, each combined with 27 possible current vertices. This combinatorial explosion is the central difficulty.

Action Space

At every step the agent chooses one of 3 actions:

Bump i — set i ← (i + 1) mod m
Bump j — set j ← (j + 1) mod m
Bump k — set k ← (k + 1) mod m

The action space is fixed at {0, 1, 2} regardless of m. This is unusually small — the difficulty comes entirely from the state space, not the action space.

Reward Function

Event	Reward	Effect
Visit a new vertex	+1	Episode continues
Complete a Hamiltonian cycle (return to start after visiting all m³ vertices)	+100	Episode ends (success)
Revisit a vertex	−10	Episode ends (failure)

The reward is sparse: the large +100 bonus only arrives if the agent perfectly sequences all m³ steps without revisiting any vertex. For m=3 the agent must make 27 correct decisions in a row; for m=5, it must make 125.

Episode Structure

What RL Does vs. What RL Does Not Do

RL What RL Does

Searches for a single Hamiltonian cycle via trial and error
Can find a second cycle by masking arcs from the first
Provides an empirical demonstration for a specific value of m
Tests whether the structure is learnable without domain knowledge

Proof What Claude's Reasoning Did

Proved that exactly 3 edge-disjoint Hamiltonian cycles always exist
Constructed them algebraically using the fiber structure s = (i+j+k) mod m
Works for all values of m, not just specific instances
Provides a closed-form decomposition, not a search result

  Key insight: The RL formulation is a complementary exercise —
  it asks whether a learned agent can rediscover what Claude found analytically.
  Even if the agent finds 3 disjoint cycles for a given m, that is an empirical
  demonstration, not a proof that such a decomposition exists for all m.

Experimental Results (m=3, 27 vertices)

We tested four algorithms on the m=3 case. All share the same environment, observation space (32-dim: coordinates, step count, fiber index, visited mask), reward function, and action masking. Seed: 42.

Algorithm	Episodes	Success Rate	First Solve	Runtime	W&B
REINFORCE	10,000	1.16%	~ep 150	32s	ssihf20k
PPO	10,000	97.25%	~ep 50	42s	1q21fgzy
GNN-PPO	5,000	99.42%	~ep 10	203s	xsw7g6da
MCTS	200	26.5%	~ep 2	6s	ny82u4h3

Key Findings

1. PPO dominates REINFORCE

Both use the same MLP architecture (32 → 128 → 128 → 3), but PPO's clipped objective and entropy bonus completely prevent the mode collapse that killed REINFORCE. PPO reached 97.25% success rate vs. REINFORCE's 1.16%.

2. GNN-PPO achieves the highest success rate (99.42%)

The graph neural network policy (2-layer message-passing GNN with 64-dim messages) exploits the regular structure of the m-ary cube. It converged to near-perfect performance within ~100 episodes — the fastest among all learned policies. Only 29 failures in 5,000 episodes.

3. MCTS finds cycles without learning

Pure tree search (200 simulations/move, UCB1 selection) achieves 26.5% per-episode success with no training at all. It is expensive per episode but has zero training instability. Useful as a non-learned baseline.

4. REINFORCE suffers catastrophic mode collapse

REINFORCE found 116 cycles in episodes 1–1,100, then converged to a deterministic 24-step suboptimal path and never found another cycle for the remaining 8,900 episodes. This is the classic high-variance policy gradient failure mode: without entropy regularization, the policy becomes too confident too early.

Ablation: Action Masking Is Essential

We ran REINFORCE with three observation/masking configurations. Action masking — zeroing out the probability of actions that lead to already-visited vertices — was the single most impactful design decision.

Variant	Obs Dim	Action Masking	Best Path	Successes
No visited mask	5	No	23/27	0
With visited mask	32	No	23/27	0
With mask + action masking	32	Yes	27/27	116

  Without action masking, no algorithm variant found a Hamiltonian cycle in 5,000 episodes.
  The agent would invariably revisit a vertex before reaching all 27. Action masking prunes
  the impossible branches, reducing the effective search space dramatically.

Algorithm Details

REINFORCE

Architecture: MLP(32, 128, 128, 3) with ReLU
Hyperparams: lr=0.003, gamma=0.99, Adam optimizer
Update: Per-episode, normalized returns
Weakness: No entropy regularization → premature convergence

PPO (Proximal Policy Optimization)

Architecture: Actor-Critic MLP, shared backbone (32, 128, 128) + actor head (3) + critic head (1)
Hyperparams: lr=0.003, gamma=0.99, clip_eps=0.2, entropy_coef=0.01, value_coef=0.5
PPO epochs: 4 per episode, gradient clipping max_norm=0.5
Key advantage: Entropy bonus prevents collapse; clipped ratio prevents destructive updates

GNN-PPO (Graph Neural Network + PPO)

Architecture: 2-layer message-passing GNN (node_feat_dim=4, msg_dim=64) + actor/critic heads
Node features: (is_current, is_visited, is_start, fiber_s) per vertex
Message passing: Vectorized with edge index tensors, residual connections, mean aggregation
PPO settings: Same as MLP-PPO (clip_eps=0.2, entropy=0.01, 4 epochs)
Key advantage: Exploits graph topology; structurally aware of the m-ary cube

MCTS (Monte Carlo Tree Search)

Simulations per move: 200
Selection: UCB1 with c_puct=1.41
Rollout: Random valid actions until dead end or cycle completion
Backprop: Normalized path length (1.0 for complete cycle)
Trade-off: No learning, expensive per episode, but no training instability

Experimental Results (m=5, 125 vertices)

Moving from m=3 to m=5 increases the cycle length from 27 to 125 steps and the observation dimension from 32 to 130 (5 features + 125-bit visited mask). The theoretical state space explodes from 2²⁷ to 2¹²⁵ — a factor of 10²⁹ larger. Can the same algorithms still find Hamiltonian cycles?

Algorithm	Episodes	Success Rate	First Solve	Runtime	W&B
PPO	20,000	99.48%	~ep 100	~5 min	7cp4e667
REINFORCE	20,000	41.83%	~ep 350	~8 min	jpucotko
GNN-PPO	10,000	DNF (too slow)	—	>8 min/ep	c9tzft69
MCTS	—	Not run	—	—	—

Key Findings (m=5)

1. PPO scales beautifully

PPO achieved 99.48% success rate on 125 vertices — nearly identical to its 97.25% on m=3. First Hamiltonian cycle found around episode 100, near-perfect convergence by episode 200. The simple MLP actor-critic with action masking handles the 4.6× longer episode without difficulty.

2. REINFORCE shows a dramatic phase transition

REINFORCE exhibited a striking three-phase training profile on m=5:

Episodes 1–400: Rapid climb to 110–124 steps, first successes appear
Episodes 400–11,800: Long plateau stuck at 124/125 steps, only 110 total successes. Classic mode collapse.
Episode ~11,800: Sudden breakthrough — the policy snaps to 100% success and never looks back
Episodes 11,800–20,000: Perfect 100% success rate (8,256 consecutive successes)

This "sudden phase transition" is a remarkable example of how REINFORCE can spend thousands of episodes in a near-optimal but incorrect mode, then catastrophically shift to the correct policy. The overall 41.83% rate is misleading — it was effectively 0.6% for 11,700 episodes, then 100%.

3. GNN-PPO and MCTS hit scaling walls

GNN-PPO could not complete a single episode — the per-node message-passing loop with 125 nodes and 750 edges (bidirectional) takes >8 minutes per episode in Python. A sparse/batched implementation (e.g., PyTorch Geometric) would be needed. MCTS was not attempted: random rollouts on 125 vertices have near-zero probability of finding a Hamiltonian cycle without a learned value function.

m=3 vs. m=5 Comparison

Metric	m=3 (27 vertices)	m=5 (125 vertices)	Growth factor
Vertices	27	125	4.6×
Observation dim	32	130	4.1×
Theoretical state space	2²⁷ ≈ 10⁸	2¹²⁵ ≈ 10³⁷	10²⁹×
PPO success rate	97.25%	99.48%	≈1× (no degradation)
PPO first solve	~ep 50	~ep 100	2×
PPO training time	42s	~5 min	~7×
REINFORCE success rate	1.16% (collapsed)	41.83% (phase transition)	Different failure mode

  Surprising result: Despite the state space growing by a factor of 1029,
  PPO's success rate actually improved from m=3 to m=5. The explanation is that the
  MLP never explicitly enumerates the state space — it learns a compressed policy that
  generalizes over visited-set patterns. The real scaling bottleneck is episode length (27 → 125 steps),
  which only caused a 2× increase in episodes-to-first-solve and a 7× increase in wall-clock time.

Experimental Results (m=7, 343 vertices)

At m=7 the cycle length reaches 343 steps and the observation grows to 348 dimensions. The theoretical state space is 2³⁴³ ≈ 10¹⁰³ — a number larger than the estimated atoms in the observable universe. Does PPO's scaling hold?

Algorithm	Episodes	Success Rate	First Solve	W&B
PPO	30,000	99.11%	~ep 250	anf6fgm2
REINFORCE	30,000	12.55%	~ep 2,700	30xx3oku
GNN-PPO	—	Not run	—	—
MCTS	—	Not run	—	—

Key Findings (m=7)

1. PPO continues to scale effortlessly

PPO achieved 99.11% success rate on 343 vertices — comparable to m=3 (97.25%) and m=5 (99.48%). First cycle found at ~ep 250, near-perfect by ep 300. The simple 2-layer MLP with action masking handles the 348-dimensional observation and 343-step episodes without any architectural changes. This is now three data points confirming PPO's ~99% success rate is essentially independent of m.

2. REINFORCE phase transition takes even longer

REINFORCE's characteristic three-phase pattern continued at m=7 with an even longer plateau:

Episodes 1–2,700: Climb to ~300/343 steps, first successes
Episodes 2,700–26,400: Massive 23,700-episode plateau, stuck near 300–340 steps
Episode ~26,400: Sudden phase transition to 100% success
Episodes 26,400–30,000: Perfect 100% (3,600 consecutive successes)

3. REINFORCE phase transition scales predictably

The phase transition pattern is now consistent across three scales:

m	Vertices	Plateau duration	Transition episode	Post-transition
3	27	No recovery	—	Mode collapse (0%)
5	125	~11,400 eps	~11,800	100%
7	343	~23,700 eps	~26,400	100%

The plateau roughly doubles from m=5 to m=7 (while vertices increase 2.7×), suggesting REINFORCE can solve arbitrarily large instances given enough episodes — but its training budget grows rapidly and unpredictably. Interestingly, m=3 never recovered at all, possibly because the 10,000-episode budget was insufficient.

Scaling Across m=3, m=5, m=7

Experimental Results (m=9, 729 vertices)

At m=9 the cycle length reaches 729 steps — the agent must make 729 consecutive correct decisions. The observation grows to 734 dimensions and the theoretical state space is 2⁷²⁹ ≈ 10²¹⁹. This was previously estimated as "uncertain" and at the frontier of feasibility.

Algorithm	Episodes	Success Rate	First Solve	W&B
PPO (from scratch)	30,000	99.01%	~ep 215	7mp85z0c
Zeropad m=3 → 9	30,000	100.00%	ep 1	ha2uub4u
Partial m=3 → 9	30,000	99.99%	ep 1	uk2o72tf
Zeropad m=5 → 9	30,000	99.91%	~ep 1	06im3aj8
Partial m=5 → 9	30,000	99.13%	~ep 10	pv1zandf

Key Findings (m=9)

1. PPO from scratch still achieves ~99% at 729 vertices

PPO's success rate (99.01%) is consistent with m=3 (97.25%), m=5 (99.48%), and m=7 (99.11%). This is now four data points confirming that PPO's performance is essentially independent of m. First solve around episode 215 — actually faster than m=7 (~ep 250).

2. Zeropad m=3 → m=9 achieves perfect 100% again

Pretraining on tiny m=3 (27 vertices) with zero-padded observations and transferring to m=9 (729 vertices) — a 27× increase in problem size — produces a flawless 30,000/30,000 success rate from episode 1. This replicates the perfect result seen at m=7, confirming the pattern is robust.

3. m=3 is a better pretraining source than m=5

Both zeropad and partial transfer from m=3 outperform their m=5 counterparts at m=9. Zeropad m=3 achieves 100.00% vs. m=5's 99.91%; partial m=3 achieves 99.99% vs. m=5's 99.13%. This may be because m=3 pretraining converges faster (smaller environment) and the learned strategy is equally general — further evidence that the navigation skill is completely scale-invariant.

Scaling Across m=3, m=5, m=7, m=9

Metric	m=3 (27v)	m=5 (125v)	m=7 (343v)	m=9 (729v)	Trend
Obs dim	32	130	348	734	~m³
State space	10⁸	10³⁷	10¹⁰³	10²¹⁹	Irrelevant
PPO from scratch	97.25%	99.48%	99.11%	99.01%	Flat ~99%
PPO first solve	~ep 50	~ep 100	~ep 250	~ep 215	Sub-linear growth
Zeropad m=3→target	—	—	100.00%	100.00%	Perfect

Experimental Results (m=11, 1,331 vertices)

At m=11, the agent must make 1,331 consecutive correct decisions. The observation grows to 1,336 dimensions and the theoretical state space reaches 2¹³³¹ ≈ 10⁴⁰⁰. This was the previous estimate for the frontier of RL feasibility.

Algorithm	Episodes	Success Rate	First Solve	W&B
PPO (from scratch)	30,000	97.79%	~ep 350	j5n9b3kg
Zeropad m=3 → 11	30,000	100.00%	ep 1	ls5vwqoi
Zeropad m=5 → 11	30,000	100.00%	ep 1	36snd57q
Partial m=3 → 11	30,000	99.97%	~ep 1	bm1gzwsf
Partial m=5 → 11	30,000	99.58%	~ep 10	xtq619eo

Key Findings (m=11)

1. PPO from scratch shows first signs of degradation

PPO's from-scratch success rate drops to 97.79% — still high, but noticeably below the ~99% plateau seen at m=5/7/9. First solve moves to ~ep 350, up from ~215 at m=9. The 1,331-step episode is starting to stress credit assignment and the 128-unit hidden layer is compressing a 1,336-dim input. This is the first scale where from-scratch PPO shows wear.

2. Zeropad transfer achieves perfect 100% from both m=3 and m=5

Both zeropad sources achieve flawless 30,000/30,000 at m=11 — a 49× scale-up from m=3 (27 → 1,331 vertices). This is the first scale where m=5 zeropad matches m=3, both achieving perfect results. Transfer learning completely compensates for the from-scratch degradation.

3. The baseline-to-transfer gap is widening

As from-scratch PPO's success rate slowly declines with scale (97.79% at m=11 vs 99.48% at m=5), transfer learning's advantage grows from marginal to meaningful. At m=11, transfer provides both faster convergence (ep 1 vs ep 350) and higher final performance (100% vs 97.79%). Transfer is no longer just a convenience — it is becoming essential.

Scaling Across m=3 to m=11

Metric	m=3	m=5	m=7	m=9	m=11	Trend
Vertices	27	125	343	729	1,331	m³
Obs dim	32	130	348	734	1,336	~m³
State space	10⁸	10³⁷	10¹⁰³	10²¹⁹	10⁴⁰⁰	Irrelevant
PPO from scratch	97.25%	99.48%	99.11%	99.01%	97.79%	Slight decline at m=11
PPO first solve	~50	~100	~250	~215	~350	Growing
Zeropad m=3	—	—	100%	100%	100%	Perfect across all

Transfer Learning: Does Training on Small m Help?

A key question from the Claude's Cycles page: can an RL agent trained on small m transfer that experience to larger m? If so, it would suggest the agent has learned the underlying structure of the digraph, not just memorized a specific solution. We tested this on m=7 (results below) and confirmed it scales to m=9 (see m=9 results above).

The Challenge: Different Observation Sizes

The observation includes a visited-vertex mask whose size is m³, so weights cannot be directly shared between different m values:

Source	Obs dim	Target	Obs dim	Problem
m=3	32	m=7	348	Input layer incompatible
m=5	130	m=7	348	Input layer incompatible

Two Transfer Approaches

Partial Transfer

Keep the hidden layers (128×128) and output heads (actor + critic). Reinitialize the input projection for the new obs_dim.

Hypothesis: hidden layers encode a reusable "navigation strategy."

Zero-Padded Transfer

During pretraining, pad observations to the target size (e.g., 348-dim for m=7). All weights transfer directly — no architecture change.

Hypothesis: network learns to ignore zero-padded dimensions.

Results

Method	Source	Finetune Success Rate	First Solve	W&B
From scratch (baseline)	—	99.11%	~ep 250	anf6fgm2
Partial transfer	m=3	99.99%	ep 1	k3fv51zi
Partial transfer	m=5	99.99%	ep 1	noxa0p61
Zero-padded transfer	m=3	100.00%	ep 1	8ddfz8i5
Zero-padded transfer	m=5	99.96%	ep 1	5nqe3nqo

Key Findings

1. Transfer eliminates the warmup period entirely

The baseline PPO needs ~215–250 episodes to find its first Hamiltonian cycle on m=7/m=9. All transfer methods solve from episode 1 of finetuning — confirmed on both m=7 (343 vertices) and m=9 (729 vertices). The "navigation strategy" learned on m=3 (27 vertices) transfers across a 27× increase in problem size.

2. Zero-padded m=3 → m=7 achieves perfect 100%

Pretraining on tiny m=3 with zero-padded observations, then finetuning on m=7 or m=9, produces a perfect 30,000/30,000 success rate in both cases. This is better than from-scratch PPO (~99%) and every other transfer variant. The zero-padding preserves the input projection's understanding of visited-mask patterns, giving it a head start.

3. Even partial transfer works immediately

Partial transfer only preserves hidden layers — the input projection is randomly reinitialized for the new 348-dim input. Despite this, it solves m=7 from episode 1 at 99.99%. This means the hidden layers encode a generalizable navigation policy largely independent of the input encoding.

4. Source problem size barely matters

Both m=3 and m=5 pretrained models transfer equally well. The 27-vertex problem teaches essentially the same strategy as the 125-vertex problem. The core skill — "how to use the action mask and visited information to navigate a 3-cube digraph" — is scale-invariant.

  Confirmed through m=11: Zero-padded m=3 transfer achieves perfect 100%
  at m=7, m=9, and m=11 (a 49× scale-up from 27 to 1,331 vertices). The recommended
  approach for any target m: pretrain on m=3 with zero-padded observations to the target dimension,
  then finetune. No curriculum chain needed — the tiny m=3 problem encodes everything.

Scaling Challenges & Feasibility Estimate

How State Space and Training Time Grow

As m increases, three quantities grow at very different rates, each creating a different bottleneck:

1. Observation dimension: O(m³)

The visited mask is an m³-bit vector, so the network input grows cubically. For m=3 the observation is 32-dimensional; for m=5 it is 130; for m=7 it is 348; for m=9 it is 734; for m=11 it is 1,336. PPO handles all of these without architectural changes, though from-scratch performance shows first signs of stress at 1,336 dimensions.

2. Episode length: O(m³)

A successful episode requires exactly m³ steps. Longer episodes mean: (a) more forward passes per episode, increasing wall-clock time linearly; (b) longer credit-assignment horizons, making it harder for the agent to attribute the final +100 reward to early actions; (c) more opportunities for a single wrong decision to terminate the episode.

Empirical growth: PPO episodes-to-first-solve: 50 (m=3) → 100 (m=5) → 250 (m=7) → 215 (m=9) → 350 (m=11). The growth is sub-linear relative to vertex count, but m=11 shows the first uptick. Transfer learning eliminates warmup entirely (ep 1 for all tested scales m=7 through m=11).

3. Theoretical state space: O(2^m³)

The visited-set tracking creates an exponentially growing state space. However, results across m=3, 5, 7, 9, and 11 consistently show that this theoretical explosion does not translate to practical difficulty for function-approximation RL. PPO maintains 97–99% success rate as the state space grows from 10⁸ to 10⁴⁰⁰. The MLP never enumerates states — it learns to generalize. The real bottleneck is episode length and credit assignment, not raw state-space size.

m	Vertices	Obs dim	Visited-set states	Status
3	27	32	2²⁷ ≈ 10⁸	Solved: all 4 algorithms (PPO 97%, GNN 99%)
5	125	130	2¹²⁵ ≈ 10³⁷	Solved: PPO 99.5%, REINFORCE 42%
7	343	348	2³⁴³ ≈ 10¹⁰³	Solved: PPO 99.1%, REINFORCE 12.6%
9	729	734	2⁷²⁹ ≈ 10²¹⁹	Solved: PPO 99.0%, transfer 100%
11	1,331	1,336	2¹³³¹ ≈ 10⁴⁰⁰	Solved: PPO 97.8%, transfer 100%

Estimated Practical Limit: m ≈ 13–15 (with transfer)

With five confirmed scales (m=3,5,7,9,11) and perfect transfer learning through m=11, the estimated limit continues to rise. However, m=11 is the first scale showing from-scratch degradation (97.79%), suggesting we are approaching a capacity limit.

m=13 (2,197 vertices): Likely feasible with transfer. Zeropad transfer has achieved perfect 100% at every tested scale (m=7 through m=11). The from-scratch rate may drop further below 97%, but transfer should compensate. The observation (2,202-dim) is 17× larger than the 128-unit hidden layer — a compression ratio that PPO has handled so far but may strain.
m=15 (3,375 vertices): At the frontier. The 3,375-step episode is long and the 3,380-dim observation may exceed the 128-unit MLP's capacity to compress. Wider hidden layers (256 or 512 units) might be needed. Transfer from m=3 may still achieve near-perfect results.
m ≥ 17 (4,913+ vertices): Standard 2-layer MLP is likely insufficient without architectural scaling (wider/deeper networks, attention, or hierarchical decomposition). Episodes exceed 4,900 steps and the visited mask dominates the observation.

The key lesson from m=3 through m=11: the exponential state space is a red herring. What actually limits RL is network capacity relative to the observation dimension and episode length for credit assignment. PPO from scratch maintains 97–99% success as the state space grows from 10⁸ to 10⁴⁰⁰. Transfer learning from m=3 provides perfect 100% results at every tested scale up to m=11 (49× scale increase). The practical limit with transfer is likely where the 128-unit hidden layer can no longer compress the policy — around m=13–15.

Can RL Find the Proof, Not Just Cycles?

Everything above shows that RL can find Hamiltonian cycles for specific m values. But Claude didn't just find cycles — Claude found a universal construction (6 conditional rules) that provably works for all odd m > 1. Can RL replicate that?

How Claude Actually Found the Proof

As documented in Knuth's paper and the discovery narrative, Claude's path over 31 explorations followed a specific intellectual trajectory:

Phase	Explorations	What happened	Intellectual move
1	1–8	Discovered serpentine traversal and the fiber structure s = (i+j+k) mod m	Inventing a new abstraction
2	9–15	Realized fiber-based rules could govern all three cycles; invented "direction strings"	Representational insight
3	16–24	Used simulated annealing on small m to find working decompositions; extracted patterns	Computational search + pattern extraction
4	25–31	Crystallized 6 conditional rules; argued they work for all odd m > 1	Generalization + universality argument

The proof has two parts: (a) a construction (the 6 bump rules in the rule table), and (b) a correctness argument (each cycle visits all m³ vertices exactly once, and the 3 cycles partition all 3m³ arcs). This is also what a human mathematician would do given enough insight and patience.

Why Standard RL Cannot Find the Proof

The RL algorithms on this page operate in a fundamentally different space than proof discovery:

Aspect	RL (what we did)	Proof discovery (what Claude did)
Output	A sequence of m³ bumps for one m	Symbolic rules + correctness argument for all m
Action space	{bump i, bump j, bump k}	Mathematical reasoning steps (infinite, unstructured)
State	Current vertex + visited set	Current mathematical knowledge and conjectures
Reward	+1 per vertex, +100 for cycle	No clear signal until the proof is complete
Generalization	Solves one m at a time	Must work for all m simultaneously
Verification	Check if all vertices visited	Check logical validity of argument

The Critical Intellectual Moves RL Cannot Make

1. Inventing the fiber concept

Recognizing s = (i+j+k) mod m as the organizing principle was an act of mathematical creativity — defining a new variable that simplifies the problem. RL has no mechanism for inventing new state representations. The agent sees vertices and bumps; it cannot decide "I should group these vertices by the sum of their coordinates modulo m."

2. The direction string abstraction

Compressing the assignment of 3m³ arcs to 3 cycles into 6 conditional rules using permutations of "012" requires understanding symmetry. The insight that each vertex's three outgoing arcs should be assigned to the three cycles via a permutation — and that the permutation depends only on fiber index and one coordinate — is a representational leap that no reward signal can guide toward.

3. The universality argument

The rules work for all odd m because the conditions (s=0, s=m−1, i=m−1, j=m−1) are defined relative to m, not using absolute values. Moving from "this works for m=3, 5, 7" to "this works for all odd m" requires logical deduction about parameterized families — a form of reasoning RL does not perform.

4. Using computation as a tool, not an end

In explorations 16–24, Claude used simulated annealing to find solutions for small m, then looked at the solutions to extract patterns. This is meta-reasoning — using computation to generate data, then reasoning about that data. RL just runs the computation; it doesn't step back and ask "what do these solutions have in common?"

RL-Adjacent Approaches That Move Toward Proof

Three approaches partially bridge the gap, each with fundamental limitations:

Approach 1: Program Synthesis RL

Train an agent to write a program (like Knuth's C code) that generates valid Hamiltonian cycles for arbitrary m. The action space would be token-level code generation; the reward would be running the program for m=3, 5, 7, … and checking validity.

Problem: The search space over program tokens is astronomically larger than finding a cycle for one m. The agent would need to discover fiber structure, direction strings, and conditional rules through trial and error over code tokens — without any guidance toward the right abstractions.

Approach 2: Conjecture-and-Verify RL

Let the agent propose parametric rules (e.g., "if s=0 and j=m−1, use direction '012'") and verify them computationally for increasing m values. Reward: +1 for each m where the rules produce valid decompositions.

This is closest to what Claude actually did in explorations 16–24. But the critical step was the abstraction: recognizing that the pattern depends on fiber index and one coordinate. An RL agent over rule templates would need this abstraction space pre-defined — which begs the question of who supplies the insight.

Approach 3: Multi-Level Abstraction Learning (Most Ambitious)

A multi-level system where:

Level 1: Find cycles for specific m (what we've done with RL)
Level 2: Compare trained policies across m values, extract shared structure
Level 3: Propose parametric rules that explain the shared structure

This mirrors Claude's actual path. But the Level 2 → 3 transition requires symbolic abstraction — going from "here are 5 trained neural network policies" to "the decision depends on (i+j+k) mod m and whether i=m−1." This is an open problem in AI.

What Would Actually Work

  The right paradigm is not RL but LLM-based mathematical reasoning with tool use.
  Claude's 31-exploration trajectory was a form of extended chain-of-thought with self-correction:
  explore small cases (tool use), notice patterns (reasoning), abstract into rules (creativity),
  verify computationally (tool use), argue universality (logical deduction). This interleaving
  of symbolic reasoning and computational verification is what a human mathematician would do
  — and it is what LLMs can approximate. RL's strength (optimizing sequential decisions
  against a reward signal) is irrelevant when the challenge is inventing the right abstractions.

Capability	RL	LLM reasoning + tools	Human mathematician
Find one cycle for one m	Yes (99%+)	Yes	Yes (tedious)
Find cycles for many m	Yes (with transfer)	Yes	Yes
Discover fiber structure	No	Yes (Claude did it)	Yes
Invent direction strings	No	Yes (Claude did it)	Yes
Prove universal construction	No	Yes (Claude did it)	Yes
Scale to m=11+ efficiently	Yes (with transfer)	Instant (closed-form rules)	Instant (closed-form rules)

Summary

Component	Description
State	Current vertex (i,j,k) + visited set + step count
Actions	{bump i, bump j, bump k} — always 3 choices
Reward	+1 new vertex, +100 complete cycle, −10 revisit
Goal	Find a Hamiltonian cycle (visit all m³ vertices exactly once, return to start)
Not the goal	Prove that exactly 3 disjoint cycles exist for all m
Best algorithm (m=3)	GNN-PPO: 99.42% success, first solve ~ep 10
Best algorithm (m=5)	PPO: 99.48% success, first solve ~ep 100
Best algorithm (m=7)	PPO: 99.11% success, first solve ~ep 250
Best algorithm (m=9)	PPO: 99.01% success, first solve ~ep 215
Best algorithm (m=11)	PPO+transfer: 100% success; from scratch: 97.79%
PPO scaling trend	97–99% from scratch, 100% with transfer (tested m=3–11)
Transfer learning	Zeropad m=3 → m=7/9/11: perfect 100% from ep 1
Practical RL limit	m ≈ 13–15 with transfer (revised after m=11 results)

Raw Experiment Logs & Resources

rl_hamiltonian_m3.md — m=3 experiment notes (REINFORCE, PPO, GNN-PPO, MCTS)
rl_hamiltonian_m5.md — m=5 experiment notes (PPO, REINFORCE, GNN-PPO DNF)
rl_hamiltonian_m7.md — m=7 experiment notes (PPO, REINFORCE phase transition)
rl_hamiltonian_m9.md — m=9 experiment notes (PPO + transfer learning)
rl_hamiltonian_m11.md — m=11 experiment notes (PPO + transfer, first degradation)
rl_hamiltonian_transfer.md — Transfer learning experiments (partial & zero-padded)
rl_hamiltonian.py — Training script (REINFORCE, PPO, GNN-PPO, MCTS, transfer)
ClaudeCycles-PlayGround.html — Even-m playground (m=2, m=4 interactive 3D visualizations)

RL for Hamiltonian Cycle Discovery

Problem Context

MDP Formulation

State Space

Action Space

Reward Function

Episode Structure

What RL Does vs. What RL Does Not Do

RL What RL Does

Proof What Claude's Reasoning Did

Experimental Results (m=3, 27 vertices)

Key Findings

1. PPO dominates REINFORCE

2. GNN-PPO achieves the highest success rate (99.42%)

3. MCTS finds cycles without learning

4. REINFORCE suffers catastrophic mode collapse

Ablation: Action Masking Is Essential

Algorithm Details

REINFORCE

PPO (Proximal Policy Optimization)

GNN-PPO (Graph Neural Network + PPO)

MCTS (Monte Carlo Tree Search)

Experimental Results (m=5, 125 vertices)

Key Findings (m=5)

1. PPO scales beautifully

2. REINFORCE shows a dramatic phase transition

3. GNN-PPO and MCTS hit scaling walls

m=3 vs. m=5 Comparison

Experimental Results (m=7, 343 vertices)

Key Findings (m=7)

1. PPO continues to scale effortlessly

2. REINFORCE phase transition takes even longer

3. REINFORCE phase transition scales predictably

Scaling Across m=3, m=5, m=7

Experimental Results (m=9, 729 vertices)

Key Findings (m=9)

1. PPO from scratch still achieves ~99% at 729 vertices

2. Zeropad m=3 → m=9 achieves perfect 100% again

3. m=3 is a better pretraining source than m=5

Scaling Across m=3, m=5, m=7, m=9

Experimental Results (m=11, 1,331 vertices)

Key Findings (m=11)

1. PPO from scratch shows first signs of degradation

2. Zeropad transfer achieves perfect 100% from both m=3 and m=5

3. The baseline-to-transfer gap is widening

Scaling Across m=3 to m=11

Transfer Learning: Does Training on Small m Help?

The Challenge: Different Observation Sizes

Two Transfer Approaches

Partial Transfer

Zero-Padded Transfer

Results

Key Findings

1. Transfer eliminates the warmup period entirely

2. Zero-padded m=3 → m=7 achieves perfect 100%

3. Even partial transfer works immediately

4. Source problem size barely matters

Scaling Challenges & Feasibility Estimate

How State Space and Training Time Grow

1. Observation dimension: O(m³)

2. Episode length: O(m³)

3. Theoretical state space: O(2m³)

Estimated Practical Limit: m ≈ 13–15 (with transfer)

Can RL Find the Proof, Not Just Cycles?

How Claude Actually Found the Proof

Why Standard RL Cannot Find the Proof

The Critical Intellectual Moves RL Cannot Make

1. Inventing the fiber concept

2. The direction string abstraction

3. The universality argument

4. Using computation as a tool, not an end

RL-Adjacent Approaches That Move Toward Proof

Approach 1: Program Synthesis RL

Approach 2: Conjecture-and-Verify RL

Approach 3: Multi-Level Abstraction Learning (Most Ambitious)

What Would Actually Work

Summary

Raw Experiment Logs & Resources

3. Theoretical state space: O(2^m³)