RL for Hamiltonian Cycle Discovery

How reinforcement learning is structured to find Hamiltonian cycles in the m-ary cube digraph — and experimental results for m=3.
Companion page to Claude's Cycles | Training script: rl_hamiltonian.py | W&B: gitcg/rl-hamiltonian

Problem Context

Consider a directed graph (digraph) whose vertices are tuples (i, j, k) with each coordinate in {0, 1, …, m−1}. There are m³ vertices total. Each vertex has exactly 3 outgoing arcs, called bumps: incrementing i, j, or k by 1 (mod m). Claude proved analytically that this digraph can always be decomposed into exactly 3 edge-disjoint Hamiltonian cycles.

The RL exercise asks a different question: can a learned agent rediscover a Hamiltonian cycle purely through trial and error?

MDP Formulation

The problem is cast as a Markov Decision Process (MDP) with the following components:

State Space

A state consists of three components:

The full state space is enormous. The visited-set component alone has 2 possible configurations. For m=3, that is 227 ≈ 134 million visited-set states, each combined with 27 possible current vertices. This combinatorial explosion is the central difficulty.

Action Space

At every step the agent chooses one of 3 actions:

  1. Bump i — set i ← (i + 1) mod m
  2. Bump j — set j ← (j + 1) mod m
  3. Bump k — set k ← (k + 1) mod m

The action space is fixed at {0, 1, 2} regardless of m. This is unusually small — the difficulty comes entirely from the state space, not the action space.

Reward Function

EventRewardEffect
Visit a new vertex +1 Episode continues
Complete a Hamiltonian cycle (return to start after visiting all m³ vertices) +100 Episode ends (success)
Revisit a vertex −10 Episode ends (failure)

The reward is sparse: the large +100 bonus only arrives if the agent perfectly sequences all m³ steps without revisiting any vertex. For m=3 the agent must make 27 correct decisions in a row; for m=5, it must make 125.

Episode Structure

Start at (0,0,0) | v Choose bump: i, j, or k | v +---> Move to new vertex (i',j',k') | | | +-- Already visited? --YES--> Reward: -10, EPISODE ENDS | | | NO | | | +-- Reward: +1 | | | +-- All m^3 visited AND back at start? | | | | | YES --> Reward: +100, EPISODE ENDS (success) | | | | | NO | | | +-----+---------+ | (next step)

What RL Does vs. What RL Does Not Do

RL What RL Does

Proof What Claude's Reasoning Did

Key insight: The RL formulation is a complementary exercise — it asks whether a learned agent can rediscover what Claude found analytically. Even if the agent finds 3 disjoint cycles for a given m, that is an empirical demonstration, not a proof that such a decomposition exists for all m.

Experimental Results (m=3, 27 vertices)

We tested four algorithms on the m=3 case. All share the same environment, observation space (32-dim: coordinates, step count, fiber index, visited mask), reward function, and action masking. Seed: 42.

Algorithm Episodes Success Rate First Solve Runtime W&B
REINFORCE 10,000 1.16% ~ep 150 32s ssihf20k
PPO 10,000 97.25% ~ep 50 42s 1q21fgzy
GNN-PPO 5,000 99.42% ~ep 10 203s xsw7g6da
MCTS 200 26.5% ~ep 2 6s ny82u4h3

Key Findings

1. PPO dominates REINFORCE

Both use the same MLP architecture (32 → 128 → 128 → 3), but PPO's clipped objective and entropy bonus completely prevent the mode collapse that killed REINFORCE. PPO reached 97.25% success rate vs. REINFORCE's 1.16%.

2. GNN-PPO achieves the highest success rate (99.42%)

The graph neural network policy (2-layer message-passing GNN with 64-dim messages) exploits the regular structure of the m-ary cube. It converged to near-perfect performance within ~100 episodes — the fastest among all learned policies. Only 29 failures in 5,000 episodes.

3. MCTS finds cycles without learning

Pure tree search (200 simulations/move, UCB1 selection) achieves 26.5% per-episode success with no training at all. It is expensive per episode but has zero training instability. Useful as a non-learned baseline.

4. REINFORCE suffers catastrophic mode collapse

REINFORCE found 116 cycles in episodes 1–1,100, then converged to a deterministic 24-step suboptimal path and never found another cycle for the remaining 8,900 episodes. This is the classic high-variance policy gradient failure mode: without entropy regularization, the policy becomes too confident too early.

Ablation: Action Masking Is Essential

We ran REINFORCE with three observation/masking configurations. Action masking — zeroing out the probability of actions that lead to already-visited vertices — was the single most impactful design decision.

VariantObs DimAction MaskingBest PathSuccesses
No visited mask5No23/270
With visited mask32No23/270
With mask + action masking32Yes27/27116
Without action masking, no algorithm variant found a Hamiltonian cycle in 5,000 episodes. The agent would invariably revisit a vertex before reaching all 27. Action masking prunes the impossible branches, reducing the effective search space dramatically.

Algorithm Details

REINFORCE

PPO (Proximal Policy Optimization)

GNN-PPO (Graph Neural Network + PPO)

MCTS (Monte Carlo Tree Search)

Experimental Results (m=5, 125 vertices)

Moving from m=3 to m=5 increases the cycle length from 27 to 125 steps and the observation dimension from 32 to 130 (5 features + 125-bit visited mask). The theoretical state space explodes from 227 to 2125 — a factor of 1029 larger. Can the same algorithms still find Hamiltonian cycles?

Algorithm Episodes Success Rate First Solve Runtime W&B
PPO 20,000 99.48% ~ep 100 ~5 min 7cp4e667
REINFORCE 20,000 41.83% ~ep 350 ~8 min jpucotko
GNN-PPO 10,000 DNF (too slow) >8 min/ep c9tzft69
MCTS Not run

Key Findings (m=5)

1. PPO scales beautifully

PPO achieved 99.48% success rate on 125 vertices — nearly identical to its 97.25% on m=3. First Hamiltonian cycle found around episode 100, near-perfect convergence by episode 200. The simple MLP actor-critic with action masking handles the 4.6× longer episode without difficulty.

2. REINFORCE shows a dramatic phase transition

REINFORCE exhibited a striking three-phase training profile on m=5:

This "sudden phase transition" is a remarkable example of how REINFORCE can spend thousands of episodes in a near-optimal but incorrect mode, then catastrophically shift to the correct policy. The overall 41.83% rate is misleading — it was effectively 0.6% for 11,700 episodes, then 100%.

3. GNN-PPO and MCTS hit scaling walls

GNN-PPO could not complete a single episode — the per-node message-passing loop with 125 nodes and 750 edges (bidirectional) takes >8 minutes per episode in Python. A sparse/batched implementation (e.g., PyTorch Geometric) would be needed. MCTS was not attempted: random rollouts on 125 vertices have near-zero probability of finding a Hamiltonian cycle without a learned value function.

m=3 vs. m=5 Comparison

Metric m=3 (27 vertices) m=5 (125 vertices) Growth factor
Vertices271254.6×
Observation dim321304.1×
Theoretical state space 227 ≈ 108 2125 ≈ 1037 1029×
PPO success rate 97.25% 99.48% ≈1× (no degradation)
PPO first solve ~ep 50 ~ep 100
PPO training time 42s ~5 min ~7×
REINFORCE success rate 1.16% (collapsed) 41.83% (phase transition) Different failure mode
Surprising result: Despite the state space growing by a factor of 1029, PPO's success rate actually improved from m=3 to m=5. The explanation is that the MLP never explicitly enumerates the state space — it learns a compressed policy that generalizes over visited-set patterns. The real scaling bottleneck is episode length (27 → 125 steps), which only caused a 2× increase in episodes-to-first-solve and a 7× increase in wall-clock time.

Experimental Results (m=7, 343 vertices)

At m=7 the cycle length reaches 343 steps and the observation grows to 348 dimensions. The theoretical state space is 2343 ≈ 10103 — a number larger than the estimated atoms in the observable universe. Does PPO's scaling hold?

Algorithm Episodes Success Rate First Solve W&B
PPO 30,000 99.11% ~ep 250 anf6fgm2
REINFORCE 30,000 12.55% ~ep 2,700 30xx3oku
GNN-PPO Not run
MCTS Not run

Key Findings (m=7)

1. PPO continues to scale effortlessly

PPO achieved 99.11% success rate on 343 vertices — comparable to m=3 (97.25%) and m=5 (99.48%). First cycle found at ~ep 250, near-perfect by ep 300. The simple 2-layer MLP with action masking handles the 348-dimensional observation and 343-step episodes without any architectural changes. This is now three data points confirming PPO's ~99% success rate is essentially independent of m.

2. REINFORCE phase transition takes even longer

REINFORCE's characteristic three-phase pattern continued at m=7 with an even longer plateau:

3. REINFORCE phase transition scales predictably

The phase transition pattern is now consistent across three scales:

mVerticesPlateau durationTransition episodePost-transition
327No recoveryMode collapse (0%)
5125~11,400 eps~11,800100%
7343~23,700 eps~26,400100%

The plateau roughly doubles from m=5 to m=7 (while vertices increase 2.7×), suggesting REINFORCE can solve arbitrarily large instances given enough episodes — but its training budget grows rapidly and unpredictably. Interestingly, m=3 never recovered at all, possibly because the 10,000-episode budget was insufficient.

Scaling Across m=3, m=5, m=7

Experimental Results (m=9, 729 vertices)

At m=9 the cycle length reaches 729 steps — the agent must make 729 consecutive correct decisions. The observation grows to 734 dimensions and the theoretical state space is 2729 ≈ 10219. This was previously estimated as "uncertain" and at the frontier of feasibility.

Algorithm Episodes Success Rate First Solve W&B
PPO (from scratch) 30,000 99.01% ~ep 215 7mp85z0c
Zeropad m=3 → 9 30,000 100.00% ep 1 ha2uub4u
Partial m=3 → 9 30,000 99.99% ep 1 uk2o72tf
Zeropad m=5 → 9 30,000 99.91% ~ep 1 06im3aj8
Partial m=5 → 9 30,000 99.13% ~ep 10 pv1zandf

Key Findings (m=9)

1. PPO from scratch still achieves ~99% at 729 vertices

PPO's success rate (99.01%) is consistent with m=3 (97.25%), m=5 (99.48%), and m=7 (99.11%). This is now four data points confirming that PPO's performance is essentially independent of m. First solve around episode 215 — actually faster than m=7 (~ep 250).

2. Zeropad m=3 → m=9 achieves perfect 100% again

Pretraining on tiny m=3 (27 vertices) with zero-padded observations and transferring to m=9 (729 vertices) — a 27× increase in problem size — produces a flawless 30,000/30,000 success rate from episode 1. This replicates the perfect result seen at m=7, confirming the pattern is robust.

3. m=3 is a better pretraining source than m=5

Both zeropad and partial transfer from m=3 outperform their m=5 counterparts at m=9. Zeropad m=3 achieves 100.00% vs. m=5's 99.91%; partial m=3 achieves 99.99% vs. m=5's 99.13%. This may be because m=3 pretraining converges faster (smaller environment) and the learned strategy is equally general — further evidence that the navigation skill is completely scale-invariant.

Scaling Across m=3, m=5, m=7, m=9

Metric m=3 (27v) m=5 (125v) m=7 (343v) m=9 (729v) Trend
Obs dim32130348734~m³
State space 108 1037 10103 10219 Irrelevant
PPO from scratch 97.25% 99.48% 99.11% 99.01% Flat ~99%
PPO first solve ~ep 50 ~ep 100 ~ep 250 ~ep 215 Sub-linear growth
Zeropad m=3→target 100.00% 100.00% Perfect

Experimental Results (m=11, 1,331 vertices)

At m=11, the agent must make 1,331 consecutive correct decisions. The observation grows to 1,336 dimensions and the theoretical state space reaches 21331 ≈ 10400. This was the previous estimate for the frontier of RL feasibility.

Algorithm Episodes Success Rate First Solve W&B
PPO (from scratch) 30,000 97.79% ~ep 350 j5n9b3kg
Zeropad m=3 → 11 30,000 100.00% ep 1 ls5vwqoi
Zeropad m=5 → 11 30,000 100.00% ep 1 36snd57q
Partial m=3 → 11 30,000 99.97% ~ep 1 bm1gzwsf
Partial m=5 → 11 30,000 99.58% ~ep 10 xtq619eo

Key Findings (m=11)

1. PPO from scratch shows first signs of degradation

PPO's from-scratch success rate drops to 97.79% — still high, but noticeably below the ~99% plateau seen at m=5/7/9. First solve moves to ~ep 350, up from ~215 at m=9. The 1,331-step episode is starting to stress credit assignment and the 128-unit hidden layer is compressing a 1,336-dim input. This is the first scale where from-scratch PPO shows wear.

2. Zeropad transfer achieves perfect 100% from both m=3 and m=5

Both zeropad sources achieve flawless 30,000/30,000 at m=11 — a 49× scale-up from m=3 (27 → 1,331 vertices). This is the first scale where m=5 zeropad matches m=3, both achieving perfect results. Transfer learning completely compensates for the from-scratch degradation.

3. The baseline-to-transfer gap is widening

As from-scratch PPO's success rate slowly declines with scale (97.79% at m=11 vs 99.48% at m=5), transfer learning's advantage grows from marginal to meaningful. At m=11, transfer provides both faster convergence (ep 1 vs ep 350) and higher final performance (100% vs 97.79%). Transfer is no longer just a convenience — it is becoming essential.

Scaling Across m=3 to m=11

Metric m=3 m=5 m=7 m=9 m=11 Trend
Vertices271253437291,331
Obs dim321303487341,336~m³
State space 108 1037 10103 10219 10400 Irrelevant
PPO from scratch 97.25% 99.48% 99.11% 99.01% 97.79% Slight decline at m=11
PPO first solve ~50 ~100 ~250 ~215 ~350 Growing
Zeropad m=3 100% 100% 100% Perfect across all

Transfer Learning: Does Training on Small m Help?

A key question from the Claude's Cycles page: can an RL agent trained on small m transfer that experience to larger m? If so, it would suggest the agent has learned the underlying structure of the digraph, not just memorized a specific solution. We tested this on m=7 (results below) and confirmed it scales to m=9 (see m=9 results above).

The Challenge: Different Observation Sizes

The observation includes a visited-vertex mask whose size is m³, so weights cannot be directly shared between different m values:

SourceObs dimTargetObs dimProblem
m=332m=7348Input layer incompatible
m=5130m=7348Input layer incompatible

Two Transfer Approaches

Partial Transfer

Keep the hidden layers (128×128) and output heads (actor + critic). Reinitialize the input projection for the new obs_dim.

Hypothesis: hidden layers encode a reusable "navigation strategy."

Zero-Padded Transfer

During pretraining, pad observations to the target size (e.g., 348-dim for m=7). All weights transfer directly — no architecture change.

Hypothesis: network learns to ignore zero-padded dimensions.

Results

Method Source Finetune Success Rate First Solve W&B
From scratch (baseline) 99.11% ~ep 250 anf6fgm2
Partial transfer m=3 99.99% ep 1 k3fv51zi
Partial transfer m=5 99.99% ep 1 noxa0p61
Zero-padded transfer m=3 100.00% ep 1 8ddfz8i5
Zero-padded transfer m=5 99.96% ep 1 5nqe3nqo

Key Findings

1. Transfer eliminates the warmup period entirely

The baseline PPO needs ~215–250 episodes to find its first Hamiltonian cycle on m=7/m=9. All transfer methods solve from episode 1 of finetuning — confirmed on both m=7 (343 vertices) and m=9 (729 vertices). The "navigation strategy" learned on m=3 (27 vertices) transfers across a 27× increase in problem size.

2. Zero-padded m=3 → m=7 achieves perfect 100%

Pretraining on tiny m=3 with zero-padded observations, then finetuning on m=7 or m=9, produces a perfect 30,000/30,000 success rate in both cases. This is better than from-scratch PPO (~99%) and every other transfer variant. The zero-padding preserves the input projection's understanding of visited-mask patterns, giving it a head start.

3. Even partial transfer works immediately

Partial transfer only preserves hidden layers — the input projection is randomly reinitialized for the new 348-dim input. Despite this, it solves m=7 from episode 1 at 99.99%. This means the hidden layers encode a generalizable navigation policy largely independent of the input encoding.

4. Source problem size barely matters

Both m=3 and m=5 pretrained models transfer equally well. The 27-vertex problem teaches essentially the same strategy as the 125-vertex problem. The core skill — "how to use the action mask and visited information to navigate a 3-cube digraph" — is scale-invariant.

Confirmed through m=11: Zero-padded m=3 transfer achieves perfect 100% at m=7, m=9, and m=11 (a 49× scale-up from 27 to 1,331 vertices). The recommended approach for any target m: pretrain on m=3 with zero-padded observations to the target dimension, then finetune. No curriculum chain needed — the tiny m=3 problem encodes everything.

Scaling Challenges & Feasibility Estimate

How State Space and Training Time Grow

As m increases, three quantities grow at very different rates, each creating a different bottleneck:

1. Observation dimension: O(m³)

The visited mask is an m³-bit vector, so the network input grows cubically. For m=3 the observation is 32-dimensional; for m=5 it is 130; for m=7 it is 348; for m=9 it is 734; for m=11 it is 1,336. PPO handles all of these without architectural changes, though from-scratch performance shows first signs of stress at 1,336 dimensions.

2. Episode length: O(m³)

A successful episode requires exactly m³ steps. Longer episodes mean: (a) more forward passes per episode, increasing wall-clock time linearly; (b) longer credit-assignment horizons, making it harder for the agent to attribute the final +100 reward to early actions; (c) more opportunities for a single wrong decision to terminate the episode.

Empirical growth: PPO episodes-to-first-solve: 50 (m=3) → 100 (m=5) → 250 (m=7) → 215 (m=9) → 350 (m=11). The growth is sub-linear relative to vertex count, but m=11 shows the first uptick. Transfer learning eliminates warmup entirely (ep 1 for all tested scales m=7 through m=11).

3. Theoretical state space: O(2)

The visited-set tracking creates an exponentially growing state space. However, results across m=3, 5, 7, 9, and 11 consistently show that this theoretical explosion does not translate to practical difficulty for function-approximation RL. PPO maintains 97–99% success rate as the state space grows from 108 to 10400. The MLP never enumerates states — it learns to generalize. The real bottleneck is episode length and credit assignment, not raw state-space size.

m Vertices Obs dim Visited-set states Status
32732 227 ≈ 108 Solved: all 4 algorithms (PPO 97%, GNN 99%)
5125130 2125 ≈ 1037 Solved: PPO 99.5%, REINFORCE 42%
7343348 2343 ≈ 10103 Solved: PPO 99.1%, REINFORCE 12.6%
9729734 2729 ≈ 10219 Solved: PPO 99.0%, transfer 100%
111,3311,336 21331 ≈ 10400 Solved: PPO 97.8%, transfer 100%

Estimated Practical Limit: m ≈ 13–15 (with transfer)

With five confirmed scales (m=3,5,7,9,11) and perfect transfer learning through m=11, the estimated limit continues to rise. However, m=11 is the first scale showing from-scratch degradation (97.79%), suggesting we are approaching a capacity limit.

The key lesson from m=3 through m=11: the exponential state space is a red herring. What actually limits RL is network capacity relative to the observation dimension and episode length for credit assignment. PPO from scratch maintains 97–99% success as the state space grows from 108 to 10400. Transfer learning from m=3 provides perfect 100% results at every tested scale up to m=11 (49× scale increase). The practical limit with transfer is likely where the 128-unit hidden layer can no longer compress the policy — around m=13–15.

Can RL Find the Proof, Not Just Cycles?

Everything above shows that RL can find Hamiltonian cycles for specific m values. But Claude didn't just find cycles — Claude found a universal construction (6 conditional rules) that provably works for all odd m > 1. Can RL replicate that?

How Claude Actually Found the Proof

As documented in Knuth's paper and the discovery narrative, Claude's path over 31 explorations followed a specific intellectual trajectory:

PhaseExplorationsWhat happenedIntellectual move
11–8 Discovered serpentine traversal and the fiber structure s = (i+j+k) mod m Inventing a new abstraction
29–15 Realized fiber-based rules could govern all three cycles; invented "direction strings" Representational insight
316–24 Used simulated annealing on small m to find working decompositions; extracted patterns Computational search + pattern extraction
425–31 Crystallized 6 conditional rules; argued they work for all odd m > 1 Generalization + universality argument

The proof has two parts: (a) a construction (the 6 bump rules in the rule table), and (b) a correctness argument (each cycle visits all m³ vertices exactly once, and the 3 cycles partition all 3m³ arcs). This is also what a human mathematician would do given enough insight and patience.

Why Standard RL Cannot Find the Proof

The RL algorithms on this page operate in a fundamentally different space than proof discovery:

AspectRL (what we did)Proof discovery (what Claude did)
Output A sequence of m³ bumps for one m Symbolic rules + correctness argument for all m
Action space {bump i, bump j, bump k} Mathematical reasoning steps (infinite, unstructured)
State Current vertex + visited set Current mathematical knowledge and conjectures
Reward +1 per vertex, +100 for cycle No clear signal until the proof is complete
Generalization Solves one m at a time Must work for all m simultaneously
Verification Check if all vertices visited Check logical validity of argument

The Critical Intellectual Moves RL Cannot Make

1. Inventing the fiber concept

Recognizing s = (i+j+k) mod m as the organizing principle was an act of mathematical creativity — defining a new variable that simplifies the problem. RL has no mechanism for inventing new state representations. The agent sees vertices and bumps; it cannot decide "I should group these vertices by the sum of their coordinates modulo m."

2. The direction string abstraction

Compressing the assignment of 3m³ arcs to 3 cycles into 6 conditional rules using permutations of "012" requires understanding symmetry. The insight that each vertex's three outgoing arcs should be assigned to the three cycles via a permutation — and that the permutation depends only on fiber index and one coordinate — is a representational leap that no reward signal can guide toward.

3. The universality argument

The rules work for all odd m because the conditions (s=0, s=m−1, i=m−1, j=m−1) are defined relative to m, not using absolute values. Moving from "this works for m=3, 5, 7" to "this works for all odd m" requires logical deduction about parameterized families — a form of reasoning RL does not perform.

4. Using computation as a tool, not an end

In explorations 16–24, Claude used simulated annealing to find solutions for small m, then looked at the solutions to extract patterns. This is meta-reasoning — using computation to generate data, then reasoning about that data. RL just runs the computation; it doesn't step back and ask "what do these solutions have in common?"

RL-Adjacent Approaches That Move Toward Proof

Three approaches partially bridge the gap, each with fundamental limitations:

Approach 1: Program Synthesis RL

Train an agent to write a program (like Knuth's C code) that generates valid Hamiltonian cycles for arbitrary m. The action space would be token-level code generation; the reward would be running the program for m=3, 5, 7, … and checking validity.

Problem: The search space over program tokens is astronomically larger than finding a cycle for one m. The agent would need to discover fiber structure, direction strings, and conditional rules through trial and error over code tokens — without any guidance toward the right abstractions.

Approach 2: Conjecture-and-Verify RL

Let the agent propose parametric rules (e.g., "if s=0 and j=m−1, use direction '012'") and verify them computationally for increasing m values. Reward: +1 for each m where the rules produce valid decompositions.

This is closest to what Claude actually did in explorations 16–24. But the critical step was the abstraction: recognizing that the pattern depends on fiber index and one coordinate. An RL agent over rule templates would need this abstraction space pre-defined — which begs the question of who supplies the insight.

Approach 3: Multi-Level Abstraction Learning (Most Ambitious)

A multi-level system where:

This mirrors Claude's actual path. But the Level 2 → 3 transition requires symbolic abstraction — going from "here are 5 trained neural network policies" to "the decision depends on (i+j+k) mod m and whether i=m−1." This is an open problem in AI.

What Would Actually Work

The right paradigm is not RL but LLM-based mathematical reasoning with tool use. Claude's 31-exploration trajectory was a form of extended chain-of-thought with self-correction: explore small cases (tool use), notice patterns (reasoning), abstract into rules (creativity), verify computationally (tool use), argue universality (logical deduction). This interleaving of symbolic reasoning and computational verification is what a human mathematician would do — and it is what LLMs can approximate. RL's strength (optimizing sequential decisions against a reward signal) is irrelevant when the challenge is inventing the right abstractions.
CapabilityRLLLM reasoning + toolsHuman mathematician
Find one cycle for one m Yes (99%+) Yes Yes (tedious)
Find cycles for many m Yes (with transfer) Yes Yes
Discover fiber structure No Yes (Claude did it) Yes
Invent direction strings No Yes (Claude did it) Yes
Prove universal construction No Yes (Claude did it) Yes
Scale to m=11+ efficiently Yes (with transfer) Instant (closed-form rules) Instant (closed-form rules)

Summary

ComponentDescription
StateCurrent vertex (i,j,k) + visited set + step count
Actions{bump i, bump j, bump k} — always 3 choices
Reward+1 new vertex, +100 complete cycle, −10 revisit
GoalFind a Hamiltonian cycle (visit all m³ vertices exactly once, return to start)
Not the goalProve that exactly 3 disjoint cycles exist for all m
Best algorithm (m=3)GNN-PPO: 99.42% success, first solve ~ep 10
Best algorithm (m=5)PPO: 99.48% success, first solve ~ep 100
Best algorithm (m=7)PPO: 99.11% success, first solve ~ep 250
Best algorithm (m=9)PPO: 99.01% success, first solve ~ep 215
Best algorithm (m=11)PPO+transfer: 100% success; from scratch: 97.79%
PPO scaling trend97–99% from scratch, 100% with transfer (tested m=3–11)
Transfer learningZeropad m=3 → m=7/9/11: perfect 100% from ep 1
Practical RL limitm ≈ 13–15 with transfer (revised after m=11 results)

Raw Experiment Logs & Resources