Wordle & Fibble Arenas for LLMs

Fibble adds lies to Wordle's color clues. Fibble 2–5 progressively increase the number of lies per row, stress-testing LLM reasoning under deception. Standard Wordle is included as the lie-free baseline.

LLM Showdown

Pit up to 6 LLMs against each other on the same word

Game Type

You pick a secret word

Record

Preview

Cross-Arena Analysis

Visualize model performance across deception levels

Rankings

Win Rate Gradient

Latency vs Performance

Avg Guesses

Head to Head

Model Profile

What this shows: Each line tracks one model's rank (by win rate) as it moves across arenas from Wordle (0 lies) to Fibble⁵ (5 lies), plus an overall average. What to look for: Lines that stay flat are consistently ranked; lines that spike downward reveal models that collapse under deception. The "Overall" column (right of the dashed line) averages all six arenas, rewarding breadth of robustness.

Double-click a model to see its full profile

What this shows: Win rate (%) for each model as the number of lies per row increases from 0 (Wordle) to 5 (Fibble⁵). The black dashed line is the mean across all qualifying models. The shaded region marks the deception cliff (Fibble²–Fibble⁴). What to look for: Most models plummet at Fibble² (2 lies), revealing a phase transition where deception overwhelms reasoning. A few models (e.g., Gemini 3.1 Pro) resist longer. Watch for the Fibble⁵ recovery: some models partially rebound when all tiles lie, because total deception restores deterministic (inverted) information.

What this shows: Average number of guesses used by each model across arenas, from Wordle (0 lies) to Fibble⁵ (5 lies). Lower is better. The black dashed line is the mean across all qualifying models. The shaded region marks the deception cliff (Fibble²–Fibble⁴). What to look for: Models that stay low across arenas are efficient solvers. A sharp rise indicates the model is struggling with deception and burning through its allowed guesses. Compare with the Win Rate Gradient to see whether more guesses actually translates to wins.

🔍 Click an arena name on the x-axis to zoom in. Click again to reset.

What this shows: A comparison of up to five models across five dimensions. Avg Win Rate = sum of per-arena win rates ÷ 6 (all arenas, missing = 0%). Speed = based on average response time per API call (lower latency = faster). Reasoning = Wordle win rate (honest clues, pure word-guessing ability). Extended Reasoning = Fibble⁵ WR ÷ Wordle WR × 100. In Fibble⁵ all five clues are lies (known), so this measures reasoning under complete information — no deception detection needed, just deeper logical deduction. >100% means the model does better at Fibble⁵ than Wordle. Deception Robustness = 50% × Fibble² WR + 25% × Fibble³ WR + 25% × Fibble⁴ WR. These arenas require identifying which clues are lies. Fibble² is weighted higher because most models score 0% on Fibble³/⁴ for now; weights may shift as models improve. How to use: Pick two to five models from the dropdowns. The radar chart summarizes normalized dimensions; the table provides exact numbers per arena.

Also compare:

Drag to rotate
Scroll to zoom
Right-drag to pan

What this shows: Each point is a model, plotted by its average response time (X-axis, log scale) vs. win rate (Y-axis). Green points win ≥60%, orange ≥10%, red <10%. Dashed lines at 30 seconds and 50% divide the space into four quadrants. What to look for: The upper-left quadrant (fast + strong) is ideal. On Wordle, reasoning models (upper-right) generally outperform fast models. Switch to Fibble to see the key insight: most slow "reasoning" models (o3, GPT-5) fall to the slow + weak corner — extra compute does not buy deception robustness. Only a few models (Gemini 3.1 Pro, GLM-5, Kimi K2.5) remain in the "slow + strong" quadrant under deception.

What this shows: A comprehensive profile for a single model across all six arenas. The summary card shows key aggregate stats. The radar chart normalizes five dimensions against all qualifying models. Bar charts break down win rate and average guesses per arena. The table shows detailed per-arena metrics. Tip: Double-click any model name in other tabs to jump here.

WORDLE

Wordle Arena

The classic word puzzle with honest clues. Zero deception — a clean baseline for comparing LLM word-guessing ability.

0 Lies · Baseline FIBBLE

Fibble Arena

One clue per row is a lie. Models must identify which color feedback is deceptive and reason around it.

1 Lie per Row FIBBLE

Fibble² Arena

Two clues per row are lies. The signal-to-noise ratio drops, demanding stronger deductive reasoning.

2 Lies per Row FIBBLE

Fibble³ Arena

Three lies per row — more clues are deceptive than honest. Models must find truth in a sea of misinformation.

3 Lies per Row FIBBLE

Fibble⁴ Arena

Four lies per row — only one clue is truthful. Extreme adversarial reasoning required to find the needle in the haystack.

4 Lies per Row FIBBLE

Fibble⁵ Arena

All five clues per row are lies. Every piece of feedback is deceptive — the ultimate test of adversarial reasoning.

5 Lies per Row

📊

Batch Experiment Results

Cross-arena performance of LLMs on 30 deterministic words. Heatmaps, degradation charts, and per-word call logs.

Updates

March 9, 2026

Max guesses increased for Fibble²–Fibble⁵. Our information-theoretic analysis showed that higher-deception arenas had insufficient margins between the theoretical minimum guesses needed and the allowed max guesses. To ensure at least a +5 margin for all Fibble arenas, the following changes take effect immediately for all daily games going forward:

Arena	Lies	Min Guesses ⓘInformation-theoretic lower bound. Wordle feedback has 3⁵=243 possible patterns. With L lies, each true feedback can appear as C(5,L)×2L different displayed patterns, reducing the effective distinguishable outcomes to 243/(C(5,L)×2L). The lower bound on guesses is ⌈log₂(2315) / log₂(243/(C(5,L)×2L))⌉. Lies reduce the information bandwidth of each guess by a factor of C(5,L)×2L, turning Wordle feedback into a noisy channel.	Old Max Guesses	New Max Guesses	Margin
Wordle	0	2	6	6 (unchanged)	+4
Fibble	1	3	8	8 (unchanged)	+5
Fibble²	2	5	8	10	+5
Fibble³	3	7	8	12	+5
Fibble⁴	4	7	8	12	+5
Fibble⁵	5	4	8	9	+5

Historical results played under the old 8-guess limit are preserved as-is in the leaderboards.

← Back to main site