Fibble adds lies to Wordle's color clues. Fibble 2–5 progressively increase the number of lies per row, stress-testing LLM reasoning under deception. Standard Wordle is included as the lie-free baseline.
LLM Showdown
Pit up to 6 LLMs against each other on the same word
Preview
Cross-Arena Analysis
Visualize model performance across deception levels
Rankings
Win Rate Gradient
Latency vs Performance
Avg Guesses
Head to Head
Model Profile
What this shows: Each line tracks one model's rank (by win rate) as it moves across arenas from Wordle (0 lies) to Fibble⁵ (5 lies), plus an overall average.
What to look for: Lines that stay flat are consistently ranked; lines that spike downward reveal models that collapse under deception. The "Overall" column (right of the dashed line) averages all six arenas, rewarding breadth of robustness.
Double-click a model to see its full profile
No models match the current filter.
What this shows: Win rate (%) for each model as the number of lies per row increases from 0 (Wordle) to 5 (Fibble⁵). The black dashed line is the mean across all qualifying models. The shaded region marks the deception cliff (Fibble²–Fibble⁴).
What to look for: Most models plummet at Fibble² (2 lies), revealing a phase transition where deception overwhelms reasoning. A few models (e.g., Gemini 3.1 Pro) resist longer. Watch for the Fibble⁵ recovery: some models partially rebound when all tiles lie, because total deception restores deterministic (inverted) information.
What this shows: Average number of guesses used by each model across arenas, from Wordle (0 lies) to Fibble⁵ (5 lies). Lower is better. The black dashed line is the mean across all qualifying models. The shaded region marks the deception cliff (Fibble²–Fibble⁴).
What to look for: Models that stay low across arenas are efficient solvers. A sharp rise indicates the model is struggling with deception and burning through its allowed guesses. Compare with the Win Rate Gradient to see whether more guesses actually translates to wins.
🔍 Click an arena name on the x-axis to zoom in. Click again to reset.
What this shows: A comparison of up to five models across six dimensions.
Avg Win Rate = sum of per-arena win rates ÷ 6 (all arenas, missing = 0%).
Speed = based on average response time per API call (lower latency = faster).
Reasoning = Wordle win rate (honest clues, pure word-guessing ability).
Extended Reasoning = Fibble⁵ WR ÷ Wordle WR × 100. In Fibble⁵ all five clues are lies (known), so this measures reasoning under complete information — no deception detection needed, just deeper logical deduction. >100% means the model does better at Fibble⁵ than Wordle.
Deception Robustness = 50% × Fibble² WR + 25% × Fibble³ WR + 25% × Fibble⁴ WR. These arenas require identifying which clues are lies. Fibble² is weighted higher because most models score 0% on Fibble³/⁴ for now; weights may shift as models improve.
How to use: Pick two to five models from the dropdowns. The radar chart summarizes normalized dimensions; the table provides exact numbers per arena.
vs
Also compare:
Drag to rotate Scroll to zoom Right-drag to pan
What this shows: Each point is a model, plotted by its average response time (X-axis, log scale) vs. win rate (Y-axis). Green points win ≥60%, orange ≥10%, red <10%. Dashed lines at 30 seconds and 50% divide the space into four quadrants.
What to look for: The upper-left quadrant (fast + strong) is ideal. On Wordle, reasoning models (upper-right) generally outperform fast models. Switch to Fibble to see the key insight: most slow "reasoning" models (o3, GPT-5) fall to the slow + weak corner — extra compute does not buy deception robustness. Only a few models (Gemini 3.1 Pro, GLM-5, Kimi K2.5) remain in the "slow + strong" quadrant under deception.
What this shows: A comprehensive profile for a single model across all six arenas.
The summary card shows key aggregate stats. The radar chart normalizes six dimensions against all qualifying models.
Bar charts break down win rate and average guesses per arena. The table shows detailed per-arena metrics.
Tip: Double-click any model name in other tabs to jump here.
Max guesses increased for Fibble²–Fibble⁵.
Our information-theoretic analysis showed that higher-deception arenas had
insufficient margins between the theoretical minimum guesses needed and the
allowed max guesses. To ensure at least a +5 margin for all Fibble arenas, the following
changes take effect immediately for all daily games going forward:
Arena
Lies
Min Guesses ⓘInformation-theoretic lower bound. Wordle feedback has 3⁵=243 possible patterns. With L lies, each true feedback can appear as C(5,L)×2L different displayed patterns, reducing the effective distinguishable outcomes to 243/(C(5,L)×2L). The lower bound on guesses is ⌈log₂(2315) / log₂(243/(C(5,L)×2L))⌉. Lies reduce the information bandwidth of each guess by a factor of C(5,L)×2L, turning Wordle feedback into a noisy channel.
Old Max Guesses
New Max Guesses
Margin
Wordle
0
2
6
6 (unchanged)
+4
Fibble
1
3
8
8 (unchanged)
+5
Fibble²
2
5
8
10
+5
Fibble³
3
7
8
12
+5
Fibble⁴
4
7
8
12
+5
Fibble⁵
5
4
8
9
+5
Historical results played under the old 8-guess limit are preserved as-is in the leaderboards.