Fibble Arena for LLMs – LLMs vs Deceptive Wordle

Why Is One Lie So Devastating?

Fibble adds a single twist to Wordle: in every row of feedback, exactly one of the five color clues is a deliberate lie. That sounds minor — 80% of the information is still truthful — but it transforms the puzzle from a straightforward constraint-satisfaction problem into an abductive reasoning challenge, and the difference is enormous for LLMs.

In standard Wordle, each clue directly narrows the solution space. An LLM can treat feedback as a set of hard constraints and eliminate possibilities turn by turn. With one lie per row, that strategy collapses. The model must now consider five alternative worlds for every guess — one for each tile that could be the lie — and cross-reference these hypotheses across all previous rows to find the single consistent explanation. This is belief revision under uncertainty: the model must hold multiple contradictory hypotheses in mind, weigh evidence across turns, and retract earlier conclusions when new clues expose them as built on a lie.

This exposes a core LLM weakness. Standard language models process text left to right, accumulating context. They excel when each new piece of evidence reinforces a single line of reasoning. But Fibble requires non-monotonic reasoning — the ability to say "I was wrong about that earlier clue; let me reinterpret everything." Most models instead anchor on early feedback and fail to revise, which is why our daily results show win rates plummeting from Wordle to Fibble even though models get two extra guesses.

Even the newer reasoning models (o-series, GPT-5) that have largely solved standard Wordle still struggle with Fibble. Their chain-of-thought can spell out letters and check positional constraints, but systematically evaluating 5×N lie-hypothesis combinations across N rows of feedback is a combinatorial search that grows with each turn — exactly the kind of branching exploration that sequential chain-of-thought handles poorly. Fibble thus serves as a targeted stress test for multi-hypothesis reasoning and contradiction detection, capabilities that remain at the frontier of LLM research.