Fibble⁴ Arena for LLMs – LLMs vs Deceptive Wordle (Extreme)

Why Are Four Lies So Extreme?

Standard Fibble has one lie per row: the model must consider 5 possible lie positions per guess. Fibble⁴ quadruples the lies — now four of the five color clues are deliberately wrong in every row. That means only 20% of the feedback is truthful, and the number of hypotheses per row jumps from 5 to C(5,4) = 5.

While the combinatorial count per row is deceptively small, the real challenge is that almost all information is false. Across N rows, a model must track 5^N possible lie-position combinations and cross-reference them to find a consistent explanation. Where standard Fibble already stretches LLM reasoning to its limits, Fibble⁴ pushes to the extreme: models must identify the single truthful tile in each row, performing multi-hypothesis search where nearly every signal is deceptive.

The cognitive challenge mirrors real-world scenarios where almost all evidence is unreliable — think intelligence analysis where nearly every source is compromised, or medical diagnosis where four out of five test results are misleading. Fibble⁴ tests whether LLMs can find the one reliable signal amid overwhelming noise, a capability that remains at the frontier of AI reasoning.

Even models that have learned to handle single-lie Fibble through careful cross-referencing collapse completely on Fibble⁴. With 4 lies per row, the model's instinct — trust the majority of clues — is catastrophically wrong, since only 20% of tiles are truthful. The key insight is to think "which ONE tile is telling the truth?" rather than "which tiles are lying?" Every additional row of evidence is both more informative and overwhelmingly deceptive, making Fibble⁴ a uniquely extreme benchmark for robust reasoning under near-total deception.