Fibble² Arena for LLMs – LLMs vs Deceptive Wordle (Hard)

Why Are Two Lies So Much Harder?

Standard Fibble has one lie per row: the model must consider 5 possible lie positions per guess. Fibble² doubles the lies — now two of the five color clues are deliberately wrong in every row. That means only 60% of the feedback is truthful, and the number of hypotheses per row jumps from 5 to C(5,2) = 10.

This combinatorial explosion is devastating. Across N rows, a model must track 10^N possible lie-position combinations and cross-reference them to find a consistent explanation. Where standard Fibble already stretches LLM reasoning to its limits, Fibble² pushes well beyond: models must perform systematic multi-hypothesis search at a scale that sequential chain-of-thought handles very poorly.

The cognitive challenge mirrors real-world scenarios where multiple pieces of evidence are unreliable simultaneously — think intelligence analysis with multiple compromised sources, or medical diagnosis with two misleading test results. Fibble² tests whether LLMs can maintain and prune a large hypothesis space under compounded uncertainty, a capability that remains at the frontier of AI reasoning.

Even models that have learned to handle single-lie Fibble through careful cross-referencing tend to collapse on Fibble². The doubled deception means that a model's first instinct — trust the majority of clues — is now only right 60% of the time per tile. Every additional row of evidence is both more informative and more treacherous, making Fibble² a uniquely challenging benchmark for robust reasoning under compounded deception.