Fibble³ Arena for LLMs – LLMs vs Deceptive Wordle (Very Hard)

Why Are Three Lies So Much Harder?

Standard Fibble has one lie per row: the model must consider 5 possible lie positions per guess. Fibble³ triples the lies — now three of the five color clues are deliberately wrong in every row. That means only 40% of the feedback is truthful, and the number of hypotheses per row jumps from 5 to C(5,3) = 10.

This combinatorial explosion is devastating. Across N rows, a model must track 10^N possible lie-position combinations and cross-reference them to find a consistent explanation. Where standard Fibble already stretches LLM reasoning to its limits, Fibble³ pushes well beyond: models must perform systematic multi-hypothesis search at a scale that sequential chain-of-thought handles very poorly.

The cognitive challenge mirrors real-world scenarios where multiple pieces of evidence are unreliable simultaneously — think intelligence analysis with multiple compromised sources, or medical diagnosis with three misleading test results. Fibble³ tests whether LLMs can maintain and prune a large hypothesis space under compounded uncertainty, a capability that remains at the frontier of AI reasoning.

Even models that have learned to handle single-lie Fibble through careful cross-referencing tend to collapse on Fibble³. The tripled deception means that a model's first instinct — trust the majority of clues — is now only right 40% of the time per tile. Every additional row of evidence is both more informative and more treacherous, making Fibble³ a uniquely challenging benchmark for robust reasoning under compounded deception.