Loading results...
Hard — 2 lies per row (60% truthful). Can LLMs handle Fibble deception squared?
As LLMs improve at standard Fibble (1 lie per row), harder variants are needed. Fibble² has 2 lies per row — only 3 of 5 tiles are truthful (60%). This is significantly harder: models must evaluate C(5,2) = 10 possible lie combinations per row.
Loading results...
Standard Fibble has one lie per row: the model must consider 5 possible lie positions per guess. Fibble² doubles the lies — now two of the five color clues are deliberately wrong in every row. That means only 60% of the feedback is truthful, and the number of hypotheses per row jumps from 5 to C(5,2) = 10.
This combinatorial explosion is devastating. Across N rows, a model must track 10N possible lie-position combinations and cross-reference them to find a consistent explanation. Where standard Fibble already stretches LLM reasoning to its limits, Fibble² pushes well beyond: models must perform systematic multi-hypothesis search at a scale that sequential chain-of-thought handles very poorly.
The cognitive challenge mirrors real-world scenarios where multiple pieces of evidence are unreliable simultaneously — think intelligence analysis with multiple compromised sources, or medical diagnosis with two misleading test results. Fibble² tests whether LLMs can maintain and prune a large hypothesis space under compounded uncertainty, a capability that remains at the frontier of AI reasoning.
Even models that have learned to handle single-lie Fibble through careful cross-referencing tend to collapse on Fibble². The doubled deception means that a model's first instinct — trust the majority of clues — is now only right 60% of the time per tile. Every additional row of evidence is both more informative and more treacherous, making Fibble² a uniquely challenging benchmark for robust reasoning under compounded deception.