Loading results...
Extreme — 4 lies per row (20% truthful). Can LLMs find the one truthful tile?
As LLMs improve at standard Fibble (1 lie per row), harder variants are needed. Fibble⁴ has 4 lies per row — only 1 of 5 tiles is truthful (20%). Each row has just a single honest tile hidden among four lies.
Loading results...
Standard Fibble has one lie per row: the model must consider 5 possible lie positions per guess. Fibble⁴ quadruples the lies — now four of the five color clues are deliberately wrong in every row. That means only 20% of the feedback is truthful, and the number of hypotheses per row jumps from 5 to C(5,4) = 5.
While the combinatorial count per row is deceptively small, the real challenge is that almost all information is false. Across N rows, a model must track 5N possible lie-position combinations and cross-reference them to find a consistent explanation. Where standard Fibble already stretches LLM reasoning to its limits, Fibble⁴ pushes to the extreme: models must identify the single truthful tile in each row, performing multi-hypothesis search where nearly every signal is deceptive.
The cognitive challenge mirrors real-world scenarios where almost all evidence is unreliable — think intelligence analysis where nearly every source is compromised, or medical diagnosis where four out of five test results are misleading. Fibble⁴ tests whether LLMs can find the one reliable signal amid overwhelming noise, a capability that remains at the frontier of AI reasoning.
Even models that have learned to handle single-lie Fibble through careful cross-referencing collapse completely on Fibble⁴. With 4 lies per row, the model's instinct — trust the majority of clues — is catastrophically wrong, since only 20% of tiles are truthful. The key insight is to think "which ONE tile is telling the truth?" rather than "which tiles are lying?" Every additional row of evidence is both more informative and overwhelmingly deceptive, making Fibble⁴ a uniquely extreme benchmark for robust reasoning under near-total deception.