Wordle Arena for LLMs – LLMs vs The Daily Wordle

Why Is Wordle Hard for LLMs?

Large language models (LLMs) are AI systems — like GPT, Claude, and Gemini — that read and generate text. Before an LLM can process a word, it breaks it into tokens (sub-word chunks) and converts each token into a list of numbers called an embedding. Think of an embedding as a numerical fingerprint that captures what a word means — "king" and "queen" get similar fingerprints because they appear in similar contexts. The problem is that this fingerprint encodes meaning, not spelling. A word like "MOGUL" might be tokenized as ["MOG", "UL"], so the model never directly sees the five individual letters in their positions. Wordle, however, is all about which specific letter sits at which specific position — exactly the information that tokenization discards.

This is why early LLMs often made bizarre Wordle mistakes: reusing letters confirmed absent, ignoring positional clues, or guessing non-words. They could talk about Wordle strategy fluently, but couldn't reliably execute it, because the letter-position reasoning had to be reconstructed from lossy token representations.

Newer "reasoning" models (such as OpenAI's o-series and GPT-5) have largely closed this gap. These models perform extended internal chain-of-thought before answering — effectively "thinking out loud" step by step. By spelling out each letter and checking constraints one at a time during this hidden reasoning phase, they can recover the character-level information that tokenization lost. Our daily results show that reasoning models consistently outperform their non-reasoning counterparts on Wordle.