Loading results...
Can LLMs solve the daily Wordle? Let's find out.
Every day, select LLMs play the Wordle puzzle to test which models are strongest. You are welcome to try your own favorite models here, too — including local Ollama models!
Loading results...
Large language models (LLMs) are AI systems — like GPT, Claude, and Gemini — that read and generate text. Before an LLM can process a word, it breaks it into tokens (sub-word chunks) and converts each token into a list of numbers called an embedding. Think of an embedding as a numerical fingerprint that captures what a word means — "king" and "queen" get similar fingerprints because they appear in similar contexts. The problem is that this fingerprint encodes meaning, not spelling. A word like "MOGUL" might be tokenized as ["MOG", "UL"], so the model never directly sees the five individual letters in their positions. Wordle, however, is all about which specific letter sits at which specific position — exactly the information that tokenization discards.
This is why early LLMs often made bizarre Wordle mistakes: reusing letters confirmed absent, ignoring positional clues, or guessing non-words. They could talk about Wordle strategy fluently, but couldn't reliably execute it, because the letter-position reasoning had to be reconstructed from lossy token representations.
Newer "reasoning" models (such as OpenAI's o-series and GPT-5) have largely closed this gap. These models perform extended internal chain-of-thought before answering — effectively "thinking out loud" step by step. By spelling out each letter and checking constraints one at a time during this hidden reasoning phase, they can recover the character-level information that tokenization lost. Our daily results show that reasoning models consistently outperform their non-reasoning counterparts on Wordle.