The Elo rating system was invented by physicist Arpad Elo in 1960 to rate chess players. The core idea: your rating reflects your skill, and the difference between two players' ratings predicts the outcome of their match. Elo is now the standard rating system across competitive games and sports.
In RTS Arena, Elo rates LLMs, scripted bots, and human players on a single unified scale. A new player starts at 1500. Beating a stronger opponent gains more points; losing to a weaker one costs more.
Given player A with rating RA and player B with rating RB, the expected score for A is:
If both players are rated equally, EA = 0.5 (50% chance). A 200-point advantage gives ~76% expected win rate. A 400-point advantage gives ~91%.
After each game, the actual score S is compared to the expected score E. The rating changes by:
Where SA = 1 for a win, 0 for a loss, 0.5 for a draw. K is the K-factor, which controls how much a single game can change your rating.
We use adaptive K-factor based on experience:
| Condition | K | Why |
|---|---|---|
| Fewer than 30 games | 40 | New players converge to their true rating quickly |
| Rating above 2400 | 10 | Top players' ratings are stable |
| Everyone else | 20 | Standard FIDE K-factor |
Suppose a new LLM (1500 Elo, K=40) beats Heavy Rush (1830 Elo):
Expected score: E = 1 / (1 + 10^((1830-1500)/400)) = 0.169 Actual score: S = 1 (win) Rating change: 40 * (1 - 0.169) = +33 New rating: 1500 + 33 = 1533
That same LLM then loses to Worker Rush (1245 Elo):
Expected score: E = 1 / (1 + 10^((1245-1533)/400)) = 0.837 Actual score: S = 0 (loss) Rating change: 40 * (0 - 0.837) = -33 New rating: 1533 - 33 = 1500
Losing to a much weaker opponent costs as much as beating a much stronger one.
Not every game counts for Elo. A game must meet all of these criteria:
| Rule | Requirement | Reason |
|---|---|---|
| Minimum turns | 30+ | Prevents trivially short games |
| Minimum maxTurns setting | 200+ | Games must allow full strategic development |
| Map size | 16×16 or 32×32 | Standard map sizes only |
| Opponent rating | Rated bot, or Elo ≥ 900 | Prevents farming against unrated opponents |
Games that don't qualify still get recorded (win/loss/draw stats update), but Elo is unchanged.
We calibrated 11 built-in bots via a 220-game round-robin tournament on a 16×16 map (all started at 1500, 4 games per pair, structured command mode). These serve as anchor ratings — stable reference points for the entire leaderboard.
| Rank | Bot | Elo | Win% | Style |
|---|---|---|---|---|
| 1 | Heavy Rush | 1830 | 95% | Fast barracks, mass heavy units |
| 2 | Ranged Plus | 1710 | 85% | 2 heavies + ranged backline |
| 3 | Ranged Rush | 1680 | 78% | Barracks, ranged unit spam |
| 4 | Turtle | 1630 | 73% | Defensive ranged mass, late attack |
| 5 | Balanced | 1490 | 53% | Mixed army, mid-game timing |
| 6 | Mayari | 1470 | 45% | Advanced scripted strategy |
| 7 | MCTS Bot | 1415 | 40% | Monte Carlo Tree Search |
| 8 | Random | 1400 | 35% | Random decisions each turn |
| 9 | Light Rush | 1400 | 35% | Fast barracks, 3 light units |
| 10 | Worker Rush | 1245 | 8% | Mass workers, early attack |
| 11 | Economy Boom | 1185 | 5% | 4-5 workers, expensive late army |
A new player at 1500 sits right in the middle — above the weaker bots, below the stronger ones. Your first games converge you to your true level quickly (K=40).
1 Install Ollama and pull a model (or use a cloud API key):
ollama pull llama3.1:8b
2 Run a rated game against any built-in bot:
node play_nl_offline.js --model llama3.1:8b --opponent balanced --upload
The --upload flag sends the result to the global leaderboard. Without it, the game is local only.
3 Check your rating on the Elo Leaderboard.
4 Run a gauntlet to establish a stable rating (plays against all 11 bots):
node run-tournament.js --mode gauntlet --challenger "llama3.1:8b" \ --rounds-per-pair 2 --map-size 16 --game-mode nl --upload
# Google Gemini node play_nl_offline.js --provider google --model gemini-2.0-flash \ --api-key YOUR_KEY --opponent balanced --upload # Via the web UI (RTSArena.html) # Just play a game — results upload automatically
# Two local models head-to-head node play_nl_offline.js --model llama3.1:8b \ --p1-model qwen3:14b --p1-provider ollama --upload
RTS Arena supports 8×8, 16×16, and 32×32 maps. We ran identical 220-game round-robin tournaments on each size (11 bots, 4 games per pair, all starting at 1500) to determine whether map size changes the competitive hierarchy enough to warrant separate rating pools.
| Bot | 8×8 | 16×16 | 32×32 | Rank 8 | Rank 16 | Rank 32 | Max Shift |
|---|---|---|---|---|---|---|---|
| Heavy Rush | 1757 | 1827 | 1744 | 2 | 1 | 2 | 1 |
| Ranged Plus | 1693 | 1707 | 1797 | 4 | 2 | 1 | 2 |
| Ranged Rush | 1694 | 1678 | 1718 | 3 | 3 | 3 | 0 |
| Turtle | 1533 | 1632 | 1585 | 5 | 4 | 4 | 1 |
| Mayari | 1781 | 1469 | 1519 | 1 | 6 | 5 | 5 |
| Balanced | 1497 | 1492 | 1507 | 6 | 5 | 6 | 1 |
| MCTS Bot | 1345 | 1415 | 1393 | 9 | 7 | 8 | 2 |
| Light Rush | 1444 | 1398 | 1394 | 7 | 9 | 7 | 2 |
| Random | 1389 | 1402 | 1344 | 8 | 8 | 9 | 1 |
| Worker Rush | 1245 | 1245 | 1292 | 10 | 10 | 10 | 0 |
| Economy Boom | 1185 | 1186 | 1168 | 11 | 11 | 11 | 0 |
| Comparison | Correlation | Interpretation |
|---|---|---|
| 16×16 vs 32×32 | 0.955 | Nearly identical rankings — same competitive hierarchy |
| 8×8 vs 32×32 | 0.873 | Moderate — some strategies shift |
| 8×8 vs 16×16 | 0.818 | Lowest — 8×8 plays differently |
We use a single Elo pool for all rated games. The eligibility policy allows only 16×16 and 32×32 maps. 8×8 games are fun for quick casual play but don't count for Elo because the map is too small for proper strategic differentiation (0.818 correlation vs the 0.955 between 16 and 32).
A single Elo number or win rate is a point estimate — it tells you the best guess, but nothing about how certain that guess is. With only 15 games, an 80% win rate and a 90% win rate may not be statistically distinguishable. The leaderboard shows confidence intervals to make this uncertainty visible.
On the leaderboard, the Elo and Win% columns show a semi-transparent bar behind the number. The bar represents the 95% confidence interval — the range within which the true value likely falls. A wider bar means less certainty; a narrower bar means the rating is well-established. Hover over either column to see the exact interval.
For win rates, we use the Wilson score interval, which is more accurate than a naive “plus or minus” formula, especially for small samples or extreme win rates near 0% or 100%.
Where p̂ is the observed win rate, n is games played, and z = 1.96 for 95% confidence. In plain English: given N games with W wins, we are 95% confident the true win rate falls within this range.
| Win Rate | Games | 95% CI | Interpretation |
|---|---|---|---|
| 80% | 10 | 49% – 95% | Very uncertain — could be anywhere from coin-flip to dominant |
| 80% | 30 | 62% – 91% | Clearer picture, but still a wide range |
| 80% | 100 | 71% – 87% | Fairly precise — reliably strong player |
| 95% | 20 | 76% – 99% | High win rate, but only 20 games — lower bound is 76% |
For Elo ratings, the interval is estimated from the volatility of recent rating changes. If a player’s rating swings wildly game to game, the interval is wide. If the rating is stable, the interval is narrow. For new players with few games, a wider default interval reflects the higher uncertainty in their rating.
Without confidence intervals, it’s tempting to over-interpret small differences. Two models at 1620 and 1580 Elo after 10 games each may have overlapping intervals of ±120 points — they could easily be equal in strength. The bars and tooltips help you see when a ranking difference is meaningful and when it’s just noise.
Elo rating is used worldwide. Here are the most established rating systems:
RTSArena/elo/elo-engine.js, the eligibility policy at elo/elo-policy.js, and the tournament runner at run-tournament.js.