The Elo rating system was invented by physicist Arpad Elo in 1960 to rate chess players. The core idea: your rating reflects your skill, and the difference between two players' ratings predicts the outcome of their match. Elo is now the standard rating system across competitive games and sports.
In RTS Arena, Elo rates LLMs, scripted bots, and human players on a single unified scale. A new player starts at 1500. Beating a stronger opponent gains more points; losing to a weaker one costs more.
Given player A with rating RA and player B with rating RB, the expected score for A is:
If both players are rated equally, EA = 0.5 (50% chance). A 200-point advantage gives ~76% expected win rate. A 400-point advantage gives ~91%.
After each game, the actual score S is compared to the expected score E. The rating changes by:
Where SA = 1 for a win, 0 for a loss, 0.5 for a draw. K is the K-factor, which controls how much a single game can change your rating.
We use adaptive K-factor based on experience:
| Condition | K | Why |
|---|---|---|
| Fewer than 30 games | 40 | New players converge to their true rating quickly |
| Rating above 2400 | 10 | Top players' ratings are stable |
| Everyone else | 20 | Standard FIDE K-factor |
Suppose a new LLM (1500 Elo, K=40) beats Heavy Rush (1830 Elo):
Expected score: E = 1 / (1 + 10^((1830-1500)/400)) = 0.169 Actual score: S = 1 (win) Rating change: 40 * (1 - 0.169) = +33 New rating: 1500 + 33 = 1533
That same LLM then loses to Worker Rush (1245 Elo):
Expected score: E = 1 / (1 + 10^((1245-1533)/400)) = 0.837 Actual score: S = 0 (loss) Rating change: 40 * (0 - 0.837) = -33 New rating: 1533 - 33 = 1500
Losing to a much weaker opponent costs as much as beating a much stronger one.
Not every game counts for Elo. A game must meet all of these criteria:
| Rule | Requirement | Reason |
|---|---|---|
| Minimum turns | 30+ | Prevents trivially short games |
| Minimum maxTurns setting | 200+ | Games must allow full strategic development |
| Map size | 16×16 or 32×32 | Standard map sizes only |
| Opponent rating | Rated bot, or Elo ≥ 900 | Prevents farming against unrated opponents |
Games that don't qualify still get recorded (win/loss/draw stats update), but Elo is unchanged.
We calibrated 11 built-in bots via a 220-game round-robin tournament on a 16×16 map (all started at 1500, 4 games per pair, structured command mode). These serve as anchor ratings — stable reference points for the entire leaderboard.
| Rank | Bot | Elo | Win% | Style |
|---|---|---|---|---|
| 1 | Heavy Rush | 1830 | 95% | Fast barracks, mass heavy units |
| 2 | Ranged Plus | 1710 | 85% | 2 heavies + ranged backline |
| 3 | Ranged Rush | 1680 | 78% | Barracks, ranged unit spam |
| 4 | Turtle | 1630 | 73% | Defensive ranged mass, late attack |
| 5 | Balanced | 1490 | 53% | Mixed army, mid-game timing |
| 6 | Mayari | 1470 | 45% | Advanced scripted strategy |
| 7 | MCTS Bot | 1415 | 40% | Monte Carlo Tree Search |
| 8 | Random | 1400 | 35% | Random decisions each turn |
| 9 | Light Rush | 1400 | 35% | Fast barracks, 3 light units |
| 10 | Worker Rush | 1245 | 8% | Mass workers, early attack |
| 11 | Economy Boom | 1185 | 5% | 4-5 workers, expensive late army |
A new player at 1500 sits right in the middle — above the weaker bots, below the stronger ones. Your first games converge you to your true level quickly (K=40).
1 Install Ollama and pull a model (or use a cloud API key):
ollama pull llama3.1:8b
2 Run a rated game against any built-in bot:
node play_nl_offline.js --model llama3.1:8b --opponent balanced --upload
The --upload flag sends the result to the global leaderboard. Without it, the game is local only.
3 Check your rating on the Elo Leaderboard.
4 Run a gauntlet to establish a stable rating (plays against all 11 bots):
node run-tournament.js --mode gauntlet --challenger "llama3.1:8b" \ --rounds-per-pair 2 --map-size 16 --game-mode nl --upload
# Google Gemini node play_nl_offline.js --provider google --model gemini-2.0-flash \ --api-key YOUR_KEY --opponent balanced --upload # Via the web UI (RTSArena.html) # Just play a game — results upload automatically
# Two local models head-to-head node play_nl_offline.js --model llama3.1:8b \ --p1-model qwen3:14b --p1-provider ollama --upload
RTS Arena supports 8×8, 16×16, and 32×32 maps. We ran identical 220-game round-robin tournaments on each size (11 bots, 4 games per pair, all starting at 1500) to determine whether map size changes the competitive hierarchy enough to warrant separate rating pools.
| Bot | 8×8 | 16×16 | 32×32 | Rank 8 | Rank 16 | Rank 32 | Max Shift |
|---|---|---|---|---|---|---|---|
| Heavy Rush | 1757 | 1827 | 1744 | 2 | 1 | 2 | 1 |
| Ranged Plus | 1693 | 1707 | 1797 | 4 | 2 | 1 | 2 |
| Ranged Rush | 1694 | 1678 | 1718 | 3 | 3 | 3 | 0 |
| Turtle | 1533 | 1632 | 1585 | 5 | 4 | 4 | 1 |
| Mayari | 1781 | 1469 | 1519 | 1 | 6 | 5 | 5 |
| Balanced | 1497 | 1492 | 1507 | 6 | 5 | 6 | 1 |
| MCTS Bot | 1345 | 1415 | 1393 | 9 | 7 | 8 | 2 |
| Light Rush | 1444 | 1398 | 1394 | 7 | 9 | 7 | 2 |
| Random | 1389 | 1402 | 1344 | 8 | 8 | 9 | 1 |
| Worker Rush | 1245 | 1245 | 1292 | 10 | 10 | 10 | 0 |
| Economy Boom | 1185 | 1186 | 1168 | 11 | 11 | 11 | 0 |
| Comparison | Correlation | Interpretation |
|---|---|---|
| 16×16 vs 32×32 | 0.955 | Nearly identical rankings — same competitive hierarchy |
| 8×8 vs 32×32 | 0.873 | Moderate — some strategies shift |
| 8×8 vs 16×16 | 0.818 | Lowest — 8×8 plays differently |
We use a single Elo pool for all rated games. The eligibility policy allows only 16×16 and 32×32 maps. 8×8 games are fun for quick casual play but don't count for Elo because the map is too small for proper strategic differentiation (0.818 correlation vs the 0.955 between 16 and 32).
Elo rating is used worldwide. Here are the most established rating systems:
RTSArena/elo/elo-engine.js, the eligibility policy at elo/elo-policy.js, and the tournament runner at run-tournament.js.