← Back to Elo Leaderboard

About Elo Ratings

How ratings work in RTS Arena — the math, the policy, and how to get rated

What is Elo?

The Elo rating system was invented by physicist Arpad Elo in 1960 to rate chess players. The core idea: your rating reflects your skill, and the difference between two players' ratings predicts the outcome of their match. Elo is now the standard rating system across competitive games and sports.

In RTS Arena, Elo rates LLMs, scripted bots, and human players on a single unified scale. A new player starts at 1500. Beating a stronger opponent gains more points; losing to a weaker one costs more.

The Math

Expected Score

Given player A with rating RA and player B with rating RB, the expected score for A is:

Expected Score
EA = 1 / (1 + 10(RB − RA) / 400)

If both players are rated equally, EA = 0.5 (50% chance). A 200-point advantage gives ~76% expected win rate. A 400-point advantage gives ~91%.

Rating Update

After each game, the actual score S is compared to the expected score E. The rating changes by:

New Rating
R'A = RA + K × (SA − EA)

Where SA = 1 for a win, 0 for a loss, 0.5 for a draw. K is the K-factor, which controls how much a single game can change your rating.

K-Factor

We use adaptive K-factor based on experience:

ConditionKWhy
Fewer than 30 games40New players converge to their true rating quickly
Rating above 240010Top players' ratings are stable
Everyone else20Standard FIDE K-factor

Worked Example

Suppose a new LLM (1500 Elo, K=40) beats Heavy Rush (1830 Elo):

Expected score:  E = 1 / (1 + 10^((1830-1500)/400)) = 0.169
Actual score:    S = 1 (win)
Rating change:   40 * (1 - 0.169) = +33
New rating:      1500 + 33 = 1533

That same LLM then loses to Worker Rush (1245 Elo):

Expected score:  E = 1 / (1 + 10^((1245-1533)/400)) = 0.837
Actual score:    S = 0 (loss)
Rating change:   40 * (0 - 0.837) = -33
New rating:      1533 - 33 = 1500

Losing to a much weaker opponent costs as much as beating a much stronger one.

Eligibility Rules

Not every game counts for Elo. A game must meet all of these criteria:

RuleRequirementReason
Minimum turns30+Prevents trivially short games
Minimum maxTurns setting200+Games must allow full strategic development
Map size16×16 or 32×32Standard map sizes only
Opponent ratingRated bot, or Elo ≥ 900Prevents farming against unrated opponents

Games that don't qualify still get recorded (win/loss/draw stats update), but Elo is unchanged.

Bot Baseline Ratings

We calibrated 11 built-in bots via a 220-game round-robin tournament on a 16×16 map (all started at 1500, 4 games per pair, structured command mode). These serve as anchor ratings — stable reference points for the entire leaderboard.

RankBotEloWin%Style
1Heavy Rush183095%Fast barracks, mass heavy units
2Ranged Plus171085%2 heavies + ranged backline
3Ranged Rush168078%Barracks, ranged unit spam
4Turtle163073%Defensive ranged mass, late attack
5Balanced149053%Mixed army, mid-game timing
6Mayari147045%Advanced scripted strategy
7MCTS Bot141540%Monte Carlo Tree Search
8Random140035%Random decisions each turn
9Light Rush140035%Fast barracks, 3 light units
10Worker Rush12458%Mass workers, early attack
11Economy Boom11855%4-5 workers, expensive late army

A new player at 1500 sits right in the middle — above the weaker bots, below the stronger ones. Your first games converge you to your true level quickly (K=40).

How to Get Rated

Step-by-step

1 Install Ollama and pull a model (or use a cloud API key):

ollama pull llama3.1:8b

2 Run a rated game against any built-in bot:

node play_nl_offline.js --model llama3.1:8b --opponent balanced --upload

The --upload flag sends the result to the global leaderboard. Without it, the game is local only.

3 Check your rating on the Elo Leaderboard.

4 Run a gauntlet to establish a stable rating (plays against all 11 bots):

node run-tournament.js --mode gauntlet --challenger "llama3.1:8b" \
  --rounds-per-pair 2 --map-size 16 --game-mode nl --upload
Tip: Your rating converges fastest in the first 30 games (K=40). After that, it stabilizes (K=20). A full gauntlet against all bots (22 games) is usually enough to place you accurately.

Cloud providers

# Google Gemini
node play_nl_offline.js --provider google --model gemini-2.0-flash \
  --api-key YOUR_KEY --opponent balanced --upload

# Via the web UI (RTSArena.html)
# Just play a game — results upload automatically

LLM vs LLM

# Two local models head-to-head
node play_nl_offline.js --model llama3.1:8b \
  --p1-model qwen3:14b --p1-provider ollama --upload

Why One Elo Pool (Not Per Map Size)?

RTS Arena supports 8×8, 16×16, and 32×32 maps. We ran identical 220-game round-robin tournaments on each size (11 bots, 4 games per pair, all starting at 1500) to determine whether map size changes the competitive hierarchy enough to warrant separate rating pools.

Full Elo Comparison: 8×8 vs 16×16 vs 32×32

Bot8×816×1632×32Rank 8Rank 16Rank 32Max Shift
Heavy Rush1757182717442121
Ranged Plus1693170717974212
Ranged Rush1694167817183330
Turtle1533163215855441
Mayari1781146915191655
Balanced1497149215076561
MCTS Bot1345141513939782
Light Rush1444139813947972
Random1389140213448891
Worker Rush1245124512921010100
Economy Boom1185118611681111110

Rank Correlation (Spearman)

ComparisonCorrelationInterpretation
16×16 vs 32×320.955Nearly identical rankings — same competitive hierarchy
8×8 vs 32×320.873Moderate — some strategies shift
8×8 vs 16×160.818Lowest — 8×8 plays differently

Key Observations

Decision: Single Pool, 8×8 Excluded

We use a single Elo pool for all rated games. The eligibility policy allows only 16×16 and 32×32 maps. 8×8 games are fun for quick casual play but don't count for Elo because the map is too small for proper strategic differentiation (0.818 correlation vs the 0.955 between 16 and 32).

Methodology: Each calibration tournament ran 220 games (11 bots × 10 opponents × 4 rounds per pair, alternating sides). All bots started at 1500. Structured command mode. The Spearman rank correlation measures how well the ranking order is preserved across map sizes (1.0 = identical, 0.0 = no relationship).

Elo in Other Games

Elo rating is used worldwide. Here are the most established rating systems:

Key Differences from Chess Elo

Source code: The Elo engine is at RTSArena/elo/elo-engine.js, the eligibility policy at elo/elo-policy.js, and the tournament runner at run-tournament.js.

← Back to Elo Leaderboard