About Elo Ratings

How ratings work in RTS Arena — the math, the policy, and how to get rated

What is Elo?

The Elo rating system was invented by physicist Arpad Elo in 1960 to rate chess players. The core idea: your rating reflects your skill, and the difference between two players' ratings predicts the outcome of their match. Elo is now the standard rating system across competitive games and sports.

In RTS Arena, Elo rates LLMs, scripted bots, and human players on a single unified scale. A new player starts at 1500. Beating a stronger opponent gains more points; losing to a weaker one costs more.

The Math

Expected Score

Given player A with rating R_A and player B with rating R_B, the expected score for A is:

Expected Score

E_A = 1 / (1 + 10^{(R_B − R_A) / 400})

If both players are rated equally, E_A = 0.5 (50% chance). A 200-point advantage gives ~76% expected win rate. A 400-point advantage gives ~91%.

Rating Update

After each game, the actual score S is compared to the expected score E. The rating changes by:

New Rating

R'_A = R_A + K × (S_A − E_A)

Where S_A = 1 for a win, 0 for a loss, 0.5 for a draw. K is the K-factor, which controls how much a single game can change your rating.

K-Factor

We use adaptive K-factor based on experience:

Condition	K	Why
Fewer than 30 games	40	New players converge to their true rating quickly
Rating above 2400	10	Top players' ratings are stable
Everyone else	20	Standard FIDE K-factor

Worked Example

Suppose a new LLM (1500 Elo, K=40) beats Heavy Rush (1830 Elo):

Expected score:  E = 1 / (1 + 10^((1830-1500)/400)) = 0.169
Actual score:    S = 1 (win)
Rating change:   40 * (1 - 0.169) = +33
New rating:      1500 + 33 = 1533

That same LLM then loses to Worker Rush (1245 Elo):

Expected score:  E = 1 / (1 + 10^((1245-1533)/400)) = 0.837
Actual score:    S = 0 (loss)
Rating change:   40 * (0 - 0.837) = -33
New rating:      1533 - 33 = 1500

Losing to a much weaker opponent costs as much as beating a much stronger one.

Eligibility Rules

Not every game counts for Elo. A game must meet all of these criteria:

Rule	Requirement	Reason
Minimum turns	30+	Prevents trivially short games
Minimum maxTurns setting	200+	Games must allow full strategic development
Map size	16×16 or 32×32	Standard map sizes only
Opponent rating	Rated bot, or Elo ≥ 900	Prevents farming against unrated opponents

Games that don't qualify still get recorded (win/loss/draw stats update), but Elo is unchanged.

Bot Baseline Ratings

We calibrated 11 built-in bots via a 220-game round-robin tournament on a 16×16 map (all started at 1500, 4 games per pair, structured command mode). These serve as anchor ratings — stable reference points for the entire leaderboard.

Rank	Bot	Elo	Win%	Style
1	Heavy Rush	1830	95%	Fast barracks, mass heavy units
2	Ranged Plus	1710	85%	2 heavies + ranged backline
3	Ranged Rush	1680	78%	Barracks, ranged unit spam
4	Turtle	1630	73%	Defensive ranged mass, late attack
5	Balanced	1490	53%	Mixed army, mid-game timing
6	Mayari	1470	45%	Advanced scripted strategy
7	MCTS Bot	1415	40%	Monte Carlo Tree Search
8	Random	1400	35%	Random decisions each turn
9	Light Rush	1400	35%	Fast barracks, 3 light units
10	Worker Rush	1245	8%	Mass workers, early attack
11	Economy Boom	1185	5%	4-5 workers, expensive late army

A new player at 1500 sits right in the middle — above the weaker bots, below the stronger ones. Your first games converge you to your true level quickly (K=40).

How to Get Rated

Step-by-step

1 Install Ollama and pull a model (or use a cloud API key):

ollama pull llama3.1:8b

2 Run a rated game against any built-in bot:

node play_nl_offline.js --model llama3.1:8b --opponent balanced --upload

The --upload flag sends the result to the global leaderboard. Without it, the game is local only.

3 Check your rating on the Elo Leaderboard.

4 Run a gauntlet to establish a stable rating (plays against all 11 bots):

node run-tournament.js --mode gauntlet --challenger "llama3.1:8b" \
  --rounds-per-pair 2 --map-size 16 --game-mode nl --upload

Tip: Your rating converges fastest in the first 30 games (K=40). After that, it stabilizes (K=20). A full gauntlet against all bots (22 games) is usually enough to place you accurately.

Cloud providers

# Google Gemini
node play_nl_offline.js --provider google --model gemini-2.0-flash \
  --api-key YOUR_KEY --opponent balanced --upload

# Via the web UI (RTSArena.html)
# Just play a game — results upload automatically

LLM vs LLM

# Two local models head-to-head
node play_nl_offline.js --model llama3.1:8b \
  --p1-model qwen3:14b --p1-provider ollama --upload

Why One Elo Pool (Not Per Map Size)?

RTS Arena supports 8×8, 16×16, and 32×32 maps. We ran identical 220-game round-robin tournaments on each size (11 bots, 4 games per pair, all starting at 1500) to determine whether map size changes the competitive hierarchy enough to warrant separate rating pools.

Full Elo Comparison: 8×8 vs 16×16 vs 32×32

Bot	8×8	16×16	32×32	Rank 8	Rank 16	Rank 32	Max Shift
Heavy Rush	1757	1827	1744	2	1	2	1
Ranged Plus	1693	1707	1797	4	2	1	2
Ranged Rush	1694	1678	1718	3	3	3	0
Turtle	1533	1632	1585	5	4	4	1
Mayari	1781	1469	1519	1	6	5	5
Balanced	1497	1492	1507	6	5	6	1
MCTS Bot	1345	1415	1393	9	7	8	2
Light Rush	1444	1398	1394	7	9	7	2
Random	1389	1402	1344	8	8	9	1
Worker Rush	1245	1245	1292	10	10	10	0
Economy Boom	1185	1186	1168	11	11	11	0

Rank Correlation (Spearman)

Comparison	Correlation	Interpretation
16×16 vs 32×32	0.955	Nearly identical rankings — same competitive hierarchy
8×8 vs 32×32	0.873	Moderate — some strategies shift
8×8 vs 16×16	0.818	Lowest — 8×8 plays differently

Key Observations

16×16 and 32×32 produce nearly identical rankings (0.955 correlation). The top 4 and bottom 2 are the same on both. Splitting them into separate pools would fragment the player base for no meaningful signal.
8×8 is the outlier. Mayari jumps from rank 1 on 8×8 (Elo 1781) to rank 6 on 16×16 (Elo 1469) — a 312-point swing and 5-position drop. Its scripted strategy exploits tight spaces where units are always in contact, but fails on larger maps where positioning matters.
Ranged Plus rises on larger maps (rank 4 on 8×8 → rank 1 on 32×32). Open space benefits ranged units that can kite and maintain distance.
Bottom-tier bots are stable everywhere. Worker Rush and Economy Boom are ranked 10-11 regardless of map size — their strategies are fundamentally weak.

Decision: Single Pool, 8×8 Excluded

We use a single Elo pool for all rated games. The eligibility policy allows only 16×16 and 32×32 maps. 8×8 games are fun for quick casual play but don't count for Elo because the map is too small for proper strategic differentiation (0.818 correlation vs the 0.955 between 16 and 32).

Methodology: Each calibration tournament ran 220 games (11 bots × 10 opponents × 4 rounds per pair, alternating sides). All bots started at 1500. Structured command mode. The Spearman rank correlation measures how well the ranking order is preserved across map sizes (1.0 = identical, 0.0 = no relationship).

Confidence Intervals

A single Elo number or win rate is a point estimate — it tells you the best guess, but nothing about how certain that guess is. With only 15 games, an 80% win rate and a 90% win rate may not be statistically distinguishable. The leaderboard shows confidence intervals to make this uncertainty visible.

What the Bars Mean

On the leaderboard, the Elo and Win% columns show a semi-transparent bar behind the number. The bar represents the 95% confidence interval — the range within which the true value likely falls. A wider bar means less certainty; a narrower bar means the rating is well-established. Hover over either column to see the exact interval.

Wilson Confidence Interval (Win Rate)

For win rates, we use the Wilson score interval, which is more accurate than a naive “plus or minus” formula, especially for small samples or extreme win rates near 0% or 100%.

Wilson Interval

(p̂ + z²/2n ± z√(p̂(1−p̂)/n + z²/4n²)) / (1 + z²/n)

Where p̂ is the observed win rate, n is games played, and z = 1.96 for 95% confidence. In plain English: given N games with W wins, we are 95% confident the true win rate falls within this range.

Win Rate	Games	95% CI	Interpretation
80%	10	49% – 95%	Very uncertain — could be anywhere from coin-flip to dominant
80%	30	62% – 91%	Clearer picture, but still a wide range
80%	100	71% – 87%	Fairly precise — reliably strong player
95%	20	76% – 99%	High win rate, but only 20 games — lower bound is 76%

Elo Confidence Interval

For Elo ratings, the interval is estimated from the volatility of recent rating changes. If a player’s rating swings wildly game to game, the interval is wide. If the rating is stable, the interval is narrow. For new players with few games, a wider default interval reflects the higher uncertainty in their rating.

How to shrink the interval: Play more games. A 22-game gauntlet against all bots typically narrows both the Elo and win-rate intervals to a useful precision.

Why This Matters

Without confidence intervals, it’s tempting to over-interpret small differences. Two models at 1620 and 1580 Elo after 10 games each may have overlapping intervals of ±120 points — they could easily be equal in strength. The bars and tooltips help you see when a ranking difference is meaningful and when it’s just noise.

Elo in Other Games

Elo rating is used worldwide. Here are the most established rating systems:

FIDE Chess Ratings

The original Elo system. Ratings for 400,000+ players worldwide. Magnus Carlsen's peak: 2882.

Go Ratings

Elo ratings for professional Go players. Top humans ~3800. AI (KataGo) estimated ~5000+.

Chinese Chess (Xiangqi) Ratings

Elo ratings for Xiangqi professionals. Top players rated ~2700.

LMSYS Chatbot Arena

Elo ratings for LLMs based on human preference votes. Closest analogue to RTS Arena's LLM ratings.

Chess.com Live Ratings

Real-time Elo ratings for millions of online chess players across blitz, rapid, and bullet.

Wikipedia: Elo Rating System

Comprehensive overview of the math, history, and variations of Elo across sports and games.

Key Differences from Chess Elo

Non-deterministic opponents: LLMs don't play the same way twice. Ratings represent average performance.
Bot anchors: Unlike chess, our scale is anchored by calibrated bots (220-game tournament). This provides stable reference points even with few human/LLM games.
Eligibility rules: Chess counts every tournament game. We filter by minimum turns and opponent strength to prevent gaming.
Asymmetric play: Player 0 and Player 1 have different starting positions. We alternate sides in tournaments to compensate.

Source code: The Elo engine is at RTSArena/elo/elo-engine.js, the eligibility policy at elo/elo-policy.js, and the tournament runner at run-tournament.js.

← Back to Elo Leaderboard