BENCHMARK

RTSGameBench

A benchmark suite spanning full games, diagnostic mini-games, and self-evolving scenario generation.

Overview of RTSGameBench and its three benchmark components

1) Evaluation Settings in RTSGameBench

The top shows full-game matchups with different player configurations, each reflecting distinct strategic demands. The bottom presents mini-games targeting specific RTS competencies, with fog-of-war (FoW) added only when partial observability is essential. Build, Prod., and Move denote building construction, unit production, and unit movement, respectively.

2) Diagnostic Mini-Games

Full-game evaluation mixes multiple strategic skills at once, making it hard to identify specific weaknesses. To address this, we introduces mini-games that isolate individual RTS competencies under controlled settings for more fine-grained evaluation.

(a) Resource Management — Time-Constrained Production (TCP).
(b) Spatial & Temporal Reasoning — Multi-Front Defense (MFD).
(c) Opponent Modeling — Fixed-Field Skirmish: Free-for-All (FS-F).
(d) Collaboration — Fixed-Field Skirmish: Team (FS-T).
(e) Adversarial Planning — Siege Planning (SP).

3) Self-Evolving Game Generation Framework

The mini-games assess key strategic competencies in RTS play, but each competency can be evaluated across a much wider range of conditions than any fixed scenario can capture. Manually expanding this suite is costly, so we propose a Self-Evolving Game Generation Framework that automatically generates diagnostic mini-games from free-form user queries and improves over successive cycles.

Self-Evolving Game Generation Framework

Generation pipeline

Given a user query, the framework generates a new game in four stages under project manager control. The designer first creates a scenario brief, then expands it into a full GDD. Next, the developer retrieves or implements rule scripts, and finally assembles the executable game with the required assets and configurations. At each stage, the analyst validates the output through rubric-based checks and simulations, while validated artifacts are stored in a shared knowledge database for future reuse.

Self-evolution mechanisms

The framework evolves through two mechanisms: a shared knowledge database that reuses validated GDDs and rule sets, and retrospective analysis that refines the analyst’s rubrics after each successful generation. Together, these make RTSGameBench a continuously extensible evaluation platform.

You can see the full procedure of the self-evolving game generation framework below.

RESULTS

🏆Leaderboard

Switch between full-game and diagnostic-game evaluations. Table sorting remains interactive.

1v1: One ally team vs. one enemy team. 2v2: Two ally teams vs. two enemy teams. 3v3: Three ally teams vs. three enemy teams.
3v4: Three ally teams vs. four enemy teams. 1v1v1v1: Four separate competing teams.

Each mini-game evaluates a specific capability.
TCP: Resource management. MFD: Spatial and temporal reasoning. FS-F: Opponent modeling. FS-T: Collaboration. SP: Adversarial planning.

#	Model	1v1			2v2			3v3			3v4			1v1v1v1
#	Model	WR	GT_W	GT_L	WR	GT_W	GT_L	WR	GT_W	GT_L	WR	GT_W	GT_L	RS	GT_W	GT_L
1	GPT-5.2 OpenAI	0.50	27	37	0.30	95	67	0.40	71	55	0.10	87	45	0.37	31	11
2	GPT-5-mini OpenAI	0.05	24	22	0.05	66	56	0.10	77	41	0.05	77	34	0.18	23	11
3	Claude-4.5-Sonnet Anthropic	0.20	28	43	0.05	78	74	0.15	67	58	0.00	--	48	0.57	28	18
4	Gemini-3-Flash Google	0.85	21	34	0.50	92	69	0.35	60	52	0.20	56	38	0.65	24	11
5	Kimi-K2.5 Moonshot AI	0.30	29	37	0.30	63	69	0.00	--	44	0.00	--	72	0.45	14	23
6	Grok-4.1-Fast xAI	0.50	41	27	0.00	--	56	0.10	63	51	0.10	60	32	0.30	34	12
7	Qwen3.5-397B Alibaba	0.15	31	26	0.00	--	62	0.35	46	60	0.00	--	62	0.30	31	13
8	Qwen3-VL-235B-Instruct Alibaba	0.00	--	28	0.00	--	51	0.00	--	36	0.00	--	43	0.07	--	9
9	Qwen3-VL-235B-Thinking Alibaba	0.20	23	25	0.00	--	58	0.30	72	52	0.00	--	32	0.27	24	11
10	LLaMA4-Maverick Meta	0.00	--	27	0.10	86	53	0.00	--	45	0.00	--	41	0.17	21	11
11	Mistral-Large-3 Mistral AI	0.00	--	27	0.00	--	60	0.00	--	45	0.00	--	19	0.13	34	11

#	Model	TCP		MFD		FS-F		FS-T		SP
#	Model	WR	AT	WR	DE	RS	DE	WR	DE	WR	AT
1	GPT-5.2 OpenAI	0.9	17	0.3	1.01	0.57	1.00	0.6	1.04	0.5	15
2	GPT-5-mini OpenAI	0.3	16	0.2	1.18	0.50	0.98	0.3	0.97	0.1	12
3	Claude-4.5-Sonnet Anthropic	1.0	14	0.3	1.23	0.57	1.04	0.7	1.30	0.8	15
4	Gemini-3-Flash Google	1.0	16	0.6	1.61	0.57	1.03	0.5	1.12	0.8	15
5	Kimi-K2.5 Moonshot AI	1.0	17	0.7	1.82	0.20	0.82	0.6	1.05	0.9	12
6	Grok-4.1-Fast xAI	0.5	20	0.1	0.90	0.33	0.95	0.5	1.17	0.4	15
7	Qwen3.5-397B Alibaba	1.0	15	0.2	1.05	0.57	1.04	0.4	0.96	0.5	15
8	Qwen3-VL-235B-Instruct Alibaba	0.3	11	0.0	0.50	0.27	0.86	0.4	1.05	0.1	17
9	Qwen3-VL-235B-Thinking Alibaba	1.0	12	0.6	1.55	0.30	0.93	0.6	1.09	0.2	12
10	LLaMA4-Maverick Meta	0.6	25	0.0	0.52	0.63	1.12	0.3	1.02	0.5	16
11	Mistral-Large-3 Mistral AI	0.0	--	0.0	0.24	0.37	0.91	0.6	1.19	0.0	--
12	Human Baseline	1.0	10	1.0	3.46	0.93	1.53	1.0	1.21	0.8	15

CITATION

BibTeX

Use the following citation if you reference RTSGameBench.

@journal{kimAKCJC26,
  author    = {Kim, San and Ahn, Daechul and Kim, Reokyoung and Choi, Hyeonbeom and Jwa, Seungyeon and Choi, Jonghyun},
  title     = {RTSGameBench: An RTS Benchmark for Strategic Reasoning by Vision-Language Models},
  journal   = {arXiv},
  year      = {2026},
}

arXiv ??/??

Illustration of our proposed RTSGameBench

Abstract