BENCHMARK
RTSGameBench
A benchmark suite spanning full games, diagnostic mini-games, and self-evolving scenario generation.
(1) Full-game evaluation across diverse matchup structures, (2) diagnostic mini-games that isolate specific competencies, and (3) a self-evolving game generation framework that converts free-form queries into new tests via multi-agent collaboration.
Overview of RTSGameBench and its three benchmark components
1) Evaluation Settings in RTSGameBench
The top shows full-game matchups with different player configurations, each reflecting distinct strategic demands. The bottom presents mini-games targeting specific RTS competencies, with fog-of-war (FoW) added only when partial observability is essential. Build, Prod., and Move denote building construction, unit production, and unit movement, respectively.
2) Diagnostic Mini-Games
Full-game evaluation mixes multiple strategic skills at once, making it hard to identify specific weaknesses. To address this, we introduces mini-games that isolate individual RTS competencies under controlled settings for more fine-grained evaluation.
(a) Resource Management — Time-Constrained Production (TCP).
(b) Spatial & Temporal Reasoning — Multi-Front Defense (MFD).
(c) Opponent Modeling — Fixed-Field Skirmish: Free-for-All (FS-F).
(d) Collaboration — Fixed-Field Skirmish: Team (FS-T).
(e) Adversarial Planning — Siege Planning (SP).
3) Self-Evolving Game Generation Framework
The mini-games assess key strategic competencies in RTS play, but each competency can be evaluated across a much wider range of conditions than any fixed scenario can capture. Manually expanding this suite is costly, so we propose a Self-Evolving Game Generation Framework that automatically generates diagnostic mini-games from free-form user queries and improves over successive cycles.
Self-Evolving Game Generation Framework
Generation pipeline
Given a user query, the framework generates a new game in four stages under project manager control. The designer first creates a scenario brief, then expands it into a full GDD. Next, the developer retrieves or implements rule scripts, and finally assembles the executable game with the required assets and configurations. At each stage, the analyst validates the output through rubric-based checks and simulations, while validated artifacts are stored in a shared knowledge database for future reuse.
Self-evolution mechanisms
The framework evolves through two mechanisms: a shared knowledge database that reuses validated GDDs and rule sets, and retrospective analysis that refines the analyst’s rubrics after each successful generation. Together, these make RTSGameBench a continuously extensible evaluation platform.
You can see the full procedure of the self-evolving game generation framework below.