arXiv ??/??

Seoul National University
* equally contributed to this work

Illustration of our proposed RTSGameBench

Abstract

This paper introduces RTSGameBench, a benchmark for evaluating strategic reasoning in Vision-Language Models (VLMs) using large-scale RTS games. It focuses on long-horizon planning, multi-agent coordination, opponent modeling, and decision-making under uncertainty.

  1. RTSGameBench. We propose a new RTS benchmark built on Beyond All Reason that evaluates VLMs through both diverse full-game matchups and diagnostic mini-games in large-scale strategic environments.
  2. Self-Evolving Game Generation. We introduce a self-evolving multi-agent framework that automatically generates new diagnostic games from free-form user queries while improving its efficiency and quality over successive cycles.
  3. RTSGameAgent. We also provide a baseline agent for group control and memory management for VLM to operate in large-scale RTS gameplay.
  4. Findings. Experiments show that current VLMs still struggle as coordination demands, task scale, and the number of involved agents increase.
BENCHMARK

RTSGameBench

A benchmark suite spanning full games, diagnostic mini-games, and self-evolving scenario generation.

(1) Full-game evaluation across diverse matchup structures, (2) diagnostic mini-games that isolate specific competencies, and (3) a self-evolving game generation framework that converts free-form queries into new tests via multi-agent collaboration.

Overview of RTSGameBench and its three benchmark components

1) Evaluation Settings in RTSGameBench

The top shows full-game matchups with different player configurations, each reflecting distinct strategic demands. The bottom presents mini-games targeting specific RTS competencies, with fog-of-war (FoW) added only when partial observability is essential. Build, Prod., and Move denote building construction, unit production, and unit movement, respectively.

2) Diagnostic Mini-Games

Full-game evaluation mixes multiple strategic skills at once, making it hard to identify specific weaknesses. To address this, we introduces mini-games that isolate individual RTS competencies under controlled settings for more fine-grained evaluation.

(a) Resource Management — Time-Constrained Production (TCP).
(b) Spatial & Temporal Reasoning — Multi-Front Defense (MFD).
(c) Opponent Modeling — Fixed-Field Skirmish: Free-for-All (FS-F).
(d) Collaboration — Fixed-Field Skirmish: Team (FS-T).
(e) Adversarial Planning — Siege Planning (SP).

3) Self-Evolving Game Generation Framework

Self-Evolving Game Generation Framework

Generation pipeline

Given a user query, the framework generates a new game in four stages under project manager control. The designer first creates a scenario brief, then expands it into a full GDD. Next, the developer retrieves or implements rule scripts, and finally assembles the executable game with the required assets and configurations. At each stage, the analyst validates the output through rubric-based checks and simulations, while validated artifacts are stored in a shared knowledge database for future reuse.


Self-evolution mechanisms

The framework evolves through two mechanisms: a shared knowledge database that reuses validated GDDs and rule sets, and retrospective analysis that refines the analyst’s rubrics after each successful generation. Together, these make RTSGameBench a continuously extensible evaluation platform.


You can see the full procedure of the self-evolving game generation framework below.
AGENT

RTSGameAgent

An agent architecture that combines agentic memory and Finite State Machine (FSM) based group control.

RTSGameAgent is a baseline agent for large-scale RTS gameplay that combines FSM-based group management with agentic memory, enabling scalable coordination and sustained coherence under large unit counts and long durations.

At each decision step, the memory phase consolidates short-term event logs St with long-term memory Lt-1 via an LLM, producing relevant entries mt and updated memory Lt. The decision phase then feeds mt, game knowledge K, and multimodal observations ot to the VLM policy, which outputs building construction, unit production, group assignment, and group movement commands through an FSM-based controller.

Key takeaway: RTSGameAgent couples memory-grounded reasoning with structured action execution, which makes large-scale RTS control more stable and analyzable for VLM-based agents.

RESULTS

🏆Leaderboard

Switch between full-game and diagnostic-game evaluations. Table sorting remains interactive.

RTSGameBench evaluates strategic reasoning across full-game matchups and competency-targeted mini-games.

1v1: One ally team vs. one enemy team. 2v2: Two ally teams vs. two enemy teams. 3v3: Three ally teams vs. three enemy teams.
3v4: Three ally teams vs. four enemy teams. 1v1v1v1: Four separate competing teams.

# Model 1v1 2v2 3v3 3v4 1v1v1v1
WRGTWGTL WRGTWGTL WRGTWGTL WRGTWGTL RSGTWGTL
1 GPT-5.2
OpenAI
0.502737 0.309567 0.407155 0.108745 0.373111
2 GPT-5-mini
OpenAI
0.052422 0.056656 0.107741 0.057734 0.182311
3 Claude-4.5-Sonnet
Anthropic
0.202843 0.057874 0.156758 0.00--48 0.572818
4 Gemini-3-Flash
Google
0.852134 0.509269 0.356052 0.205638 0.652411
5 Kimi-K2.5
Moonshot AI
0.302937 0.306369 0.00--44 0.00--72 0.451423
6 Grok-4.1-Fast
xAI
0.504127 0.00--56 0.106351 0.106032 0.303412
7 Qwen3.5-397B
Alibaba
0.153126 0.00--62 0.354660 0.00--62 0.303113
8 Qwen3-VL-235B-Instruct
Alibaba
0.00--28 0.00--51 0.00--36 0.00--43 0.07--9
9 Qwen3-VL-235B-Thinking
Alibaba
0.202325 0.00--58 0.307252 0.00--32 0.272411
10 LLaMA4-Maverick
Meta
0.00--27 0.108653 0.00--45 0.00--41 0.172111
11 Mistral-Large-3
Mistral AI
0.00--27 0.00--60 0.00--45 0.00--19 0.133411
CITATION

BibTeX

Use the following citation if you reference RTSGameBench.

@journal{kimAKCJC26,
  author    = {Kim, San and Ahn, Daechul and Kim, Reokyoung and Choi, Hyeonbeom and Jwa, Seungyeon and Choi, Jonghyun},
  title     = {RTSGameBench: An RTS Benchmark for Strategic Reasoning by Vision-Language Models},
  journal   = {arXiv},
  year      = {2026},
}