What Happens When:
Learning Temporal Orders of Events in Videos

WACV 2026

Daechul Ahn*¹, Yura Choi* ², Hyeonbeom Choi* ¹,
Seongwon Cho ¹, San Kim¹, Jonghyun Choi¹

¹Seoul National University ²Imperial College London

* equally contributed to this work

Abstract

We find that Video Large Multimodal Models (VLMMs) often rely on prior knowledge rather than actual visual sequences to predict event orders. To rigorously evaluate temporal understanding, we introduce VECTOR, a benchmark designed to assess models' ability to identify the temporal order of events. We also propose MECoT, which combines multi-event instruction fine-tuning with chain-of-thought prompting to enhance temporal reasoning. MECoT achieves state-of-the-art results on VECTOR while also improving performance on general video benchmarks.

MOTIVATION

Diagnosing the Prior-Knowledge Bias

Why do current Video LMMs fail at temporal reasoning?

Do VLMMs truly understand temporal order, or just rely on common sense?

Biased prediction on event-ordering task

Example: Consider a campfire-making video with three steps:

(A) Arranging stones (B) Lighting a fire (C) Observing ashes

GPT-4o correctly predicts the order A→B→C. But when we swap the events so the video actually shows C→B→A, the model still predicts A→B→C — it follows what typically happens rather than what is actually shown in the video.

How often does this happen? We measure this with the biased ratio (η), which captures how frequently models produce the same prediction regardless of the actual visual order.

Event sequencing task on videos with original and shuffled event order

Key Finding: η exceeds 78% across all models and datasets, indicating that current VLMMs heavily depend on prior knowledge as a shortcut, bypassing genuine temporal reasoning. This finding motivates our work.

BENCHMARK

VECTOR: Comprehending Temporal Order of Events in Videos

Five tasks across two evaluation groups to diagnose temporal understanding

VECTOR evaluates whether VLMMs can truly understand when things happen in a video.

Overview of the VECTOR benchmark

Event-level Tasks

Test whether models can correctly track individual events over time.

Given a video of multiple unrelated actions, models must:

List all events in chronological order
Identify which events occur between two specified events
Pinpoint the exact position of queried events

Pattern-level Tasks

Models must recognize higher-level temporal structures.

Given a sequence with semantic patterns, models must:

Detect where an anomalous event breaks the pattern
Understand how actions relate across time
Go beyond recognizing individual actions

By combining both event-level and pattern-level reasoning, VECTOR provides a comprehensive diagnosis of temporal understanding in VLMMs.

METHOD

Multi-Event instruction fine-tuning with Chain-of-Thought (MECoT)

A two-stage approach to improve temporal reasoning in VLMMs

STAGE 1

Multi-Event Instruction Fine-Tuning

Problem: Existing video descriptions oversimplify multi-event videos into a single summary, losing fine-grained temporal details.

Solution:

Generate segment-wise descriptions using pre-trained VLMM
Merge into coherent temporal narrative via GPT-4o-mini
Fine-tune with 120K multi-event descriptions

Overview of multi-event instruction fine-tuning

STAGE 2

Temporal Chain-of-Thought

At inference time:

Model first generates a detailed chronological description of the video
Description serves as explicit reasoning context
Model uses it alongside original video to answer the query

"Articulate what happens and when before reasoning about the answer"

Overview of temporal Chain-of-Thought

RESULTS

Experimental Results

For more details, please check out the paper.

Performance on VECTOR Benchmark

Exact match (EM) accuracy on VECTOR

Performance on General Benchmarks

Comparison of MECoT with LLaVA-One-Vision (7B) on general benchmarks

BibTeX

@inproceedings{ahnCCCKC25,
  author    = {Ahn, Daechul and Choi, Yura and Choi, Hyeonbeom and Seongwon Cho and Kim, San and Choi, Jonghyun},
  title     = {What Happens When: Learning Temporal Orders of Events in Videos},
  booktitle = {WACV},
  year      = {2026},
}