ISR-DPO: Aligning Large Multimodal Models for Videos by Iterative Self-Retrospective DPO

AAAI 2025

1Seoul National University 2Yonsei University, 3University of Minnesota

Iterative Self-Retrospective Direct Preference Optimization, ISR-DPO

Abstract

Iterative self-improvement, a concept extending beyond personal growth, has found powerful applications in machine learning, particularly in transforming weak models into strong ones. While recent advances in natural language processing have shown its efficacy through iterative preference optimization, applying this approach to Video Large Multi-modal Models (VLMMs) remains challenging due to modality misalignment. VLMMs struggle with this misalignment during iterative preference modeling, as the self-judge model often prioritizes linguistic knowledge over visual information. Additionally, iterative preference optimization can lead to visually hallucinated verbose responses due to length bias within the self-rewarding cycle. To address these issues, we propose Iterative Self-Retrospective Direct Preference Optimization (ISR-DPO), a method that uses self-retrospection to enhance preference modeling. This approach enhances the self-judge's focus on informative video regions, resulting in more visually grounded preferences. In extensive empirical evaluations across diverse video question answering benchmarks, the ISR-DPO significantly outperforms the state of the art. We are committed to open-sourcing our code, models, and datasets to encourage further investigation.

ISR-DPO Framework

Each iteration of ISR-DPO involves three stages:
1) After training iteration \( t \), the latest updated VLMM (\( \pi_{\theta^{t}} \)) generates two different responses \( y_1 \) and \( y_2 \) for the given video \( V \) and instruction \( x \). In addition, a visual description, i.e., visual context, is generated through self-retrospection, providing the necessary input for the next stage, as indicated by the black dotted line.
2) Using the information generated in the previous stage, the model (\( \pi_{\theta^{t}} \)) compares its responses (\( y_1 \) and \( y_2 \)) and classifies the preferred response \( y_w \) and the rejected response \( y_l \).
3) Then, the VLMM (\( \pi_{\theta^{t}} \)) is optimized using DPO to update the parameters to \( \pi_{\theta^{t+1}} \).

Overview of self-retrospective direct preference optimization for modality alignment between video and text

Self-Retrospective Preference Modeling

A key aspect of iterative DPO is that the VLMM acts as a judge to select the most appropriate response to a given question. To enhance this process, the VLMM utilizes detailed visual descriptions as visual context, which it generates itself alongside the video content for improved clarity.

Inspired by human learning processes, we incorporate a self-retrospective mechanism, where previously generated visual context is leveraged to produce better context. This improves the accuracy and relevance of preference selection. The process is formulated as follows:

\[ c_{t} \sim \pi_{\theta^{t}}(V, c_{t-1}) \]

where \( c_{t-1} \) represents the visual context at time \( t-1 \).

Using the generated context \( c_t \), question \( x \), video \( V \), and responses \( \{y_1, y_2\} \), the aligned VLMM \( \pi_{\theta^{t}} \) classifies the preferred response \( y_w \) and the rejected response \( y_l \). This process, termed self-retrospective preference modeling, enables the construction of preference data \( D_t^{\text{pref}} \) at time \( t \).



Results

For more experimental results, please check out the paper.

Quantitative comparison on in/out-domain VideoQA with prior arts

We quantitatively evaluate VLMMs on the in/out-domain VideoQA benchmark, showing effectiveness of our proposed ISR-DPO.

Quantitative comparison between different VLMMs on in-domain (left) and out-domain (right) VideoQA benchmark

BibTeX

@inproceedings{ahnCKYKC25,
  author    = {Ahn, Daechul and Choi, Yura and Kim, San and Yu, Youngjae and Kang, Dongyeop and Choi, Jonghyun},
  title     = {ISR-DPO: Aligning Large Multimodal Models for Videos by Iterative Self-Retrospective DPO},
  booktitle = {AAAI},
  year      = {2025},
}