Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback

Recent advancements in Large Language Models (LLMs) have driven the development of Video Large Multimodal models (VLMs). While Supervised Fine-Tuning (SFT) for multimodal alignment between video and text has shown promise, challenges persist. The primary obstacle is the scarcity of high-quality video-text instruction-tune data, often resulting in poor grounding of video. Addressing this is crucial for successfully applying VLMs to various real-world video understanding tasks.

VLM-RLAIF. We propose a novel alignment method for VLMs utilizing Reinforcement Learning from AI Feedback (RLAIF) to effectively align video-text modalities.
Context-Aware Reward Modeling. We enhance AI's feedback by introducing context-aware reward modeling, focusing on improved clarity and comprehension in video content understanding.
Enhanced SFT Training. We enrich the Supervised Fine-Tuning (SFT) model's training by introducing additional instruction-tune data and applying a simple curriculum strategy, which we call VLM-SFT.
Performance. We demonstrate the effectiveness of our proposed VLM-RLAIF approach on various video understanding benchmarks, showing noticeable improvements over existing methods including SFT model.
Open-source. We make AI preference data, additional visual instruction tuning data (for curriculum learning), our model and code base publicly available.

Quantitative comparison of VLMs on various video benchmarks

1) Supervised fine-tuning (SFT)

We first fine-tune an LLM, e.g., Vicuna, using supervised learning on synthetically generated video-text instruction-tune data. This involves the integration of a vision encoder with two linear layers and additional learnable parameters using LoRA, into the training process. In particular, we improve the SFT process by expanding the instruction-tune data and introducing simple curriculum learning. We refer to this fine-tuned model as the Video Large Multimodal model with SFT or VLM-SFT for short.

2) Reward modeling with AI feedback

A key aspect of the RLAIF involves leveraging a pre-trained AI model to generate human-like preferences between different responses generated from the same input. To obtain human-like preference, we employ the VLM-SFT as a judge to assess preferences. Once preferences are judged, we train a reward model (RM) based on preferences. The RM give higher score reward to the better response and lower score reward to the less appropriate one in a pair of responses, thus guiding the policy model using reinforcement learning.

3) Reinforcement learning from AI feedback

We finally fine-tune a supervised policy model, initialized from the VLM-SFT, aiming to optimize the scalar reward output of the trained RM by reinforcement learning (PPO). We call this trained model as the Video Large Multimodal model with RLAIF or VLM-RLAIF for short.

Utilizing Reinforcement Learning from AI Feedback (RLAIF) for modality alignment between video and text

For VLM-SFT to select preference grounded on the video, we argue that a detailed understanding of video content is necessary for more accurate and contextually relevant decisions by the VLM-SFT. We propose integrating detailed video descriptions, termed as context, into the preference selection workflow to enhance VLMM's contextual clarity. This context allows the VLM-SFT to better understand the video content and identify the most suitable response. Integrating context with instruction inputs using a specific prompt, as shown in dotted boxes in the right Figure-(2), facilitates the collection of context-aware preferences.

The right figure illustrates the three stages of the proposed context-aware reward modeling:

The SFT model produces two candidate responses context from the provided video.
With the video, question and responses at hand, the SFT model utilize context information and guiding prompt to evaluate the responses.
The reward model is trained using the preference pairs generated in the previous step as indicated in orange box.

Each stage involves a model's input, indicated by a dotted box, and includes a task-specific prompt, shown in a yellow box. The first stage focuses on response generation, the second on evaluation and selection of the superior response, and the third on training the reward model.

1) Quantitative comparison with prior arts

We quantitatively evaluate VLMMs on the video-based generative performance benchmark that measures five criteria of generated text, showing effectiveness of our proposed VLM-RLAIF.

Quantitative comparison between different VLMMs on video-based generative performance benchmark

2) Qualitative comparison between VLM-SFT and VLM-RLAIF

We qualitatively compares the performance of VLM-SFT and VLM-RLAIF, highlighting their multimodal understanding capabilities below Figure. VLM-RLAIF consistently yields more accurate answers than VLM-SFT, as highlighted in blue for accurate responses and red for less accurate ones.

Qualitative examples of the comparative results between VLM-SFT and VLM-RLAIF in video instruction-following task

BibTeX

@inproceedings{ahnCYKC24,
  author    = {Ahn, Daechul and Choi, Yura and Yu, Youngjae and Kang, Dongyeop and Choi, Jonghyun},
  title     = {Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback},
  booktitle = {ACL},
  year      = {2024},
}

Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback

ACL 2024 (Oral)

Illustration of the proposed VLM-RLAIF

Abstract