Promoting MLLM adjustment through MM-RLHF: A large-scale human preference data set for multimodal tasks

Multimodal Large Language Models (MLLMS) have received considerable attention for their ability to handle complex tasks involving vision, language and audio integration. However, they lack the extensive adjustment in addition to fundamentally monitored fine tuning (SFT). Current advanced models often circumvent strict adaptation stages, leaving important aspects such as truth, security and human preference adaptation inadequate. Existing approaches are only targeted at specific domains, such as hallucination reduction or conversation improvements, which lacks to improve the model’s overall performance and reliability. This narrow focus raises questions about whether human preference decoration can improve MLLMS across a wider spectrum of tasks.

The recent years have witnessed significant progress in MLLMs, built on advanced LLM architectures such as GPTS, Llama, Alpaca, Vicuna and Mistral. These models have evolved through end-to-end training methods, tackle complex multimodal tasks involving caption adjustment, reasoning and instruction after. Several Mllms Open Source, including Otter, Mplug-Owl, Llava, Qwen-VL and Vita, have emerged to tackle the basic multimodal challenges. However, adaptation efforts have been limited. While algorithms like Fact-RLHF and Llavacritic have shown the promise of reducing hallucinations and improving conversation skills, they have not improved general capabilities. Evaluation frames such as MME, MMBench and Seed-Bench have been developed to assess these models.

Researchers from Kuaishou, Casia, NJU, USTC, PKU, Alibaba and Meta AI have suggested MM-RLHF, an innovative approach with a comprehensive 120K fine-grained, human-annoted preference comparison pair. This data set represents significant progress in terms of size, diversity and annotation quality compared to existing resources. The method introduces two key innovations: a critical-based reward model that generates detailed criticisms before scoring output, and dynamic reward scaling that optimizes test weights based on reward signals. It improves both the interpretability of model decisions and the effectiveness of the adjustment process, addressing the limitations of traditional scaly reward mechanisms in multimodal contexts.

MM-RLHF implementation involves a complex data preparation and filtration process across three main domains: image understanding, video understanding and multimodal security. The image understanding component integrates data from multiple sources, including LLAVA OV, VLFEEDBACK and LLAVA-RLHF, with multi-turn dialogues converted to single-turn format. This collection results in over 10 million dialogue tests that cover different tasks from basic conversation to complex reasoning. The data filtering process uses predefined sampling weights categorized in three types: Questions with more choices for testing reasoning and perception, Long -term questions for evaluating conversation skills and short text questions for basic image analysis.

The evaluation of MM-RLHF and MM-DPO shows significant improvements across multiple dimensions when used for models such as LLAVA-OV-7B, LLAVA-OV-0.5B and InternVL-1B. Conversational ability improved by over 10%, while uncertain behavior fell by at least 50%. The adjusted models show better results in hallucination reduction, mathematical reasoning and understanding of more images, even without specific training data for some tasks. However, model-specific variations are observed with different models that require different hyper-parameter settings for optimal performance. High resolution tasks also show limited gains due to data set limits and filtration strategies that are not targeted against resolution optimization.

In this paper, researchers introduced MM-RLHF, a data set and adjustment method showing significant progress in MLLM development. Unlike previous task -specific approaches, this method takes a holistic approach to improving model performance across multiple dimensions. The rich annotation granularity of the data set, including scores on dimension and ranking, provides untapped potential for future development. Future research directions will focus on using this granularity through advanced optimization techniques, addressing high-resolution data and expanding the data set through semi-automated methods and potentially establishing a foundation for more robust multimodal learning framework.


Check out The paper and project page. All credit for this research goes to the researchers in this project. You are also welcome to follow us on Twitter And don’t forget to join our 75k+ ml subbreddit.

🚨 Recommended Reading AI Research Release Nexus: An Advanced System Integrating Agent AI system and Data Processing Standards To Tackle Legal Concerns In Ai Data Set


Sajjad Ansari is a last year bachelor from IIT KHARAGPUR. As a technical enthusiast, he covers the practical uses of AI focusing on understanding the impact of AI technologies and their real world. He aims to formulate complex AI concepts in a clear and accessible way.

Leave a Comment