Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization

FOS: Computer and information sciences Computer Science - Computation and Language Computer Vision and Pattern Recognition (cs.CV) Computer Science - Computer Vision and Pattern Recognition Computation and Language (cs.CL)
DOI: 10.48550/arxiv.2403.08730 Publication Date: 2024-03-13
ABSTRACT
Multimodal Large Language Models (MLLMs) excel in generating responses based on visual inputs. However, they often suffer from a bias towards similar to their pretraining corpus, overshadowing the importance of information. We treat this as "preference" for statistics, which hinders model's grounding input. To mitigate issue, we propose Bootstrapped Preference Optimization (BPO), conducts preference learning with datasets containing negative bootstrapped model itself. Specifically, following two strategies: 1) using distorted image inputs MLLM eliciting that contain signified bias; 2) leveraging text-based LLM explicitly inject erroneous but common elements into original response. Those undesirable are paired annotated construct dataset, is subsequently utilized perform learning. Our approach effectively suppresses pretrained bias, enabling enhanced Extensive experimentation demonstrates significant performance improvements across multiple benchmarks, advancing state-of-the-art multimodal conversational systems.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....