Diffusion Model as a Noise-Aware Latent Reward Model for Step-Level Preference Optimization
FOS: Computer and information sciences
Computer Vision and Pattern Recognition (cs.CV)
Computer Science - Computer Vision and Pattern Recognition
DOI:
10.48550/arxiv.2502.01051
Publication Date:
2025-02-02
AUTHORS (9)
ABSTRACT
Preference optimization for diffusion models aims to align them with human preferences images. Previous methods typically leverage Vision-Language Models (VLMs) as pixel-level reward approximate preferences. However, when used step-level preference optimization, these face challenges in handling noisy images of different timesteps and require complex transformations into pixel space. In this work, we demonstrate that are inherently well-suited modeling the latent space, they can naturally extract features from Accordingly, propose Latent Reward Model (LRM), which repurposes components predict at various timesteps. Building on LRM, introduce Optimization (LPO), a method designed directly Experimental results indicate LPO not only significantly enhances performance aligning general, aesthetic, text-image alignment preferences, but also achieves 2.5-28$\times$ training speedup compared existing methods. Our code will be available https://github.com/casiatao/LPO.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....