Separate-and-Enhance: Compositional Finetuning for Text-to-Image Diffusion Models
DOI:
10.1145/3641519.3657527
Publication Date:
2024-07-12T10:39:28Z
AUTHORS (5)
ABSTRACT
Despite recent significant strides achieved by diffusion-based Text-to-Image (T2I) models, current systems are still less capable of ensuring decent compositional generation aligned with text prompts, particularly for the multi-object generation. In this work, we first show fundamental reasons such misalignment identifying issues related to low attention activation and mask overlaps. Then propose a finetuning framework two novel objectives, Separate loss Enhance loss, that reduce object overlaps maximize scores, respectively. Unlike conventional test-time adaptation methods, our model, once finetuned on critical parameters, is able directly perform inference given an arbitrary prompt, which enhances scalability generalizability. Through comprehensive evaluations, model demonstrates superior performance in image realism, text-image alignment, adaptability, significantly surpassing established baselines. Furthermore, training diverse range concepts enables it generalize effectively concepts, exhibiting enhanced compared models trained individual concept pairs.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (49)
CITATIONS (1)
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....