NFDI4DS | UHH-SEMS - Publication Details

Muse: Text-To-Image Generation via Masked Generative Transformers

FOS: Computer and information sciences Computer Science - Machine Learning Artificial Intelligence (cs.AI) Computer Science - Artificial Intelligence Computer Vision and Pattern Recognition (cs.CV) Computer Science - Computer Vision and Pattern Recognition 0202 electrical engineering, electronic engineering, information engineering 02 engineering and technology Machine Learning (cs.LG)

DOI: 10.48550/arxiv.2301.00704 Publication Date: 2023-01-01

Abstract Supplemental Material References Cited by

AUTHORS (12)

Chang, Huiwen

Zhang, Han

Barber, Jarred

Maschinot, AJ

Lezama, Jose

Jiang, Lu

Yang, Ming-Hsuan

Murphy, Kevin

Freeman, William T.

Rubinstein, Michael

Li, Yuanzhen

Krishnan, Dilip

ABSTRACT

We present Muse, a text-to-image Transformer model that achieves state-of-the-art image generation performance while being significantly more efficient than diffusion or autoregressive models. Muse is trained on a masked modeling task in discrete token space: given the text embedding extracted from a pre-trained large language model (LLM), Muse is trained to predict randomly masked image tokens. Compared to pixel-space diffusion models, such as Imagen and DALL-E 2, Muse is significantly more efficient due to the use of discrete tokens and requiring fewer sampling iterations; compared to autoregressive models, such as Parti, Muse is more efficient due to the use of parallel decoding. The use of a pre-trained LLM enables fine-grained language understanding, translating to high-fidelity image generation and the understanding of visual concepts such as objects, their spatial relationships, pose, cardinality etc. Our 900M parameter model achieves a new SOTA on CC3M, with an FID score of 6.06. The Muse 3B parameter model achieves an FID of 7.88 on zero-shot COCO evaluation, along with a CLIP score of 0.32. Muse also directly enables a number of image editing applications without the need to fine-tune or invert the model: inpainting, outpainting, and mask-free editing. More results are available at https://muse-model.github.io

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENAIRE - Products

PlumX Metrics

Muse: Text-To-Image Generation via Masked Generative Transformers

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....