Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding

FOS: Computer and information sciences Computer Vision and Pattern Recognition (cs.CV) Computer Science - Computer Vision and Pattern Recognition
DOI: 10.48550/arxiv.2405.08748 Publication Date: 2024-05-14
ABSTRACT
We present Hunyuan-DiT, a text-to-image diffusion transformer with fine-grained understanding of both English and Chinese. To construct we carefully design the structure, text encoder, positional encoding. also build from scratch whole data pipeline to update evaluate for iterative model optimization. For language understanding, train Multimodal Large Language Model refine captions images. Finally, Hunyuan-DiT can perform multi-turn multimodal dialogue users, generating refining images according context. Through our holistic human evaluation protocol more than 50 professional evaluators, sets new state-of-the-art in Chinese-to-image generation compared other open-source models. Code pretrained models are publicly available at github.com/Tencent/HunyuanDiT
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....