Large Language Diffusion Models

FOS: Computer and information sciences Computer Science - Machine Learning Computer Science - Computation and Language Computation and Language (cs.CL) Machine Learning (cs.LG)
DOI: 10.48550/arxiv.2502.09992 Publication Date: 2025-02-14
ABSTRACT
Autoregressive models (ARMs) are widely regarded as the cornerstone of large language (LLMs). We challenge this notion by introducing LLaDA, a diffusion model trained from scratch under pre-training and supervised fine-tuning (SFT) paradigm. LLaDA distributions through forward data masking process reverse process, parameterized vanilla Transformer to predict masked tokens. By optimizing likelihood bound, it provides principled generative approach for probabilistic inference. Across extensive benchmarks, demonstrates strong scalability, outperforming our self-constructed ARM baselines. Remarkably, 8B is competitive with LLMs like LLaMA3 in in-context learning and, after SFT, exhibits impressive instruction-following abilities case studies such multi-turn dialogue. Moreover, addresses reversal curse, surpassing GPT-4o poem completion task. Our findings establish viable promising alternative ARMs, challenging assumption that key LLM capabilities discussed above inherently tied ARMs.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....