Gradient Multi-Normalization for Stateless and Scalable LLM Training
Normalization
DOI:
10.48550/arxiv.2502.06742
Publication Date:
2025-02-10
AUTHORS (4)
ABSTRACT
Training large language models (LLMs) typically relies on adaptive optimizers like Adam (Kingma & Ba, 2015) which store additional state information to accelerate convergence but incur significant memory overhead. Recent efforts, such as SWAN (Ma et al., 2024) address this by eliminating the need for optimizer states while achieving performance comparable via a multi-step preprocessing procedure applied instantaneous gradients. Motivated success of SWAN, we introduce novel framework designing stateless that normalizes stochastic gradients according multiple norms. To achieve this, propose simple alternating scheme enforce normalization w.r.t these We show our can produce, up an arbitrary precision, fixed-point problem, and is particular instance approach with carefully chosen norms, providing deeper understanding its design. However, SWAN's computationally expensive whitening/orthogonalization step limit practicality LMs. Using principled perspective, develop more efficient, scalable, practical optimizer. Our algorithm relaxes properties significantly reducing computational cost retaining efficiency, making it applicable training large-scale models. Experiments pre-training LLaMA 1 billion parameters demonstrate 3X speedup over reduced requirements, outperforming other memory-efficient baselines.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....