NFDI4DS | UHH-SEMS - Publication Details

Gradient Multi-Normalization for Stateless and Scalable LLM Training

Normalization

DOI: 10.48550/arxiv.2502.06742 Publication Date: 2025-02-10

Abstract Supplemental Material References Cited by

AUTHORS (4)

Meyer Scetbon

Chao Ma

Wenbo Gong

Edward Meeds

ABSTRACT

Training large language models (LLMs) typically relies on adaptive optimizers like Adam (Kingma & Ba, 2015) which store additional state information to accelerate convergence but incur significant memory overhead. Recent efforts, such as SWAN (Ma et al., 2024) address this by eliminating the need for optimizer states while achieving performance comparable via a multi-step preprocessing procedure applied instantaneous gradients. Motivated success of SWAN, we introduce novel framework designing stateless that normalizes stochastic gradients according multiple norms. To achieve this, propose simple alternating scheme enforce normalization w.r.t these We show our can produce, up an arbitrary precision, fixed-point problem, and is particular instance approach with carefully chosen norms, providing deeper understanding its design. However, SWAN's computationally expensive whitening/orthogonalization step limit practicality LMs. Using principled perspective, develop more efficient, scalable, practical optimizer. Our algorithm relaxes properties significantly reducing computational cost retaining efficiency, making it applicable training large-scale models. Experiments pre-training LLaMA 1 billion parameters demonstrate 3X speedup over reduced requirements, outperforming other memory-efficient baselines.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENALEX - Publications OPENAIRE - Products

PlumX Metrics

Gradient Multi-Normalization for Stateless and Scalable LLM Training

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....