Variance-reduced Zeroth-Order Methods for Fine-Tuning Language Models

Variance reduction Leverage (statistics) Benchmark (surveying)
DOI: 10.48550/arxiv.2404.08080 Publication Date: 2024-04-11
ABSTRACT
Fine-tuning language models (LMs) has demonstrated success in a wide array of downstream tasks. However, as LMs are scaled up, the memory requirements for backpropagation become prohibitively high. Zeroth-order (ZO) optimization methods can leverage memory-efficient forward passes to estimate gradients. More recently, MeZO, an adaptation ZO-SGD, been shown consistently outperform zero-shot and in-context learning when combined with suitable task prompts. In this work, we couple ZO variance reduction techniques enhance stability convergence inference-based LM fine-tuning. We introduce Memory-Efficient Zeroth-Order Stochastic Variance-Reduced Gradient (MeZO-SVRG) demonstrate its efficacy across multiple fine-tuning tasks, eliminating reliance on task-specific Evaluated range both masked autoregressive benchmark GLUE MeZO-SVRG outperforms MeZO up 20% increase test accuracies full- partial-parameter settings. benefits from reduced computation time it often surpasses MeZO's peak accuracy $2\times$ GPU-hours. significantly reduces required footprint compared first-order SGD, i.e. by models. Our experiments highlight that MeZO-SVRG's savings progressively improve SGD larger batch sizes.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....