NFDI4DS | UHH-SEMS - Publication Details

Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training

Hessian matrix Clipping (morphology) FLOPS Convexity Perplexity

DOI: 10.48550/arxiv.2305.14342 Publication Date: 2023-01-01

Abstract Supplemental Material References Cited by

AUTHORS (5)

Hong Liu

Zhiyuan Li

D. Hall

Percy Liang

Tengyu Ma

ABSTRACT

Given the massive cost of language model pre-training, a non-trivial improvement optimization algorithm would lead to material reduction on time and training. Adam its variants have been state-of-the-art for years, more sophisticated second-order (Hessian-based) optimizers often incur too much per-step overhead. In this paper, we propose Sophia, Second-order Clipped Stochastic Optimization, simple scalable optimizer that uses light-weight estimate diagonal Hessian as pre-conditioner. The update is moving average gradients divided by estimated Hessian, followed element-wise clipping. clipping controls worst-case size tames negative impact non-convexity rapid change along trajectory. Sophia only estimates every handful iterations, which has negligible memory On modeling with GPT models sizes ranging from 125M 1.5B, achieves 2x speed-up compared in number steps, total compute, wall-clock time, achieving same perplexity 50% fewer less reduced time. Theoretically, show simplified setting, adapts heterogeneous curvatures different parameter dimensions, thus run-time bound does not depend condition loss.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENAIRE - Products OPENALEX - Publications

PlumX Metrics

Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....