Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training
Hessian matrix
Clipping (morphology)
FLOPS
Convexity
Perplexity
DOI:
10.48550/arxiv.2305.14342
Publication Date:
2023-01-01
AUTHORS (5)
ABSTRACT
Given the massive cost of language model pre-training, a non-trivial improvement optimization algorithm would lead to material reduction on time and training. Adam its variants have been state-of-the-art for years, more sophisticated second-order (Hessian-based) optimizers often incur too much per-step overhead. In this paper, we propose Sophia, Second-order Clipped Stochastic Optimization, simple scalable optimizer that uses light-weight estimate diagonal Hessian as pre-conditioner. The update is moving average gradients divided by estimated Hessian, followed element-wise clipping. clipping controls worst-case size tames negative impact non-convexity rapid change along trajectory. Sophia only estimates every handful iterations, which has negligible memory On modeling with GPT models sizes ranging from 125M 1.5B, achieves 2x speed-up compared in number steps, total compute, wall-clock time, achieving same perplexity 50% fewer less reduced time. Theoretically, show simplified setting, adapts heterogeneous curvatures different parameter dimensions, thus run-time bound does not depend condition loss.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....