MEMO: Fine-grained Tensor Management For Ultra-long Context LLM Training
FOS: Computer and information sciences
Computer Science - Machine Learning
Computer Science - Distributed, Parallel, and Cluster Computing
Distributed, Parallel, and Cluster Computing (cs.DC)
Machine Learning (cs.LG)
DOI:
10.1145/3709703
Publication Date:
2025-02-11T20:45:06Z
AUTHORS (12)
ABSTRACT
Nowadays, Large Language Models (LLMs) have been trained using extended context lengths to foster more creative applications. However, long training poses great challenges considering the constraint of GPU memory. It not only leads substantial activation memory consumption during training, but also incurs considerable fragmentation. To facilitate existing frameworks adopted strategies such as recomputation and various forms parallelisms. Nevertheless, these techniques rely on redundant computation or extensive communication, resulting in low Model FLOPS Utilization (MFU). In this paper, we propose MEMO, a novel LLM framework designed for fine-grained management. Given quadratic scaling linear with sequence when FlashAttention, offload memory-consuming activations CPU after each layer's forward pass fetch them backward pass. maximize swapping without hindering computation, avoid exhausting limited memory, implement token-wise mechanism. Furthermore, tackle fragmentation issue by employing bi-level Mixed Integer Programming (MIP) approach, optimizing reuse across transformer layers. Empirical results demonstrate that MEMO achieves an average 1.97x 1.80x MFU compared Megatron-LM DeepSpeed, respectively. This improvement is attributed MEMO's ability minimize fragmentation, reduce intensive circumvent delays associated reorganization process due By leveraging management, facilitates efficient 7B 1 million length just 8 A800 GPUs, achieving 52.30%.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (103)
CITATIONS (0)
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....