NFDI4DS | UHH-SEMS - Publication Details

MEMO: Fine-grained Tensor Management For Ultra-long Context LLM Training

FOS: Computer and information sciences Computer Science - Machine Learning Computer Science - Distributed, Parallel, and Cluster Computing Distributed, Parallel, and Cluster Computing (cs.DC) Machine Learning (cs.LG)

DOI: 10.1145/3709703 Publication Date: 2025-02-11T20:45:06Z

Abstract Supplemental Material References Cited by

AUTHORS (12)

Pinxue Zhao

Hailin Zhang

Fangcheng Fu

Xiaonan Nie

Qibin Liu

Fang Yang

Yuanbo Peng

Dian Jiao

Shuaipeng Li

Jinbao Xue

Yangyu Tao

Bin Cui

ABSTRACT

Nowadays, Large Language Models (LLMs) have been trained using extended context lengths to foster more creative applications. However, long training poses great challenges considering the constraint of GPU memory. It not only leads substantial activation memory consumption during training, but also incurs considerable fragmentation. To facilitate existing frameworks adopted strategies such as recomputation and various forms parallelisms. Nevertheless, these techniques rely on redundant computation or extensive communication, resulting in low Model FLOPS Utilization (MFU). In this paper, we propose MEMO, a novel LLM framework designed for fine-grained management. Given quadratic scaling linear with sequence when FlashAttention, offload memory-consuming activations CPU after each layer's forward pass fetch them backward pass. maximize swapping without hindering computation, avoid exhausting limited memory, implement token-wise mechanism. Furthermore, tackle fragmentation issue by employing bi-level Mixed Integer Programming (MIP) approach, optimizing reuse across transformer layers. Empirical results demonstrate that MEMO achieves an average 1.97x 1.80x MFU compared Megatron-LM DeepSpeed, respectively. This improvement is attributed MEMO's ability minimize fragmentation, reduce intensive circumvent delays associated reorganization process due By leveraging management, facilitates efficient 7B 1 million length just 8 A800 GPUs, achieving 52.30%.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES (103)

CITATIONS (0)

EXTERNAL LINKS

CROSSREF - Publications OPENAIRE - Products OPENALEX - Publications

PlumX Metrics

MEMO: Fine-grained Tensor Management For Ultra-long Context LLM Training

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....