NFDI4DS | UHH-SEMS - Publication Details

Medha: Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations

FOS: Computer and information sciences Distributed, Parallel, and Cluster Computing (cs.DC) Machine Learning (cs.LG)

DOI: 10.48550/arxiv.2409.17264 Publication Date: 2024-09-25

Abstract Supplemental Material References Cited by

AUTHORS (8)

Amey Agrawal

Haoran Qiu

Jiajia Chen

Íñigo Goiri

Ramachandran Ramjee

Chaojie Zhang

Alexey Tumanov

Esha Choukse

ABSTRACT

As large language models (LLMs) handle increasingly longer contexts, serving inference requests for context lengths in the range of millions tokens presents unique challenges. While existing techniques are effective training, they fail to address challenges inference, such as varying prefill and decode phases their associated latency constraints -- like Time First Token (TTFT) per Output (TPOT). Furthermore, no long-context solutions head-of-line blocking today. We present Medha, a system efficient LLM that introduces three key innovations: adaptive chunking with slack-aware scheduling prevent head-ofline blocking, Sequence Pipeline Parallelism (SPP) reduce TTFT, KV Cache (KVP) minimize TPOT. By combining these into novel 3D parallelism engine, Medha achieves unprecedented scale supporting contexts up 10M production-grade latency. Our evaluation shows reduces median by 30x compared state-of-the-art systems when mix short long requests, while improving throughput upwards 5x. This enables, first time, at without compromising on shorter request latencies or efficiency.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENAIRE - Products OPENALEX - Publications

PlumX Metrics

Medha: Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....