Medha: Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations
FOS: Computer and information sciences
Distributed, Parallel, and Cluster Computing (cs.DC)
Machine Learning (cs.LG)
DOI:
10.48550/arxiv.2409.17264
Publication Date:
2024-09-25
AUTHORS (8)
ABSTRACT
As large language models (LLMs) handle increasingly longer contexts, serving inference requests for context lengths in the range of millions tokens presents unique challenges. While existing techniques are effective training, they fail to address challenges inference, such as varying prefill and decode phases their associated latency constraints -- like Time First Token (TTFT) per Output (TPOT). Furthermore, no long-context solutions head-of-line blocking today. We present Medha, a system efficient LLM that introduces three key innovations: adaptive chunking with slack-aware scheduling prevent head-ofline blocking, Sequence Pipeline Parallelism (SPP) reduce TTFT, KV Cache (KVP) minimize TPOT. By combining these into novel 3D parallelism engine, Medha achieves unprecedented scale supporting contexts up 10M production-grade latency. Our evaluation shows reduces median by 30x compared state-of-the-art systems when mix short long requests, while improving throughput upwards 5x. This enables, first time, at without compromising on shorter request latencies or efficiency.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....