Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve

FOS: Computer and information sciences Computer Science - Machine Learning Computer Science - Distributed, Parallel, and Cluster Computing Distributed, Parallel, and Cluster Computing (cs.DC) Machine Learning (cs.LG)
DOI: 10.48550/arxiv.2403.02310 Publication Date: 2024-03-04
ABSTRACT
Each LLM serving request goes through two phases. The first is prefill which processes the entire input prompt to produce one output token and second decode generates rest of tokens, one-at-a-time. Prefill iterations have high latency but saturate GPU compute due parallel processing prompt. In contrast, low also utilization because a iteration only single per request. This makes batching highly effective for decodes consequently overall throughput. However, multiple requests leads an interleaving it challenging achieve both throughput latency. We introduce efficient inference scheduler Sarathi-Serve inspired by techniques we originally proposed optimizing in Sarathi. leverages chunked-prefills from Sarathi create stall-free schedules that can add new batch without pausing ongoing decodes. Stall-free scheduling unlocks opportunity improve with large sizes while minimizing effect on Our evaluation shows improves within desired SLOs Mistral-7B up 2.6x A100 6.9x Falcon-180B 8 GPUs over Orca vLLM.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....