NFDI4DS | UHH-SEMS - Publication Details

Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve

FOS: Computer and information sciences Computer Science - Machine Learning Computer Science - Distributed, Parallel, and Cluster Computing Distributed, Parallel, and Cluster Computing (cs.DC) Machine Learning (cs.LG)

DOI: 10.48550/arxiv.2403.02310 Publication Date: 2024-03-04

Abstract Supplemental Material References Cited by

AUTHORS (8)

Amey Agrawal

Nitin Kedia

Ashish Panwar

Jayashree Mohan

Nipun Kwatra

Bhargav S. Gulavani

Alexey Tumanov

Ramachandran Ramjee

ABSTRACT

Each LLM serving request goes through two phases. The first is prefill which processes the entire input prompt to produce one output token and second decode generates rest of tokens, one-at-a-time. Prefill iterations have high latency but saturate GPU compute due parallel processing prompt. In contrast, low also utilization because a iteration only single per request. This makes batching highly effective for decodes consequently overall throughput. However, multiple requests leads an interleaving it challenging achieve both throughput latency. We introduce efficient inference scheduler Sarathi-Serve inspired by techniques we originally proposed optimizing in Sarathi. leverages chunked-prefills from Sarathi create stall-free schedules that can add new batch without pausing ongoing decodes. Stall-free scheduling unlocks opportunity improve with large sizes while minimizing effect on Our evaluation shows improves within desired SLOs Mistral-7B up 2.6x A100 6.9x Falcon-180B 8 GPUs over Orca vLLM.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENALEX - Publications OPENAIRE - Products

PlumX Metrics

Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....