Towards Efficient Large Multimodal Model Serving
FOS: Computer and information sciences
Artificial Intelligence (cs.AI)
Computer Science - Distributed, Parallel, and Cluster Computing
Computer Science - Artificial Intelligence
Distributed, Parallel, and Cluster Computing (cs.DC)
DOI:
10.48550/arxiv.2502.00937
Publication Date:
2025-02-02
AUTHORS (12)
ABSTRACT
Recent advances in generative AI have led to large multi-modal models (LMMs) capable of simultaneously processing inputs various modalities such as text, images, video, and audio. While these demonstrate impressive capabilities, efficiently serving them production environments poses significant challenges due their complex architectures heterogeneous resource requirements. We present the first comprehensive systems analysis two prominent LMM architectures, decoder-only cross-attention, on six representative open-source models. investigate multi-stage inference pipelines utilization patterns that lead unique design implications. also an in-depth traces, uncovering workload characteristics, including variable, heavy-tailed request distributions, diverse modal combinations, bursty traffic patterns. Our key findings reveal different stages exhibit highly performance characteristics demands, while concurrent requests across interference. To address challenges, we propose a decoupled architecture enables independent allocation adaptive scaling for each stage. further optimizations stage colocation maximize throughput meeting latency objectives.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....