NFDI4DS | UHH-SEMS - Publication Details

WaferLLM: A Wafer-Scale LLM Inference System

FOS: Computer and information sciences Computer Science - Machine Learning Artificial Intelligence (cs.AI) Emerging Technologies (cs.ET) Computer Science - Distributed, Parallel, and Cluster Computing Computer Science - Artificial Intelligence Hardware Architecture (cs.AR) Computer Science - Emerging Technologies Distributed, Parallel, and Cluster Computing (cs.DC) Computer Science - Hardware Architecture Machine Learning (cs.LG)

DOI: 10.48550/arxiv.2502.04563 Publication Date: 2025-01-01

Abstract Supplemental Material References Cited by

AUTHORS (8)

He, Congjie

Huang, Yeqi

Mu, Pei

Miao, Ziming

Xue, Jilong

Ma, Lingxiao

Yang, Fan

Mai, Luo

ABSTRACT

Emerging AI accelerators increasingly adopt wafer-scale manufacturing technologies, integrating hundreds of thousands of AI cores in a mesh-based architecture with large distributed on-chip memory (tens of GB in total) and ultra-high on-chip memory bandwidth (tens of PB/s). However, current LLM inference systems, optimized for shared memory architectures like GPUs, fail to fully exploit these accelerators. We introduce WaferLLM, the first wafer-scale LLM inference system. WaferLLM is guided by a novel PLMR model (pronounced as "Plummer") that captures the unique hardware characteristics of wafer-scale architectures. Leveraging this model, WaferLLM pioneers wafer-scale LLM parallelism, optimizing the utilization of hundreds of thousands of on-chip cores. It also introduces MeshGEMM and MeshGEMV, the first GEMM and GEMV implementations designed to scale effectively on wafer-scale accelerators. Evaluations show that WaferLLM achieves 200$\times$ better wafer-scale accelerator utilization than state-of-the-art systems. On a commodity wafer-scale accelerator, WaferLLM delivers 606$\times$ faster and 22$\times$ more energy-efficient GEMV compared to an advanced GPU. For LLMs, based on 16-bit data type, WaferLLM achieves 2700 toks/sec/req decode speed on Llama3-8B model and 840 toks/sec/req decode speed on Qwen2-72B model, which enables 39$\times$ faster decoding with 1.7$\times$ better energy efficiency. We anticipate these numbers will grow significantly as wafer-scale AI models, software, and hardware continue to mature.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENAIRE - Products

PlumX Metrics

WaferLLM: A Wafer-Scale LLM Inference System

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....