NFDI4DS | UHH-SEMS - Publication Details

NeuPIMs: NPU-PIM Heterogeneous Acceleration for Batched LLM Inferencing

Interleaving

DOI: 10.1145/3620666.3651380 Publication Date: 2024-04-24T12:08:21Z

Abstract Supplemental Material References Cited by

AUTHORS (9)

Guseul Heo

Sangyeop Lee

Jaehong Cho

Hyunmin Choi

Sanghyeon Lee

Hyungkyu Ham

Gwangsun Kim

Divya Mahajan

Jongse Park

ABSTRACT

Modern transformer-based Large Language Models (LLMs) are constructed with a series of decoder blocks. Each block comprises three key components: (1) QKV generation, (2) multi-head attention, and (3) feed-forward networks. In batched processing, generation networks involve compute-intensive matrix-matrix multiplications (GEMM), while attention requires bandwidth-heavy matrix-vector (GEMV). Machine learning accelerators like TPUs or NPUs proficient in handling GEMM but less efficient for GEMV computations. Conversely, Processing-in-Memory (PIM) technology is tailored computation, it lacks the computational power to handle effectively. Inspired by this insight, we propose NeuPIMs, heterogeneous acceleration system that jointly exploits conventional GEMM-focused NPU GEMV-optimized PIM devices. The main challenge efficiently integrating lies enabling concurrent operations on both platforms, each addressing specific kernel type. First, existing PIMs typically operate "blocked" mode, allowing only either be active at any given time. Second, inherent dependencies between LLMs restrict their parallel processing. To tackle these challenges, NeuPIMs equipped dual row buffers bank, facilitating simultaneous management memory read/write commands. Further, employs runtime sub-batch interleaving technique maximize execution, leveraging batch parallelism allow two independent sub-batches pipelined within single device. Our evaluation demonstrates compared GPU-only, NPU-only, na\"ive NPU+PIM integrated approaches, achieves 3$\times$, 2.4$\times$ 1.6$\times$ throughput improvement, respectively.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES (89)

CITATIONS (24)

EXTERNAL LINKS

OPENALEX - Publications OPENAIRE - Products CROSSREF - Publications

PlumX Metrics

NeuPIMs: NPU-PIM Heterogeneous Acceleration for Batched LLM Inferencing

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....