NeuPIMs: NPU-PIM Heterogeneous Acceleration for Batched LLM Inferencing

Interleaving
DOI: 10.1145/3620666.3651380 Publication Date: 2024-04-24T12:08:21Z
ABSTRACT
Modern transformer-based Large Language Models (LLMs) are constructed with a series of decoder blocks. Each block comprises three key components: (1) QKV generation, (2) multi-head attention, and (3) feed-forward networks. In batched processing, generation networks involve compute-intensive matrix-matrix multiplications (GEMM), while attention requires bandwidth-heavy matrix-vector (GEMV). Machine learning accelerators like TPUs or NPUs proficient in handling GEMM but less efficient for GEMV computations. Conversely, Processing-in-Memory (PIM) technology is tailored computation, it lacks the computational power to handle effectively. Inspired by this insight, we propose NeuPIMs, heterogeneous acceleration system that jointly exploits conventional GEMM-focused NPU GEMV-optimized PIM devices. The main challenge efficiently integrating lies enabling concurrent operations on both platforms, each addressing specific kernel type. First, existing PIMs typically operate "blocked" mode, allowing only either be active at any given time. Second, inherent dependencies between LLMs restrict their parallel processing. To tackle these challenges, NeuPIMs equipped dual row buffers bank, facilitating simultaneous management memory read/write commands. Further, employs runtime sub-batch interleaving technique maximize execution, leveraging batch parallelism allow two independent sub-batches pipelined within single device. Our evaluation demonstrates compared GPU-only, NPU-only, na\"ive NPU+PIM integrated approaches, achieves 3$\times$, 2.4$\times$ 1.6$\times$ throughput improvement, respectively.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (89)
CITATIONS (24)
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....