NFDI4DS | UHH-SEMS - Publication Details

SMAT

OPENALEX - Publications

Jiajia Li Guangming Tan Mingyu Chen Ninghui Sun

Sparse Matrix Vector multiplication (SpMV) is an important kernel in both traditional high performance computing and emerging data-intensive applications. By far, SpMV libraries are optimized by either application-specific or architecture-specific approaches, making the become too complicated to be used extensively real In this work we develop a Matrix-vector Auto-Tuning system (SMAT) bridge gap between specific optimizations general-purpose usage. SMAT provides users with unified...

10.1145/2491956.2462181 article EN 2013-06-11

Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect

OPENALEX - Publications

Ang Li Shuaiwen Leon Song Jieyang Chen Jiajia Li Xu Liu and 2 more

High performance multi-GPU computing becomes an inevitable trend due to the ever-increasing demand on computation capability in emerging domains such as deep learning, big data and planet-scale simulations. However, lack of understanding how modern GPUs can be connected real impact state-of-the-art interconnect technology application become a hurdle. In this paper, we fill gap by conducting thorough evaluation five latest types GPU interconnects: PCIe, NVLink-V1, NVLink-V2, NVLink-SLI...

10.1109/tpds.2019.2928289 article EN publisher-specific-oa IEEE Transactions on Parallel and Distributed Systems 2019-07-15

Bridging the gap between deep learning and sparse matrix format selection

OPENALEX - Publications

Yue Zhao Jiajia Li Chunhua Liao Xipeng Shen

This work presents a systematic exploration on the promise and special challenges of deep learning for sparse matrix format selection---a problem determining best storage to maximize performance Sparse Matrix Vector Multiplication (SpMV). It describes how effectively bridge gap between needs pillar HPC through set techniques representations, structure, cross-architecture model migrations. The new solution cuts selection errors by two thirds, improves SpMV 1.73X average over state art.

10.1145/3178487.3178495 article EN 2018-02-06

HiCOO: Hierarchical Storage of Sparse Tensors

OPENALEX - Publications

Jiajia Li Jimeng Sun Richard Vuduc

This paper proposes a new storage format for sparse tensors, called Hierarchical COOrdinate (HiCOO; pronounced: "haiku"). It derives from coordinate (COO) format, arguably the de facto standard general tensor storage. HiCOO improves upon COO by compressing indices in units of blocks, with goals preserving "mode-agnostic" simplicity while reducing bytes needed to represent and promoting data locality. We evaluate implementing single-node, multicore-parallel version matricized...

10.1109/sc.2018.00022 article EN 2018-11-01

An input-adaptive and in-place approach to dense tensor-times-matrix multiply

OPENALEX - Publications

Jiajia Li Casey Battaglino Ioakeim Perros Jimeng Sun Richard Vuduc

This paper describes a novel framework, called InTensLi ("intensely"), for producing fast single-node implementations of dense tensor-times-matrix multiply (Ttm) arbitrary dimension. Whereas conventional Ttm rely on explicitly converting the input tensor operand into matrix---in order to be able use any available and general matrix-matrix (Gemm) implementation---our framework's strategy is carry out in-place, avoiding this copy. As resulting expose tuning parameters, also heuristic empirical...

10.1145/2807591.2807671 article EN 2015-10-27

Understanding the GPU Microarchitecture to Achieve Bare-Metal Performance Tuning

OPENALEX - Publications

Xiuxia Zhang Guangming Tan Shuangbai Xue Jiajia Li Keren Zhou and 1 more

In this paper, we present a methodology to understand GPU microarchitectural features and improve performance for compute-intensive kernels. The relies on reverse engineering approach crack the ISA encodings in order build assembler. An assembly microbenchmark suite correlates with their factors uncover instruction-level memory hierarchy preferences. We use SGEMM as running example show ways achieve bare-metal tuning. boost is achieved by tuning FFMA throughput activating dual-issue,...

10.1145/3018743.3018755 article EN 2017-01-26

Model-Driven Sparse CP Decomposition for Higher-Order Tensors

OPENALEX - Publications

Jiajia Li Jee Choi Ioakeim Perros Jimeng Sun Richard Vuduc

Given an input tensor, its CANDECOMP/PARAFAC decomposition (or CPD) is a low-rank representation. CPDs are of particular interest in data analysis and mining, especially when the tensor sparse higher order (dimension). This paper focuses on central bottleneck CPD algorithm, which evaluating sequence matricized times Khatri-Rao products (MTTKRPs). To speed up MTTKRP sequence, we propose novel, adaptive memoization AdaTM. Besides removing redundant computations within potentially reduces...

10.1109/ipdps.2017.80 article EN 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2017-05-01

A pattern based algorithmic autotuner for graph processing on GPUs

OPENALEX - Publications

Ke Meng Jiajia Li Guangming Tan Ninghui Sun

This paper proposes Gswitch, a pattern-based algorithmic auto-tuning system that dynamically switches between optimization variants with negligible overhead. Its novelty lies in small set of patterns allow for the configurable assembly algorithm. The fast transition Gswitch is based on machine learning model trained using 644 real graphs. Moreover, provides simple programming interface conceals low-level tuning details from user. We evaluate typical graph algorithms (BFS, CC, PR, SSSP, and...

10.1145/3293883.3295716 article EN 2019-02-05

Real-time flashover prediction model for multi-compartment building structures using attention based recurrent neural networks

OPENALEX - Publications

Wai Cheong Tam Eugene Yujun Fu Jiajia Li Richard D. Peacock Paul A. Reneke and 4 more

10.1016/j.eswa.2023.119899 article EN Expert Systems with Applications 2023-03-17

gHyPart: GPU-friendly End-to-End Hypergraph Partitioner

OPENALEX - Publications

Zhenlin Wu Haosong Zhao Hongyuan Liu Wujie Wen Jiajia Li

Hypergraph partitioning finds practical applications in various fields, such as high-performance computing and circuit VLSI physical design, where solutions often demand substantial parallelism beyond what existing CPU-based can offer. While GPUs are promising this regard, their potential hypergraph remains unexplored. In work, we first develop an end-to-end deterministic partitioner on GPUs, ported from state-of-the-art multi-threaded CPU identify three major performance challenges by...

10.1145/3711925 article EN ACM Transactions on Architecture and Code Optimization 2025-01-10

SRSparse: Generating Codes for High-Performance Sparse Matrix-Vector Semiring Computations

OPENALEX - Publications

Zhen Du Ying Liu Ninghui Sun Huimin Cui Xiaobing Feng and 1 more

Sparse matrix-vector semiring computation is a key operation in sparse matrix computations, with performance strongly dependent on both program design and the features of matrices. Given diversity matrices, designing tailored for each challenging. To address this, we propose SRSparse 1 an generator that creates programs by automatically combining methods to fit specific input It provides two components: problem definition configuration , which declares computation, scheduling language can be...

10.1145/3722114 article EN ACM Transactions on Architecture and Code Optimization 2025-03-07

Load-Balanced Sparse MTTKRP on GPUs

OPENALEX - Publications

Israt Nisa Jiajia Li Aravind Sukumaran-Rajam Richard Vuduc P. Sadayappan

Sparse matricized tensor times Khatri-Rao product (MTTKRP) is one of the most computationally expensive kernels in sparse computations. This work focuses on optimizing MTTKRP operation GPUs, addressing both performance and storage requirements. We begin by identifying bottlenecks directly extending state-of-the-art CSF (compressed fiber) format from CPUs to GPUs. A significant challenge with GPUs compared multicore that utilizing much greater degree parallelism a load-balanced fashion for...

10.1109/ipdps.2019.00023 article EN 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2019-05-01

Design and Implementation of Adaptive SpMV Library for Multicore and Many-Core Architecture

OPENALEX - Publications

Guangming Tan Junhong Liu Jiajia Li

Sparse matrix vector multiplication (SpMV) is an important computational kernel in traditional high-performance computing and emerging data-intensive applications. Previous SpMV libraries are optimized by either application-specific or architecture-specific approaches but present difficulties for use real In this work, we develop auto-tuning system (SMATER) to bridge the gap between specific optimizations general-purpose use. SMATER provides programmers a unified interface based on...

10.1145/3218823 article EN ACM Transactions on Mathematical Software 2018-08-09

An efficient mixed-mode representation of sparse tensors

OPENALEX - Publications

Israt Nisa Jiajia Li Aravind Sukumaran-Rajam Prasant Singh Rawat Sriram Krishnamoorthy and 1 more

The Compressed Sparse Fiber (CSF) representation for sparse tensors is a generalization of the Row (CSR) format matrices. For tensor with d modes, typical methods such as CANDECOMP/PARAFAC decomposition (CPD) require sequence computations, where efficient memory access respect to different modes required each them. straightforward solution use distinct representations tensor, one being computations. However, d-fold space overhead often unacceptable in practice, especially memory-constrained...

10.1145/3295500.3356216 article EN 2019-11-07

Efficient and effective sparse tensor reordering

OPENALEX - Publications

Jiajia Li Bora Uçar Ümit V. Çatalyürek Jimeng Sun Kevin Barker and 1 more

This paper formalizes the problem of reordering a sparse tensor to improve spatial and temporal locality operations with it, proposes two algorithms for this problem, which we call BFS-MCS Lexi-Order. The method is Breadth First Search (BFS)-like heuristic approach based on maximum cardinality search family; Lexi-Order an extension doubly lexical ordering matrices tensors. We show effects these schemes within context widely used computation, CANDECOMP/PARAFAC decomposition (CPD), when...

10.1145/3330345.3330366 preprint EN 2019-06-18

AlphaSparse: Generating High Performance SpMV Codes Directly from Sparse Matrices

OPENALEX - Publications

Zhen Du Jiajia Li Yinshan Wang Xueqi Li Guangming Tan and 1 more

Sparse Matrix-Vector multiplication (SpMV) is an essential computational kernel in many application scenarios. Tens of sparse matrix formats and implementations have been proposed to compress the memory storage speed up SpMV performance. We develop AlphaSparse, a superset all existing works that goes beyond scope human-designed format(s) implementation(s). AlphaSparse automatically creates novel machine-designed en-tirely from knowledge input sparsity patterns hard-ware architectures. Based...

10.1109/sc41404.2022.00071 article EN 2022-11-01

Optimizing sparse tensor times matrix on GPUs

OPENALEX - Publications

Yuchen Ma Jiajia Li Xiaolong Wu Chenggang Yan Jimeng Sun and 1 more

10.1016/j.jpdc.2018.07.018 article EN publisher-specific-oa Journal of Parallel and Distributed Computing 2018-08-06

A global-local optimization framework for simultaneous multi-mode multi-corner clock skew variation reduction

OPENALEX - Publications

Kwangsoo Han Jiajia Li Andrew B. Kahng Siddhartha Nath Jongpil Lee

As combinations of signoff corners grow in modern SoCs, minimization clock skew variation across is important. Large can cause difficulties multi-corner timing closure because fixing violations at one corner lead to other corners. Such "ping-pong" effects significant power and area overheads time signoff. We propose a novel framework encompassing both global local network optimizations minimize the sum variations different PVT between all sequentially adjacent sink pairs. The optimization...

10.1145/2744769.2744776 article EN 2015-06-02

Bridging the gap between deep learning and sparse matrix format selection

OPENALEX - Publications

Yue Zhao Jiajia Li Chunhua Liao Xipeng Shen

This work presents a systematic exploration on the promise and special challenges of deep learning for sparse matrix format selection---a problem determining best storage to maximize performance Sparse Matrix Vector Multiplication (SpMV). It describes how effectively bridge gap between needs pillar HPC through set techniques representations, structure, cross-architecture model migrations. The new solution cuts selection errors by two thirds, improves SpMV 1.73X average over state art.

10.1145/3200691.3178495 article EN ACM SIGPLAN Notices 2018-02-10

Optimizing sparse tensor times matrix on multi-core and many-core architectures

OPENALEX - Publications

Jiajia Li Yuchen Ma Chenggang Yan Richard Vuduc

This paper presents the optimized design and implementation of sparse tensor-times-dense matrix multiply (SpTTM) for CPU GPU platforms. primitive is a critical bottleneck in data analysis mining applications based on tensor methods, such as Tucker decomposition. We first implement sequential SpTTM to avoid explicit transformations between matrix, which conventional approach. further optimize multicore systems by parallelizing, avoiding locks, exploiting locality. Our up 3.5× faster than from...

10.5555/3018843.3018848 article EN Irregular Applications: Architectures and Algorithms 2016-11-13

A High Performance Sparse Tensor Algebra Compiler in MLIR

OPENALEX - Publications

Ruiqin Tian Luanzheng Guo Jiajia Li Bin Ren Gökçen Kestor

Sparse tensor algebra is widely used in many applications, including scientific computing, machine learning, and data analytics. The performance of sparse kernels strongly depends on the intrinsic characteristics input tensors, hence storage formats are designed for tensors to achieve optimal particular applications/architectures, which makes it challenging implement optimize every operation interest a given architecture. We propose domain-specific language (DSL) compiler framework...

10.1109/llvmhpc54804.2021.00009 article EN 2021-11-01

SMAT

OPENALEX - Publications

Jiajia Li Guangming Tan Mingyu Chen Ninghui Sun

Sparse Matrix Vector multiplication (SpMV) is an important kernel in both traditional high performance computing and emerging data-intensive applications. By far, SpMV libraries are optimized by either application-specific or architecture-specific approaches, making the become too complicated to be used extensively real In this work we develop a Matrix-vector Auto-Tuning system (SMAT) bridge gap between specific optimizations general-purpose usage. SMAT provides users with unified...

10.1145/2499370.2462181 article EN ACM SIGPLAN Notices 2013-06-16

Optimizing Sparse Tensor Times Matrix on Multi-core and Many-Core Architectures

OPENALEX - Publications

Jiajia Li Yuchen Ma Chenggang Yan Richard Vuduc

This paper presents the optimized design and implementation of sparse tensor-times-dense matrix multiply (SpTTM) for CPU GPU platforms. primitive is a critical bottleneck in data analysis mining applications based on tensor methods, such as Tucker decomposition. We first implement sequential SpTTM to avoid explicit transformations between matrix, which conventional approach. further optimize multicore systems by parallelizing, avoiding locks, exploiting locality. Our up 3.5× faster than from...

10.1109/ia3.2016.010 article EN 2016-11-01

Sparta

OPENALEX - Publications

Jiawen Liu Jie Ren Roberto Gioiosa Dong Li Jiajia Li

Sparse tensor contractions appear commonly in many applications. Efficiently computing a two sparse product is challenging: It not only inherits the challenges from common matrix-matrix multiplication (SpGEMM), i.e., indirect memory access and unknown output size before computation, but also raises new because of high dimensionality tensors, expensive multi-dimensional index search, massive intermediate data. To address above challenges, we introduce three optimization techniques by using...

10.1145/3437801.3441581 article EN 2021-02-17