Jiajia Li

ORCID: 0000-0003-1270-4147
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Parallel Computing and Optimization Techniques
  • Tensor decomposition and applications
  • Advanced Data Storage Technologies
  • Algorithms and Data Compression
  • Interconnection Networks and Systems
  • Distributed and Parallel Computing Systems
  • Graph Theory and Algorithms
  • Ferroelectric and Negative Capacitance Devices
  • Computational Physics and Python Applications
  • Advanced Graph Neural Networks
  • Advanced Neural Network Applications
  • Advanced Neuroimaging Techniques and Applications
  • VLSI and FPGA Design Techniques
  • Distributed systems and fault tolerance
  • Network Packet Processing and Optimization
  • Stochastic Gradient Optimization Techniques
  • Embedded Systems Design Techniques
  • Experimental Learning in Engineering
  • Melanoma and MAPK Pathways
  • Optimization and Search Problems
  • Quantum many-body systems
  • Reservoir Engineering and Simulation Methods
  • IoT and Edge/Fog Computing
  • Fire Detection and Safety Systems
  • Network Security and Intrusion Detection

North Carolina State University
2022-2025

China National Petroleum Corporation (China)
2024

National Institute of Standards and Technology
2023

William & Mary
2021-2022

Pacific Northwest National Laboratory
2018-2021

Battelle
2019

Georgia Institute of Technology
2015-2018

South China Normal University
2017

University of California, San Diego
2015

Institute of Computing Technology
2010-2013

Sparse Matrix Vector multiplication (SpMV) is an important kernel in both traditional high performance computing and emerging data-intensive applications. By far, SpMV libraries are optimized by either application-specific or architecture-specific approaches, making the become too complicated to be used extensively real In this work we develop a Matrix-vector Auto-Tuning system (SMAT) bridge gap between specific optimizations general-purpose usage. SMAT provides users with unified...

10.1145/2491956.2462181 article EN 2013-06-11

High performance multi-GPU computing becomes an inevitable trend due to the ever-increasing demand on computation capability in emerging domains such as deep learning, big data and planet-scale simulations. However, lack of understanding how modern GPUs can be connected real impact state-of-the-art interconnect technology application become a hurdle. In this paper, we fill gap by conducting thorough evaluation five latest types GPU interconnects: PCIe, NVLink-V1, NVLink-V2, NVLink-SLI...

10.1109/tpds.2019.2928289 article EN publisher-specific-oa IEEE Transactions on Parallel and Distributed Systems 2019-07-15

This work presents a systematic exploration on the promise and special challenges of deep learning for sparse matrix format selection---a problem determining best storage to maximize performance Sparse Matrix Vector Multiplication (SpMV). It describes how effectively bridge gap between needs pillar HPC through set techniques representations, structure, cross-architecture model migrations. The new solution cuts selection errors by two thirds, improves SpMV 1.73X average over state art.

10.1145/3178487.3178495 article EN 2018-02-06

This paper proposes a new storage format for sparse tensors, called Hierarchical COOrdinate (HiCOO; pronounced: "haiku"). It derives from coordinate (COO) format, arguably the de facto standard general tensor storage. HiCOO improves upon COO by compressing indices in units of blocks, with goals preserving "mode-agnostic" simplicity while reducing bytes needed to represent and promoting data locality. We evaluate implementing single-node, multicore-parallel version matricized...

10.1109/sc.2018.00022 article EN 2018-11-01

This paper describes a novel framework, called InTensLi ("intensely"), for producing fast single-node implementations of dense tensor-times-matrix multiply (Ttm) arbitrary dimension. Whereas conventional Ttm rely on explicitly converting the input tensor operand into matrix---in order to be able use any available and general matrix-matrix (Gemm) implementation---our framework's strategy is carry out in-place, avoiding this copy. As resulting expose tuning parameters, also heuristic empirical...

10.1145/2807591.2807671 article EN 2015-10-27

In this paper, we present a methodology to understand GPU microarchitectural features and improve performance for compute-intensive kernels. The relies on reverse engineering approach crack the ISA encodings in order build assembler. An assembly microbenchmark suite correlates with their factors uncover instruction-level memory hierarchy preferences. We use SGEMM as running example show ways achieve bare-metal tuning. boost is achieved by tuning FFMA throughput activating dual-issue,...

10.1145/3018743.3018755 article EN 2017-01-26

Given an input tensor, its CANDECOMP/PARAFAC decomposition (or CPD) is a low-rank representation. CPDs are of particular interest in data analysis and mining, especially when the tensor sparse higher order (dimension). This paper focuses on central bottleneck CPD algorithm, which evaluating sequence matricized times Khatri-Rao products (MTTKRPs). To speed up MTTKRP sequence, we propose novel, adaptive memoization AdaTM. Besides removing redundant computations within potentially reduces...

10.1109/ipdps.2017.80 article EN 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2017-05-01

This paper proposes Gswitch, a pattern-based algorithmic auto-tuning system that dynamically switches between optimization variants with negligible overhead. Its novelty lies in small set of patterns allow for the configurable assembly algorithm. The fast transition Gswitch is based on machine learning model trained using 644 real graphs. Moreover, provides simple programming interface conceals low-level tuning details from user. We evaluate typical graph algorithms (BFS, CC, PR, SSSP, and...

10.1145/3293883.3295716 article EN 2019-02-05

Hypergraph partitioning finds practical applications in various fields, such as high-performance computing and circuit VLSI physical design, where solutions often demand substantial parallelism beyond what existing CPU-based can offer. While GPUs are promising this regard, their potential hypergraph remains unexplored. In work, we first develop an end-to-end deterministic partitioner on GPUs, ported from state-of-the-art multi-threaded CPU identify three major performance challenges by...

10.1145/3711925 article EN ACM Transactions on Architecture and Code Optimization 2025-01-10

Sparse matrix-vector semiring computation is a key operation in sparse matrix computations, with performance strongly dependent on both program design and the features of matrices. Given diversity matrices, designing tailored for each challenging. To address this, we propose SRSparse 1 an generator that creates programs by automatically combining methods to fit specific input It provides two components: problem definition configuration , which declares computation, scheduling language can be...

10.1145/3722114 article EN ACM Transactions on Architecture and Code Optimization 2025-03-07

Sparse matricized tensor times Khatri-Rao product (MTTKRP) is one of the most computationally expensive kernels in sparse computations. This work focuses on optimizing MTTKRP operation GPUs, addressing both performance and storage requirements. We begin by identifying bottlenecks directly extending state-of-the-art CSF (compressed fiber) format from CPUs to GPUs. A significant challenge with GPUs compared multicore that utilizing much greater degree parallelism a load-balanced fashion for...

10.1109/ipdps.2019.00023 article EN 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2019-05-01

Sparse matrix vector multiplication (SpMV) is an important computational kernel in traditional high-performance computing and emerging data-intensive applications. Previous SpMV libraries are optimized by either application-specific or architecture-specific approaches but present difficulties for use real In this work, we develop auto-tuning system (SMATER) to bridge the gap between specific optimizations general-purpose use. SMATER provides programmers a unified interface based on...

10.1145/3218823 article EN ACM Transactions on Mathematical Software 2018-08-09

The Compressed Sparse Fiber (CSF) representation for sparse tensors is a generalization of the Row (CSR) format matrices. For tensor with d modes, typical methods such as CANDECOMP/PARAFAC decomposition (CPD) require sequence computations, where efficient memory access respect to different modes required each them. straightforward solution use distinct representations tensor, one being computations. However, d-fold space overhead often unacceptable in practice, especially memory-constrained...

10.1145/3295500.3356216 article EN 2019-11-07

This paper formalizes the problem of reordering a sparse tensor to improve spatial and temporal locality operations with it, proposes two algorithms for this problem, which we call BFS-MCS Lexi-Order. The method is Breadth First Search (BFS)-like heuristic approach based on maximum cardinality search family; Lexi-Order an extension doubly lexical ordering matrices tensors. We show effects these schemes within context widely used computation, CANDECOMP/PARAFAC decomposition (CPD), when...

10.1145/3330345.3330366 preprint EN 2019-06-18

Sparse Matrix-Vector multiplication (SpMV) is an essential computational kernel in many application scenarios. Tens of sparse matrix formats and implementations have been proposed to compress the memory storage speed up SpMV performance. We develop AlphaSparse, a superset all existing works that goes beyond scope human-designed format(s) implementation(s). AlphaSparse automatically creates novel machine-designed en-tirely from knowledge input sparsity patterns hard-ware architectures. Based...

10.1109/sc41404.2022.00071 article EN 2022-11-01

10.1016/j.jpdc.2018.07.018 article EN publisher-specific-oa Journal of Parallel and Distributed Computing 2018-08-06

As combinations of signoff corners grow in modern SoCs, minimization clock skew variation across is important. Large can cause difficulties multi-corner timing closure because fixing violations at one corner lead to other corners. Such "ping-pong" effects significant power and area overheads time signoff. We propose a novel framework encompassing both global local network optimizations minimize the sum variations different PVT between all sequentially adjacent sink pairs. The optimization...

10.1145/2744769.2744776 article EN 2015-06-02

This work presents a systematic exploration on the promise and special challenges of deep learning for sparse matrix format selection---a problem determining best storage to maximize performance Sparse Matrix Vector Multiplication (SpMV). It describes how effectively bridge gap between needs pillar HPC through set techniques representations, structure, cross-architecture model migrations. The new solution cuts selection errors by two thirds, improves SpMV 1.73X average over state art.

10.1145/3200691.3178495 article EN ACM SIGPLAN Notices 2018-02-10

This paper presents the optimized design and implementation of sparse tensor-times-dense matrix multiply (SpTTM) for CPU GPU platforms. primitive is a critical bottleneck in data analysis mining applications based on tensor methods, such as Tucker decomposition. We first implement sequential SpTTM to avoid explicit transformations between matrix, which conventional approach. further optimize multicore systems by parallelizing, avoiding locks, exploiting locality. Our up 3.5× faster than from...

10.5555/3018843.3018848 article EN Irregular Applications: Architectures and Algorithms 2016-11-13

Sparse tensor algebra is widely used in many applications, including scientific computing, machine learning, and data analytics. The performance of sparse kernels strongly depends on the intrinsic characteristics input tensors, hence storage formats are designed for tensors to achieve optimal particular applications/architectures, which makes it challenging implement optimize every operation interest a given architecture. We propose domain-specific language (DSL) compiler framework...

10.1109/llvmhpc54804.2021.00009 article EN 2021-11-01

Sparse Matrix Vector multiplication (SpMV) is an important kernel in both traditional high performance computing and emerging data-intensive applications. By far, SpMV libraries are optimized by either application-specific or architecture-specific approaches, making the become too complicated to be used extensively real In this work we develop a Matrix-vector Auto-Tuning system (SMAT) bridge gap between specific optimizations general-purpose usage. SMAT provides users with unified...

10.1145/2499370.2462181 article EN ACM SIGPLAN Notices 2013-06-16

This paper presents the optimized design and implementation of sparse tensor-times-dense matrix multiply (SpTTM) for CPU GPU platforms. primitive is a critical bottleneck in data analysis mining applications based on tensor methods, such as Tucker decomposition. We first implement sequential SpTTM to avoid explicit transformations between matrix, which conventional approach. further optimize multicore systems by parallelizing, avoiding locks, exploiting locality. Our up 3.5× faster than from...

10.1109/ia3.2016.010 article EN 2016-11-01

Sparse tensor contractions appear commonly in many applications. Efficiently computing a two sparse product is challenging: It not only inherits the challenges from common matrix-matrix multiplication (SpGEMM), i.e., indirect memory access and unknown output size before computation, but also raises new because of high dimensionality tensors, expensive multi-dimensional index search, massive intermediate data. To address above challenges, we introduce three optimization techniques by using...

10.1145/3437801.3441581 article EN 2021-02-17
Coming Soon ...