NFDI4DS | UHH-SEMS - Publication Details

Piyush Sao

ORCID: 0000-0002-9432-5855

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5048170299

Research Areas

Parallel Computing and Optimization Techniques
Interconnection Networks and Systems
Graph Theory and Algorithms
Data Management and Algorithms
Matrix Theory and Algorithms
Distributed and Parallel Computing Systems
Advanced Graph Neural Networks
Advanced Data Storage Technologies
Tensor decomposition and applications
Distributed systems and fault tolerance
Bioinformatics and Genomic Networks
Advanced Clustering Algorithms Research
Semantic Web and Ontologies
VLSI and FPGA Design Techniques
Cloud Computing and Resource Management
Algorithms and Data Compression
Radiation Effects in Electronics
Machine Learning and Data Classification
Advanced Memory and Neural Computing
Computer Graphics and Visualization Techniques
Complex Network Analysis Techniques
Cellular Automata and Applications
Machine Learning in Materials Science
Data Visualization and Analytics
Low-power high-performance VLSI design

Oak Ridge National Laboratory
2019-2024

Georgia Institute of Technology
1999-2018

Self-stabilizing iterative solvers

OPENALEX - Publications

Piyush Sao Richard Vuduc

We show how to use the idea of self-stabilization, which originates in context distributed control, make fault-tolerant iterative solvers. Generally, a self-stabilizing system is one that, starting from an arbitrary state (valid or invalid), reaches valid within finite number steps. This property imbues with natural means tolerating transient faults. give two proof-of-concept examples linear solvers: for steepest descent (SD) and conjugate gradients (CG). Our self-stabilized versions SD CG...

10.1145/2530268.2530272 article EN 2013-10-29

Traversing large graphs on GPUs with unified memory

OPENALEX - Publications

Prasun Gera Hyojong Kim Piyush Sao Hyesoon Kim David A. Bader

Due to the limited capacity of GPU memory, majority prior work on graph applications GPUs has been restricted graphs modest sizes that fit in memory. Recent hardware and software advances make it possible address much larger host memory transparently as a part feature known unified virtual While accessing over an interconnect is understandably slower, problem space not sufficiently explored context challenging workload with low computational intensity irregular data access pattern such...

10.14778/3384345.3384358 article EN Proceedings of the VLDB Endowment 2020-03-01

A Communication-Avoiding 3D LU Factorization Algorithm for Sparse Matrices

OPENALEX - Publications

Piyush Sao Xiaoye Sherry Li Richard Vuduc

We propose a new algorithm to improve the strong scalability of right-looking sparse LU factorization on distributed memory systems. Our 3D uses three-dimensional MPI process grid, aggressively exploits elimination tree parallelism and trades off increased for reduced per-process communication. also analyze asymptotic improvements planar graphs (e.g., from 2D grid or mesh domains) certain non-planar (specifically grids meshes). For with n vertices, our reduces communication volume...

10.1109/ipdps.2018.00100 article EN 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2018-05-01

A supernodal all-pairs shortest path algorithm

OPENALEX - Publications

Piyush Sao Ramakrishnan Kannan Prasun Gera Richard Vuduc

We show how to exploit graph sparsity in the Floyd-Warshall algorithm for all-pairs shortest path (Apsp) problem. is an attractive choice Apsp on high-performing systems due its structural similarity solving dense linear and matrix multiplication. However, if of input not properly exploited, will perform unnecessary asymptotic work thus may be a suitable many graphs. To overcome this limitation, key idea our approach use known algebraic relationship between Gaussian elimination, import...

10.1145/3332466.3374533 article EN 2020-02-19

A Sparse Direct Solver for Distributed Memory Xeon Phi-Accelerated Systems

OPENALEX - Publications

Piyush Sao Xing Liu Richard Vuduc Xiaoye Li

This paper presents the first sparse direct solver for distributed memory systems comprising hybrid multicourse CPU and Intel Xeon Pico-processors. It builds on algorithmic approach of SuperLU_DIST, which is right-looking statically pivoted. Our contribution a novel algorithm, called HALO. The name shorthand highly asynchronous lazy offload, it refers tithe way algorithm combines aggressive use asynchrony with accelerated updates, data shadowing (a la halo or ghost zones), all serve to hide...

10.1109/ipdps.2015.104 article EN 2015-05-01

Newly Released Capabilities in the Distributed-Memory SuperLU Sparse Direct Solver

OPENALEX - Publications

Xiaoye Sherry Li Paul Lin Yang Liu Piyush Sao

We present the new features available in recent release of SuperLU_DIST , Version 8.1.1. is a distributed-memory parallel sparse direct solver. The include (1) 3D communication-avoiding algorithm framework that trades off inter-process communication for selective memory duplication, (2) multi-GPU support both NVIDIA GPUs and AMD GPUs, (3) mixed-precision routines perform single-precision LU factorization double-precision iterative refinement. Apart from improvements, we also modernized...

10.1145/3577197 article EN other-oa ACM Transactions on Mathematical Software 2022-12-19

A distributed kernel summation framework for general‐dimension machine learning

OPENALEX - Publications

Dongryeol Lee Piyush Sao Richard Vuduc Alexander Gray

Abstract Kernel summations are a ubiquitous key computational bottleneck in many data analysis methods. In this paper, we attempt to marry, for the first time, best relevant techniques parallel computing, where kernel low dimensions, with general‐dimension algorithms from machine learning literature. We provide distributed implementation of summation framework that can utilize: (i) various types deterministic and probabilistic approximations may be suitable high‐dimensional problems large...

10.1002/sam.11207 article EN Statistical Analysis and Data Mining The ASA Data Science Journal 2013-10-24

A communication-avoiding 3D algorithm for sparse LU factorization on heterogeneous systems

OPENALEX - Publications

Piyush Sao Xiaoye Sherry Li Richard Vuduc

10.1016/j.jpdc.2019.03.004 article EN publisher-specific-oa Journal of Parallel and Distributed Computing 2019-04-24

A communication-avoiding 3D sparse triangular solver

OPENALEX - Publications

Piyush Sao Ramakrishnan Kannan Xiaoye Sherry Li Richard Vuduc

We present a novel distributed memory algorithm to improve the strong scalability of solution sparse triangular system. This operation appears in solve phase direct methods for solving general linear systems, Ax = b. Our 3D solver employs several techniques, including MPI process grid, elimination tree parallelism, and data replication, all which reduce per-process communication when combined. analytical models understand cost our show that can volume asymptotically by factor O(n1/4) O(n1/6)...

10.1145/3330345.3330357 article EN 2019-06-18

Unified Communication Optimization Strategies for Sparse Triangular Solver on CPU and GPU Clusters

OPENALEX - Publications

Yang Liu Nan Ding Piyush Sao Samuel Williams Xiaoye Sherry Li

This paper presents a unified communication optimization framework for sparse triangular solve (SpTRSV) algorithms on CPU and GPU clusters. The builds upon 3D communication-avoiding (CA) layout of Px × Py Pz processes that divides matrix into submatrices, each handled by 2D grid with block-cyclic distribution. We propose three strategies: First, new SpTRSV algorithm is developed, which trades the inter-grid synchronization replicated computation. design requires only one synchronization,...

10.1145/3581784.3607092 article EN cc-by 2023-11-11

A Self-Correcting Connected Components Algorithm

OPENALEX - Publications

Piyush Sao Oded Green Chirag Jain Richard Vuduc

We present a new fault-tolerant algorithm for the problem of computing connected components graph. Our derives from highly parallel but non-resilient algorithm, which is based on technique label propagation (LP). To make (LP) resilient to transient soft faults, we apply an algorithmic design principle that refer as self-correction. Briefly, self-correcting detects if it has reached invalid state given was previously in known valid state; and so, restores itself back assuming availability...

10.1145/2909428.2909435 article EN 2016-05-27

Scalable Knowledge Graph Analytics at 136 Petaflop/s

OPENALEX - Publications

Ramakrishnan Kannan Piyush Sao Hao Lu Drahomíra Herrmannová Vijay Thakkar and 3 more

We are motivated by newly proposed methods for data mining large-scale corpora of scholarly publications, such as the full biomedical literature, which may consist tens millions papers spanning decades research. In this setting, analysts seek to discover how concepts relate one another. They construct graph representations from annotated text databases and then formulate relationship-mining problem computing all-pairs shortest paths (APSP), becomes a significant bottleneck. context, we...

10.1109/sc41405.2020.00010 article EN 2020-11-01

A single-tree algorithm to compute the Euclidean minimum spanning tree on GPUs

OPENALEX - Publications

Andrey Prokopenko Piyush Sao Damien Lebrun-Grandié

Computing the Euclidean minimum spanning tree (Emst) is a computationally demanding step of many algorithms. While work-efficient serial and multithreaded algorithms for computing Emst are known, designing an efficient GPU algorithm challenging due to complex branching structure, data dependencies, load imbalances. In this paper, we propose single-tree Borůvka-based on GPUs. We use nearest neighbor reduce number required distance calculations by avoiding traversing subtrees with leaf nodes...

10.1145/3545008.3546185 article EN 2022-08-29

Sparse Binary Matrix-Vector Multiplication on Neuromorphic Computers

OPENALEX - Publications

Catherine D. Schuman Bill Kay Prasanna Date Ramakrishnan Kannan Piyush Sao and 1 more

Neuromorphic computers offer the opportunity for low-power, efficient computation. Though they have been primarily applied to neural network tasks, there is also leverage inherent characteristics of neuromorphic (low power, massive parallelism, collocated processing and memory) perform non-neural tasks. Here, we demonstrate how an approach performing sparse binary matrix-vector multiplication on computers. We describe approach, which relies connection between breadth first search, introduce...

10.1109/ipdpsw52791.2021.00054 article EN 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) 2021-06-01

PANDORA: A Parallel Dendrogram Construction Algorithm for Single Linkage Clustering on GPU

OPENALEX - Publications

Piyush Sao Andrey Prokopenko Damien Lebrun-Grandié

This paper presents \pandora, a novel parallel algorithm for efficiently constructing dendrograms single-linkage hierarchical clustering, including \hdbscan. Traditional dendrogram construction methods from minimum spanning tree (MST), such as agglomerative or divisive techniques, often fail to parallelize, especially with skewed common in real-world data. \pandora addresses these challenges through unique recursive contraction method, which simplifies the initial and then progressively...

10.48550/arxiv.2401.06089 preprint EN other-oa arXiv (Cornell University) 2024-01-01

Accelerated Constrained Sparse Tensor Factorization on Massively Parallel Architectures

OPENALEX - Publications

Yongseok Soh Ramakrishnan Kannan Piyush Sao Jee Choi

This study presents the first constrained sparse tensor factorization (cSTF) framework that optimizes and fully offloads computation to massively parallel GPU architectures, performance characterization of cSTF on architectures. In contrast prior work factorization, where matricized times Khatri-Rao product (MTTKRP) is primary bottleneck, our systematic analysis algorithm GPUs reveals adding constraints creates an additional bottleneck in update operation for many real-world tensors. While...

10.1145/3673038.3673128 article EN cc-by 2024-08-08

PANDORA: A Parallel Dendrogram Construction Algorithm for Single Linkage Clustering on GPU

OPENALEX - Publications

Piyush Sao Andrey Prokopenko Damien Lebrun-Grandié

This paper introduces Pandora, a parallel algorithm for computing dendrograms, the hierarchical cluster trees single linkage clustering (SLC). Current approaches construct dendrograms by partitioning minimum spanning tree and removing edges. However, they struggle with skewed, hard-to-parallelize real-world dendrograms. Consequently, is sequential bottleneck in HDBSCAN*[21], popular SLC variant.

10.1145/3673038.3673148 article EN public-domain 2024-08-08

Interface for Sparse Linear Algebra Operations

OPENALEX - Publications

Ahmad Abdelfattah Willow Ahrens Hartwig Anzt Chris Armstrong Benjamin Brock and 30 more

The standardization of an interface for dense linear algebra operations in the BLAS standard has enabled interoperability between different libraries, thereby boosting success scientific computing, particular HPC. Despite numerous efforts past, community not yet agreed on a sparse due to reasons. One is fact that objects allow many storage formats, and hardware may favor formats. This makes definition FORTRAN-style all-circumventing extremely challenging. Another reason opposed...

10.48550/arxiv.2411.13259 preprint EN arXiv (Cornell University) 2024-11-20

Scalable All-pairs Shortest Paths for Huge Graphs on Multi-GPU Clusters

OPENALEX - Publications

Piyush Sao Hao Lu Ramakrishnan Kannan Vijay Thakkar Richard Vuduc and 1 more

We present an optimized Floyd-Warshall (Floyd-Warshall) algorithm that computes the All-pairs shortest path (APSP) for GPU accelerated clusters. The due to its structural similarities matrix-multiplication is well suited highly parallel architectures. To achieve high efficiency, we address two key algorithmic challenges: reducing communication overhead and addressing limited memory. reduce costs, redesign (a) expose more parallelism, (b) aggressively overlap computation with pipelined...

10.1145/3431379.3460651 article EN 2021-06-17

Self-stabilizing Connected Components

OPENALEX - Publications

Piyush Sao Christian Engelmann Srinivas Eswar Oded Green Richard Vuduc

For the problem of computing connected components a graph, this paper considers design algorithms that are resilient to transient hardware faults, like bit Hips. More specifically, it applies technique self-stabilization. A system is self-stabilizing if, when starting from valid or invalid state, guaranteed reach state after finite number steps. Therefore on machine subject fault, algorithm could recover if fault caused enter an state. We give comprehensive analysis and states during label...

10.1109/ftxs49593.2019.00011 article EN 2019-11-01

Scalable Knowledge Graph Analytics at 136 Petaflop/s

OPENALEX - Publications

Ramakrishnan Kannan Piyush Sao Hao Lu Dasha Herrmannova Vijay Thakkar and 3 more

10.1109/sc41405.2020.00010 article EN other-oa OSTI OAI (U.S. Department of Energy Office of Scientific and Technical Information) 2020-11-01

Sparse Binary Matrix-Vector Multiplication on Neuromorphic Computers

OPENALEX - Publications

Catherine D. Schuman Bill Kay Prasanna Date Ramakrishnan Kannan Piyush Sao and 1 more

10.1109/ipdpsw52791.2021.00054 article EN other-oa OSTI OAI (U.S. Department of Energy Office of Scientific and Technical Information) 2021-06-01

Coming Soon ...