Piyush Sao

ORCID: 0000-0002-9432-5855
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Parallel Computing and Optimization Techniques
  • Interconnection Networks and Systems
  • Graph Theory and Algorithms
  • Data Management and Algorithms
  • Matrix Theory and Algorithms
  • Distributed and Parallel Computing Systems
  • Advanced Graph Neural Networks
  • Advanced Data Storage Technologies
  • Tensor decomposition and applications
  • Distributed systems and fault tolerance
  • Bioinformatics and Genomic Networks
  • Advanced Clustering Algorithms Research
  • Semantic Web and Ontologies
  • VLSI and FPGA Design Techniques
  • Cloud Computing and Resource Management
  • Algorithms and Data Compression
  • Radiation Effects in Electronics
  • Machine Learning and Data Classification
  • Advanced Memory and Neural Computing
  • Computer Graphics and Visualization Techniques
  • Complex Network Analysis Techniques
  • Cellular Automata and Applications
  • Machine Learning in Materials Science
  • Data Visualization and Analytics
  • Low-power high-performance VLSI design

Oak Ridge National Laboratory
2019-2024

Georgia Institute of Technology
1999-2018

We show how to use the idea of self-stabilization, which originates in context distributed control, make fault-tolerant iterative solvers. Generally, a self-stabilizing system is one that, starting from an arbitrary state (valid or invalid), reaches valid within finite number steps. This property imbues with natural means tolerating transient faults. give two proof-of-concept examples linear solvers: for steepest descent (SD) and conjugate gradients (CG). Our self-stabilized versions SD CG...

10.1145/2530268.2530272 article EN 2013-10-29

Due to the limited capacity of GPU memory, majority prior work on graph applications GPUs has been restricted graphs modest sizes that fit in memory. Recent hardware and software advances make it possible address much larger host memory transparently as a part feature known unified virtual While accessing over an interconnect is understandably slower, problem space not sufficiently explored context challenging workload with low computational intensity irregular data access pattern such...

10.14778/3384345.3384358 article EN Proceedings of the VLDB Endowment 2020-03-01

We propose a new algorithm to improve the strong scalability of right-looking sparse LU factorization on distributed memory systems. Our 3D uses three-dimensional MPI process grid, aggressively exploits elimination tree parallelism and trades off increased for reduced per-process communication. also analyze asymptotic improvements planar graphs (e.g., from 2D grid or mesh domains) certain non-planar (specifically grids meshes). For with n vertices, our reduces communication volume...

10.1109/ipdps.2018.00100 article EN 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2018-05-01

We show how to exploit graph sparsity in the Floyd-Warshall algorithm for all-pairs shortest path (Apsp) problem. is an attractive choice Apsp on high-performing systems due its structural similarity solving dense linear and matrix multiplication. However, if of input not properly exploited, will perform unnecessary asymptotic work thus may be a suitable many graphs. To overcome this limitation, key idea our approach use known algebraic relationship between Gaussian elimination, import...

10.1145/3332466.3374533 article EN 2020-02-19

This paper presents the first sparse direct solver for distributed memory systems comprising hybrid multicourse CPU and Intel Xeon Pico-processors. It builds on algorithmic approach of SuperLU_DIST, which is right-looking statically pivoted. Our contribution a novel algorithm, called HALO. The name shorthand highly asynchronous lazy offload, it refers tithe way algorithm combines aggressive use asynchrony with accelerated updates, data shadowing (a la halo or ghost zones), all serve to hide...

10.1109/ipdps.2015.104 article EN 2015-05-01

We present the new features available in recent release of SuperLU_DIST , Version 8.1.1. is a distributed-memory parallel sparse direct solver. The include (1) 3D communication-avoiding algorithm framework that trades off inter-process communication for selective memory duplication, (2) multi-GPU support both NVIDIA GPUs and AMD GPUs, (3) mixed-precision routines perform single-precision LU factorization double-precision iterative refinement. Apart from improvements, we also modernized...

10.1145/3577197 article EN other-oa ACM Transactions on Mathematical Software 2022-12-19

Abstract Kernel summations are a ubiquitous key computational bottleneck in many data analysis methods. In this paper, we attempt to marry, for the first time, best relevant techniques parallel computing, where kernel low dimensions, with general‐dimension algorithms from machine learning literature. We provide distributed implementation of summation framework that can utilize: (i) various types deterministic and probabilistic approximations may be suitable high‐dimensional problems large...

10.1002/sam.11207 article EN Statistical Analysis and Data Mining The ASA Data Science Journal 2013-10-24

10.1016/j.jpdc.2019.03.004 article EN publisher-specific-oa Journal of Parallel and Distributed Computing 2019-04-24

We present a novel distributed memory algorithm to improve the strong scalability of solution sparse triangular system. This operation appears in solve phase direct methods for solving general linear systems, Ax = b. Our 3D solver employs several techniques, including MPI process grid, elimination tree parallelism, and data replication, all which reduce per-process communication when combined. analytical models understand cost our show that can volume asymptotically by factor O(n1/4) O(n1/6)...

10.1145/3330345.3330357 article EN 2019-06-18

This paper presents a unified communication optimization framework for sparse triangular solve (SpTRSV) algorithms on CPU and GPU clusters. The builds upon 3D communication-avoiding (CA) layout of Px × Py Pz processes that divides matrix into submatrices, each handled by 2D grid with block-cyclic distribution. We propose three strategies: First, new SpTRSV algorithm is developed, which trades the inter-grid synchronization replicated computation. design requires only one synchronization,...

10.1145/3581784.3607092 article EN cc-by 2023-11-11

We present a new fault-tolerant algorithm for the problem of computing connected components graph. Our derives from highly parallel but non-resilient algorithm, which is based on technique label propagation (LP). To make (LP) resilient to transient soft faults, we apply an algorithmic design principle that refer as self-correction. Briefly, self-correcting detects if it has reached invalid state given was previously in known valid state; and so, restores itself back assuming availability...

10.1145/2909428.2909435 article EN 2016-05-27

We are motivated by newly proposed methods for data mining large-scale corpora of scholarly publications, such as the full biomedical literature, which may consist tens millions papers spanning decades research. In this setting, analysts seek to discover how concepts relate one another. They construct graph representations from annotated text databases and then formulate relationship-mining problem computing all-pairs shortest paths (APSP), becomes a significant bottleneck. context, we...

10.1109/sc41405.2020.00010 article EN 2020-11-01

Computing the Euclidean minimum spanning tree (Emst) is a computationally demanding step of many algorithms. While work-efficient serial and multithreaded algorithms for computing Emst are known, designing an efficient GPU algorithm challenging due to complex branching structure, data dependencies, load imbalances. In this paper, we propose single-tree Borůvka-based on GPUs. We use nearest neighbor reduce number required distance calculations by avoiding traversing subtrees with leaf nodes...

10.1145/3545008.3546185 article EN 2022-08-29

Neuromorphic computers offer the opportunity for low-power, efficient computation. Though they have been primarily applied to neural network tasks, there is also leverage inherent characteristics of neuromorphic (low power, massive parallelism, collocated processing and memory) perform non-neural tasks. Here, we demonstrate how an approach performing sparse binary matrix-vector multiplication on computers. We describe approach, which relies connection between breadth first search, introduce...

10.1109/ipdpsw52791.2021.00054 article EN 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) 2021-06-01

This paper presents \pandora, a novel parallel algorithm for efficiently constructing dendrograms single-linkage hierarchical clustering, including \hdbscan. Traditional dendrogram construction methods from minimum spanning tree (MST), such as agglomerative or divisive techniques, often fail to parallelize, especially with skewed common in real-world data. \pandora addresses these challenges through unique recursive contraction method, which simplifies the initial and then progressively...

10.48550/arxiv.2401.06089 preprint EN other-oa arXiv (Cornell University) 2024-01-01

This study presents the first constrained sparse tensor factorization (cSTF) framework that optimizes and fully offloads computation to massively parallel GPU architectures, performance characterization of cSTF on architectures. In contrast prior work factorization, where matricized times Khatri-Rao product (MTTKRP) is primary bottleneck, our systematic analysis algorithm GPUs reveals adding constraints creates an additional bottleneck in update operation for many real-world tensors. While...

10.1145/3673038.3673128 article EN cc-by 2024-08-08

This paper introduces Pandora, a parallel algorithm for computing dendrograms, the hierarchical cluster trees single linkage clustering (SLC). Current approaches construct dendrograms by partitioning minimum spanning tree and removing edges. However, they struggle with skewed, hard-to-parallelize real-world dendrograms. Consequently, is sequential bottleneck in HDBSCAN*[21], popular SLC variant.

10.1145/3673038.3673148 article EN public-domain 2024-08-08

The standardization of an interface for dense linear algebra operations in the BLAS standard has enabled interoperability between different libraries, thereby boosting success scientific computing, particular HPC. Despite numerous efforts past, community not yet agreed on a sparse due to reasons. One is fact that objects allow many storage formats, and hardware may favor formats. This makes definition FORTRAN-style all-circumventing extremely challenging. Another reason opposed...

10.48550/arxiv.2411.13259 preprint EN arXiv (Cornell University) 2024-11-20

We present an optimized Floyd-Warshall (Floyd-Warshall) algorithm that computes the All-pairs shortest path (APSP) for GPU accelerated clusters. The due to its structural similarities matrix-multiplication is well suited highly parallel architectures. To achieve high efficiency, we address two key algorithmic challenges: reducing communication overhead and addressing limited memory. reduce costs, redesign (a) expose more parallelism, (b) aggressively overlap computation with pipelined...

10.1145/3431379.3460651 article EN 2021-06-17

For the problem of computing connected components a graph, this paper considers design algorithms that are resilient to transient hardware faults, like bit Hips. More specifically, it applies technique self-stabilization. A system is self-stabilizing if, when starting from valid or invalid state, guaranteed reach state after finite number steps. Therefore on machine subject fault, algorithm could recover if fault caused enter an state. We give comprehensive analysis and states during label...

10.1109/ftxs49593.2019.00011 article EN 2019-11-01

10.1109/sc41405.2020.00010 article EN other-oa OSTI OAI (U.S. Department of Energy Office of Scientific and Technical Information) 2020-11-01

10.1109/ipdpsw52791.2021.00054 article EN other-oa OSTI OAI (U.S. Department of Energy Office of Scientific and Technical Information) 2021-06-01
Coming Soon ...