- Parallel Computing and Optimization Techniques
- Interconnection Networks and Systems
- Graph Theory and Algorithms
- Data Management and Algorithms
- Matrix Theory and Algorithms
- Distributed and Parallel Computing Systems
- Advanced Graph Neural Networks
- Advanced Data Storage Technologies
- Tensor decomposition and applications
- Distributed systems and fault tolerance
- Bioinformatics and Genomic Networks
- Advanced Clustering Algorithms Research
- Semantic Web and Ontologies
- VLSI and FPGA Design Techniques
- Cloud Computing and Resource Management
- Algorithms and Data Compression
- Radiation Effects in Electronics
- Machine Learning and Data Classification
- Advanced Memory and Neural Computing
- Computer Graphics and Visualization Techniques
- Complex Network Analysis Techniques
- Cellular Automata and Applications
- Machine Learning in Materials Science
- Data Visualization and Analytics
- Low-power high-performance VLSI design
Oak Ridge National Laboratory
2019-2024
Georgia Institute of Technology
1999-2018
We show how to use the idea of self-stabilization, which originates in context distributed control, make fault-tolerant iterative solvers. Generally, a self-stabilizing system is one that, starting from an arbitrary state (valid or invalid), reaches valid within finite number steps. This property imbues with natural means tolerating transient faults. give two proof-of-concept examples linear solvers: for steepest descent (SD) and conjugate gradients (CG). Our self-stabilized versions SD CG...
Due to the limited capacity of GPU memory, majority prior work on graph applications GPUs has been restricted graphs modest sizes that fit in memory. Recent hardware and software advances make it possible address much larger host memory transparently as a part feature known unified virtual While accessing over an interconnect is understandably slower, problem space not sufficiently explored context challenging workload with low computational intensity irregular data access pattern such...
We propose a new algorithm to improve the strong scalability of right-looking sparse LU factorization on distributed memory systems. Our 3D uses three-dimensional MPI process grid, aggressively exploits elimination tree parallelism and trades off increased for reduced per-process communication. also analyze asymptotic improvements planar graphs (e.g., from 2D grid or mesh domains) certain non-planar (specifically grids meshes). For with n vertices, our reduces communication volume...
We show how to exploit graph sparsity in the Floyd-Warshall algorithm for all-pairs shortest path (Apsp) problem. is an attractive choice Apsp on high-performing systems due its structural similarity solving dense linear and matrix multiplication. However, if of input not properly exploited, will perform unnecessary asymptotic work thus may be a suitable many graphs. To overcome this limitation, key idea our approach use known algebraic relationship between Gaussian elimination, import...
This paper presents the first sparse direct solver for distributed memory systems comprising hybrid multicourse CPU and Intel Xeon Pico-processors. It builds on algorithmic approach of SuperLU_DIST, which is right-looking statically pivoted. Our contribution a novel algorithm, called HALO. The name shorthand highly asynchronous lazy offload, it refers tithe way algorithm combines aggressive use asynchrony with accelerated updates, data shadowing (a la halo or ghost zones), all serve to hide...
We present the new features available in recent release of SuperLU_DIST , Version 8.1.1. is a distributed-memory parallel sparse direct solver. The include (1) 3D communication-avoiding algorithm framework that trades off inter-process communication for selective memory duplication, (2) multi-GPU support both NVIDIA GPUs and AMD GPUs, (3) mixed-precision routines perform single-precision LU factorization double-precision iterative refinement. Apart from improvements, we also modernized...
Abstract Kernel summations are a ubiquitous key computational bottleneck in many data analysis methods. In this paper, we attempt to marry, for the first time, best relevant techniques parallel computing, where kernel low dimensions, with general‐dimension algorithms from machine learning literature. We provide distributed implementation of summation framework that can utilize: (i) various types deterministic and probabilistic approximations may be suitable high‐dimensional problems large...
We present a novel distributed memory algorithm to improve the strong scalability of solution sparse triangular system. This operation appears in solve phase direct methods for solving general linear systems, Ax = b. Our 3D solver employs several techniques, including MPI process grid, elimination tree parallelism, and data replication, all which reduce per-process communication when combined. analytical models understand cost our show that can volume asymptotically by factor O(n1/4) O(n1/6)...
This paper presents a unified communication optimization framework for sparse triangular solve (SpTRSV) algorithms on CPU and GPU clusters. The builds upon 3D communication-avoiding (CA) layout of Px × Py Pz processes that divides matrix into submatrices, each handled by 2D grid with block-cyclic distribution. We propose three strategies: First, new SpTRSV algorithm is developed, which trades the inter-grid synchronization replicated computation. design requires only one synchronization,...
We present a new fault-tolerant algorithm for the problem of computing connected components graph. Our derives from highly parallel but non-resilient algorithm, which is based on technique label propagation (LP). To make (LP) resilient to transient soft faults, we apply an algorithmic design principle that refer as self-correction. Briefly, self-correcting detects if it has reached invalid state given was previously in known valid state; and so, restores itself back assuming availability...
We are motivated by newly proposed methods for data mining large-scale corpora of scholarly publications, such as the full biomedical literature, which may consist tens millions papers spanning decades research. In this setting, analysts seek to discover how concepts relate one another. They construct graph representations from annotated text databases and then formulate relationship-mining problem computing all-pairs shortest paths (APSP), becomes a significant bottleneck. context, we...
Computing the Euclidean minimum spanning tree (Emst) is a computationally demanding step of many algorithms. While work-efficient serial and multithreaded algorithms for computing Emst are known, designing an efficient GPU algorithm challenging due to complex branching structure, data dependencies, load imbalances. In this paper, we propose single-tree Borůvka-based on GPUs. We use nearest neighbor reduce number required distance calculations by avoiding traversing subtrees with leaf nodes...
Neuromorphic computers offer the opportunity for low-power, efficient computation. Though they have been primarily applied to neural network tasks, there is also leverage inherent characteristics of neuromorphic (low power, massive parallelism, collocated processing and memory) perform non-neural tasks. Here, we demonstrate how an approach performing sparse binary matrix-vector multiplication on computers. We describe approach, which relies connection between breadth first search, introduce...
This paper presents \pandora, a novel parallel algorithm for efficiently constructing dendrograms single-linkage hierarchical clustering, including \hdbscan. Traditional dendrogram construction methods from minimum spanning tree (MST), such as agglomerative or divisive techniques, often fail to parallelize, especially with skewed common in real-world data. \pandora addresses these challenges through unique recursive contraction method, which simplifies the initial and then progressively...
This study presents the first constrained sparse tensor factorization (cSTF) framework that optimizes and fully offloads computation to massively parallel GPU architectures, performance characterization of cSTF on architectures. In contrast prior work factorization, where matricized times Khatri-Rao product (MTTKRP) is primary bottleneck, our systematic analysis algorithm GPUs reveals adding constraints creates an additional bottleneck in update operation for many real-world tensors. While...
This paper introduces Pandora, a parallel algorithm for computing dendrograms, the hierarchical cluster trees single linkage clustering (SLC). Current approaches construct dendrograms by partitioning minimum spanning tree and removing edges. However, they struggle with skewed, hard-to-parallelize real-world dendrograms. Consequently, is sequential bottleneck in HDBSCAN*[21], popular SLC variant.
The standardization of an interface for dense linear algebra operations in the BLAS standard has enabled interoperability between different libraries, thereby boosting success scientific computing, particular HPC. Despite numerous efforts past, community not yet agreed on a sparse due to reasons. One is fact that objects allow many storage formats, and hardware may favor formats. This makes definition FORTRAN-style all-circumventing extremely challenging. Another reason opposed...
We present an optimized Floyd-Warshall (Floyd-Warshall) algorithm that computes the All-pairs shortest path (APSP) for GPU accelerated clusters. The due to its structural similarities matrix-multiplication is well suited highly parallel architectures. To achieve high efficiency, we address two key algorithmic challenges: reducing communication overhead and addressing limited memory. reduce costs, redesign (a) expose more parallelism, (b) aggressively overlap computation with pipelined...
For the problem of computing connected components a graph, this paper considers design algorithms that are resilient to transient hardware faults, like bit Hips. More specifically, it applies technique self-stabilization. A system is self-stabilizing if, when starting from valid or invalid state, guaranteed reach state after finite number steps. Therefore on machine subject fault, algorithm could recover if fault caused enter an state. We give comprehensive analysis and states during label...