- Parallel Computing and Optimization Techniques
- Graph Theory and Algorithms
- Advanced Data Storage Technologies
- Cloud Computing and Resource Management
- Interconnection Networks and Systems
- Advanced Neural Network Applications
- Numerical Methods and Algorithms
- Advanced Memory and Neural Computing
- Embedded Systems Design Techniques
- Advanced Graph Neural Networks
- Ferroelectric and Negative Capacitance Devices
- Tensor decomposition and applications
- Formal Methods in Verification
- Error Correcting Code Techniques
- Data Management and Algorithms
- Real-Time Systems Scheduling
- Complexity and Algorithms in Graphs
- Matrix Theory and Algorithms
- Adversarial Robustness in Machine Learning
- Genetic Associations and Epidemiology
- Caching and Content Delivery
- CCD and CMOS Imaging Sensors
- Evolution and Genetic Dynamics
- Complex Network Analysis Techniques
- Low-power high-performance VLSI design
Carnegie Mellon University
2015-2024
Universidade Estadual de Campinas (UNICAMP)
2018
The University of Texas at Austin
2005-2016
We show how the BLAS-like Library Instantiation Software (BLIS) framework, which provides a more detailed layering of GotoBLAS (now maintained as OpenBLAS) implementation, allows one to analytically determine tuning parameters for high-end instantiations matrix-matrix multiplication. This is both practical and scientific importance, it greatly reduces development effort required implementation level-3 BLAS while also advancing our understanding hierarchically layered memories interact with...
This paper has two main contributions. First, we propose a novel coding technique - Generalized PolyDot for matrix-vector products that advances on existing techniques coded matrix operations under storage and communication constraints. Next, use the problem of training large Deep Neural Networks (DNNs) using unreliable nodes are prone to soft-errors, e.g., bit flips during computation produce erroneous outputs. An additional difficulty imposed by DNN is parameter values (weight matrices)...
In this paper, we address the question of how to automatically map computational kernels highly efficient code for a wide range computing platforms and establish correctness synthesized code. More specifically, focus on two fundamental problems that software developers are faced with: performance portability across ever-changing landscape parallel guarantees sophisticated floating-point The problem is approached as follows: We develop formal framework capture algorithms, platforms, program...
BLIS is a new software framework for instantiating high-performance BLAS-like dense linear algebra libraries. We demonstrate how acts as productivity multiplier by using it to implement the level-3 BLAS on variety of current architectures. The systems which we include state-of-the-art general-purpose, low-power, and many-core show, with very little effort, yields sequential parallel implementations that are competitive performance ATLAS, OpenBLAS (an effort maintain extend GotoBLAS),...
The importance of Sparse Matrix dense Vector multiplication (SpMV) operation in graph analytics and numerous scientific applications has led to development custom accelerators that are intended over-come the difficulties sparse data operations on general purpose architectures. However, efficient SpMV large problem (i.e. working set exceeds on-chip storage) is severely constrained due strong dependence limited amount fast random access memory scale. Additionally, unstructured matrix with high...
Symmetric tensor operations arise in a wide variety of computations. However, the benefits exploiting symmetry order to reduce storage and computation is conflict with desire simplify memory access patterns. In this paper, we propose blocked data structure (Blocked Compact Storage) wherein consider by blocks store only unique symmetric tensor. We an algorithm-by-blocks, already shown benefit for matrix computations, that exploits format utilizing series temporary tensors avoid redundant...
We present two high-performance implementations of the convolution operator via direct algorithm that outperform so-called lowering approach based on im2col transform plus gemm kernel an ARMv8-based processor. One our methods presents additional advantage zero-memory overhead while other employs yet rather moderate workspace, substantially smaller than required by im2col+gemm solution. In contrast with a previous implementation similar convolution, this work exhibits key preserving...
The computation of convolution layers in deep neural networks typically rely on high performance routines that trade space for time by using additional memory (either packing purposes or required as part the algorithm) to improve performance. problems with such an approach are two-fold. First, these incur overhead which reduces overall size network can fit embedded devices limited capacity. Second, were not optimized performing convolution, means obtained is usually less than conventionally...
Cyber-physical systems (CPSs), ranging from critical infrastructures such as power plants, to modern (semi) autonomous vehicles, are that use software control physical processes. CPSs made up of many different computational components. Each component runs its own piece implements algorithms, based on model the environment. Every then interacts with other components through signals and values it sends out. Collectively, these components, code they run, drive complex behaviors society has come...
A theorem related to the accumulation of Householder transformations into a single orthogonal transformation known as compact WY transform is presented. It provides simple characterization computation this and suggests an alternative algorithm for computing it. also transformation, UT transform, with same utility Transform which requires less has similar stability properties. That was first published over decade ago but gone unnoticed by community.
Data movement between processor and memory hierarchy is a fundamental bottleneck that limits the performance of many applications on modern computer architectures. Tiling loop permutation are key techniques for improving data locality. However, selecting effective tile-sizes permutations particularly challenging tensor contractions due to large number loops. Even state-of-the-art compilers usually produce sub-optimal permutations, as they rely naïve cost models. In this paper we provide an...
Computing systems are evolving rapidly. At the device level, emerging devices beginning to compete with traditional CMOS systems. architecture novel architectures successfully avoiding communication bottleneck that is a central feature, and limitation, of von Neumann architecture. Furthermore, such increasingly plagued by unreliability. This unreliability arises at or gate-level in devices, can percolate up processor system-level if left unchecked. The goal this article survey recent...
Linear algebra-based approaches to exact triangle counting often require sparse matrix multiplication as a primitive operation. Non-linear algebra the same problem assume that adjacency of graph is not available. In this paper, we show both can be unified into single approach separates data format from algorithm design. By casting multiplication, different counts each exactly once identified. addition, by choosing appropriate format, equivalent compact-forward attained assuming We our yields...
In this work we present a performance exploration on Eager K-truss, linear-algebraic formulation of the K-truss graph algorithm. We address issues related to load imbalance parallel tasks in symmetric, triangular graphs by presenting fine-grained approach executing support computation. This also increases available parallelism, making it amenable GPU execution. demonstrate our using implementations Kokkos and evaluate them an Intel Skylake CPU Nvidia Tesla V100 GPU. Overall, observe between...
Graphs play a key role in data analytics. and the software systems used to work with them are highly diverse. Algorithms interact hardware different ways which graph solution works best on given platform changes structure of graph. This makes it difficult decide programming framework is for situation. In this paper, we try make sense diverse landscape. We evaluate five frameworks analytics: SuiteS-parse GraphBLAS, Galois, NWGraph library, Graph Kernel Collection, GraphIt. use GAP Benchmark...
Edge-centric algorithms using the linear algebraic approach typically require use of both incidence and adjacency matrices. Due to two different data formats, information contained in graph is replicated, thereby incurring time space for replication. Using K-truss as an example, we demonstrate that edge-centric algorithm, Eager can be identified from a formulation only matrix. In addition, our implementation algorithm out-performs Galois' by average (over 53 graphs) more than 13 times, up 71 times.
We propose a coded computing strategy for the Fast Fourier Transform (FFT) algorithm in fully distributed setting, which does not have powerful master node orchestrating worker nodes. The setting requires large amount of data movements between nodes, and this communication is often bottleneck parallel computing. identify cost each step FFT using α-β model, commonly used highperformance literature to estimate latency. show that by (P, K) systematic MDS code, overhead coding negligible...
We discuss the OpenMP parallelization of linear algebra algorithms that are coded using Formal Linear Algebra Methods Environment (FLAME) API. This API expresses at a higher level abstraction, avoids use loop and array indices, represents these as they formally derived presented. report on two implementations workqueuing model, neither which requires explicit indices to specify parallelism. The first implementation uses experimental taskq pragma, may influence adoption similar construct into...
PageRank is an important vertex ranking algorithm that suffers from poor performance and efficiency due to notorious memory access behavior. Furthermore, when graphs become bigger sparser, applications are inhibited as most current solutions profoundly rely on large random fast memory, which not easily scalable. In this paper we present a 16nm ASIC based shared platform for implementation fundamentally accelerates Sparse Matrix dense Vector multiplication (SpMV), the core kernel of PageRank....
Software tools for linkage disequilibrium (LD) analyses are designed to calculate LD among all genetic variants in a single region. Since compute and memory requirements grow quadratically with the distance between variants, using these long-range calculations leads long execution times increased allocation of resources. Furthermore, widely used do not fully utilize computational resources modern processors and/or graphics processing cards, limiting future large-scale on thousands samples....
Current microprocessor trends show a steady increase in the number of cores and/or threads present on same CPU die. While this improves performance for compute-bound applications, benefits memory-bound applications are limited. The discrete Fourier transform (DFT) is an example such application, where increasing does not yield corresponding performance. In paper, we alternate solution using increased cores/threads available typical multicore system. We propose to repurpose some as soft...
Implementing complex arithmetic routines with Single Instruction Multiple Data (SIMD) instructions requires the use of that are usually not found in their real counter-parts. These instructions, such as shuffles and addsub, often bottlenecks for many kernels modern architectures can perform more operations than execute arithmetic. In this work, we focus on using a variety data layouts (mixed format) storing numbers at different stages computation so to limit these instructions. Using matrix...
We propose FFTX, a new framework for building high-performance FFT-based applications on exascale machines. Complex node architectures lead to multiple levels of parallelism and demand efficient ways data communication. The current FFTW interface falls short in maximizing performance such scenarios. FFTX is designed enable application developers leverage expert-level, automatic optimizations while navigating familiar interface. backwards compatible extends the Interface into an embedded...
Genomic datasets are steadily growing in size as more genomes sequenced and new genetic variants discovered. Datasets that comprise thousands of millions single-nucleotide polymorphisms (SNPs), exhibit excessive computational demands can lead to prohibitively long analyses, yielding the deployment high-performance approaches a prerequisite for thorough analysis current future large-scale datasets. In this work, we demonstrate kernel calculating linkage disequilibria (LD) genomes, i.e.,...