- Parallel Computing and Optimization Techniques
- Advanced Neural Network Applications
- Stochastic Gradient Optimization Techniques
- Matrix Theory and Algorithms
- Ferroelectric and Negative Capacitance Devices
- Advanced Memory and Neural Computing
- Advanced Data Storage Technologies
- Brain Tumor Detection and Classification
- Logic, programming, and type systems
- Scheduling and Optimization Algorithms
- Embedded Systems Design Techniques
- Machine Learning in Materials Science
- Tensor decomposition and applications
- Low-power high-performance VLSI design
- Interconnection Networks and Systems
- VLSI and FPGA Design Techniques
- Sparse and Compressive Sensing Techniques
- Formal Methods in Verification
- Numerical Methods and Algorithms
- Quantum Computing Algorithms and Architecture
Cornell University
2025
University of California, Berkeley
2020-2024
Berkeley College
2021-2024
University of California System
2022
Recent advances in Deep Neural Networks (DNNs) have led to active development of specialized DNN accelerators, many which feature a large number processing elements laid out spatially, together with multi-level memory hierarchy and flexible interconnect. While accelerators can take advantage data reuse achieve high peak throughput, they also expose runtime parameters the programmers who need explicitly manage how computation is scheduled both spatially temporally. In fact, different...
Recent advances in state-of-the-art DNN architecture design have been moving toward Transformer models. These models achieve superior accuracy across a wide range of applications. This trend has consistent over the past several years since were originally introduced. However, amount compute and bandwidth required for inference recent is growing at significant rate, this made their deployment latency-sensitive applications challenging. As such, there an increased focus on making more...
The optimization of the matrix multiplication (or GEMM) has been a need during last decades. This operation is considered flagship current linear algebra libraries such as BLIS, OpenBLAS, or Intel OneAPI because its widespread use in large variety scientific applications. GEMM usually implemented following GotoBLAS philosophy, which tiles operands and uses series nested loops for performance improvement. These approaches extract maximum computational power architectures through small pieces...
In the hardware design space exploration process, it is critical to optimize both parameters and algorithm-to-hardware mappings. Previous work has largely approached this simultaneous optimization problem by separately exploring mapspace—both individually large highly nonconvex spaces—independently. The resulting combinatorial explosion created significant difficulties for optimizers.
Abstract Reducing communication - either between levels of a memory hierarchy or processors over network is key component performance optimization (in both time and energy) for many nested loop problems, including dense linear algebra, particle interactions, machine learning. Previous tiling based approaches these problems have been used to find lower bounds on the required execute them optimal rearrangements, blockings, attain such bounds. However, general typically assumed problem sizes...
Efficiently executing convolutional neural nets (CNNs) is important in many machine-learning tasks. Since the cost of moving a word data, either between levels memory hierarchy or processors over network, much higher than an arithmetic operation, minimizing data movement critical to performance optimization. In this paper, we present both new lower bounds on needed for CNNs, and optimal sequential algorithms that attain these bounds. most common cases, our can significantly more reuse matrix...
The standardization of an interface for dense linear algebra operations in the BLAS standard has enabled interoperability between different libraries, thereby boosting success scientific computing, particular HPC. Despite numerous efforts past, community not yet agreed on a sparse due to reasons. One is fact that objects allow many storage formats, and hardware may favor formats. This makes definition FORTRAN-style all-circumventing extremely challenging. Another reason opposed...
Convolutional neural networks (CNNs) are important in a wide variety of machine learning tasks and applications, so optimizing their performance is essential. Moving words data between levels memory hierarchy or processors on network much more expensive than the cost arithmetic, minimizing communication critical to performance. In this paper, we present new lower bounds movement for mixed precision convolutions both single-processor parallel distributed models, as well algorithms that...
Recent advances in Deep Neural Networks (DNNs) have led to active development of specialized DNN accelerators, many which feature a large number processing elements laid out spatially, together with multi-level memory hierarchy and flexible interconnect. While accelerators can take advantage data reuse achieve high peak throughput, they also expose runtime parameters the programmers who need explicitly manage how computation is scheduled both spatially temporally. In fact, different...
The optimization of the matrix multiplication (or GEMM) has been a need during last decades. This operation is considered flagship current linear algebra libraries such as BLIS, OpenBLAS, or Intel OneAPI because its widespread use in large variety scientific applications. GEMM usually implemented following GotoBLAS philosophy, which tiles operands and uses series nested loops for performance improvement. These approaches extract maximum computational power architectures through small pieces...
Reducing communication - either between levels of a memory hierarchy or processors over network is key component performance optimization (in both time and energy) for many problems, including dense linear algebra, particle interactions, machine learning. For these which can be represented as nested-loop computations, previous tiling based approaches have been used to find lower bounds on the required execute them optimal rearrangements, blockings, attain such bounds. However, general...