- Advanced Memory and Neural Computing
- Advanced Neural Network Applications
- Ferroelectric and Negative Capacitance Devices
- Parallel Computing and Optimization Techniques
- Evolutionary Algorithms and Applications
- Interconnection Networks and Systems
- IoT and Edge/Fog Computing
- Neural Networks and Applications
- CCD and CMOS Imaging Sensors
- Age of Information Optimization
- Reinforcement Learning in Robotics
- Particle accelerators and beam dynamics
- Embedded Systems Design Techniques
- Particle Detector Development and Performance
- Advanced Data Storage Technologies
- Energy Harvesting in Wireless Networks
- Distributed and Parallel Computing Systems
- 3D IC and TSV technologies
- Low-power high-performance VLSI design
- Context-Aware Activity Recognition Systems
- Computability, Logic, AI Algorithms
- Advanced Database Systems and Queries
- Blind Source Separation Techniques
- Advanced Manufacturing and Logistics Optimization
- Modular Robots and Swarm Intelligence
IBM (United States)
2024
Georgia Institute of Technology
2017-2023
The advent of Deep Learning (DL) has radically transformed the computing industry across entire spectrum from algorithms to circuits. As myriad application domains embrace DL, it become synonymous with a genre workloads vision, speech, language, recommendations, robotics, and games. key compute kernel within most DL is general matrix-matrix multiplications (GEMMs), which appears frequently during both forward pass (inference training) backward (training). GEMMs are natural choice for...
Deep neural networks (DNN) have demonstrated highly promising results across computer vision and speech recognition, are becoming foundational for ubiquitous AI. The computational complexity of these algorithms a need high energy-efficiency has led to surge in research on hardware accelerators. % this paradigm. To reduce the latency energy costs accessing DRAM, most DNN accelerators spatial nature, with hundreds processing elements (PE) operating parallel communicating each other directly....
Systolic Arrays are one of the most popular compute substrates within Deep Learning accelerators today, as they provide extremely high efficiency for running dense matrix multiplications. However, research community lacks tools to insights on both design trade-offs and efficient mapping strategies systolic-array based accelerators. We introduce CNN Accelerator Simulator (SCALE-Sim), which is a configurable systolic array cycle accurate DNN accelerator simulator. SCALE-Sim exposes various...
Deep neural networks (DNN) have demonstrated highly promising results across computer vision and speech recognition, are becoming foundational for ubiquitous AI. The computational complexity of these algorithms a need high energy-efficiency has led to surge in research on hardware accelerators. % this paradigm. To reduce the latency energy costs accessing DRAM, most DNN accelerators spatial nature, with hundreds processing elements (PE) operating parallel communicating each other directly....
The compute demand for deep learning workloads is well known and a prime motivator powerful parallel computing platforms such as GPUs or dedicated hardware accelerators. massive inherent parallelism of these enables us to extract more performance by simply provisioning given task. This strategy can be directly exploited build higher-performing DNN workloads, incorporating many units possible in single system. referred scaling up. Alternatively, it's feasible arrange multiple systems work on...
Applications across image processing, speech recognition, and classification heavily rely on neural network-based algorithms that have demonstrated highly promising results in accuracy. However, such involve massive computations are not manageable general purpose processors. To cope with this challenge, spatial architecture-based accelerators, which consist of an array hundreds processing elements (PEs), emerged. These accelerators achieve high throughput exploiting parallel over the PEs;...
DSP48s, BRAMs and URAMs in the Xilinx Ultra-scale+ family support dedicated cascade interconnect for high frequency, nearest-neighbor data movement using hard wiring resources. We demonstrate how to leverage these structures effectively requirements of dense machine learning (ML) workloads at URAM-limited 650MHz frequency (714MHz reported by Vivado). refor-mulate convolution matrix-vector multiplication operations make effective use (1) DSP48s supporting common multiply-accumulate chains,...
Modern deep learning systems rely on (a) a hand-tuned neural network topology, (b) massive amounts of labeled training data, and (c) extensive over large-scale compute resources to build system that can perform efficient image classification or speech recognition. Unfortunately, we are still far away from implementing adaptive general purpose intelligent which would need learn autonomously in unknown environments may not have access some any these three components. Reinforcement evolutionary...
This work demonstrates a scalable reconfigurable accelerator (RA) architecture designed to extract maximum performance and energy efficiency for GEMM workloads. We also present self-adaptive (SA) unit, which runs learnt model one-shot configuration optimization in hardware offloading the software stack thus easing deployment of proposed design. evaluate an instance methodology with 32.768 TOPS reference implementation called SAGAR, that can provide same mapping flexibility as compute...
High computational demands of deep neural networks (DNNs) coupled with their pervasiveness across cloud and IoT platforms have led to the emergence DNN accelerators employing hundreds processing elements (PE). Most are optimized for regular mapping problems, or dataflows, emanating from dense matrix multiplications in convolutional layers. However, continuous innovations including myriad layer types/shapes, cross-layer fusion, sparsity irregular dataflows within accelerators, which...
As AI-based applications become pervasive, CPU vendors are starting to incorporate matrix engines within the datapath boost efficiency. Systolic arrays have been premier architectural choice as in offload accelerators. However, we demonstrate that incorporating them inside CPUs can introduce under-utilization and stalls due limited register storage amortize fill drain times of array. To address this, propose RASA, Register-Aware Array. We develop techniques divide an execution stage into...
This Synthesis Lecture focuses on techniques for efficient data orchestration within DNN accelerators. The End of Moore's Law, coupled with the increasing growth in deep learning and other AI applicat
The everlasting demand for higher computing power deep neural networks (DNNs) drives the development of parallel architectures. 3D integration, in which chips are integrated and connected vertically, can further increase performance because it introduces another level spatial parallelism. Therefore, we analyze dataflows, performance, area, temperature such 3D-DNN-accelerators. Monolithic TSV-based stacked 3D-ICs compared against 2D-ICs. We identify workload properties architectural...
The remarkable advancements in artificial intelligence (AI), primarily driven by deep neural networks, are facing challenges surrounding unsustainable computational trajectories, limited robustness, and a lack of explainability. To develop next-generation cognitive AI systems, neuro-symbolic emerges as promising paradigm, fusing symbolic approaches to enhance interpretability, trustworthiness, while facilitating learning from much less data. Recent systems have demonstrated great potential...
Recent advancements in machine learning algorithms, especially the development of Deep Neural Networks (DNNs) have transformed landscape Artificial Intelligence (AI). With every passing day, deep based methods are applied to solve new problems with exceptional results. The portal real world is edge. true impact AI can only be fully realized if we agents continuously interacting and solving everyday problems. Unfortunately, high compute memory requirements DNNs acts a huge barrier towards...
Design space exploration and optimization is an essential but iterative step in custom accelerator design involving costly search based method to extract maximum performance energy efficiency. State-of-the-art methods employ data centric approaches reduce the cost of each iteration still rely on algorithms obtain optima. This work proposes a learned, constant time optimizer that uses recommendation network called AIrchitect, which capable learning architecture mapping with 94.3% test...
Modern deep learning systems rely on (a) a hand-tuned neural network topology, (b) massive amounts of labeled training data, and (c) extensive over large-scale compute resources to build system that can perform efficient image classification or speech recognition. Unfortunately, we are still far away from implementing adaptive general purpose intelligent which would need learn autonomously in unknown environments may not have access some any these three components. Reinforcement evolutionary...