NFDI4DS | UHH-SEMS - Publication Details

SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training

OPENALEX - Publications

Eric Qin Ananda Samajdar Hyoukjun Kwon Vineet Nadella Sudarshan Srinivasan and 3 more

The advent of Deep Learning (DL) has radically transformed the computing industry across entire spectrum from algorithms to circuits. As myriad application domains embrace DL, it become synonymous with a genre workloads vision, speech, language, recommendations, robotics, and games. key compute kernel within most DL is general matrix-matrix multiplications (GEMMs), which appears frequently during both forward pass (inference training) backward (training). GEMMs are natural choice for...

10.1109/hpca47549.2020.00015 article EN 2020-02-01

MAERI

OPENALEX - Publications

Hyoukjun Kwon Ananda Samajdar Tushar Krishna

Deep neural networks (DNN) have demonstrated highly promising results across computer vision and speech recognition, are becoming foundational for ubiquitous AI. The computational complexity of these algorithms a need high energy-efficiency has led to surge in research on hardware accelerators. % this paradigm. To reduce the latency energy costs accessing DRAM, most DNN accelerators spatial nature, with hundreds processing elements (PE) operating parallel communicating each other directly....

10.1145/3173162.3173176 article EN 2018-03-19

SCALE-Sim: Systolic CNN Accelerator Simulator

OPENALEX - Publications

Ananda Samajdar Yuhao Zhu Paul N. Whatmough Matthew Mattina Tushar Krishna

Systolic Arrays are one of the most popular compute substrates within Deep Learning accelerators today, as they provide extremely high efficiency for running dense matrix multiplications. However, research community lacks tools to insights on both design trade-offs and efficient mapping strategies systolic-array based accelerators. We introduce CNN Accelerator Simulator (SCALE-Sim), which is a configurable systolic array cycle accurate DNN accelerator simulator. SCALE-Sim exposes various...

10.48550/arxiv.1811.02883 preprint EN other-oa arXiv (Cornell University) 2018-01-01

MAERI

OPENALEX - Publications

Hyoukjun Kwon Ananda Samajdar Tushar Krishna

Deep neural networks (DNN) have demonstrated highly promising results across computer vision and speech recognition, are becoming foundational for ubiquitous AI. The computational complexity of these algorithms a need high energy-efficiency has led to surge in research on hardware accelerators. % this paradigm. To reduce the latency energy costs accessing DRAM, most DNN accelerators spatial nature, with hundreds processing elements (PE) operating parallel communicating each other directly....

10.1145/3296957.3173176 article EN ACM SIGPLAN Notices 2018-03-19

A Systematic Methodology for Characterizing Scalability of DNN Accelerators using SCALE-Sim

OPENALEX - Publications

Ananda Samajdar Jan Moritz Joseph Yuhao Zhu Paul N. Whatmough Matthew Mattina and 1 more

The compute demand for deep learning workloads is well known and a prime motivator powerful parallel computing platforms such as GPUs or dedicated hardware accelerators. massive inherent parallelism of these enables us to extract more performance by simply provisioning given task. This strategy can be directly exploited build higher-performing DNN workloads, incorporating many units possible in single system. referred scaling up. Alternatively, it's feasible arrange multiple systems work on...

10.1109/ispass48437.2020.00016 article EN 2020-08-01

Towards Cognitive AI Systems: Workload and Characterization of Neuro-Symbolic AI

OPENALEX - Publications

Zishen Wan Che-Kai Liu Hanchen Yang Ritik Raj Chaojian Li and 7 more

10.1109/ispass61541.2024.00033 article EN 2024-05-05

Rethinking NoCs for Spatial Neural Network Accelerators

OPENALEX - Publications

Hyoukjun Kwon Ananda Samajdar Tushar Krishna

Applications across image processing, speech recognition, and classification heavily rely on neural network-based algorithms that have demonstrated highly promising results in accuracy. However, such involve massive computations are not manageable general purpose processors. To cope with this challenge, spatial architecture-based accelerators, which consist of an array hundreds processing elements (PEs), emerged. These accelerators achieve high throughput exploiting parallel over the PEs;...

10.1145/3130218.3130230 article EN 2017-09-20

Scaling the Cascades: Interconnect-Aware FPGA Implementation of Machine Learning Problems

OPENALEX - Publications

Ananda Samajdar Tushar Garg Tushar Krishna Nachiket Kapre

DSP48s, BRAMs and URAMs in the Xilinx Ultra-scale+ family support dedicated cascade interconnect for high frequency, nearest-neighbor data movement using hard wiring resources. We demonstrate how to leverage these structures effectively requirements of dense machine learning (ML) workloads at URAM-limited 650MHz frequency (714MHz reported by Vivado). refor-mulate convolution matrix-vector multiplication operations make effective use (1) DSP48s supporting common multiply-accumulate chains,...

10.1109/fpl.2019.00061 article EN 2019-09-01

GeneSys: Enabling Continuous Learning through Neural Network Evolution in Hardware

OPENALEX - Publications

Ananda Samajdar Parth Mannan Kartikay Garg Tushar Krishna

Modern deep learning systems rely on (a) a hand-tuned neural network topology, (b) massive amounts of labeled training data, and (c) extensive over large-scale compute resources to build system that can perform efficient image classification or speech recognition. Unfortunately, we are still far away from implementing adaptive general purpose intelligent which would need learn autonomously in unknown environments may not have access some any these three components. Reinforcement evolutionary...

10.1109/micro.2018.00074 article EN 2018-10-01

Self adaptive reconfigurable arrays (SARA)

OPENALEX - Publications

Ananda Samajdar Eric Qin Michael Pellauer Tushar Krishna

This work demonstrates a scalable reconfigurable accelerator (RA) architecture designed to extract maximum performance and energy efficiency for GEMM workloads. We also present self-adaptive (SA) unit, which runs learnt model one-shot configuration optimization in hardware offloading the software stack thus easing deployment of proposed design. evaluate an instance methodology with 32.768 TOPS reference implementation called SAGAR, that can provide same mapping flexibility as compute...

10.1145/3489517.3530506 article EN Proceedings of the 59th ACM/IEEE Design Automation Conference 2022-07-10

Towards Efficient Neuro-Symbolic AI: From Workload Characterization to Hardware Architecture

OPENALEX - Publications

Zishen Wan Che-Kai Liu Hanchen Yang Ritik Raj Chaojian Li and 11 more

10.1109/tcasai.2024.3462692 article EN IEEE transactions on circuits and systems for artificial intelligence. 2024-09-01

A Communication-Centric Approach for Designing Flexible DNN Accelerators

OPENALEX - Publications

Hyoukjun Kwon Ananda Samajdar Tushar Krishna

High computational demands of deep neural networks (DNNs) coupled with their pervasiveness across cloud and IoT platforms have led to the emergence DNN accelerators employing hundreds processing elements (PE). Most are optimized for regular mapping problems, or dataflows, emanating from dense matrix multiplications in convolutional layers. However, continuous innovations including myriad layer types/shapes, cross-layer fusion, sparsity irregular dataflows within accelerators, which...

10.1109/mm.2018.2877289 article EN IEEE Micro 2018-11-01

RASA: Efficient Register-Aware Systolic Array Matrix Engine for CPU

OPENALEX - Publications

Geonhwa Jeong Eric Qin Ananda Samajdar Christopher J. Hughes Sreenivas Subramoney and 2 more

As AI-based applications become pervasive, CPU vendors are starting to incorporate matrix engines within the datapath boost efficiency. Systolic arrays have been premier architectural choice as in offload accelerators. However, we demonstrate that incorporating them inside CPUs can introduce under-utilization and stalls due limited register storage amortize fill drain times of array. To address this, propose RASA, Register-Aware Array. We develop techniques divide an execution stage into...

10.1109/dac18074.2021.9586257 article EN 2021-11-08

Data Orchestration in Deep Learning Accelerators

OPENALEX - Publications

Tushar Krishna Hyoukjun Kwon Angshuman Parashar Michael Pellauer Ananda Samajdar

This Synthesis Lecture focuses on techniques for efficient data orchestration within DNN accelerators. The End of Moore's Law, coupled with the increasing growth in deep learning and other AI applicat

10.2200/s01015ed1v01y202005cac052 article EN Synthesis lectures on computer architecture 2020-08-17

Architecture, Dataflow and Physical Design Implications of 3D-ICs for DNN-Accelerators

OPENALEX - Publications

Jan Moritz Joseph Ananda Samajdar Lingjun Zhu Rainer Leupers Sung Kyu Lim and 2 more

The everlasting demand for higher computing power deep neural networks (DNNs) drives the development of parallel architectures. 3D integration, in which chips are integrated and connected vertically, can further increase performance because it introduces another level spatial parallelism. Therefore, we analyze dataflows, performance, area, temperature such 3D-DNN-accelerators. Monolithic TSV-based stacked 3D-ICs compared against 2D-ICs. We identify workload properties architectural...

10.1109/isqed51717.2021.9424349 article EN 2021-04-07

Towards Efficient Neuro-Symbolic AI: From Workload Characterization to Hardware Architecture

OPENALEX - Publications

Zishen Wan Che-Kai Liu Hanchen Yang Ritik Raj Chaojian Li and 11 more

The remarkable advancements in artificial intelligence (AI), primarily driven by deep neural networks, are facing challenges surrounding unsustainable computational trajectories, limited robustness, and a lack of explainability. To develop next-generation cognitive AI systems, neuro-symbolic emerges as promising paradigm, fusing symbolic approaches to enhance interpretability, trustworthiness, while facilitating learning from much less data. Recent systems have demonstrated great potential...

10.48550/arxiv.2409.13153 preprint EN arXiv (Cornell University) 2024-09-19

CLAN: Continuous Learning using Asynchronous Neuroevolution on Commodity Edge Devices

OPENALEX - Publications

Parth Mannan Ananda Samajdar Tushar Krishna

Recent advancements in machine learning algorithms, especially the development of Deep Neural Networks (DNNs) have transformed landscape Artificial Intelligence (AI). With every passing day, deep based methods are applied to solve new problems with exceptional results. The portal real world is edge. true impact AI can only be fully realized if we agents continuously interacting and solving everyday problems. Unfortunately, high compute memory requirements DNNs acts a huge barrier towards...

10.1109/ispass48437.2020.00019 article EN 2020-08-01

AIrchitect: Automating Hardware Architecture and Mapping Optimization

OPENALEX - Publications

Ananda Samajdar Jan Moritz Joseph Tushar Krishna

Design space exploration and optimization is an essential but iterative step in custom accelerator design involving costly search based method to extract maximum performance energy efficiency. State-of-the-art methods employ data centric approaches reduce the cost of each iteration still rely on algorithms obtain optima. This work proposes a learned, constant time optimizer that uses recommendation network called AIrchitect, which capable learning architecture mapping with 94.3% test...

10.23919/date56975.2023.10137333 article EN Design, Automation & Test in Europe Conference & Exhibition (DATE), 2015 2023-04-01

GeneSys: Enabling Continuous Learning through Neural Network Evolution in Hardware

OPENALEX - Publications

Ananda Samajdar Parth Mannan Kartikay Garg Tushar Krishna

Modern deep learning systems rely on (a) a hand-tuned neural network topology, (b) massive amounts of labeled training data, and (c) extensive over large-scale compute resources to build system that can perform efficient image classification or speech recognition. Unfortunately, we are still far away from implementing adaptive general purpose intelligent which would need learn autonomously in unknown environments may not have access some any these three components. Reinforcement evolutionary...

10.48550/arxiv.1808.01363 preprint EN other-oa arXiv (Cornell University) 2018-01-01