NFDI4DS | UHH-SEMS - Publication Details

Tushar Krishna

ORCID: 0000-0001-5738-6942

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5034089074

Research Areas

Parallel Computing and Optimization Techniques
Advanced Memory and Neural Computing
Interconnection Networks and Systems
Advanced Neural Network Applications
Ferroelectric and Negative Capacitance Devices
Embedded Systems Design Techniques
Advanced Data Storage Technologies
Distributed and Parallel Computing Systems
Neural Networks and Applications
3D IC and TSV technologies
Low-power high-performance VLSI design
Stochastic Gradient Optimization Techniques
CCD and CMOS Imaging Sensors
Cloud Computing and Resource Management
Superconducting Materials and Applications
IoT and Edge/Fog Computing
Scientific Computing and Data Management
Supercapacitor Materials and Fabrication
Particle Detector Development and Performance
VLSI and FPGA Design Techniques
Adversarial Robustness in Machine Learning
Antenna Design and Analysis
VLSI and Analog Circuit Testing
Evolutionary Algorithms and Applications
Antenna Design and Optimization

Georgia Institute of Technology
2015-2024

SRM University
2024

VIT-AP University
2024

Carnegie Mellon University
2023

Atlanta Technical College
2018-2021

Koneru Lakshmaiah Education Foundation
2008-2021

West Bengal National University of Juridical Sciences
2021

University of Rochester
2019

Purdue University System
2019

University of Utah
2019

The gem5 simulator

OPENALEX - Publications

Nathan Binkert Bradford M. Beckmann Gabriel Black Steven K. Reinhardt Али Саиди and 11 more

The gem5 simulation infrastructure is the merger of best aspects M5 [4] and GEMS [9] simulators. provides a highly configurable framework, multiple ISAs, diverse CPU models. complements these features with detailed exible memory system, including support for cache coherence protocols interconnect Currently, supports most commercial ISAs (ARM, ALPHA, MIPS, Power, SPARC, x86), booting Linux on three them x86). project result combined efforts many academic industrial institutions, AMD, ARM, HP,...

10.1145/2024716.2024718 article EN ACM SIGARCH Computer Architecture News 2011-05-31

Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks

OPENALEX - Publications

Yu-Hsin Chen Tushar Krishna Joel Emer Vivienne Sze

Eyeriss is an accelerator for state-of-the-art deep convolutional neural networks (CNNs). It optimizes the energy efficiency of entire system, including chip and off-chip DRAM, various CNN shapes by reconfiguring architecture. CNNs are widely used in modern AI systems but also bring challenges on throughput to underlying hardware. This because its computation requires a large amount data, creating significant data movement from on-chip that more energy-consuming than computation. Minimizing...

10.1109/jssc.2016.2616357 article EN IEEE Journal of Solid-State Circuits 2016-11-08

GARNET: A detailed on-chip network model inside a full-system simulator

OPENALEX - Publications

Niket Agarwal Tushar Krishna Li-Shiuan Peh Niraj K. Jha

Until very recently, microprocessor designs were computation-centric. On-chip communication was frequently ignored. This because of fast, single-cycle on-chip communication. The interconnect power also insignificant compared to the transistor power. With uniprocessor providing diminishing returns and advent chip multiprocessors (CMPs) in mainstream systems, network that connects different processing cores has become a critical part design. Transistor miniaturization led high global wire...

10.1109/ispass.2009.4919636 article EN 2009-04-01

14.5 Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks

OPENALEX - Publications

Yu‐Hsin Chen Tushar Krishna Joel Emer Vivienne Sze

Deep learning using convolutional neural networks (CNN) gives state-of-the-art accuracy on many computer vision tasks (e.g. object detection, recognition, segmentation). Convolutions account for over 90% of the processing in CNNs both inference/testing and training, fully are increasingly being used. To achieve requires with not only a larger number layers, but also millions filters weights, varying shapes (i.e. filter sizes, filters, channels) as shown Fig. 14.5.1. For instance, AlexNet [1]...

10.1109/isscc.2016.7418007 article EN 2022 IEEE International Solid- State Circuits Conference (ISSCC) 2016-01-01

SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training

OPENALEX - Publications

Eric Qin Ananda Samajdar Hyoukjun Kwon Vineet Nadella Sudarshan Srinivasan and 3 more

The advent of Deep Learning (DL) has radically transformed the computing industry across entire spectrum from algorithms to circuits. As myriad application domains embrace DL, it become synonymous with a genre workloads vision, speech, language, recommendations, robotics, and games. key compute kernel within most DL is general matrix-matrix multiplications (GEMMs), which appears frequently during both forward pass (inference training) backward (training). GEMMs are natural choice for...

10.1109/hpca47549.2020.00015 article EN 2020-02-01

MAERI

OPENALEX - Publications

Hyoukjun Kwon Ananda Samajdar Tushar Krishna

Deep neural networks (DNN) have demonstrated highly promising results across computer vision and speech recognition, are becoming foundational for ubiquitous AI. The computational complexity of these algorithms a need high energy-efficiency has led to surge in research on hardware accelerators. % this paradigm. To reduce the latency energy costs accessing DRAM, most DNN accelerators spatial nature, with hundreds processing elements (PE) operating parallel communicating each other directly....

10.1145/3173162.3173176 article EN 2018-03-19

Understanding Reuse, Performance, and Hardware Cost of DNN Dataflow

OPENALEX - Publications

Hyoukjun Kwon Prasanth Chatarasi Michael Pellauer Angshuman Parashar Vivek Sarkar and 1 more

The data partitioning and scheduling strategies used by DNN accelerators to leverage reuse perform staging are known as dataflow, which directly impacts the performance energy efficiency of accelerators. An accelerator micro architecture dictates dataflow(s) that can be employed execute layers in a DNN. Selecting dataflow for layer have large impact on utilization efficiency, but there is lack understanding choices consequences tools methodologies help architects explore co-optimization design space.

10.1145/3352460.3358252 article EN 2019-10-11

SCALE-Sim: Systolic CNN Accelerator Simulator

OPENALEX - Publications

Ananda Samajdar Yuhao Zhu Paul N. Whatmough Matthew Mattina Tushar Krishna

Systolic Arrays are one of the most popular compute substrates within Deep Learning accelerators today, as they provide extremely high efficiency for running dense matrix multiplications. However, research community lacks tools to insights on both design trade-offs and efficient mapping strategies systolic-array based accelerators. We introduce CNN Accelerator Simulator (SCALE-Sim), which is a configurable systolic array cycle accurate DNN accelerator simulator. SCALE-Sim exposes various...

10.48550/arxiv.1811.02883 preprint EN other-oa arXiv (Cornell University) 2018-01-01

MAERI

OPENALEX - Publications

Hyoukjun Kwon Ananda Samajdar Tushar Krishna

10.1145/3296957.3173176 article EN ACM SIGPLAN Notices 2018-03-19

A Systematic Methodology for Characterizing Scalability of DNN Accelerators using SCALE-Sim

OPENALEX - Publications

Ananda Samajdar Jan Moritz Joseph Yuhao Zhu Paul N. Whatmough Matthew Mattina and 1 more

The compute demand for deep learning workloads is well known and a prime motivator powerful parallel computing platforms such as GPUs or dedicated hardware accelerators. massive inherent parallelism of these enables us to extract more performance by simply provisioning given task. This strategy can be directly exploited build higher-performing DNN workloads, incorporating many units possible in single system. referred scaling up. Alternatively, it's feasible arrange multiple systems work on...

10.1109/ispass48437.2020.00016 article EN 2020-08-01

MAESTRO: A Data-Centric Approach to Understand Reuse, Performance, and Hardware Cost of DNN Mappings

OPENALEX - Publications

Hyoukjun Kwon Prasanth Chatarasi Vivek Sarkar Tushar Krishna Michael Pellauer and 1 more

The efficiency of an accelerator depends on three factors-mapping, deep neural network (DNN) layers, and hardware-constructing extremely complicated design space DNN accelerators. To demystify such guide the for better efficiency, we propose analytical cost model, MAESTRO. MAESTRO receives model description hardware resources information as a list, mapping described in data-centric representation inputs. consists directives that enable concise mappings compiler-friendly form. analyzes...

10.1109/mm.2020.2985963 article EN IEEE Micro 2020-04-22

Breaking the on-chip latency barrier using SMART

OPENALEX - Publications

Tushar Krishna Chia-Hsin Owen Chen Woo Cheol Kwon Li-Shiuan Peh

As the number of on-chip cores increases, scalable topologies such as meshes inevitably add multiple hops in each network traversal. The best we can do right now is to design 1-cycle routers, that low-load latency between a source and destination equal routers + links (i.e. hops×2) them. OS/compiler cache coherence protocols designers often try limit communication within few hops, since critical for their scalability. In this work, propose an called SMART (Single-cycle Multi-hop Asynchronous...

10.1109/hpca.2013.6522334 article EN 2013-02-01

SCORPIO

OPENALEX - Publications

Bhavya K. Daya Chia-Hsin Owen Chen Suvinay Subramanian Woo-Cheol Kwon Sunghyun Park and 4 more

In the many-core era, scalable coherence and on-chip interconnects are crucial for shared memory processors. While snoopy is common in small multicore systems, directory-based de facto choice scalability to many cores, as relies on ordered which do not scale. However, does scale beyond tens of cores due excessive directory area overhead or inaccurate sharer tracking. Prior techniques supporting ordering arbitrary unordered networks impractical full chip designs We present SCORPIO, an mesh...

10.1145/2678373.2665680 article EN ACM SIGARCH Computer Architecture News 2014-06-14

Characterizing the Deployment of Deep Neural Networks on Commercial Edge Devices

OPENALEX - Publications

Ramyad Hadidi Jiashen Cao Yilun Xie Bahar Asgari Tushar Krishna and 1 more

The great success of deep neural networks (DNNs) has significantly assisted humans in numerous applications such as computer vision. DNNs are widely used today's and systems. However, in-the-edge inference is still a severe challenge mainly because the contradiction between inherent intensive resource requirements tight availability edge devices. Nevertheless, inferencing preserves privacy several user-centric domains applies scenarios with limited Internet connectivity (e.g., drones,...

10.1109/iiswc47752.2019.9041955 article EN 2019-11-01

Heterogeneous Dataflow Accelerators for Multi-DNN Workloads

OPENALEX - Publications

Hyoukjun Kwon Liangzhen Lai Michael Pellauer Tushar Krishna Yu‐Hsin Chen and 1 more

Emerging AI-enabled applications such as augmented and virtual reality (AR/VR) leverage multiple deep neural network (DNN) models for various sub-tasks object detection, image segmentation, eye-tracking, speech recognition, so on. Because of the diversity sub-tasks, layers within across DNN are highly heterogeneous in operation shape. Diverse layer operations shapes major challenges a fixed dataflow accelerator (FDA) that employs strategy on single substrate since each prefers different...

10.1109/hpca51647.2021.00016 article EN 2021-02-01

Co-Exploration of Neural Architectures and Heterogeneous ASIC Accelerator Designs Targeting Multiple Tasks

OPENALEX - Publications

Lei Yang Zheyu Yan Meng Li Hyoukjun Kwon Liangzhen Lai and 4 more

Neural Architecture Search (NAS) has demonstrated its power on various AI accelerating platforms such as Field Programmable Gate Arrays (FPGAs) and Graphic Processing Units (GPUs). However, it remains an open problem how to integrate NAS with Application-Specific Integrated Circuits (ASICs), despite them being the most powerful platforms. The major bottleneck comes from large design freedom associated ASIC designs. Moreover, consideration that multiple DNNs will run in parallel for different...

10.1109/dac18072.2020.9218676 article EN 2020-07-01

ConfuciuX: Autonomous Hardware Resource Assignment for DNN Accelerators using Reinforcement Learning

OPENALEX - Publications

Sheng-Chun Kao Geonhwa Jeong Tushar Krishna

DNN accelerators provide efficiency by leveraging reuse of activations/weights/outputs during the computations to reduce data movement from DRAM chip. The is captured accelerator's dataflow. While there has been significant prior work in exploring and comparing various dataflows, strategy for assigning on-chip hardware resources (i.e., compute memory) given a dataflow that can optimize performance/energy while meeting platform constraints area/power DNN(s) interest still relatively...

10.1109/micro50266.2020.00058 article EN 2020-10-01

GAMMA

OPENALEX - Publications

Sheng-Chun Kao Tushar Krishna

DNN layers are multi-dimensional loops that can be ordered, tiled, and scheduled in myriad ways across space time on accelerators. Each of these choices is called a mapping. It has been shown the mapping plays an extremely crucial role overall performance efficiency, as it directly determines amount reuse accelerator leverage from DNN. Moreover, instead using fixed for every layer, research revealed benefit optimizing per-layer mappings. However, determining right mapping, given layer still...

10.1145/3400302.3415639 article EN 2020-11-02

Towards Cognitive AI Systems: Workload and Characterization of Neuro-Symbolic AI

OPENALEX - Publications

Zishen Wan Che-Kai Liu Hanchen Yang Ritik Raj Chaojian Li and 7 more

10.1109/ispass61541.2024.00033 article EN 2024-05-05

Approaching the theoretical limits of a mesh NoC with a 16-node chip prototype in 45nm SOI

OPENALEX - Publications

Sunghyun Park Tushar Krishna Chia‐Hsin Chen Bhavya K. Daya Anantha P. Chandrakasan and 1 more

In this paper, we present a case study of our chip prototype 16-node 4x4 mesh NoC fabricated in 45nm SOI CMOS that aims to simultaneously optimize energy-latency-throughput for unicasts, multicasts and broadcasts. We first define analyze the theoretical limits latency, throughput energy, then describe how approach these through combination microarchitecture circuit techniques. Our 1.1V 1GHz achieves 1-cycle router-and-link latency at each hop energy-efficient router-level multicast support,...

10.1145/2228360.2228431 article EN 2012-05-31

Towards the ideal on-chip fabric for 1-to-many and many-to-1 communication

OPENALEX - Publications

Tushar Krishna Li-Shiuan Peh Bradford M. Beckmann Steven K. Reinhardt

The prevalence of multicore architectures has accentuated the need for scalable cache coherence solutions. Many proposed designs use a mix 1-to-1, 1-to-many (1-to-M), and many-to-1 (M-to-1) communication to maintain data consistency. on-chip network is backbone that needs handle all these flows efficiently allow protocols scale. However, most research in networks focused on optimizing only 1-to-1 traffic. There been some recent work addressing 1-to-M traffic by proposing forking multicast...

10.1145/2155620.2155630 article EN 2011-12-03

Rethinking NoCs for Spatial Neural Network Accelerators

OPENALEX - Publications

Hyoukjun Kwon Ananda Samajdar Tushar Krishna

Applications across image processing, speech recognition, and classification heavily rely on neural network-based algorithms that have demonstrated highly promising results in accuracy. However, such involve massive computations are not manageable general purpose processors. To cope with this challenge, spatial architecture-based accelerators, which consist of an array hundreds processing elements (PEs), emerged. These accelerators achieve high throughput exploiting parallel over the PEs;...

10.1145/3130218.3130230 article EN 2017-09-20

Architecture, Chip, and Package Codesign Flow for Interposer-Based 2.5-D Chiplet Integration Enabling Heterogeneous IP Reuse

OPENALEX - Publications

Jinwoo Kim Gauthaman Murali Heechun Park Eric Qin Hyoukjun Kwon and 11 more

A new trend in system-on-chip (SoC) design is chiplet-based IP reuse using 2.5-D integration. Complete electronic systems can be created through the integration of chiplets on an interposer, rather than a monolithic flow. This approach expands access to large catalog off-the-shelf intellectual properties (IPs), allows them, and enables heterogeneous blocks different technologies. In this article, we present highly integrated flow that encompasses architecture, circuit, package build simulate...

10.1109/tvlsi.2020.3015494 article EN IEEE Transactions on Very Large Scale Integration (VLSI) Systems 2020-08-24

MAGMA: An Optimization Framework for Mapping Multiple DNNs on Multiple Accelerator Cores

OPENALEX - Publications

Sheng-Chun Kao Tushar Krishna

As Deep Learning continues to drive a variety of applications in edge and cloud data centers, there is growing trend towards building large accelerators with several sub-accelerator cores/chiplets. This work looks at the problem supporting multi-tenancy on such accelerators. In particular, we focus mapping jobs from DNNs simultaneously an accelerator. Given extremely search space, formulate as optimization develop framework called M3E. addition, specialized algorithm MAGMA custom operators...

10.1109/hpca53966.2022.00065 article EN 2022-04-01

Coming Soon ...