Tushar Krishna

ORCID: 0000-0001-5738-6942
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Parallel Computing and Optimization Techniques
  • Advanced Memory and Neural Computing
  • Interconnection Networks and Systems
  • Advanced Neural Network Applications
  • Ferroelectric and Negative Capacitance Devices
  • Embedded Systems Design Techniques
  • Advanced Data Storage Technologies
  • Distributed and Parallel Computing Systems
  • Neural Networks and Applications
  • 3D IC and TSV technologies
  • Low-power high-performance VLSI design
  • Stochastic Gradient Optimization Techniques
  • CCD and CMOS Imaging Sensors
  • Cloud Computing and Resource Management
  • Superconducting Materials and Applications
  • IoT and Edge/Fog Computing
  • Scientific Computing and Data Management
  • Supercapacitor Materials and Fabrication
  • Particle Detector Development and Performance
  • VLSI and FPGA Design Techniques
  • Adversarial Robustness in Machine Learning
  • Antenna Design and Analysis
  • VLSI and Analog Circuit Testing
  • Evolutionary Algorithms and Applications
  • Antenna Design and Optimization

Georgia Institute of Technology
2015-2024

SRM University
2024

VIT-AP University
2024

Carnegie Mellon University
2023

Atlanta Technical College
2018-2021

Koneru Lakshmaiah Education Foundation
2008-2021

West Bengal National University of Juridical Sciences
2021

University of Rochester
2019

Purdue University System
2019

University of Utah
2019

The gem5 simulation infrastructure is the merger of best aspects M5 [4] and GEMS [9] simulators. provides a highly configurable framework, multiple ISAs, diverse CPU models. complements these features with detailed exible memory system, including support for cache coherence protocols interconnect Currently, supports most commercial ISAs (ARM, ALPHA, MIPS, Power, SPARC, x86), booting Linux on three them x86). project result combined efforts many academic industrial institutions, AMD, ARM, HP,...

10.1145/2024716.2024718 article EN ACM SIGARCH Computer Architecture News 2011-05-31

Eyeriss is an accelerator for state-of-the-art deep convolutional neural networks (CNNs). It optimizes the energy efficiency of entire system, including chip and off-chip DRAM, various CNN shapes by reconfiguring architecture. CNNs are widely used in modern AI systems but also bring challenges on throughput to underlying hardware. This because its computation requires a large amount data, creating significant data movement from on-chip that more energy-consuming than computation. Minimizing...

10.1109/jssc.2016.2616357 article EN IEEE Journal of Solid-State Circuits 2016-11-08

Until very recently, microprocessor designs were computation-centric. On-chip communication was frequently ignored. This because of fast, single-cycle on-chip communication. The interconnect power also insignificant compared to the transistor power. With uniprocessor providing diminishing returns and advent chip multiprocessors (CMPs) in mainstream systems, network that connects different processing cores has become a critical part design. Transistor miniaturization led high global wire...

10.1109/ispass.2009.4919636 article EN 2009-04-01

Deep learning using convolutional neural networks (CNN) gives state-of-the-art accuracy on many computer vision tasks (e.g. object detection, recognition, segmentation). Convolutions account for over 90% of the processing in CNNs both inference/testing and training, fully are increasingly being used. To achieve requires with not only a larger number layers, but also millions filters weights, varying shapes (i.e. filter sizes, filters, channels) as shown Fig. 14.5.1. For instance, AlexNet [1]...

10.1109/isscc.2016.7418007 article EN 2022 IEEE International Solid- State Circuits Conference (ISSCC) 2016-01-01

The advent of Deep Learning (DL) has radically transformed the computing industry across entire spectrum from algorithms to circuits. As myriad application domains embrace DL, it become synonymous with a genre workloads vision, speech, language, recommendations, robotics, and games. key compute kernel within most DL is general matrix-matrix multiplications (GEMMs), which appears frequently during both forward pass (inference training) backward (training). GEMMs are natural choice for...

10.1109/hpca47549.2020.00015 article EN 2020-02-01

Deep neural networks (DNN) have demonstrated highly promising results across computer vision and speech recognition, are becoming foundational for ubiquitous AI. The computational complexity of these algorithms a need high energy-efficiency has led to surge in research on hardware accelerators. % this paradigm. To reduce the latency energy costs accessing DRAM, most DNN accelerators spatial nature, with hundreds processing elements (PE) operating parallel communicating each other directly....

10.1145/3173162.3173176 article EN 2018-03-19

The data partitioning and scheduling strategies used by DNN accelerators to leverage reuse perform staging are known as dataflow, which directly impacts the performance energy efficiency of accelerators. An accelerator micro architecture dictates dataflow(s) that can be employed execute layers in a DNN. Selecting dataflow for layer have large impact on utilization efficiency, but there is lack understanding choices consequences tools methodologies help architects explore co-optimization design space.

10.1145/3352460.3358252 article EN 2019-10-11

Systolic Arrays are one of the most popular compute substrates within Deep Learning accelerators today, as they provide extremely high efficiency for running dense matrix multiplications. However, research community lacks tools to insights on both design trade-offs and efficient mapping strategies systolic-array based accelerators. We introduce CNN Accelerator Simulator (SCALE-Sim), which is a configurable systolic array cycle accurate DNN accelerator simulator. SCALE-Sim exposes various...

10.48550/arxiv.1811.02883 preprint EN other-oa arXiv (Cornell University) 2018-01-01

Deep neural networks (DNN) have demonstrated highly promising results across computer vision and speech recognition, are becoming foundational for ubiquitous AI. The computational complexity of these algorithms a need high energy-efficiency has led to surge in research on hardware accelerators. % this paradigm. To reduce the latency energy costs accessing DRAM, most DNN accelerators spatial nature, with hundreds processing elements (PE) operating parallel communicating each other directly....

10.1145/3296957.3173176 article EN ACM SIGPLAN Notices 2018-03-19

The compute demand for deep learning workloads is well known and a prime motivator powerful parallel computing platforms such as GPUs or dedicated hardware accelerators. massive inherent parallelism of these enables us to extract more performance by simply provisioning given task. This strategy can be directly exploited build higher-performing DNN workloads, incorporating many units possible in single system. referred scaling up. Alternatively, it's feasible arrange multiple systems work on...

10.1109/ispass48437.2020.00016 article EN 2020-08-01

The efficiency of an accelerator depends on three factors-mapping, deep neural network (DNN) layers, and hardware-constructing extremely complicated design space DNN accelerators. To demystify such guide the for better efficiency, we propose analytical cost model, MAESTRO. MAESTRO receives model description hardware resources information as a list, mapping described in data-centric representation inputs. consists directives that enable concise mappings compiler-friendly form. analyzes...

10.1109/mm.2020.2985963 article EN IEEE Micro 2020-04-22

As the number of on-chip cores increases, scalable topologies such as meshes inevitably add multiple hops in each network traversal. The best we can do right now is to design 1-cycle routers, that low-load latency between a source and destination equal routers + links (i.e. hops×2) them. OS/compiler cache coherence protocols designers often try limit communication within few hops, since critical for their scalability. In this work, propose an called SMART (Single-cycle Multi-hop Asynchronous...

10.1109/hpca.2013.6522334 article EN 2013-02-01

In the many-core era, scalable coherence and on-chip interconnects are crucial for shared memory processors. While snoopy is common in small multicore systems, directory-based de facto choice scalability to many cores, as relies on ordered which do not scale. However, does scale beyond tens of cores due excessive directory area overhead or inaccurate sharer tracking. Prior techniques supporting ordering arbitrary unordered networks impractical full chip designs We present SCORPIO, an mesh...

10.1145/2678373.2665680 article EN ACM SIGARCH Computer Architecture News 2014-06-14

The great success of deep neural networks (DNNs) has significantly assisted humans in numerous applications such as computer vision. DNNs are widely used today's and systems. However, in-the-edge inference is still a severe challenge mainly because the contradiction between inherent intensive resource requirements tight availability edge devices. Nevertheless, inferencing preserves privacy several user-centric domains applies scenarios with limited Internet connectivity (e.g., drones,...

10.1109/iiswc47752.2019.9041955 article EN 2019-11-01

Emerging AI-enabled applications such as augmented and virtual reality (AR/VR) leverage multiple deep neural network (DNN) models for various sub-tasks object detection, image segmentation, eye-tracking, speech recognition, so on. Because of the diversity sub-tasks, layers within across DNN are highly heterogeneous in operation shape. Diverse layer operations shapes major challenges a fixed dataflow accelerator (FDA) that employs strategy on single substrate since each prefers different...

10.1109/hpca51647.2021.00016 article EN 2021-02-01

Neural Architecture Search (NAS) has demonstrated its power on various AI accelerating platforms such as Field Programmable Gate Arrays (FPGAs) and Graphic Processing Units (GPUs). However, it remains an open problem how to integrate NAS with Application-Specific Integrated Circuits (ASICs), despite them being the most powerful platforms. The major bottleneck comes from large design freedom associated ASIC designs. Moreover, consideration that multiple DNNs will run in parallel for different...

10.1109/dac18072.2020.9218676 article EN 2020-07-01

DNN accelerators provide efficiency by leveraging reuse of activations/weights/outputs during the computations to reduce data movement from DRAM chip. The is captured accelerator's dataflow. While there has been significant prior work in exploring and comparing various dataflows, strategy for assigning on-chip hardware resources (i.e., compute memory) given a dataflow that can optimize performance/energy while meeting platform constraints area/power DNN(s) interest still relatively...

10.1109/micro50266.2020.00058 article EN 2020-10-01

DNN layers are multi-dimensional loops that can be ordered, tiled, and scheduled in myriad ways across space time on accelerators. Each of these choices is called a mapping. It has been shown the mapping plays an extremely crucial role overall performance efficiency, as it directly determines amount reuse accelerator leverage from DNN. Moreover, instead using fixed for every layer, research revealed benefit optimizing per-layer mappings. However, determining right mapping, given layer still...

10.1145/3400302.3415639 article EN 2020-11-02

In this paper, we present a case study of our chip prototype 16-node 4x4 mesh NoC fabricated in 45nm SOI CMOS that aims to simultaneously optimize energy-latency-throughput for unicasts, multicasts and broadcasts. We first define analyze the theoretical limits latency, throughput energy, then describe how approach these through combination microarchitecture circuit techniques. Our 1.1V 1GHz achieves 1-cycle router-and-link latency at each hop energy-efficient router-level multicast support,...

10.1145/2228360.2228431 article EN 2012-05-31

The prevalence of multicore architectures has accentuated the need for scalable cache coherence solutions. Many proposed designs use a mix 1-to-1, 1-to-many (1-to-M), and many-to-1 (M-to-1) communication to maintain data consistency. on-chip network is backbone that needs handle all these flows efficiently allow protocols scale. However, most research in networks focused on optimizing only 1-to-1 traffic. There been some recent work addressing 1-to-M traffic by proposing forking multicast...

10.1145/2155620.2155630 article EN 2011-12-03

Applications across image processing, speech recognition, and classification heavily rely on neural network-based algorithms that have demonstrated highly promising results in accuracy. However, such involve massive computations are not manageable general purpose processors. To cope with this challenge, spatial architecture-based accelerators, which consist of an array hundreds processing elements (PEs), emerged. These accelerators achieve high throughput exploiting parallel over the PEs;...

10.1145/3130218.3130230 article EN 2017-09-20

A new trend in system-on-chip (SoC) design is chiplet-based IP reuse using 2.5-D integration. Complete electronic systems can be created through the integration of chiplets on an interposer, rather than a monolithic flow. This approach expands access to large catalog off-the-shelf intellectual properties (IPs), allows them, and enables heterogeneous blocks different technologies. In this article, we present highly integrated flow that encompasses architecture, circuit, package build simulate...

10.1109/tvlsi.2020.3015494 article EN IEEE Transactions on Very Large Scale Integration (VLSI) Systems 2020-08-24

As Deep Learning continues to drive a variety of applications in edge and cloud data centers, there is growing trend towards building large accelerators with several sub-accelerator cores/chiplets. This work looks at the problem supporting multi-tenancy on such accelerators. In particular, we focus mapping jobs from DNNs simultaneously an accelerator. Given extremely search space, formulate as optimization develop framework called M3E. addition, specialized algorithm MAGMA custom operators...

10.1109/hpca53966.2022.00065 article EN 2022-04-01
Coming Soon ...