Sai Rahul Chalamalasetti

ORCID: 0000-0001-9004-440X
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Parallel Computing and Optimization Techniques
  • Embedded Systems Design Techniques
  • Advanced Memory and Neural Computing
  • Ferroelectric and Negative Capacitance Devices
  • Interconnection Networks and Systems
  • Cloud Computing and Resource Management
  • Network Packet Processing and Optimization
  • Low-power high-performance VLSI design
  • Radiation Effects in Electronics
  • Distributed and Parallel Computing Systems
  • Algorithms and Data Compression
  • VLSI and Analog Circuit Testing
  • CCD and CMOS Imaging Sensors
  • Caching and Content Delivery
  • Machine Learning in Materials Science
  • Stochastic Gradient Optimization Techniques
  • Advanced Image and Video Retrieval Techniques
  • Text and Document Classification Technologies
  • Scientific Computing and Data Management
  • Digital Filter Design and Implementation
  • Digital Transformation in Industry
  • Polymer Nanocomposite Synthesis and Irradiation
  • Video Coding and Compression Technologies
  • Advanced Data Storage Technologies
  • Advanced Optical Sensing Technologies

Hewlett-Packard (United States)
2013-2024

Matrix Research (United States)
2024

Hewlett Packard Enterprise (United States)
2018-2021

Universidade de São Paulo
2017

Silicon Labs (United States)
2014

University of Massachusetts Lowell
2008-2013

University of Massachusetts Amherst
2010

Memristor crossbars are circuits capable of performing analog matrix-vector multiplications, overcoming the fundamental energy efficiency limitations digital logic. They have been shown to be effective in special-purpose accelerators for a limited set neural network applications. We present Programmable Ultra-efficient Memristor-based Accelerator (PUMA) which enhances memristor with general purpose execution units enable acceleration wide variety Machine Learning (ML) inference workloads....

10.1145/3297858.3304049 article EN 2019-04-04

The wide adoption of deep neural networks has been accompanied by ever-increasing energy and performance demands due to the expensive nature training them. Numerous special-purpose architectures have proposed accelerate training: both digital hybrid digital-analog using resistive RAM (ReRAM) crossbars. ReRAM-based accelerators demonstrated effectiveness ReRAM crossbars at performing matrix-vector multiplication operations that are prevalent in training. However, they still suffer from...

10.1109/tc.2020.2998456 article EN publisher-specific-oa IEEE Transactions on Computers 2020-05-30

Providing low-latency access to large amounts of data is one the foremost requirements for many web services. To address these needs, systems such as Memcached have been created which provide a distributed, all in-memory key-value store. These are critical and often deployed across hundreds or thousands servers. However, not well matched commodity servers, they require significant CPU resources achieve reasonable network bandwidth, yet core functions do benefit from high performance standard...

10.1145/2435264.2435306 article EN 2013-02-11

We propose memristor-based TCAMs (Ternary Content Addressable Memory) circuits to accelerate Regular Expression (RegEx) matching through in memory processing of finite automata. RegEx is a key function network security find malicious actors. However, latency and power can be incredibly high current proposals are challenged perform wire-speed for large rulesets. Our approach dramatically decreases operating power, enables throughput, the use nanoscale memristor TCAM (mTCAMs) compression...

10.1109/tnano.2019.2936239 article EN IEEE Transactions on Nanotechnology 2019-01-01

Memristor crossbars are circuits capable of performing analog matrix-vector multiplications, overcoming the fundamental energy efficiency limitations digital logic. They have been shown to be effective in special-purpose accelerators for a limited set neural network applications. We present Programmable Ultra-efficient Memristor-based Accelerator (PUMA) which enhances memristor with general purpose execution units enable acceleration wide variety Machine Learning (ML) inference workloads....

10.48550/arxiv.1901.10351 preprint EN other-oa arXiv (Cornell University) 2019-01-01

Heterogeneous CPU-FPGA systems are evolving towards tighter integration between CPUs and FPGAs for improved performance energy efficiency. At the same time, programmability is also improving with High Level Synthesis tools (e.g., OpenCL Software Development Kits), which allow programmers to express their designs high-level programming languages, avoid time-consuming error-prone register-transfer level (RTL) programming. In traditional loosely-coupled accelerator mode, work as offload...

10.1145/3297663.3310305 article EN 2019-04-04

ReRAM-based accelerators have shown great potential for accelerating DNN inference because ReRAM crossbars can perform analog matrix-vector multiplication operations with low latency and energy consumption. However, these require the use of ADCs which constitute a significant fraction cost MVM operations. The overhead be mitigated via partial sum quantization. prior quantization flows do not consider is highly relevant to traditional digital architectures. To address this issue, we propose...

10.1145/3394885.3431554 article EN Proceedings of the 28th Asia and South Pacific Design Automation Conference 2021-01-18

The increasing deployment of machine learning at the core and edge for applications such as video image recognition has resulted in a number special purpose accelerators this domain. However, these do not have full end-to-end software stacks application development, resulting hard-to-develop, proprietary, suboptimal programming executables. In paper, we describe stack memristor-based hybrid (analog-digital)accelerator. consists an ONNX converter, optimizer, compiler, driver, emulators....

10.1109/icrc.2018.8638612 article EN 2018-11-01

Traditional high-performance computing and modern artificial intelligence are converging with workflows as a common paradigm. We predict nine principles of heterogeneity serverless for this convergence, from high-level programming to low-level hardware.

10.1109/mc.2023.3332973 article EN Computer 2024-01-01

This paper presents an architecture and implementation details for MORA, a novel coarse grained reconfigurable processor accelerating media processing applications. The MORA involves 2-D array of several such processors, to deliver low cost, high throughput performance in A distinguishing feature the is co-design hardware low-level programming language throughout design cycle. single processor, benchmark evaluation using cycle accurate simulator are presented.

10.1109/ahs.2009.37 article EN 2009-07-01

Heterogeneous computing offers a promising solution for energy efficient in the data center. FPGA based heterogeneous is an especially direction since it allows creation of custom hardware solutions centric parallel applications. One main issues delaying wide spread adoption FPGAs as stream high performance devices difficulty programming them. OpenCL was meant to address difficulties and non-uniformity related devices, unfortunately because its complexity sets bar many software programmers,...

10.1109/fpl.2014.6927442 article EN 2014-09-01

This brief presents the implementation and evaluation of an 8-bit adaptable processor core to be part power-throughput-area efficient multimedia oriented reconfigurable architecture array. The design was custom implemented in IBM's 90 nm CMOS technology occupies 0.115 mm <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> silicon area with approximately 70% utilized by circuits. shows a peak throughput performance 75 MOPS/mW. Benchmarking...

10.1109/tvlsi.2012.2206063 article EN IEEE Transactions on Very Large Scale Integration (VLSI) Systems 2012-07-25

MORA is a novel platform for high-level FPGA programming of streaming vector and matrix operations, aimed at multimedia applications. It consists soft array pipelined low-complexity SIMD processors-in-memory (PIM). We present Domain-Specific Language (DSL) the processor array. The DSL embedded in C++, providing designers with familiar language framework ability to compile designs using standard compiler functional testing before generating bitstream toolchain. paper discusses MORA-C++...

10.1109/asap.2010.5540750 article EN 2010-07-01

In this paper, we present the design and evaluation of two new processing elements for reconfigurable computing. We also a circuit-level implementation data paths in static dynamic styles to explore various performance-power tradeoffs involved. When implemented IBM 90-nm CMOS process, 8-b achieve operating frequencies ranging over 1 GHz both implementations, with each path supporting single-cycle computational capability. A novel single-precision floating point element (FPPE) using 24-b...

10.1109/tvlsi.2012.2220868 article EN IEEE Transactions on Very Large Scale Integration (VLSI) Systems 2012-10-23

The deceleration of transistor feature size scaling has motivated growing adoption specialized accelerators implemented as GPUs, FPGAs, ASICs, and more recently new types computing such neuromorphic, bio-inspired, ultra low energy, reversible, stochastic, optical, quantum, combinations, others unforeseen. There is a tension between specialization generalization, with the current state trending to master slave models where (slaves) are instructed by general purpose system (master) running an...

10.1109/icrc.2017.8123649 article EN 2017-11-01

Regular expression (RegEx)matching is a key function in network security, where matching of packet data against known malicious signatures filters and alerts active intrusions. RegExs are widely used open source commercial security systems as they easily concisely represent complex patterns like those signatures. However, the latency power required to perform RegEx incredibly high approaches this problem struggle achieve > 1 Gbps on real-world rulesets while internet wirespeeds continue...

10.1109/icrc.2018.8638603 article EN 2018-11-01

This paper presents new power efficient high throughput data paths for portable multimedia devices. The various provide support dense arithmetic operations. work provides the performance evaluation a library of reconfigurable path elements (Processing Elements) previously proposed and two processing element architectures to be part portable, systems. results show that designs will higher efficiency in area consumption compared suggested commercial solutions, could prove highly beneficial...

10.1109/reconfig.2008.58 article EN 2008-12-01

This paper presents an FPGA implementation of a low cost 8 bit reconfigurable processor core for media processing applications. The is optimized to provide all basic arithmetic and logic functions required by the other domains, as well make it easily integrable into 2D array. investigation feasibility potential soft architecture platforms. was synthesized on entire Virtex family evaluate its overall performance, scalability portability. A special feature proposed simple programming model...

10.1109/fpl.2009.5272461 article EN 2009-08-01

We propose using memristor-based TCAMs (Ternary Content Addressable Memory) to accelerate Regular Expression (RegEx) matching. RegEx matching is a key function in network security, where deep packet inspection finds and filters out malicious actors. However, latency power can be incredibly high current proposals are challenged perform wire-speed for large scale rulesets. Our approach dramatically decreases operating power, provides throughput, the use of mTCAMs enables novel compression...

10.1145/3232195.3232201 article EN 2018-07-17

Changes in Moore's law and Dennard's scaling made hardware accelerators critical for performance improvement, but configuring them performance, area, energy efficiency is hard requires expert knowledge. High-Level Synthesis (HLS) tools enable design FPGAs to be done high-level languages reducing the cost time needed still requiring configuration. This paper presents an open-source, flexible virtualized autotuner LegUp parameters. Our optimization target was Weighted Normalized Sum (WNS) of 8...

10.1109/reconfig.2017.8279778 article EN 2017-12-01

This work presents an effort to bridge the gap between abstract high level programming and OpenCL by extending existing Java framework (APARAPI), based on OpenCL, so that it can be used program FPGAs at a of abstraction increased ease programmability. We run several real world algorithms assess performance both low end system. On systems respectively we observed up 78-80 percent power reduction 4.8X-5.3X speed increase running NBody simulation, as well 65-80 6.2X-7X for KMeans, MapReduce...

10.48550/arxiv.1408.4964 preprint EN other-oa arXiv (Cornell University) 2014-01-01

We propose an FPGA design for the relevancy computation part of a high-throughput real-time search application. The application matches terms in stream documents against static profile, held off-chip memory. present mathematical analysis throughput and apply it to problem scaling Bloom filter used discard nonmatches.

10.1155/2012/507173 article EN cc-by International Journal of Reconfigurable Computing 2012-01-01

In this paper, we describe a novel scheme for radiation hardening of high performance pipelined architectures and data paths. The proposed technique uses local ground bus decoupled from the global using an additional pull down device, to detect transient error. Combining detector output with duplicated pipeline registers enables instruction execution through path be repeated as soon error is detected. outputs various stages in are manipulated maintain correctness event detection...

10.1109/ahs.2010.5546228 article EN 2010-06-01
Coming Soon ...