NFDI4DS | UHH-SEMS - Publication Details

Sai Rahul Chalamalasetti

ORCID: 0000-0001-9004-440X

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5079257666

Research Areas

Parallel Computing and Optimization Techniques
Embedded Systems Design Techniques
Advanced Memory and Neural Computing
Ferroelectric and Negative Capacitance Devices
Interconnection Networks and Systems
Cloud Computing and Resource Management
Network Packet Processing and Optimization
Low-power high-performance VLSI design
Radiation Effects in Electronics
Distributed and Parallel Computing Systems
Algorithms and Data Compression
VLSI and Analog Circuit Testing
CCD and CMOS Imaging Sensors
Caching and Content Delivery
Machine Learning in Materials Science
Stochastic Gradient Optimization Techniques
Advanced Image and Video Retrieval Techniques
Text and Document Classification Technologies
Scientific Computing and Data Management
Digital Filter Design and Implementation
Digital Transformation in Industry
Polymer Nanocomposite Synthesis and Irradiation
Video Coding and Compression Technologies
Advanced Data Storage Technologies
Advanced Optical Sensing Technologies

Hewlett-Packard (United States)
2013-2024

Matrix Research (United States)
2024

Hewlett Packard Enterprise (United States)
2018-2021

Universidade de São Paulo
2017

Silicon Labs (United States)
2014

University of Massachusetts Lowell
2008-2013

University of Massachusetts Amherst
2010

PUMA

OPENALEX - Publications

Aayush Ankit Izzat El Hajj Sai Rahul Chalamalasetti Geoffrey Ndu Martin Foltín and 6 more

Memristor crossbars are circuits capable of performing analog matrix-vector multiplications, overcoming the fundamental energy efficiency limitations digital logic. They have been shown to be effective in special-purpose accelerators for a limited set neural network applications. We present Programmable Ultra-efficient Memristor-based Accelerator (PUMA) which enhances memristor with general purpose execution units enable acceleration wide variety Machine Learning (ML) inference workloads....

10.1145/3297858.3304049 article EN 2019-04-04

PANTHER: A Programmable Architecture for Neural Network Training Harnessing Energy-Efficient ReRAM

OPENALEX - Publications

Aayush Ankit Izzat El Hajj Sai Rahul Chalamalasetti Sapan Agarwal Matthew Marinella and 5 more

The wide adoption of deep neural networks has been accompanied by ever-increasing energy and performance demands due to the expensive nature training them. Numerous special-purpose architectures have proposed accelerate training: both digital hybrid digital-analog using resistive RAM (ReRAM) crossbars. ReRAM-based accelerators demonstrated effectiveness ReRAM crossbars at performing matrix-vector multiplication operations that are prevalent in training. However, they still suffer from...

10.1109/tc.2020.2998456 article EN publisher-specific-oa IEEE Transactions on Computers 2020-05-30

An FPGA memcached appliance

OPENALEX - Publications

Sai Rahul Chalamalasetti Kevin Lim Mitch Wright Alvin AuYoung Parthasarathy Ranganathan and 1 more

Providing low-latency access to large amounts of data is one the foremost requirements for many web services. To address these needs, systems such as Memcached have been created which provide a distributed, all in-memory key-value store. These are critical and often deployed across hundreds or thousands servers. However, not well matched commodity servers, they require significant CPU resources achieve reasonable network bandwidth, yet core functions do benefit from high performance standard...

10.1145/2435264.2435306 article EN 2013-02-11

Memristor TCAMs Accelerate Regular Expression Matching for Network Intrusion Detection

OPENALEX - Publications

Catherine E. Graves Sity Lam Xuema Li Lennie Kiyama Martin Foltín and 10 more

We propose memristor-based TCAMs (Ternary Content Addressable Memory) circuits to accelerate Regular Expression (RegEx) matching through in memory processing of finite automata. RegEx is a key function network security find malicious actors. However, latency and power can be incredibly high current proposals are challenged perform wire-speed for large rulesets. Our approach dramatically decreases operating power, enables throughput, the use nanoscale memristor TCAM (mTCAMs) compression...

10.1109/tnano.2019.2936239 article EN IEEE Transactions on Nanotechnology 2019-01-01

PUMA: A Programmable Ultra-efficient Memristor-based Accelerator for Machine Learning Inference

OPENALEX - Publications

Aayush Ankit Izzat El Hajj Sai Rahul Chalamalasetti Geoffrey Ndu Martin Foltín and 6 more

10.48550/arxiv.1901.10351 preprint EN other-oa arXiv (Cornell University) 2019-01-01

Analysis and Modeling of Collaborative Execution Strategies for Heterogeneous CPU-FPGA Architectures

OPENALEX - Publications

Sitao Huang Li‐Wen Chang Izzat El Hajj Simon Garcia de Gonzalo Juan Gómez-Luna and 6 more

Heterogeneous CPU-FPGA systems are evolving towards tighter integration between CPUs and FPGAs for improved performance energy efficiency. At the same time, programmability is also improving with High Level Synthesis tools (e.g., OpenCL Software Development Kits), which allow programmers to express their designs high-level programming languages, avoid time-consuming error-prone register-transfer level (RTL) programming. In traditional loosely-coupled accelerator mode, work as offload...

10.1145/3297663.3310305 article EN 2019-04-04

Mixed Precision Quantization for ReRAM-based DNN Inference Accelerators

OPENALEX - Publications

Sitao Huang Aayush Ankit Plínio Silveira Rodrigo Antunes Sai Rahul Chalamalasetti and 13 more

ReRAM-based accelerators have shown great potential for accelerating DNN inference because ReRAM crossbars can perform analog matrix-vector multiplication operations with low latency and energy consumption. However, these require the use of ADCs which constitute a significant fraction cost MVM operations. The overhead be mitigated via partial sum quantization. prior quantization flows do not consider is highly relevant to traditional digital architectures. To address this issue, we propose...

10.1145/3394885.3431554 article EN Proceedings of the 28th Asia and South Pacific Design Automation Conference 2021-01-18

Hardware-Software Co-Design for an Analog-Digital Accelerator for Machine Learning

OPENALEX - Publications

Joao Ambrosi Aayush Ankit Rodrigo Antunes Sai Rahul Chalamalasetti Soumitra Chatterjee and 15 more

The increasing deployment of machine learning at the core and edge for applications such as video image recognition has resulted in a number special purpose accelerators this domain. However, these do not have full end-to-end software stacks application development, resulting hard-to-develop, proprietary, suboptimal programming executables. In paper, we describe stack memristor-based hybrid (analog-digital)accelerator. consists an ONNX converter, optimizer, compiler, driver, emulators....

10.1109/icrc.2018.8638612 article EN 2018-11-01

Predicting Heterogeneity and Serverless Principles of Converged High-Performance Computing, Artificial Intelligence, and Workflows

OPENALEX - Publications

Pedro Bruel Sai Rahul Chalamalasetti Aditya Dhakal Eitan Frachtenberg Ninad Hogade and 5 more

Traditional high-performance computing and modern artificial intelligence are converging with workflows as a common paradigm. We predict nine principles of heterogeneity serverless for this convergence, from high-level programming to low-level hardware.

10.1109/mc.2023.3332973 article EN Computer 2024-01-01

MORA - An Architecture and Programming Model for a Resource Efficient Coarse Grained Reconfigurable Processor

OPENALEX - Publications

Sai Rahul Chalamalasetti Sohan Purohit Martin Margala Wim Vanderbauwhede

This paper presents an architecture and implementation details for MORA, a novel coarse grained reconfigurable processor accelerating media processing applications. The MORA involves 2-D array of several such processors, to deliver low cost, high throughput performance in A distinguishing feature the is co-design hardware low-level programming language throughout design cycle. single processor, benchmark evaluation using cycle accurate simulator are presented.

10.1109/ahs.2009.37 article EN 2009-07-01

High level programming framework for FPGAs in the data center

OPENALEX - Publications

Oren Segal Martin Margala Sai Rahul Chalamalasetti Mitch Wright

Heterogeneous computing offers a promising solution for energy efficient in the data center. FPGA based heterogeneous is an especially direction since it allows creation of custom hardware solutions centric parallel applications. One main issues delaying wide spread adoption FPGAs as stream high performance devices difficulty programming them. OpenCL was meant to address difficulties and non-uniformity related devices, unfortunately because its complexity sets bar many software programmers,...

10.1109/fpl.2014.6927442 article EN 2014-09-01

Throughput/Resource-Efficient Reconfigurable Processor for Multimedia Applications

OPENALEX - Publications

Sohan Purohit Sai Rahul Chalamalasetti Martin Margala Wim Vanderbauwhede

This brief presents the implementation and evaluation of an 8-bit adaptable processor core to be part power-throughput-area efficient multimedia oriented reconfigurable architecture array. The design was custom implemented in IBM's 90 nm CMOS technology occupies 0.115 mm <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> silicon area with approximately 70% utilized by circuits. shows a peak throughput performance 75 MOPS/mW. Benchmarking...

10.1109/tvlsi.2012.2206063 article EN IEEE Transactions on Very Large Scale Integration (VLSI) Systems 2012-07-25

A C++-embedded Domain-Specific Language for programming the MORA soft processor array

OPENALEX - Publications

Wim Vanderbauwhede Martin Margala Sai Rahul Chalamalasetti Sohan Purohit

MORA is a novel platform for high-level FPGA programming of streaming vector and matrix operations, aimed at multimedia applications. It consists soft array pipelined low-complexity SIMD processors-in-memory (PIM). We present Domain-Specific Language (DSL) the processor array. The DSL embedded in C++, providing designers with familiar language framework ability to compile designs using standard compiler functional testing before generating bitstream toolchain. paper discusses MORA-C++...

10.1109/asap.2010.5540750 article EN 2010-07-01

Design and Evaluation of High-Performance Processing Elements for Reconfigurable Systems

OPENALEX - Publications

Sohan Purohit Sai Rahul Chalamalasetti Martin Margala Wim Vanderbauwhede

In this paper, we present the design and evaluation of two new processing elements for reconfigurable computing. We also a circuit-level implementation data paths in static dynamic styles to explore various performance-power tradeoffs involved. When implemented IBM 90-nm CMOS process, 8-b achieve operating frequencies ranging over 1 GHz both implementations, with each path supporting single-cycle computational capability. A novel single-precision floating point element (FPPE) using 24-b...

10.1109/tvlsi.2012.2220868 article EN IEEE Transactions on Very Large Scale Integration (VLSI) Systems 2012-10-23

Generalize or Die: Operating Systems Support for Memristor-Based Accelerators

OPENALEX - Publications

Pedro Bruel Sai Rahul Chalamalasetti Chris Dalton Izzat El Hajj Alfredo Goldman and 6 more

The deceleration of transistor feature size scaling has motivated growing adoption specialized accelerators implemented as GPUs, FPGAs, ASICs, and more recently new types computing such neuromorphic, bio-inspired, ultra low energy, reversible, stochastic, optical, quantum, combinations, others unforeseen. There is a tension between specialization generalization, with the current state trending to master slave models where (slaves) are instructed by general purpose system (master) running an...

10.1109/icrc.2017.8123649 article EN 2017-11-01

Regular Expression Matching with Memristor TCAMs

OPENALEX - Publications

Catherine E. Graves Wen Ma Xia Sheng B. Buchanan Le Zheng and 7 more

Regular expression (RegEx)matching is a key function in network security, where matching of packet data against known malicious signatures filters and alerts active intrusions. RegExs are widely used open source commercial security systems as they easily concisely represent complex patterns like those signatures. However, the latency power required to perform RegEx incredibly high approaches this problem struggle achieve > 1 Gbps on real-world rulesets while internet wirespeeds continue...

10.1109/icrc.2018.8638603 article EN 2018-11-01

Power-Efficient High Throughput Reconfigurable Datapath Design for Portable Multimedia Devices

OPENALEX - Publications

Sohan Purohit Sai Rahul Chalamalasetti Martin Margala Pasquale Corsonello

This paper presents new power efficient high throughput data paths for portable multimedia devices. The various provide support dense arithmetic operations. work provides the performance evaluation a library of reconfigurable path elements (Processing Elements) previously proposed and two processing element architectures to be part portable, systems. results show that designs will higher efficiency in area consumption compared suggested commercial solutions, could prove highly beneficial...

10.1109/reconfig.2008.58 article EN 2008-12-01

A low cost reconfigurable soft processor for multimedia applications: Design synthesis and programming model

OPENALEX - Publications

Sai Rahul Chalamalasetti Wim Vanderbauwhede Sohan Purohit Martin Margala

This paper presents an FPGA implementation of a low cost 8 bit reconfigurable processor core for media processing applications. The is optimized to provide all basic arithmetic and logic functions required by the other domains, as well make it easily integrable into 2D array. investigation feasibility potential soft architecture platforms. was synthesized on entire Virtex family evaluate its overall performance, scalability portability. A special feature proposed simple programming model...

10.1109/fpl.2009.5272461 article EN 2009-08-01

Regular Expression Matching with Memristor TCAMs for Network Security

OPENALEX - Publications

Catherine E. Graves Wen Ma Xia Sheng B. Buchanan Le Zheng and 7 more

We propose using memristor-based TCAMs (Ternary Content Addressable Memory) to accelerate Regular Expression (RegEx) matching. RegEx matching is a key function in network security, where deep packet inspection finds and filters out malicious actors. However, latency power can be incredibly high current proposals are challenged perform wire-speed for large scale rulesets. Our approach dramatically decreases operating power, provides throughput, the use of mTCAMs enables novel compression...

10.1145/3232195.3232201 article EN 2018-07-17

Autotuning high-level synthesis for FPGAs using OpenTuner and LegUp

OPENALEX - Publications

Pedro Bruel Alfredo Goldman Sai Rahul Chalamalasetti Dejan Milojičić

Changes in Moore's law and Dennard's scaling made hardware accelerators critical for performance improvement, but configuring them performance, area, energy efficiency is hard requires expert knowledge. High-Level Synthesis (HLS) tools enable design FPGAs to be done high-level languages reducing the cost time needed still requiring configuration. This paper presents an open-source, flexible virtualized autotuner LegUp parameters. Our optimization target was Weighted Normalized Sum (WNS) of 8...

10.1109/reconfig.2017.8279778 article EN 2017-12-01

High Level Programming for Heterogeneous Architectures

OPENALEX - Publications

Oren Segal Martin Margala Sai Rahul Chalamalasetti Mitch Wright

This work presents an effort to bridge the gap between abstract high level programming and OpenCL by extending existing Java framework (APARAPI), based on OpenCL, so that it can be used program FPGAs at a of abstraction increased ease programmability. We run several real world algorithms assess performance both low end system. On systems respectively we observed up 78-80 percent power reduction 4.8X-5.3X speed increase running NBody simulation, as well 65-80 6.2X-7X for KMeans, MapReduce...

10.48550/arxiv.1408.4964 preprint EN other-oa arXiv (Cornell University) 2014-01-01

Throughput Analysis for a High-Performance FPGA-Accelerated Real-Time Search Application

OPENALEX - Publications

Wim Vanderbauwhede Sai Rahul Chalamalasetti Martin Margala

We propose an FPGA design for the relevancy computation part of a high-throughput real-time search application. The application matches terms in stream documents against static profile, held off-chip memory. present mathematical analysis throughput and apply it to problem scaling Bloom filter used discard nonmatches.

10.1155/2012/507173 article EN cc-by International Journal of Reconfigurable Computing 2012-01-01

Low overhead soft error detection and correction scheme for reconfigurable pipelined data paths

OPENALEX - Publications

Sohan Purohit Sai Rahul Chalamalasetti Martin Margala

In this paper, we describe a novel scheme for radiation hardening of high performance pipelined architectures and data paths. The proposed technique uses local ground bus decoupled from the global using an additional pull down device, to detect transient error. Combining detector output with duplicated pipeline registers enables instruction execution through path be repeated as soon error is detected. outputs various stages in are manipulated maintain correctness event detection...

10.1109/ahs.2010.5546228 article EN 2010-06-01

Coming Soon ...