- Parallel Computing and Optimization Techniques
- Embedded Systems Design Techniques
- Advanced Memory and Neural Computing
- Ferroelectric and Negative Capacitance Devices
- Interconnection Networks and Systems
- Cloud Computing and Resource Management
- Network Packet Processing and Optimization
- Low-power high-performance VLSI design
- Radiation Effects in Electronics
- Distributed and Parallel Computing Systems
- Algorithms and Data Compression
- VLSI and Analog Circuit Testing
- CCD and CMOS Imaging Sensors
- Caching and Content Delivery
- Machine Learning in Materials Science
- Stochastic Gradient Optimization Techniques
- Advanced Image and Video Retrieval Techniques
- Text and Document Classification Technologies
- Scientific Computing and Data Management
- Digital Filter Design and Implementation
- Digital Transformation in Industry
- Polymer Nanocomposite Synthesis and Irradiation
- Video Coding and Compression Technologies
- Advanced Data Storage Technologies
- Advanced Optical Sensing Technologies
Hewlett-Packard (United States)
2013-2024
Matrix Research (United States)
2024
Hewlett Packard Enterprise (United States)
2018-2021
Universidade de São Paulo
2017
Silicon Labs (United States)
2014
University of Massachusetts Lowell
2008-2013
University of Massachusetts Amherst
2010
Memristor crossbars are circuits capable of performing analog matrix-vector multiplications, overcoming the fundamental energy efficiency limitations digital logic. They have been shown to be effective in special-purpose accelerators for a limited set neural network applications. We present Programmable Ultra-efficient Memristor-based Accelerator (PUMA) which enhances memristor with general purpose execution units enable acceleration wide variety Machine Learning (ML) inference workloads....
The wide adoption of deep neural networks has been accompanied by ever-increasing energy and performance demands due to the expensive nature training them. Numerous special-purpose architectures have proposed accelerate training: both digital hybrid digital-analog using resistive RAM (ReRAM) crossbars. ReRAM-based accelerators demonstrated effectiveness ReRAM crossbars at performing matrix-vector multiplication operations that are prevalent in training. However, they still suffer from...
Providing low-latency access to large amounts of data is one the foremost requirements for many web services. To address these needs, systems such as Memcached have been created which provide a distributed, all in-memory key-value store. These are critical and often deployed across hundreds or thousands servers. However, not well matched commodity servers, they require significant CPU resources achieve reasonable network bandwidth, yet core functions do benefit from high performance standard...
We propose memristor-based TCAMs (Ternary Content Addressable Memory) circuits to accelerate Regular Expression (RegEx) matching through in memory processing of finite automata. RegEx is a key function network security find malicious actors. However, latency and power can be incredibly high current proposals are challenged perform wire-speed for large rulesets. Our approach dramatically decreases operating power, enables throughput, the use nanoscale memristor TCAM (mTCAMs) compression...
Memristor crossbars are circuits capable of performing analog matrix-vector multiplications, overcoming the fundamental energy efficiency limitations digital logic. They have been shown to be effective in special-purpose accelerators for a limited set neural network applications. We present Programmable Ultra-efficient Memristor-based Accelerator (PUMA) which enhances memristor with general purpose execution units enable acceleration wide variety Machine Learning (ML) inference workloads....
Heterogeneous CPU-FPGA systems are evolving towards tighter integration between CPUs and FPGAs for improved performance energy efficiency. At the same time, programmability is also improving with High Level Synthesis tools (e.g., OpenCL Software Development Kits), which allow programmers to express their designs high-level programming languages, avoid time-consuming error-prone register-transfer level (RTL) programming. In traditional loosely-coupled accelerator mode, work as offload...
ReRAM-based accelerators have shown great potential for accelerating DNN inference because ReRAM crossbars can perform analog matrix-vector multiplication operations with low latency and energy consumption. However, these require the use of ADCs which constitute a significant fraction cost MVM operations. The overhead be mitigated via partial sum quantization. prior quantization flows do not consider is highly relevant to traditional digital architectures. To address this issue, we propose...
The increasing deployment of machine learning at the core and edge for applications such as video image recognition has resulted in a number special purpose accelerators this domain. However, these do not have full end-to-end software stacks application development, resulting hard-to-develop, proprietary, suboptimal programming executables. In paper, we describe stack memristor-based hybrid (analog-digital)accelerator. consists an ONNX converter, optimizer, compiler, driver, emulators....
Traditional high-performance computing and modern artificial intelligence are converging with workflows as a common paradigm. We predict nine principles of heterogeneity serverless for this convergence, from high-level programming to low-level hardware.
This paper presents an architecture and implementation details for MORA, a novel coarse grained reconfigurable processor accelerating media processing applications. The MORA involves 2-D array of several such processors, to deliver low cost, high throughput performance in A distinguishing feature the is co-design hardware low-level programming language throughout design cycle. single processor, benchmark evaluation using cycle accurate simulator are presented.
Heterogeneous computing offers a promising solution for energy efficient in the data center. FPGA based heterogeneous is an especially direction since it allows creation of custom hardware solutions centric parallel applications. One main issues delaying wide spread adoption FPGAs as stream high performance devices difficulty programming them. OpenCL was meant to address difficulties and non-uniformity related devices, unfortunately because its complexity sets bar many software programmers,...
This brief presents the implementation and evaluation of an 8-bit adaptable processor core to be part power-throughput-area efficient multimedia oriented reconfigurable architecture array. The design was custom implemented in IBM's 90 nm CMOS technology occupies 0.115 mm <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> silicon area with approximately 70% utilized by circuits. shows a peak throughput performance 75 MOPS/mW. Benchmarking...
MORA is a novel platform for high-level FPGA programming of streaming vector and matrix operations, aimed at multimedia applications. It consists soft array pipelined low-complexity SIMD processors-in-memory (PIM). We present Domain-Specific Language (DSL) the processor array. The DSL embedded in C++, providing designers with familiar language framework ability to compile designs using standard compiler functional testing before generating bitstream toolchain. paper discusses MORA-C++...
In this paper, we present the design and evaluation of two new processing elements for reconfigurable computing. We also a circuit-level implementation data paths in static dynamic styles to explore various performance-power tradeoffs involved. When implemented IBM 90-nm CMOS process, 8-b achieve operating frequencies ranging over 1 GHz both implementations, with each path supporting single-cycle computational capability. A novel single-precision floating point element (FPPE) using 24-b...
The deceleration of transistor feature size scaling has motivated growing adoption specialized accelerators implemented as GPUs, FPGAs, ASICs, and more recently new types computing such neuromorphic, bio-inspired, ultra low energy, reversible, stochastic, optical, quantum, combinations, others unforeseen. There is a tension between specialization generalization, with the current state trending to master slave models where (slaves) are instructed by general purpose system (master) running an...
Regular expression (RegEx)matching is a key function in network security, where matching of packet data against known malicious signatures filters and alerts active intrusions. RegExs are widely used open source commercial security systems as they easily concisely represent complex patterns like those signatures. However, the latency power required to perform RegEx incredibly high approaches this problem struggle achieve > 1 Gbps on real-world rulesets while internet wirespeeds continue...
This paper presents new power efficient high throughput data paths for portable multimedia devices. The various provide support dense arithmetic operations. work provides the performance evaluation a library of reconfigurable path elements (Processing Elements) previously proposed and two processing element architectures to be part portable, systems. results show that designs will higher efficiency in area consumption compared suggested commercial solutions, could prove highly beneficial...
This paper presents an FPGA implementation of a low cost 8 bit reconfigurable processor core for media processing applications. The is optimized to provide all basic arithmetic and logic functions required by the other domains, as well make it easily integrable into 2D array. investigation feasibility potential soft architecture platforms. was synthesized on entire Virtex family evaluate its overall performance, scalability portability. A special feature proposed simple programming model...
We propose using memristor-based TCAMs (Ternary Content Addressable Memory) to accelerate Regular Expression (RegEx) matching. RegEx matching is a key function in network security, where deep packet inspection finds and filters out malicious actors. However, latency power can be incredibly high current proposals are challenged perform wire-speed for large scale rulesets. Our approach dramatically decreases operating power, provides throughput, the use of mTCAMs enables novel compression...
Changes in Moore's law and Dennard's scaling made hardware accelerators critical for performance improvement, but configuring them performance, area, energy efficiency is hard requires expert knowledge. High-Level Synthesis (HLS) tools enable design FPGAs to be done high-level languages reducing the cost time needed still requiring configuration. This paper presents an open-source, flexible virtualized autotuner LegUp parameters. Our optimization target was Weighted Normalized Sum (WNS) of 8...
This work presents an effort to bridge the gap between abstract high level programming and OpenCL by extending existing Java framework (APARAPI), based on OpenCL, so that it can be used program FPGAs at a of abstraction increased ease programmability. We run several real world algorithms assess performance both low end system. On systems respectively we observed up 78-80 percent power reduction 4.8X-5.3X speed increase running NBody simulation, as well 65-80 6.2X-7X for KMeans, MapReduce...
We propose an FPGA design for the relevancy computation part of a high-throughput real-time search application. The application matches terms in stream documents against static profile, held off-chip memory. present mathematical analysis throughput and apply it to problem scaling Bloom filter used discard nonmatches.
In this paper, we describe a novel scheme for radiation hardening of high performance pipelined architectures and data paths. The proposed technique uses local ground bus decoupled from the global using an additional pull down device, to detect transient error. Combining detector output with duplicated pipeline registers enables instruction execution through path be repeated as soon error is detected. outputs various stages in are manipulated maintain correctness event detection...