NFDI4DS | UHH-SEMS - Publication Details

Energy-Efficient Abundant-Data Computing: The N3XT 1,000x

OPENALEX - Publications

Mohamed M. Sabry Aly Mingyu Gao Gage Hills Chi-Shuen Lee Gregory Pitner and 15 more

Next-generation information technologies will process unprecedented amounts of loosely structured data that overwhelm existing computing systems. N3XT improves the energy efficiency abundant-data applications 1,000-fold by using new logic and memory technologies, 3D integration with fine-grained connectivity, architectures for computation immersed in memory.

10.1109/mc.2015.376 article EN Computer 2015-12-01

The N3XT Approach to Energy-Efficient Abundant-Data Computing

OPENALEX - Publications

Mohamed M. Sabry Aly Tony F. Wu Andrew Bartolo Yash H. Malviya William Hwang and 6 more

The world's appetite for analyzing massive amounts of structured and unstructured data has grown dramatically. computational demands these abundant-data applications, such as deep learning, far exceed the capabilities today's computing systems are unlikely to be met with isolated improvements in transistor or memory technologies, integrated circuit architectures alone. To achieve unprecedented functionality, speed, energy efficiency, one must create transformative nanosystems whose based on...

10.1109/jproc.2018.2882603 article EN publisher-specific-oa Proceedings of the IEEE 2018-12-27

OPQ: Compressing Deep Neural Networks with One-shot Pruning-Quantization

OPENALEX - Publications

Peng Hu Xi Peng Hongyuan Zhu Mohamed M. Sabry Aly Jie Lin

As Deep Neural Networks (DNNs) usually are overparameterized and have millions of weight parameters, it is challenging to deploy these large DNN models on resource-constrained hardware platforms, e.g., smartphones. Numerous network compression methods such as pruning quantization proposed reduce the model size significantly, which key find suitable allocation (e.g., sparsity codebook) each layer. Existing solutions obtain in an iterative/manual fashion while finetuning compressed model, thus...

10.1609/aaai.v35i9.16950 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2021-05-18

14.3 A 43pJ/Cycle Non-Volatile Microcontroller with 4.7μs Shutdown/Wake-up Integrating 2.3-bit/Cell Resistive RAM and Resilience Techniques

OPENALEX - Publications

Tony F. Wu Binh Le Robert M. Radway Andrew Bartolo William Hwang and 11 more

Non-volatility is emerging as an essential on-chip memory characteristic across a wide range of application domains, from edge nodes for the Internet Things (IoT) to large computing clusters. On-chip non-volatile (NVM) critical low-energy operation, real-time responses, privacy and security, operation in unpredictable environments, fault-tolerance [1]. Existing NVMs (e.g., Flash, FRAM, EEPROM) suffer high read/write energy/latency, density, integration challenges For example, ideal IoT...

10.1109/isscc.2019.8662402 article EN 2022 IEEE International Solid- State Circuits Conference (ISSCC) 2019-02-01

Illusion of large on-chip memory by networked computing chips for neural network inference

OPENALEX - Publications

Robert M. Radway Andrew Bartolo Paul C. Jolly Zainab F. Khan Binh Le and 11 more

10.1038/s41928-020-00515-3 article EN Nature Electronics 2021-01-11

SonicFFT: A system architecture for ultrasonic-based FFT acceleration

OPENALEX - Publications

Darayus Adil Patel Viet Phuong Bui Kevin Tshun Chuan Chai Amit Lal Mohamed M. Sabry Aly

Fast Fourier Transform (FFT) is an essential algorithm for numerous scientific and engineering applications. It key to implement FFT in a high-performance energy-efficient manner. In this paper, we leverage the properties of ultrasonic wave propagation silicon computation. We introduce SonicFFT: A system architecture ultrasonic-based acceleration. To evaluate benefits SonicFFT, compact-model based simulation framework that quantifies performance energy integrated comprising digital computing...

10.1109/asp-dac52403.2022.9712586 article EN 2022 27th Asia and South Pacific Design Automation Conference (ASP-DAC) 2022-01-17

TEA-DNN: the Quest for Time-Energy-Accuracy Co-optimized Deep Neural Networks

OPENALEX - Publications

Lile Cai Anne-Maelle Barneche Arthur Herbout Chuan-Sheng Foo Jie Lin and 2 more

Embedded deep learning platforms have witnessed two simultaneous improvements. First, the accuracy of convolutional neural networks (CNNs) has been significantly improved through use automated neural-architecture search (NAS) algorithms to determine CNN structure. Second, there increasing interest in developing hardware accelerators for CNNs that provide inference performance and energy consumption compared GPUs. Such embedded differ amount compute resources memory-access bandwidth, which...

10.1109/islped.2019.8824934 article EN 2019-07-01

OPQ: Compressing Deep Neural Networks with One-shot Pruning-Quantization

OPENALEX - Publications

Peng Hu Xi Peng Hongyuan Zhu Mohamed M. Sabry Aly Jie Lin

As Deep Neural Networks (DNNs) usually are overparameterized and have millions of weight parameters, it is challenging to deploy these large DNN models on resource-constrained hardware platforms, e.g., smartphones. Numerous network compression methods such as pruning quantization proposed reduce the model size significantly, which key find suitable allocation (e.g., sparsity codebook) each layer. Existing solutions obtain in an iterative/manual fashion while finetuning compressed model, thus...

10.48550/arxiv.2205.11141 preprint EN cc-by arXiv (Cornell University) 2022-01-01

From Algorithm to Hardware: A Survey on Efficient and Safe Deployment of Deep Neural Networks

OPENALEX - Publications

Xue Geng Zhe Wang Chunyun Chen Qing Xu Kaixin Xu and 8 more

Deep neural networks (DNNs) have been widely used in many artificial intelligence (AI) tasks. However, deploying them brings significant challenges due to the huge cost of memory, energy, and computation. To address these challenges, researchers developed various model compression techniques such as quantization pruning. Recently, there has a surge research on methods achieve efficiency while retaining performance. Furthermore, more works focus customizing DNN hardware accelerators better...

10.1109/tnnls.2024.3394494 article EN IEEE Transactions on Neural Networks and Learning Systems 2024-01-01

ViTA: A Highly Efficient Dataflow and Architecture for Vision Transformers

OPENALEX - Publications

Chunyun Chen Lantian Li Mohamed M. Sabry Aly

10.23919/date58400.2024.10546565 article EN 2024-03-25

3D nanosystems enable embedded abundant-data computing

OPENALEX - Publications

William Hwang Mohamed M. Sabry Aly Yash H. Malviya Mingyu Gao Tony F. Wu and 3 more

The world's appetite for abundant-data computing, where a massive amount of structured and unstructured data is analyzed, has increased dramatically. computational demands these applications, such as deep learning, far exceed the capabilities today's systems, especially energy-constrained embedded systems (e.g., mobile with limited battery capacity). These are unlikely to be met by isolated improvements in transistor or memory technologies, integrated circuit (IC) architectures alone....

10.1145/3125502.3125531 article EN 2017-10-15

Nano-engineered architectures for ultra-low power wireless body sensor nodes

OPENALEX - Publications

Rubén Braojos David Atienza Mohamed M. Sabry Aly Tony F. Wu H.‐S. Philip Wong and 2 more

Wireless body sensor nodes (WBSNs) are miniaturized devices that able to acquire, process and transmit bio-signals (such as electrocardiograms, respiration or human-body kinetics). WBSNs face major design challenges due extremely limited power budgets very small form factors. We demonstrate, for the first time in literature, use of disruptive nanotechnologies create new nano-engineered ultra-low (ULP) WBSN architectures. Compared state-of-the-art multi-core designs, our architectures...

10.1145/2968456.2968464 article EN 2016-10-01

Sub-10nm Ultra-thin ZnO Channel FET with Record-High 561 µA/µm ION at VDS 1V, High µ-84 cm2/V-s and1T-1RRAM Memory Cell Demonstration Memory Implications for Energy-Efficient Deep-Learning Computing

OPENALEX - Publications

Umesh Chand Mohamed M. Sabry Aly Mohan Lal Chun-Kuei Chen Sonu Hooda and 4 more

For the first time, we investigated ultra-short-channel ZnO thin-film FETs with L <inf xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">ch</inf> = 8 nm extremely scaled channel thickness t xmlns:xlink="http://www.w3.org/1999/xlink">ZnO</inf> of 3nm, device exhibits ultra-low sub-pA/µm off leakage (1.2 pA/µm), high electron mobility (µ xmlns:xlink="http://www.w3.org/1999/xlink">eff</inf> 84 cm2/V•s) record peak transconductance (Gm,) 254 μS/μm at V...

10.1109/vlsitechnologyandcir46769.2022.9830250 article EN 2022 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits) 2022-06-12

Get More from Less: A Hybrid Machine Learning Framework for Improving Early Predictions in STEM Education

OPENALEX - Publications

Mohammad Rashedul Hasan Mohamed M. Sabry Aly

This paper builds a Machine Learning based novel classification framework that performs multi-class optimally when the features are scarce. It uses context of undergraduate STEM courses for making periodic predictions student performance at fine-grained level. shows neither vanilla classifier nor an ensemble perform number small (during early predictions). The ML hybrid predicts four classes equally spaced intervals during semester: at-risk (grade below C), prone to risk ok B), and good A)....

10.1109/csci49370.2019.00157 article EN 2021 International Conference on Computational Science and Computational Intelligence (CSCI) 2019-12-01

A study of signed multipliers on FPGAs

OPENALEX - Publications

Mohamed M. Sabry Aly Ahmed Sayed

Multiplication is an important fundamental operation that critical in most signal and image processing applications. It also essential for all types of wireless communications We compare general multipliers from architecture point view, maximum clock frequency, latency, throughput, resource usage, as well dynamic power consumption. use a flopped combinational baseline multiplier our comparison we the same FPGA platform to be fair analysis. conclude regular approach implying DSP elements HDL...

10.1109/icedsa.2012.6507811 article EN 2012-11-01

EXTENT: Enabling Approximation-Oriented Energy Efficient STT-RAM Write Circuit

OPENALEX - Publications

Saeed Seyedfaraji Javad Talafy Mohamed M. Sabry Aly Semeen Rehman

Spin Transfer Torque Random Access Memory (STT-RAM) has garnered interest due to its various characteristics such as non-volatility, low leakage power, high density. Its magnetic properties have a vital role in STT switching operations through thermal effectiveness. A key challenge for STT-RAM industrial adaption is the write energy and latency. In this paper, we overcome by exploiting stochastic of cells and, tandem, with circuit-level approximation. We enforce robustness our technique...

10.1109/access.2022.3194679 article EN cc-by IEEE Access 2022-01-01

Fledge: Flexible Edge Platforms Enabled by In-memory Computing

OPENALEX - Publications

Kamalika Datta Arko Dutt A. Hafz Zaky Umesh Chand Devendra Singh and 4 more

The proliferation of advanced analytics and artificial intelligence has been driven by huge volumes data that are mostly generated at the edge. Simultaneously, there is a rising demand to perform on edge platforms (i.e., near-sensor analytics). However, conventional architectures such may not execute targeted applications in an energy-efficient manner. Emerging near in-memory computing paradigms can increase energy efficiency relying emerging logic memory devices. More importantly, these...

10.23919/date48585.2020.9116423 article EN Design, Automation & Test in Europe Conference & Exhibition (DATE), 2015 2020-03-01

Quantifying the Benefits of Monolithic 3D Computing Systems Enabled by TFT and RRAM

OPENALEX - Publications

Abdallah M. Felfel Kamalika Datta Arko Dutt Hasita Veluri A. Hafz Zaky and 2 more

Current data-centric workloads, such as deep learning, expose the memory-access inefficiencies of current computing systems. Monolithic 3D integration can overcome this limitation by leveraging fine-grained and dense vertical connectivity to enable massively-concurrent accesses between compute memory units. Thin-Film Transistors (TFTs) Resistive RAM (RRAM) naturally monolithic they are fabricated in low temperature (a crucial requirement). In paper, we explore ZnO-based TFTs HfO <inf...

10.23919/date48585.2020.9116410 article EN Design, Automation & Test in Europe Conference & Exhibition (DATE), 2015 2020-03-01

PSRR-MaxpoolNMS: Pyramid Shifted MaxpoolNMS with Relationship Recovery

OPENALEX - Publications

Tianyi Zhang Jie Lin Peng Hu Bin Zhao Mohamed M. Sabry Aly

Non-maximum Suppression (NMS) is an essential post-processing step in modern convolutional neural networks for object detection. Unlike convolutions which are inherently parallel, the de-facto standard NMS, namely GreedyNMS, cannot be easily parallelized and thus could performance bottleneck detection pipelines. MaxpoolNMS introduced as a parallelizable alternative to turn enables faster speed than GreedyNMS at comparable accuracy. However, only capable of replacing first stage two-stage...

10.1109/cvpr46437.2021.01558 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021-06-01

Scalable Hardware Acceleration of Non-Maximum Suppression

OPENALEX - Publications

Chunyun Chen Tianyi Zhang Zehui Yu Adithi Raghuraman Shwetalaxmi Udayan and 2 more

Non-maximum Suppression (NMS) in one- and two-stage object detection deep neural networks (e.g., SSD Faster-RCNN) is becoming the computation bottleneck. In this paper, we introduce a hardware acceleration for scalable PSRR-MaxpoolNMS algorithm. Our architecture shows 75.0× 305× speedups compared to software implementation of as well implementations GreedyNMS, respectively, while simultaneously achieving comparable Mean Average Precision (mAP) software-based floating-point implementations....

10.23919/date54114.2022.9774717 article EN Design, Automation & Test in Europe Conference & Exhibition (DATE), 2015 2022-03-14

Dataflow-based Joint Quantization of Weights and Activations for Deep Neural Networks

OPENALEX - Publications

Xue Geng Jie Fu Bin Zhao Jie Lin Mohamed M. Sabry Aly and 2 more

This paper addresses a challenging problem - how to reduce energy consumption without incurring performance drop when deploying deep neural networks (DNNs) at the inference stage. In order alleviate computation and storage burdens, we propose novel dataflow-based joint quantization approach with hypothesis that fewer number of operations would incur less information loss thus improve final performance. It first introduces scheme efficient bit-shifting rounding represent network parameters...

10.48550/arxiv.1901.02064 preprint EN other-oa arXiv (Cornell University) 2019-01-01

Dataflow-Based Joint Quantization for Deep Neural Networks

OPENALEX - Publications

Xue Geng Jie Fu Bin Zhao Jie Lin Mohamed M. Sabry Aly and 2 more

This paper addresses a challenging problem - how to reduce energy consumption without incurring performance drop when deploying deep neural networks (DNNs) at the inference stage. In order alleviate computation and storage burdens, we propose novel dataflow-based joint quantization approach with hypothesis that fewer number of operations would incur less information loss thus improve final performance. It first introduces scheme efficient bit-shifting rounding represent network parameters...

10.1109/dcc.2019.00086 article EN 2019-03-01