Mohamed M. Sabry Aly

ORCID: 0000-0002-8018-1264
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Advanced Memory and Neural Computing
  • Advanced Neural Network Applications
  • Semiconductor materials and devices
  • Ferroelectric and Negative Capacitance Devices
  • Parallel Computing and Optimization Techniques
  • Advanced Image and Video Retrieval Techniques
  • Low-power high-performance VLSI design
  • CCD and CMOS Imaging Sensors
  • Domain Adaptation and Few-Shot Learning
  • Adversarial Robustness in Machine Learning
  • VLSI and FPGA Design Techniques
  • Anomaly Detection Techniques and Applications
  • Analog and Mixed-Signal Circuit Design
  • Advanced Data Storage Technologies
  • Neural Networks and Applications
  • Sparse and Compressive Sensing Techniques
  • Image and Signal Denoising Methods
  • Advanced Data Compression Techniques
  • 3D IC and TSV technologies
  • Interconnection Networks and Systems
  • Spacecraft Design and Technology
  • Physical Unclonable Functions (PUFs) and Hardware Security
  • Error Correcting Code Techniques
  • Quantum-Dot Cellular Automata
  • Advanced Image Processing Techniques

Nanyang Technological University
2018-2024

University of Sharjah
2023

Agency for Science, Technology and Research
2021

Polytechnique Montréal
2019

University of Nebraska–Lincoln
2019

Stanford University
2015-2018

Next-generation information technologies will process unprecedented amounts of loosely structured data that overwhelm existing computing systems. N3XT improves the energy efficiency abundant-data applications 1,000-fold by using new logic and memory technologies, 3D integration with fine-grained connectivity, architectures for computation immersed in memory.

10.1109/mc.2015.376 article EN Computer 2015-12-01

The world's appetite for analyzing massive amounts of structured and unstructured data has grown dramatically. computational demands these abundant-data applications, such as deep learning, far exceed the capabilities today's computing systems are unlikely to be met with isolated improvements in transistor or memory technologies, integrated circuit architectures alone. To achieve unprecedented functionality, speed, energy efficiency, one must create transformative nanosystems whose based on...

10.1109/jproc.2018.2882603 article EN publisher-specific-oa Proceedings of the IEEE 2018-12-27

As Deep Neural Networks (DNNs) usually are overparameterized and have millions of weight parameters, it is challenging to deploy these large DNN models on resource-constrained hardware platforms, e.g., smartphones. Numerous network compression methods such as pruning quantization proposed reduce the model size significantly, which key find suitable allocation (e.g., sparsity codebook) each layer. Existing solutions obtain in an iterative/manual fashion while finetuning compressed model, thus...

10.1609/aaai.v35i9.16950 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2021-05-18

Non-volatility is emerging as an essential on-chip memory characteristic across a wide range of application domains, from edge nodes for the Internet Things (IoT) to large computing clusters. On-chip non-volatile (NVM) critical low-energy operation, real-time responses, privacy and security, operation in unpredictable environments, fault-tolerance [1]. Existing NVMs (e.g., Flash, FRAM, EEPROM) suffer high read/write energy/latency, density, integration challenges For example, ideal IoT...

10.1109/isscc.2019.8662402 article EN 2022 IEEE International Solid- State Circuits Conference (ISSCC) 2019-02-01

Fast Fourier Transform (FFT) is an essential algorithm for numerous scientific and engineering applications. It key to implement FFT in a high-performance energy-efficient manner. In this paper, we leverage the properties of ultrasonic wave propagation silicon computation. We introduce SonicFFT: A system architecture ultrasonic-based acceleration. To evaluate benefits SonicFFT, compact-model based simulation framework that quantifies performance energy integrated comprising digital computing...

10.1109/asp-dac52403.2022.9712586 article EN 2022 27th Asia and South Pacific Design Automation Conference (ASP-DAC) 2022-01-17

Embedded deep learning platforms have witnessed two simultaneous improvements. First, the accuracy of convolutional neural networks (CNNs) has been significantly improved through use automated neural-architecture search (NAS) algorithms to determine CNN structure. Second, there increasing interest in developing hardware accelerators for CNNs that provide inference performance and energy consumption compared GPUs. Such embedded differ amount compute resources memory-access bandwidth, which...

10.1109/islped.2019.8824934 article EN 2019-07-01

As Deep Neural Networks (DNNs) usually are overparameterized and have millions of weight parameters, it is challenging to deploy these large DNN models on resource-constrained hardware platforms, e.g., smartphones. Numerous network compression methods such as pruning quantization proposed reduce the model size significantly, which key find suitable allocation (e.g., sparsity codebook) each layer. Existing solutions obtain in an iterative/manual fashion while finetuning compressed model, thus...

10.48550/arxiv.2205.11141 preprint EN cc-by arXiv (Cornell University) 2022-01-01

Deep neural networks (DNNs) have been widely used in many artificial intelligence (AI) tasks. However, deploying them brings significant challenges due to the huge cost of memory, energy, and computation. To address these challenges, researchers developed various model compression techniques such as quantization pruning. Recently, there has a surge research on methods achieve efficiency while retaining performance. Furthermore, more works focus customizing DNN hardware accelerators better...

10.1109/tnnls.2024.3394494 article EN IEEE Transactions on Neural Networks and Learning Systems 2024-01-01

The world's appetite for abundant-data computing, where a massive amount of structured and unstructured data is analyzed, has increased dramatically. computational demands these applications, such as deep learning, far exceed the capabilities today's systems, especially energy-constrained embedded systems (e.g., mobile with limited battery capacity). These are unlikely to be met by isolated improvements in transistor or memory technologies, integrated circuit (IC) architectures alone....

10.1145/3125502.3125531 article EN 2017-10-15

Wireless body sensor nodes (WBSNs) are miniaturized devices that able to acquire, process and transmit bio-signals (such as electrocardiograms, respiration or human-body kinetics). WBSNs face major design challenges due extremely limited power budgets very small form factors. We demonstrate, for the first time in literature, use of disruptive nanotechnologies create new nano-engineered ultra-low (ULP) WBSN architectures. Compared state-of-the-art multi-core designs, our architectures...

10.1145/2968456.2968464 article EN 2016-10-01

For the first time, we investigated ultra-short-channel ZnO thin-film FETs with L <inf xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">ch</inf> = 8 nm extremely scaled channel thickness t xmlns:xlink="http://www.w3.org/1999/xlink">ZnO</inf> of 3nm, device exhibits ultra-low sub-pA/µm off leakage (1.2 pA/µm), high electron mobility (µ xmlns:xlink="http://www.w3.org/1999/xlink">eff</inf> 84 cm2/V•s) record peak transconductance (Gm,) 254 μS/μm at V...

10.1109/vlsitechnologyandcir46769.2022.9830250 article EN 2022 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits) 2022-06-12

This paper builds a Machine Learning based novel classification framework that performs multi-class optimally when the features are scarce. It uses context of undergraduate STEM courses for making periodic predictions student performance at fine-grained level. shows neither vanilla classifier nor an ensemble perform number small (during early predictions). The ML hybrid predicts four classes equally spaced intervals during semester: at-risk (grade below C), prone to risk ok B), and good A)....

10.1109/csci49370.2019.00157 article EN 2021 International Conference on Computational Science and Computational Intelligence (CSCI) 2019-12-01

Multiplication is an important fundamental operation that critical in most signal and image processing applications. It also essential for all types of wireless communications We compare general multipliers from architecture point view, maximum clock frequency, latency, throughput, resource usage, as well dynamic power consumption. use a flopped combinational baseline multiplier our comparison we the same FPGA platform to be fair analysis. conclude regular approach implying DSP elements HDL...

10.1109/icedsa.2012.6507811 article EN 2012-11-01

Spin Transfer Torque Random Access Memory (STT-RAM) has garnered interest due to its various characteristics such as non-volatility, low leakage power, high density. Its magnetic properties have a vital role in STT switching operations through thermal effectiveness. A key challenge for STT-RAM industrial adaption is the write energy and latency. In this paper, we overcome by exploiting stochastic of cells and, tandem, with circuit-level approximation. We enforce robustness our technique...

10.1109/access.2022.3194679 article EN cc-by IEEE Access 2022-01-01

The proliferation of advanced analytics and artificial intelligence has been driven by huge volumes data that are mostly generated at the edge. Simultaneously, there is a rising demand to perform on edge platforms (i.e., near-sensor analytics). However, conventional architectures such may not execute targeted applications in an energy-efficient manner. Emerging near in-memory computing paradigms can increase energy efficiency relying emerging logic memory devices. More importantly, these...

10.23919/date48585.2020.9116423 article EN Design, Automation &amp; Test in Europe Conference &amp; Exhibition (DATE), 2015 2020-03-01

Current data-centric workloads, such as deep learning, expose the memory-access inefficiencies of current computing systems. Monolithic 3D integration can overcome this limitation by leveraging fine-grained and dense vertical connectivity to enable massively-concurrent accesses between compute memory units. Thin-Film Transistors (TFTs) Resistive RAM (RRAM) naturally monolithic they are fabricated in low temperature (a crucial requirement). In paper, we explore ZnO-based TFTs HfO <inf...

10.23919/date48585.2020.9116410 article EN Design, Automation &amp; Test in Europe Conference &amp; Exhibition (DATE), 2015 2020-03-01

Non-maximum Suppression (NMS) is an essential post-processing step in modern convolutional neural networks for object detection. Unlike convolutions which are inherently parallel, the de-facto standard NMS, namely GreedyNMS, cannot be easily parallelized and thus could performance bottleneck detection pipelines. MaxpoolNMS introduced as a parallelizable alternative to turn enables faster speed than GreedyNMS at comparable accuracy. However, only capable of replacing first stage two-stage...

10.1109/cvpr46437.2021.01558 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021-06-01

Non-maximum Suppression (NMS) in one- and two-stage object detection deep neural networks (e.g., SSD Faster-RCNN) is becoming the computation bottleneck. In this paper, we introduce a hardware acceleration for scalable PSRR-MaxpoolNMS algorithm. Our architecture shows 75.0× 305× speedups compared to software implementation of as well implementations GreedyNMS, respectively, while simultaneously achieving comparable Mean Average Precision (mAP) software-based floating-point implementations....

10.23919/date54114.2022.9774717 article EN Design, Automation &amp; Test in Europe Conference &amp; Exhibition (DATE), 2015 2022-03-14

This paper addresses a challenging problem - how to reduce energy consumption without incurring performance drop when deploying deep neural networks (DNNs) at the inference stage. In order alleviate computation and storage burdens, we propose novel dataflow-based joint quantization approach with hypothesis that fewer number of operations would incur less information loss thus improve final performance. It first introduces scheme efficient bit-shifting rounding represent network parameters...

10.48550/arxiv.1901.02064 preprint EN other-oa arXiv (Cornell University) 2019-01-01

This paper addresses a challenging problem - how to reduce energy consumption without incurring performance drop when deploying deep neural networks (DNNs) at the inference stage. In order alleviate computation and storage burdens, we propose novel dataflow-based joint quantization approach with hypothesis that fewer number of operations would incur less information loss thus improve final performance. It first introduces scheme efficient bit-shifting rounding represent network parameters...

10.1109/dcc.2019.00086 article EN 2019-03-01
Coming Soon ...