Mohamed Assem Ibrahim

ORCID: 0000-0002-4129-0310
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Parallel Computing and Optimization Techniques
  • Advanced Data Storage Technologies
  • Advanced Memory and Neural Computing
  • Interconnection Networks and Systems
  • Cloud Computing and Resource Management
  • Network Packet Processing and Optimization
  • Particle Detector Development and Performance
  • Embedded Systems Design Techniques
  • Recommender Systems and Techniques
  • Magnetic confinement fusion research
  • Radio Frequency Integrated Circuit Design
  • Silicon Carbide Semiconductor Technologies
  • Plasma Diagnostics and Applications
  • Advanced Neural Network Applications
  • Distributed and Parallel Computing Systems
  • Caching and Content Delivery
  • Low-power high-performance VLSI design
  • Software Testing and Debugging Techniques
  • Superconducting Materials and Applications
  • Ferroelectric and Negative Capacitance Devices
  • Quantum-Dot Cellular Automata

Advanced Micro Devices (United States)
2021-2024

Advanced Micro Devices (Canada)
2021-2024

William & Mary
2017-2021

Williams (United States)
2020

Dynamic parallelism (DP) is a promising feature for GPUs, which allows on-demand spawning of kernels on the GPU without any CPU intervention. However, this has two major drawbacks. First, launching can incur significant performance penalties. Second, dynamically-generated are not always able to efficiently utilize cores due hardware-limits. To address these concerns cohesively, we propose SPAWN, runtime framework that controls kernels, thereby directly reducing associated launch overheads...

10.1109/hpca.2017.14 article EN 2017-02-01

Managing the thread-level parallelism (TLP) of GPGPU applications by limiting it to a certain degree is known be effective in improving overall performance. However, we find that such prior techniques can lead sub-optimal system throughput and fairness when two or more are co-scheduled on same GPU. It because they attempt maximize performance individual isolation, ultimately allowing each application take disproportionate amount shared resources. This leads high contention cache memory. To...

10.1109/hpca.2018.00030 article EN 2018-02-01

Graphics Processing Units (GPUs) use caches to provide on-chip bandwidth as a way address the memory wall. However, they are not always efficiently utilized for optimal GPU performance. We find that main source of this inefficiency stems from tightly-coupled design cores with L1 caches. First, such assumes per-core private local cache in which each core independently required data. This allows same line get replicated across cores, wastes precious capacity. Second, due many-to-few traffic...

10.1109/hpca51647.2021.00047 article EN 2021-02-01

The Automata Processor (AP) accelerates applications from domains ranging machine learning to genomics. However, as a spatial architecture, it is unable handle larger automata programs without repeated reconfiguration and re-execution. To achieve high throughput, this paper proposes for the first time architectural support AP efficiently execute large-scale applications. We find that large number of existing new Non-deterministic Finite (NFA) based have states are never enabled but still...

10.1109/micro.2018.00078 article EN 2018-10-01

Bandwidth achieved from local/shared caches and memory is a major performance determinant in Graphics Processing Units (GPUs). These existing sources of bandwidth are often not enough for optimal GPU performance. Therefore, to enhance the further, we focus on efficiently unlocking an additional potential source bandwidth, which call as remote-core bandwidth. The this based observation that fraction data (i.e., L1 read misses) required by one core can also be found local (L1) other cores. In...

10.1109/pact.2019.00028 article EN 2019-09-01

Graphics Processing Units (GPUs) concurrently execute thousands of threads, which makes them effective for achieving high throughput a wide range applications. However, the memory wall often limits peak throughput. GPUs use caches to address this limitation, and hence several prior works have focused on improving cache hit rates, in turn can improve memory-intensive almost all assume conventional hierarchy where each GPU core has private local L1 cores share L2 cache. Our analysis shows that...

10.1145/3410463.3414623 article EN 2020-09-30

With unprecedented demand for generative AI (GenAI) inference, acceleration of primitives that dominate GenAI such as general matrix-vector multiplication (GEMV) is receiving considerable attention. A challenge with GEMVs the high memory bandwidth this primitive demands. Multiple vendors have proposed commercially viable processing-in-memory (PIM) prototypes attain boost over processor via augmenting banks compute capabilities and broadcasting same command to all banks. While PIM designs...

10.48550/arxiv.2403.20297 preprint EN arXiv (Cornell University) 2024-03-29

10.1145/3695794.3695796 article EN Proceedings of the International Symposium on Memory Systems 2024-09-30

This paper evaluates the efficacy of recent commercial processing-in-memory (PIM) solutions to accelerate fast Fourier transform (FFT), an important primitive across several domains. Specifically, we observe that efficient implementations FFT on modern GPUs are memory bandwidth bound. As such, boost availed by PIM makes a case for FFT. To this end, first deduce mapping computation strawman architecture representative designs. We even with careful data mapping, is not effective in...

10.48550/arxiv.2308.03973 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Value prediction holds the promise of significantly improving performance and energy efficiency. However, if values are predicted incorrectly, significant overheads observed due to execution rollbacks. To address these overheads, value approximation is introduced, which leverages observation that rollbacks not necessary as long application-level loss in quality misprediction acceptable user. context Graphics Processing Units (GPUs), our evaluations show existing approximate predictors...

10.1145/3330345.3330362 article EN 2019-06-18

Deep neural network (DNN) based recommendation models (RMs) represent a class of critical workloads that are broadly used in social media, entertainment content, and online businesses. Given their pervasive usage, understanding the memory subsystem behavior these is crucial, particularly from perspective future design. To this end, work, we first do an in-depth footprint traffic analysis emerging RMs. We observe RMs will severely stress (and possibly larger) caches memories.

10.1145/3488423.3519317 article EN 2021-09-27

Data format innovations have been critical for machine learning (ML) scaling, which in turn fuels ground-breaking ML capabilities. However, even the presence of low-precision formats, model weights are often stored both high-precision and during training. Furthermore, with emerging directional data formats (e.g., MX9, MX6, etc.) multiple weight copies can be required. To lower memory capacity needs weights, we explore just-in-time quantization (JIT-Q) where only store generate when needed....

10.48550/arxiv.2311.05034 preprint EN other-oa arXiv (Cornell University) 2023-01-01
Coming Soon ...