NFDI4DS | UHH-SEMS - Publication Details

Mohamed Assem Ibrahim

ORCID: 0000-0002-4129-0310

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5066580088

Research Areas

Parallel Computing and Optimization Techniques
Advanced Data Storage Technologies
Advanced Memory and Neural Computing
Interconnection Networks and Systems
Cloud Computing and Resource Management
Network Packet Processing and Optimization
Particle Detector Development and Performance
Embedded Systems Design Techniques
Recommender Systems and Techniques
Magnetic confinement fusion research
Radio Frequency Integrated Circuit Design
Silicon Carbide Semiconductor Technologies
Plasma Diagnostics and Applications
Advanced Neural Network Applications
Distributed and Parallel Computing Systems
Caching and Content Delivery
Low-power high-performance VLSI design
Software Testing and Debugging Techniques
Superconducting Materials and Applications
Ferroelectric and Negative Capacitance Devices
Quantum-Dot Cellular Automata

Advanced Micro Devices (United States)
2021-2024

Advanced Micro Devices (Canada)
2021-2024

William & Mary
2017-2021

Williams (United States)
2020

Controlled Kernel Launch for Dynamic Parallelism in GPUs

OPENALEX - Publications

Xulong Tang Ashutosh Pattnaik Huaipan Jiang Onur Kayıran Adwait Jog and 4 more

Dynamic parallelism (DP) is a promising feature for GPUs, which allows on-demand spawning of kernels on the GPU without any CPU intervention. However, this has two major drawbacks. First, launching can incur significant performance penalties. Second, dynamically-generated are not always able to efficiently utilize cores due hardware-limits. To address these concerns cohesively, we propose SPAWN, runtime framework that controls kernels, thereby directly reducing associated launch overheads...

10.1109/hpca.2017.14 article EN 2017-02-01

Efficient and Fair Multi-programming in GPUs via Effective Bandwidth Management

OPENALEX - Publications

Haonan Wang Fan Luo Mohamed Assem Ibrahim Onur Kayıran Adwait Jog

Managing the thread-level parallelism (TLP) of GPGPU applications by limiting it to a certain degree is known be effective in improving overall performance. However, we find that such prior techniques can lead sub-optimal system throughput and fairness when two or more are co-scheduled on same GPU. It because they attempt maximize performance individual isolation, ultimately allowing each application take disproportionate amount shared resources. This leads high contention cache memory. To...

10.1109/hpca.2018.00030 article EN 2018-02-01

Analyzing and Leveraging Decoupled L1 Caches in GPUs

OPENALEX - Publications

Mohamed Assem Ibrahim Onur Kayıran Yasuko Eckert Gabriel H. Loh Adwait Jog

Graphics Processing Units (GPUs) use caches to provide on-chip bandwidth as a way address the memory wall. However, they are not always efficiently utilized for optimal GPU performance. We find that main source of this inefficiency stems from tightly-coupled design cores with L1 caches. First, such assumes per-core private local cache in which each core independently required data. This allows same line get replicated across cores, wastes precious capacity. Second, due many-to-few traffic...

10.1109/hpca51647.2021.00047 article EN 2021-02-01

Architectural Support for Efficient Large-Scale Automata Processing

OPENALEX - Publications

Hongyuan Liu Mohamed Assem Ibrahim Onur Kayıran Sreepathi Pai Adwait Jog

The Automata Processor (AP) accelerates applications from domains ranging machine learning to genomics. However, as a spatial architecture, it is unable handle larger automata programs without repeated reconfiguration and re-execution. To achieve high throughput, this paper proposes for the first time architectural support AP efficiently execute large-scale applications. We find that large number of existing new Non-deterministic Finite (NFA) based have states are never enabled but still...

10.1109/micro.2018.00078 article EN 2018-10-01

Analyzing and Leveraging Remote-Core Bandwidth for Enhanced Performance in GPUs

OPENALEX - Publications

Mohamed Assem Ibrahim Hongyuan Liu Onur Kayıran Adwait Jog

Bandwidth achieved from local/shared caches and memory is a major performance determinant in Graphics Processing Units (GPUs). These existing sources of bandwidth are often not enough for optimal GPU performance. Therefore, to enhance the further, we focus on efficiently unlocking an additional potential source bandwidth, which call as remote-core bandwidth. The this based observation that fraction data (i.e., L1 read misses) required by one core can also be found local (L1) other cores. In...

10.1109/pact.2019.00028 article EN 2019-09-01

Analyzing and Leveraging Shared L1 Caches in GPUs

OPENALEX - Publications

Mohamed Assem Ibrahim Onur Kayıran Yasuko Eckert Gabriel H. Loh Adwait Jog

Graphics Processing Units (GPUs) concurrently execute thousands of threads, which makes them effective for achieving high throughput a wide range applications. However, the memory wall often limits peak throughput. GPUs use caches to address this limitation, and hence several prior works have focused on improving cache hit rates, in turn can improve memory-intensive almost all assume conventional hierarchy where each GPU core has private local L1 cores share L2 cache. Our analysis shows that...

10.1145/3410463.3414623 article EN 2020-09-30

Balanced Data Placement for GEMV Acceleration with Processing-In-Memory

OPENALEX - Publications

Mohamed Assem Ibrahim Mahzabeen Islam Shaizeen Aga

With unprecedented demand for generative AI (GenAI) inference, acceleration of primitives that dominate GenAI such as general matrix-vector multiplication (GEMV) is receiving considerable attention. A challenge with GEMVs the high memory bandwidth this primitive demands. Multiple vendors have proposed commercially viable processing-in-memory (PIM) prototypes attain boost over processor via augmenting banks compute capabilities and broadcasting same command to all banks. While PIM designs...

10.48550/arxiv.2403.20297 preprint EN arXiv (Cornell University) 2024-03-29

PIM-Potential: Broadening the Acceleration Reach of PIM Architectures

OPENALEX - Publications

Johnathan Alsop Shaizeen Aga Mohamed Assem Ibrahim Mahzabeen Islam Nuwan Jayasena and 1 more

10.1145/3695794.3695795 article EN Proceedings of the International Symposium on Memory Systems 2024-09-30

Pimacolaba: Collaborative Acceleration for FFT on Commercial Processing-In-Memory Architectures

OPENALEX - Publications

Mohamed Assem Ibrahim Shaizeen Aga

10.1145/3695794.3695796 article EN Proceedings of the International Symposium on Memory Systems 2024-09-30

PIMnast: Balanced Data Placement for GEMV Acceleration with Processing-In-Memory

OPENALEX - Publications

Mohamed Assem Ibrahim Mahzabeen Islam Shaizeen Aga

10.1109/scw63240.2024.00137 article EN 2024-11-17

Collaborative Acceleration for FFT on Commercial Processing-In-Memory Architectures

OPENALEX - Publications

Mohamed Assem Ibrahim Shaizeen Aga

This paper evaluates the efficacy of recent commercial processing-in-memory (PIM) solutions to accelerate fast Fourier transform (FFT), an important primitive across several domains. Specifically, we observe that efficient implementations FFT on modern GPUs are memory bandwidth bound. As such, boost availed by PIM makes a case for FFT. To this end, first deduce mapping computation strawman architecture representative designs. We even with careful data mapping, is not effective in...

10.48550/arxiv.2308.03973 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Address-stride assisted approximate load value prediction in GPUs

OPENALEX - Publications

Haonan Wang Mohamed Assem Ibrahim Sparsh Mittal Adwait Jog

Value prediction holds the promise of significantly improving performance and energy efficiency. However, if values are predicted incorrectly, significant overheads observed due to execution rollbacks. To address these overheads, value approximation is introduced, which leverages observation that rollbacks not necessary as long application-level loss in quality misprediction acceptable user. context Graphics Processing Units (GPUs), our evaluations show existing approximate predictors...

10.1145/3330345.3330362 article EN 2019-06-18

Efficient Cache Utilization via Model-aware Data Placement for Recommendation Models

OPENALEX - Publications

Mohamed Assem Ibrahim Onur Kayıran Shaizeen Aga

Deep neural network (DNN) based recommendation models (RMs) represent a class of critical workloads that are broadly used in social media, entertainment content, and online businesses. Given their pervasive usage, understanding the memory subsystem behavior these is crucial, particularly from perspective future design. To this end, work, we first do an in-depth footprint traffic analysis emerging RMs. We observe RMs will severely stress (and possibly larger) caches memories.

10.1145/3488423.3519317 article EN 2021-09-27

Just-in-time Quantization with Processing-In-Memory for Efficient ML Training

OPENALEX - Publications

Mohamed Assem Ibrahim Shaizeen Aga Ada Li Suchita Pati Mahzabeen Islam

Data format innovations have been critical for machine learning (ML) scaling, which in turn fuels ground-breaking ML capabilities. However, even the presence of low-precision formats, model weights are often stored both high-precision and during training. Furthermore, with emerging directional data formats (e.g., MX9, MX6, etc.) multiple weight copies can be required. To lower memory capacity needs weights, we explore just-in-time quantization (JIT-Q) where only store generate when needed....

10.48550/arxiv.2311.05034 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Rethinking Cache Hierarchy And Interconnect Design For Next-Generation Gpus

OPENALEX - Publications

Mohamed Assem Ibrahim

10.21220/s2-a01c-6214 article EN 2021-01-01

Coming Soon ...