NFDI4DS | UHH-SEMS - Publication Details

Mohammed Sourouri

ORCID: 0000-0003-1231-6355

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5023630514

Research Areas

Parallel Computing and Optimization Techniques
Advanced Data Storage Technologies
Distributed and Parallel Computing Systems
Embedded Systems Design Techniques
Computer Graphics and Visualization Techniques
Computational Geometry and Mesh Generation
Interconnection Networks and Systems
Geological Modeling and Analysis
Seismic Imaging and Inversion Techniques
Stochastic Gradient Optimization Techniques
Seismology and Earthquake Studies
3D Shape Modeling and Analysis
Advanced Data Compression Techniques
Methane Hydrates and Related Phenomena
Advanced Numerical Methods in Computational Mathematics

Norwegian University of Science and Technology
2016-2017

Simula Research Laboratory
2012-2016

University of Oslo
2012-2015

The READEX formalism for automatic tuning for energy efficiency

OPENALEX - Publications

Joseph Schuchart Michael Gerndt Per Gunnar Kjeldsberg Michael Lysaght David Horák and 16 more

Energy efficiency is an important aspect of future exascale systems, mainly due to rising energy cost. Although High performance computing (HPC) applications are compute centric, they still exhibit varying computational characteristics in different regions the program, such as compute-, memory-, and I/O-bound code regions. Some today's clusters already offer mechanisms adjust system resource requirements application, e.g., by controlling CPU frequency. However, manually tuning for improved a...

10.1007/s00607-016-0532-7 article EN cc-by Computing 2017-01-10

Effective multi-GPU communication using multiple CUDA streams and threads

OPENALEX - Publications

Mohammed Sourouri Tor Gillberg Scott B. Baden Xing Cai

In the context of multiple GPUs that share same PCIe bus, we propose a new communication scheme leads to more effective overlap and computation. Multiple CUDA streams OpenMP threads are adopted so data can simultaneously be sent received. A representative 3D stencil example is used demonstrate effectiveness our scheme. We compare performance with an MPI-based state-of-the-art Results show approach outperforms scheme, being up 1.85× faster. However, results also indicate current underlying...

10.1109/padsw.2014.7097919 article EN 2014-12-01

Scalable Heterogeneous CPU-GPU Computations for Unstructured Tetrahedral Meshes

OPENALEX - Publications

Johannes Langguth Mohammed Sourouri Glenn Terje Lines Scott B. Baden Xing Cai

A recent trend in modern high-performance computing environments is the introduction of powerful, energy-efficient hardware accelerators such as GPUs and Xeon Phi coprocessors. These specialized devices coexist with CPUs are optimized for highly parallel applications. In regular computing-intensive applications predictable data access patterns, these often far outperform thus relegate latter to pure control functions instead computations. For irregular applications, however, performance gap...

10.1109/mm.2015.70 article EN IEEE Micro 2015-07-01

Towards fine-grained dynamic tuning of HPC applications on modern multi-core architectures

OPENALEX - Publications

Mohammed Sourouri Espen Birger Raknes Nico Reißmann Johannes Langguth Daniel Hackenberg and 2 more

There is a consensus that exascale systems should operate within power envelope of 20MW. Consequently, energy conservation still considered as the most crucial constraint if such are to be realized.

10.1145/3126908.3126945 article EN 2017-11-08

Panda: A Compiler Framework for Concurrent CPU $$+$$ + GPU Execution of 3D Stencil Computations on GPU-accelerated Supercomputers

OPENALEX - Publications

Mohammed Sourouri Scott B. Baden Xing Cai

10.1007/s10766-016-0454-1 article EN International Journal of Parallel Programming 2016-10-05

CPU+GPU Programming of Stencil Computations for Resource-Efficient Use of GPU Clusters

OPENALEX - Publications

Mohammed Sourouri Johannes Langguth Filippo Spiga Scott B. Baden Xing Cai

On modern GPU clusters, the role of CPUs is often restricted to controlling GPUs and handling MPI communication. The unused computing power CPUs, however, can be considerable for computations whose performance bounded by memory traffic. This paper investigates challenges simultaneous usage computation. Our emphasis on deriving a heterogeneous CPU+GPU programming approach that combines MPI, OpenMP CUDA. To effectively hide overhead various inter-and intra-node communications, new level task...

10.1109/cse.2015.33 article EN 2015-10-01

Memory Bandwidth Contention: Communication vs Computation Tradeoffs in Supercomputers with Multicore Architectures

OPENALEX - Publications

Johannes Langguth Xing Cai Mohammed Sourouri

We study the problem of contention for memory bandwidth between computation and communication in supercomputers that feature multicore CPUs. The arises when are overlapped both operations compete same bandwidth. This is most visible at limits scalability, take similar amounts time thus must be taken into account order to reach maximum scalability bound applications. Typical examples codes affected by sparse matrix-vector computations, graph algorithms, many machine learning problems, as they...

10.1109/padsw.2018.8644601 article EN 2018-12-01

A New Parallel 3D Front Propagation Algorithm for Fast Simulation of Geological folds

OPENALEX - Publications

Tor Gillberg Mohammed Sourouri Xing Cai

We present a novel method for 3D anisotropic front propagation and apply it to the simulation of geological folding. The new iterative algorithm has simple structure abundant parallelism, is easily adapted multithreaded architectures using OpenMP. Moreover, we have used automated C-to-CUDA source code translator, Mint, achieve greatly enhanced computing speed on GPUs. Both OpenMP CUDA implementations been tested benchmarked several examples

10.1016/j.procs.2012.04.101 article EN Procedia Computer Science 2012-01-01

Parallel solutions of static Hamilton-Jacobi equations for simulations of geological folds

OPENALEX - Publications

Tor Gillberg Are Magnus Bruaset Øyvind Hjelle Mohammed Sourouri

Two new algorithms for numerical solution of static Hamilton-Jacobi equations are presented. These designed to work efficiently on different parallel computing architectures, and results multicore CPU GPU implementations reported discussed. The experiments show that the proposed strategies scale well with computational power hardware. performance methods investigate tow types formulations investigated, isotropic eikonal equation an anisotropic formulation used simulate geological folding....

10.1186/2190-5983-4-10 article EN cc-by Journal of Mathematics in Industry 2014-01-01

Multi-GPU Implementations of Parallel 3D Sweeping Algorithms with Application to Geological Folding

OPENALEX - Publications

Ezhilmathi Krishnasamy Mohammed Sourouri Xing Cai

This paper studies the CUDA programming challenges with using multiple GPUs inside a single machine to carry out plane-by-plane updates in parallel 3D sweeping algorithms. In particular, care must be taken mask overhead of various data movements between GPUs. Multiple OpenMP threads on CPU side should combined streams per GPU hide transfer cost related halo computation each 2D plane. Moreover, technique peer-to-peer motion can used reduce impact volumetric shuffles that have done mandatory...

10.1016/j.procs.2015.05.339 article EN Procedia Computer Science 2015-01-01

On the performance and energy efficiency of the PGAS programming model on multicore architectures

OPENALEX - Publications

Jérémie Lagravière Johannes Langguth Mohammed Sourouri Phuong Hoai Ha Xing Cai

Using large-scale multicore systems to get the maximum performance and energy efficiency with manageable programmability is a major challenge. The partitioned global address space (PGAS) programming model enhances by providing over computing systems. However, so far of PGAS on multicore-based parallel architectures have not been investigated thoroughly. In this paper we use set selected kernels from well-known NAS Parallel Benchmarks evaluate UPC language, which widely used implementation...

10.1109/hpcsim.2016.7568416 preprint EN 2016-07-01

Coming Soon ...