NFDI4DS | UHH-SEMS - Publication Details

Hatem Ltaief

ORCID: 0000-0002-6897-1095

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5017526753

Research Areas

Parallel Computing and Optimization Techniques
Matrix Theory and Algorithms
Distributed and Parallel Computing Systems
Advanced Data Storage Technologies
Interconnection Networks and Systems
Soil Geostatistics and Mapping
Electromagnetic Scattering and Analysis
Adaptive optics and wavefront sensing
Seismic Imaging and Inversion Techniques
Sparse and Compressive Sensing Techniques
Tensor decomposition and applications
Advanced Numerical Methods in Computational Mathematics
Spatial and Panel Data Analysis
Data Management and Algorithms
Seismic Waves and Analysis
Cloud Computing and Resource Management
Numerical Methods and Algorithms
Electromagnetic Simulation and Numerical Methods
Advanced Wireless Communication Techniques
Geophysical Methods and Applications
Astronomy and Astrophysical Research
Gaussian Processes and Bayesian Inference
Error Correcting Code Techniques
Stellar, planetary, and galactic studies
Numerical methods for differential equations

King Abdullah University of Science and Technology
2015-2024

Kootenay Association for Science & Technology
2016-2023

Beijing Institute of Technology
2023

National Institutes of Natural Sciences
2020

Friedrich-Alexander-Universität Erlangen-Nürnberg
2015

University of Tennessee at Knoxville
2008-2012

University of Houston
1996-2008

Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects

OPENALEX - Publications

Emmanuel Agullo Jim Demmel Jack Dongarra Bilel Hadri Jakub Kurzak and 4 more

The emergence and continuing use of multi-core architectures graphics processing units require changes in the existing software sometimes even a redesign established algorithms order to take advantage now prevailing parallelism. Parallel Linear Algebra for Scalable Multi-core Architectures (PLASMA) Matrix on GPU Multics (MAGMA) are two projects that aims achieve high performance portability across wide range hybrid systems respectively. We present this document comparative study PLASMA's...

10.1088/1742-6596/180/1/012037 article EN Journal of Physics Conference Series 2009-07-01

Dense linear algebra solvers for multicore with GPU accelerators

OPENALEX - Publications

Stanimire Tomov Rajib Nath Hatem Ltaief Jack Dongarra

Solving dense linear systems of equations is a fundamental problem in scientific computing. Numerical simulations involving complex represented terms unknown variables and relations between them often lead to that must be solved as fast possible. We describe current efforts toward the development these critical solvers area algebra (DLA) for multicore with GPU accelerators. how code/develop effectively use high computing power available new emerging hybrid architectures. The approach taken...

10.1109/ipdpsw.2010.5470941 article EN 2010-04-01

Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA

OPENALEX - Publications

George Bosilca Aurélien Bouteiller Anthony Danalis Mathieu Faverge Azzam Haidar and 8 more

We present a method for developing dense linear algebra algorithms that seamlessly scales to thousands of cores. It can be done with our project called DPLASMA (Distributed PLASMA) uses novel generic distributed Direct Acyclic Graph Engine (DAGuE). The engine has been designed high performance computing and thus it enables scaling tile algorithms, originating in PLASMA, on large memory systems. underlying DAGuE framework many appealing features when considering distributed-memory platforms...

10.1109/ipdps.2011.299 article EN 2011-05-01

QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators

OPENALEX - Publications

Emmanuel Agullo Cédric Augonnet Jack Dongarra Mathieu Faverge Hatem Ltaief and 2 more

One of the major trends in design exascale architectures is use multicore nodes enhanced with GPU accelerators. Exploiting all resources a hybrid accelerators-based node at their maximum potential thus fundamental step towards computing. In this article, we present highly efficient QR factorization for such node. Our method three steps. The first consists expressing as sequence tasks well chosen granularity that will aim being executed on CPU core or GPU. We show can efficiently adapt...

10.1109/ipdps.2011.90 article EN 2011-05-01

Trends in Data Locality Abstractions for HPC Systems

OPENALEX - Publications

Didem Unat Anshu Dubey Torsten Hoefler John Shalf M Abraham and 16 more

The cost of data movement has always been an important concern in high performance computing (HPC) systems. It now become the dominant factor terms both energy consumption and performance. Support for expression locality explored past, but those efforts have had only modest success being adopted HPC applications various reasons. them However, with increasing complexity memory hierarchy higher parallelism emerging systems, management acquired a new urgency. Developers can no longer limit...

10.1109/tpds.2017.2703149 article EN IEEE Transactions on Parallel and Distributed Systems 2017-05-12

ExaGeoStat: A High Performance Unified Software for Geostatistics on Manycore Systems

OPENALEX - Publications

Sameh Abdulah Hatem Ltaief Ying Sun Marc G. Genton David E. Keyes

We present ExaGeoStat, a high performance framework for geospatial statistics in climate and environment modeling. In contrast to simulation based on partial differential equations derived from first-principles modeling, ExaGeoStat employs statistical model the evaluation of Gaussian log-likelihood function, which operates large dense covariance matrix. Generated by parametrizable Matern resulting matrix is symmetric positive definite. The computational tasks involved during function become...

10.1109/tpds.2018.2850749 article EN IEEE Transactions on Parallel and Distributed Systems 2018-06-26

Multicore-Optimized Wavefront Diamond Blocking for Optimizing Stencil Updates

OPENALEX - Publications

Tareq B. Malas Georg Hager Hatem Ltaief Holger Stengel Gerhard Wellein and 1 more

The importance of stencil-based algorithms in computational science has focused attention on optimized parallel implementations for multilevel cache-based processors. Temporal blocking schemes leverage the large bandwidth and low latency caches to accelerate stencil updates approach theoretical peak performance. A key ingredient is reduction data traffic across slow paths, especially main memory interface. In this work we combine ideas multi-core wavefront temporal diamond tiling arrive at...

10.1137/140991133 article EN SIAM Journal on Scientific Computing 2015-01-01

GPU-Accelerated Modified Bessel Function of the Second Kind for Gaussian Processes

OPENALEX - Publications

Zhengyang Geng Sameh Abdulah Ying Sun Hatem Ltaief David E. Keyes and 1 more

Modified Bessel functions of the second kind are widely used in physics, engineering, spatial statistics, and machine learning. Since contemporary scientific applications, including learning, rely on GPUs for acceleration, providing robust GPU-hosted implementations special functions, such as modified function, is crucial performance. Existing function CPUs have limited coverage full range values needed some applications. In this work, we present a implementation GPUs, eliminating dependence...

10.48550/arxiv.2502.00356 preprint EN arXiv (Cornell University) 2025-02-01

Comparative study of one-sided factorizations with multiple software packages on multi-core hardware

OPENALEX - Publications

Emmanuel Agullo Bilel Hadri Hatem Ltaief Jack Dongarrra

The emergence and continuing use of multi-core architectures require changes in the existing software sometimes even a redesign established algorithms order to take advantage now prevailing parallelism. Parallel Linear Algebra for Scalable Multi-core Architectures (PLASMA) is project that aims achieve both high performance portability across wide range architectures. We present this paper comparative study PLASMA's against linear algebra packages (LAPACK ScaLAPACK), new approaches at...

10.1145/1654059.1654080 article EN 2009-11-14

LU factorization for accelerator-based systems

OPENALEX - Publications

Emmanuel Agullo Cédric Augonnet Jack Dongarra Mathieu Faverge Julien Langou and 2 more

Multicore architectures enhanced with multiple GPUs are likely to become mainstream High Performance Computing (HPC) platforms in a near future. In this paper, we present the design and implementation of an LU factorization using tile algorithm that can fully exploit potential such spite their complexity. We use methodology derived from previous work on Cholesky QR factorizations. Our contributions essentially consist providing new CPU/GPU hybrid kernels, studying impact performance looking...

10.1109/aiccsa.2011.6126599 preprint EN 2011-12-01

Parallel reduction to condensed forms for symmetric eigenvalue problems using aggregated fine-grained and memory-aware kernels

OPENALEX - Publications

Azzam Haidar Hatem Ltaief Jack Dongarra

This paper introduces a novel implementation in reducing symmetric dense matrix to tridiagonal form, which is the preprocessing step toward solving eigenvalue problems. Based on tile algorithms, reduction follows two-stage approach, where first reduced band form prior final condensed structure. The challenging trade-off between algorithmic performance and task granularity has been tackled through grouping technique, consists of aggregating fine-grained memory-aware computational tasks during...

10.1145/2063384.2063394 article EN 2011-11-08

Energy Footprint of Advanced Dense Numerical Linear Algebra Using Tile Algorithms on Multicore Architectures

OPENALEX - Publications

Jack Dongarra Hatem Ltaief Piotr Łuszczek Vincent M. Weaver

We propose to study the impact on energy footprint of two advanced algorithmic strategies in context high performance dense linear algebra libraries: (1) mixed precision algorithms with iterative refinement allow run at peak single floating-point arithmetic while achieving double accuracy and (2) tree reduction technique exposes more parallelism when factorizing tall skinny matrices for solving over determined systems equations or calculating singular value decomposition. Integrated within...

10.1109/cgc.2012.113 article EN 2012-11-01

Programming Abstractions for Data Locality

OPENALEX - Publications

Adrian Tate Amir Kamil Anshu Dubey Armin Groblinger Bradford L. Chamberlain and 29 more

and performance portability obtained by using locality abstractions. Fortunately, the trend emerging in recent literature on topic alleviates many of concerns that got way their adoption application developers. Data abstractions are available forms libraries, data structures, languages runtime systems; a common theme is increasing productivity without sacrificing performance. This paper examines these trends identifies commonalities can combine various concepts to develop comprehensive...

10.2172/1172915 preprint EN 2014-11-01

Batched QR and SVD algorithms on GPUs with applications in hierarchical matrix compression

OPENALEX - Publications

Wajih Boukaram George Turkiyyah Hatem Ltaief David E. Keyes

10.1016/j.parco.2017.09.001 article EN Parallel Computing 2017-09-14

KBLAS

OPENALEX - Publications

Ahmad Abdelfattah David E. Keyes Hatem Ltaief

KBLAS is an open-source, high-performance library that provides optimized kernels for a subset of Level 2 BLAS functionalities on CUDA-enabled GPUs. Since performance dense matrix-vector multiplication hindered by the overhead memory accesses, double-buffering optimization technique employed to overlap data motion with computation. After identifying proper set tuning parameters, efficiently runs various GPU architectures while avoiding code rewriting and retaining compliance standard API....

10.1145/2818311 article EN ACM Transactions on Mathematical Software 2016-05-10

The compute and control for adaptive optics (CACAO) real-time control software package

OPENALEX - Publications

Olivier Guyon Arnaud Sevin Hatem Ltaief Nour Skaf Frantz Martinache and 9 more

The compute and control for adaptive optics (cacao) package is an open-source modular software environment real-time of modern system. By leveraging many-core CPU GPU hardware, it can scale up to meet the demanding computing requirements current future high frame rate, actuator count (AO) systems. cacao's design enables both simple/barebone operation, complex full-featured AO centered on data streams that hold in shared memory along with a synchronization mechanism processes. Users...

10.1117/12.2314315 article EN Adaptive Optics Systems VI 2018-07-11

Extreme-Scale Task-Based Cholesky Factorization Toward Climate and Weather Prediction Applications

OPENALEX - Publications

Qinglei Cao Yu Pei Kadir Akbudak Aleksandr Mikhalev George Bosilca and 3 more

Climate and weather can be predicted statistically via geospatial Maximum Likelihood Estimates (MLE), as an alternative to running large ensembles of forward models. The MLE-based iterative optimization procedure requires the solving large-scale linear systems that performs a Cholesky factorization on symmetric positive-definite covariance matrix---a demanding dense in terms memory footprint computation. We propose novel solution this problem: at mathematical level, we reduce computational...

10.1145/3394277.3401846 article EN 2020-06-18

Scheduling dense linear algebra operations on multicore processors

OPENALEX - Publications

Jakub Kurzak Hatem Ltaief Jack Dongarra Rosa M. Badía

Abstract State‐of‐the‐art dense linear algebra software, such as the LAPACK and ScaLAPACK libraries, suffers performance losses on multicore processors due to their inability fully exploit thread‐level parallelism. At same time, coarse–grain dataflow model gains popularity a paradigm for programming architectures. This work looks at implementing classic workloads, Cholesky factorization, QR factorization LU using dynamic data‐driven execution. Two emerging approaches are examined, of nested...

10.1002/cpe.1467 article EN Concurrency and Computation Practice and Experience 2009-08-11

Two-Stage Tridiagonal Reduction for Dense Symmetric Matrices Using Tile Algorithms on Multicore Architectures

OPENALEX - Publications

Piotr Łuszczek Hatem Ltaief Jack Dongarra

While successful implementations have already been written for one-sided transformations (e.g., QR, LU and Cholesky factorizations) on multicore architecture, getting high performance two-sided reductions Hessenberg, tridiagonal bidiagonal reductions) is still an open difficult research problem due to expensive memory-bound operations occurring during the panel factorization. The processor memory speed gap continues widen, which has even further exacerbated problem. This paper focuses...

10.1109/ipdps.2011.91 article EN 2011-05-01

Hierarchical algorithms on hierarchical architectures

OPENALEX - Publications

David E. Keyes Hatem Ltaief George Turkiyyah

A traditional goal of algorithmic optimality, squeezing out flops, has been superseded by evolution in architecture. Flops no longer serve as a reasonable proxy for all aspects complexity. Instead, algorithms must now squeeze memory, data transfers, and synchronizations, while extra flops on locally cached represent only small costs time energy. Hierarchically low-rank matrices realize rarely achieved combination optimal storage complexity high-computational intensity wide class formally...

10.1098/rsta.2019.0055 article EN cc-by Philosophical Transactions of the Royal Society A Mathematical Physical and Engineering Sciences 2020-01-20

Accelerating Geostatistical Modeling and Prediction With Mixed-Precision Computations: A High-Productivity Approach With PaRSEC

OPENALEX - Publications

Sameh Abdulah Qinglei Cao Yu Pei George Bosilca Jack Dongarra and 4 more

Geostatistical modeling, one of the prime motivating applications for exascale computing, is a technique predicting desired quantities from geographically distributed data, based on statistical models and optimization parameters. Spatial data are assumed to possess properties stationarity or non-stationarity via kernel fitted covariance matrix. A primary workhorse stationary spatial statistics Gaussian maximum log-likelihood estimation (MLE), whose central structure dense, symmetric positive...

10.1109/tpds.2021.3084071 article EN publisher-specific-oa IEEE Transactions on Parallel and Distributed Systems 2021-05-26

Tile QR factorization with parallel panel processing for multicore architectures

OPENALEX - Publications

Bilel Hadri Hatem Ltaief Emmanuel Agullo Jack Dongarra

To exploit the potential of multicore architectures, recent dense linear algebra libraries have used tile algorithms, which consist in scheduling a Directed Acyclic Graph (DAG) tasks fine granularity where nodes represent tasks, either panel factorization or update block-column, and edges dependencies among them. Although past approaches already achieve high performance on moderate large square matrices, their way processing sequence leads to limited when factorizing tall skinny matrices...

10.1109/ipdps.2010.5470443 article EN 2010-01-01

Coming Soon ...