Hatem Ltaief

ORCID: 0000-0002-6897-1095
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Parallel Computing and Optimization Techniques
  • Matrix Theory and Algorithms
  • Distributed and Parallel Computing Systems
  • Advanced Data Storage Technologies
  • Interconnection Networks and Systems
  • Soil Geostatistics and Mapping
  • Electromagnetic Scattering and Analysis
  • Adaptive optics and wavefront sensing
  • Seismic Imaging and Inversion Techniques
  • Sparse and Compressive Sensing Techniques
  • Tensor decomposition and applications
  • Advanced Numerical Methods in Computational Mathematics
  • Spatial and Panel Data Analysis
  • Data Management and Algorithms
  • Seismic Waves and Analysis
  • Cloud Computing and Resource Management
  • Numerical Methods and Algorithms
  • Electromagnetic Simulation and Numerical Methods
  • Advanced Wireless Communication Techniques
  • Geophysical Methods and Applications
  • Astronomy and Astrophysical Research
  • Gaussian Processes and Bayesian Inference
  • Error Correcting Code Techniques
  • Stellar, planetary, and galactic studies
  • Numerical methods for differential equations

King Abdullah University of Science and Technology
2015-2024

Kootenay Association for Science & Technology
2016-2023

Beijing Institute of Technology
2023

National Institutes of Natural Sciences
2020

Friedrich-Alexander-Universität Erlangen-Nürnberg
2015

University of Tennessee at Knoxville
2008-2012

University of Houston
1996-2008

The emergence and continuing use of multi-core architectures graphics processing units require changes in the existing software sometimes even a redesign established algorithms order to take advantage now prevailing parallelism. Parallel Linear Algebra for Scalable Multi-core Architectures (PLASMA) Matrix on GPU Multics (MAGMA) are two projects that aims achieve high performance portability across wide range hybrid systems respectively. We present this document comparative study PLASMA's...

10.1088/1742-6596/180/1/012037 article EN Journal of Physics Conference Series 2009-07-01

Solving dense linear systems of equations is a fundamental problem in scientific computing. Numerical simulations involving complex represented terms unknown variables and relations between them often lead to that must be solved as fast possible. We describe current efforts toward the development these critical solvers area algebra (DLA) for multicore with GPU accelerators. how code/develop effectively use high computing power available new emerging hybrid architectures. The approach taken...

10.1109/ipdpsw.2010.5470941 article EN 2010-04-01

We present a method for developing dense linear algebra algorithms that seamlessly scales to thousands of cores. It can be done with our project called DPLASMA (Distributed PLASMA) uses novel generic distributed Direct Acyclic Graph Engine (DAGuE). The engine has been designed high performance computing and thus it enables scaling tile algorithms, originating in PLASMA, on large memory systems. underlying DAGuE framework many appealing features when considering distributed-memory platforms...

10.1109/ipdps.2011.299 article EN 2011-05-01

One of the major trends in design exascale architectures is use multicore nodes enhanced with GPU accelerators. Exploiting all resources a hybrid accelerators-based node at their maximum potential thus fundamental step towards computing. In this article, we present highly efficient QR factorization for such node. Our method three steps. The first consists expressing as sequence tasks well chosen granularity that will aim being executed on CPU core or GPU. We show can efficiently adapt...

10.1109/ipdps.2011.90 article EN 2011-05-01

The cost of data movement has always been an important concern in high performance computing (HPC) systems. It now become the dominant factor terms both energy consumption and performance. Support for expression locality explored past, but those efforts have had only modest success being adopted HPC applications various reasons. them However, with increasing complexity memory hierarchy higher parallelism emerging systems, management acquired a new urgency. Developers can no longer limit...

10.1109/tpds.2017.2703149 article EN IEEE Transactions on Parallel and Distributed Systems 2017-05-12

We present ExaGeoStat, a high performance framework for geospatial statistics in climate and environment modeling. In contrast to simulation based on partial differential equations derived from first-principles modeling, ExaGeoStat employs statistical model the evaluation of Gaussian log-likelihood function, which operates large dense covariance matrix. Generated by parametrizable Matern resulting matrix is symmetric positive definite. The computational tasks involved during function become...

10.1109/tpds.2018.2850749 article EN IEEE Transactions on Parallel and Distributed Systems 2018-06-26

The importance of stencil-based algorithms in computational science has focused attention on optimized parallel implementations for multilevel cache-based processors. Temporal blocking schemes leverage the large bandwidth and low latency caches to accelerate stencil updates approach theoretical peak performance. A key ingredient is reduction data traffic across slow paths, especially main memory interface. In this work we combine ideas multi-core wavefront temporal diamond tiling arrive at...

10.1137/140991133 article EN SIAM Journal on Scientific Computing 2015-01-01

Modified Bessel functions of the second kind are widely used in physics, engineering, spatial statistics, and machine learning. Since contemporary scientific applications, including learning, rely on GPUs for acceleration, providing robust GPU-hosted implementations special functions, such as modified function, is crucial performance. Existing function CPUs have limited coverage full range values needed some applications. In this work, we present a implementation GPUs, eliminating dependence...

10.48550/arxiv.2502.00356 preprint EN arXiv (Cornell University) 2025-02-01

The emergence and continuing use of multi-core architectures require changes in the existing software sometimes even a redesign established algorithms order to take advantage now prevailing parallelism. Parallel Linear Algebra for Scalable Multi-core Architectures (PLASMA) is project that aims achieve both high performance portability across wide range architectures. We present this paper comparative study PLASMA's against linear algebra packages (LAPACK ScaLAPACK), new approaches at...

10.1145/1654059.1654080 article EN 2009-11-14

Multicore architectures enhanced with multiple GPUs are likely to become mainstream High Performance Computing (HPC) platforms in a near future. In this paper, we present the design and implementation of an LU factorization using tile algorithm that can fully exploit potential such spite their complexity. We use methodology derived from previous work on Cholesky QR factorizations. Our contributions essentially consist providing new CPU/GPU hybrid kernels, studying impact performance looking...

10.1109/aiccsa.2011.6126599 preprint EN 2011-12-01

This paper introduces a novel implementation in reducing symmetric dense matrix to tridiagonal form, which is the preprocessing step toward solving eigenvalue problems. Based on tile algorithms, reduction follows two-stage approach, where first reduced band form prior final condensed structure. The challenging trade-off between algorithmic performance and task granularity has been tackled through grouping technique, consists of aggregating fine-grained memory-aware computational tasks during...

10.1145/2063384.2063394 article EN 2011-11-08

We propose to study the impact on energy footprint of two advanced algorithmic strategies in context high performance dense linear algebra libraries: (1) mixed precision algorithms with iterative refinement allow run at peak single floating-point arithmetic while achieving double accuracy and (2) tree reduction technique exposes more parallelism when factorizing tall skinny matrices for solving over determined systems equations or calculating singular value decomposition. Integrated within...

10.1109/cgc.2012.113 article EN 2012-11-01

and performance portability obtained by using locality abstractions. Fortunately, the trend emerging in recent literature on topic alleviates many of concerns that got way their adoption application developers. Data abstractions are available forms libraries, data structures, languages runtime systems; a common theme is increasing productivity without sacrificing performance. This paper examines these trends identifies commonalities can combine various concepts to develop comprehensive...

10.2172/1172915 preprint EN 2014-11-01

KBLAS is an open-source, high-performance library that provides optimized kernels for a subset of Level 2 BLAS functionalities on CUDA-enabled GPUs. Since performance dense matrix-vector multiplication hindered by the overhead memory accesses, double-buffering optimization technique employed to overlap data motion with computation. After identifying proper set tuning parameters, efficiently runs various GPU architectures while avoiding code rewriting and retaining compliance standard API....

10.1145/2818311 article EN ACM Transactions on Mathematical Software 2016-05-10

The compute and control for adaptive optics (cacao) package is an open-source modular software environment real-time of modern system. By leveraging many-core CPU GPU hardware, it can scale up to meet the demanding computing requirements current future high frame rate, actuator count (AO) systems. cacao's design enables both simple/barebone operation, complex full-featured AO centered on data streams that hold in shared memory along with a synchronization mechanism processes. Users...

10.1117/12.2314315 article EN Adaptive Optics Systems VI 2018-07-11

Climate and weather can be predicted statistically via geospatial Maximum Likelihood Estimates (MLE), as an alternative to running large ensembles of forward models. The MLE-based iterative optimization procedure requires the solving large-scale linear systems that performs a Cholesky factorization on symmetric positive-definite covariance matrix---a demanding dense in terms memory footprint computation. We propose novel solution this problem: at mathematical level, we reduce computational...

10.1145/3394277.3401846 article EN 2020-06-18

Abstract State‐of‐the‐art dense linear algebra software, such as the LAPACK and ScaLAPACK libraries, suffers performance losses on multicore processors due to their inability fully exploit thread‐level parallelism. At same time, coarse–grain dataflow model gains popularity a paradigm for programming architectures. This work looks at implementing classic workloads, Cholesky factorization, QR factorization LU using dynamic data‐driven execution. Two emerging approaches are examined, of nested...

10.1002/cpe.1467 article EN Concurrency and Computation Practice and Experience 2009-08-11

While successful implementations have already been written for one-sided transformations (e.g., QR, LU and Cholesky factorizations) on multicore architecture, getting high performance two-sided reductions Hessenberg, tridiagonal bidiagonal reductions) is still an open difficult research problem due to expensive memory-bound operations occurring during the panel factorization. The processor memory speed gap continues widen, which has even further exacerbated problem. This paper focuses...

10.1109/ipdps.2011.91 article EN 2011-05-01

A traditional goal of algorithmic optimality, squeezing out flops, has been superseded by evolution in architecture. Flops no longer serve as a reasonable proxy for all aspects complexity. Instead, algorithms must now squeeze memory, data transfers, and synchronizations, while extra flops on locally cached represent only small costs time energy. Hierarchically low-rank matrices realize rarely achieved combination optimal storage complexity high-computational intensity wide class formally...

10.1098/rsta.2019.0055 article EN cc-by Philosophical Transactions of the Royal Society A Mathematical Physical and Engineering Sciences 2020-01-20

Geostatistical modeling, one of the prime motivating applications for exascale computing, is a technique predicting desired quantities from geographically distributed data, based on statistical models and optimization parameters. Spatial data are assumed to possess properties stationarity or non-stationarity via kernel fitted covariance matrix. A primary workhorse stationary spatial statistics Gaussian maximum log-likelihood estimation (MLE), whose central structure dense, symmetric positive...

10.1109/tpds.2021.3084071 article EN publisher-specific-oa IEEE Transactions on Parallel and Distributed Systems 2021-05-26

To exploit the potential of multicore architectures, recent dense linear algebra libraries have used tile algorithms, which consist in scheduling a Directed Acyclic Graph (DAG) tasks fine granularity where nodes represent tasks, either panel factorization or update block-column, and edges dependencies among them. Although past approaches already achieve high performance on moderate large square matrices, their way processing sequence leads to limited when factorizing tall skinny matrices...

10.1109/ipdps.2010.5470443 article EN 2010-01-01
Coming Soon ...