NFDI4DS | UHH-SEMS - Publication Details

Jakub Kurzak

ORCID: 0000-0002-9697-0145

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5070456582

Research Areas

Parallel Computing and Optimization Techniques
Distributed and Parallel Computing Systems
Matrix Theory and Algorithms
Numerical Methods and Algorithms
Interconnection Networks and Systems
Advanced Data Storage Technologies
Embedded Systems Design Techniques
Cloud Computing and Resource Management
Electromagnetic Scattering and Analysis
Stochastic Gradient Optimization Techniques
Algorithms and Data Compression
Scheduling and Optimization Algorithms
Particle accelerators and beam dynamics
Scientific Computing and Data Management
Digital Filter Design and Implementation
Quantum Computing Algorithms and Architecture
Sparse and Compressive Sensing Techniques
Industrial Automation and Control Systems
Advanced Neural Network Applications
Electromagnetic Simulation and Numerical Methods
Advanced Graph Neural Networks
Computational Geometry and Mesh Generation
Robotics and Sensor-Based Localization
Advanced Image and Video Retrieval Techniques
Scheduling and Timetabling Solutions

Advanced Micro Devices (United States)
2021-2024

Advanced Micro Devices (Canada)
2024

University of Tennessee at Knoxville
2010-2019

Oak Ridge National Laboratory
2017-2018

University of Manchester
2017-2018

University of Houston
2004-2008

A class of parallel tiled linear algebra algorithms for multicore architectures

OPENALEX - Publications

Alfredo Buttari Julien Langou Jakub Kurzak Jack Dongarra

10.1016/j.parco.2008.10.002 article EN Parallel Computing 2008-11-11

Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects

OPENALEX - Publications

Emmanuel Agullo Jim Demmel Jack Dongarra Bilel Hadri Jakub Kurzak and 4 more

The emergence and continuing use of multi-core architectures graphics processing units require changes in the existing software sometimes even a redesign established algorithms order to take advantage now prevailing parallelism. Parallel Linear Algebra for Scalable Multi-core Architectures (PLASMA) Matrix on GPU Multics (MAGMA) are two projects that aims achieve high performance portability across wide range hybrid systems respectively. We present this document comparative study PLASMA's...

10.1088/1742-6596/180/1/012037 article EN Journal of Physics Conference Series 2009-07-01

Accelerating scientific computations with mixed precision algorithms

OPENALEX - Publications

Marc Baboulin Alfredo Buttari Jack Dongarra Jakub Kurzak Julie Langou and 3 more

10.1016/j.cpc.2008.11.005 article EN Computer Physics Communications 2008-11-14

Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA

OPENALEX - Publications

George Bosilca Aurélien Bouteiller Anthony Danalis Mathieu Faverge Azzam Haidar and 8 more

We present a method for developing dense linear algebra algorithms that seamlessly scales to thousands of cores. It can be done with our project called DPLASMA (Distributed PLASMA) uses novel generic distributed Direct Acyclic Graph Engine (DAGuE). The engine has been designed high performance computing and thus it enables scaling tile algorithms, originating in PLASMA, on large memory systems. underlying DAGuE framework many appealing features when considering distributed-memory platforms...

10.1109/ipdps.2011.299 article EN 2011-05-01

Autotuning GEMM Kernels for the Fermi GPU

OPENALEX - Publications

Jakub Kurzak Stanimire Tomov Jack Dongarra

In recent years, the use of graphics chips has been recognized as a viable way accelerating scientific and engineering applications, even more so since introduction Fermi architecture by NVIDIA, with features essential to numerical computing, such fast double precision arithmetic memory protected error correction codes. Being crucial component software packages, LAPACK ScaLAPACK, general dense matrix multiplication routine is one important workloads be implemented on these devices. This...

10.1109/tpds.2011.311 article EN IEEE Transactions on Parallel and Distributed Systems 2012-01-04

The Singular Value Decomposition: Anatomy of Optimizing an Algorithm for Extreme Scale

OPENALEX - Publications

Jack Dongarra Mark Gates Azzam Haidar Jakub Kurzak Piotr Łuszczek and 2 more

The computation of the singular value decomposition, or SVD, has a long history with many improvements over years, both in its implementations and algorithmically. Here, we survey evolution SVD algorithms for dense matrices, discussing motivation performance impacts changes. There are two main branches methods: bidiagonalization Jacobi. Bidiagonalization methods started implementation by Golub Reinsch Algol60, which was subsequently ported to Fortran EISPACK library, later more efficiently...

10.1137/17m1117732 article EN SIAM Review 2018-01-01

Mixed Precision Iterative Refinement Techniques for the Solution of Dense Linear Systems

OPENALEX - Publications

Alfredo Buttari Jack Dongarra Julie Langou Julien Langou Piotr Łuszczek and 1 more

By using a combination of 32-bit and 64-bit floating point arithmetic, the performance many dense sparse linear algebra algorithms can be significantly enhanced while maintaining accuracy resulting solution. The approach presented here apply not only to conventional processors but also exotic technologies such as Field Programmable Gate Arrays (FPGA), Graphical Processing Units (GPU), Cell BE processor. Results on modern processor architectures are presented.

10.1177/1094342007084026 article EN The International Journal of High Performance Computing Applications 2007-10-29

Parallel tiled QR factorization for multicore architectures

OPENALEX - Publications

Alfredo Buttari Julien Langou Jakub Kurzak Jack Dongarra

Abstract As multicore systems continue to gain ground in the high‐performance computing world, linear algebra algorithms have be reformulated or new developed order take advantage of architectural features on these processors. Fine‐grain parallelism becomes a major requirement and introduces necessity loose synchronization parallel execution an operation. This paper presents algorithm for QR factorization where operations can represented as sequence small tasks that operate square blocks...

10.1002/cpe.1301 article EN Concurrency and Computation Practice and Experience 2008-06-03

Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization

OPENALEX - Publications

Jakub Kurzak Alfredo Buttari J. Dongarra

The Sony/Toshiba/IBM (STI) CELL processor introduces pioneering solutions in architecture. At the same time it presents new challenges for development of numerical algorithms. One is effective exploitation differential between speed single and double precision arithmetic; other efficient parallelization short vector SIMD cores. first challenge addressed by utilizing well known technique iterative refinement solution a dense symmetric positive definite system linear equations, resulting...

10.1109/tpds.2007.70813 article EN IEEE Transactions on Parallel and Distributed Systems 2008-08-01

SLATE

OPENALEX - Publications

Mark Gates Jakub Kurzak Ali Charara Asim YarKhan Jack Dongarra

The SLATE (Software for Linear Algebra Targeting Exascale) library is being developed to provide fundamental dense linear algebra capabilities current and upcoming distributed high-performance systems, both accelerated CPU-GPU based CPU based. will coverage of existing ScaLAPACK functionality, including the parallel BLAS; systems using LU Cholesky; least squares problems QR; eigenvalue singular value problems. In this respect, it serve as a replacement ScaLAPACK, which after two decades...

10.1145/3295500.3356223 article EN 2019-11-07

Using Mixed Precision for Sparse Matrix Computations to Enhance the Performance while Achieving 64-bit Accuracy

OPENALEX - Publications

Alfredo Buttari Jack Dongarra Jakub Kurzak Piotr Łuszczek Stanimire Tomov

By using a combination of 32-bit and 64-bit floating point arithmetic, the performance many sparse linear algebra algorithms can be significantly enhanced while maintaining accuracy resulting solution. These ideas applied to multifrontal supernodal direct techniques iterative such as Krylov subspace methods. The approach presented here apply not only conventional processors but also exotic technologies Field Programmable Gate Arrays (FPGA), Graphical Processing Units (GPU), Cell BE processor.

10.1145/1377596.1377597 article EN ACM Transactions on Mathematical Software 2008-07-01

Tools and techniques for performance---Exploiting the performance of 32 bit floating point arithmetic in obtaining 64 bit accuracy (revisiting iterative refinement for linear systems)

OPENALEX - Publications

Julie Langou Julien Langou Piotr Łuszczek Jakub Kurzak Alfredo Buttari and 1 more

Recent versions of microprocessors exhibit performance characteristics for 32 bit floating point arithmetic (single precision) that is substantially higher than 64 (double precision). Examples include the Intel's Pentium IV and M processors, AMD's Opteron architectures IBM's Cell Broad Engine processor. When working in single precision, operations can be performed up to two times faster on ten over double precision. The enhancements these are derived by accessing extensions basic...

10.1145/1188455.1188573 article EN 2006-01-01

Exploiting the Performance of 32 bit Floating Point Arithmetic in Obtaining 64 bit Accuracy (Revisiting Iterative Refinement for Linear Systems)

OPENALEX - Publications

Julie Langou Julien Langou Piotr Łuszczek Jakub Kurzak Alfredo Buttari and 1 more

10.1109/sc.2006.30 article EN 2006-11-01

Implementation of mixed precision in solving systems of linear equations on the Cell processor

OPENALEX - Publications

Jakub Kurzak Jack Dongarra

Abstract This paper describes the design concepts behind implementations of mixed‐precision linear algebra routines targeted for Cell processor. It in detail implementation code to solve system equations using Gaussian elimination single precision with iterative refinement solution full double‐precision accuracy. By utilizing this approach algorithm achieves close an order magnitude higher performance on processor than offered by standard algorithm. The is effectively high‐performance...

10.1002/cpe.1164 article EN Concurrency and Computation Practice and Experience 2007-01-09

Optimizing matrix multiplication for a short-vector SIMD architecture – CELL processor

OPENALEX - Publications

Jakub Kurzak Wesley Alvaro Jack Dongarra

10.1016/j.parco.2008.12.010 article EN Parallel Computing 2009-01-22

Implementation and Tuning of Batched Cholesky Factorization and Solve for NVIDIA GPUs

OPENALEX - Publications

Jakub Kurzak Hartwig Anzt Mark Gates Jack Dongarra

Many problems in engineering and scientific computing require the solution of a large number small systems linear equations. Due to their high processing power, Graphics Processing Units became an attractive target for this class problems, routines based on LU QR factorization have been provided by NVIDIA cuBLAS library. This work addresses situation where equations are symmetric positive definite. The paper describes implementation tuning kernels Cholesky forward backward substitution....

10.1109/tpds.2015.2481890 article EN IEEE Transactions on Parallel and Distributed Systems 2015-09-24

Porting the PLASMA Numerical Library to the OpenMP Standard

OPENALEX - Publications

Asim YarKhan Jakub Kurzak Piotr Łuszczek Jack Dongarra

10.1007/s10766-016-0441-6 article EN International Journal of Parallel Programming 2016-06-14

PLASMA

OPENALEX - Publications

Jack Dongarra Mark Gates Azzam Haidar Jakub Kurzak Piotr Łuszczek and 10 more

The recent version of the Parallel Linear Algebra Software for Multicore Architectures (PLASMA) library is based on tasks with dependencies from OpenMP standard. main functionality presented. Extensive benchmarks are targeted three multicore and manycore architectures, namely, an Intel Xeon, Xeon Phi, IBM POWER 8 processors.

10.1145/3264491 article EN ACM Transactions on Mathematical Software 2019-05-03

Massively parallel implementation of a fast multipole method for distributed memory machines

OPENALEX - Publications

Jakub Kurzak B. Montgomery Pettitt

10.1016/j.jpdc.2005.02.001 article EN Journal of Parallel and Distributed Computing 2005-05-06

Fast multipole methods for particle dynamics

OPENALEX - Publications

Jakub Kurzak B. Montgomery Pettitt

The growth of simulations particle systems has been aided by advances in computer speed and algorithms. adoption O(N) algorithms to solve N-body simulation problems less rapid due the fact that such scaling was only competitive for relatively large N. Our work seeks find algorithmic modifications practical implementations intermediate values N typical use molecular simulations. This article reviews fast multipole techniques calculation electrostatic interactions systems. basic mathematics...

10.1080/08927020600991161 article EN Molecular Simulation 2006-09-01

Scheduling dense linear algebra operations on multicore processors

OPENALEX - Publications

Jakub Kurzak Hatem Ltaief Jack Dongarra Rosa M. Badía

Abstract State‐of‐the‐art dense linear algebra software, such as the LAPACK and ScaLAPACK libraries, suffers performance losses on multicore processors due to their inability fully exploit thread‐level parallelism. At same time, coarse–grain dataflow model gains popularity a paradigm for programming architectures. This work looks at implementing classic workloads, Cholesky factorization, QR factorization LU using dynamic data‐driven execution. Two emerging approaches are examined, of nested...

10.1002/cpe.1467 article EN Concurrency and Computation Practice and Experience 2009-08-11

Coming Soon ...