Jakub Kurzak

ORCID: 0000-0002-9697-0145
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Parallel Computing and Optimization Techniques
  • Distributed and Parallel Computing Systems
  • Matrix Theory and Algorithms
  • Numerical Methods and Algorithms
  • Interconnection Networks and Systems
  • Advanced Data Storage Technologies
  • Embedded Systems Design Techniques
  • Cloud Computing and Resource Management
  • Electromagnetic Scattering and Analysis
  • Stochastic Gradient Optimization Techniques
  • Algorithms and Data Compression
  • Scheduling and Optimization Algorithms
  • Particle accelerators and beam dynamics
  • Scientific Computing and Data Management
  • Digital Filter Design and Implementation
  • Quantum Computing Algorithms and Architecture
  • Sparse and Compressive Sensing Techniques
  • Industrial Automation and Control Systems
  • Advanced Neural Network Applications
  • Electromagnetic Simulation and Numerical Methods
  • Advanced Graph Neural Networks
  • Computational Geometry and Mesh Generation
  • Robotics and Sensor-Based Localization
  • Advanced Image and Video Retrieval Techniques
  • Scheduling and Timetabling Solutions

Advanced Micro Devices (United States)
2021-2024

Advanced Micro Devices (Canada)
2024

University of Tennessee at Knoxville
2010-2019

Oak Ridge National Laboratory
2017-2018

University of Manchester
2017-2018

University of Houston
2004-2008

The emergence and continuing use of multi-core architectures graphics processing units require changes in the existing software sometimes even a redesign established algorithms order to take advantage now prevailing parallelism. Parallel Linear Algebra for Scalable Multi-core Architectures (PLASMA) Matrix on GPU Multics (MAGMA) are two projects that aims achieve high performance portability across wide range hybrid systems respectively. We present this document comparative study PLASMA's...

10.1088/1742-6596/180/1/012037 article EN Journal of Physics Conference Series 2009-07-01

We present a method for developing dense linear algebra algorithms that seamlessly scales to thousands of cores. It can be done with our project called DPLASMA (Distributed PLASMA) uses novel generic distributed Direct Acyclic Graph Engine (DAGuE). The engine has been designed high performance computing and thus it enables scaling tile algorithms, originating in PLASMA, on large memory systems. underlying DAGuE framework many appealing features when considering distributed-memory platforms...

10.1109/ipdps.2011.299 article EN 2011-05-01

In recent years, the use of graphics chips has been recognized as a viable way accelerating scientific and engineering applications, even more so since introduction Fermi architecture by NVIDIA, with features essential to numerical computing, such fast double precision arithmetic memory protected error correction codes. Being crucial component software packages, LAPACK ScaLAPACK, general dense matrix multiplication routine is one important workloads be implemented on these devices. This...

10.1109/tpds.2011.311 article EN IEEE Transactions on Parallel and Distributed Systems 2012-01-04

The computation of the singular value decomposition, or SVD, has a long history with many improvements over years, both in its implementations and algorithmically. Here, we survey evolution SVD algorithms for dense matrices, discussing motivation performance impacts changes. There are two main branches methods: bidiagonalization Jacobi. Bidiagonalization methods started implementation by Golub Reinsch Algol60, which was subsequently ported to Fortran EISPACK library, later more efficiently...

10.1137/17m1117732 article EN SIAM Review 2018-01-01

By using a combination of 32-bit and 64-bit floating point arithmetic, the performance many dense sparse linear algebra algorithms can be significantly enhanced while maintaining accuracy resulting solution. The approach presented here apply not only to conventional processors but also exotic technologies such as Field Programmable Gate Arrays (FPGA), Graphical Processing Units (GPU), Cell BE processor. Results on modern processor architectures are presented.

10.1177/1094342007084026 article EN The International Journal of High Performance Computing Applications 2007-10-29

Abstract As multicore systems continue to gain ground in the high‐performance computing world, linear algebra algorithms have be reformulated or new developed order take advantage of architectural features on these processors. Fine‐grain parallelism becomes a major requirement and introduces necessity loose synchronization parallel execution an operation. This paper presents algorithm for QR factorization where operations can represented as sequence small tasks that operate square blocks...

10.1002/cpe.1301 article EN Concurrency and Computation Practice and Experience 2008-06-03

The Sony/Toshiba/IBM (STI) CELL processor introduces pioneering solutions in architecture. At the same time it presents new challenges for development of numerical algorithms. One is effective exploitation differential between speed single and double precision arithmetic; other efficient parallelization short vector SIMD cores. first challenge addressed by utilizing well known technique iterative refinement solution a dense symmetric positive definite system linear equations, resulting...

10.1109/tpds.2007.70813 article EN IEEE Transactions on Parallel and Distributed Systems 2008-08-01

The SLATE (Software for Linear Algebra Targeting Exascale) library is being developed to provide fundamental dense linear algebra capabilities current and upcoming distributed high-performance systems, both accelerated CPU-GPU based CPU based. will coverage of existing ScaLAPACK functionality, including the parallel BLAS; systems using LU Cholesky; least squares problems QR; eigenvalue singular value problems. In this respect, it serve as a replacement ScaLAPACK, which after two decades...

10.1145/3295500.3356223 article EN 2019-11-07

By using a combination of 32-bit and 64-bit floating point arithmetic, the performance many sparse linear algebra algorithms can be significantly enhanced while maintaining accuracy resulting solution. These ideas applied to multifrontal supernodal direct techniques iterative such as Krylov subspace methods. The approach presented here apply not only conventional processors but also exotic technologies Field Programmable Gate Arrays (FPGA), Graphical Processing Units (GPU), Cell BE processor.

10.1145/1377596.1377597 article EN ACM Transactions on Mathematical Software 2008-07-01

Recent versions of microprocessors exhibit performance characteristics for 32 bit floating point arithmetic (single precision) that is substantially higher than 64 (double precision). Examples include the Intel's Pentium IV and M processors, AMD's Opteron architectures IBM's Cell Broad Engine processor. When working in single precision, operations can be performed up to two times faster on ten over double precision. The enhancements these are derived by accessing extensions basic...

10.1145/1188455.1188573 article EN 2006-01-01

Recent versions of microprocessors exhibit performance characteristics for 32 bit floating point arithmetic (single precision) that is substantially higher than 64 (double precision). Examples include the Intel's Pentium IV and M processors, AMD's Opteron architectures IBM's Cell Broad Engine processor. When working in single precision, operations can be performed up to two times faster on ten over double precision. The enhancements these are derived by accessing extensions basic...

10.1109/sc.2006.30 article EN 2006-11-01

Abstract This paper describes the design concepts behind implementations of mixed‐precision linear algebra routines targeted for Cell processor. It in detail implementation code to solve system equations using Gaussian elimination single precision with iterative refinement solution full double‐precision accuracy. By utilizing this approach algorithm achieves close an order magnitude higher performance on processor than offered by standard algorithm. The is effectively high‐performance...

10.1002/cpe.1164 article EN Concurrency and Computation Practice and Experience 2007-01-09

Many problems in engineering and scientific computing require the solution of a large number small systems linear equations. Due to their high processing power, Graphics Processing Units became an attractive target for this class problems, routines based on LU QR factorization have been provided by NVIDIA cuBLAS library. This work addresses situation where equations are symmetric positive definite. The paper describes implementation tuning kernels Cholesky forward backward substitution....

10.1109/tpds.2015.2481890 article EN IEEE Transactions on Parallel and Distributed Systems 2015-09-24

10.1007/s10766-016-0441-6 article EN International Journal of Parallel Programming 2016-06-14

The recent version of the Parallel Linear Algebra Software for Multicore Architectures (PLASMA) library is based on tasks with dependencies from OpenMP standard. main functionality presented. Extensive benchmarks are targeted three multicore and manycore architectures, namely, an Intel Xeon, Xeon Phi, IBM POWER 8 processors.

10.1145/3264491 article EN ACM Transactions on Mathematical Software 2019-05-03

The growth of simulations particle systems has been aided by advances in computer speed and algorithms. adoption O(N) algorithms to solve N-body simulation problems less rapid due the fact that such scaling was only competitive for relatively large N. Our work seeks find algorithmic modifications practical implementations intermediate values N typical use molecular simulations. This article reviews fast multipole techniques calculation electrostatic interactions systems. basic mathematics...

10.1080/08927020600991161 article EN Molecular Simulation 2006-09-01

Abstract State‐of‐the‐art dense linear algebra software, such as the LAPACK and ScaLAPACK libraries, suffers performance losses on multicore processors due to their inability fully exploit thread‐level parallelism. At same time, coarse–grain dataflow model gains popularity a paradigm for programming architectures. This work looks at implementing classic workloads, Cholesky factorization, QR factorization LU using dynamic data‐driven execution. Two emerging approaches are examined, of nested...

10.1002/cpe.1467 article EN Concurrency and Computation Practice and Experience 2009-08-11
Coming Soon ...