- Parallel Computing and Optimization Techniques
- Distributed and Parallel Computing Systems
- Matrix Theory and Algorithms
- Numerical Methods and Algorithms
- Interconnection Networks and Systems
- Advanced Data Storage Technologies
- Embedded Systems Design Techniques
- Cloud Computing and Resource Management
- Electromagnetic Scattering and Analysis
- Stochastic Gradient Optimization Techniques
- Algorithms and Data Compression
- Scheduling and Optimization Algorithms
- Particle accelerators and beam dynamics
- Scientific Computing and Data Management
- Digital Filter Design and Implementation
- Quantum Computing Algorithms and Architecture
- Sparse and Compressive Sensing Techniques
- Industrial Automation and Control Systems
- Advanced Neural Network Applications
- Electromagnetic Simulation and Numerical Methods
- Advanced Graph Neural Networks
- Computational Geometry and Mesh Generation
- Robotics and Sensor-Based Localization
- Advanced Image and Video Retrieval Techniques
- Scheduling and Timetabling Solutions
Advanced Micro Devices (United States)
2021-2024
Advanced Micro Devices (Canada)
2024
University of Tennessee at Knoxville
2010-2019
Oak Ridge National Laboratory
2017-2018
University of Manchester
2017-2018
University of Houston
2004-2008
The emergence and continuing use of multi-core architectures graphics processing units require changes in the existing software sometimes even a redesign established algorithms order to take advantage now prevailing parallelism. Parallel Linear Algebra for Scalable Multi-core Architectures (PLASMA) Matrix on GPU Multics (MAGMA) are two projects that aims achieve high performance portability across wide range hybrid systems respectively. We present this document comparative study PLASMA's...
We present a method for developing dense linear algebra algorithms that seamlessly scales to thousands of cores. It can be done with our project called DPLASMA (Distributed PLASMA) uses novel generic distributed Direct Acyclic Graph Engine (DAGuE). The engine has been designed high performance computing and thus it enables scaling tile algorithms, originating in PLASMA, on large memory systems. underlying DAGuE framework many appealing features when considering distributed-memory platforms...
In recent years, the use of graphics chips has been recognized as a viable way accelerating scientific and engineering applications, even more so since introduction Fermi architecture by NVIDIA, with features essential to numerical computing, such fast double precision arithmetic memory protected error correction codes. Being crucial component software packages, LAPACK ScaLAPACK, general dense matrix multiplication routine is one important workloads be implemented on these devices. This...
The computation of the singular value decomposition, or SVD, has a long history with many improvements over years, both in its implementations and algorithmically. Here, we survey evolution SVD algorithms for dense matrices, discussing motivation performance impacts changes. There are two main branches methods: bidiagonalization Jacobi. Bidiagonalization methods started implementation by Golub Reinsch Algol60, which was subsequently ported to Fortran EISPACK library, later more efficiently...
By using a combination of 32-bit and 64-bit floating point arithmetic, the performance many dense sparse linear algebra algorithms can be significantly enhanced while maintaining accuracy resulting solution. The approach presented here apply not only to conventional processors but also exotic technologies such as Field Programmable Gate Arrays (FPGA), Graphical Processing Units (GPU), Cell BE processor. Results on modern processor architectures are presented.
Abstract As multicore systems continue to gain ground in the high‐performance computing world, linear algebra algorithms have be reformulated or new developed order take advantage of architectural features on these processors. Fine‐grain parallelism becomes a major requirement and introduces necessity loose synchronization parallel execution an operation. This paper presents algorithm for QR factorization where operations can represented as sequence small tasks that operate square blocks...
The Sony/Toshiba/IBM (STI) CELL processor introduces pioneering solutions in architecture. At the same time it presents new challenges for development of numerical algorithms. One is effective exploitation differential between speed single and double precision arithmetic; other efficient parallelization short vector SIMD cores. first challenge addressed by utilizing well known technique iterative refinement solution a dense symmetric positive definite system linear equations, resulting...
The SLATE (Software for Linear Algebra Targeting Exascale) library is being developed to provide fundamental dense linear algebra capabilities current and upcoming distributed high-performance systems, both accelerated CPU-GPU based CPU based. will coverage of existing ScaLAPACK functionality, including the parallel BLAS; systems using LU Cholesky; least squares problems QR; eigenvalue singular value problems. In this respect, it serve as a replacement ScaLAPACK, which after two decades...
By using a combination of 32-bit and 64-bit floating point arithmetic, the performance many sparse linear algebra algorithms can be significantly enhanced while maintaining accuracy resulting solution. These ideas applied to multifrontal supernodal direct techniques iterative such as Krylov subspace methods. The approach presented here apply not only conventional processors but also exotic technologies Field Programmable Gate Arrays (FPGA), Graphical Processing Units (GPU), Cell BE processor.
Recent versions of microprocessors exhibit performance characteristics for 32 bit floating point arithmetic (single precision) that is substantially higher than 64 (double precision). Examples include the Intel's Pentium IV and M processors, AMD's Opteron architectures IBM's Cell Broad Engine processor. When working in single precision, operations can be performed up to two times faster on ten over double precision. The enhancements these are derived by accessing extensions basic...
Recent versions of microprocessors exhibit performance characteristics for 32 bit floating point arithmetic (single precision) that is substantially higher than 64 (double precision). Examples include the Intel's Pentium IV and M processors, AMD's Opteron architectures IBM's Cell Broad Engine processor. When working in single precision, operations can be performed up to two times faster on ten over double precision. The enhancements these are derived by accessing extensions basic...
Abstract This paper describes the design concepts behind implementations of mixed‐precision linear algebra routines targeted for Cell processor. It in detail implementation code to solve system equations using Gaussian elimination single precision with iterative refinement solution full double‐precision accuracy. By utilizing this approach algorithm achieves close an order magnitude higher performance on processor than offered by standard algorithm. The is effectively high‐performance...
Many problems in engineering and scientific computing require the solution of a large number small systems linear equations. Due to their high processing power, Graphics Processing Units became an attractive target for this class problems, routines based on LU QR factorization have been provided by NVIDIA cuBLAS library. This work addresses situation where equations are symmetric positive definite. The paper describes implementation tuning kernels Cholesky forward backward substitution....
The recent version of the Parallel Linear Algebra Software for Multicore Architectures (PLASMA) library is based on tasks with dependencies from OpenMP standard. main functionality presented. Extensive benchmarks are targeted three multicore and manycore architectures, namely, an Intel Xeon, Xeon Phi, IBM POWER 8 processors.
The growth of simulations particle systems has been aided by advances in computer speed and algorithms. adoption O(N) algorithms to solve N-body simulation problems less rapid due the fact that such scaling was only competitive for relatively large N. Our work seeks find algorithmic modifications practical implementations intermediate values N typical use molecular simulations. This article reviews fast multipole techniques calculation electrostatic interactions systems. basic mathematics...
Abstract State‐of‐the‐art dense linear algebra software, such as the LAPACK and ScaLAPACK libraries, suffers performance losses on multicore processors due to their inability fully exploit thread‐level parallelism. At same time, coarse–grain dataflow model gains popularity a paradigm for programming architectures. This work looks at implementing classic workloads, Cholesky factorization, QR factorization LU using dynamic data‐driven execution. Two emerging approaches are examined, of nested...