NFDI4DS | UHH-SEMS - Publication Details

Pedro Valero‐Lara

ORCID: 0000-0002-1479-4310

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5083524237

Research Areas

Parallel Computing and Optimization Techniques
Distributed and Parallel Computing Systems
Advanced Data Storage Technologies
Lattice Boltzmann Simulation Studies
Interconnection Networks and Systems
Cloud Computing and Resource Management
Aerosol Filtration and Electrostatic Precipitation
Generative Adversarial Networks and Image Synthesis
Embedded Systems Design Techniques
Advanced Image and Video Retrieval Techniques
Advanced Neural Network Applications
Data Management and Algorithms
Advanced Memory and Neural Computing
Graph Theory and Algorithms
Scientific Computing and Data Management
Matrix Theory and Algorithms
Advanced Numerical Methods in Computational Mathematics
Algorithms and Data Compression
Software Engineering Research
Advanced Database Systems and Queries
Fluid Dynamics and Vibration Analysis
Medical Image Segmentation Techniques
Modular Robots and Swarm Intelligence
Topic Modeling
Geophysical Methods and Applications

Oak Ridge National Laboratory
2021-2025

Universitat Politècnica de Catalunya
2016-2022

Barcelona Supercomputing Center
2016-2022

Brigham Young University
2022

Software Competence Center Hagenberg (Austria)
2019

University of St Andrews
2019

Pacific Northwest National Laboratory
2019

Noesis Solutions (Belgium)
2019

Institut national de recherche en informatique et en automatique
2019

Politecnico di Milano
2019

The Design and Performance of Batched BLAS on Modern High-Performance Computing Systems

OPENALEX - Publications

Jack Dongarra Sven Hammarling Nicholas J. Higham Samuel D. Relton Pedro Valero‐Lara and 1 more

A current trend in high-performance computing is to decompose a large linear algebra problem into batches containing thousands of smaller problems, that can be solved independently, before collating the results. To standardize interface these routines, community developing an extension BLAS standard (the batched BLAS), enabling users perform small operations parallel whilst making efficient use their hardware. We discuss benefits and drawbacks proposals number experiments, focusing on...

10.1016/j.procs.2017.05.138 article EN Procedia Computer Science 2017-01-01

Performance Evaluation of cuDNN Convolution Algorithms on NVIDIA Volta GPUs

OPENALEX - Publications

Marc Jordà Pedro Valero‐Lara Antonio J. Peña

Convolutional neural networks (CNNs) have recently attracted considerable attention due to their outstanding accuracy in applications, such as image recognition and natural language processing. While one advantage of the CNNs over other types is reduced computational cost, faster execution still desired for both training inference. Since convolution operations pose most time, multiple algorithms were are being developed with aim accelerating this type operations. However, wide range...

10.1109/access.2019.2918851 article EN cc-by IEEE Access 2019-01-01

Evaluation of OpenAI Codex for HPC Parallel Programming Models Kernel Generation

OPENALEX - Publications

William F. Godoy Pedro Valero‐Lara Keita Teranishi Prasanna Balaprakash Jeffrey S. Vetter

We evaluate AI-assisted generative capabilities on fundamental numerical kernels in high-performance computing (HPC), including AXPY, GEMV, GEMM, SpMV, Jacobi Stencil, and CG. test the generated kernel codes for a variety of language-supported programming models, (1) C++ (e.g., OpenMP [including offload], OpenACC, Kokkos, SyCL, CUDA, HIP), (2) Fortran offload] OpenACC), (3) Python numba, Numba, cuPy, pyCUDA), (4) Julia Threads, CUDA.jl, AMDGPU.jl, KernelAbstractions.jl). use GitHub Copilot...

10.1145/3605731.3605886 preprint EN 2023-08-07

Accelerating fluid–solid simulations (Lattice-Boltzmann & Immersed-Boundary) on heterogeneous architectures

OPENALEX - Publications

Pedro Valero‐Lara Francisco D. Igual Manuel Prieto Alfredo Pinelli Julien Favier

10.1016/j.jocs.2015.07.002 article EN Journal of Computational Science 2015-07-15

27th Euromicro International Conference on Parallel, Distributed and Network-Based Processing PDP 2019

OPENALEX - Publications

Georgios C. Chasparis Vladimir Janjic Maurizio Drocco Paolo Viviani Jens Gustedt and 4 more

10.1109/empdp.2019.8671587 article EN 2019-02-01

Evaluating performance and portability of high-level programming models: Julia, Python/Numba, and Kokkos on exascale nodes

OPENALEX - Publications

William F. Godoy Pedro Valero‐Lara T. Elise Dettling Christian Trefftz Ian Jorquera and 5 more

We explore the performance and portability of high-level programming models: LLVM-based Julia Python/Numba, Kokkos on high-performance computing (HPC) nodes: AMD Epyc CPUs MI250X graphical processing units (GPUs) Frontier's test bed Crusher system Ampere's Arm-based NVIDIA's A100 GPUs Wombat at Oak Ridge Leadership Computing Facilities. compare default a hand-rolled dense matrix multiplication algorithm against vendor-compiled C/OpenMP implementations, each GPU CUDA HIP. Rather than focusing...

10.1109/ipdpsw59300.2023.00068 article EN 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) 2023-05-01

Fast finite difference Poisson solvers on heterogeneous architectures

OPENALEX - Publications

Pedro Valero‐Lara Alfredo Pinelli Manuel Prieto

10.1016/j.cpc.2013.12.026 article EN Computer Physics Communications 2014-01-04

Heterogeneous CPU+GPU approaches for mesh refinement over Lattice‐Boltzmann simulations

OPENALEX - Publications

Pedro Valero‐Lara Johan Jansson

Summary The use of mesh refinement in CFD is an efficient and widely used methodology to minimize the computational cost by solving those regions high geometrical complexity with a finer grid. In this work, author focuses on studying two methods, one based Multi‐Domain Irregular meshing, deal over LBM simulations. numerical formulation presented detail. It proposed approaches, homogeneous GPU heterogeneous CPU+GPU, each methods. Obviously, architectures, CPU GPU, compute same problem...

10.1002/cpe.3919 article EN Concurrency and Computation Practice and Experience 2016-08-27

cuThomasBatch and cuThomasVBatch, CUDA Routines to compute batch of tridiagonal systems on NVIDIA GPUs

OPENALEX - Publications

Pedro Valero‐Lara Ivan Martínez-Pérez Raúl Sirvent Xavier Martorell Antonio J. Peña

Summary The solving of tridiagonal systems is one the most computationally expensive parts in many applications, so that multiple studies have explored use NVIDIA GPUs to accelerate such computation. However, these mainly focused on using parallel algorithms compute systems, which can efficiently exploit shared memory and are able saturate capacity with a low number presenting poor scalability when dealing relatively high systems. gtsvStridedBatch routine cuSPARSE package examples, used as...

10.1002/cpe.4909 article EN Concurrency and Computation Practice and Experience 2018-08-27

Extending SEER for Extreme Heterogeneity

OPENALEX - Publications

J. Gonzalez José María Prieto González Keita Teranishi Jeffrey S. Vetter Pedro Valero‐Lara

10.1145/3720555.3721990 article DA 2025-03-01

Accelerating Solid-fluid Interaction using Lattice-boltzmann and Immersed Boundary Coupled Simulations on Heterogeneous Platforms

OPENALEX - Publications

Pedro Valero‐Lara Alfredo Pinelli Manuel Prieto

We propose a numerical approach based on the Lattice-Boltzmann (LBM) and Immersed Boundary (IB) methods to tackle problem of interaction solids with an incompressible fluid flow. The proposed method uses Cartesian uniform grid that incorporates both solid domain. This is very optimum novel solve this growing research topic in Computational Fluid Dynamics. explain detail parallelization whole GPUs heterogeneous GPU-Multicore platform describe different optimizations, focusing memory...

10.1016/j.procs.2014.05.005 article EN Procedia Computer Science 2014-01-01

sLASs: A fully automatic auto-tuned linear algebra library based on OpenMP extensions implemented in OmpSs (LASs Library)

OPENALEX - Publications

Pedro Valero‐Lara Sandra Catalán Xavier Martorell Tetsuzo Usui Jesús Labarta

10.1016/j.jpdc.2019.12.002 article EN Journal of Parallel and Distributed Computing 2020-01-06

Block Tridiagonal Solvers on Heterogeneous Architectures

OPENALEX - Publications

Pedro Valero‐Lara Alfredo Pinelli Julien Favier Manuel Prieto

Modern multi-core and many-core systems offer a very impressive cost/performance ratio. In this paper set of new parallel implementations for the solution linear with block-tridiagonal coefficient matrix on current architectures is proposed evaluated: one them multi-core, others finally, heterogeneous implementation both architectures. The results show speedup higher than 6 certain parts problem, being fastest.

10.1109/ispa.2012.91 preprint EN 2012-07-01

Accelerating solid–fluid interaction based on the immersed boundary method on multicore and GPU architectures

OPENALEX - Publications

Pedro Valero‐Lara

10.1007/s11227-014-1262-2 article EN The Journal of Supercomputing 2014-07-11

cuHinesBatch: Solving Multiple Hines systems on GPUs Human Brain Project * *This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 720270 (HBP SGA1), from the Spanish Ministry of Economy and Competitiveness under the project Computación de Altas Prestaciones VII (TIN2015-65316-P) and the Departament d'Innovació, Universitats i Empresa de la Generalitat de Catalunya, under project MPEXPAR: Models de Programació i …

OPENALEX - Publications

Pedro Valero‐Lara Ivan Martínez-Pérez Antonio J. Peña Xavier Martorell Raül Sirvent and 1 more

The simulation of the behavior Human Brain is one most important challenges today in computing. main problem consists finding efficient ways to manipulate and compute huge volume data that this kind simulations need, using current technology. In sense, work focused on steps such simulation, which computing Voltage neurons’ morphology. This carried out Hines Algorithm. Although algorithm optimum method terms number operations, it need non-trivial modifications be efficiently parallelized...

10.1016/j.procs.2017.05.145 article EN Procedia Computer Science 2017-01-01

Variable Batched DGEMM

OPENALEX - Publications

Pedro Valero‐Lara Ivan Martínez-Pérez Sergi Mateo Raül Sirvent Vicenç Beltrán and 2 more

Many scientific applications are in need to solve a high number of small-size independent problems. These individual problems do not provide enough parallelism and then, these must be computed as batch. Today, vendors such Intel NVIDIA developing their own suite batch routines. Although most the works focus on computing batches fixed size, real we can assume uniform size for all set We explore analyze different strategies based parallel for, task taskloop OpenMP pragmas. straightforward from...

10.1109/pdp2018.2018.00065 article EN 2018-03-01

A Non-uniform Staggered Cartesian Grid Approach for Lattice-boltzmann Method

OPENALEX - Publications

Pedro Valero‐Lara Johan Jansson

We propose a numerical approach based on the Lattice-Boltzmann method (LBM) for dealing with mesh refinement of Non-uniform Staggered Cartesian Grid. explain, in detail, strategy mapping LBM over such geometries. The main benefit this approach, compared to others, consists solving all fluid units only once per time-step, and also reducing considerably complexity communication memory management between different refined levels. Also, it exhibits better matching parallel processors. To...

10.1016/j.procs.2015.05.245 article EN Procedia Computer Science 2015-01-01

Reducing memory requirements for large size LBM simulations on GPUs

OPENALEX - Publications

Pedro Valero‐Lara

Summary The scientific community in its never‐ending road of larger and more efficient computational resources is need implementations that can adapt efficiently on the current parallel platforms. Graphics processing units are an appropriate platform cover some these demands. This architecture presents a high performance with reduced cost power consumption. However, memory capacity devices so expensive transfers necessary to deal big problems. Today, lattice‐Boltzmann method (LBM) has...

10.1002/cpe.4221 article EN Concurrency and Computation Practice and Experience 2017-06-20

A Fast Solver for Large Tridiagonal Systems on Multi-Core Processors (Lass Library)

OPENALEX - Publications

Pedro Valero‐Lara Diego Andrade Raúl Sirvent Jesús Labarta Basilio B. Fraguela and 1 more

Many problems of industrial and scientific interest require the solving tridiagonal linear systems. This paper presents several implementations for parallel large systems on multi-core architectures, using OmpSs programming model. The strategy used parallelization is based combination two different existing algorithms, PCR Thomas. Thomas algorithm, which cannot be parallelized, requires fewest number floating point operations. algorithm most popular method, but it more computationally...

10.1109/access.2019.2900122 article EN cc-by-nc-nd IEEE Access 2019-01-01

KokkACC: Enhancing Kokkos with OpenACC

OPENALEX - Publications

Pedro Valero‐Lara Seyong Lee Marc Gonzalez-Tallada Joel Denny Jeffrey S. Vetter

Template metaprogramming is gaining popularity as a high-level solution for achieving performance portability on heterogeneous computing resources. Kokkos representative approach that offers programmers abstractions generic programming while most of the device-specific code generation and optimizations are delegated to compiler through template specializations. For this, provides set specializations in multiple back ends, such CUDA HIP. Unlike or HIP, OpenACC directive-based model. This...

10.1109/waccpd56842.2022.00009 article EN 2022-11-01

Multi-GPU acceleration of DARTEL (early detection of Alzheimer)

OPENALEX - Publications

Pedro Valero‐Lara

Medical image processing is becoming a significant discipline within the bioinformatic community. In particular, deformable registration methods are one of most sophisticate and important lines research biomedical processing, due to valuable information provided. However, these consume considerable time, power consumption require high amounts memory. Current Graphics Processing Units (GPU) have number cores memory bandwidth, providing an excellent platform for reducing cost in terms time...

10.1109/cluster.2014.6968783 article EN 2014-09-01

Many-Task Computing on Many-Core Architectures

OPENALEX - Publications

Pedro Valero‐Lara Poornima Nookala Fernando L. Pelayo Johan Jansson Serapheim Dimitropoulos and 1 more

Many-Task Computing (MTC) is a common scenario for multiple parallel systems, such as cluster, grids, cloud and supercomputers, but it not so popular in shared memory processors. In this sense given the spectacular growth performance number of cores integrated many-core architectures, study MTC on architectures becoming more relevant. paper, authors present what are those programming mechanisms to take advantages massively features particular target MTC. Also, hardware two dominant platforms...

10.12694/scpe.v17i1.1148 article EN Scalable Computing Practice and Experience 2016-03-25

Coming Soon ...