Pedro Valero‐Lara

ORCID: 0000-0002-1479-4310
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Parallel Computing and Optimization Techniques
  • Distributed and Parallel Computing Systems
  • Advanced Data Storage Technologies
  • Lattice Boltzmann Simulation Studies
  • Interconnection Networks and Systems
  • Cloud Computing and Resource Management
  • Aerosol Filtration and Electrostatic Precipitation
  • Generative Adversarial Networks and Image Synthesis
  • Embedded Systems Design Techniques
  • Advanced Image and Video Retrieval Techniques
  • Advanced Neural Network Applications
  • Data Management and Algorithms
  • Advanced Memory and Neural Computing
  • Graph Theory and Algorithms
  • Scientific Computing and Data Management
  • Matrix Theory and Algorithms
  • Advanced Numerical Methods in Computational Mathematics
  • Algorithms and Data Compression
  • Software Engineering Research
  • Advanced Database Systems and Queries
  • Fluid Dynamics and Vibration Analysis
  • Medical Image Segmentation Techniques
  • Modular Robots and Swarm Intelligence
  • Topic Modeling
  • Geophysical Methods and Applications

Oak Ridge National Laboratory
2021-2025

Universitat Politècnica de Catalunya
2016-2022

Barcelona Supercomputing Center
2016-2022

Brigham Young University
2022

Software Competence Center Hagenberg (Austria)
2019

University of St Andrews
2019

Pacific Northwest National Laboratory
2019

Noesis Solutions (Belgium)
2019

Institut national de recherche en informatique et en automatique
2019

Politecnico di Milano
2019

A current trend in high-performance computing is to decompose a large linear algebra problem into batches containing thousands of smaller problems, that can be solved independently, before collating the results. To standardize interface these routines, community developing an extension BLAS standard (the batched BLAS), enabling users perform small operations parallel whilst making efficient use their hardware. We discuss benefits and drawbacks proposals number experiments, focusing on...

10.1016/j.procs.2017.05.138 article EN Procedia Computer Science 2017-01-01

Convolutional neural networks (CNNs) have recently attracted considerable attention due to their outstanding accuracy in applications, such as image recognition and natural language processing. While one advantage of the CNNs over other types is reduced computational cost, faster execution still desired for both training inference. Since convolution operations pose most time, multiple algorithms were are being developed with aim accelerating this type operations. However, wide range...

10.1109/access.2019.2918851 article EN cc-by IEEE Access 2019-01-01

We evaluate AI-assisted generative capabilities on fundamental numerical kernels in high-performance computing (HPC), including AXPY, GEMV, GEMM, SpMV, Jacobi Stencil, and CG. test the generated kernel codes for a variety of language-supported programming models, (1) C++ (e.g., OpenMP [including offload], OpenACC, Kokkos, SyCL, CUDA, HIP), (2) Fortran offload] OpenACC), (3) Python numba, Numba, cuPy, pyCUDA), (4) Julia Threads, CUDA.jl, AMDGPU.jl, KernelAbstractions.jl). use GitHub Copilot...

10.1145/3605731.3605886 preprint EN 2023-08-07

We explore the performance and portability of high-level programming models: LLVM-based Julia Python/Numba, Kokkos on high-performance computing (HPC) nodes: AMD Epyc CPUs MI250X graphical processing units (GPUs) Frontier's test bed Crusher system Ampere's Arm-based NVIDIA's A100 GPUs Wombat at Oak Ridge Leadership Computing Facilities. compare default a hand-rolled dense matrix multiplication algorithm against vendor-compiled C/OpenMP implementations, each GPU CUDA HIP. Rather than focusing...

10.1109/ipdpsw59300.2023.00068 article EN 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) 2023-05-01

Summary The use of mesh refinement in CFD is an efficient and widely used methodology to minimize the computational cost by solving those regions high geometrical complexity with a finer grid. In this work, author focuses on studying two methods, one based Multi‐Domain Irregular meshing, deal over LBM simulations. numerical formulation presented detail. It proposed approaches, homogeneous GPU heterogeneous CPU+GPU, each methods. Obviously, architectures, CPU GPU, compute same problem...

10.1002/cpe.3919 article EN Concurrency and Computation Practice and Experience 2016-08-27

Summary The solving of tridiagonal systems is one the most computationally expensive parts in many applications, so that multiple studies have explored use NVIDIA GPUs to accelerate such computation. However, these mainly focused on using parallel algorithms compute systems, which can efficiently exploit shared memory and are able saturate capacity with a low number presenting poor scalability when dealing relatively high systems. gtsvStridedBatch routine cuSPARSE package examples, used as...

10.1002/cpe.4909 article EN Concurrency and Computation Practice and Experience 2018-08-27

We propose a numerical approach based on the Lattice-Boltzmann (LBM) and Immersed Boundary (IB) methods to tackle problem of interaction solids with an incompressible fluid flow. The proposed method uses Cartesian uniform grid that incorporates both solid domain. This is very optimum novel solve this growing research topic in Computational Fluid Dynamics. explain detail parallelization whole GPUs heterogeneous GPU-Multicore platform describe different optimizations, focusing memory...

10.1016/j.procs.2014.05.005 article EN Procedia Computer Science 2014-01-01

Modern multi-core and many-core systems offer a very impressive cost/performance ratio. In this paper set of new parallel implementations for the solution linear with block-tridiagonal coefficient matrix on current architectures is proposed evaluated: one them multi-core, others finally, heterogeneous implementation both architectures. The results show speedup higher than 6 certain parts problem, being fastest.

10.1109/ispa.2012.91 preprint EN 2012-07-01

The simulation of the behavior Human Brain is one most important challenges today in computing. main problem consists finding efficient ways to manipulate and compute huge volume data that this kind simulations need, using current technology. In sense, work focused on steps such simulation, which computing Voltage neurons’ morphology. This carried out Hines Algorithm. Although algorithm optimum method terms number operations, it need non-trivial modifications be efficiently parallelized...

10.1016/j.procs.2017.05.145 article EN Procedia Computer Science 2017-01-01

Many scientific applications are in need to solve a high number of small-size independent problems. These individual problems do not provide enough parallelism and then, these must be computed as batch. Today, vendors such Intel NVIDIA developing their own suite batch routines. Although most the works focus on computing batches fixed size, real we can assume uniform size for all set We explore analyze different strategies based parallel for, task taskloop OpenMP pragmas. straightforward from...

10.1109/pdp2018.2018.00065 article EN 2018-03-01

We propose a numerical approach based on the Lattice-Boltzmann method (LBM) for dealing with mesh refinement of Non-uniform Staggered Cartesian Grid. explain, in detail, strategy mapping LBM over such geometries. The main benefit this approach, compared to others, consists solving all fluid units only once per time-step, and also reducing considerably complexity communication memory management between different refined levels. Also, it exhibits better matching parallel processors. To...

10.1016/j.procs.2015.05.245 article EN Procedia Computer Science 2015-01-01

Summary The scientific community in its never‐ending road of larger and more efficient computational resources is need implementations that can adapt efficiently on the current parallel platforms. Graphics processing units are an appropriate platform cover some these demands. This architecture presents a high performance with reduced cost power consumption. However, memory capacity devices so expensive transfers necessary to deal big problems. Today, lattice‐Boltzmann method (LBM) has...

10.1002/cpe.4221 article EN Concurrency and Computation Practice and Experience 2017-06-20

Many problems of industrial and scientific interest require the solving tridiagonal linear systems. This paper presents several implementations for parallel large systems on multi-core architectures, using OmpSs programming model. The strategy used parallelization is based combination two different existing algorithms, PCR Thomas. Thomas algorithm, which cannot be parallelized, requires fewest number floating point operations. algorithm most popular method, but it more computationally...

10.1109/access.2019.2900122 article EN cc-by-nc-nd IEEE Access 2019-01-01

Template metaprogramming is gaining popularity as a high-level solution for achieving performance portability on heterogeneous computing resources. Kokkos representative approach that offers programmers abstractions generic programming while most of the device-specific code generation and optimizations are delegated to compiler through template specializations. For this, provides set specializations in multiple back ends, such CUDA HIP. Unlike or HIP, OpenACC directive-based model. This...

10.1109/waccpd56842.2022.00009 article EN 2022-11-01

Medical image processing is becoming a significant discipline within the bioinformatic community. In particular, deformable registration methods are one of most sophisticate and important lines research biomedical processing, due to valuable information provided. However, these consume considerable time, power consumption require high amounts memory. Current Graphics Processing Units (GPU) have number cores memory bandwidth, providing an excellent platform for reducing cost in terms time...

10.1109/cluster.2014.6968783 article EN 2014-09-01

Many-Task Computing (MTC) is a common scenario for multiple parallel systems, such as cluster, grids, cloud and supercomputers, but it not so popular in shared memory processors. In this sense given the spectacular growth performance number of cores integrated many-core architectures, study MTC on architectures becoming more relevant. paper, authors present what are those programming mechanisms to take advantages massively features particular target MTC. Also, hardware two dominant platforms...

10.12694/scpe.v17i1.1148 article EN Scalable Computing Practice and Experience 2016-03-25
Coming Soon ...