Lluc Alvarez

ORCID: 0000-0003-0506-8867
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Parallel Computing and Optimization Techniques
  • Advanced Data Storage Technologies
  • Embedded Systems Design Techniques
  • Interconnection Networks and Systems
  • Cloud Computing and Resource Management
  • Genomics and Phylogenetic Studies
  • Algorithms and Data Compression
  • Chromosomal and Genetic Variations
  • Caching and Content Delivery
  • Advanced Neural Network Applications
  • Distributed and Parallel Computing Systems
  • Ferroelectric and Negative Capacitance Devices
  • Graph Theory and Algorithms
  • IoT and Edge/Fog Computing
  • Plant Virus Research Studies
  • Advanced Memory and Neural Computing
  • Photonic and Optical Devices
  • Radiation Effects in Electronics
  • Evolutionary Algorithms and Applications
  • Artificial Intelligence in Healthcare and Education
  • Semiconductor materials and devices
  • Computational Physics and Python Applications
  • Brain Tumor Detection and Classification
  • Network Packet Processing and Optimization
  • Neural Networks and Reservoir Computing

Barcelona Supercomputing Center
2013-2024

Universitat Politècnica de Catalunya
2013-2024

Universitat Autònoma de Barcelona
2021

Université Grenoble Alpes
2018

Universidad del Noreste
2018

Lewis & Clark College
2018

University of Hawaiʻi at Mānoa
2018

Computing Center
2018

Universidad de Cantabria
2016

The increasing number of cores in manycore architectures causes important power and scalability problems the memory subsystem. One solution is to introduce scratchpad memories alongside cache hierarchy, forming a hybrid system. Scratchpad are more power-efficient than caches they do not generate coherence traffic, but suffer from poor programmability. A good way hide programmability difficulties programmer give compiler responsibility generating code manage memories. Unfortunately, compilers...

10.1145/2749469.2750411 article EN 2015-05-26

To alleviate the performance and energy overheads of contemporary applications with large data footprints, we propose Two Level Perceptron (TLP) predictor, a neural mechanism that effectively combines predicting whether an access will be off-chip adaptive prefetch filtering at first-level cache (L1D). TLP is composed two connected microarchitectural perceptron predictors, named First Predictor (FLP) Second (SLP). FLP performs accurate prediction by using several program features based on...

10.1109/hpca57654.2024.00046 article EN 2024-03-02

The increasing number of cores and the anticipated level heterogeneity in upcoming multicore architectures cause important problems traditional cache hierarchies. A good way to alleviate these is add scratchpad memories alongside hierarchy, forming a hybrid memory hierarchy. This organization has potential improve performance reduce power consumption on-chip network traffic, but exposing such complex model programmer very negative impact on programmability architecture. Emerging task-based...

10.1109/pact.2015.26 article EN 2015-10-01

Managing criticality in task-based programming models opens a wide range of performance and power optimization opportunities future manycore systems. Criticality aware task schedulers can benefit from these by scheduling tasks to the most appropriate cores. However, may suffer priority inversion static binding problems that limit their expected improvements. Based on observation information be exploited drive hardware reconfigurations, we propose Aware Task Acceleration (CATA) mechanism...

10.1109/ipdps.2016.49 article EN 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2016-05-01

Given the overwhelming impact of machine learning on last decade, several libraries and frameworks have been developed in recent years to simplify design training neural networks, providing array-based programming, automatic differentiation user-friendly access hardware accelerators. None those tools, however, was designed with native transparent support for Cloud Computing or heterogeneous High-Performance (HPC). The DeepHealth Toolkit is an open source Deep Learning toolkit aimed at...

10.1109/icpr48806.2021.9411954 article EN 2022 26th International Conference on Pattern Recognition (ICPR) 2021-01-10

Sequence alignment remains a fundamental problem with practical applications ranging from pattern recognition to computational biology. Traditional algorithms based on dynamic programming are hard parallelize, require significant amounts of memory, and fail scale for large inputs. This work presents eWFA-GPU, GPU (graphics processing unit)-accelerated tool compute the exact edit-distance sequence wavefront algorithm (WFA). approach exploits similarities between input sequences accelerate...

10.1109/access.2022.3182714 article EN cc-by IEEE Access 2022-01-01

The growing complexity of multi-core architectures has motivated a wide range software mechanisms to improve the orchestration parallel executions. Task parallelism become very attractive approach thanks its programmability, portability and potential for optimizations. However, with expected increase in core counts, finer-grained tasking will be required exploit available parallelism, which overheads introduced by runtime system. This work presents Dependence Manager (TDM), hardware/software...

10.1109/hpca.2018.00033 article EN 2018-02-01

In the last years, advances in next-generation sequencing technologies have enabled proliferation of genomic applications that guide personalized medicine. These an enormous computational cost due to large amount data they process. The first step many these consists aligning reads against a reference genome. Very recently, wavefront alignment algorithm has been introduced, significantly reducing execution time read This paper presents FPGA-based hardware/software co-designed accelerator such...

10.1109/fpl53798.2021.00033 article EN 2021-08-01

Frequent Translation Lookaside Buffer (TLB) misses incur high performance and energy costs due to page walks required for fetching the corresponding address translations. Prefetching table entries (PTEs) ahead of demand TLB accesses can mitigate translation bottleneck, but each prefetch requires traversing table, triggering additional memory hierarchy. Therefore, prefetching is a costly technique that may undermine when prefetches are not accurate.In this paper we exploit locality in last...

10.1109/isca52012.2021.00016 article EN 2021-06-01

Cache Coherent NUMA (ccNUMA) architectures are a widespread paradigm due to the benefits they provide for scaling core count and memory capacity. Also, flat address space offer considerably improves programmability. However, ccNUMA require sophisticated expensive cache coherence protocols enforce correctness during parallel executions, which trigger significant amount of on- off-chip traffic in system. This paper analyses how may be best constrained large, real platform comprising 288 cores...

10.1109/tpds.2017.2787123 article EN IEEE Transactions on Parallel and Distributed Systems 2017-12-25

Stacked DRAM memories have become a reality in High-Performance Computing (HPC) architectures. These provide much higher bandwidth while consuming less power than traditional off-chip memories, but their limited memory capacity is insufficient for modern HPC systems. For this reason, both stacked and are expected to co-exist architectures, giving raise different approaches architecting the system.

10.1145/3205289.3205312 article EN 2018-06-12

In the last decades, continuous proliferation of High-Performance Computing (HPC) systems and data centers has augmented demand for expert HPC system designers, administrators, programmers. For this reason, most universities have introduced courses on parallel programming in their degrees. However, laboratory assignments these generally use clusters that are owned, managed administrated by university. This methodology been shown effective to teach programming, but using a remote cluster...

10.1109/eduhpc.2018.00004 article EN 2018-11-01

This paper focuses on the implementation of a neural network accelerator optimized for speed and energy efficiency, use in embedded machine learning. Specifically, we explore power reduction at hardware level through systolic array low-precision data systems, including quantized approaches. We present comprehensive analysis comparing full precision (FP16) with (INT16) version an FPGA. upgraded FP16 modules to handle INT16 values, employing shifts enhance value density while maintaining...

10.3390/electronics13142822 article EN Electronics 2024-07-18

The effort to reduce address translation overheads has typically targeted data accesses since they constitute the overwhelming portion of second-level TLB (STLB) misses in desktop and HPC applications. cost instruction been relatively neglected due historically small footprints. However, state-of-the-art datacenter server applications feature massive footprints owing deep software stacks, resulting high STLB miss rates for accesses.

10.1145/3466752.3480049 article EN 2021-10-17

The increase in working set sizes of contemporary applications outpaces the growth cache sizes, resulting frequent main memory accesses that deteriorate system performance due to disparity between processor and speeds. Prefetching data blocks into hierarchy ahead demand has proven successful at attenuating this bottleneck. However, spatial prefetchers operating physical address space leave significant on table by limiting their pattern detection within 4KB page boundaries when modern systems...

10.1109/micro56248.2022.00070 article EN 2022-10-01

With increasing core counts, the scalability of directory-based cache coherence has become a challenging problem. To reduce area and power needs directory, recent proposals its size by classifying data as private or shared, disable for data. However, existing classification methods suffer from inaccuracies require complex hardware support with limited scalability. This paper proposes hardware/software co-designed approach: runtime system identifies that is guaranteed programming model...

10.1109/sc.2018.00038 article EN 2018-11-01

Peachy Parallel Assignments are a resource for instructors teaching parallel and distributed programming. These high-quality assignments, previously tested in class, that readily adoptable. This collection of assignments includes implementing subset OpenMP using pthreads, creating an animated fractal, image processing histogram equalization, simulating storm high-energy particles, solving the wave equation variety settings. All these come with sample assignment sheets necessary starter code.

10.1109/eduhpc.2018.00012 article EN 2018-11-01

Current microprocessors include several knobs to modify the hardware behavior in order improve performance, power, and energy under different workload demands. An impractical time consuming offline profiling is needed evaluate design space find optimal knob configuration. Different are typically configured a decoupled manner avoid time-consuming process. This can often lead underperforming configurations conflicting decisions that jeopardize system power-performance efficiency. Thus, dynamic...

10.1109/tc.2020.2980230 article EN IEEE Transactions on Computers 2020-03-13

Cache coherence protocols limit the scalability of multicore and manycore architectures are responsible for an important amount power consumed in chip. A good way to alleviate these problems is introduce a local memory alongside cache hierarchy, forming hybrid system. Local memories more power-efficient than caches do not generate traffic, but they suffer from poor programmability. When non-predictable access patterns found, compilers succeed generating code because incoherence between two...

10.1109/tc.2013.194 article EN IEEE Transactions on Computers 2013-10-01

The increasing number of cores in manycore architectures causes important power and scalability problems the memory subsystem. One solution is to introduce scratchpad memories alongside cache hierarchy, forming a hybrid system. Scratchpad are more power-efficient than caches they do not generate coherence traffic, but suffer from poor programmability. A good way hide programmability difficulties programmer give compiler responsibility generating code manage memories. Unfortunately, compilers...

10.1145/2872887.2750411 article EN ACM SIGARCH Computer Architecture News 2015-06-13

Cache coherence protocols limit the scalability of chip multiprocessors. One solution is to introduce a local memory alongside cache hierarchy, forming hybrid system. Local memories are more power-efficient than caches and they do not generate traffic but suffer from poor programmability. When non-predictable access patterns found compilers succeed in generating code because incoherency between two storages. This paper proposes protocol for systems that allows compiler even presence aliasing...

10.5555/2388996.2389117 article EN IEEE International Conference on High Performance Computing, Data, and Analytics 2012-11-10

The vast disparity between Last Level Cache (LLC) and memory latencies has motivated the need for efficient cache management policies. computer architecture literature abounds with work on LLC replacement policy. Although these works greatly improve over least-recently-used (LRU) policy, they tend to focus only SPEC CPU 2006 benchmark suite - more recently 2017 evaluation. However, workloads are representative a subset of current High-Performance Computing (HPC) workloads. In this paper we...

10.1109/iiswc50251.2020.00022 article EN 2020-10-01
Coming Soon ...