NFDI4DS | UHH-SEMS - Publication Details

Coherence protocol for transparent management of scratchpad memories in shared memory manycore architectures

OPENALEX - Publications

Lluc Alvarez Lluís Vilanova Miquel Moretó Marc Casas Marc González and 4 more

The increasing number of cores in manycore architectures causes important power and scalability problems the memory subsystem. One solution is to introduce scratchpad memories alongside cache hierarchy, forming a hybrid system. Scratchpad are more power-efficient than caches they do not generate coherence traffic, but suffer from poor programmability. A good way hide programmability difficulties programmer give compiler responsibility generating code manage memories. Unfortunately, compilers...

10.1145/2749469.2750411 article EN 2015-05-26

A Two Level Neural Approach Combining Off-Chip Prediction with Adaptive Prefetch Filtering

OPENALEX - Publications

Alexandre Valentin Jamet Georgios Vavouliotis Daniel A. Jiménez Lluc Alvarez Marc Casas

To alleviate the performance and energy overheads of contemporary applications with large data footprints, we propose Two Level Perceptron (TLP) predictor, a neural mechanism that effectively combines predicting whether an access will be off-chip adaptive prefetch filtering at first-level cache (L1D). TLP is composed two connected microarchitectural perceptron predictors, named First Predictor (FLP) Second (SLP). FLP performs accurate prediction by using several program features based on...

10.1109/hpca57654.2024.00046 article EN 2024-03-02

Runtime-Guided Management of Scratchpad Memories in Multicore Architectures

OPENALEX - Publications

Lluc Alvarez Miquel Moretó Marc Casas Emilio Castillo Xavier Martorell and 3 more

The increasing number of cores and the anticipated level heterogeneity in upcoming multicore architectures cause important problems traditional cache hierarchies. A good way to alleviate these is add scratchpad memories alongside hierarchy, forming a hybrid memory hierarchy. This organization has potential improve performance reduce power consumption on-chip network traffic, but exposing such complex model programmer very negative impact on programmability architecture. Emerging task-based...

10.1109/pact.2015.26 article EN 2015-10-01

CATA: Criticality Aware Task Acceleration for Multicore Processors

OPENALEX - Publications

Emilio Castillo Miquel Moretó Marc Casas Lluc Alvarez Enrique Vallejo and 7 more

Managing criticality in task-based programming models opens a wide range of performance and power optimization opportunities future manycore systems. Criticality aware task schedulers can benefit from these by scheduling tasks to the most appropriate cores. However, may suffer priority inversion static binding problems that limit their expected improvements. Based on observation information be exploited drive hardware reconfigurations, we propose Aware Task Acceleration (CATA) mechanism...

10.1109/ipdps.2016.49 article EN 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2016-05-01

The DeepHealth Toolkit: A Unified Framework to Boost Biomedical Applications

OPENALEX - Publications

Michele Cancilla Laura Canalini Federico Bolelli Stefano Allegretti Salvador Carrión and 10 more

Given the overwhelming impact of machine learning on last decade, several libraries and frameworks have been developed in recent years to simplify design training neural networks, providing array-based programming, automatic differentiation user-friendly access hardware accelerators. None those tools, however, was designed with native transparent support for Cloud Computing or heterogeneous High-Performance (HPC). The DeepHealth Toolkit is an open source Deep Learning toolkit aimed at...

10.1109/icpr48806.2021.9411954 article EN 2022 26th International Conference on Pattern Recognition (ICPR) 2021-01-10

Accelerating Edit-Distance Sequence Alignment on GPU Using the Wavefront Algorithm

OPENALEX - Publications

Quim Aguado-Puig Santiago Marco‐Sola Juan Carlos Moure David Castells-Rufas Lluc Alvarez and 2 more

Sequence alignment remains a fundamental problem with practical applications ranging from pattern recognition to computational biology. Traditional algorithms based on dynamic programming are hard parallelize, require significant amounts of memory, and fail scale for large inputs. This work presents eWFA-GPU, GPU (graphics processing unit)-accelerated tool compute the exact edit-distance sequence wavefront algorithm (WFA). approach exploits similarities between input sequences accelerate...

10.1109/access.2022.3182714 article EN cc-by IEEE Access 2022-01-01

Architectural Support for Task Dependence Management with Flexible Software Scheduling

OPENALEX - Publications

Emilio Castillo Lluc Alvarez Miquel Moretó Marc Casas Enrique Vallejo and 3 more

The growing complexity of multi-core architectures has motivated a wide range software mechanisms to improve the orchestration parallel executions. Task parallelism become very attractive approach thanks its programmability, portability and potential for optimizations. However, with expected increase in core counts, finer-grained tasking will be required exploit available parallelism, which overheads introduced by runtime system. This work presents Dependence Manager (TDM), hardware/software...

10.1109/hpca.2018.00033 article EN 2018-02-01

An FPGA Accelerator of the Wavefront Algorithm for Genomics Pairwise Alignment

OPENALEX - Publications

Abbas Haghi Santiago Marco‐Sola Lluc Alvarez Dionysios Diamantopoulos Christoph Hagleitner and 1 more

In the last years, advances in next-generation sequencing technologies have enabled proliferation of genomic applications that guide personalized medicine. These an enormous computational cost due to large amount data they process. The first step many these consists aligning reads against a reference genome. Very recently, wavefront alignment algorithm has been introduced, significantly reducing execution time read This paper presents FPGA-based hardware/software co-designed accelerator such...

10.1109/fpl53798.2021.00033 article EN 2021-08-01

Exploiting Page Table Locality for Agile TLB Prefetching

OPENALEX - Publications

Georgios Vavouliotis Lluc Alvarez Vasileios Karakostas Konstantinos Nikas Nectarios Koziris and 2 more

Frequent Translation Lookaside Buffer (TLB) misses incur high performance and energy costs due to page walks required for fetching the corresponding address translations. Prefetching table entries (PTEs) ahead of demand TLB accesses can mitigate translation bottleneck, but each prefetch requires traversing table, triggering additional memory hierarchy. Therefore, prefetching is a costly technique that may undermine when prefetches are not accurate.In this paper we exploit locality in last...

10.1109/isca52012.2021.00016 article EN 2021-06-01

Reducing Cache Coherence Traffic with a NUMA-Aware Runtime Approach

OPENALEX - Publications

Paul Caheny Lluc Alvarez Said Derradji Mateo Valero Miquel Moretó and 1 more

Cache Coherent NUMA (ccNUMA) architectures are a widespread paradigm due to the benefits they provide for scaling core count and memory capacity. Also, flat address space offer considerably improves programmability. However, ccNUMA require sophisticated expensive cache coherence protocols enforce correctness during parallel executions, which trigger significant amount of on- off-chip traffic in system. This paper analyses how may be best constrained large, real platform comprising 288 cores...

10.1109/tpds.2017.2787123 article EN IEEE Transactions on Parallel and Distributed Systems 2017-12-25

WFA-FPGA: An efficient accelerator of the wavefront algorithm for short and long read genomics alignment

OPENALEX - Publications

Abbas Haghi Santiago Marco‐Sola Lluc Alvarez Dionysios Diamantopoulos Christoph Hagleitner and 1 more

10.1016/j.future.2023.07.008 article EN Future Generation Computer Systems 2023-07-05

Runtime-Guided Management of Stacked DRAM Memories in Task Parallel Programs

OPENALEX - Publications

Lluc Alvarez Marc Casas Jesús Labarta Eduard Ayguadé Mateo Valero and 1 more

Stacked DRAM memories have become a reality in High-Performance Computing (HPC) architectures. These provide much higher bandwidth while consuming less power than traditional off-chip memories, but their limited memory capacity is insufficient for modern HPC systems. For this reason, both stacked and are expected to co-exist architectures, giving raise different approaches architecting the system.

10.1145/3205289.3205312 article EN 2018-06-12

Teaching HPC Systems and Parallel Programming with Small-Scale Clusters

OPENALEX - Publications

Lluc Alvarez Eduard Ayguadé Filippo Mantovani

In the last decades, continuous proliferation of High-Performance Computing (HPC) systems and data centers has augmented demand for expert HPC system designers, administrators, programmers. For this reason, most universities have introduced courses on parallel programming in their degrees. However, laboratory assignments these generally use clusters that are owned, managed administrated by university. This methodology been shown effective to teach programming, but using a remote cluster...

10.1109/eduhpc.2018.00004 article EN 2018-11-01

Energy and Precision Evaluation of a Systolic Array Accelerator Using a Quantization Approach for Edge Computing

OPENALEX - Publications

Alejandra Sanchez-Flores Jordi Fornt Lluc Alvarez B. Alorda

This paper focuses on the implementation of a neural network accelerator optimized for speed and energy efficiency, use in embedded machine learning. Specifically, we explore power reduction at hardware level through systolic array low-precision data systems, including quantized approaches. We present comprehensive analysis comparing full precision (FP16) with (INT16) version an FPGA. upgraded FP16 modules to handle INT16 values, employing shifts enhance value density while maintaining...

10.3390/electronics13142822 article EN Electronics 2024-07-18

Morrigan: A Composite Instruction TLB Prefetcher

OPENALEX - Publications

Georgios Vavouliotis Lluc Alvarez Boris Grot Daniel A. Jiménez Marc Casas

The effort to reduce address translation overheads has typically targeted data accesses since they constitute the overwhelming portion of second-level TLB (STLB) misses in desktop and HPC applications. cost instruction been relatively neglected due historically small footprints. However, state-of-the-art datacenter server applications feature massive footprints owing deep software stacks, resulting high STLB miss rates for accesses.

10.1145/3466752.3480049 article EN 2021-10-17

Page Size Aware Cache Prefetching

OPENALEX - Publications

Georgios Vavouliotis Gino Chacon Lluc Alvarez Paul V. Gratz Daniel A. Jiménez and 1 more

The increase in working set sizes of contemporary applications outpaces the growth cache sizes, resulting frequent main memory accesses that deteriorate system performance due to disparity between processor and speeds. Prefetching data blocks into hierarchy ahead demand has proven successful at attenuating this bottleneck. However, spatial prefetchers operating physical address space leave significant on table by limiting their pattern detection within 4KB page boundaries when modern systems...

10.1109/micro56248.2022.00070 article EN 2022-10-01

Runtime-Assisted Cache Coherence Deactivation in Task Parallel Programs

OPENALEX - Publications

Paul Caheny Lluc Alvarez Mateo Valero Miquel Moretó Marc Casas

With increasing core counts, the scalability of directory-based cache coherence has become a challenging problem. To reduce area and power needs directory, recent proposals its size by classifying data as private or shared, disable for data. However, existing classification methods suffer from inaccuracies require complex hardware support with limited scalability. This paper proposes hardware/software co-designed approach: runtime system identifies that is guaranteed programming model...

10.1109/sc.2018.00038 article EN 2018-11-01

Peachy Parallel Assignments (EduHPC 2018)

OPENALEX - Publications

Eduard Ayguadé Lluc Alvarez Fabio Banchelli Martin Burtscher Arturo González-Escribano and 6 more

Peachy Parallel Assignments are a resource for instructors teaching parallel and distributed programming. These high-quality assignments, previously tested in class, that readily adoptable. This collection of assignments includes implementing subset OpenMP using pthreads, creating an animated fractal, image processing histogram equalization, simulating storm high-energy particles, solving the wave equation variety settings. All these come with sample assignment sheets necessary starter code.

10.1109/eduhpc.2018.00012 article EN 2018-11-01

Intelligent Adaptation of Hardware Knobs for Improving Performance and Power Consumption

OPENALEX - Publications

Cristobal Ortega Lluc Alvarez Marc Casas Ramon Bertran Alper Buyuktosunoglu and 3 more

Current microprocessors include several knobs to modify the hardware behavior in order improve performance, power, and energy under different workload demands. An impractical time consuming offline profiling is needed evaluate design space find optimal knob configuration. Different are typically configured a decoupled manner avoid time-consuming process. This can often lead underperforming configurations conflicting decisions that jeopardize system power-performance efficiency. Thus, dynamic...

10.1109/tc.2020.2980230 article EN IEEE Transactions on Computers 2020-03-13

Hardware–Software Coherence Protocol for the Coexistence of Caches and Local Memories

OPENALEX - Publications

Lluc Alvarez Lluís Vilanova Marc González Xavier Martorell Nacho Navarro and 1 more

Cache coherence protocols limit the scalability of multicore and manycore architectures are responsible for an important amount power consumed in chip. A good way to alleviate these problems is introduce a local memory alongside cache hierarchy, forming hybrid system. Local memories more power-efficient than caches do not generate traffic, but they suffer from poor programmability. When non-predictable access patterns found, compilers succeed generating code because incoherence between two...

10.1109/tc.2013.194 article EN IEEE Transactions on Computers 2013-10-01

Coherence protocol for transparent management of scratchpad memories in shared memory manycore architectures

OPENALEX - Publications

Lluc Alvarez Lluís Vilanova Miquel Moretó Marc Casas Marc González and 4 more

The increasing number of cores in manycore architectures causes important power and scalability problems the memory subsystem. One solution is to introduce scratchpad memories alongside cache hierarchy, forming a hybrid system. Scratchpad are more power-efficient than caches they do not generate coherence traffic, but suffer from poor programmability. A good way hide programmability difficulties programmer give compiler responsibility generating code manage memories. Unfortunately, compilers...

10.1145/2872887.2750411 article EN ACM SIGARCH Computer Architecture News 2015-06-13

Hardware-software coherence protocol for the coexistence of caches and local memories

OPENALEX - Publications

Lluc Alvarez Lluís Vilanova Marc González Xavier Martorell Nacho Navarro and 1 more

Cache coherence protocols limit the scalability of chip multiprocessors. One solution is to introduce a local memory alongside cache hierarchy, forming hybrid system. Local memories are more power-efficient than caches and they do not generate traffic but suffer from poor programmability. When non-predictable access patterns found compilers succeed in generating code because incoherency between two storages. This paper proposes protocol for systems that allows compiler even presence aliasing...

10.5555/2388996.2389117 article EN IEEE International Conference on High Performance Computing, Data, and Analytics 2012-11-10

Characterizing the impact of last-level cache replacement policies on big-data workloads

OPENALEX - Publications

Alexandre Valentin Jamet Lluc Alvarez Daniel A. Jiménez Marc Casas

The vast disparity between Last Level Cache (LLC) and memory latencies has motivated the need for efficient cache management policies. computer architecture literature abounds with work on LLC replacement policy. Although these works greatly improve over least-recently-used (LRU) policy, they tend to focus only SPEC CPU 2006 benchmark suite - more recently 2017 evaluation. However, workloads are representative a subset of current High-Performance Computing (HPC) workloads. In this paper we...

10.1109/iiswc50251.2020.00022 article EN 2020-10-01