- Parallel Computing and Optimization Techniques
- Advanced Data Storage Technologies
- Embedded Systems Design Techniques
- Interconnection Networks and Systems
- Cloud Computing and Resource Management
- Genomics and Phylogenetic Studies
- Algorithms and Data Compression
- Chromosomal and Genetic Variations
- Caching and Content Delivery
- Advanced Neural Network Applications
- Distributed and Parallel Computing Systems
- Ferroelectric and Negative Capacitance Devices
- Graph Theory and Algorithms
- IoT and Edge/Fog Computing
- Plant Virus Research Studies
- Advanced Memory and Neural Computing
- Photonic and Optical Devices
- Radiation Effects in Electronics
- Evolutionary Algorithms and Applications
- Artificial Intelligence in Healthcare and Education
- Semiconductor materials and devices
- Computational Physics and Python Applications
- Brain Tumor Detection and Classification
- Network Packet Processing and Optimization
- Neural Networks and Reservoir Computing
Barcelona Supercomputing Center
2013-2024
Universitat Politècnica de Catalunya
2013-2024
Universitat Autònoma de Barcelona
2021
Université Grenoble Alpes
2018
Universidad del Noreste
2018
Lewis & Clark College
2018
University of Hawaiʻi at Mānoa
2018
Computing Center
2018
Universidad de Cantabria
2016
The increasing number of cores in manycore architectures causes important power and scalability problems the memory subsystem. One solution is to introduce scratchpad memories alongside cache hierarchy, forming a hybrid system. Scratchpad are more power-efficient than caches they do not generate coherence traffic, but suffer from poor programmability. A good way hide programmability difficulties programmer give compiler responsibility generating code manage memories. Unfortunately, compilers...
To alleviate the performance and energy overheads of contemporary applications with large data footprints, we propose Two Level Perceptron (TLP) predictor, a neural mechanism that effectively combines predicting whether an access will be off-chip adaptive prefetch filtering at first-level cache (L1D). TLP is composed two connected microarchitectural perceptron predictors, named First Predictor (FLP) Second (SLP). FLP performs accurate prediction by using several program features based on...
The increasing number of cores and the anticipated level heterogeneity in upcoming multicore architectures cause important problems traditional cache hierarchies. A good way to alleviate these is add scratchpad memories alongside hierarchy, forming a hybrid memory hierarchy. This organization has potential improve performance reduce power consumption on-chip network traffic, but exposing such complex model programmer very negative impact on programmability architecture. Emerging task-based...
Managing criticality in task-based programming models opens a wide range of performance and power optimization opportunities future manycore systems. Criticality aware task schedulers can benefit from these by scheduling tasks to the most appropriate cores. However, may suffer priority inversion static binding problems that limit their expected improvements. Based on observation information be exploited drive hardware reconfigurations, we propose Aware Task Acceleration (CATA) mechanism...
Given the overwhelming impact of machine learning on last decade, several libraries and frameworks have been developed in recent years to simplify design training neural networks, providing array-based programming, automatic differentiation user-friendly access hardware accelerators. None those tools, however, was designed with native transparent support for Cloud Computing or heterogeneous High-Performance (HPC). The DeepHealth Toolkit is an open source Deep Learning toolkit aimed at...
Sequence alignment remains a fundamental problem with practical applications ranging from pattern recognition to computational biology. Traditional algorithms based on dynamic programming are hard parallelize, require significant amounts of memory, and fail scale for large inputs. This work presents eWFA-GPU, GPU (graphics processing unit)-accelerated tool compute the exact edit-distance sequence wavefront algorithm (WFA). approach exploits similarities between input sequences accelerate...
The growing complexity of multi-core architectures has motivated a wide range software mechanisms to improve the orchestration parallel executions. Task parallelism become very attractive approach thanks its programmability, portability and potential for optimizations. However, with expected increase in core counts, finer-grained tasking will be required exploit available parallelism, which overheads introduced by runtime system. This work presents Dependence Manager (TDM), hardware/software...
In the last years, advances in next-generation sequencing technologies have enabled proliferation of genomic applications that guide personalized medicine. These an enormous computational cost due to large amount data they process. The first step many these consists aligning reads against a reference genome. Very recently, wavefront alignment algorithm has been introduced, significantly reducing execution time read This paper presents FPGA-based hardware/software co-designed accelerator such...
Frequent Translation Lookaside Buffer (TLB) misses incur high performance and energy costs due to page walks required for fetching the corresponding address translations. Prefetching table entries (PTEs) ahead of demand TLB accesses can mitigate translation bottleneck, but each prefetch requires traversing table, triggering additional memory hierarchy. Therefore, prefetching is a costly technique that may undermine when prefetches are not accurate.In this paper we exploit locality in last...
Cache Coherent NUMA (ccNUMA) architectures are a widespread paradigm due to the benefits they provide for scaling core count and memory capacity. Also, flat address space offer considerably improves programmability. However, ccNUMA require sophisticated expensive cache coherence protocols enforce correctness during parallel executions, which trigger significant amount of on- off-chip traffic in system. This paper analyses how may be best constrained large, real platform comprising 288 cores...
Stacked DRAM memories have become a reality in High-Performance Computing (HPC) architectures. These provide much higher bandwidth while consuming less power than traditional off-chip memories, but their limited memory capacity is insufficient for modern HPC systems. For this reason, both stacked and are expected to co-exist architectures, giving raise different approaches architecting the system.
In the last decades, continuous proliferation of High-Performance Computing (HPC) systems and data centers has augmented demand for expert HPC system designers, administrators, programmers. For this reason, most universities have introduced courses on parallel programming in their degrees. However, laboratory assignments these generally use clusters that are owned, managed administrated by university. This methodology been shown effective to teach programming, but using a remote cluster...
This paper focuses on the implementation of a neural network accelerator optimized for speed and energy efficiency, use in embedded machine learning. Specifically, we explore power reduction at hardware level through systolic array low-precision data systems, including quantized approaches. We present comprehensive analysis comparing full precision (FP16) with (INT16) version an FPGA. upgraded FP16 modules to handle INT16 values, employing shifts enhance value density while maintaining...
The effort to reduce address translation overheads has typically targeted data accesses since they constitute the overwhelming portion of second-level TLB (STLB) misses in desktop and HPC applications. cost instruction been relatively neglected due historically small footprints. However, state-of-the-art datacenter server applications feature massive footprints owing deep software stacks, resulting high STLB miss rates for accesses.
The increase in working set sizes of contemporary applications outpaces the growth cache sizes, resulting frequent main memory accesses that deteriorate system performance due to disparity between processor and speeds. Prefetching data blocks into hierarchy ahead demand has proven successful at attenuating this bottleneck. However, spatial prefetchers operating physical address space leave significant on table by limiting their pattern detection within 4KB page boundaries when modern systems...
With increasing core counts, the scalability of directory-based cache coherence has become a challenging problem. To reduce area and power needs directory, recent proposals its size by classifying data as private or shared, disable for data. However, existing classification methods suffer from inaccuracies require complex hardware support with limited scalability. This paper proposes hardware/software co-designed approach: runtime system identifies that is guaranteed programming model...
Peachy Parallel Assignments are a resource for instructors teaching parallel and distributed programming. These high-quality assignments, previously tested in class, that readily adoptable. This collection of assignments includes implementing subset OpenMP using pthreads, creating an animated fractal, image processing histogram equalization, simulating storm high-energy particles, solving the wave equation variety settings. All these come with sample assignment sheets necessary starter code.
Current microprocessors include several knobs to modify the hardware behavior in order improve performance, power, and energy under different workload demands. An impractical time consuming offline profiling is needed evaluate design space find optimal knob configuration. Different are typically configured a decoupled manner avoid time-consuming process. This can often lead underperforming configurations conflicting decisions that jeopardize system power-performance efficiency. Thus, dynamic...
Cache coherence protocols limit the scalability of multicore and manycore architectures are responsible for an important amount power consumed in chip. A good way to alleviate these problems is introduce a local memory alongside cache hierarchy, forming hybrid system. Local memories more power-efficient than caches do not generate traffic, but they suffer from poor programmability. When non-predictable access patterns found, compilers succeed generating code because incoherence between two...
The increasing number of cores in manycore architectures causes important power and scalability problems the memory subsystem. One solution is to introduce scratchpad memories alongside cache hierarchy, forming a hybrid system. Scratchpad are more power-efficient than caches they do not generate coherence traffic, but suffer from poor programmability. A good way hide programmability difficulties programmer give compiler responsibility generating code manage memories. Unfortunately, compilers...
Cache coherence protocols limit the scalability of chip multiprocessors. One solution is to introduce a local memory alongside cache hierarchy, forming hybrid system. Local memories are more power-efficient than caches and they do not generate traffic but suffer from poor programmability. When non-predictable access patterns found compilers succeed in generating code because incoherency between two storages. This paper proposes protocol for systems that allows compiler even presence aliasing...
The vast disparity between Last Level Cache (LLC) and memory latencies has motivated the need for efficient cache management policies. computer architecture literature abounds with work on LLC replacement policy. Although these works greatly improve over least-recently-used (LRU) policy, they tend to focus only SPEC CPU 2006 benchmark suite - more recently 2017 evaluation. However, workloads are representative a subset of current High-Performance Computing (HPC) workloads. In this paper we...