NFDI4DS | UHH-SEMS - Publication Details

An Effective Gray-Box Identification Procedure for Multicore Thermal Modeling

OPENALEX - Publications

Francesco Beneventi Andrea Bartolini Andrea Tilli Luca Benini

Aggressive thermal management is a critical feature for high-end computing platforms, as worst-case budgeting becoming unaffordable. Reactive management, which sets temperature thresholds to trigger capping actions, too “near-sighted,” and it may lead severe performance degradation overshoots. More aggressive proactive managements minimize penalty with smooth optimal control. These techniques require knowledge of models, have be accurate simple make the controls effective, while keeping...

10.1109/tc.2012.293 article EN IEEE Transactions on Computers 2012-12-13

Continuous learning of HPC infrastructure models using big data analytics and in-memory processing tools

OPENALEX - Publications

Francesco Beneventi Andrea Bartolini Carlo Cavazzoni Luca Benini

Exascale computing represents the next leap in HPC race. Reaching this level of performance is subject to several engineering challenges such as energy consumption, equipment-cooling, reliability and massive parallelism. Model-based optimization an essential tool design process control efficient, reliable thermally constrained systems. However, domain, model learning techniques tailored specific supercomputer require real measurements must therefore handle analyze a amount data coming from...

10.23919/date.2017.7927143 article EN Design, Automation & Test in Europe Conference & Exhibition (DATE), 2015 2017-03-01

M100 ExaData: a data collection campaign on the CINECA’s Marconi100 Tier-0 supercomputer

OPENALEX - Publications

Andrea Borghesi Carmine Di Santi Martin Molan Mohsen Seyedkazemi Ardebili Alessio Mauri and 7 more

Supercomputers are the most powerful computing machines available to society. They play a central role in economic, industrial, and societal development. While they used by scientists, engineers, decision-makers, data-analyst computationally solve complex problems, supercomputers their hosting datacenters themselves power-hungry systems. Improving efficiency, availability, resiliency is vital subject of many research engineering efforts. Still, major roadblock hinders researchers: dearth...

10.1038/s41597-023-02174-3 article EN cc-by Scientific Data 2023-05-18

Paving the Way Toward Energy-Aware and Automated Datacentre

OPENALEX - Publications

Andrea Bartolini Francesco Beneventi Andrea Borghesi Daniele Cesarini Antonio Libri and 2 more

Energy efficiency and datacentre automation are critical targets of the research deployment agenda CINECA its partners in Efficient System Laboratory University Bologna Integrated ETH Zurich. In this manuscript, we present primary outcomes conducted domain under umbrella several European, National Private funding schemes. These consist of: (i) ExaMon scalable, flexible, holistic monitoring framework, which is capable ingesting 70GB/day telemetry data entire link with machine learning...

10.1145/3339186.3339215 article EN 2019-07-22

Experimenting with Emerging RISC-V Systems for Decentralised Machine Learning

OPENALEX - Publications

Gianluca Mittone Nicolò Tonci Robert Birke Iacopo Colonnelli Doriana Medić and 8 more

Decentralised Machine Learning (DML) enables collaborative machine learning without centralised input data. Federated (FL) and Edge Inference are examples of DML. While tools for DML (especially FL) starting to flourish, many not flexible portable enough experiment with novel processors (e.g., RISC-V), non-fully connected network topologies, asynchronous collaboration schemes. We overcome these limitations via a domain-specific language allowing us map schemes an underlying middleware, i.e....

10.1145/3587135.3592211 article EN 2023-05-09

Bias-Compensated Least Squares Identification of Distributed Thermal Models for Many-Core Systems-on-Chip

OPENALEX - Publications

Roberto Diversi Andrea Tilli Andrea Bartolini Francesco Beneventi Luca Benini

The thermal wall for many-core systems on-chip calls advanced management techniques to maximize performance, while capping temperatures. Distributed and compact models are a cornerstone such techniques. System identification methodologies allow extract directly from the target device response. Unfortunately, standard Auto-Regressive eXogenous Least Squares cannot effectively tackle both model approximation measurement noise typical of real systems. In this work, we propose novel distributed...

10.1109/tcsi.2014.2312495 article EN IEEE Transactions on Circuits and Systems I Regular Papers 2014-04-08

Design of an Energy Aware Petaflops Class High Performance Cluster Based on Power Architecture

OPENALEX - Publications

Wissam Abu Ahmad Andrea Bartolini Francesco Beneventi Luca Benini Andrea Borghesi and 7 more

In this paper we present D.A.V.I.D.E. (Development for an Added Value Infrastructure Designed in Europe), innovative and energy efficient High Performance Computing cluster designed by E4 Computer Engineering PRACE (Partnership Advanced Europe). is built using best-in-class components (IBM's POWER8-NVLink CPUs, NVIDIA TESLA P100 GPUs, Mellanox InfiniBand EDR 100 Gb/s networking) plus custom hardware system middleware software. features (i) a dedicated power monitor interface, around the...

10.1109/ipdpsw.2017.22 preprint EN 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) 2017-05-01

Thermal Analysis and Interpolation Techniques for a Logic + WideIO Stacked DRAM Test Chip

OPENALEX - Publications

Francesco Beneventi Andrea Bartolini Pascal Vivet Luca Benini

Self-heating and high-operating temperature are major concerns in 3-D-chip integration. In this paper, we leverage a 3-D test chip (WideIO dynamic random access memory on top of logic die) equipped with sensors heaters to explore thermal effects develop advanced modeling strategies suitable for complex 3-D-stacked circuits. We correlate measurements the power dissipated by using model learning techniques. Moreover, defined basis function obtained data available from on-chip sensors. This can...

10.1109/tcad.2015.2474382 article EN IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 2015-08-28

The D.A.V.I.D.E. big-data-powered fine-grain power and performance monitoring support

OPENALEX - Publications

Andrea Bartolini Andrea Borghesi Antonio Libri Francesco Beneventi Daniele Gregori and 3 more

On the race toward exascale supercomputing systems are facing important challenges which limit efficiency of system. Among all, power and energy consumption fueled by end Dennard's scaling start to show their impact on limiting supercomputers peak performance cost effectiveness.

10.1145/3203217.3205863 article EN 2018-05-08

Meet Monte Cimone

OPENALEX - Publications

Federico Ficarelli Andrea Bartolini Emanuele Parisi Francesco Beneventi Francesco Barchi and 7 more

The new open and royalty-free RISC-V ISA is attracting interest across the whole computing continuum, from microcontrollers to supercomputers. High-performance processors accelerators have been announced, but RISC-V-based HPC systems will need a holistic co-design effort, spanning memory, storage hierarchy interconnects full software stack. In this paper, we describe Monte Cimone, fully-operational multi-blade computer prototype hardware-software test-bed based on U740, double precision...

10.1145/3528416.3530869 article EN 2022-05-05

Cooling-aware node-level task allocation for next-generation green HPC systems

OPENALEX - Publications

Francesco Beneventi Andrea Bartolini Carlo Cavazzoni Luca Benini

Energy-efficiency is of primary interest in future HPC systems as their computational growth limited by the supercomputer peak power consumption. A significant part consumed a machine caused cooling infrastructure. Todays thermal design based on coarse grain models which consider silicon die processing elements an isothermal surface. Similarly feedback control loops uses same assumption to modulate effort with goal reducing cost and maintaining temperature safe working range. Recent...

10.1109/hpcsim.2016.7568402 article EN 2016-07-01

Self-Aware Thermal Management for High-Performance Computing Processors

OPENALEX - Publications

Andrea Bartolini Roberto Diversi Daniele Cesarini Francesco Beneventi

Processors for high performance computing and server workload are today thermally constrained.To preserve a safe working temperature, state-of-the-art processors this market segment integrates many cores on the same die feature fine-grain power management thermal feedback loops implemented in hardware.However, to keep control policy simple, these controllers fail taking advantage underlining heterogeneity, long transients specific user mode.In paper, we present self-aware framework making...

10.1109/mdat.2017.2774774 article EN IEEE Design and Test 2017-11-16

SCC thermal model identification via advanced bias-compensated least-squares

OPENALEX - Publications

Roberto Diversi Andrea Bartolini Andrea Tilli Francesco Beneventi Luca Benini

Compact thermal models and modeling strategies are today a cornerstone for advanced power management to counteract the emerging crisis many-core systems-on-chip. System identification techniques allow extract directly from target device response. Unfortunately, standard Least Squares cannot effectively cope with both model approximation measurement noise typical of real systems. In this work, we present novel distributed strategy capable coping real-life temperature sensor extracting set...

10.5555/2485288.2485347 article EN Design, Automation, and Test in Europe 2013-03-18

SCC Thermal Model Identification via Advanced Bias-Compensated Least-Squares

OPENALEX - Publications

Roberto Diversi Andrea Bartolini Andrea Tilli Francesco Beneventi Luca Benini

Compact thermal models and modeling strategies are today a cornerstone for advanced power management to counteract the emerging crisis many-core systems-on-chip. System identification techniques allow extract directly from target device response. Unfortunately, standard Least Squares cannot effectively cope with both model approximation measurement noise typical of real systems. In this work, we present novel distributed strategy capable coping real-life temperature sensor extracting set...

10.7873/date.2013.060 article EN Design, Automation & Test in Europe Conference & Exhibition (DATE), 2015 2013-01-01

GRAAFE: GRaph Anomaly Anticipation Framework for Exascale HPC systems

OPENALEX - Publications

Martin Molan Mohsen Seyedkazemi Ardebili Junaid Ahmed Khan Francesco Beneventi Daniele Cesarini and 2 more

10.1016/j.future.2024.06.032 article EN Future Generation Computer Systems 2024-06-21

Monte Cimone: Paving the Road for the First Generation of RISC-V High-Performance Computers

OPENALEX - Publications

Andrea Bartolini Federico Ficarelli Emanuele Parisi Francesco Beneventi Francesco Barchi and 7 more

The new open and royalty-free RISC-V ISA is attracting interest across the whole computing continuum, from microcontrollers to supercomputers. High-performance processors accelerators have been announced, but RISC-V-based HPC systems will need a holistic co-design effort, spanning memory, storage hierarchy interconnects full software stack. In this paper, we describe Monte Cimone, fully-operational multi-blade computer prototype hardware-software test-bed based on U740, double-precision...

10.1109/socc56010.2022.9908096 article EN 2022-09-05

Static Thermal Model Learning for High-Performance Multicore Servers

OPENALEX - Publications

Francesco Beneventi Andrea Bartolini Luca Benini

Aggressive thermal management is a critical feature for high-end computing platforms, as worst-case budgeting becoming unaffordable. Reactive management, which sets temperature thresholds to trigger capping actions, too "near-sighted", and it may lead severe performance degradation overshoots. More aggressive proactive minimizes penalty with smooth optimal control, but requires the knowledge of system models be precise. Unfortunately, in practice these are not provided by equipment...

10.1109/icccn.2011.6006065 article EN 2011-07-01

Thermal model identification of supercomputing nodes in production environment

OPENALEX - Publications

Roberto Diversi Andrea Bartolini Francesco Beneventi Luca Benini

Distributed and compact thermal models are at the basis of thermal-aware design on-line optimization cooling effort in future High-Performance Computing systems. These can be directly extracted from target device's response by means system identification techniques. This paper proposes a novel approach for real-life production HPC Our is capable extracting MISO supercomputing node deployment scenario affected quantization noise on temperature measurements as well operating free-cooling, with...

10.1109/iecon.2016.7793664 article EN 2016-10-01

On-line thermal emulation: How to speed-up your thermal controller design

OPENALEX - Publications

Francesco Beneventi Andrea Bartolini Luca Benini

Dynamic thermal management (DTM) is a key technology for future many-core systems. Indeed systems, as both server-class and embedded chip multiprocessors are thermally constrained. DTM design requires consideration the chain of interactions between HW operating points, workload phases, power consumption, die temperature, monitor infrastructure, control policy. Hugely different time scales involved, from microseconds to hours. Simulating performance solutions system in reasonable an open...

10.1109/patmos.2013.6662161 article EN 2013-09-01

A Scalable Framework for Online Power Modelling of High-Performance Computing Nodes in Production

OPENALEX - Publications

Federico Pittino Francesco Beneventi Andrea Bartolini Luca Benini

Power and thermal design management are critical components of high performance computing (HPC) systems, due to their cutting-edge position in terms power density large total consumption. Many HPC strategies rely on the availability accurate compact models, capable predicting consumption tracking its sensitivity workload parameters operating points. In this paper we describe a methodology framework for training models derived with two best-in-class procedures directly online production nodes...

10.1109/hpcs.2018.00058 article EN 2018-07-01

Monte Cimone: Paving the Road for the First Generation of RISC-V High-Performance Computers

OPENALEX - Publications

Andrea Bartolini Federico Ficarelli Emanuele Parisi Francesco Beneventi Francesco Barchi and 7 more

The new open and royalty-free RISC-V ISA is attracting interest across the whole computing continuum, from microcontrollers to supercomputers. High-performance processors accelerators have been announced, but RISC-V-based HPC systems will need a holistic co-design effort, spanning memory, storage hierarchy interconnects full software stack. In this paper, we describe Monte Cimone, fully-operational multi-blade computer prototype hardware-software test-bed based on U740, double-precision...

10.48550/arxiv.2205.03725 preprint EN cc-by arXiv (Cornell University) 2022-01-01

Thermal analysis and model identification techniques for a logic + WIDEIO stacked DRAM test chip

OPENALEX - Publications

Francesco Beneventi Andrea Bartolini Pascal Vivet Denis Dutoit Luca Benini

High temperature is one of the limiting factors and major concerns in 3D-chip integration. In this paper we use a 3D test chip (WIDEIO DRAM on top logic die) equipped with sensors heaters to explore thermal effects. We correlated real measurements power dissipated by using model learning techniques. The resulting compact able predict temperatures at locations far from infer dissipation any location chip. Results are verified mean an off-sample validation technique show high accuracy when...

10.5555/2616606.2617079 article EN Design, Automation, and Test in Europe 2014-03-24

Prediction of Thermal Hazards in a Real Datacenter Room Using Temporal Convolutional Networks

OPENALEX - Publications

Mohsen Seyedkazemi Ardebili Marcello Zanghieri Alessio Burrello Francesco Beneventi Andrea Acquaviva and 2 more

Datacenters play a vital role in today's society. At large, datacenter room is complex controlled environment composed of thousands computing nodes, which consume kW power. To dissipate the power, forced air/liquid flow employed, with cost millions euros per year. Reducing this involves using free-cooling and average case design, can create cooling shortage thermal hazards. When hazard happens, system administrators facility manager must stop production to avoid IT equipment damage wear-out....

10.23919/date51398.2021.9474116 article EN Design, Automation & Test in Europe Conference & Exhibition (DATE), 2015 2021-02-01