- Radiation Effects in Electronics
- Parallel Computing and Optimization Techniques
- Distributed systems and fault tolerance
- Cloud Computing and Resource Management
- Quantum Computing Algorithms and Architecture
- Particle Detector Development and Performance
- VLSI and Analog Circuit Testing
- Distributed and Parallel Computing Systems
- Advanced Data Storage Technologies
- Low-power high-performance VLSI design
- Quantum Information and Cryptography
- Advancements in Semiconductor Devices and Circuit Design
- Semiconductor materials and devices
- Security and Verification in Computing
- Radiation Therapy and Dosimetry
- semigroups and automata theory
- Software Reliability and Analysis Research
- Nutrition and Health in Aging
- Healthcare during COVID-19 Pandemic
- Cardiovascular and exercise physiology
- Graphite, nuclear technology, radiation studies
- Microgrid Control and Optimization
- Body Composition Measurement Techniques
- Logic, programming, and type systems
- Health, Nursing, Elderly Care
Universidade Federal do Paraná
2019-2023
Universidade Federal do Rio Grande do Sul
2012-2019
University of Rio Grande and Rio Grande Community College
2012-2019
Instituto Superior Técnico
2019
Institute of Informatics of the Slovak Academy of Sciences
2017-2019
Instituto Politécnico de Lisboa
2019
Universidade Federal do Rio Grande
2015
Increase in graphics hardware performance and improvements programmability has enabled GPUs to evolve from a graphics-specific accelerator general-purpose computing device. Titan, the world's second fastest supercomputer for open science 2014, consists of more dum 18,000 that scientists various domains such as astrophysics, fusion, climate, combustion use routinely run large-scale simulations. Unfortunately, while efficiency is well understood, their resilience characteristics system have...
Graphics processing units (GPUs) are increasingly attractive for both safety-critical and High-Performance Computing applications. GPU reliability is a primary concern the automotive aerospace markets becoming an issue also supercomputers. In fact, high number of devices in large data centers makes probability having at least device corrupted to be very high. this paper, we aim giving novel insights on by evaluating neutron sensitivity modern GPUs memory structures, highlighting pattern...
We present an in-depth analysis of transient faults effects on HPC applications in Intel Xeon Phi processors based radiation experiments and high-level fault injection. Besides measuring the realistic error rates Phi, we quantify Silent Data Corruption (SDCs) by correlating distribution corrupted elements output to application's characteristics. evaluate benefits imprecise computing for reducing programs' rate. For example, HotSpot a 0.5% tolerance value reduces rate 85%.
Graphics processing units (GPUs) are increasingly common in both safety-critical and high-performance computing (HPC) applications. Some current supercomputers composed of thousands GPUs so the probability device corruption becomes very high. Moreover, GPU's parallel capabilities attractive for automotive aerospace markets, where reliability is a serious concern. In this paper, neutron sensitivity modern GPU caches, internal resources experimentally evaluated. Various Duplication With...
Novel computing architectures offer the possibility to execute float point operations with different precisions. The execution of reduced precision operations, when acceptable for certain applications, is likely reduce both time and power consumption. However, application's error rate device's reliability can also be impacted by these changes. In this paper, we study impact data operation changes on modern architectures. We consider Xilinx Field-Programmable Gate-Arrays (FPGA), Intel Xeon...
The increased need for computing capabilities and higher efficiency have stimulated industries to make available in the market novel architectures with complexity. variety of codes that be executed combined complexity introduces challenges reliability evaluation systems applications. This paper compares behaviors six different (an Intel co-processor, three NVIDIA GPUs, an AMD APU, embedded ARM) executing eight codes. To support our evaluation, we present discuss experimental beam data covers...
In this paper, we evaluate the error criticality of radiation-induced errors on modern High-Performance Computing (HPC) accelerators (Intel Xeon Phi and NVIDIA K40) through a dedicated set metrics. We show that, as long imprecise computing is concerned, simple mismatch detection not sufficient to compare radiation sensitivity HPC devices algorithms. Our analysis quantifies qualifies effects applications' output correlating number corrupted elements with their spatial locality. Also, provide...
In this paper we assess and discuss the efficiency overhead of Error-Correcting Code (ECC) mechanism available on modern GPGPUs, which are increasingly used for both High Performance Computing safety-critical applications. Both resilience to radiation-induced silent data corruption functional interruption experimentally analytically addressed. The provided experimental analysis demonstrates that ECC significantly reduces occurrence but may not be sufficient guarantee high reliability....
Most High Performance Computing (HPC) systems today are known as "power hungry" because they aim at computing speed regardless to energy consumption. Some scientific applications still claim more and the community expects reach exascale by end of decade. Nevertheless, we need search alternatives cope with constraints. A promising step forward in this direction is usage low power processors such ARM. ARM target consumption contrast Xeon that conventional on HPC aiming speed. This paper...
We present the results of accelerated radiation testing on an AMD processing unit, three Nvidia graphic units, Intel accelerator, a field-programmable gate array, and two double-data-rate memories under thermal high-energy neutrons separately. The sensitivity depends device type code being executed we show that contribute to error rate modern computing devices certain conditions.
Transient faults are a major problem for large scale HPC systems, and the mitigation of adverse fault effects need to be highly efficient as we approach exascale. We developed injection tool (CAROL-FI) identify potential sources effects. With deeper understanding such effects, provide useful insights design techniques, like selective hardening critical portions code.
The High-Performance Computing (HPC) community aimed for many years at increasing performance regardless to energy consumption. However, is limiting the scalability of next generation supercomputers. Current HPC systems already cost huge amounts power, in order a few Mega Watts (MW). future intend achieve 10 100 times more performance, but accepted power those machines must remain below 20 MW. Therefore, scientific investigating ways improve efficiency. This paper presents study execution...
Quantum computing is an up-and-coming technology that expected to revolutionize the computation paradigm in next few years. Qubits, primary elements of quantum circuits, exploit physics proprieties increase parallelism and speed drastically. Unfortunately, besides being intrinsically noisy, qubits have also been shown be highly susceptible external sources faults, such as ionizing radiation. The latest discoveries highlight a much higher radiation sensitivity than traditional transistors...
Quantum computing is one of the most promising technology advances latest years. Qubits are highly sensitive to noise, which can make output useless. Lately, it has been shown that superconducting qubits extremely susceptible external sources faults, such as ionizing radiation. When adopted in large scale, radiation-induced errors expected become a serious challenge for reliability. We propose an evaluation impact transient faults execution quantum circuits on chips. Inspired by...
The high performance, efficiency, and low cost of Commercial Off-The-Shelf (COTS) devices make them attractive for applications with strict reliability constraints. Today, COTS are adopted in HPC safety-critical such as autonomous driving. Unfortunately, the cheap natural Boron widely used chip manufacturing process makes highly susceptible to thermal (low energy) neutrons. In this paper, we demonstrate that neutrons a significant threat device reliability. For our study, consider an AMD...
In this paper we assess and discuss the radiation sensitivity of a set HPC applications executed on NVIDIA K20 GPGPUs. The occurrence both radiation-induced silent data corruption functional interruption will be experimentally addressed for Hotspot, LavaMD, Matrix Transponse. Each tested codes requires proper computational power elaborates different amount data. Both these characteristics play significant role in application radiations sensitivity. Additionally, an evaluation error rate at...
Quantum computing (QC), by exploiting the quantum properties of bits (qubits), significantly improves performance and efficiency computation. Unfortunately, devices are very susceptible to external perturbation, including natural radiation. In this article, we measure, through GEANT4 simulations, charge deposited impinging neutrons in superconducting devices. As show, most atmospheric deposit sufficient energy break Cooper pairs and, thus, can potentially modify qubit state. Then, with a...
The constant need of higher performances and reduced power consumption has lead vendors to design heterogeneous devices that embed traditional CPU an accelerator, like a GPU or FPGA. When the accelerator are used collaboratively device computational reach their peak. However, amount resources employed for computation has, potentially, side effect increasing soft error rate. In this paper we evaluate reliability behavior AMD Kaveri Accelerated Processing Units executing set applications. We...
In this paper, we inspect the impact of modifying benchmarks' input sizes on parallel processors reliability. A larger size imposes a higher scheduler strain, potentially increasing processor's radiation sensitivity. Additionally, affects codes throughput, number resources used for computation, and their criticality. The is experimentally studied by comparing sensitivity three modern processors: Intel Xeon Phis, NVIDIA K20, K40. Our test procedure has shown that threads management...
In this paper, we investigate neutron-induced errors in three implementations of sort algorithms (QuickSort, MergeSort, and RadixSort) executed on modern graphics processing units designed for high-performance computing large server applications. We measure the radiation-induced error rate taking advantage neutron beam available at Los Alamos Neutron Science Center facility. also analyze output criticality by identifying specific patterns. found that radiation can cause wrong elements to...
Quantum Computing is a highly promising new computation paradigm. Unfortunately, quantum bits (qubits) are extremely fragile and their state can be gradually or suddenly modified by intrinsic noise external perturbation. In this paper, we target the sensitivity of circuits to radiation-induced transient faults. We consider circuit cuts that split into smaller independent portions, understand how faults propagate in each portion. As show, have different vulnerabilities, our methodology...
The error rate of current High Performance Computing (HPC) systems is already in the order one per dozens hours. Understanding reliability behavior HPC applications will be required for next generation supercomputers. Using can select efficient mitigation techniques application and fine-tune parameters such as checkpoint frequency. In this paper, we investigate a machine learning model to predict applications. We inject faults more than 30 executing Intel Xeon Phi Knights Landing (KNL) use...
In this paper, we evaluate the effects of reducing average memory access time (AMAT) on graphics processing units' (GPU) performance and reliability based data obtained at Los Alamos Neutron Science Center (LANSCE). We also measure input size changes neutron radiation sensitivity GPU running different applications. Results show an increase in silent corruption (SDC) cross section with AMAT optimizations from a higher usage unprotected registers SRAM resources, single event functional...