NFDI4DS | UHH-SEMS - Publication Details

Ana Gainaru

ORCID: 0000-0002-1375-9468

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5021627261

Research Areas

Distributed and Parallel Computing Systems
Advanced Data Storage Technologies
Cloud Computing and Resource Management
Parallel Computing and Optimization Techniques
Scientific Computing and Data Management
Software System Performance and Reliability
Distributed systems and fault tolerance
Software Reliability and Analysis Research
Age of Information Optimization
Data Mining Algorithms and Applications
Algorithms and Data Compression
Laser-induced spectroscopy and plasma
Magnetic confinement fusion research
Advanced Bandit Algorithms Research
Machine Learning and Algorithms
Ferroelectric and Negative Capacitance Devices
Network Security and Intrusion Detection
Advanced Text Analysis Techniques
Mobile Ad Hoc Networks
Medical Image Segmentation Techniques
Vehicular Ad Hoc Networks (VANETs)
Cell Image Analysis Techniques
Advanced Queuing Theory Analysis
VLSI and Analog Circuit Testing
Business Process Modeling and Analysis

Oak Ridge National Laboratory
2020-2025

Vanderbilt University
2017-2020

University of Illinois Urbana-Champaign
2011-2017

Mellanox Technologies (United States)
2016-2017

Illinois College
2016

National Center for Supercomputing Applications
2011-2013

Institut national de recherche en informatique et en automatique
2013

Urbana University
2012

Universitatea Națională de Știință și Tehnologie Politehnica București
2009-2011

Scheduling the I/O of HPC Applications Under Congestion

OPENALEX - Publications

Ana Gainaru Guillaume Aupy Anne Benoît Franck Cappello Yves Robert and 1 more

A significant percentage of the computing capacity large-scale platforms is wasted because interferences incurred by multiple applications that access a shared parallel file system concurrently. One solution to handling I/O bursts enlarge-scale HPC systems absorb them at an intermediate storage layer consisting burst buffers. However, our analysis Argonne's Mira shows buffers cannot prevent congestion all times. Consequently, performances dramatically degraded, showing in some cases decrease...

10.1109/ipdps.2015.116 preprint EN 2015-05-01

2022 Review of Data-Driven Plasma Science

OPENALEX - Publications

Rushil Anirudh Richard Archibald M. Salman Asif Markus M. Becker S. Benkadda and 58 more

Data-driven science and technology offer transformative tools methods to science. This review article highlights the latest development progress in interdisciplinary field of data-driven plasma (DDPS), i.e., whose is driven strongly by data analyses. Plasma considered be most ubiquitous form observable matter universe. Data associated with plasmas can, therefore, cover extremely large spatial temporal scales, often provide essential information for other scientific disciplines. Thanks...

10.1109/tps.2023.3268170 article EN cc-by IEEE Transactions on Plasma Science 2023-07-01

Fault prediction under the microscope: a closer look into HPC systems

OPENALEX - Publications

Ana Gainaru Franck Cappello Marc Snir William Kramer

A large percentage of computing capacity in today's high-performance systems is wasted because failures. Consequently current research focusing on providing fault tolerance strategies that aim to minimize fault's effects applications. By far the most popular technique checkpointrestart strategy. complement this classical approach failure avoidance, by which occurrence a predicted and preventive measures are taken. This requires reliable prediction system anticipate failures their locations....

10.5555/2388996.2389101 article EN IEEE International Conference on High Performance Computing, Data, and Analytics 2012-11-10

Modeling and tolerating heterogeneous failures in large parallel systems

OPENALEX - Publications

E. M. Heien Derrick Kondo Ana Gainaru Daniel LaPine Bill Kramer and 1 more

As supercomputers and clusters increase in size complexity, system failures are inevitable. Different hardware components (such as memory, disk, or network) of such systems can have different failure rates. Prior works assume equally affect an application, whereas our goal is to provide models for applications that reflect their specific component usage. This challenging because dynamics heterogeneous space time.

10.1145/2063384.2063444 preprint EN 2011-11-08

Fault prediction under the microscope: A closer look into HPC systems

OPENALEX - Publications

Ana Gainaru Franck Cappello Marc Snir William Kramer

10.1109/sc.2012.57 article EN International Conference for High Performance Computing, Networking, Storage and Analysis 2012-11-01

Taming of the Shrew: Modeling the Normal and Faulty Behaviour of Large-scale HPC Systems

OPENALEX - Publications

Ana Gainaru Franck Cappello William Kramer

HPC systems are complex machines that generate a huge volume of system state data called "events". Events generated without following general consistent rule and different hardware software components such have failure rates. Distinguishing between normal behaviour faulty situation relies on event analysis. Being able to detect quickly deviations from normality is essential for administration the foundation fault prediction. As continue grow in size complexity, mining flows become more...

10.1109/ipdps.2012.107 article EN 2012-05-01

Improving the Computing Efficiency of HPC Systems Using a Combination of Proactive and Preventive Checkpointing

OPENALEX - Publications

Mohamed Slim Bouguerra Ana Gainaru Leonardo Bautista-Gomez Franck Cappello Satoshi Matsuoka and 1 more

As the failure frequency is increasing with components count in modern and future supercomputers, resilience becoming critical for extreme scale systems. The association of prediction proactive checkpointing seeks to reduce effect failures execution time parallel applications. Unfortunately, does not systematically avoid restarting from scratch. To mitigate this issue, can be coupled periodic checkpointing. However, blind use these techniques always improves system efficiency, because...

10.1109/ipdps.2013.74 article EN 2013-05-01

AI-coupled HPC Workflow Applications, Middleware and Performance

OPENALEX - Publications

Wes Brewer Ana Gainaru Frédéric Suter Feiyi Wang Murali Emani and 1 more

AI integration is revolutionizing the landscape of HPC simulations, enhancing importance, use, and performance AI-driven workflows. This paper surveys diverse rapidly evolving field provides a common conceptual basis for understanding Specifically, we use insights from different modes coupling into workflows to propose six execution motifs most commonly found in scientific applications. The proposed set by definition incomplete evolving. However, they allow us analyze primary challenges...

10.48550/arxiv.2406.14315 preprint EN arXiv (Cornell University) 2024-06-20

HPC I/O innovations in the exascale era

OPENALEX - Publications

Greg Eisenhauer Norbert Podhorszki Ana Gainaru Scott Klasky Junmin Gu and 9 more

As high performance computing architecture evolves to deliver ever-increasing performance, the middleware tools also need adapt in order for applications better use these higher-performance features. The Adaptable Input Output System (ADIOS), which provides scalable IO exascale HPC is one such middleware. During Exascale Computing Project (ECP), key portions of ADIOS environment were adapted respond ongoing developments and stresses opportunities inherent those changes. This paper examines...

10.1177/10943420251330446 article EN The International Journal of High Performance Computing Applications 2025-04-04

A Realistic Mobility Model Based on Social Networks for the Simulation of VANETs

OPENALEX - Publications

Ana Gainaru Ciprian Dobre Valentin Cristea

The validation of mobile ad hoc technologies relies almost exclusively on modeling and simulation. In this paper we present a novel mobility model based social network theory. is designed to accurately reflect the realistic involved actors in various VANET simulation scenarios. This much needed as, order have high degree confidence using simulation, (as well as model) must act very realistic. However, most models currently used are simplistic. being presented part VNSim, generic simulator...

10.1109/vetecs.2009.5073334 article EN 2009-04-01

Reducing Waste in Extreme Scale Systems through Introspective Analysis

OPENALEX - Publications

Leonardo Bautista-Gomez Ana Gainaru Swann Perarnau Devesh Tiwari Saurabh Gupta and 3 more

Resilience is an important challenge for extreme-scale supercomputers. Today, failures in supercomputers are assumed to be uniformly distributed time. However, recent studies show that high-performance computing systems partially correlated time, generating periods of higher failure density. Our study the logs multiple density occur with up three times more than average. We design a monitoring system listens hardware events and forwards runtime detect those regime changes. implement capable...

10.1109/ipdps.2016.100 article EN 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2016-05-01

MGARD: A multigrid framework for high-performance, error-controlled data compression and refactoring

OPENALEX - Publications

Qian Gong Jieyang Chen Ben Whitney Xin Liang Viktor Reshniak and 11 more

We describe MGARD, a software providing MultiGrid Adaptive Reduction for floating-point scientific data on structured and unstructured grids. With exceptional compression capability precise error control, MGARD addresses wide range of requirements, including storage reduction, high-performance I/O, in-situ analysis. It features unified application programming interface (API) that seamlessly operates across diverse computing architectures. has been optimized with highly-tuned GPU kernels...

10.1016/j.softx.2023.101590 article EN cc-by-nc-nd SoftwareX 2023-11-22

Adaptive event prediction strategy with dynamic time window for large-scale HPC systems

OPENALEX - Publications

Ana Gainaru Franck Cappello Joshi Fullop Ştefan Trăuşan-Matu William Kramer

In this paper, we analyse messages generated by different HPC large-scale systems in order to extract sequences of correlated events which lately use predict the normal and faulty behaviour system. Our method uses a dynamic window strategy that is able find frequent regardless on time delay between them. Most current related research narrows correlation extraction fixed relatively small windows do not reflect whole The are constant change during lifetime machine. We consider it important...

10.1145/2038633.2038637 article EN 2011-10-23

Failure prediction for HPC systems and applications

OPENALEX - Publications

Ana Gainaru Franck Cappello Marc Snir William Kramer

As large-scale systems evolve towards post-petascale computing, it is crucial to focus on providing fault-tolerance strategies that aim minimize fault’s effects applications. By far the most popular technique checkpoint–restart strategy. A complement this classical approach failure avoidance, by which occurrence of a fault predicted and proactive measures are taken. This requires reliable prediction system anticipate failures their locations. One way offering analysis logs generated during...

10.1177/1094342013488258 article EN The International Journal of High Performance Computing Applications 2013-07-03

Scheduling Parallel Tasks under Multiple Resources: List Scheduling vs. Pack Scheduling

OPENALEX - Publications

Hongyang Sun Redouane Elghazi Ana Gainaru Guillaume Aupy Padma Raghavan

Scheduling in High-Performance Computing (HPC) has been traditionally centered around computing resources (e.g., processors/cores). The ever-growing amount of data produced by modern scientific applications start to drive novel architectures and new frameworks support more efficient processing, transfer storage for future HPC systems. This trend towards data-driven demands the scheduling solutions also consider other I/O, memory, cache) that can be shared amongst competing applications. In...

10.1109/ipdps.2018.00029 article EN 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2018-05-01

Reservation Strategies for Stochastic Jobs

OPENALEX - Publications

Guillaume Aupy Ana Gainaru Valentin Honoré Padma Raghavan Yves Robert and 1 more

In this paper, we are interested in scheduling stochastic jobs on a reservation-based platform. Specifically, consider whose execution time follows known probability distribution. The platform is reservation-based, meaning that the user has to request fixed-length slots. cost then depends both (i) duration (pay for what you ask); and (ii) actual of job use). A reservation strategy determines sequence increasing length reservations, which paid until one them allows successfully complete. goal...

10.1109/ipdps.2019.00027 article EN 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2019-05-01

Driving Next-Generation Workflows from the Data Plane

OPENALEX - Publications

Frédéric Suter Rafael Ferreira da Silva Ana Gainaru Scott Klasky

We observe the emergence of a new generation scientific workflows that process data produced at sustained rate by instruments and large scale numerical simulations. This is consumed multiple analysis, visualization, or Machine Learning components not only to enable inference justify program, but also monitor steer evolution these experiments. In such workflows, moving intermediate efficiently key performance, more than scheduling computational tasks. However, most traditional workflow...

10.1109/e-science58273.2023.10254849 article EN 2023-10-09

Hades: A Context-Aware Active Storage Framework for Accelerating Large-Scale Data Analysis

OPENALEX - Publications

Jaime Cernuda Luke Logan Ana Gainaru Scott Klasky Jay Lofstead and 2 more

10.1109/ccgrid59990.2024.00070 article EN 2024-05-06

Speculative Scheduling for Stochastic HPC Applications

OPENALEX - Publications

Ana Gainaru Guillaume Aupy Hongyang Sun Padma Raghavan

New emerging fields are developing a growing number of large-scale applications with heterogeneous, dynamic and data-intensive requirements that put high emphasis on productivity thus not tuned to run efficiently today's performance computing (HPC) systems. Some these applications, such as neuroscience workloads those use adaptive numerical algorithms, develop modeling simulation workflows stochastic execution times unpredictable resource requirements. When they deployed current HPC systems...

10.1145/3337821.3337890 article EN 2019-07-25

Using InfiniBand Hardware Gather-Scatter Capabilities to Optimize MPI All-to-All

OPENALEX - Publications

Ana Gainaru Richard L. Graham Artem Y. Polyakov Gilad Shainer

The MPI all-to-all algorithm is a data intensive, high-cost collective used by many scientific High Performance Computing applications. Optimizations for small exchange use aggregation techniques, such as the Bruck algorithm, to minimize number of messages sent, and overall operation latency. This paper presents three variants which differ in way laid out memory at intermediate steps algorithm. Mellanox's InfiniBand support Host Channel Adapter (HCA) hardware scatter/gather selectively...

10.1145/2966884.2966918 article EN 2016-09-25

Coming Soon ...