Ana Gainaru

ORCID: 0000-0002-1375-9468
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Distributed and Parallel Computing Systems
  • Advanced Data Storage Technologies
  • Cloud Computing and Resource Management
  • Parallel Computing and Optimization Techniques
  • Scientific Computing and Data Management
  • Software System Performance and Reliability
  • Distributed systems and fault tolerance
  • Software Reliability and Analysis Research
  • Age of Information Optimization
  • Data Mining Algorithms and Applications
  • Algorithms and Data Compression
  • Laser-induced spectroscopy and plasma
  • Magnetic confinement fusion research
  • Advanced Bandit Algorithms Research
  • Machine Learning and Algorithms
  • Ferroelectric and Negative Capacitance Devices
  • Network Security and Intrusion Detection
  • Advanced Text Analysis Techniques
  • Mobile Ad Hoc Networks
  • Medical Image Segmentation Techniques
  • Vehicular Ad Hoc Networks (VANETs)
  • Cell Image Analysis Techniques
  • Advanced Queuing Theory Analysis
  • VLSI and Analog Circuit Testing
  • Business Process Modeling and Analysis

Oak Ridge National Laboratory
2020-2025

Vanderbilt University
2017-2020

University of Illinois Urbana-Champaign
2011-2017

Mellanox Technologies (United States)
2016-2017

Illinois College
2016

National Center for Supercomputing Applications
2011-2013

Institut national de recherche en informatique et en automatique
2013

Urbana University
2012

Universitatea Națională de Știință și Tehnologie Politehnica București
2009-2011

A significant percentage of the computing capacity large-scale platforms is wasted because interferences incurred by multiple applications that access a shared parallel file system concurrently. One solution to handling I/O bursts enlarge-scale HPC systems absorb them at an intermediate storage layer consisting burst buffers. However, our analysis Argonne's Mira shows buffers cannot prevent congestion all times. Consequently, performances dramatically degraded, showing in some cases decrease...

10.1109/ipdps.2015.116 preprint EN 2015-05-01

Data-driven science and technology offer transformative tools methods to science. This review article highlights the latest development progress in interdisciplinary field of data-driven plasma (DDPS), i.e., whose is driven strongly by data analyses. Plasma considered be most ubiquitous form observable matter universe. Data associated with plasmas can, therefore, cover extremely large spatial temporal scales, often provide essential information for other scientific disciplines. Thanks...

10.1109/tps.2023.3268170 article EN cc-by IEEE Transactions on Plasma Science 2023-07-01

A large percentage of computing capacity in today's high-performance systems is wasted because failures. Consequently current research focusing on providing fault tolerance strategies that aim to minimize fault's effects applications. By far the most popular technique checkpointrestart strategy. complement this classical approach failure avoidance, by which occurrence a predicted and preventive measures are taken. This requires reliable prediction system anticipate failures their locations....

10.5555/2388996.2389101 article EN IEEE International Conference on High Performance Computing, Data, and Analytics 2012-11-10

As supercomputers and clusters increase in size complexity, system failures are inevitable. Different hardware components (such as memory, disk, or network) of such systems can have different failure rates. Prior works assume equally affect an application, whereas our goal is to provide models for applications that reflect their specific component usage. This challenging because dynamics heterogeneous space time.

10.1145/2063384.2063444 preprint EN 2011-11-08

10.1109/sc.2012.57 article EN International Conference for High Performance Computing, Networking, Storage and Analysis 2012-11-01

HPC systems are complex machines that generate a huge volume of system state data called "events". Events generated without following general consistent rule and different hardware software components such have failure rates. Distinguishing between normal behaviour faulty situation relies on event analysis. Being able to detect quickly deviations from normality is essential for administration the foundation fault prediction. As continue grow in size complexity, mining flows become more...

10.1109/ipdps.2012.107 article EN 2012-05-01

As the failure frequency is increasing with components count in modern and future supercomputers, resilience becoming critical for extreme scale systems. The association of prediction proactive checkpointing seeks to reduce effect failures execution time parallel applications. Unfortunately, does not systematically avoid restarting from scratch. To mitigate this issue, can be coupled periodic checkpointing. However, blind use these techniques always improves system efficiency, because...

10.1109/ipdps.2013.74 article EN 2013-05-01

AI integration is revolutionizing the landscape of HPC simulations, enhancing importance, use, and performance AI-driven workflows. This paper surveys diverse rapidly evolving field provides a common conceptual basis for understanding Specifically, we use insights from different modes coupling into workflows to propose six execution motifs most commonly found in scientific applications. The proposed set by definition incomplete evolving. However, they allow us analyze primary challenges...

10.48550/arxiv.2406.14315 preprint EN arXiv (Cornell University) 2024-06-20

As high performance computing architecture evolves to deliver ever-increasing performance, the middleware tools also need adapt in order for applications better use these higher-performance features. The Adaptable Input Output System (ADIOS), which provides scalable IO exascale HPC is one such middleware. During Exascale Computing Project (ECP), key portions of ADIOS environment were adapted respond ongoing developments and stresses opportunities inherent those changes. This paper examines...

10.1177/10943420251330446 article EN The International Journal of High Performance Computing Applications 2025-04-04

The validation of mobile ad hoc technologies relies almost exclusively on modeling and simulation. In this paper we present a novel mobility model based social network theory. is designed to accurately reflect the realistic involved actors in various VANET simulation scenarios. This much needed as, order have high degree confidence using simulation, (as well as model) must act very realistic. However, most models currently used are simplistic. being presented part VNSim, generic simulator...

10.1109/vetecs.2009.5073334 article EN 2009-04-01

Resilience is an important challenge for extreme-scale supercomputers. Today, failures in supercomputers are assumed to be uniformly distributed time. However, recent studies show that high-performance computing systems partially correlated time, generating periods of higher failure density. Our study the logs multiple density occur with up three times more than average. We design a monitoring system listens hardware events and forwards runtime detect those regime changes. implement capable...

10.1109/ipdps.2016.100 article EN 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2016-05-01

We describe MGARD, a software providing MultiGrid Adaptive Reduction for floating-point scientific data on structured and unstructured grids. With exceptional compression capability precise error control, MGARD addresses wide range of requirements, including storage reduction, high-performance I/O, in-situ analysis. It features unified application programming interface (API) that seamlessly operates across diverse computing architectures. has been optimized with highly-tuned GPU kernels...

10.1016/j.softx.2023.101590 article EN cc-by-nc-nd SoftwareX 2023-11-22

In this paper, we analyse messages generated by different HPC large-scale systems in order to extract sequences of correlated events which lately use predict the normal and faulty behaviour system. Our method uses a dynamic window strategy that is able find frequent regardless on time delay between them. Most current related research narrows correlation extraction fixed relatively small windows do not reflect whole The are constant change during lifetime machine. We consider it important...

10.1145/2038633.2038637 article EN 2011-10-23

As large-scale systems evolve towards post-petascale computing, it is crucial to focus on providing fault-tolerance strategies that aim minimize fault’s effects applications. By far the most popular technique checkpoint–restart strategy. A complement this classical approach failure avoidance, by which occurrence of a fault predicted and proactive measures are taken. This requires reliable prediction system anticipate failures their locations. One way offering analysis logs generated during...

10.1177/1094342013488258 article EN The International Journal of High Performance Computing Applications 2013-07-03

Scheduling in High-Performance Computing (HPC) has been traditionally centered around computing resources (e.g., processors/cores). The ever-growing amount of data produced by modern scientific applications start to drive novel architectures and new frameworks support more efficient processing, transfer storage for future HPC systems. This trend towards data-driven demands the scheduling solutions also consider other I/O, memory, cache) that can be shared amongst competing applications. In...

10.1109/ipdps.2018.00029 article EN 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2018-05-01

In this paper, we are interested in scheduling stochastic jobs on a reservation-based platform. Specifically, consider whose execution time follows known probability distribution. The platform is reservation-based, meaning that the user has to request fixed-length slots. cost then depends both (i) duration (pay for what you ask); and (ii) actual of job use). A reservation strategy determines sequence increasing length reservations, which paid until one them allows successfully complete. goal...

10.1109/ipdps.2019.00027 article EN 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2019-05-01

We observe the emergence of a new generation scientific workflows that process data produced at sustained rate by instruments and large scale numerical simulations. This is consumed multiple analysis, visualization, or Machine Learning components not only to enable inference justify program, but also monitor steer evolution these experiments. In such workflows, moving intermediate efficiently key performance, more than scheduling computational tasks. However, most traditional workflow...

10.1109/e-science58273.2023.10254849 article EN 2023-10-09

New emerging fields are developing a growing number of large-scale applications with heterogeneous, dynamic and data-intensive requirements that put high emphasis on productivity thus not tuned to run efficiently today's performance computing (HPC) systems. Some these applications, such as neuroscience workloads those use adaptive numerical algorithms, develop modeling simulation workflows stochastic execution times unpredictable resource requirements. When they deployed current HPC systems...

10.1145/3337821.3337890 article EN 2019-07-25

The MPI all-to-all algorithm is a data intensive, high-cost collective used by many scientific High Performance Computing applications. Optimizations for small exchange use aggregation techniques, such as the Bruck algorithm, to minimize number of messages sent, and overall operation latency. This paper presents three variants which differ in way laid out memory at intermediate steps algorithm. Mellanox's InfiniBand support Host Channel Adapter (HCA) hardware scatter/gather selectively...

10.1145/2966884.2966918 article EN 2016-09-25
Coming Soon ...