- Distributed and Parallel Computing Systems
- Advanced Data Storage Technologies
- Cloud Computing and Resource Management
- Parallel Computing and Optimization Techniques
- Scientific Computing and Data Management
- Software System Performance and Reliability
- Distributed systems and fault tolerance
- Software Reliability and Analysis Research
- Age of Information Optimization
- Data Mining Algorithms and Applications
- Algorithms and Data Compression
- Laser-induced spectroscopy and plasma
- Magnetic confinement fusion research
- Advanced Bandit Algorithms Research
- Machine Learning and Algorithms
- Ferroelectric and Negative Capacitance Devices
- Network Security and Intrusion Detection
- Advanced Text Analysis Techniques
- Mobile Ad Hoc Networks
- Medical Image Segmentation Techniques
- Vehicular Ad Hoc Networks (VANETs)
- Cell Image Analysis Techniques
- Advanced Queuing Theory Analysis
- VLSI and Analog Circuit Testing
- Business Process Modeling and Analysis
Oak Ridge National Laboratory
2020-2025
Vanderbilt University
2017-2020
University of Illinois Urbana-Champaign
2011-2017
Mellanox Technologies (United States)
2016-2017
Illinois College
2016
National Center for Supercomputing Applications
2011-2013
Institut national de recherche en informatique et en automatique
2013
Urbana University
2012
Universitatea Națională de Știință și Tehnologie Politehnica București
2009-2011
A significant percentage of the computing capacity large-scale platforms is wasted because interferences incurred by multiple applications that access a shared parallel file system concurrently. One solution to handling I/O bursts enlarge-scale HPC systems absorb them at an intermediate storage layer consisting burst buffers. However, our analysis Argonne's Mira shows buffers cannot prevent congestion all times. Consequently, performances dramatically degraded, showing in some cases decrease...
Data-driven science and technology offer transformative tools methods to science. This review article highlights the latest development progress in interdisciplinary field of data-driven plasma (DDPS), i.e., whose is driven strongly by data analyses. Plasma considered be most ubiquitous form observable matter universe. Data associated with plasmas can, therefore, cover extremely large spatial temporal scales, often provide essential information for other scientific disciplines. Thanks...
A large percentage of computing capacity in today's high-performance systems is wasted because failures. Consequently current research focusing on providing fault tolerance strategies that aim to minimize fault's effects applications. By far the most popular technique checkpointrestart strategy. complement this classical approach failure avoidance, by which occurrence a predicted and preventive measures are taken. This requires reliable prediction system anticipate failures their locations....
As supercomputers and clusters increase in size complexity, system failures are inevitable. Different hardware components (such as memory, disk, or network) of such systems can have different failure rates. Prior works assume equally affect an application, whereas our goal is to provide models for applications that reflect their specific component usage. This challenging because dynamics heterogeneous space time.
HPC systems are complex machines that generate a huge volume of system state data called "events". Events generated without following general consistent rule and different hardware software components such have failure rates. Distinguishing between normal behaviour faulty situation relies on event analysis. Being able to detect quickly deviations from normality is essential for administration the foundation fault prediction. As continue grow in size complexity, mining flows become more...
As the failure frequency is increasing with components count in modern and future supercomputers, resilience becoming critical for extreme scale systems. The association of prediction proactive checkpointing seeks to reduce effect failures execution time parallel applications. Unfortunately, does not systematically avoid restarting from scratch. To mitigate this issue, can be coupled periodic checkpointing. However, blind use these techniques always improves system efficiency, because...
AI integration is revolutionizing the landscape of HPC simulations, enhancing importance, use, and performance AI-driven workflows. This paper surveys diverse rapidly evolving field provides a common conceptual basis for understanding Specifically, we use insights from different modes coupling into workflows to propose six execution motifs most commonly found in scientific applications. The proposed set by definition incomplete evolving. However, they allow us analyze primary challenges...
As high performance computing architecture evolves to deliver ever-increasing performance, the middleware tools also need adapt in order for applications better use these higher-performance features. The Adaptable Input Output System (ADIOS), which provides scalable IO exascale HPC is one such middleware. During Exascale Computing Project (ECP), key portions of ADIOS environment were adapted respond ongoing developments and stresses opportunities inherent those changes. This paper examines...
The validation of mobile ad hoc technologies relies almost exclusively on modeling and simulation. In this paper we present a novel mobility model based social network theory. is designed to accurately reflect the realistic involved actors in various VANET simulation scenarios. This much needed as, order have high degree confidence using simulation, (as well as model) must act very realistic. However, most models currently used are simplistic. being presented part VNSim, generic simulator...
Resilience is an important challenge for extreme-scale supercomputers. Today, failures in supercomputers are assumed to be uniformly distributed time. However, recent studies show that high-performance computing systems partially correlated time, generating periods of higher failure density. Our study the logs multiple density occur with up three times more than average. We design a monitoring system listens hardware events and forwards runtime detect those regime changes. implement capable...
We describe MGARD, a software providing MultiGrid Adaptive Reduction for floating-point scientific data on structured and unstructured grids. With exceptional compression capability precise error control, MGARD addresses wide range of requirements, including storage reduction, high-performance I/O, in-situ analysis. It features unified application programming interface (API) that seamlessly operates across diverse computing architectures. has been optimized with highly-tuned GPU kernels...
In this paper, we analyse messages generated by different HPC large-scale systems in order to extract sequences of correlated events which lately use predict the normal and faulty behaviour system. Our method uses a dynamic window strategy that is able find frequent regardless on time delay between them. Most current related research narrows correlation extraction fixed relatively small windows do not reflect whole The are constant change during lifetime machine. We consider it important...
As large-scale systems evolve towards post-petascale computing, it is crucial to focus on providing fault-tolerance strategies that aim minimize fault’s effects applications. By far the most popular technique checkpoint–restart strategy. A complement this classical approach failure avoidance, by which occurrence of a fault predicted and proactive measures are taken. This requires reliable prediction system anticipate failures their locations. One way offering analysis logs generated during...
Scheduling in High-Performance Computing (HPC) has been traditionally centered around computing resources (e.g., processors/cores). The ever-growing amount of data produced by modern scientific applications start to drive novel architectures and new frameworks support more efficient processing, transfer storage for future HPC systems. This trend towards data-driven demands the scheduling solutions also consider other I/O, memory, cache) that can be shared amongst competing applications. In...
In this paper, we are interested in scheduling stochastic jobs on a reservation-based platform. Specifically, consider whose execution time follows known probability distribution. The platform is reservation-based, meaning that the user has to request fixed-length slots. cost then depends both (i) duration (pay for what you ask); and (ii) actual of job use). A reservation strategy determines sequence increasing length reservations, which paid until one them allows successfully complete. goal...
We observe the emergence of a new generation scientific workflows that process data produced at sustained rate by instruments and large scale numerical simulations. This is consumed multiple analysis, visualization, or Machine Learning components not only to enable inference justify program, but also monitor steer evolution these experiments. In such workflows, moving intermediate efficiently key performance, more than scheduling computational tasks. However, most traditional workflow...
New emerging fields are developing a growing number of large-scale applications with heterogeneous, dynamic and data-intensive requirements that put high emphasis on productivity thus not tuned to run efficiently today's performance computing (HPC) systems. Some these applications, such as neuroscience workloads those use adaptive numerical algorithms, develop modeling simulation workflows stochastic execution times unpredictable resource requirements. When they deployed current HPC systems...
The MPI all-to-all algorithm is a data intensive, high-cost collective used by many scientific High Performance Computing applications. Optimizations for small exchange use aggregation techniques, such as the Bruck algorithm, to minimize number of messages sent, and overall operation latency. This paper presents three variants which differ in way laid out memory at intermediate steps algorithm. Mellanox's InfiniBand support Host Channel Adapter (HCA) hardware scatter/gather selectively...