- Parallel Computing and Optimization Techniques
- Distributed and Parallel Computing Systems
- Advanced Data Storage Technologies
- Software System Performance and Reliability
- Scientific Computing and Data Management
- Cloud Computing and Resource Management
- Distributed systems and fault tolerance
- Low-power high-performance VLSI design
- Software Testing and Debugging Techniques
- Research Data Management Practices
- Interconnection Networks and Systems
- Algorithms and Data Compression
- Advanced Software Engineering Methodologies
- Radiation Effects in Electronics
- Advanced Database Systems and Queries
- Scheduling and Optimization Algorithms
- Embedded Systems Design Techniques
- Big Data and Business Intelligence
- Software Engineering Research
- Numerical Methods and Algorithms
- Protein Structure and Dynamics
- Computational Physics and Python Applications
- Peer-to-Peer Network Technologies
- Speech and Audio Processing
- Machine Learning in Bioinformatics
Nvidia (United States)
2022-2023
Lawrence Livermore National Laboratory
2013-2022
Bavarian Academy of Sciences and Humanities
2021
Leibniz Supercomputing Centre
2021
Irish Centre for High-End Computing
2021
National University of Ireland
2021
Red Hat (United States)
2021
IBM (United States)
2021
Lawrence Livermore National Security
2018
University of Utah
2018
Dynamic Voltage Frequency Scaling (DVFS) has been the tool of choice for balancing power and performance in high-performance computing (HPC). With introduction Intel's Sandy Bridge family processors, researchers now have a far more attractive option: user-specified, dynamic, hardware-enforced processor bounds. In this paper we provide first look at technology HPC environment detail both opportunities potential pitfalls using technique to control power. As part evaluation measure...
We present the Stack Trace Analysis Tool (STAT) to aid in debugging extreme-scale applications. STAT can reduce problem exploration spaces from thousands of processes a few by sampling stack traces form process equivalence classes, groups exhibiting similar behavior. then use full-featured debuggers on representatives these behavior classes for root cause analysis. scalably collects over period assemble profile application's routines samples call graph prefix tree that encodes common...
The economics of flash vs. disk storage is driving HPC centers to incorporate faster solid-state burst buffers into the hierarchy in exchange for smaller parallel file system (PFS) bandwidth. In systems with an underprovisioned PFS, avoiding I/O contention at PFS level will become crucial achieving high computational efficiency. this paper, we propose novel batch job scheduling techniques that reduce such by integrating awareness policies as EASY backfilling. We model available bandwidth...
OpenMP plays a growing role as portable programming model to harness on-node parallelism, yet, existing data race checkers for have high overheads and generate many false positives. In this paper, we propose the first checker, ARCHER, that achieves accuracy, low on large applications, portability. ARCHER incorporates scalable happens-before tracking, exploits structured parallelism via combined static dynamic analysis, modularly interfaces with runtimes. significantly outperforms TSan Intel®...
Resource and job management software is crucial to High Performance Computing (HPC) for efficient application execution. However, current systems approaches can no longer keep up with the challenges large HPC centers are facing due ever-increasing system scales, resource workload diversity, interplays between various resources (e.g., compute clusters a global file system), complexity of constraints such as strict power budgeting. To address this gap, we propose Flux, an extensible framework...
We improved the quality and reduced time to produce machine learned models for use in small molecule antiviral design. Our globally asynchronous multi-level parallel training approach strong scales all of Sierra with up 97.7% efficiency. trained a novel, character-based Wasserstein autoencoder that produces higher model on 1.613 billion compounds 23 minutes while previous state art takes day 1 million compounds. Reducing from shifts creation bottleneck computer job turnaround human...
We present a scalable temporal order analysis technique that supports debugging of large scale applications by classifying MPI tasks based on their logical program execution order. Our approach combines static techniques with dynamic to determine this scalably. It uses stack trace guide selection critical points in anomalous application runs. novel ordering engine then leverages information along the application's control structure apply data flow key such as loop variables. use lightweight...
Today's largest systems have over 100,000 cores, with million-core expected the next few years. This growing scale makes debugging applications that run on them a daunting challenge. Few tools perform well at this and most provide an overload of information about entire job. Developers need quickly direct to root cause problem. paper presents AutomaDeD, tool identifies which tasks large-scale application first manifest bug specific code region program execution point. AutomaDeD statistically...
For job allocation decision, current batch schedulers have access to and use only information on the number of nodes runtime because it is readily available at submission time from user scripts. User-provided runtimes are typically inaccurate users overestimate or lack understanding resource requirements. Beyond runtime, other system resources, including IO network, not but play a key role in performance. There need for automatic, general, scalable tools that provide accurate usage so that,...
Scientific workflows have been used almost universally across scientific domains, and underpinned some of the most significant discoveries past several decades. Many these high computational, storage, and/or communication demands, thus must execute on a wide range large-scale platforms, from large clouds to upcoming exascale high-performance computing (HPC) platforms. These executions be managed using software infrastructure. Due popularity workflows, workflow management systems (WMSs)...
Contemporary microprocessors provide a rich set of integrated performance counters that allow application developers and system architects alike the opportunity to gather important information about workload behaviors. Current techniques for analyzing data produced from these use raw counts, ratios, visualization help users make decisions their performance. While are appropriate one process, they do not scale easily new levels demanded by contemporary computing systems. Very simply, this...
Dynamic linking has many advantages for managing large code bases, but dynamically linked applications have not typically scaled well on high performance computing systems. Splitting a monolithic executable into dynamic shared object (DSO) files decreases compile time codes, reduces runtime memory requirements by allowing modules to be loaded and unloaded as needed, allows common DSOs among executables. However, launching an that depends causes flood of file system operations at program...
The detection and elimination of data races in largescale OpenMP programs is critical importance. Unfortunately, today's state-of-the-art race checkers suffer from high memory overheads and/or miss races. In this paper, we present SWORD, a detector that significantly improves upon these limitations. SWORD limits the application slowdown usage by utilizing only bounded, user-adjustable buffer to collect targeted accesses. When fills up, accesses are compressed flushed file system for later...
Dynamic analysis techniques help programmers find the root cause of bugs in large-scale parallel applications.
Debugging large-scale parallel applications is challenging. In most HPC applications, tasks progress in a coordinated fashion, and thus fault one task can quickly propagate to other tasks, making it difficult debug. Finding the least-progressed significantly reduce effort identify where originated. However, existing approaches for detecting them suffer low accuracy large overheads; either they use imprecise static analysis or are unable infer dependence inside loops. We present loop-aware...
Petascale systems will present several new challenges to performance and correctness tools. Such machines may contain millions of cores, requiring that tools use scalable data structures analysis algorithms collect process application data. In addition, at such scales, each tool itself become a large parallel - already, debugging the full Blue-Gene/L (BG/L) installation Lawrence Livermore National Laboratory requires employing 1664 daemons. To reach sizes beyond, must communication...
The ability to record and replay program execution helps significantly in debugging non-deterministic MPI applications by reproducing message-receive orders. However, the large amount of data that traditional record-and-reply techniques precludes its practical applicability massively parallel applications. In this paper, we propose a new compression algorithm, Clock Delta Compression (CDC), for scalable CDC defines reference order message receives based on totally ordered relation using...
The advancement of machine learning techniques and the heterogeneous architectures most current supercomputers are propelling demand for large multiscale simulations that can automatically autonomously couple diverse components map them to relevant resources solve complex problems at multiple scales. Nevertheless, despite recent progress in workflow technologies, capabilities limited coupling two In first-ever demonstration using three scales resolution, we present a scalable generalizable...
Exascale computers will offer transformative capabilities to combine data-driven and learning-based approaches with traditional simulation applications accelerate scientific discovery insight. These software combinations integrations, however, are difficult achieve due challenges of coordination deployment heterogeneous components on diverse massive platforms. We present the ExaWorks project, which can address many these challenges: is leading a co-design process create workflow Software...
As High Performance Computing (HPC) workflows increase in complexity, their designers seek to enable automation and flexibility offered by cloud technologies. Container orchestration through Kubernetes enables highly desirable capabilities but does not satisfy the performance demands of HPC. tools that automate lifecycle Message Passing Interface (MPI)-based applications do scale, scheduler provide crucial scheduling capabilities. In this work, we detail our efforts port CORAL-2 benchmark...
Many tools that target parallel and distributed environments must co-locate a set of daemons with the processes application. However, efficient portable deployment these on large scale systems is an unsolved problem. We overcome this gap LaunchMON, scalable, robust, portable, secure, general purpose infrastructure for launching tool daemons. Its API allows builders to identify all job, launch relevant nodes control daemon interaction. Our results show LaunchMON scales very counts...
Debugging large-scale parallel applications is challenging. Most existing techniques provide mechanisms for process control but little information about the causes of failures. debuggers also scale poorly despite continued growth in supercomputer core counts. Our novel, highly scalable tool helps developers to understand and fix performance failures correctness problems at scale. probabilistically infers least progressed task MPI programs using Markov models execution history dependence...
Debugging large-scale parallel applications is challenging. Most existing techniques provide little information about failure root causes. Further, most debuggers significantly slow down program execution, and run sluggishly with massively applications. This paper presents a novel technique that scalably infers the tasks in on which occurred, as well code it originated. Our combines scalable runtime analysis static to determine least-progressed task(s) identify lines at arose. We present...
Contemporary microprocessors provide a rich set of integrated performance counters that allow application developers and system architects alike the opportunity to gather important information about workload behaviors. Current techniques for analyzing data produced from these use raw counts, ratios, visualization help users make decisions their performance. While are appropriate one process, they do not scale easily new levels demanded by contemporary computing systems. Very simply, this...