Dong H. Ahn

ORCID: 0000-0001-6722-0532
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Parallel Computing and Optimization Techniques
  • Distributed and Parallel Computing Systems
  • Advanced Data Storage Technologies
  • Software System Performance and Reliability
  • Scientific Computing and Data Management
  • Cloud Computing and Resource Management
  • Distributed systems and fault tolerance
  • Low-power high-performance VLSI design
  • Software Testing and Debugging Techniques
  • Research Data Management Practices
  • Interconnection Networks and Systems
  • Algorithms and Data Compression
  • Advanced Software Engineering Methodologies
  • Radiation Effects in Electronics
  • Advanced Database Systems and Queries
  • Scheduling and Optimization Algorithms
  • Embedded Systems Design Techniques
  • Big Data and Business Intelligence
  • Software Engineering Research
  • Numerical Methods and Algorithms
  • Protein Structure and Dynamics
  • Computational Physics and Python Applications
  • Peer-to-Peer Network Technologies
  • Speech and Audio Processing
  • Machine Learning in Bioinformatics

Nvidia (United States)
2022-2023

Lawrence Livermore National Laboratory
2013-2022

Bavarian Academy of Sciences and Humanities
2021

Leibniz Supercomputing Centre
2021

Irish Centre for High-End Computing
2021

National University of Ireland
2021

Red Hat (United States)
2021

IBM (United States)
2021

Lawrence Livermore National Security
2018

University of Utah
2018

Dynamic Voltage Frequency Scaling (DVFS) has been the tool of choice for balancing power and performance in high-performance computing (HPC). With introduction Intel's Sandy Bridge family processors, researchers now have a far more attractive option: user-specified, dynamic, hardware-enforced processor bounds. In this paper we provide first look at technology HPC environment detail both opportunities potential pitfalls using technique to control power. As part evaluation measure...

10.1109/ipdpsw.2012.116 article EN 2012-05-01

We present the Stack Trace Analysis Tool (STAT) to aid in debugging extreme-scale applications. STAT can reduce problem exploration spaces from thousands of processes a few by sampling stack traces form process equivalence classes, groups exhibiting similar behavior. then use full-featured debuggers on representatives these behavior classes for root cause analysis. scalably collects over period assemble profile application's routines samples call graph prefix tree that encodes common...

10.1109/ipdps.2007.370254 article EN 2007-01-01

The economics of flash vs. disk storage is driving HPC centers to incorporate faster solid-state burst buffers into the hierarchy in exchange for smaller parallel file system (PFS) bandwidth. In systems with an underprovisioned PFS, avoiding I/O contention at PFS level will become crucial achieving high computational efficiency. this paper, we propose novel batch job scheduling techniques that reduce such by integrating awareness policies as EASY backfilling. We model available bandwidth...

10.1145/2907294.2907316 article EN 2016-05-31

OpenMP plays a growing role as portable programming model to harness on-node parallelism, yet, existing data race checkers for have high overheads and generate many false positives. In this paper, we propose the first checker, ARCHER, that achieves accuracy, low on large applications, portability. ARCHER incorporates scalable happens-before tracking, exploits structured parallelism via combined static dynamic analysis, modularly interfaces with runtimes. significantly outperforms TSan Intel®...

10.1109/ipdps.2016.68 article EN 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2016-05-01

Resource and job management software is crucial to High Performance Computing (HPC) for efficient application execution. However, current systems approaches can no longer keep up with the challenges large HPC centers are facing due ever-increasing system scales, resource workload diversity, interplays between various resources (e.g., compute clusters a global file system), complexity of constraints such as strict power budgeting. To address this gap, we propose Flux, an extensible framework...

10.1109/icppw.2014.15 article EN 2014-09-01

We improved the quality and reduced time to produce machine learned models for use in small molecule antiviral design. Our globally asynchronous multi-level parallel training approach strong scales all of Sierra with up 97.7% efficiency. trained a novel, character-based Wasserstein autoencoder that produces higher model on 1.613 billion compounds 23 minutes while previous state art takes day 1 million compounds. Reducing from shifts creation bottleneck computer job turnaround human...

10.1177/10943420211010930 article EN cc-by-nc The International Journal of High Performance Computing Applications 2021-05-03

We present a scalable temporal order analysis technique that supports debugging of large scale applications by classifying MPI tasks based on their logical program execution order. Our approach combines static techniques with dynamic to determine this scalably. It uses stack trace guide selection critical points in anomalous application runs. novel ordering engine then leverages information along the application's control structure apply data flow key such as loop variables. use lightweight...

10.1145/1654059.1654104 article EN 2009-11-14

Today's largest systems have over 100,000 cores, with million-core expected the next few years. This growing scale makes debugging applications that run on them a daunting challenge. Few tools perform well at this and most provide an overload of information about entire job. Developers need quickly direct to root cause problem. paper presents AutomaDeD, tool identifies which tasks large-scale application first manifest bug specific code region program execution point. AutomaDeD statistically...

10.1109/dsn.2010.5544927 article EN 2010-06-01

For job allocation decision, current batch schedulers have access to and use only information on the number of nodes runtime because it is readily available at submission time from user scripts. User-provided runtimes are typically inaccurate users overestimate or lack understanding resource requirements. Beyond runtime, other system resources, including IO network, not but play a key role in performance. There need for automatic, general, scalable tools that provide accurate usage so that,...

10.1145/3225058.3225091 article EN 2018-08-08

Scientific workflows have been used almost universally across scientific domains, and underpinned some of the most significant discoveries past several decades. Many these high computational, storage, and/or communication demands, thus must execute on a wide range large-scale platforms, from large clouds to upcoming exascale high-performance computing (HPC) platforms. These executions be managed using software infrastructure. Due popularity workflows, workflow management systems (WMSs)...

10.48550/arxiv.2103.09181 preprint EN cc-by-sa arXiv (Cornell University) 2021-01-01

Contemporary microprocessors provide a rich set of integrated performance counters that allow application developers and system architects alike the opportunity to gather important information about workload behaviors. Current techniques for analyzing data produced from these use raw counts, ratios, visualization help users make decisions their performance. While are appropriate one process, they do not scale easily new levels demanded by contemporary computing systems. Very simply, this...

10.5555/762761.762802 article EN Conference on High Performance Computing (Supercomputing) 2002-11-16

Dynamic linking has many advantages for managing large code bases, but dynamically linked applications have not typically scaled well on high performance computing systems. Splitting a monolithic executable into dynamic shared object (DSO) files decreases compile time codes, reduces runtime memory requirements by allowing modules to be loaded and unloaded as needed, allows common DSOs among executables. However, launching an that depends causes flood of file system operations at program...

10.1145/2464996.2465020 article EN 2013-05-28

The detection and elimination of data races in largescale OpenMP programs is critical importance. Unfortunately, today's state-of-the-art race checkers suffer from high memory overheads and/or miss races. In this paper, we present SWORD, a detector that significantly improves upon these limitations. SWORD limits the application slowdown usage by utilizing only bounded, user-adjustable buffer to collect targeted accesses. When fills up, accesses are compressed flushed file system for later...

10.1109/ipdps.2018.00094 article EN 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2018-05-01

Dynamic analysis techniques help programmers find the root cause of bugs in large-scale parallel applications.

10.1145/2667219 article EN Communications of the ACM 2015-08-24

Debugging large-scale parallel applications is challenging. In most HPC applications, tasks progress in a coordinated fashion, and thus fault one task can quickly propagate to other tasks, making it difficult debug. Finding the least-progressed significantly reduce effort identify where originated. However, existing approaches for detecting them suffer low accuracy large overheads; either they use imprecise static analysis or are unable infer dependence inside loops. We present loop-aware...

10.1145/2594291.2594336 article EN 2014-05-13

Petascale systems will present several new challenges to performance and correctness tools. Such machines may contain millions of cores, requiring that tools use scalable data structures analysis algorithms collect process application data. In addition, at such scales, each tool itself become a large parallel - already, debugging the full Blue-Gene/L (BG/L) installation Lawrence Livermore National Laboratory requires employing 1664 daemons. To reach sizes beyond, must communication...

10.5555/1413370.1413397 article EN IEEE International Conference on High Performance Computing, Data, and Analytics 2008-11-15

The ability to record and replay program execution helps significantly in debugging non-deterministic MPI applications by reproducing message-receive orders. However, the large amount of data that traditional record-and-reply techniques precludes its practical applicability massively parallel applications. In this paper, we propose a new compression algorithm, Clock Delta Compression (CDC), for scalable CDC defines reference order message receives based on totally ordered relation using...

10.1145/2807591.2807642 article EN 2015-10-27

The advancement of machine learning techniques and the heterogeneous architectures most current supercomputers are propelling demand for large multiscale simulations that can automatically autonomously couple diverse components map them to relevant resources solve complex problems at multiple scales. Nevertheless, despite recent progress in workflow technologies, capabilities limited coupling two In first-ever demonstration using three scales resolution, we present a scalable generalizable...

10.1145/3458817.3476210 article EN 2021-10-21

Exascale computers will offer transformative capabilities to combine data-driven and learning-based approaches with traditional simulation applications accelerate scientific discovery insight. These software combinations integrations, however, are difficult achieve due challenges of coordination deployment heterogeneous components on diverse massive platforms. We present the ExaWorks project, which can address many these challenges: is leading a co-design process create workflow Software...

10.1109/works54523.2021.00012 article EN 2021-11-01

As High Performance Computing (HPC) workflows increase in complexity, their designers seek to enable automation and flexibility offered by cloud technologies. Container orchestration through Kubernetes enables highly desirable capabilities but does not satisfy the performance demands of HPC. tools that automate lifecycle Message Passing Interface (MPI)-based applications do scale, scheduler provide crucial scheduling capabilities. In this work, we detail our efforts port CORAL-2 benchmark...

10.1109/canopie-hpc56864.2022.00011 article EN 2022-11-01

Many tools that target parallel and distributed environments must co-locate a set of daemons with the processes application. However, efficient portable deployment these on large scale systems is an unsolved problem. We overcome this gap LaunchMON, scalable, robust, portable, secure, general purpose infrastructure for launching tool daemons. Its API allows builders to identify all job, launch relevant nodes control daemon interaction. Our results show LaunchMON scales very counts...

10.1109/icpp.2008.63 article EN 2008-09-01

Debugging large-scale parallel applications is challenging. Most existing techniques provide mechanisms for process control but little information about the causes of failures. debuggers also scale poorly despite continued growth in supercomputer core counts. Our novel, highly scalable tool helps developers to understand and fix performance failures correctness problems at scale. probabilistically infers least progressed task MPI programs using Markov models execution history dependence...

10.1145/2370816.2370848 article EN 2012-09-19

Debugging large-scale parallel applications is challenging. Most existing techniques provide little information about failure root causes. Further, most debuggers significantly slow down program execution, and run sluggishly with massively applications. This paper presents a novel technique that scalably infers the tasks in on which occurred, as well code it originated. Our combines scalable runtime analysis static to determine least-progressed task(s) identify lines at arose. We present...

10.1109/tpds.2014.2314100 article EN IEEE Transactions on Parallel and Distributed Systems 2014-04-21

Contemporary microprocessors provide a rich set of integrated performance counters that allow application developers and system architects alike the opportunity to gather important information about workload behaviors. Current techniques for analyzing data produced from these use raw counts, ratios, visualization help users make decisions their performance. While are appropriate one process, they do not scale easily new levels demanded by contemporary computing systems. Very simply, this...

10.1109/sc.2002.10066 article EN 2002-01-01
Coming Soon ...