- Parallel Computing and Optimization Techniques
- Distributed systems and fault tolerance
- Gene Regulatory Network Analysis
- Advanced Data Storage Technologies
- Software System Performance and Reliability
- Interconnection Networks and Systems
- Evolutionary Algorithms and Applications
- Cloud Computing and Resource Management
- Distributed and Parallel Computing Systems
- Cellular Automata and Applications
- CCD and CMOS Imaging Sensors
- Welding Techniques and Residual Stresses
- Remote Sensing and LiDAR Applications
- Atmospheric aerosols and clouds
- Advanced Software Engineering Methodologies
- Remote Sensing in Agriculture
- Electromagnetic Scattering and Analysis
- Advanced Welding Techniques Analysis
- Real-Time Systems Scheduling
- African history and culture analysis
- Context-Aware Activity Recognition Systems
- Neural Networks and Applications
- Music Technology and Sound Studies
- Software Engineering Research
- Quality and Management Systems
Lawrence Livermore National Laboratory
2015-2020
Forschungszentrum Jülich
2009-2018
German Research School for Simulation Sciences
2012-2014
RWTH Aachen University
2010-2013
Institute for Advanced Study
2010
University of Potsdam
2008
Welding Institute (Slovenia)
1998
The critical path, which describes the longest execution sequence without wait states in a parallel program, identifies activities that determine overall program runtime. Combining knowledge of path with traditional profiles, we have defined set compact performance indicators help answer variety important performance-analysis questions, such as identifying load imbalance, quantifying impact imbalance on runtime, and characterizing resource consumption. By replaying event traces parallel, can...
Driven by growing application requirements and accelerated current trends in microprocessor design, the number of processor cores on modern supercomputers is increasing from generation to generation. However, load or communication imbalance prevents many codes taking advantage available parallelism, as delays single processes may spread wait states across entire machine. Moreover, when employing complex point-to-point patterns, propagate along far-reaching cause-effect chains that are hard...
The fat-tree topology is one of the most commonly used network topologies in HPC systems. Vendors support several options that can be configured when deploying networks on production systems, such as link bandwidth, number rails, planes, and tapering. This paper showcases use simulations to compare impact these design representative applications, libraries, multi-job workloads. We present advances TraceR-CODES simulation framework enable this analysis evaluate its prediction accuracy against...
Driven by growing application requirements and accelerated current trends in microprocessor design, the number of processor cores on modern supercomputers is increasing from generation to generation. However, load or communication imbalance prevents many codes taking advantage available parallelism, as delays single processes may spread wait states across entire machine. Moreover, when employing complex point-to-point patterns, propagate along far-reaching cause-effect chains that are hard...
Cray XT and IBM Blue Gene systems present current alternative approaches to constructing leadership computer relying on applications being able exploit very large configurations of processor cores, associated analysis tools must also scale commensurately isolate quantify performance issues that manifest at the largest scales. In studying scalability Scalasca toolset several hundred thousand MPI processes XT5 BG/P systems, we investigated a progressive execution deterioration well-known ASCI...
Load or communication imbalance prevents many codes from taking advantage of the parallelism available on modern supercomputers. We present two scalable methods to highlight in parallel programs: The first method identifies delays that inflict wait states at subsequent synchronization points, and attributes their costs terms resource waste original cause. second combines knowledge critical path with traditional profiles derive a set compact performance indicators help answer variety...
Asynchrony and non-determinism in Charm++ programs present a significant challenge analyzing their event traces. We new framework to organize traces of parallel written Charm++. Our reorganization allows one more easily explore analyze such by providing context through logical structure. describe several heuristics compensate for missing dependencies between events that currently cannot be recorded. introduce task ordering recovers structure from the non-deterministic execution order. Using...
In studying the scalability of Scalasca performance analysis toolset to several hundred thousand MPI processes on IBM Blue Gene/P, we investigated a progressive execution deterioration well-known ASCI Sweep3D compact application. runtime summarization quantified communication time that correlated wth computational imbalance, and automated trace confirmed growing amounts waiting times. Further instrumentation, measurement analyses pinpointed conditional section highly imbalanced computation...
To better understand the formation of wait states in MPI programs and to support user finding optimization targets case load imbalance, a major source states, we added our earlier work two new trace-analysis techniques Scalasca, performance analysis tool designed for large-scale applications. In this paper, show how techniques, which were originally restricted two-sided collective communication, are extended cover also one-sided communication. We demonstrate experiences with benchmark...
Load imbalance usually introduces wait states into the execution of parallel programs. Being able to identify and quantify is therefore essential for diagnosis remediation this phenomenon. An established method detecting generate event traces compare relevant timestamps across process boundaries. However, large trace volumes prevent analysis longer periods. In paper, we present an extremely lightweight wait-state profiler which does not rely on that can be used estimate in MPI codes with...