- Parallel Computing and Optimization Techniques
- Distributed and Parallel Computing Systems
- Advanced Data Storage Technologies
- Cloud Computing and Resource Management
- Interconnection Networks and Systems
- Probabilistic and Robust Engineering Design
- Software System Performance and Reliability
- Software Engineering Research
- Lattice Boltzmann Simulation Studies
- Computational Physics and Python Applications
- Machine Learning and Data Classification
- Automotive and Human Injury Biomechanics
- Metaheuristic Optimization Algorithms Research
- Algorithms and Data Compression
- earthquake and tectonic studies
- Low-power high-performance VLSI design
- Structural Response to Dynamic Loads
- Artificial Intelligence in Games
- Evolutionary Algorithms and Applications
- Logic, programming, and type systems
- Embedded Systems Design Techniques
- Distributed systems and fault tolerance
- Caching and Content Delivery
- Big Data Technologies and Applications
- Anomaly Detection Techniques and Applications
Hunan University
2019-2025
Argonne National Laboratory
2003-2024
University of Chicago
1991-2023
Yunnan Academy of Agricultural Sciences
2018
Texas A&M University
2006-2016
Mitchell Institute
2004-2016
Northwestern University
2000-2003
Beihang University
1998-2002
Louisiana State University
1999-2000
Institute of Computing Technology
2000
Abstract Solving large scale inverse-problems using deep-learning algorithms have become an essential part of modern research and industrial applications. The complexity the underlying inverse problem may require utilization high performance computing systems which poses a challenge on algorithmic design solver. Most deep learning require, due to their design, custom parallelization techniques in order be resource efficient while showing reasonable convergence. In this paper we introduce...
Performance is an important issue with any application, especially grid applications. Efficient execution of applications requires insight into how the system features impact performance This generally results from significant experimental analysis and possibly development models. paper present Prophesy system, for which novel component model development. In particular, this discusses use our coupling parameter (i.e., a metric that attempts to quantify interaction between kernels compose...
Energy consumption is a major concern with high-performance multicore systems. In this paper, we explore the energy and performance (execution time) characteristics of different parallel implementations scientific applications. particular, experiments focus on message-passing interface (MPI)-only versus hybrid MPI/OpenMP for NAS (NASA Advanced Supercomputing) BT (Block Tridiagonal) benchmark (strong scaling), Lattice Boltzmann application Gyrokinetic Toroidal Code — GTC (weak as well central...
Energy-efficient scientific applications require insight into how high-performance computing system features impact the applications' power and performance. This results from development of performance models. When used with an earthquake simulation aerospace application, a proposed modeling framework reduces energy consumption by up to 48.65 percent 30.67 percent, respectively.
ABSTRACT As we enter the exascale computing era, efficiently utilizing power and optimizing performance of scientific applications under energy constraints has become critical challenging. We propose a low‐overhead autotuning framework to autotune for various hybrid MPI/OpenMP at large scales explore tradeoffs between application runtime power/energy efficient execution, then use this four ECP proxy applications—XSBench, AMG, SWFFT, SW4lite. Our approach uses Bayesian optimization with...
Performance projections of high performance computing (HPC) applications onto various hardware platforms are important for vendors and HPC users. The aid in the design future systems, enable them to compare application across different existing help users with system procurement refinements. In this paper, we present a method projecting node level using published data industry standard benchmarks, SPEC CFP2006, counter from one base machine. particular, project eight four utilizing...
Ytopt is a Python machine-learning-based autotuning software package developed within the ECP PROTEAS-TUNE project. The ytopt adopts an asynchronous search framework that consists of sampling small number input parameter configurations and progressively fitting surrogate model over input-output space until exhausting user-defined maximum evaluations or wall-clock time. libEnsemble toolkit for coordinating workflows dynamic ensembles calculations across massively parallel resources PETSc/TAO...
The NAS Parallel Benchmarks (NPB) are well-known applications with the fixed algorithms for evaluating parallel systems and tools. Multicore supercomputers provide a natural programming paradigm hybrid programs, whereby OpenMP can be used data sharing multicores that comprise node MPI communication between nodes. In this paper, we use SP BT benchmarks of NPB 3.3 as basis comparative approach to implement MPI/OpenMP versions BT. particular, compare performance counterparts on large-scale...
Performance models provide significant insight into the performance relationships between an application and system used for execution. The major obstacle to developing is lack of knowledge about different functions that compose application. This paper addresses issue by using a coupling parameter, which quantifies interaction kernels, develop predictions. results, three NAS parallel benchmarks, indicate predictions parameter were greatly improved over traditional technique summing execution...
Understanding workload behavior plays an important role in performance studies. The growing complexity of applications and architectures has increased the gap among application developers, engineers, hardware designers. To reduce this gap, we propose SKOPE, a SKeleton framework for Performance Exploration, that produces descriptive model about semantic workload, which can infer potential transformations help users understand how workloads may interact with adapt to emerging hardware. SKOPE...
Chip multiprocessors (CMP) are widely used for high performance computing. Further, these CMPs being configured in a hierarchical manner to compose node cluster system. A major challenge be addressed is efficient use of such systems large-scale scientific applications. In this paper, we quantify the gap resulting from using different number processors per node; information provide baseline amount optimization needed when all on CMP clusters. We conduct detailed analysis identify how...
Training scientific deep learning models requires the significant compute power of high-performance computing systems. In this paper, we analyze performance characteristics benchmarks from exploratory research project CANDLE (Cancer Distributed Learning Environment) with a focus on hyperparameters epochs, batch sizes, and rates. We present parallel methodology that uses distributed framework Horovod to parallelize benchmarks. then use scaling strategies for both epochs size linear rate...
Abstract We develop a ytopt autotuning framework that leverages Bayesian optimization to explore the parameter space search and compare four different supervised learning methods within evaluate their effectiveness. select six of most complex PolyBench benchmarks apply newly developed LLVM Clang/Polly loop pragmas optimize them. then use pragma parameters improve performance. The experimental results show our approach outperforms other compiling provide smallest execution time for syr2k,...
Efficient execution of applications requires insight into how the system features impact performance application. For distributed systems, task gaining this is complicated by complexity features. This generally results from significant experimental analysis and possibly development models. paper presents Prophesy project, an infrastructure that aids in needed based upon experience. The core component a relational database allows for recording data, application details.
Journal Article Performance Characteristics of Hybrid MPI/OpenMP Implementations NAS Parallel Benchmarks SP and BT on Large-Scale Multicore Clusters Get access Xingfu Wu, Wu * 1Department Computer Science Engineering, Texas A&M University, College Station, TX 77843, USA *Corresponding author: wuxf@cse.tamu.edu Search for other works by this author on: Oxford Academic Google Scholar Valerie Taylor The Journal, Volume 55, Issue 2, February 2012, Pages 154–167,...
Hardware performance counters are used as effective proxies to estimate power consumption and runtime. In this paper we present a counter-based modeling optimization method, use the method model four metrics: runtime, system power, CPU memory power. The that compose models explore some counter-guided optimizations with two large-scale scientific applications: an earthquake simulation aerospace application. We demonstrate of using power-aware supercomputers, Mira at Argonne National...
Efficiently utilizing procured power and optimizing performance of scientific applications under energy constraints are challenging. The HPC PowerStack defines a software stack to manage high-performance computing systems standardizes the interfaces between different components stack. This survey paper presents findings working group focused on end-to-end tuning PowerStack. First, we provide background layer-specific efforts in terms their high-level objectives, optimization goals,...