- Parallel Computing and Optimization Techniques
- Interconnection Networks and Systems
- Distributed and Parallel Computing Systems
- Cloud Computing and Resource Management
- Advanced Data Storage Technologies
- Software System Performance and Reliability
- Embedded Systems Design Techniques
- Software Engineering Research
- Data Visualization and Analytics
- Scientific Computing and Data Management
- Complex Network Analysis Techniques
- Advanced Neural Network Applications
- Distributed systems and fault tolerance
- Advanced Memory and Neural Computing
- Protein Structure and Dynamics
- Stochastic Gradient Optimization Techniques
- Algorithms and Data Compression
- Matrix Theory and Algorithms
- Tensor decomposition and applications
- Software-Defined Networks and 5G
- Machine Learning and ELM
- Opinion Dynamics and Social Influence
- Caching and Content Delivery
- Manufacturing Process and Optimization
- Advanced Database Systems and Queries
University of Maryland, College Park
2019-2024
Nvidia (United Kingdom)
2024
Iowa State University
2023
University of Oregon
2023
Lawrence Livermore National Laboratory
2011-2022
Leibniz Supercomputing Centre
2019
University of Illinois Urbana-Champaign
2007-2011
Indian Institute of Technology Kanpur
2007
Parallel machines are becoming more complex with increasing core counts and heterogeneous architectures. However, the commonly used parallel programming models, C/C++ MPI and/or OpenMP, make it difficult to write source code that is easily tuned for many targets. Newer language approaches attempt ease this burden by providing optimization features such as automatic load balancing, overlap of computation communication, message-driven execution, implicit data layout optimizations. In paper, we...
Predictable performance is important for understanding and alleviating application issues; quantifying the effects of source code, compiler, or system software changes; estimating time required batch jobs; determining allocation requests proposals. Our experiments show that on a Cray XE system, execution communication-heavy parallel ranges from 28% faster to 41% slower than average observed performance. Blue Gene systems, other hand, demonstrate no noticeable run-to-run variability. In this...
Parallel programs in high performance computing (HPC) continue to grow complexity and scale the exascale era. The diversity hardware parallel programming models make developing, optimizing, maintaining software even more burdensome for developers. One way alleviate some of these burdens is with automated development analysis tools. Such tools can perform complex and/or remedial tasks developers that increase their productivity decrease chance error. Until recently, such code have been...
Performance visualization comprises techniques that aid developers and analysts in improving the time energy efficiency of their software. In this work, we discuss performance as it relates to survey existing approaches visualization. We present an overview what types data can be collected a categorization goals address. develop taxonomy for contexts which different visualizations reside describe state art research pertaining each. Finally, unaddressed future challenges
Large language models are increasingly becoming a popular tool for software development.Their ability to model and generate source code has been demonstrated in variety of contexts, including completion, summarization, translation, lookup.However, they often struggle complex programs.In this paper, we study the capabilities state-of-the-art parallel code.In order evaluate models, create benchmark, ParEval, consisting prompts that represent 420 different coding tasks related scientific...
NAMD (nanoscale molecular dynamics) is a production dynamics (MD) application for biomolecular simulations that include assemblages of proteins, cell membranes, and water molecules. In simulation, the problem size fixed large number iterations must be executed in order to understand interesting biological phenomena. Hence, we need MD applications scale thousands processors, even though individual timestep on one processor quite small. has demonstrated its performance several parallel...
Molecular Dynamics applications enhance our understanding of biological phenomena through bio-molecular simulations. Large-scale parallelization MD simulations is challenging because the small number atoms and time scales involved. Load balancing in parallel programs crucial for good performance on large machines. This paper discusses load algorithms deployed a code called NAMD. It focuses new schemes balancers provides an analysis benefits achieved. Specifically, presents technique...
NAMD is a portable parallel application for biomolecular simulations. pioneered the use of hybrid spatial and force decomposition, technique now used by most scalable programs simulations, including Blue Matter Desmond developed IBM D. E. Shaw respectively. has been using Charm++ benefits from its adaptive communication-computation overlap dynamic load balancing. This paper focuses on new scalability challenges in simulations: much larger machines simulating molecular systems with millions...
Large parallel machines with hundreds of thousands processors are becoming more prevalent. Ensuring good load balance is critical for scaling certain classes applications on even processors. Centralized balancing algorithms suffer from scalability problems, especially a relatively small amount memory. Fully distributed algorithms, the other hand, tend to take longer arrive at solutions. In this paper, we present an automatic dynamic hierarchical method that overcomes challenges centralized...
A low-diameter, fast interconnection network is going to be a prerequisite for building exascale machines. two-level direct has been proposed by several groups as scalable design future IBM's PERCS topology and the dragonfly discussed in DARPA hardware study are examples of this design. The presence multiple levels leads hot-spots on few links when processes grouped together at lowest level minimize total communication volume. This especially true graphs with small number neighbors per task....
Recent results have shown that topology aware mapping reduces network contention in communication-intensive kernels on massively parallel machines. We demonstrate mesh interconnects, also allows for the utilization of highly-efficient collectives. map novel 2.5D dense linear algebra algorithms to exploit rectangular collectives cuboid partitions allocated by a Blue Gene/P supercomputer. Our mappings allow optimized line multicasts and reductions. Commonly used 2D cannot be mapped this...
With the continuous rise in complexity of modern supercomputers, optimizing performance large-scale parallel programs is becoming increasingly challenging. Simultaneously, growth scale magnifies impact even minor inefficiencies--potentially millions compute hours and megawatts power consumption can be wasted on avoidable mistakes or sub-optimal algorithms. This makes analysis optimization critical elements software development process. One most common forms to study execution traces, which...
Interconnection networks are a critical resource for large supercomputers. The dragonfly topology, which provides low network diameter and bisection bandwidth, is being explored as promising option building multi-Petaflop's Exaflop's systems. Unlike the extensively studied torus networks, best choices of message routing job placement strategies topology not well understood. This paper aims at analyzing behavior machine built using various strategies, policies, application communication...
The poor performance of transformers on arithmetic tasks seems to stem in large part from their inability keep track the exact position each digit inside a span digits. We mend this problem by adding an embedding that encodes its relative start number. In addition boost these embeddings provide own, we show fix enables architectural modifications such as input injection and recurrent layers improve even further. With positions resolved, can study logical extrapolation ability transformers....
Network contention has a significantly adverse effect on the performance of parallel applications with increasing size machines. Machines petascale era are forcing application developers to map tasks intelligently job partitions achieve best possible. This paper presents framework for automated mapping regular communication graphs two and three dimensional mesh torus networks. will save much effort part generate mappings their individual applications. One component is process topology...
The performance of massively parallel applications is often heavily impacted by the cost communication among compute nodes. However, determining how to best use network a formidable task, made challenging ever increasing size and complexity modern supercomputers. This paper applies visualization techniques aid application developers in understanding activity enabling detailed exploration flow packets through hardware interconnect. In order visualize this large complex data, we employ two...
Large parallel machines with hundreds of thousands processors are being built. Recent studies have shown that ensuring good load balance is critical for scaling certain classes applications on even processors. Centralized balancing algorithms suffer from scalability problems, especially relatively small amount memory. Fully distributed algorithms, the other hand, tend to yield poor very large machines. In this paper, we present an automatic dynamic hierarchical method overcomes challenges...
Task mapping on torus networks has traditionally focused either reducing the maximum dilation or average number of hops per byte for messages in an application. These metrics make simplified assumptions about cause network congestion, and do not provide accurate correlation with execution time. Hence, these cannot be used to reasonably predict compare application performance different mappings. In this paper, we attempt model using communication data, such as graph hardware counters. We use...
Network congestion is one of the primary causes performance degradation, variability and poor scaling in communication-heavy parallel applications. However, mechanisms network on modern interconnection networks are not well understood. We need new approaches to analyze, model predict this critical behaviour order improve large-scale This paper applies supervised learning algorithms, such as forests extremely randomized trees gradient boosted regression trees, perform analysis communication...
Tuning application parameters for optimal performance is a challenging combinatorial problem. Hence, techniques modeling the functional relationships between various input features in parameter space and are important. We show that simple statistical inference inadequate to capture these relationships. Even with more complex ensembles of models, minimum coverage required via experimental observations still quite large. propose deep learning based approach can combine information from...
The dragonfly topology is a popular choice for building high-radix, low-diameter, hierarchical networks with high-bandwidth links. On Cray installations of the network, job placement policies and routing inefficiencies can lead to significant network congestion single multi-job workloads. In this paper, we explore effects placement, parallel workloads configurations on health develop better understanding inter-job interference. We have developed functional simulator, Damselfly, model...
This paper presents an evaluation and comparison of three topologies that are popular for building interconnection networks in large-scale supercomputers: torus, fat-tree, dragonfly. To perform this evaluation, we propose a comprehensive methodology present scalable packet-level network simulator, TraceR. Our includes design prototype systems being evaluated, use proxy applications to determine computation communication load, simulating individual multi-job workloads, computing aggregated...