- Parallel Computing and Optimization Techniques
- Distributed and Parallel Computing Systems
- Cloud Computing and Resource Management
- Advanced Data Storage Technologies
- Scheduling and Optimization Algorithms
- Scientific Computing and Data Management
- Distributed systems and fault tolerance
- Neural Networks and Applications
- Software System Performance and Reliability
- Solar Radiation and Photovoltaics
- Advanced Neural Network Applications
- Meteorological Phenomena and Simulations
- Model Reduction and Neural Networks
- Graph Theory and Algorithms
- Image Enhancement Techniques
- Interconnection Networks and Systems
- Probabilistic and Robust Engineering Design
- Advanced Software Engineering Methodologies
- Network Packet Processing and Optimization
- Non-Destructive Testing Techniques
- Gaussian Processes and Bayesian Inference
- Machine Learning and Data Classification
- Radiation Effects in Electronics
- Machine Learning in Materials Science
- Stochastic Gradient Optimization Techniques
Nvidia (United States)
2017-2023
Hewlett Packard Enterprise (United States)
2019
Universitat Politècnica de Catalunya
2019
Barcelona Supercomputing Center
2019
Stanford University
2011-2018
Nvidia (United Kingdom)
2018
Stanford Medicine
2015-2018
Modern parallel architectures have both heterogeneous processors and deep, complex memory hierarchies. We present Legion, a programming model runtime system for achieving high performance on these machines. Legion is organized around logical regions, which express locality independence of program data, tasks, functions that perform computations regions. describe dynamically extracts parallelism from programs, using distributed, scheduling algorithm identifies independent tasks nested...
Modern parallel architectures have both heterogeneous processors and deep, complex memory hierarchies. We present Legion, a programming model runtime system for achieving high performance on these machines. Legion is organized around logical regions, which express locality independence of program data, tasks, functions that perform computations regions. describe dynamically extracts parallelism from programs, using distributed, scheduling algorithm identifies independent tasks nested...
We extract pixel-level masks of extreme weather patterns using variants Tiramisu and DeepLabv3+ neural networks. describe improvements to the software frameworks, input pipeline, network training algorithms necessary efficiently scale deep learning on Piz Daint Summit systems. The scales 5300 P100 GPUs with a sustained throughput 21.0 PF/s parallel efficiency 79.0%. up 27360 V100 325.8 90.7% in single precision. By taking advantage FP16 Tensor Cores, half-precision version achieves peak 1.13...
We present Regent, a high-productivity programming language for high performance computing with logical regions. Regent users compose programs tasks (functions eligible parallel execution) and regions (hierarchical collections of structured objects). appear to execute sequentially, require no explicit synchronization, are trivially deadlock-free. Regent's type system catches many common classes mistakes guarantees that program correct serial execution produces identical results on...
We extract pixel-level masks of extreme weather patterns using variants Tiramisu and DeepLabv3+ neural networks. describe improvements to the software frameworks, input pipeline, network training algorithms necessary efficiently scale deep learning on Piz Daint Summit systems. The scales 5300 P100 GPUs with a sustained throughput 21.0 PF/s parallel efficiency 79.0%. up 27360 V100 325.8 90.7% in single precision. By taking advantage FP16 Tensor Cores, half-precision version achieves peak 1.13...
We present Realm, an event-based runtime system for heterogeneous, distributed memory machines. Realm is fully asynchronous: all actions are non-blocking. supports spawning computations, moving data, and reservations, a novel synchronization primitive. Asynchrony exposed via light-weight event capable of operating without central management.
Applications written for distributed-memory parallel architectures must partition their data to enable execution. As memory hierarchies become deeper, it is increasingly necessary that the partitioning also be hierarchical match. Current language proposals perform this statically, which excludes many important applications where appropriate itself dependent and so computed dynamically. We describe Legion, a region-based programming system, each region may partitioned into subregions....
We present Singe, a Domain Specific Language (DSL) compiler for combustion chemistry that leverages warp specialization to produce high performance code GPUs. Instead of relying on traditional GPU programming models emphasize data-parallel computations, allows compilers like Singe partition computations into sub-computations which are then assigned different warps within thread block. Fine-grain synchronization between is performed efficiently in hardware using producer-consumer named...
Uncertainty quantification for forward and inverse problems is a central challenge across physical biomedical disciplines. We address this the problem of modeling subsurface flow at Hanford Site by combining stochastic computational models with observational data using physics-informed GAN models. The geographic extent, spatial heterogeneity, multiple correlation length scales require training computationally intensive model to thousands dimensions. develop highly optimized implementation...
We introduce novel communication strategies in synchronous distributed Deep Learning consisting of decentralized gradient reduction orchestration and computational graph-aware grouping tensors. These new techniques produce an optimal overlap between computation result near-linear scaling (0.93) training up to 27,600 NVIDIA V100 GPUs on the Summit Supercomputer. demonstrate our context a Fully Convolutional Neural Network approximate solution longstanding scientific inverse problem materials...
Applications on modern supercomputers are increasingly limited by the cost of data movement, but mainstream programming systems have few abstractions for describing structure a program's data. Consequently, burden managing placement, and layout currently falls primarily upon programmer. To address this problem we previously proposed model based logical regions described Legion, system incorporating regions. In paper, present slicing, which incorporates fields into region model. We show that...
Many recent programming systems for both supercomputing and data center workloads generate task graphs to express computations that run on parallel distributed machines. Due the overhead associated with constructing these dependence analysis generates them is often statically computed memoized, resulting graph executed repeatedly at runtime. However, many applications require a dynamic due dependent behavior, but there are new challenges in capturing re- executing In this work, we introduce...
We explore the use of asynchronous many-task (AMT) programming models for implementation in situ analysis towards goal maximizing programmer productivity and overall performance on next generation platforms. describe how a broad class statistics algorithms can be transformed from traditional single-programm multiple-data (SPMD) to an AMT implementation, demonstrating with concrete example: measurement descriptive implemented Legion. Our experiments quantify benefit possible drawbacks this...
A key problem in parallel programming is how data partitioned: divided into subsets that can be operated on and, distributed memory machines, spread across multiple address spaces.
Applications written for distributed-memory parallel architectures must partition their data to enable execution. As memory hierarchies become deeper, it is increasingly necessary that the partitioning also be hierarchical match. Current language proposals perform this statically, which excludes many important applications where appropriate itself dependent and so computed dynamically. We describe Legion, a region-based programming system, each region may partitioned into subregions....
We present control replication, a technique for generating high-performance and scalable SPMD code from implicitly parallel programs. In contrast to traditional programming models that require the programmer explicitly manage threads communication synchronization between them, programs have sequential execution semantics naturally avoid pitfalls of code. However, without optimizations distribute overhead, scalability is often poor.
Uncertainty quantification for forward and inverse problems is a central challenge across physical biomedical disciplines. We address this the problem of modeling subsurface flow at Hanford Site by combining stochastic computational models with observational data using physics-informed GAN models. The geographic extent, spatial heterogeneity, multiple correlation length scales require training computationally intensive model to thousands dimensions. develop hierarchical scheme exploiting...
Researchers built the EcoG GPU-based cluster to show that a system can be designed around GPU computing and still power efficient.
The rapid growth in simulation data requires large-scale parallel implementations of scientific analysis and visualization algorithms, both to produce results within an acceptable timeframe enable situ deployment. However, efficient scalable implementations, especially more complex approaches, require not only advanced but also in-depth knowledge the underlying runtime. Furthermore, different machine configurations applications may favor runtimes, i.e., MPI vs Charm++ Legion, etc., hardware...
We present Singe, a Domain Specific Language (DSL) compiler for combustion chemistry that leverages warp specialization to produce high performance code GPUs. Instead of relying on traditional GPU programming models emphasize data-parallel computations, allows compilers like Singe partition computations into sub-computations which are then assigned different warps within thread block. Fine-grain synchronization between is performed efficiently in hardware using producer-consumer named...
Implicitly parallel programming systems must solve the joint problems of dependence analysis and coherence to ensure apparently-sequential semantics for applications run on distributed memory machines. Solving these in presence data-dependent control flow arbitrary aliasing is a challenge that most existing eschew by compromising expressivity their models and/or performance implementations. We demonstrate general class solutions via reduction visibility problem from computer graphics.
Sparse tensor algorithms are becoming widespread, particularly in the domains of deep learning, graph and data analytics, scientific computing. Current high-performance broad-domain architectures, such as GPUs, often suffer memory system inefficiencies by moving too much or it far through hierarchy. To increase performance efficiency, proposed domain-specific accelerators tailor their architectures to needs a narrow application domain, but result cannot be applied wide range applications...