NFDI4DS | UHH-SEMS - Publication Details

Legion: Expressing locality and independence with logical regions

OPENALEX - Publications

Michael Bauer Sean Treichler Elliott Slaughter Alex Aiken

Modern parallel architectures have both heterogeneous processors and deep, complex memory hierarchies. We present Legion, a programming model runtime system for achieving high performance on these machines. Legion is organized around logical regions, which express locality independence of program data, tasks, functions that perform computations regions. describe dynamically extracts parallelism from programs, using distributed, scheduling algorithm identifies independent tasks nested...

10.1109/sc.2012.71 article EN International Conference for High Performance Computing, Networking, Storage and Analysis 2012-11-01

Legion: expressing locality and independence with logical regions

OPENALEX - Publications

Michael Bauer Sean Treichler Elliott Slaughter Alex Aiken

Modern parallel architectures have both heterogeneous processors and deep, complex memory hierarchies. We present Legion, a programming model runtime system for achieving high performance on these machines. Legion is organized around logical regions, which express locality independence of program data, tasks, functions that perform computations regions. describe dynamically extracts parallelism from programs, using distributed, scheduling algorithm identifies independent tasks nested...

10.5555/2388996.2389086 article EN IEEE International Conference on High Performance Computing, Data, and Analytics 2012-11-10

Exascale Deep Learning for Climate Analytics

OPENALEX - Publications

Thorsten Kurth Sean Treichler Joshua Romero Mayur Mudigonda Nathan Luehr and 7 more

We extract pixel-level masks of extreme weather patterns using variants Tiramisu and DeepLabv3+ neural networks. describe improvements to the software frameworks, input pipeline, network training algorithms necessary efficiently scale deep learning on Piz Daint Summit systems. The scales 5300 P100 GPUs with a sustained throughput 21.0 PF/s parallel efficiency 79.0%. up 27360 V100 325.8 90.7% in single precision. By taking advantage FP16 Tensor Cores, half-precision version achieves peak 1.13...

10.1109/sc.2018.00054 preprint EN 2018-11-01

Regent

OPENALEX - Publications

Elliott Slaughter Wonchan Lee Sean Treichler Michael Bauer Alex Aiken

We present Regent, a high-productivity programming language for high performance computing with logical regions. Regent users compose programs tasks (functions eligible parallel execution) and regions (hierarchical collections of structured objects). appear to execute sequentially, require no explicit synchronization, are trivially deadlock-free. Regent's type system catches many common classes mistakes guarantees that program correct serial execution produces identical results on...

10.1145/2807591.2807629 article EN 2015-10-27

Exascale deep learning for climate analytics

OPENALEX - Publications

Thorsten Kurth Sean Treichler Joshua Romero Mayur Mudigonda Nathan Luehr and 7 more

We extract pixel-level masks of extreme weather patterns using variants Tiramisu and DeepLabv3+ neural networks. describe improvements to the software frameworks, input pipeline, network training algorithms necessary efficiently scale deep learning on Piz Daint Summit systems. The scales 5300 P100 GPUs with a sustained throughput 21.0 PF/s parallel efficiency 79.0%. up 27360 V100 325.8 90.7% in single precision. By taking advantage FP16 Tensor Cores, half-precision version achieves peak 1.13...

10.5555/3291656.3291724 article EN arXiv (Cornell University) 2018-11-11

Realm

OPENALEX - Publications

Sean Treichler Michael Bauer Alex Aiken

We present Realm, an event-based runtime system for heterogeneous, distributed memory machines. Realm is fully asynchronous: all actions are non-blocking. supports spawning computations, moving data, and reservations, a novel synchronization primitive. Asynchrony exposed via light-weight event capable of operating without central management.

10.1145/2628071.2628084 article EN 2014-08-21

Language support for dynamic, hierarchical data partitioning

OPENALEX - Publications

Sean Treichler Michael Bauer Alex Aiken

Applications written for distributed-memory parallel architectures must partition their data to enable execution. As memory hierarchies become deeper, it is increasingly necessary that the partitioning also be hierarchical match. Current language proposals perform this statically, which excludes many important applications where appropriate itself dependent and so computed dynamically. We describe Legion, a region-based programming system, each region may partitioned into subregions....

10.1145/2509136.2509545 article EN 2013-10-23

Singe

OPENALEX - Publications

Michael Bauer Sean Treichler Alex Aiken

We present Singe, a Domain Specific Language (DSL) compiler for combustion chemistry that leverages warp specialization to produce high performance code GPUs. Instead of relying on traditional GPU programming models emphasize data-parallel computations, allows compilers like Singe partition computations into sub-computations which are then assigned different warps within thread block. Fine-grain synchronization between is performed efficiently in hardware using producer-consumer named...

10.1145/2555243.2555258 article EN 2014-02-06

Highly-scalable, Physics-Informed GANs for Learning Solutions of Stochastic PDEs

OPENALEX - Publications

Liu Yang Mr Prabhat George Em Karniadakis Sean Treichler Thorsten Kurth and 6 more

Uncertainty quantification for forward and inverse problems is a central challenge across physical biomedical disciplines. We address this the problem of modeling subsurface flow at Hanford Site by combining stochastic computational models with observational data using physics-informed GAN models. The geographic extent, spatial heterogeneity, multiple correlation length scales require training computationally intensive model to thousands dimensions. develop highly optimized implementation...

10.1109/dls49591.2019.00006 article EN 2019-11-01

Exascale Deep Learning for Scientific Inverse Problems

OPENALEX - Publications

Nouamane Laanait Joshua Romero Junqi Yin M. Todd Young Sean Treichler and 4 more

We introduce novel communication strategies in synchronous distributed Deep Learning consisting of decentralized gradient reduction orchestration and computational graph-aware grouping tensors. These new techniques produce an optimal overlap between computation result near-linear scaling (0.93) training up to 27,600 NVIDIA V100 GPUs on the Summit Supercomputer. demonstrate our context a Fully Convolutional Neural Network approximate solution longstanding scientific inverse problem materials...

10.48550/arxiv.1909.11150 preprint EN other-oa arXiv (Cornell University) 2019-01-01

Structure Slicing: Extending Logical Regions with Fields

OPENALEX - Publications

Michael Bauer Sean Treichler Elliott Slaughter Alex Aiken

Applications on modern supercomputers are increasingly limited by the cost of data movement, but mainstream programming systems have few abstractions for describing structure a program's data. Consequently, burden managing placement, and layout currently falls primarily upon programmer. To address this problem we previously proposed model based logical regions described Legion, system incorporating regions. In paper, present slicing, which incorporates fields into region model. We show that...

10.1109/sc.2014.74 article EN 2014-11-01

Dynamic Tracing: Memoization of Task Graphs for Dynamic Task-Based Runtimes

OPENALEX - Publications

Wonchan Lee Elliott Slaughter Michael Bauer Sean Treichler Todd Warszawski and 2 more

Many recent programming systems for both supercomputing and data center workloads generate task graphs to express computations that run on parallel distributed machines. Due the overhead associated with constructing these dependence analysis generates them is often statically computed memoized, resulting graph executed repeatedly at runtime. However, many applications require a dynamic due dependent behavior, but there are new challenges in capturing re- executing In this work, we introduce...

10.1109/sc.2018.00037 article EN 2018-11-01

Towards Asynchronous Many-Task in Situ Data Analysis Using Legion

OPENALEX - Publications

Philippe Pébaÿ Janine Camille Bennett David S Hollman Sean Treichler Patrick McCormick and 3 more

We explore the use of asynchronous many-task (AMT) programming models for implementation in situ analysis towards goal maximizing programmer productivity and overall performance on next generation platforms. describe how a broad class statistics algorithms can be transformed from traditional single-programm multiple-data (SPMD) to an AMT implementation, demonstrating with concrete example: measurement descriptive implemented Legion. Our experiments quantify benefit possible drawbacks this...

10.1109/ipdpsw.2016.24 article EN 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) 2016-05-01

Dependent partitioning

OPENALEX - Publications

Sean Treichler Michael Bauer Rahul Sharma Elliott Slaughter Alex Aiken

A key problem in parallel programming is how data partitioned: divided into subsets that can be operated on and, distributed memory machines, spread across multiple address spaces.

10.1145/2983990.2984016 article EN 2016-10-19

Language support for dynamic, hierarchical data partitioning

OPENALEX - Publications

Sean Treichler Michael Bauer Alex Aiken

Applications written for distributed-memory parallel architectures must partition their data to enable execution. As memory hierarchies become deeper, it is increasingly necessary that the partitioning also be hierarchical match. Current language proposals perform this statically, which excludes many important applications where appropriate itself dependent and so computed dynamically. We describe Legion, a region-based programming system, each region may partitioned into subregions....

10.1145/2544173.2509545 article EN ACM SIGPLAN Notices 2013-10-29

Control replication

OPENALEX - Publications

Elliott Slaughter Wonchan Lee Sean Treichler Wen Zhang Michael Bauer and 3 more

We present control replication, a technique for generating high-performance and scalable SPMD code from implicitly parallel programs. In contrast to traditional programming models that require the programmer explicitly manage threads communication synchronization between them, programs have sequential execution semantics naturally avoid pitfalls of code. However, without optimizations distribute overhead, scalability is often poor.

10.1145/3126908.3126949 article EN 2017-11-08

Highly-scalable, physics-informed GANs for learning solutions of stochastic PDEs

OPENALEX - Publications

Liu Yang Sean Treichler Thorsten Kurth Keno Fischer David A. Barajas‐Solano and 6 more

Uncertainty quantification for forward and inverse problems is a central challenge across physical biomedical disciplines. We address this the problem of modeling subsurface flow at Hanford Site by combining stochastic computational models with observational data using physics-informed GAN models. The geographic extent, spatial heterogeneity, multiple correlation length scales require training computationally intensive model to thousands dimensions. develop hierarchical scheme exploiting...

10.48550/arxiv.1910.13444 preprint EN other-oa arXiv (Cornell University) 2019-01-01

EcoG: A Power-Efficient GPU Cluster Architecture for Scientific Computing

OPENALEX - Publications

Mike Showerman Jeremy Enos Craig Steffen Sean Treichler William Gropp and 1 more

Researchers built the EcoG GPU-based cluster to show that a system can be designed around GPU computing and still power efficient.

10.1109/mcse.2011.30 article EN Computing in Science & Engineering 2011-03-01

BabelFlow: An Embedded Domain Specific Language for Parallel Analysis and Visualization

OPENALEX - Publications

Steve Petruzza Sean Treichler Valerio Pascucci Peer‐Timo Bremer

The rapid growth in simulation data requires large-scale parallel implementations of scientific analysis and visualization algorithms, both to produce results within an acceptable timeframe enable situ deployment. However, efficient scalable implementations, especially more complex approaches, require not only advanced but also in-depth knowledge the underlying runtime. Furthermore, different machine configurations applications may favor runtimes, i.e., MPI vs Charm++ Legion, etc., hardware...

10.1109/ipdps.2018.00056 article EN 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2018-05-01

Singe

OPENALEX - Publications

Michael Bauer Sean Treichler Alex Aiken

We present Singe, a Domain Specific Language (DSL) compiler for combustion chemistry that leverages warp specialization to produce high performance code GPUs. Instead of relying on traditional GPU programming models emphasize data-parallel computations, allows compilers like Singe partition computations into sub-computations which are then assigned different warps within thread block. Fine-grain synchronization between is performed efficiently in hardware using producer-consumer named...

10.1145/2692916.2555258 article EN ACM SIGPLAN Notices 2014-02-06

Visibility Algorithms for Dynamic Dependence Analysis and Distributed Coherence

OPENALEX - Publications

Michael Bauer Elliott Slaughter Sean Treichler Wonchan Lee Michael Garland and 1 more

Implicitly parallel programming systems must solve the joint problems of dependence analysis and coherence to ensure apparently-sequential semantics for applications run on distributed memory machines. Solving these in presence data-dependent control flow arbitrary aliasing is a challenge that most existing eschew by compromising expressivity their models and/or performance implementations. We demonstrate general class solutions via reduction visibility problem from computer graphics.

10.1145/3572848.3577515 article EN cc-by-nc-nd 2023-02-21

Symphony: Orchestrating Sparse and Dense Tensors with Hierarchical Heterogeneous Processing

OPENALEX - Publications

Michael Pellauer Jason Clemons Vignesh Balaji Neal Crago Aamer Jaleel and 7 more

Sparse tensor algorithms are becoming widespread, particularly in the domains of deep learning, graph and data analytics, scientific computing. Current high-performance broad-domain architectures, such as GPUs, often suffer memory system inefficiencies by moving too much or it far through hierarchy. To increase performance efficiency, proposed domain-specific accelerators tailor their architectures to needs a narrow application domain, but result cannot be applied wide range applications...

10.1145/3630007 article EN ACM Transactions on Computer Systems 2023-10-27