Sean Treichler

ORCID: 0000-0003-2189-4026
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Parallel Computing and Optimization Techniques
  • Distributed and Parallel Computing Systems
  • Cloud Computing and Resource Management
  • Advanced Data Storage Technologies
  • Scheduling and Optimization Algorithms
  • Scientific Computing and Data Management
  • Distributed systems and fault tolerance
  • Neural Networks and Applications
  • Software System Performance and Reliability
  • Solar Radiation and Photovoltaics
  • Advanced Neural Network Applications
  • Meteorological Phenomena and Simulations
  • Model Reduction and Neural Networks
  • Graph Theory and Algorithms
  • Image Enhancement Techniques
  • Interconnection Networks and Systems
  • Probabilistic and Robust Engineering Design
  • Advanced Software Engineering Methodologies
  • Network Packet Processing and Optimization
  • Non-Destructive Testing Techniques
  • Gaussian Processes and Bayesian Inference
  • Machine Learning and Data Classification
  • Radiation Effects in Electronics
  • Machine Learning in Materials Science
  • Stochastic Gradient Optimization Techniques

Nvidia (United States)
2017-2023

Hewlett Packard Enterprise (United States)
2019

Universitat Politècnica de Catalunya
2019

Barcelona Supercomputing Center
2019

Stanford University
2011-2018

Nvidia (United Kingdom)
2018

Stanford Medicine
2015-2018

Modern parallel architectures have both heterogeneous processors and deep, complex memory hierarchies. We present Legion, a programming model runtime system for achieving high performance on these machines. Legion is organized around logical regions, which express locality independence of program data, tasks, functions that perform computations regions. describe dynamically extracts parallelism from programs, using distributed, scheduling algorithm identifies independent tasks nested...

10.1109/sc.2012.71 article EN International Conference for High Performance Computing, Networking, Storage and Analysis 2012-11-01

Modern parallel architectures have both heterogeneous processors and deep, complex memory hierarchies. We present Legion, a programming model runtime system for achieving high performance on these machines. Legion is organized around logical regions, which express locality independence of program data, tasks, functions that perform computations regions. describe dynamically extracts parallelism from programs, using distributed, scheduling algorithm identifies independent tasks nested...

10.5555/2388996.2389086 article EN IEEE International Conference on High Performance Computing, Data, and Analytics 2012-11-10

We extract pixel-level masks of extreme weather patterns using variants Tiramisu and DeepLabv3+ neural networks. describe improvements to the software frameworks, input pipeline, network training algorithms necessary efficiently scale deep learning on Piz Daint Summit systems. The scales 5300 P100 GPUs with a sustained throughput 21.0 PF/s parallel efficiency 79.0%. up 27360 V100 325.8 90.7% in single precision. By taking advantage FP16 Tensor Cores, half-precision version achieves peak 1.13...

10.1109/sc.2018.00054 preprint EN 2018-11-01

We present Regent, a high-productivity programming language for high performance computing with logical regions. Regent users compose programs tasks (functions eligible parallel execution) and regions (hierarchical collections of structured objects). appear to execute sequentially, require no explicit synchronization, are trivially deadlock-free. Regent's type system catches many common classes mistakes guarantees that program correct serial execution produces identical results on...

10.1145/2807591.2807629 article EN 2015-10-27

We extract pixel-level masks of extreme weather patterns using variants Tiramisu and DeepLabv3+ neural networks. describe improvements to the software frameworks, input pipeline, network training algorithms necessary efficiently scale deep learning on Piz Daint Summit systems. The scales 5300 P100 GPUs with a sustained throughput 21.0 PF/s parallel efficiency 79.0%. up 27360 V100 325.8 90.7% in single precision. By taking advantage FP16 Tensor Cores, half-precision version achieves peak 1.13...

10.5555/3291656.3291724 article EN arXiv (Cornell University) 2018-11-11

We present Realm, an event-based runtime system for heterogeneous, distributed memory machines. Realm is fully asynchronous: all actions are non-blocking. supports spawning computations, moving data, and reservations, a novel synchronization primitive. Asynchrony exposed via light-weight event capable of operating without central management.

10.1145/2628071.2628084 article EN 2014-08-21

Applications written for distributed-memory parallel architectures must partition their data to enable execution. As memory hierarchies become deeper, it is increasingly necessary that the partitioning also be hierarchical match. Current language proposals perform this statically, which excludes many important applications where appropriate itself dependent and so computed dynamically. We describe Legion, a region-based programming system, each region may partitioned into subregions....

10.1145/2509136.2509545 article EN 2013-10-23

We present Singe, a Domain Specific Language (DSL) compiler for combustion chemistry that leverages warp specialization to produce high performance code GPUs. Instead of relying on traditional GPU programming models emphasize data-parallel computations, allows compilers like Singe partition computations into sub-computations which are then assigned different warps within thread block. Fine-grain synchronization between is performed efficiently in hardware using producer-consumer named...

10.1145/2555243.2555258 article EN 2014-02-06

Uncertainty quantification for forward and inverse problems is a central challenge across physical biomedical disciplines. We address this the problem of modeling subsurface flow at Hanford Site by combining stochastic computational models with observational data using physics-informed GAN models. The geographic extent, spatial heterogeneity, multiple correlation length scales require training computationally intensive model to thousands dimensions. develop highly optimized implementation...

10.1109/dls49591.2019.00006 article EN 2019-11-01

We introduce novel communication strategies in synchronous distributed Deep Learning consisting of decentralized gradient reduction orchestration and computational graph-aware grouping tensors. These new techniques produce an optimal overlap between computation result near-linear scaling (0.93) training up to 27,600 NVIDIA V100 GPUs on the Summit Supercomputer. demonstrate our context a Fully Convolutional Neural Network approximate solution longstanding scientific inverse problem materials...

10.48550/arxiv.1909.11150 preprint EN other-oa arXiv (Cornell University) 2019-01-01

Applications on modern supercomputers are increasingly limited by the cost of data movement, but mainstream programming systems have few abstractions for describing structure a program's data. Consequently, burden managing placement, and layout currently falls primarily upon programmer. To address this problem we previously proposed model based logical regions described Legion, system incorporating regions. In paper, present slicing, which incorporates fields into region model. We show that...

10.1109/sc.2014.74 article EN 2014-11-01

Many recent programming systems for both supercomputing and data center workloads generate task graphs to express computations that run on parallel distributed machines. Due the overhead associated with constructing these dependence analysis generates them is often statically computed memoized, resulting graph executed repeatedly at runtime. However, many applications require a dynamic due dependent behavior, but there are new challenges in capturing re- executing In this work, we introduce...

10.1109/sc.2018.00037 article EN 2018-11-01

We explore the use of asynchronous many-task (AMT) programming models for implementation in situ analysis towards goal maximizing programmer productivity and overall performance on next generation platforms. describe how a broad class statistics algorithms can be transformed from traditional single-programm multiple-data (SPMD) to an AMT implementation, demonstrating with concrete example: measurement descriptive implemented Legion. Our experiments quantify benefit possible drawbacks this...

10.1109/ipdpsw.2016.24 article EN 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) 2016-05-01

A key problem in parallel programming is how data partitioned: divided into subsets that can be operated on and, distributed memory machines, spread across multiple address spaces.

10.1145/2983990.2984016 article EN 2016-10-19

Applications written for distributed-memory parallel architectures must partition their data to enable execution. As memory hierarchies become deeper, it is increasingly necessary that the partitioning also be hierarchical match. Current language proposals perform this statically, which excludes many important applications where appropriate itself dependent and so computed dynamically. We describe Legion, a region-based programming system, each region may partitioned into subregions....

10.1145/2544173.2509545 article EN ACM SIGPLAN Notices 2013-10-29

We present control replication, a technique for generating high-performance and scalable SPMD code from implicitly parallel programs. In contrast to traditional programming models that require the programmer explicitly manage threads communication synchronization between them, programs have sequential execution semantics naturally avoid pitfalls of code. However, without optimizations distribute overhead, scalability is often poor.

10.1145/3126908.3126949 article EN 2017-11-08

Uncertainty quantification for forward and inverse problems is a central challenge across physical biomedical disciplines. We address this the problem of modeling subsurface flow at Hanford Site by combining stochastic computational models with observational data using physics-informed GAN models. The geographic extent, spatial heterogeneity, multiple correlation length scales require training computationally intensive model to thousands dimensions. develop hierarchical scheme exploiting...

10.48550/arxiv.1910.13444 preprint EN other-oa arXiv (Cornell University) 2019-01-01

Researchers built the EcoG GPU-based cluster to show that a system can be designed around GPU computing and still power efficient.

10.1109/mcse.2011.30 article EN Computing in Science & Engineering 2011-03-01

The rapid growth in simulation data requires large-scale parallel implementations of scientific analysis and visualization algorithms, both to produce results within an acceptable timeframe enable situ deployment. However, efficient scalable implementations, especially more complex approaches, require not only advanced but also in-depth knowledge the underlying runtime. Furthermore, different machine configurations applications may favor runtimes, i.e., MPI vs Charm++ Legion, etc., hardware...

10.1109/ipdps.2018.00056 article EN 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2018-05-01

We present Singe, a Domain Specific Language (DSL) compiler for combustion chemistry that leverages warp specialization to produce high performance code GPUs. Instead of relying on traditional GPU programming models emphasize data-parallel computations, allows compilers like Singe partition computations into sub-computations which are then assigned different warps within thread block. Fine-grain synchronization between is performed efficiently in hardware using producer-consumer named...

10.1145/2692916.2555258 article EN ACM SIGPLAN Notices 2014-02-06

Implicitly parallel programming systems must solve the joint problems of dependence analysis and coherence to ensure apparently-sequential semantics for applications run on distributed memory machines. Solving these in presence data-dependent control flow arbitrary aliasing is a challenge that most existing eschew by compromising expressivity their models and/or performance implementations. We demonstrate general class solutions via reduction visibility problem from computer graphics.

10.1145/3572848.3577515 article EN cc-by-nc-nd 2023-02-21

Sparse tensor algorithms are becoming widespread, particularly in the domains of deep learning, graph and data analytics, scientific computing. Current high-performance broad-domain architectures, such as GPUs, often suffer memory system inefficiencies by moving too much or it far through hierarchy. To increase performance efficiency, proposed domain-specific accelerators tailor their architectures to needs a narrow application domain, but result cannot be applied wide range applications...

10.1145/3630007 article EN ACM Transactions on Computer Systems 2023-10-27
Coming Soon ...