NFDI4DS | UHH-SEMS - Publication Details

AMReX: a framework for block-structured adaptive mesh refinement

OPENALEX - Publications

Weiqun Zhang Ann Almgren Vince Beckner John B. Bell Johannes Blaschke and 12 more

AMReX is a C++ software framework that supports the development of block-structured adaptive mesh refinement (AMR) algorithms for solving systems partial differential equations (PDEs) with complex boundary conditions on current and emerging architectures.

10.21105/joss.01370 article EN cc-by The Journal of Open Source Software 2019-05-12

High Level Synthesis of Complex Applications

OPENALEX - Publications

Xinheng Liu Yao Chen Tan Nguyen Swathi Gurumani Kyle Rupnow and 1 more

High level synthesis (HLS) is gaining wider acceptance for hardware design due to its higher productivity and better space exploration features. In recent years, HLS techniques flows have also advanced significantly, as a result, many new FPGA designs are developed with HLS. However, despite studies using HLS, the size complexity of such applications remain generally small, it not well understood how optimize large, complex reference code. Typical benchmark contain somewhere between 100 1400...

10.1145/2847263.2847274 article EN 2016-02-04

BoxLib with Tiling: An Adaptive Mesh Refinement Software Framework

OPENALEX - Publications

Weiqun Zhang Ann Almgren Marc Day Tan Nguyen John Shalf and 1 more

In this paper we introduce a block-structured adaptive mesh refinement software framework that incorporates tiling, well-known loop transformation. Because the multiscale, multiphysics codes built in BoxLib are designed to solve complex systems at high resolution, performance on current and next generation architectures is essential. With expectation of many more cores per node architectures, ability effectively utilize threads within essential, model for parallelization will not be...

10.1137/15m102616x article EN SIAM Journal on Scientific Computing 2016-01-01

The Performance and Energy Efficiency Potential of FPGAs in Scientific Computing

OPENALEX - Publications

Tan Nguyen Samuel Williams Marco Siracusa Colin MacLean Douglas W. Doerfler and 1 more

Hardware specialization is a promising direction for the future of digital computing. Reconfigurable technologies enable hardware with modest non-recurring engineering cost. In this paper, we use FPGAs to evaluate benefits building specialized numerical kernels found in scientific applications. order properly performance, not only compare Intel Arria 10 and Xilinx U280 performance against Xeon, Xeon Phi, NVIDIA V100 GPUs, but also extend Empirical Roofline Toolkit (ERT) assess our results...

10.1109/pmbs51919.2020.00007 article EN 2020-11-01

FPGA‐based HPC accelerators: An evaluation on performance and energy efficiency

OPENALEX - Publications

Tan Nguyen Colin MacLean Marco Siracusa Douglas W. Doerfler Nicholas J. Wright and 1 more

Abstract Hardware specialization is a promising direction for the future of digital computing. Reconfigurable technologies enable hardware with modest non‐recurring engineering cost, but their performance and energy efficiency compared to state‐of‐the‐art processor architectures remain an open question. In this article, we use FPGAs evaluate benefits building specialized numerical kernels found in scientific applications. order properly performance, not only compare Intel Arria 10 Xilinx...

10.1002/cpe.6570 article EN Concurrency and Computation Practice and Experience 2021-08-22

Bamboo: translating MPI applications to a latency-tolerant, data-driven form

OPENALEX - Publications

Tan Nguyen Pietro Cicotti Eric J. Bylaska Dan Quinlan Scott B. Baden

We present Bamboo, a custom source-to-source translator that transforms MPI C source into data-driven form automatically overlaps communication with available computation. Running on up to 98304 processors of NERSC's Hopper system, we observe Bamboo's overlap capability speeds implementations 3D Jacobi iterative solver and Cannon's matrix multiplication. generated code meets or exceeds the performance hand optimized MPI, which includes split-phase coding, method classically employed hide...

10.5555/2388996.2389050 article EN IEEE International Conference on High Performance Computing, Data, and Analytics 2012-11-10

A software-based dynamic-warp scheduling approach for load-balancing the Viola–Jones face detection algorithm on GPUs

OPENALEX - Publications

Tan Nguyen Daniel Hefenbrock Jason Oberg Ryan Kastner Scott B. Baden

10.1016/j.jpdc.2013.01.012 article EN Journal of Parallel and Distributed Computing 2013-01-29

Architectural Requirements for Deep Learning Workloads in HPC Environments

OPENALEX - Publications

Khaled Z. Ibrahim Tan Nguyen Hai Ah Nam W. Bhimji Steven Farrell and 4 more

Scientific machine learning (SciML) promises to have a transformational impact on scientific exploration, by combining state-of-the-art AI methods with the latest generation of supercomputers. However, efficiently leverage ML techniques high-performance computing (HPC) systems, it is critical understand performance characteristics underlying algorithms modern computational systems. In this work, we present new methodology for developing detailed understanding benchmarks. To demonstrate our...

10.1109/pmbs54543.2021.00007 article EN 2021-11-01

Devastator: A Scalable Parallel Discrete Event Simulation Framework for Modern C++

OPENALEX - Publications

John Bachan Jianlan Ye Xuan Jiang Tan Nguyen Mahesh Natarajan and 2 more

10.1145/3615979.3656061 article EN 2024-06-18

Bamboo -- Translating MPI applications to a latency-tolerant, data-driven form

OPENALEX - Publications

Tan Nguyen Pietro Cicotti Eric J. Bylaska Dan Quinlan Scott B. Baden

We present Bamboo, a custom source-to-source translator that transforms MPI C source into data-driven form automatically overlaps communication with available computation. Running on up to 98304 processors of NERSC's Hopper system, we observe Bamboo's overlap capability speeds implementations 3D Jacobi iterative solver and Cannon's matrix multiplication. generated code meets or exceeds the performance hand optimized MPI, which includes split-phase coding, method classically employed hide...

10.1109/sc.2012.23 article EN International Conference for High Performance Computing, Networking, Storage and Analysis 2012-11-01

FCUDA-SoC

OPENALEX - Publications

Tan Nguyen Swathi Gurumani Kyle Rupnow Yao Chen

Throughput oriented high level synthesis allows efficient design and optimization using parallel input languages. Parallel languages offer the benefit of parallelism extraction at multiple levels granularity, offering effective space exploration to select single core implementations, easy scaling through instantiations. However, study for has concentrated on on-chip communications, while neglecting platform integration, which can have a significant impact achieved performance. In this paper,...

10.1145/2847263.2847344 article EN 2016-02-04

BoxLib with Tiling: An AMR Software Framework

OPENALEX - Publications

Weiqun Zhang Ann Almgren Marc Day Tan Nguyen John Shalf and 1 more

In this paper we introduce a block-structured adaptive mesh refinement (AMR) software framework that incorporates tiling, well-known loop transformation. Because the multiscale, multiphysics codes built in BoxLib are designed to solve complex systems at high resolution, performance on current and next generation architectures is essential. With expectation of many more cores per node architectures, ability effectively utilize threads within essential, model for parallelization will not be...

10.48550/arxiv.1604.03570 preprint EN other-oa arXiv (Cornell University) 2016-01-01

SPADES: A Productive Design Flow for Versal Programmable Logic

OPENALEX - Publications

Tan Nguyen Zachary Taylor Blair Stephen Neuendorffer John Wawrzynek

With the increasing growth of complexity and heterogeneity modern FPGA fabrics, conventional "flat" design flow relying on standard tools, from Synthesis, Implementation, to Bitstream Generation, has become more arduous than ever. This leads an inordinate turn-around time which severely impacts productivity application developers in quest space exploration. We propose open-source tool built around a customizable overlay Spatially Distributed Socket Engines (SPADES) address issue. SPADES...

10.1109/fpl60245.2023.00017 article EN 2023-09-04

System-level design solutions: Enabling the IoT explosion

OPENALEX - Publications

Liwei Yang Yao Chen Wei Zuo Tan Nguyen Swathi Gurumani and 2 more

Analysts estimate that there will be 50 billion internet-connected devices by 2020, from 25 in 2015. This predicted explosion of IoT affects various evolving and growing markets as well entirely new applications. Despite the variety target applications, all such demand low energy/power consumption, high reliability, connectivity, interoperability, security privacy. Furthermore, while meeting these demands, time-to-market is a critical metric determining whether an product capture market...

10.1109/asicon.2015.7517023 article EN 2021 IEEE 14th International Conference on ASIC (ASICON) 2015-11-01

Perilla: Metadata-Based Optimizations of an Asynchronous Runtime for Adaptive Mesh Refinement

OPENALEX - Publications

Tan Nguyen Didem Unat Weiqun Zhang Ann Almgren Muhammad Nufail Farooqi and 1 more

Hardware architecture is increasingly complex, urging the development of asynchronous runtime systems with advance resource and locality management supports. However, these supports may come at cost complicating user interface while programming remains one major constraints to wide adoption runtimes in practice. In this paper, we propose a solution that leverages application metadata enable challenging optimizations as well facilitate task transforming legacy code an representation. We...

10.1109/sc.2016.80 article EN 2016-11-01

A 0.9 V to 1.95 V dynamic voltage-scalable and frequency-scalable 32 b PowerPC processor

OPENALEX - Publications

Kevin Nowka G. Carpenter E. Mac Donald Hung Q. Ngo B. Brock and 3 more

A 32 b PowerPC/spl trade/ system-on-a-chip supporting dynamic voltage supply and frequency scaling operates from 366 MHz at 1.8 V 600 mW down to 150 1.0 53 in a 0.18 /spl mu/m CMOS process. Maximum change without PLL relock is 10 mV//spl mu/s. Processor state save/restore enables deep-sleep state.

10.1109/isscc.2002.993071 article EN 2002 IEEE International Solid-State Circuits Conference. Digest of Technical Papers (Cat. No.02CH37315) 2003-06-25

Phase Asynchronous AMR Execution for Productive and Performant Astrophysical Flows

OPENALEX - Publications

Muhammad Nufail Farooqi Tan Nguyen Weiqun Zhang Ann Almgren John Shalf and 1 more

Adaptive Mesh Refinement (AMR) is an approach to solving PDEs that reduces the computational and memory requirements at expense of increased communication. Although adopting asynchronous execution can overcome communication issues, manually restructuring AMR application realize asynchrony extremely complicated hinders readability long-term maintainability. To balance performance against productivity, we design a user-friendly API adopt phase model where all subgrids level be computed...

10.1109/sc.2018.00073 article EN 2018-11-01

Perilla: metadata-based optimizations of an asynchronous runtime for adaptive mesh refinement

OPENALEX - Publications

Tan Nguyen Didem Unat Weiqun Zhang Ann Almgren Muhammad Nufail Farooqi and 1 more

Hardware architecture is increasingly complex, urging the development of asynchronous runtime systems with advance resource and locality management supports. However, these supports may come at cost complicating user interface while programming remains one major constraints to wide adoption runtimes in practice. In this paper, we propose a solution that leverages application metadata enable challenging optimizations as well facilitate task transforming legacy code an representation. We...

10.5555/3014904.3015013 article EN IEEE International Conference on High Performance Computing, Data, and Analytics 2016-11-13

Automatic translation of MPI source into a latency-tolerant, data-driven form

OPENALEX - Publications

Tan Nguyen Pietro Cicotti Eric J. Bylaska Dan Quinlan Scott B. Baden

10.1016/j.jpdc.2017.02.009 article EN publisher-specific-oa Journal of Parallel and Distributed Computing 2017-03-06

FCUDA-HB: Hierarchical and Scalable Bus Architecture Generation on FPGAs With the FCUDA Flow

OPENALEX - Publications

Ying Chen Tan Nguyen Yao Chen Swathi Gurumani Yun Liang and 4 more

Recent progress in high-level synthesis (HLS) has helped raise the abstraction level of hardware design. HLS flows reduce designer effort by allowing development a language, which improves debugging, code reuse and ability to explore different implementation options. However, although process is fast, performance analysis still require lengthy logic physical For design optimization, tools space exploration obtain parallelism at multiple levels granularity including within single...

10.1109/tcad.2016.2552821 article EN IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 2016-01-01

Phase asynchronous AMR execution for productive and performant astrophysical flows

OPENALEX - Publications

Muhammad Nufail Farooqi Tan Nguyen Weiqun Zhang Ann Almgren John Shalf and 1 more

Adaptive Mesh Refinement (AMR) is an approach to solving PDEs that reduces the computational and memory requirements at expense of increased communication. Although adopting asynchronous execution can overcome communication issues, manually restructuring AMR application realize asynchrony extremely complicated hinders readability long-term maintainability. To balance performance against productivity, we design a user-friendly API adopt phase model where all subgrids level be computed...

10.5555/3291656.3291750 article EN IEEE International Conference on High Performance Computing, Data, and Analytics 2018-11-11

LU Factorization: Towards Hiding Communication Overheads with a Lookahead-Free Algorithm

OPENALEX - Publications

Tan Nguyen Scott B. Baden

Lookahead is a well-known technique for masking communication in matrix factorization, but at the cost of complicating application software. We present new approach, based on automated code-restructuring, that realizes benefits lookahead while avoiding complications. apply our to HPL, Linpack benchmark used assess performance supercomputers. Starting with simpler non-lookahead version application, we are able meet Stampede mainframe.

10.1109/cluster.2015.61 article EN 2015-09-01

SoC, NoC and Hierarchical Bus Implementations of Applications on FPGAs Using the FCUDA Flow

OPENALEX - Publications

Tan Nguyen Yao Chen Kyle Rupnow Swathi Gurumani Yao Chen

The FCUDA project aims to improve programmability of FPGAs and expression application parallelism in High Level Synthesis (HLS) through the use CUDA language. language is a popular single-instruction multiple data (SIMD) style programming with wide adoption, thus offering significant opportunity bring experienced programmers FPGA computing. now has open-sourced core RTL transformation as well infrastructure for design space exploration, bus-based andNoC-based on-chip communications, platform...

10.1109/isvlsi.2016.131 article EN 2016-07-01