- Parallel Computing and Optimization Techniques
- Advanced Data Storage Technologies
- Distributed and Parallel Computing Systems
- Interconnection Networks and Systems
- Embedded Systems Design Techniques
- Computer Graphics and Visualization Techniques
- VLSI and FPGA Design Techniques
- Simulation Techniques and Applications
- Optimization and Search Problems
- Analog and Mixed-Signal Circuit Design
- Face and Expression Recognition
- Advanced Image and Video Retrieval Techniques
- Advanced Numerical Methods in Computational Mathematics
- IoT and Edge/Fog Computing
- Cloud Computing and Resource Management
- Natural Language Processing Techniques
- Low-power high-performance VLSI design
- Mathematics, Computing, and Information Processing
- Computational Fluid Dynamics and Aerodynamics
- Real-time simulation and control systems
- Scientific Computing and Data Management
- Algorithms and Data Compression
- Model-Driven Software Engineering Techniques
- Computational Geometry and Mesh Generation
- Modular Robots and Swarm Intelligence
Lawrence Berkeley National Laboratory
2016-2024
Berkeley College
2023
University of California, Berkeley
2023
National Energy Research Scientific Computing Center
2021
University of California, San Diego
2012-2017
Advanced Digital Sciences Center
2015-2016
IBM Research - Austin
2003
AMReX is a C++ software framework that supports the development of block-structured adaptive mesh refinement (AMR) algorithms for solving systems partial differential equations (PDEs) with complex boundary conditions on current and emerging architectures.
High level synthesis (HLS) is gaining wider acceptance for hardware design due to its higher productivity and better space exploration features. In recent years, HLS techniques flows have also advanced significantly, as a result, many new FPGA designs are developed with HLS. However, despite studies using HLS, the size complexity of such applications remain generally small, it not well understood how optimize large, complex reference code. Typical benchmark contain somewhere between 100 1400...
In this paper we introduce a block-structured adaptive mesh refinement software framework that incorporates tiling, well-known loop transformation. Because the multiscale, multiphysics codes built in BoxLib are designed to solve complex systems at high resolution, performance on current and next generation architectures is essential. With expectation of many more cores per node architectures, ability effectively utilize threads within essential, model for parallelization will not be...
Hardware specialization is a promising direction for the future of digital computing. Reconfigurable technologies enable hardware with modest non-recurring engineering cost. In this paper, we use FPGAs to evaluate benefits building specialized numerical kernels found in scientific applications. order properly performance, not only compare Intel Arria 10 and Xilinx U280 performance against Xeon, Xeon Phi, NVIDIA V100 GPUs, but also extend Empirical Roofline Toolkit (ERT) assess our results...
Abstract Hardware specialization is a promising direction for the future of digital computing. Reconfigurable technologies enable hardware with modest non‐recurring engineering cost, but their performance and energy efficiency compared to state‐of‐the‐art processor architectures remain an open question. In this article, we use FPGAs evaluate benefits building specialized numerical kernels found in scientific applications. order properly performance, not only compare Intel Arria 10 Xilinx...
We present Bamboo, a custom source-to-source translator that transforms MPI C source into data-driven form automatically overlaps communication with available computation. Running on up to 98304 processors of NERSC's Hopper system, we observe Bamboo's overlap capability speeds implementations 3D Jacobi iterative solver and Cannon's matrix multiplication. generated code meets or exceeds the performance hand optimized MPI, which includes split-phase coding, method classically employed hide...
Scientific machine learning (SciML) promises to have a transformational impact on scientific exploration, by combining state-of-the-art AI methods with the latest generation of supercomputers. However, efficiently leverage ML techniques high-performance computing (HPC) systems, it is critical understand performance characteristics underlying algorithms modern computational systems. In this work, we present new methodology for developing detailed understanding benchmarks. To demonstrate our...
We present Bamboo, a custom source-to-source translator that transforms MPI C source into data-driven form automatically overlaps communication with available computation. Running on up to 98304 processors of NERSC's Hopper system, we observe Bamboo's overlap capability speeds implementations 3D Jacobi iterative solver and Cannon's matrix multiplication. generated code meets or exceeds the performance hand optimized MPI, which includes split-phase coding, method classically employed hide...
Throughput oriented high level synthesis allows efficient design and optimization using parallel input languages. Parallel languages offer the benefit of parallelism extraction at multiple levels granularity, offering effective space exploration to select single core implementations, easy scaling through instantiations. However, study for has concentrated on on-chip communications, while neglecting platform integration, which can have a significant impact achieved performance. In this paper,...
In this paper we introduce a block-structured adaptive mesh refinement (AMR) software framework that incorporates tiling, well-known loop transformation. Because the multiscale, multiphysics codes built in BoxLib are designed to solve complex systems at high resolution, performance on current and next generation architectures is essential. With expectation of many more cores per node architectures, ability effectively utilize threads within essential, model for parallelization will not be...
With the increasing growth of complexity and heterogeneity modern FPGA fabrics, conventional "flat" design flow relying on standard tools, from Synthesis, Implementation, to Bitstream Generation, has become more arduous than ever. This leads an inordinate turn-around time which severely impacts productivity application developers in quest space exploration. We propose open-source tool built around a customizable overlay Spatially Distributed Socket Engines (SPADES) address issue. SPADES...
Analysts estimate that there will be 50 billion internet-connected devices by 2020, from 25 in 2015. This predicted explosion of IoT affects various evolving and growing markets as well entirely new applications. Despite the variety target applications, all such demand low energy/power consumption, high reliability, connectivity, interoperability, security privacy. Furthermore, while meeting these demands, time-to-market is a critical metric determining whether an product capture market...
Hardware architecture is increasingly complex, urging the development of asynchronous runtime systems with advance resource and locality management supports. However, these supports may come at cost complicating user interface while programming remains one major constraints to wide adoption runtimes in practice. In this paper, we propose a solution that leverages application metadata enable challenging optimizations as well facilitate task transforming legacy code an representation. We...
A 32 b PowerPC/spl trade/ system-on-a-chip supporting dynamic voltage supply and frequency scaling operates from 366 MHz at 1.8 V 600 mW down to 150 1.0 53 in a 0.18 /spl mu/m CMOS process. Maximum change without PLL relock is 10 mV//spl mu/s. Processor state save/restore enables deep-sleep state.
Adaptive Mesh Refinement (AMR) is an approach to solving PDEs that reduces the computational and memory requirements at expense of increased communication. Although adopting asynchronous execution can overcome communication issues, manually restructuring AMR application realize asynchrony extremely complicated hinders readability long-term maintainability. To balance performance against productivity, we design a user-friendly API adopt phase model where all subgrids level be computed...
Hardware architecture is increasingly complex, urging the development of asynchronous runtime systems with advance resource and locality management supports. However, these supports may come at cost complicating user interface while programming remains one major constraints to wide adoption runtimes in practice. In this paper, we propose a solution that leverages application metadata enable challenging optimizations as well facilitate task transforming legacy code an representation. We...
Recent progress in high-level synthesis (HLS) has helped raise the abstraction level of hardware design. HLS flows reduce designer effort by allowing development a language, which improves debugging, code reuse and ability to explore different implementation options. However, although process is fast, performance analysis still require lengthy logic physical For design optimization, tools space exploration obtain parallelism at multiple levels granularity including within single...
Adaptive Mesh Refinement (AMR) is an approach to solving PDEs that reduces the computational and memory requirements at expense of increased communication. Although adopting asynchronous execution can overcome communication issues, manually restructuring AMR application realize asynchrony extremely complicated hinders readability long-term maintainability. To balance performance against productivity, we design a user-friendly API adopt phase model where all subgrids level be computed...
Lookahead is a well-known technique for masking communication in matrix factorization, but at the cost of complicating application software. We present new approach, based on automated code-restructuring, that realizes benefits lookahead while avoiding complications. apply our to HPL, Linpack benchmark used assess performance supercomputers. Starting with simpler non-lookahead version application, we are able meet Stampede mainframe.
The FCUDA project aims to improve programmability of FPGAs and expression application parallelism in High Level Synthesis (HLS) through the use CUDA language. language is a popular single-instruction multiple data (SIMD) style programming with wide adoption, thus offering significant opportunity bring experienced programmers FPGA computing. now has open-sourced core RTL transformation as well infrastructure for design space exploration, bus-based andNoC-based on-chip communications, platform...