NFDI4DS | UHH-SEMS - Publication Details

CoSA: Scheduling by Constrained Optimization for Spatial Accelerators

OPENALEX - Publications

Qijing Huang Minwoo Kang Grace Dinh Thomas Norell Aravind Kalaiah and 3 more

Recent advances in Deep Neural Networks (DNNs) have led to active development of specialized DNN accelerators, many which feature a large number processing elements laid out spatially, together with multi-level memory hierarchy and flexible interconnect. While accelerators can take advantage data reuse achieve high peak throughput, they also expose runtime parameters the programmers who need explicitly manage how computation is scheduled both spatially temporally. In fact, different...

10.1109/isca52012.2021.00050 article EN 2021-06-01

Full Stack Optimization of Transformer Inference: a Survey

OPENALEX - Publications

Sehoon Kim Coleman Hooper Thanakul Wattanawong Minwoo Kang Ruohan Yan and 7 more

Recent advances in state-of-the-art DNN architecture design have been moving toward Transformer models. These models achieve superior accuracy across a wide range of applications. This trend has consistent over the past several years since were originally introduced. However, amount compute and bandwidth required for inference recent is growing at significant rate, this made their deployment latency-sensitive applications challenging. As such, there an increased focus on making more...

10.48550/arxiv.2302.14017 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Tackling the Matrix Multiplication Micro-Kernel Generation with Exo

OPENALEX - Publications

Adrián Castelló Julian Bellavita Grace Dinh Yuka Ikarashi Héctor Martínez

The optimization of the matrix multiplication (or GEMM) has been a need during last decades. This operation is considered flagship current linear algebra libraries such as BLIS, OpenBLAS, or Intel OneAPI because its widespread use in large variety scientific applications. GEMM usually implemented following GotoBLAS philosophy, which tiles operands and uses series nested loops for performance improvement. These approaches extract maximum computational power architectures through small pieces...

10.1109/cgo57630.2024.10444883 article EN 2024-02-28

Portable, High Performance Matrix Multiplication Micro-Kernels for RISC-V with ExO

OPENALEX - Publications

Adrián Castelló Héctor Martínez Sandra Catalán Jie Lei Yuka Ikarashi and 3 more

10.1109/pdp66500.2025.00013 article EN 2025-03-12

DOSA: Differentiable Model-Based One-Loop Search for DNN Accelerators

OPENALEX - Publications

C. Hong Qijing Huang Grace Dinh Mahesh Subedar Yakun Sophia Shao

In the hardware design space exploration process, it is critical to optimize both parameters and algorithm-to-hardware mappings. Previous work has largely approached this simultaneous optimization problem by separately exploring mapspace—both individually large highly nonconvex spaces—independently. The resulting combinatorial explosion created significant difficulties for optimizers.

10.1145/3613424.3623797 article EN cc-by 2023-10-28

Communication-Optimal Tilings for Projective Nested Loops with Arbitrary Bounds

OPENALEX - Publications

Grace Dinh James Demmel

Abstract Reducing communication - either between levels of a memory hierarchy or processors over network is key component performance optimization (in both time and energy) for many nested loop problems, including dense linear algebra, particle interactions, machine learning. Previous tiling based approaches these problems have been used to find lower bounds on the required execute them optimal rearrangements, blockings, attain such bounds. However, general typically assumed problem sizes...

10.1145/3350755.3400275 article EN 2020-07-06

Communication-Optimal Convolutional Neural Nets

OPENALEX - Publications

James Demmel Grace Dinh

Efficiently executing convolutional neural nets (CNNs) is important in many machine-learning tasks. Since the cost of moving a word data, either between levels memory hierarchy or processors over network, much higher than an arithmetic operation, minimizing data movement critical to performance optimization. In this paper, we present both new lower bounds on needed for CNNs, and optimal sequential algorithms that attain these bounds. most common cases, our can significantly more reuse matrix...

10.48550/arxiv.1802.06905 preprint EN other-oa arXiv (Cornell University) 2018-01-01

Interface for Sparse Linear Algebra Operations

OPENALEX - Publications

Ahmad Abdelfattah Willow Ahrens Hartwig Anzt Chris Armstrong Benjamin Brock and 30 more

The standardization of an interface for dense linear algebra operations in the BLAS standard has enabled interoperability between different libraries, thereby boosting success scientific computing, particular HPC. Despite numerous efforts past, community not yet agreed on a sparse due to reasons. One is fact that objects allow many storage formats, and hardware may favor formats. This makes definition FORTRAN-style all-circumventing extremely challenging. Another reason opposed...

10.48550/arxiv.2411.13259 preprint EN arXiv (Cornell University) 2024-11-20

Communication bounds for convolutional neural networks

OPENALEX - Publications

Anthony Chen James Demmel Grace Dinh Mason Haberle Olga Holtz

Convolutional neural networks (CNNs) are important in a wide variety of machine learning tasks and applications, so optimizing their performance is essential. Moving words data between levels memory hierarchy or processors on network much more expensive than the cost arithmetic, minimizing communication critical to performance. In this paper, we present new lower bounds movement for mixed precision convolutions both single-processor parallel distributed models, as well algorithms that...

10.1145/3539781.3539784 preprint EN 2022-06-27

CoSA: Scheduling by Constrained Optimization for Spatial Accelerators

OPENALEX - Publications

Qijing Huang Minwoo Kang Grace Dinh Thomas Norell Aravind Kalaiah and 3 more

Recent advances in Deep Neural Networks (DNNs) have led to active development of specialized DNN accelerators, many which feature a large number processing elements laid out spatially, together with multi-level memory hierarchy and flexible interconnect. While accelerators can take advantage data reuse achieve high peak throughput, they also expose runtime parameters the programmers who need explicitly manage how computation is scheduled both spatially temporally. In fact, different...

10.48550/arxiv.2105.01898 preprint EN other-oa arXiv (Cornell University) 2021-01-01

Tackling the Matrix Multiplication Micro-kernel Generation with Exo

OPENALEX - Publications

Adrián Castelló Julian Bellavita Grace Dinh Yuka Ikarashi Héctor Martínez

The optimization of the matrix multiplication (or GEMM) has been a need during last decades. This operation is considered flagship current linear algebra libraries such as BLIS, OpenBLAS, or Intel OneAPI because its widespread use in large variety scientific applications. GEMM usually implemented following GotoBLAS philosophy, which tiles operands and uses series nested loops for performance improvement. These approaches extract maximum computational power architectures through small pieces...

10.48550/arxiv.2310.17408 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Communication-Optimal Tilings for Projective Nested Loops with Arbitrary Bounds

OPENALEX - Publications

Grace Dinh James Demmel

Reducing communication - either between levels of a memory hierarchy or processors over network is key component performance optimization (in both time and energy) for many problems, including dense linear algebra, particle interactions, machine learning. For these which can be represented as nested-loop computations, previous tiling based approaches have been used to find lower bounds on the required execute them optimal rearrangements, blockings, attain such bounds. However, general...

10.48550/arxiv.2003.00119 preprint EN other-oa arXiv (Cornell University) 2020-01-01