NFDI4DS | UHH-SEMS - Publication Details

Portable performance on heterogeneous architectures

OPENALEX - Publications

Phitchaya Mangpo Phothilimthana Jason Ansel Jonathan Ragan‐Kelley Saman Amarasinghe

Trends in both consumer and high performance computing are bringing not only more cores, but also increased heterogeneity among the computational resources within a single machine. In many machines, one of greatest is now their graphics coprocessors (GPUs), just primary CPUs. But GPU programming memory models differ dramatically from conventional CPUs, relative characteristics different processors vary widely between machines. Different system often perform best with algorithms usage...

10.1145/2451116.2451162 article EN 2013-03-16

Scaling up Superoptimization

OPENALEX - Publications

Phitchaya Mangpo Phothilimthana Aditya Thakur Rastislav Bodík Dinakar Dhurjati

Developing a code optimizer is challenging, especially for new, idiosyncratic ISAs. Superoptimization can, in principle, discover machine-specific optimizations automatically by searching the space of all instruction sequences. If we can increase size fragments superoptimizer optimize, will be able to more optimizations. We develop LENS, search algorithm that increases synthesize rapidly pruning away invalid candidate programs. Pruning achieved selectively refining abstraction under which...

10.1145/2872362.2872387 article EN 2016-03-25

Chlorophyll

OPENALEX - Publications

Phitchaya Mangpo Phothilimthana Tikhon Jelvis Rohin Shah Nishant Totla Sarah Chasins and 1 more

We developed Chlorophyll, a synthesis-aided programming model and compiler for the GreenArrays GA144, an extremely minimalist low-power spatial architecture that requires partitioning program into fragments of no more than 256 instructions 64 words data. This processor is 100-times energy efficient its competitors, but currently can only be programmed using low-level stack-based language.

10.1145/2594291.2594339 article EN 2014-05-13

Floem: a programming system for NIC-accelerated network applications

OPENALEX - Publications

Phitchaya Mangpo Phothilimthana Ming Liu Antoine Kaufmann Simon Peter Rastislav Bodík and 1 more

Developing server applications that offload computation to a NIC accelerator is complex and laborious. Developers have explore the design space, which includes semantic changes for different offloading strategies, as well variations on parallelization, program-to-resource mapping, communication strategies program components across devices.We therefore FLOEM -- language, compiler, runtime programming NIC-accelerated applications. enables exploration by providing abstractions assign hardware...

10.5555/3291168.3291217 article EN Operating Systems Design and Implementation 2018-10-08

Swizzle Inventor

OPENALEX - Publications

Phitchaya Mangpo Phothilimthana Archibald Samuel Elliott An Wang Abhinav Jangda Bastian Hagedorn and 5 more

Utilizing memory and register bandwidth in modern architectures may require swizzles --- non-trivial mappings of data computations onto hardware resources such as shuffles. We develop Swizzle Inventor to help programmers implement swizzle programs, by writing program sketches that omit delegating their creation an automatic synthesizer. Our synthesis algorithm scales real-world allowing us invent new GPU kernels for stencil computations, matrix transposition, a finite field multiplication...

10.1145/3297858.3304059 article EN 2019-04-04

A Learned Performance Model for Tensor Processing Units

OPENALEX - Publications

Samuel J. Kaufman Phitchaya Mangpo Phothilimthana Yanqi Zhou Charith Mendis Sudip Roy and 2 more

Accurate hardware performance models are critical to efficient code generation. They can be used by compilers make heuristic decisions, superoptimizers as a minimization objective, or autotuners find an optimal configuration for specific program. However, they difficult develop because contemporary processors complex, and the recent proliferation of deep learning accelerators has increased development burden. We demonstrate method from corpus tensor computation graph programs Tensor...

10.48550/arxiv.2008.01040 preprint EN other-oa arXiv (Cornell University) 2020-01-01

TensorRight: Automated Verification of Tensor Graph Rewrites

OPENALEX - Publications

Jai Arora Sirui Lu Devansh Jain Tianfan Xu Farzin Houshmand and 7 more

Tensor compilers, essential for generating efficient code deep learning models across various applications, employ tensor graph rewrites as one of the key optimizations. These optimize computational graphs with expectation preserving semantics tensors arbitrary rank and size. Despite this expectation, to best our knowledge, there does not exist a fully automated verification system prove soundness these Previous works, while successful in verifying concrete rank, do provide guarantees...

10.1145/3704865 article EN Proceedings of the ACM on Programming Languages 2025-01-07

Communication-minimizing 2D convolution in GPU registers

OPENALEX - Publications

Forrest Iandola David Sheffield Michael J. Anderson Phitchaya Mangpo Phothilimthana Kurt Keutzer

2D image convolution is ubiquitous in processing and computer vision problems such as feature extraction. Exploiting parallelism a common strategy for accelerating convolution. Parallel processors keep getting faster, but algorithms remain memory bounded on parallel GPUs. Therefore, reducing communication fundamental to To reduce communication, we reorganize the algorithm prefetch regions register, do more work per thread with fewer threads. enable portability future architectures, implement...

10.1109/icip.2013.6738436 article EN 2013-09-01

Chlorophyll

OPENALEX - Publications

Phitchaya Mangpo Phothilimthana Tikhon Jelvis Rohin Shah Nishant Totla Sarah Chasins and 1 more

We developed Chlorophyll, a synthesis-aided programming model and compiler for the GreenArrays GA144, an extremely minimalist low-power spatial architecture that requires partitioning program into fragments of no more than 256 instructions 64 words data. This processor is 100-times energy efficient its competitors, but currently can only be programmed using low-level stack-based language. The Chlorophyll allows programmers to provide human insight by specifying partial data computation....

10.1145/2666356.2594339 article EN ACM SIGPLAN Notices 2014-06-05

Thesios: Synthesizing Accurate Counterfactual I/O Traces from I/O Samples

OPENALEX - Publications

Phitchaya Mangpo Phothilimthana Saurabh Kadekodi Soroush Ghodrati Selene Moon Martin Maas

Representative modeling of I/O activity is crucial when designing large-scale distributed storage systems. Particularly important use cases are counterfactual "what-if" analyses that assess the impact anticipated or hypothetical new policies hardware prior to deployment. We propose Thesios, a methodology accurately synthesize such full-resolution traces by carefully combining down-sampled collected from multiple disks attached servers. Applying this approach real-world already routinely...

10.1145/3620666.3651337 article EN 2024-04-24

GreenThumb: superoptimizer construction framework

OPENALEX - Publications

Phitchaya Mangpo Phothilimthana Aditya Thakur Rastislav Bodík Dinakar Dhurjati

Developing an optimizing compiler backend remains a laborious process, especially for nontraditional ISAs that have been appearing recently. Superoptimization sidesteps the need many code transformations by searching most optimal instruction sequence semantically equivalent to original fragment. Even though superoptimization discovers best machine-specific optimizations, it has yet become widely-used. We propose GreenThumb, extensible framework reduces cost of constructing superoptimizers...

10.1145/2892208.2892233 article EN 2016-03-14

High-Coverage Hint Generation for Massive Courses

OPENALEX - Publications

Phitchaya Mangpo Phothilimthana Sumukh Sridhara

In massive programming courses, automated hint generation offers the promise of zero-cost, zero-latency assistance for students who are struggling to make progress on solving a program. While more robust approach based path construction requires tremendous engineering effort build, another easier-to-build program mutations suffers from low coverage.

10.1145/3059009.3059058 article EN 2017-06-28

A Flexible Approach to Autotuning Multi-Pass Machine Learning Compilers

OPENALEX - Publications

Phitchaya Mangpo Phothilimthana Amit Sabne Nikhil Sarda Karthik Murthy Yanqi Zhou and 12 more

Search-based techniques have been demonstrated effective in solving complex optimization problems that arise domain-specific compilers for machine learning (ML). Unfortunately, deploying such production is impeded by two limitations. First, prior works require factorization of a computation graph into smaller subgraphs over which search applied. This decomposition not only non-trivial but also significantly limits the scope optimization. Second, to be applied single stage compilation flow,...

10.1109/pact52795.2021.00008 article EN 2021-09-01

Portable performance on heterogeneous architectures

OPENALEX - Publications

Phitchaya Mangpo Phothilimthana Jason Ansel Jonathan Ragan‐Kelley Saman Amarasinghe

Trends in both consumer and high performance computing are bringing not only more cores, but also increased heterogeneity among the computational resources within a single machine. In many machines, one of greatest is now their graphics coprocessors (GPUs), just primary CPUs. But GPU programming memory models differ dramatically from conventional CPUs, relative characteristics different processors vary widely between machines. Different system often perform best with algorithms usage...

10.1145/2490301.2451162 article EN ACM SIGARCH Computer Architecture News 2013-03-16

Neural architecture search using property guided synthesis

OPENALEX - Publications

Charles Jin Phitchaya Mangpo Phothilimthana Sudip Roy

In the past few years, neural architecture search (NAS) has become an increasingly important tool within deep learning community. Despite many recent successes of NAS, however, most existing approaches operate highly structured design spaces, and hence explore only a small fraction full space architectures while also requiring significant manual effort from domain experts. this work, we develop techniques that enable efficient NAS in significantly larger space. To accomplish this, propose to...

10.1145/3563329 article EN Proceedings of the ACM on Programming Languages 2022-10-31

Portable performance on heterogeneous architectures

OPENALEX - Publications

Phitchaya Mangpo Phothilimthana Jason Ansel Jonathan Ragan‐Kelley Saman Amarasinghe

Trends in both consumer and high performance computing are bringing not only more cores, but also increased heterogeneity among the computational resources within a single machine. In many machines, one of greatest is now their graphics coprocessors (GPUs), just primary CPUs. But GPU programming memory models differ dramatically from conventional CPUs, relative characteristics different processors vary widely between machines. Different system often perform best with algorithms usage...

10.1145/2499368.2451162 article EN ACM SIGPLAN Notices 2013-03-16

TpuGraphs: A Performance Prediction Dataset on Large Tensor Computational Graphs

OPENALEX - Publications

Phitchaya Mangpo Phothilimthana Sami Abu-El-Haija Kaidi Cao Bahare Fatemi Charith Mendis and 1 more

Precise hardware performance models play a crucial role in code optimizations. They can assist compilers making heuristic decisions or aid autotuners identifying the optimal configuration for given program. For example, autotuner XLA, machine learning compiler, discovered 10-20% speedup on state-of-the-art serving substantial production traffic at Google. Although there exist few datasets program prediction, they target small sub-programs such as basic blocks kernels. This paper introduces...

10.48550/arxiv.2308.13490 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Short and Simple Cycle Separators in Planar Graphs

OPENALEX - Publications

Eli Fox-Epstein Shay Mozes Phitchaya Mangpo Phothilimthana Christian Sommer

We provide an implementation of algorithm that, given a triangulated planar graph with m edges, returns simple cycle that is 3/4-balanced separator consisting at most √8 edges. An efficient construction short and balanced forms essential in numerous algorithms, for example, computing shortest paths, minimum cuts, or maximum flows. To the best our knowledge, this first such worst-case guarantee on length. evaluate performance compare it to algorithms recently studied by Holzer et al. [2009]....

10.1145/2957318 article EN ACM Journal of Experimental Algorithmics 2016-09-15

GRANITE: A Graph Neural Network Model for Basic Block Throughput Estimation

OPENALEX - Publications

Ondřej Sýkora Phitchaya Mangpo Phothilimthana Charith Mendis Amir Yazdanbakhsh

Analytical hardware performance models yield swift estimation of desired metrics. However, developing these analytical for modern processors with sophisticated microarchitectures is an extremely laborious task and requires a firm understanding target microarchitecture's internal structure. In this paper, we introduce GRANITE <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sup> , new machine learning model that estimates the throughput basic...

10.1109/iiswc55918.2022.00012 article EN 2022-11-01

Scaling up Superoptimization

OPENALEX - Publications

Phitchaya Mangpo Phothilimthana Aditya Thakur Rastislav Bodík Dinakar Dhurjati

Developing a code optimizer is challenging, especially for new, idiosyncratic ISAs. Superoptimization can, in principle, discover machine-specific optimizations automatically by searching the space of all instruction sequences. If we can increase size fragments superoptimizer optimize, will be able to more optimizations. We develop LENS, search algorithm that increases synthesize rapidly pruning away invalid candidate programs. Pruning achieved selectively refining abstraction under which...

10.1145/2954679.2872387 article EN ACM SIGPLAN Notices 2016-03-25

Scaling up Superoptimization

OPENALEX - Publications

Phitchaya Mangpo Phothilimthana Aditya Thakur Rastislav Bodík Dinakar Dhurjati

Developing a code optimizer is challenging, especially for new, idiosyncratic ISAs. Superoptimization can, in principle, discover machine-specific optimizations automatically by searching the space of all instruction sequences. If we can increase size fragments superoptimizer optimize, will be able to more optimizations. We develop LENS, search algorithm that increases synthesize rapidly pruning away invalid candidate programs. Pruning achieved selectively refining abstraction under which...

10.1145/2954680.2872387 article EN ACM SIGOPS Operating Systems Review 2016-03-25