- Parallel Computing and Optimization Techniques
- Software Testing and Debugging Techniques
- Embedded Systems Design Techniques
- Cloud Computing and Resource Management
- Advanced Neural Network Applications
- Interconnection Networks and Systems
- Logic, programming, and type systems
- Software Engineering Research
- Ferroelectric and Negative Capacitance Devices
- Software System Performance and Reliability
- Distributed and Parallel Computing Systems
- Caching and Content Delivery
- Optimization and Search Problems
- Topic Modeling
- Online Learning and Analytics
- Formal Methods in Verification
- Semiconductor materials and devices
- Domain Adaptation and Few-Shot Learning
- Complexity and Algorithms in Graphs
- Machine Learning and Data Classification
- Data Stream Mining Techniques
- Natural Language Processing Techniques
- Advanced Image and Video Retrieval Techniques
- Advanced Graph Neural Networks
- Intelligent Tutoring Systems and Adaptive Learning
Google (United States)
2019-2025
Brain (Germany)
2022
University of California, Berkeley
2013-2019
Berkeley College
2016
Massachusetts Institute of Technology
2013
Trends in both consumer and high performance computing are bringing not only more cores, but also increased heterogeneity among the computational resources within a single machine. In many machines, one of greatest is now their graphics coprocessors (GPUs), just primary CPUs. But GPU programming memory models differ dramatically from conventional CPUs, relative characteristics different processors vary widely between machines. Different system often perform best with algorithms usage...
Developing a code optimizer is challenging, especially for new, idiosyncratic ISAs. Superoptimization can, in principle, discover machine-specific optimizations automatically by searching the space of all instruction sequences. If we can increase size fragments superoptimizer optimize, will be able to more optimizations. We develop LENS, search algorithm that increases synthesize rapidly pruning away invalid candidate programs. Pruning achieved selectively refining abstraction under which...
We developed Chlorophyll, a synthesis-aided programming model and compiler for the GreenArrays GA144, an extremely minimalist low-power spatial architecture that requires partitioning program into fragments of no more than 256 instructions 64 words data. This processor is 100-times energy efficient its competitors, but currently can only be programmed using low-level stack-based language.
Developing server applications that offload computation to a NIC accelerator is complex and laborious. Developers have explore the design space, which includes semantic changes for different offloading strategies, as well variations on parallelization, program-to-resource mapping, communication strategies program components across devices.We therefore FLOEM -- language, compiler, runtime programming NIC-accelerated applications. enables exploration by providing abstractions assign hardware...
Utilizing memory and register bandwidth in modern architectures may require swizzles --- non-trivial mappings of data computations onto hardware resources such as shuffles. We develop Swizzle Inventor to help programmers implement swizzle programs, by writing program sketches that omit delegating their creation an automatic synthesizer. Our synthesis algorithm scales real-world allowing us invent new GPU kernels for stencil computations, matrix transposition, a finite field multiplication...
Accurate hardware performance models are critical to efficient code generation. They can be used by compilers make heuristic decisions, superoptimizers as a minimization objective, or autotuners find an optimal configuration for specific program. However, they difficult develop because contemporary processors complex, and the recent proliferation of deep learning accelerators has increased development burden. We demonstrate method from corpus tensor computation graph programs Tensor...
Tensor compilers, essential for generating efficient code deep learning models across various applications, employ tensor graph rewrites as one of the key optimizations. These optimize computational graphs with expectation preserving semantics tensors arbitrary rank and size. Despite this expectation, to best our knowledge, there does not exist a fully automated verification system prove soundness these Previous works, while successful in verifying concrete rank, do provide guarantees...
2D image convolution is ubiquitous in processing and computer vision problems such as feature extraction. Exploiting parallelism a common strategy for accelerating convolution. Parallel processors keep getting faster, but algorithms remain memory bounded on parallel GPUs. Therefore, reducing communication fundamental to To reduce communication, we reorganize the algorithm prefetch regions register, do more work per thread with fewer threads. enable portability future architectures, implement...
We developed Chlorophyll, a synthesis-aided programming model and compiler for the GreenArrays GA144, an extremely minimalist low-power spatial architecture that requires partitioning program into fragments of no more than 256 instructions 64 words data. This processor is 100-times energy efficient its competitors, but currently can only be programmed using low-level stack-based language. The Chlorophyll allows programmers to provide human insight by specifying partial data computation....
Representative modeling of I/O activity is crucial when designing large-scale distributed storage systems. Particularly important use cases are counterfactual "what-if" analyses that assess the impact anticipated or hypothetical new policies hardware prior to deployment. We propose Thesios, a methodology accurately synthesize such full-resolution traces by carefully combining down-sampled collected from multiple disks attached servers. Applying this approach real-world already routinely...
Developing an optimizing compiler backend remains a laborious process, especially for nontraditional ISAs that have been appearing recently. Superoptimization sidesteps the need many code transformations by searching most optimal instruction sequence semantically equivalent to original fragment. Even though superoptimization discovers best machine-specific optimizations, it has yet become widely-used. We propose GreenThumb, extensible framework reduces cost of constructing superoptimizers...
In massive programming courses, automated hint generation offers the promise of zero-cost, zero-latency assistance for students who are struggling to make progress on solving a program. While more robust approach based path construction requires tremendous engineering effort build, another easier-to-build program mutations suffers from low coverage.
Search-based techniques have been demonstrated effective in solving complex optimization problems that arise domain-specific compilers for machine learning (ML). Unfortunately, deploying such production is impeded by two limitations. First, prior works require factorization of a computation graph into smaller subgraphs over which search applied. This decomposition not only non-trivial but also significantly limits the scope optimization. Second, to be applied single stage compilation flow,...
Trends in both consumer and high performance computing are bringing not only more cores, but also increased heterogeneity among the computational resources within a single machine. In many machines, one of greatest is now their graphics coprocessors (GPUs), just primary CPUs. But GPU programming memory models differ dramatically from conventional CPUs, relative characteristics different processors vary widely between machines. Different system often perform best with algorithms usage...
In the past few years, neural architecture search (NAS) has become an increasingly important tool within deep learning community. Despite many recent successes of NAS, however, most existing approaches operate highly structured design spaces, and hence explore only a small fraction full space architectures while also requiring significant manual effort from domain experts. this work, we develop techniques that enable efficient NAS in significantly larger space. To accomplish this, propose to...
Trends in both consumer and high performance computing are bringing not only more cores, but also increased heterogeneity among the computational resources within a single machine. In many machines, one of greatest is now their graphics coprocessors (GPUs), just primary CPUs. But GPU programming memory models differ dramatically from conventional CPUs, relative characteristics different processors vary widely between machines. Different system often perform best with algorithms usage...
Precise hardware performance models play a crucial role in code optimizations. They can assist compilers making heuristic decisions or aid autotuners identifying the optimal configuration for given program. For example, autotuner XLA, machine learning compiler, discovered 10-20% speedup on state-of-the-art serving substantial production traffic at Google. Although there exist few datasets program prediction, they target small sub-programs such as basic blocks kernels. This paper introduces...
We provide an implementation of algorithm that, given a triangulated planar graph with m edges, returns simple cycle that is 3/4-balanced separator consisting at most √8 edges. An efficient construction short and balanced forms essential in numerous algorithms, for example, computing shortest paths, minimum cuts, or maximum flows. To the best our knowledge, this first such worst-case guarantee on length. evaluate performance compare it to algorithms recently studied by Holzer et al. [2009]....
Analytical hardware performance models yield swift estimation of desired metrics. However, developing these analytical for modern processors with sophisticated microarchitectures is an extremely laborious task and requires a firm understanding target microarchitecture's internal structure. In this paper, we introduce GRANITE <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sup> , new machine learning model that estimates the throughput basic...
Developing a code optimizer is challenging, especially for new, idiosyncratic ISAs. Superoptimization can, in principle, discover machine-specific optimizations automatically by searching the space of all instruction sequences. If we can increase size fragments superoptimizer optimize, will be able to more optimizations. We develop LENS, search algorithm that increases synthesize rapidly pruning away invalid candidate programs. Pruning achieved selectively refining abstraction under which...
Developing a code optimizer is challenging, especially for new, idiosyncratic ISAs. Superoptimization can, in principle, discover machine-specific optimizations automatically by searching the space of all instruction sequences. If we can increase size fragments superoptimizer optimize, will be able to more optimizations. We develop LENS, search algorithm that increases synthesize rapidly pruning away invalid candidate programs. Pruning achieved selectively refining abstraction under which...