- Parallel Computing and Optimization Techniques
- Distributed and Parallel Computing Systems
- Tensor decomposition and applications
- Embedded Systems Design Techniques
- Algorithms and Data Compression
- Advanced Data Storage Technologies
- Computational Physics and Python Applications
- Interconnection Networks and Systems
- Logic, programming, and type systems
- Advanced Neural Network Applications
- Computer Graphics and Visualization Techniques
- Distributed systems and fault tolerance
- Software Testing and Debugging Techniques
- Graph Theory and Algorithms
- Model-Driven Software Engineering Techniques
- Modular Robots and Swarm Intelligence
- CCD and CMOS Imaging Sensors
- Network Packet Processing and Optimization
- Simulation Techniques and Applications
- Machine Learning and Data Classification
- Manufacturing Process and Optimization
- Data Visualization and Analytics
- VLSI and Analog Circuit Testing
- Machine Learning and Algorithms
- Cellular Automata and Applications
Stanford University
2020-2025
Massachusetts Institute of Technology
2012-2020
University of Illinois Urbana-Champaign
2010-2011
Tensor algebra is a powerful tool with applications in machine learning, data analytics, engineering and the physical sciences. Tensors are often sparse compound operations must frequently be computed single kernel for performance to save memory. Programmers left write kernels every operation of interest, different mixes dense tensors formats. The combinations infinite, which makes it impossible manually implement optimize them all. This paper introduces first compiler technique...
Writing highly performant simulations requires a lot of human effort to optimize for an increasingly diverse set hardware platforms, such as multi-core CPUs, GPUs, and distributed machines. Since these optimizations cut across both the design geometric data structures numerical linear algebra, code reusability portability is frequently sacrificed performance. We believe key make simulation programmers more productive at developing portable introduce new linguistic abstractions, in rendering...
Sparse tensors arise in problems science, engineering, machine learning, and data analytics. Programs that operate on such can exploit sparsity to reduce storage requirements computational time. Developing maintaining sparse software by hand, however, is a complex error-prone task. Therefore, we propose treating as property of tensors, not tedious implementation task, letting compiler generate code automatically from sparsity-agnostic definition the computation. This paper discusses...
With existing programming tools, writing high-performance simulation code is labor intensive and requires sacrificing readability portability. The alternative to prototype simulations in a high-level language like Matlab, thereby performance. Matlab model naturally describes the behavior of an entire physical system using linear algebra. However, also manipulate individual geometric elements, which are best represented linked data structures meshes. Translating between algebra comes at...
Many problems consist of a structured grid points that are updated repeatedly based on the values fixed set neighboring in same grid. To parallelize these we can geometrically divide into chunks processed by different processors. One challenge with this approach is update at periphery chunk requires from chunks. These often located remote memory belonging to processes. The naive implementation results lot time spent communication leaving less for useful computation. By using Ghost Cell...
This paper shows how to extend sparse tensor algebra compilers introduce temporary tensors called workspaces avoid inefficient data structures accesses. We develop an intermediate representation (IR) for operations concrete index notation that specifies when sub-computations occur and where they are stored. then describe the workspace transformation in this IR, programmatically invoke it, IR is compiled code. Finally, we show can be used optimize kernels, including matrix multiplication,...
This paper shows how to build a sparse tensor algebra compiler that is agnostic formats (data layouts). We develop an interface describes in terms of their capabilities and properties, show modular code generator where new can be added as plugins. then describe six implementations the compose form dense, CSR/CSF, COO, DIA, ELL, HASH countless variants thereof. With these at hand, our generate compute any expression on combination aforementioned formats. To demonstrate technique, we have...
We address the problem of optimizing sparse tensor algebra in a compiler and show how to define standard loop transformations---split, collapse, reorder---on iteration spaces. The key idea is track transformation functions that map original space derived These are needed by code generator emit maps coordinates between spaces at runtime, since data structures remain space. further demonstrate can tile both universe subset nonzero coordinates: former analogous tiling dense spaces, while latter...
We introduce DISTAL, a compiler for dense tensor algebra that targets modern distributed and heterogeneous systems. DISTAL lets users independently describe how tensors computation map onto target machines through separate format scheduling languages. The combination of choices data distribution creates large design space includes many algorithms from both the past (e.g., Cannon's algorithm) present COSMA). compiles domain specific language to task-based runtime system supports nodes with...
We propose the Sparse Abstract Machine (SAM), an abstract machine model for targeting sparse tensor algebra to reconfigurable and fixed-function spatial dataflow accelerators. SAM defines a streaming abstraction with primitives that encompass large space of scheduled expressions. graphs naturally separate formats from algorithms are expressive enough incorporate arbitrary iteration orderings many hardware-specific optimizations. also present Custard, compiler high-level language demonstrates...
We describe an abstract loop-based intermediate representation that can express fused implementations of relational algebra expressions on sets and bags (multisets). The loops are abstracted away from physical data structures thus making it easier to generate, reason about, perform optimization like fusion on. IR supports the natural as well complex operators used in production database systems, including outer joins, non-equi differences. then show how compile this efficient C++ code...
Tensor algebra is an important computational abstraction that increasingly used in data analytics, machine learning, engineering, and the physical sciences. However, number of tensor expressions unbounded, which makes it hard to develop optimize libraries. Furthermore, tensors are often sparse (most components zero), means code has traverse compressed formats. To support programmers we have developed taco, a generation tool generates dense, sparse, mixed kernels from expressions. This paper...
While loop reordering and fusion can make big impacts on the constant-factor performance of dense tensor programs, effects sparse programs are asymptotic, often leading to orders magnitude differences in practice. Sparse tensors also introduce a choice compressed storage formats that have asymptotic effects. Research into compilers has led simplified languages express these tradeoffs, but user is expected provide schedule makes decisions. This challenging because schedulers must anticipate...
With the slowing of Moore’s law, computer architects have turned to domain-specific hardware specialization continue improving performance and efficiency computing systems. However, typically entails significant modifications software stack properly leverage updated hardware. The lack a structured approach for updating compiler accelerator in tandem has impeded many attempts systematize this procedure. We propose new enable flexible evolvable based on coarse-grained reconfigurable arrays...
This paper shows how to compile sparse array programming languages. A language is an that supports element-wise application, reduction, and broadcasting of arbitrary functions over dense arrays with any fill value. Such a has great expressive power can express linear tensor algebra, images, exclusion inclusion filters, even graph algorithms. Our compiler strategy generalizes prior work in the literature on algebra compilation support function applied arrays, instead only addition...
Real world arrays often contain underlying structure, such as sparsity, runs of repeated values, or symmetry. Specializing for structure yields significant speedups. But automatically generating efficient code structured data is challenging, especially when with different interact. We show how to abstract over array structures so that the compiler can generate coiterate any combination them. Our technique enables new formats (such 1DVBL irregular clustered sparsity), iteration strategies...
It is common for object-oriented programs to have both mutable and immutable classes. Immutable classes simplify programing because the programmer does not reason about side-effects. Sometimes programmers write from scratch, other times they transform into To a class, must find all methods that mutate its transitive state objects can enter or escape of class. The analyses are non-trivial rewriting tedious. Fortunately, this be automated.
Although an agile approach is standard for software design, how to properly adapt this method hardware still open question. This work addresses question while building a system on chip (SoC) with specialized accelerators. Rather than using traditional waterfall design flow, which starts by studying the application be accelerated, we begin constructing complete flow from expressed in high-level domain-specific language (DSL), our case Halide, generic coarse-grained reconfigurable array...
Data packing before and after communication can make up as much 90% of the time on modern computers. Despite MPI's well-defined datatype interface for non-contiguous data access, many codes use manual pack loops performance reasons. Programmers write access-pattern specific (e.g., do unrolling) which compilers emit optimized code. In contrast, MPI implementations in today interpret datatypes at time, resulting high overheads. this work we explore effectiveness using runtime compilation...
This paper shows how to extend sparse tensor algebra compilers introduce temporary tensors called workspaces avoid inefficient data structures accesses. We develop an intermediate representation (IR) for operations concrete index notation that specifies when sub-computations occur and where they are stored. then describe the workspace transformation in this IR, programmatically invoke it, IR is compiled code. Finally, we show can be used optimize kernels, including matrix multiplication,...
Image processing and machine learning applications benefit tremendously from hardware acceleration. Existing compilers target either FPGAs, which sacrifice power performance for programmability, or ASICs, become obsolete as change. Programmable domain-specific accelerators, such coarse-grained reconfigurable arrays (CGRAs), have emerged a promising middle-ground, but they traditionally been difficult compiler targets since use different memory abstraction. In contrast to CPUs GPUs, the...