- Parallel Computing and Optimization Techniques
- Embedded Systems Design Techniques
- Logic, programming, and type systems
- Advanced Data Storage Technologies
- Distributed and Parallel Computing Systems
- Formal Methods in Verification
- Cloud Computing and Resource Management
- Distributed systems and fault tolerance
- Advanced Memory and Neural Computing
- Software Testing and Debugging Techniques
- Software System Performance and Reliability
- Ferroelectric and Negative Capacitance Devices
- VLSI and Analog Circuit Testing
- Interconnection Networks and Systems
- Model-Driven Software Engineering Techniques
- Quantum Computing Algorithms and Architecture
- Numerical Methods and Algorithms
- Cellular Automata and Applications
- Software Engineering Research
- Advanced Electron Microscopy Techniques and Applications
- Scientific Computing and Data Management
- Security and Verification in Computing
- Integrated Circuits and Semiconductor Failure Analysis
- Algorithms and Data Compression
- Tensor decomposition and applications
University of Cambridge
2024-2025
University of Edinburgh
2020-2024
Edinburgh College
2024
Lawrence Livermore National Laboratory
2023
ETH Zurich
2015-2020
Board of the Swiss Federal Institutes of Technology
2018
Institut national de recherche en informatique et en automatique
2012-2015
École Normale Supérieure - PSL
2012-2015
École Normale Supérieure
2013-2014
Laboratoire de l'Informatique du Parallélisme
2013-2014
The polyhedral model for loop parallelization has proved to be an effective tool advanced optimization and automatic of programs in higher-level languages. Yet, integrate such optimizations seamlessly into production compilers, they must performed on the compiler's internal, low-level, intermediate representation (IR). With Polly, we present infrastructure IR. We describe detection program parts amenable a (so-called static control parts), their translation Z-polyhedral representation, this...
Programming accelerators such as GPUs with low-level APIs and languages OpenCL CUDA is difficult, error-prone, not performance-portable. Automatic parallelization domain specific (DSLs) have been proposed to hide complexity regain performance portability. We present PENCIL, a rigorously-defined subset of GNU C99-enriched additional language constructs-that enables compilers exploit parallelism produce highly optimized code when targeting accelerators. PENCIL aims serve both portable...
Tiling is a key technique to enhance data reuse. For computations structured as one sequential outer "time" loop enclosing set of parallel inner loops, tiling only the loops may not enable enough reuse in cache. along with time enhances locality but require other transformations like skewing that inhibit inter-tile parallelism.
Time-tiling is necessary for the efficient execution of iterative stencil computations. Classical hyper-rectangular tiles cannot be used due to combination backward and forward dependences along space dimensions. Existing techniques trade temporal data reuse inefficiencies in other areas, such as load imbalance, redundant computations, or increased control flow overhead, therefore making it challenging use with GPUs.
Most compilers have a single core intermediate representation (IR) (e.g., LLVM) sometimes complemented with vaguely defined IR-like data structures. This IR is commonly low-level and close to machine instructions. As result, optimizations relying on domain-specific information are either not possible or require complex analysis recover the missing information. In contrast, multi-level rewriting instantiates hierarchy of dialects (IRs), lowers programs level-by-level, performs code...
Abstract mathematical representations such as integer polyhedra have been shown to be useful precisely analyze computational kernels and express complex loop transformations. Such transformations rely on abstract syntax tree (AST) generators convert the representation back an imperative program. generic AST avoid need resort transformation-specific code generators, which may very costly or technically difficult develop become more complex. Existing proven their effectiveness, but they hit...
Time-tiling is necessary for the efficient execution of iterative stencil computations. Classical hyper-rectangular tiles cannot be used due to combination backward and forward dependences along space dimensions. Existing techniques trade temporal data reuse inefficiencies in other areas, such as load imbalance, redundant computations, or increased control flow overhead, therefore making it challenging use with GPUs.
Code transformations, such as loop tiling and fusion, are of key importance for the efficient implementation stencil computations. However, their direct application to a large code base is costly severely impacts program maintainability. While recently introduced domain-specific languages facilitate they typically still require manual tuning or auto-tuning techniques select transformations that yield optimal performance. In this paper, we introduce MODESTO, model-driven optimization...
The freedom to reorder computations involving associative operators has been widely recognized and exploited in designing parallel algorithms a more limited extent optimizing compilers.
Programming today's increasingly complex heterogeneous hardware is difficult, as it commonly requires the use of data-parallel languages, pragma annotations, specialized libraries, or DSL compilers. Adding explicit accelerator support into a larger code base not only costly, but also introduces additional complexity that hinders long-term maintenance. We propose new compiler brings us closer to dream automatic mapping. Starting from sequential IR, we automatically generate hybrid executable...
Iterative stencil computations are important in scientific computing and more also the embedded mobile domain. Recent publications have shown that tiling schemes ensure concurrent start provide efficient ways to execute these kernels. Diamond hybrid-hexagonal two enable start. Both different advantages: diamond has been integrated a general purpose optimization framework uses cost function choose among hyperplanes, whereas greater flexibility with tile sizes for exploited effective...
The construction of effective loop nest optimizers and parallelizers remains challenging despite decades work in the area. Due to increasing diversity loop-intensive applications complex memory/computation hierarchies modern processors, optimization heuristics are pulled towards conflicting goals, highlighting lack a systematic approach optimizing locality parallelism. Acknowledging these demands on optimization, we propose an algorithmic template capable modeling multi-level parallelism...
The efficiency of tensor contraction is great importance. Compilers cannot optimize it well enough to come close the performance expert-tuned implementations. All existing approaches that provide competitive require optimized external code. We introduce a compiler optimization reaches BLAS libraries without need for an implementation or automatic tuning. Our approach provides across hardware architectures and can be generalized deliver same benefits algebraic path problems. By making fast...
While the cost of computation is an easy to understand local property, data movement on cached architectures depends global state, does not compose, and hard predict. As a result, programmers often fail consider movement. Existing cache models simulators provide missing information but are computationally expensive. We present lightweight model for fully associative caches with least recently used (LRU) replacement policy that gives fast accurate results. count misses without explicit...
Modern Hardware Description Languages (HDLs) such as SystemVerilog or VHDL are, due to their sheer complexity, insufficient transport designs through modern circuit design flows. Instead, each automation tool lowers HDLs its own Intermediate Representation (IR). These tools are monolithic and mostly proprietary, disagree in implementation of HDLs, while many redundant IRs exists, no IR today can be used the entire flow. To solve this problem, we propose LLHD multi-level IR. is designed...
Inlining is a core transformation in optimizing compilers. It replaces function call (call site) with the body of called (callee). helps reduce overhead and binary size, more importantly, enables other optimizations. The problem inlining has been extensively studied, but it far from being solved; predicting which decisions are beneficial nontrivial due to interactions rest compiler pipeline. Previous work mainly focused on designing heuristics for better not investigated optimal inlining,...
High-performance micro-kernels must fully exploit today's diverse and specialized hardware to deliver peak performance DNNs. While higher-level optimizations for DNNs are offered by numerous compilers (e.g., MLIR, TVM, OpenXLA), performance-critical left code generators or handwritten assembly. Even though widely-adopted LLVM, GCC) offer tuned backends, their CPU-focused input abstraction, unstructured IR, general-purpose best-effort design inhibit tailored generation innovative hardware. We...
Tensor algebra is a crucial component for data-intensive workloads such as machine learning and scientific computing. As the complexity of data grows, scientists often encounter dilemma between highly specialized dense tensor efficient structure-aware algorithms provided by sparse algebra. In this paper, we introduce DASTAC, framework to propagate tensors's captured high-level structure down low-level code generation incorporating techniques automatic layout compression, polyhedral analysis,...
The freedom to reorder computations involving associative operators has been widely recognized and exploited in designing parallel algorithms a more limited extent optimizing compilers. In this paper, we develop novel framework utilizing the associativity commutativity of operations regular loop enhance register reuse. Stencils represent particular class important where optimization can be applied performance. We show how stencil implemented better exploit reuse reduce load/stores....
To optimize code effectively, compilers must deal with memory dependencies. However, the state-of-the-art heuristics available in literature to track dependencies are inherently imprecise and computationally expensive. Consequently, most advanced transformations that have today ineffective when applied on real-world programs. The goal of this paper is solve conundrum through dynamic disambiguation pointers. We provide different ways determine at runtime two locations can overlap. then...
A number of legacy codes make use linearized array references (i.e., to one-dimensional arrays) encode accesses multi-dimensional arrays. This is also true a optimized libraries and the well-known LLVM intermediate representation, which linearize accesses. In many cases, only information available an base pointer single dimensional offset. For problems with parametric extents, this offset usually multivariate polynomial. Compiler analyses such as data dependence analysis are impeded because...