- Parallel Computing and Optimization Techniques
- Embedded Systems Design Techniques
- Interconnection Networks and Systems
- Cloud Computing and Resource Management
- Advanced Data Storage Technologies
- VLSI and Analog Circuit Testing
- Semiconductor materials and devices
- Distributed and Parallel Computing Systems
- Distributed systems and fault tolerance
- Low-power high-performance VLSI design
- Advancements in Semiconductor Devices and Circuit Design
- Advanced Malware Detection Techniques
- Photonic and Optical Devices
- Security and Verification in Computing
- Graph Theory and Algorithms
- Cryptography and Data Security
- VLSI and FPGA Design Techniques
- Software Engineering Research
- Algorithms and Data Compression
- Software System Performance and Reliability
- Formal Methods in Verification
- Advanced Electron Microscopy Techniques and Applications
- Software Testing and Debugging Techniques
- Logic, Reasoning, and Knowledge
- Logic, programming, and type systems
University of California, Santa Barbara
2021-2025
Princeton University
2014-2020
Princeton Public Schools
2017
Serverless computing is a rapidly growing cloud application model, popularized by Amazon's Lambda platform. services provide fine-grained provisioning of resources, which scale automatically with user demand. Function-as-a-Service (FaaS) applications follow this serverless the developer providing their as set functions are executed in response to user- or system-generated event. Functions designed be short-lived and execute inside containers virtual machines, introducing range system-level...
Industry is building larger, more complex, manycore processors on the back of strong institutional knowledge, but academic projects face difficulties in replicating that scale. To alleviate these and to develop share community needs open architecture frameworks for simulation, synthesis, software exploration which support extensibility, scalability, configurability, alongside an established base verification tools supported software. In this paper we present OpenPiton, source framework...
Industry is building larger, more complex, manycore processors on the back of strong institutional knowledge, but academic projects face difficulties in replicating that scale. To alleviate these and to develop share community needs open architecture frameworks for simulation, synthesis, software exploration which support extensibility, scalability, configurability, alongside an established base verification tools supported software. In this paper we present OpenPiton, source framework...
The end of Dennard's scaling and the looming power wall have made energy primary design goals for modern processors. Further, new applications such as cloud computing Internet Things (IoT) continue to necessitate increased performance efficiency. Manycore processors show potential in addressing some these issues. However, there is little detailed data on manycore In this work, we carefully study characteristics Piton, a 25-core open source academic processor, including voltage versus...
Heterogeneous architectures and heterogeneous-ISA designs are growing areas of computer architecture system software research. Unfortunately, this line research is significantly hindered by the lack experimental systems modifiable hardware frameworks. This work proposes BYOC, a "Bring Your Own Core" framework that specifically designed to enable heterogeneous BYOC an open-source provides scalable cache coherence system, includes out-of-the-box support for four different ISAs (RISC-V 32-bit,...
Modern computing systems employ significant heterogeneity and specialization to meet performance targets at manageable power. However, memory latency bottlenecks remain problematic, particularly for sparse neural network graph analytic applications where indirect accesses (IMAs) challenge the hierarchy.
Philosophically, our approaches to acceleration focus on the extreme. We must optimise accelerators maximum, leaving software fix any hardware-software mismatches. Today's abstractions for programming leak hardware details, requiring changes data formats and manual memory coherence management, among other issues. This harms generality requires deep knowledge efficiently program accelerators, a state which we consider hardware-oriented.
The shared cloud-based computing paradigm has experienced enormous growth. Multitenant clouds are conventionally built atop datacenters that utilize commodity hardware connected hierarchically with standard network protocols. Piton is a 25-core manycore processor takes different perspective, rethinking the architecture of and specializing for Infrastructure as Service (IaaS) clouds. tile-based designed not only single chip, but large-scale system. Up to 8,192 chips (204,800 cores) can be...
Industry is building larger, more complex, manycore processors on the back of strong institutional knowledge, but academic projects face difficulties in replicating that scale. To alleviate these and to develop share community needs open architecture frameworks for simulation, synthesis, software exploration which support extensibility, scalability, configurability, alongside an established base verification tools supported software. In this paper we present OpenPiton, source framework...
Computation is increasingly moving to the data enter. Thus, energy used by CPUs in centeris gaining importance. The centralization of computation center has also led much commonality between applications running there. For example, there are many instances similar or identical versions Apache web server a large center. Many these applications, such as bulk image resizing video Transco ding, favor increasing throughput over single stream performance. In this work, we propose Execution...
Embedded FPGAs (eFPGA) are increasingly being used in SoCs, enabling post-silicon hardware specialization. Existing CPU-eFPGA SoCs have three deficiencies. First, their low core count hinders efficient execution of thread-level-parallel workloads. Second, noncoherent or partially coherent integration inhibits dynamic, random memory sharing. Third, the use full-custom circuits makes proprietary eFPGAs technology-dependent, inflexible physical layout, and lacking architectural customizability.
As Moore's Law is coming to an end, heterogeneous SoCs have become ubiquitous, improving performance and efficiency with specialized hardware. However, the addition of hardware accelerators makes data supply more challenging. Feeding becomes a bottleneck, especially for data-intensive workloads such as graph analytics, sparse linear algebra, machine learning applications. DECADES addresses this issue combination accelerators, embedded FPGA (eFPGA), its unique ''intelligent storage'' (IS)...
We introduce the new problem of hardware decompilation . Analogous to software decompilation, is about analyzing a low-level artifact—in this case netlist , i.e., graph wires and logical gates representing digital circuit—in order recover higher-level programming abstractions, using those abstractions generate code written in description language (HDL). The overall requires number pieces. In paper we focus on one specific piece puzzle: technique call loop rerolling Hardware leverages clone...
For five years, OpenPiton has provided hardware designs, build and verification scripts, other infrastructure to enable efficient, detailed research into manycores systems-on-chip. It enables open-source development through its open design support of a plethora simulators CAD tools. was first designed perform cutting-edge computer architecture at Princeton University opening it up the public led thousands downloads numerous academic publications spanning many subfields within computing. In...
Garbage collection greatly improves programmer productivity and ensures memory safety. Manual management on the other hand often delivers better performance but is typically unsafe can lead to system crashes or security vulnerabilities. We propose integrating safe manual with garbage in .NET runtime get best of both worlds. In our design, programmers choose between allocating objects collected heap heap. All existing applications run unmodified, without any degradation, using Our programming...
Chips with tens of billions transistors have become today's norm. These designs are straining our electronic design automation tools throughout the process, requiring ever more computational resources. In many tools, parallelisation has improved both latency and throughput for designer's benefit. However, largely remain restricted to a single machine in case RTL simulation, we believe that this leaves much potential performance on table. We introduce Metro-MPI improve simulation modern 10...
This paper presents CIFER, the world's first opensource, fully cache-coherent, heterogeneous many-core, CPU-FPGA SoC. The 12nm, 16mm2 chip integrates four 64-bit, OS-capable, RISC-V application cores; three TinyCore clusters that each contain six 32-bit, compute cores (18 in total); and an EDA-synthesized, standard-cell-based eFPGA. CIFER enables decomposition of real-world applications tailored execution (parallelization or specialization) per decomposed task. Our evaluation shows that: 1)...
Energy efficiency has become an increasingly important concern in computer architecture due to the end of Dennard scaling. Heterogeneity been explored as a way achieve better energy and heterogeneous microarchitecture chips have common mobile setting. Recent research using heterogeneous-ISA, microarchitecture, general-purpose cores further gains. However, there is no open-source hardware implementation heterogeneous-ISA processor available for research, effective on processors necessitates...
Industry is building larger, more complex, manycore processors on the back of strong institutional knowledge, but academic projects face difficulties in replicating that scale. To alleviate these and to develop share community needs open architecture frameworks for simulation, synthesis, software exploration which support extensibility, scalability, configurability, alongside an established base verification tools supported software. In this paper we present OpenPiton, source framework...
Effective digital hardware design fundamentally requires decomposing a into set of interconnected modules, each distinct unit computation and state. However, naively connecting modules leads to real-world pathological cases which are surprisingly far from obvious when looking at the interfaces alone very difficult debug after synthesis. We show for first time that it is possible soundly abstract even complex combinational dependencies arbitrary through assignment IO ports one four new sorts...
To better facilitate application performance programming we propose a software optimization strategy enabled by novel low-latency Prediction System Service (PSS). Rather than relying on nuanced domain-specific knowledge or slapdash heuristics, system service for prediction encourages programmers to spend their time uncovering new levers rather worrying about the details of control. The core idea is write optimizations that improve in specific cases, under tunings, and leave decision how when...
EDA toolchains are notoriously unpredictable, incomplete, and error-prone; the generally-accepted remedy has been to re-imagine tasks as compilation problems. However, any compiler framework we apply must be prepared handle wide range of tasks, including not only like technology mapping optimization (the "there"} in our title), but also decompilation loop rerolling "back again"). In this paper, advocate for equality saturation -- a term rewriting choice when building hardware toolchains....
State-of-the-art domain specific architectures (DSAs) work with sparse data, and need hardware support for index data-structures [31, 43, 57, 61]. Indexes are more space-efficient sparse-data, reduce DRAM bandwidth, if data reuse can be managed. However, indexes exhibit dynamic accesses, chase pointers, to walk-and-search. This inflates the working set thrashes cache. We observe that cache organization itself is responsible this behavior.
TL simulation has become a crucial bottleneck in the design of emerging SoCs for AI. To clear this bottleneck, teams are leaning ever more heavily on emulation and other alternative tools. We find that designer can instead exploit natural boundaries these order to parallelise their RTL simulations using HPC techniques. By distributing Verilog across tens nodes (and thousands physical cores), we simulate 10B+ transistor, 1024 core SoC with over 2.7MIPS aggregate throughput simulated cores....