- Parallel Computing and Optimization Techniques
- Embedded Systems Design Techniques
- Low-power high-performance VLSI design
- Interconnection Networks and Systems
- Radiation Effects in Electronics
- Advanced Data Storage Technologies
- Distributed and Parallel Computing Systems
- Semiconductor materials and devices
- Advanced Memory and Neural Computing
- Quantum-Dot Cellular Automata
- CCD and CMOS Imaging Sensors
- Formal Methods in Verification
University of California, Berkeley
2016-2019
University of California System
2016
Massachusetts Institute of Technology
2010
This paper introduces the Graphite open-source distributed parallel multicore simulator infrastructure. is designed from ground up for exploration of future multi-core processors containing dozens, hundreds, or even thousands cores. It provides high performance fast design space and software development. Several techniques are used to achieve this including: direct execution, seamless multi-machine distribution, lax synchronization. capable accelerating simulations by distributing them...
This paper presents a sample-based energy simulation methodology that enables fast and accurate estimations of performance average power for arbitrary RTL designs. Our approach uses an FPGA to simultaneously simulate the design collect samples containing exact state snapshots. Each snapshot is then replayed in gate-level simulation, resulting workload-specific estimate with confidence intervals. For workloads, our guarantees minimum four-orders-of-magnitude speedup over commercial CAD tools...
The Berkeley resilient out-of-order machine (BROOM) is a resilient, wide-voltage-range implementation of an open-source (OoO) RISC-V processor implemented in ASIC flow. A 28-nm test-chip contains BOOM OoO core and 1-MiB level-2 (L2) cache, enhanced with architectural error tolerance for low-voltage operation. It was by using agile design methodology, where the initial architecture transformed to perform well high-performance, low-leakage CMOS process, informed synthesis, place, route data...
We present DESSERT, an FPGA-accelerated methodology for simulation-based RTL verification. The design is automatically transformed and instrumented to allow deterministic simulation on the FPGA with initialization state snapshot capture. Assert statements, which are in error checking software simulation, synthesized quick hardware-based checking. Print statements also generate logs from FPGA, compared fly against a functional golden-model simulator more exhaustive To rapidly provide...
This report makes the case that a well-designed Reduced Instruction Set Computer (RISC) can match, and even exceed, performance code density of existing commercial Complex Computers (CISC) while maintaining simplicity cost-effectiveness underpins original RISC goals. We begin by comparing dynamic instruction counts bytes fetched for popular proprietary ARMv7, ARMv8, IA-32, x86-64 Architectures (ISAs) against free open RISC-V RV64G RV64GC ISAs when running SPEC CINT2006 benchmark suite. was...
An open-source out-of-order superscalar processor implements the 64-bit RISC-V instruction set architecture (ISA) and achieves 3.77 CoreMark/MHz. The 2.7 mm×1.8 mm chip includes one core operating at 1.0 GHz nominal 0.9 V with 1 MB of level-2 (L2) cache in a 28 nm HPM process. A line recycling (LR) technique reuses faulty lines that fail low voltages to correct errors only 0.77% L2 area overhead. LR reduces minimum voltage 0.47 V, improving energy efficiency by 43% negligible impact on CPI.
This paper presents a sample-based energy simulation methodology that enables fast and accurate estimations of performance average power for arbitrary RTL designs. Our approach uses an FPGA to simultaneously simulate the design collect samples containing exact state snapshots. Each snapshot is then replayed in gate-level simulation, resulting workload-specific estimate with confidence intervals. For workloads, our guarantees minimum four-orders-of-magnitude speedup over commercial CAD tools...
Architecture-level assist techniques enable low-voltage operation by tolerating errors in SRAM-based caches. A line recycling (LR) technique is proposed to reuse faulty cache lines that fail at low voltages correct with only 0.77% level-2 (L2) area overhead. LR can either save 33% of capacity loss from disable or allow further reduction minimum operating voltage (Vmin). Bit bypass implemented SRAM extends the tag array log error entries providing multibit-error protection for metadata...