- Parallel Computing and Optimization Techniques
- Advanced Data Storage Technologies
- Interconnection Networks and Systems
- Advanced Memory and Neural Computing
- Low-power high-performance VLSI design
- Embedded Systems Design Techniques
- Photonic and Optical Devices
- Radiation Effects in Electronics
- Optical Network Technologies
- Ferroelectric and Negative Capacitance Devices
- Distributed and Parallel Computing Systems
- Distributed systems and fault tolerance
- Semiconductor materials and devices
- Cloud Computing and Resource Management
- Advanced Neural Network Applications
- Semiconductor Lasers and Optical Devices
- Advanced Optical Network Technologies
- Advanced Photonic Communication Systems
- VLSI and FPGA Design Techniques
- Advancements in Semiconductor Devices and Circuit Design
- Phase-change materials and chalcogenides
- Photonic Crystals and Applications
- Interactive and Immersive Displays
- Virtual Reality Applications and Impacts
- VLSI and Analog Circuit Testing
Google (United States)
2013-2024
Hewlett-Packard (United States)
2005-2014
IEEE Computer Society
2013
Seoul National University
2013
Intel (United States)
2011-2012
Intel (United Kingdom)
2009
Laboratory for Research on Enterprise and Decisions
2007
HRL Laboratories (United States)
2006
FX Palo Alto Laboratory
2005
Digital Wave (United States)
1989-2003
Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called Tensor Processing Unit (TPU) --- deployed datacenters since 2015 accelerates the inference phase of neural networks (NN). The heart TPU is 65,536 8-bit MAC matrix multiply unit offers peak throughput 92 TeraOps/second (TOPS) and large (28 MiB) software-managed on-chip memory. TPU's deterministic execution model better match to...
This paper introduces McPAT, an integrated power, area, and timing modeling framework that supports comprehensive design space exploration for multicore manycore processor configurations ranging from 90nm to 22nm beyond. At the microarchitectural level, McPAT includes models fundamental components of a chip multiprocessor, including in-order out-of-order cores, networks-on-chip, shared caches, memory controllers, multiple-domain clocking. circuit technology levels, critical-path modeling,...
Article Free Access Share on Improving direct-mapped cache performance by the addition of a small fully-associative prefetch buffers Author: Norman P. Jouppi Digital Equipment Corporation Western Research Lab, 100 Hamilton Ave., Palo Alto, CA CAView Profile Authors Info & Claims ISCA '98: 25 years international symposia Computer architecture (selected papers)August 1998 Pages 388–397https://doi.org/10.1145/285930.285998Online:01 August 1998Publication History...
Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called Tensor Processing Unit (TPU) --- deployed datacenters since 2015 accelerates the inference phase of neural networks (NN). The heart TPU is 65,536 8-bit MAC matrix multiply unit offers peak throughput 92 TeraOps/second (TOPS) and large (28 MiB) software-managed on-chip memory. TPU's deterministic execution model better match to...
Various new nonvolatile memory (NVM) technologies have emerged recently. Among all the investigated NVM candidate technologies, spin-torque-transfer (STT-RAM, or MRAM), phase-change random-access (PCRAM), and resistive (ReRAM) are regarded as most promising candidates. As ultimate goal of this research is to deploy them into multiple levels in hierarchy, it necessary explore wide design space find proper implementation at different hierarchy from highly latency-optimized caches density-...
This paper describes an analytical model for the access and cycle times of on-chip direct-mapped set-associative caches. The inputs to are cache size, block associativity, as well array organization process parameters. gives estimates that within 6% Hspice results circuits we have chosen. extends previous models fixes many their major shortcomings. New features include tag array, comparator, multiplexor drivers, nonstep stage input slopes, rectangular stacking memory subarrays, a...
The performance tradeoff between hardware complexity and clock speed is studied.First, a generic superscalar pipeline defined.Then the specific areas of register renaming, instruction window wakeup selection logic, operand bypassing are analyzed.Each modeled Spice simulated for feature sizes O&m, 0.35,um, 0.18~7% Performance results trends expressed in terms issue width size.Our analysis indicates that logic as well bypass likely to be most critical future.A microarchitecture simplifies...
This paper proposes and evaluates single-ISA heterogeneous multi-core architectures as a mechanism to reduce processor power dissipation. Our design incorporates cores representing different points in the power/performance space; during an application's execution, system software dynamically chooses most appropriate core meet specific performance requirements. evaluation of this architecture shows significant energy benefits. For objective function that optimizes for efficiency with tight...
We expect that many-core microprocessors will push performance per chip from the 10 gigaflop to teraflop range in coming decade. To support this increased performance, memory and inter-core bandwidths also have scale by orders of magnitude. Pin limitations, energy cost electrical signaling, non-scalability chip-length global wires are significant bandwidth impediments. Recent developments silicon nanophotonic technology potential meet these off- on-stack requirements at acceptable power...
A single-ISA heterogeneous multi-core architecture is achip multiprocessor composed of cores varying size, performance,and complexity. This paper demonstrates that thisarchitecture can provide significantly higher performance inthe same area than a conventional chip multiprocessor. It doesso by matching the various jobs diverse workload to thevarious cores. type covers spectrum ofworkloads particularly well, providing high single-thread performancewhen thread parallelism low, and...
We expect that many-core microprocessors will push performance per chip from the 10 gigaflop to teraflop range in coming decade. To support this increased performance, memory and inter-core bandwidths also have scale by orders of magnitude. Pin limitations, energy cost electrical signaling, non-scalability chip-length global wires are significant bandwidth impediments. Recent developments silicon nanophotonic technology potential meet these off- on-stack requirements at acceptable power...
Projections of computer technology forecast processors with peak performance 1,000 MIPS in the relatively near future. These could easily lose half or more their memory hierarchy if design is based on conventional caching techniques. This paper presents hardware techniques to improve caches.
Heterogeneous (or asymmetric) chip multiprocessors present unique opportunities for improving system throughput, reducing processor power, and mitigating Amdahl's law. On-chip heterogeneity allow the to better match execution resources each application's needs address a much wider spectrum of loads - from low high thread parallelism with efficiency.
Persistent memory is an emerging technology which allows in-memory persistent data objects to be updated at much higher throughput than when using disks as storage. Previous designs use logging or copy-on-write mechanisms update data, unfortunately reduces the system performance roughly half that of a native with no persistence support. One great challenges in this application class therefore how efficiently enable atomic, consistent, and durable updates ensure survives and/or failures. Our...
The first-generation tensor processing unit (TPU) runs deep neural network (DNN) inference 15-30 times faster with 30-80 better energy efficiency than contemporary CPUs and GPUs in similar semiconductor technologies. This domain-specific architecture (DSA) is a custom chip that has been deployed Google datacenters since 2015, where it serves billions of people.
Google deployed several TPU generations since 2015, teaching us lessons that changed our views: semi-conductor technology advances unequally; compiler compatibility trumps binary compatibility, especially for VLIW domain-specific architectures (DSA); target total cost of ownership vs initial cost; support multi-tenancy; deep neural networks (DNN) grow 1.5X annually; DNN evolve workloads; some inference tasks require floating point; DSAs need air-cooling; apps limit latency, not batch size;...
This article introduces McPAT, an integrated power, area, and timing modeling framework that supports comprehensive design space exploration for multicore manycore processor configurations ranging from 90nm to 22nm beyond. At microarchitectural level, McPAT includes models the fundamental components of a complete chip multiprocessor, including in-order out-of-order cores, networks-on-chip, shared caches, system such as memory controllers Ethernet controllers. circuit detailed critical-path...
Google's TPU supercomputers train deep neural networks 50x faster than general-purpose running a high-performance computing benchmark.
In response to innovations in machine learning (ML) models, production workloads changed radically and rapidly. TPU v4 is the fifth Google domain specific architecture (DSA) its third supercomputer for such ML models. Optical circuit switches (OCSes) dynamically reconfigure interconnect topology improve scale, availability, utilization, modularity, deployment, security, power, performance; users can pick a twisted 3D torus if desired. Much cheaper, lower faster than Infiniband, OCSes...
Hardware techniques for improving the performance of caches are presented. Miss caching places a small, fully associative cache between and its refill path. Misses in that hit miss have only 1-cycle penalty. Small 2 to 5 entries shown be very effective removing mapping conflict misses first-level direct-mapped caches. Victim is an improvement it loads small with victim not requested line. 1 even more at than caching. Stream buffers prefetch lines starting address. The prefetched data placed...
This paper proposes and evaluates single-ISA heterogeneous multi-core architectures as a mechanism to reduce processor power dissipation. Our design incorporates cores representing different points in the power/performance space; during an application's execution, system software dynamically chooses most appropriate core meet specific performance requirements. evaluation of this architecture shows significant energy benefits. For objective function that optimizes for efficiency with tight...
Projections of computer technology forecast processors with peak performance 1,000 MIPS in the relatively near future. These could easily lose half or more their memory hierarchy if design is based on conventional caching techniques. This paper presents hardware techniques to improve caches. Miss places a small fully-associative cache between and its refill path. Misses that hit miss have only one cycle penalty, as opposed many penalty without cache. Small caches 2 5 entries are shown be...
High performance general-purpose processors are increasingly being used for a variety of application domains - scientific, engineering, databases, and more recently, media processing. It is therefore important to ensure that architectural features use significant fraction the on-chip transistors applicable across these different domains. For example, current processor designs often devote largest (up 80%) caches. Many workloads, however, do not make effective large caches; e.g., processing...