NFDI4DS | UHH-SEMS - Publication Details

In-Datacenter Performance Analysis of a Tensor Processing Unit

OPENALEX - Publications

Norman P. Jouppi Cliff Young Nishant Patil David A. Patterson Gaurav Agrawal and 71 more

Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called Tensor Processing Unit (TPU) --- deployed datacenters since 2015 accelerates the inference phase of neural networks (NN). The heart TPU is 65,536 8-bit MAC matrix multiply unit offers peak throughput 92 TeraOps/second (TOPS) and large (28 MiB) software-managed on-chip memory. TPU's deterministic execution model better match to...

10.1145/3079856.3080246 article EN 2017-06-15

McPAT

OPENALEX - Publications

Sheng Li Jung Ho Ahn Richard Strong Jay Brockman Dean M. Tullsen and 1 more

This paper introduces McPAT, an integrated power, area, and timing modeling framework that supports comprehensive design space exploration for multicore manycore processor configurations ranging from 90nm to 22nm beyond. At the microarchitectural level, McPAT includes models fundamental components of a chip multiprocessor, including in-order out-of-order cores, networks-on-chip, shared caches, memory controllers, multiple-domain clocking. circuit technology levels, critical-path modeling,...

10.1145/1669112.1669172 article EN 2009-12-12

Improving direct-mapped cache performance by the addition of a small fully-associative cache prefetch buffers

OPENALEX - Publications

Norman P. Jouppi

Article Free Access Share on Improving direct-mapped cache performance by the addition of a small fully-associative prefetch buffers Author: Norman P. Jouppi Digital Equipment Corporation Western Research Lab, 100 Hamilton Ave., Palo Alto, CA CAView Profile Authors Info & Claims ISCA '98: 25 years international symposia Computer architecture (selected papers)August 1998 Pages 388–397https://doi.org/10.1145/285930.285998Online:01 August 1998Publication History...

10.1145/285930.285998 article EN 1998-08-01

In-Datacenter Performance Analysis of a Tensor Processing Unit

OPENALEX - Publications

Norman P. Jouppi Cliff Young Nishant Patil David A. Patterson Gaurav Agrawal and 71 more

Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called Tensor Processing Unit (TPU) --- deployed datacenters since 2015 accelerates the inference phase of neural networks (NN). The heart TPU is 65,536 8-bit MAC matrix multiply unit offers peak throughput 92 TeraOps/second (TOPS) and large (28 MiB) software-managed on-chip memory. TPU's deterministic execution model better match to...

10.1145/3140659.3080246 article EN ACM SIGARCH Computer Architecture News 2017-06-24

NVSim: A Circuit-Level Performance, Energy, and Area Model for Emerging Nonvolatile Memory

OPENALEX - Publications

Xiangyu Dong Cong Xu Yuan Xie Norman P. Jouppi

Various new nonvolatile memory (NVM) technologies have emerged recently. Among all the investigated NVM candidate technologies, spin-torque-transfer (STT-RAM, or MRAM), phase-change random-access (PCRAM), and resistive (ReRAM) are regarded as most promising candidates. As ultimate goal of this research is to deploy them into multiple levels in hierarchy, it necessary explore wide design space find proper implementation at different hierarchy from highly latency-optimized caches density-...

10.1109/tcad.2012.2185930 article EN IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 2012-06-14

CACTI: an enhanced cache access and cycle time model

OPENALEX - Publications

Steven J. E. Wilton Norman P. Jouppi

This paper describes an analytical model for the access and cycle times of on-chip direct-mapped set-associative caches. The inputs to are cache size, block associativity, as well array organization process parameters. gives estimates that within 6% Hspice results circuits we have chosen. extends previous models fixes many their major shortcomings. New features include tag array, comparator, multiplexor drivers, nonstep stage input slopes, rectangular stacking memory subarrays, a...

10.1109/4.509850 article EN IEEE Journal of Solid-State Circuits 1996-05-01

Complexity-effective superscalar processors

OPENALEX - Publications

Subbarao Palacharla Norman P. Jouppi James E. Smith

The performance tradeoff between hardware complexity and clock speed is studied.First, a generic superscalar pipeline defined.Then the specific areas of register renaming, instruction window wakeup selection logic, operand bypassing are analyzed.Each modeled Spice simulated for feature sizes O&m, 0.35,um, 0.18~7% Performance results trends expressed in terms issue width size.Our analysis indicates that logic as well bypass likely to be most critical future.A microarchitecture simplifies...

10.1145/264107.264201 article EN 1997-05-01

Single-ISA heterogeneous multi-core architectures: the potential for processor power reduction

OPENALEX - Publications

Rakesh Kumar Keith I. Farkas Norman P. Jouppi Parthasarathy Ranganathan Dean M. Tullsen

This paper proposes and evaluates single-ISA heterogeneous multi-core architectures as a mechanism to reduce processor power dissipation. Our design incorporates cores representing different points in the power/performance space; during an application's execution, system software dynamically chooses most appropriate core meet specific performance requirements. evaluation of this architecture shows significant energy benefits. For objective function that optimizes for efficiency with tight...

10.5555/956417.956569 article EN International Symposium on Microarchitecture 2003-12-03

Corona

OPENALEX - Publications

Dana Vantrease Robert Schreiber Matteo Monchiero Moray McLaren Norman P. Jouppi and 5 more

We expect that many-core microprocessors will push performance per chip from the 10 gigaflop to teraflop range in coming decade. To support this increased performance, memory and inter-core bandwidths also have scale by orders of magnitude. Pin limitations, energy cost electrical signaling, non-scalability chip-length global wires are significant bandwidth impediments. Recent developments silicon nanophotonic technology potential meet these off- on-stack requirements at acceptable power...

10.1145/1394608.1382135 article EN ACM SIGARCH Computer Architecture News 2008-06-01

Single-ISA Heterogeneous Multi-Core Architectures for Multithreaded Workload Performance

OPENALEX - Publications

Rakesh Kumar Dean M. Tullsen Parthasarathy Ranganathan Norman P. Jouppi Keith I. Farkas

A single-ISA heterogeneous multi-core architecture is achip multiprocessor composed of cores varying size, performance,and complexity. This paper demonstrates that thisarchitecture can provide significantly higher performance inthe same area than a conventional chip multiprocessor. It doesso by matching the various jobs diverse workload to thevarious cores. type covers spectrum ofworkloads particularly well, providing high single-thread performancewhen thread parallelism low, and...

10.1145/1028176.1006707 article EN ACM SIGARCH Computer Architecture News 2004-03-02

Corona: System Implications of Emerging Nanophotonic Technology

OPENALEX - Publications

Dana Vantrease Robert Schreiber Matteo Monchiero Moray McLaren Norman P. Jouppi and 5 more

We expect that many-core microprocessors will push performance per chip from the 10 gigaflop to teraflop range in coming decade. To support this increased performance, memory and inter-core bandwidths also have scale by orders of magnitude. Pin limitations, energy cost electrical signaling, non-scalability chip-length global wires are significant bandwidth impediments. Recent developments silicon nanophotonic technology potential meet these off- on-stack requirements at acceptable power...

10.1109/isca.2008.35 article EN International Symposium on Computer Architecture 2008-06-01

Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

OPENALEX - Publications

Norman P. Jouppi

Projections of computer technology forecast processors with peak performance 1,000 MIPS in the relatively near future. These could easily lose half or more their memory hierarchy if design is based on conventional caching techniques. This paper presents hardware techniques to improve caches.

10.1145/325164.325162 article EN 1990-01-01

Heterogeneous chip multiprocessors

OPENALEX - Publications

Rakesh Kumar Dean M. Tullsen Norman P. Jouppi Parthasarathy Ranganathan

Heterogeneous (or asymmetric) chip multiprocessors present unique opportunities for improving system throughput, reducing processor power, and mitigating Amdahl's law. On-chip heterogeneity allow the to better match execution resources each application's needs address a much wider spectrum of loads - from low high thread parallelism with efficiency.

10.1109/mc.2005.379 article EN Computer 2005-11-01

Kiln

OPENALEX - Publications

Jishen Zhao Sheng Li Doe Hyun Yoon Yuan Xie Norman P. Jouppi

Persistent memory is an emerging technology which allows in-memory persistent data objects to be updated at much higher throughput than when using disks as storage. Previous designs use logging or copy-on-write mechanisms update data, unfortunately reduces the system performance roughly half that of a native with no persistence support. One great challenges in this application class therefore how efficiently enable atomic, consistent, and durable updates ensure survives and/or failures. Our...

10.1145/2540708.2540744 article EN 2013-12-07

Motivation for and Evaluation of the First Tensor Processing Unit

OPENALEX - Publications

Norman P. Jouppi Cliff Young Nishant Patil David A. Patterson

The first-generation tensor processing unit (TPU) runs deep neural network (DNN) inference 15-30 times faster with 30-80 better energy efficiency than contemporary CPUs and GPUs in similar semiconductor technologies. This domain-specific architecture (DSA) is a custom chip that has been deployed Google datacenters since 2015, where it serves billions of people.

10.1109/mm.2018.032271057 article EN IEEE Micro 2018-05-01

Ten Lessons From Three Generations Shaped Google’s TPUv4i : Industrial Product

OPENALEX - Publications

Norman P. Jouppi Doe Hyun Yoon Matthew B. Ashcraft Mark Gottscho Thomas B. Jablin and 11 more

Google deployed several TPU generations since 2015, teaching us lessons that changed our views: semi-conductor technology advances unequally; compiler compatibility trumps binary compatibility, especially for VLIW domain-specific architectures (DSA); target total cost of ownership vs initial cost; support multi-tenancy; deep neural networks (DNN) grow 1.5X annually; DNN evolve workloads; some inference tasks require floating point; DSAs need air-cooling; apps limit latency, not batch size;...

10.1109/isca52012.2021.00010 article EN 2021-06-01

The McPAT Framework for Multicore and Manycore Architectures

OPENALEX - Publications

Sheng Li Jung Ho Ahn Richard Strong Jay Brockman Dean M. Tullsen and 1 more

This article introduces McPAT, an integrated power, area, and timing modeling framework that supports comprehensive design space exploration for multicore manycore processor configurations ranging from 90nm to 22nm beyond. At microarchitectural level, McPAT includes models the fundamental components of a complete chip multiprocessor, including in-order out-of-order cores, networks-on-chip, shared caches, system such as memory controllers Ethernet controllers. circuit detailed critical-path...

10.1145/2445572.2445577 article EN ACM Transactions on Architecture and Code Optimization 2013-04-01

A domain-specific supercomputer for training deep neural networks

OPENALEX - Publications

Norman P. Jouppi Doe Hyun Yoon George Thomas Kurian Sheng Li Nishant Patil and 3 more

Google's TPU supercomputers train deep neural networks 50x faster than general-purpose running a high-performance computing benchmark.

10.1145/3360307 article EN Communications of the ACM 2020-06-18

TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings

OPENALEX - Publications

Norman P. Jouppi George Thomas Kurian Sheng Li Peter Ma Rahul Nagarajan and 9 more

In response to innovations in machine learning (ML) models, production workloads changed radically and rapidly. TPU v4 is the fifth Google domain specific architecture (DSA) its third supercomputer for such ML models. Optical circuit switches (OCSes) dynamically reconfigure interconnect topology improve scale, availability, utilization, modularity, deployment, security, power, performance; users can pick a twisted 3D torus if desired. Much cheaper, lower faster than Infiniband, OCSes...

10.1145/3579371.3589350 article EN 2023-06-16

Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

OPENALEX - Publications

Norman P. Jouppi

Hardware techniques for improving the performance of caches are presented. Miss caching places a small, fully associative cache between and its refill path. Misses in that hit miss have only 1-cycle penalty. Small 2 to 5 entries shown be very effective removing mapping conflict misses first-level direct-mapped caches. Victim is an improvement it loads small with victim not requested line. 1 even more at than caching. Stream buffers prefetch lines starting address. The prefetched data placed...

10.1109/isca.1990.134547 article EN 2002-12-04

Single-ISA heterogeneous multi-core architectures: the potential for processor power reduction

OPENALEX - Publications

Rakesh Kumar Keith I. Farkas Norman P. Jouppi Parthasarathy Ranganathan Dean M. Tullsen

This paper proposes and evaluates single-ISA heterogeneous multi-core architectures as a mechanism to reduce processor power dissipation. Our design incorporates cores representing different points in the power/performance space; during an application's execution, system software dynamically chooses most appropriate core meet specific performance requirements. evaluation of this architecture shows significant energy benefits. For objective function that optimizes for efficiency with tight...

10.1109/micro.2003.1253185 article EN 2004-05-06

Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

OPENALEX - Publications

Norman P. Jouppi

Projections of computer technology forecast processors with peak performance 1,000 MIPS in the relatively near future. These could easily lose half or more their memory hierarchy if design is based on conventional caching techniques. This paper presents hardware techniques to improve caches. Miss places a small fully-associative cache between and its refill path. Misses that hit miss have only one cycle penalty, as opposed many penalty without cache. Small caches 2 5 entries are shown be...

10.1145/325096.325162 article EN ACM SIGARCH Computer Architecture News 1990-05-01

Reconfigurable caches and their application to media processing

OPENALEX - Publications

Parthasarathy Ranganathan Sarita V. Adve Norman P. Jouppi

High performance general-purpose processors are increasingly being used for a variety of application domains - scientific, engineering, databases, and more recently, media processing. It is therefore important to ensure that architectural features use significant fraction the on-chip transistors applicable across these different domains. For example, current processor designs often devote largest (up 80%) caches. Many workloads, however, do not make effective large caches; e.g., processing...

10.1145/339647.339685 article EN 2000-01-01