NFDI4DS | UHH-SEMS - Publication Details

A reconfigurable fabric for accelerating large-scale datacenter services

OPENALEX - Publications

Andrew Putnam Adrian M. Caulfield Eric S. Chung Derek Chiou Kypros Constantinides and 18 more

Datacenter workloads demand high computational capabilities, flexibility, power efficiency, and low cost. It is challenging to improve all of these factors simultaneously. To advance datacenter capabilities beyond what commodity server designs can provide, we have designed built a composable, reconfigurablefabric accelerate portions large-scale software services. Each instantiation the fabric consists 6x8 2-D torus high-end Stratix V FPGAs embedded into half-rack 48 machines. One FPGA placed...

10.1145/2678373.2665678 article EN ACM SIGARCH Computer Architecture News 2014-06-14

A reconfigurable fabric for accelerating large-scale datacenter services

OPENALEX - Publications

Andrew Putnam Adrian M. Caulfield Eric S. Chung Derek Chiou Kypros Constantinides and 18 more

Datacenter workloads demand high computational capabilities, flexibility, power efficiency, and low cost. It is challenging to improve all of these factors simultaneously. To advance datacenter capabilities beyond what commodity server designs can provide, we have designed built a composable, reconfigurable fabric accelerate portions large-scale software services. Each instantiation the consists 6×8 2-D torus high-end Stratix V FPGAs embedded into half-rack 48 machines. One FPGA placed each...

10.1109/isca.2014.6853195 article EN 2014-06-01

A cloud-scale acceleration architecture

OPENALEX - Publications

Adrian M. Caulfield Eric S. Chung Andrew Putnam Hari Angepat Jeremy Fowers and 13 more

Hyperscale datacenter providers have struggled to balance the growing need for specialized hardware (efficiency) with economic benefits of homogeneity (manageability). In this paper we propose a new cloud architecture that uses reconfigurable logic accelerate both network plane functions and applications. This Configurable Cloud places layer (FPGAs) between switches servers, enabling flows be programmably transformed at line rate, acceleration local applications running on server, FPGAs...

10.1109/micro.2016.7783710 article EN 2016-10-01

A cloud-scale acceleration architecture

OPENALEX - Publications

Adrian M. Caulfield Eric S. Chung Andrew Putnam Hari Angepat Jeremy Fowers and 13 more

Hyperscale datacenter providers have struggled to balance the growing need for specialized hardware (efficiency) with economic benefits of homogeneity (manageability). In this paper we propose a new cloud architecture that uses reconfigurable logic accelerate both network plane functions and applications. This Configurable Cloud places layer (FPGAs) between switches servers, enabling flows be programmably transformed at line rate, acceleration local applications running on server, FPGAs...

10.5555/3195638.3195647 article EN International Symposium on Microarchitecture 2016-10-15

Serving DNNs in Real Time at Datacenter Scale with Project Brainwave

OPENALEX - Publications

Eric S. Chung Jeremy Fowers Kalin Ovtcharov Michael Papamichael Adrian M. Caulfield and 38 more

To meet the computational demands required of deep learning, cloud operators are turning toward specialized hardware for improved efficiency and performance. Project Brainwave, Microsofts principal infrastructure AI serving in real time, accelerates neural network (DNN) inferencing major services such as Bings intelligent search features Azure. Exploiting distributed model parallelism pinning over low-latency microservices, Brainwave serves state-of-the-art, pre-trained DNN models with high...

10.1109/mm.2018.022071131 article EN IEEE Micro 2018-03-01

GPGPU performance and power estimation using machine learning

OPENALEX - Publications

Gene Wu Joseph L. Greathouse Alexander Lyashevsky Nuwan Jayasena Derek Chiou

Graphics Processing Units (GPUs) have numerous configuration and design options, including core frequency, number of parallel compute units (CUs), available memory bandwidth. At many stages the process, it is important to estimate how application performance power are impacted by these options. This paper describes a GPU estimation model that uses machine learning techniques on measurements from real hardware. The trained collection applications run at different hardware configurations. From...

10.1109/hpca.2015.7056063 article EN 2015-02-01

RAMP: Research Accelerator for Multiple Processors

OPENALEX - Publications

John Wawrzynek David A. Patterson Mark Oskin Shih‐Lien Lu Christoforos Kozyrakis and 3 more

The RAMP project's goal is to enable the intensive, multidisciplinary innovation that computing industry will need tackle problems of parallel processing. itself an open-source, community-developed, FPGA-based emulator architectures. its design framework lets a large, collaborative community develop and contribute reusable, composable modules. three complete designs - for transactional memory, distributed systems, distributed-shared memory demonstrate platform's potential.

10.1109/mm.2007.39 article EN IEEE Micro 2007-03-01

A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services

OPENALEX - Publications

Andrew Putnam Adrian M. Caulfield Eric S. Chung Derek Chiou Kypros Constantinides and 18 more

To advance datacenter capabilities beyond what commodity server designs can provide, the authors designed and built a composable, reconfigurable fabric to accelerate large-scale software services. Each instantiation of consists 6 x 8 2D torus high-end field-programmable gate arrays (FPGAs) embedded into half-rack 48 servers. The deployed in bed 1,632 servers FPGAs production successfully used it ranking portion Bing Web search engine by nearly factor two.

10.1109/mm.2015.42 article EN IEEE Micro 2015-05-01

FPGA-Accelerated Simulation Technologies (FAST): Fast, Full-System, Cycle-Accurate Simulators

OPENALEX - Publications

Derek Chiou Dam Sunwoo Joonsoo Kim Nikhil A. Patil William Reinhart and 3 more

This paper describes FAST, a novel simulation methodol- ogy that can produce simulators (i) are orders of mag- nitude faster than comparable simulators, (ii) cycle- accurate, (iii) model the entire system running unmodified applications and operating systems, (iv) provide visibility with minimal performance impact (v) capable current instruction sets such as x86. It achieves its capabilities by partitioning into speculative functional component simulates set architecture timing com- ponent...

10.5555/1331699.1331723 article EN International Symposium on Microarchitecture 2007-12-01

FPGA-Accelerated Simulation Technologies (FAST): Fast, Full-System, Cycle-Accurate Simulators

OPENALEX - Publications

Derek Chiou Dam Sunwoo Joon-Soo Kim Nikhil A. Patil William Reinhart and 3 more

This paper describes FAST, a novel simulation methodology that can produce simulators (i) are orders of magnitude faster than comparable simulators, (ii) cycle- accurate, (Hi) model the entire system running unmodified applications and operating systems, (iv) provide visibility with minimal performance impact (v) capable current instruction sets such as x86. It achieves its capabilities by partitioning into speculative functional component simulates set architecture timing predicts...

10.1109/micro.2007.36 article EN 2007-01-01

An FPGA-based In-Line Accelerator for Memcached

OPENALEX - Publications

Maysam Lavasani Hari Angepat Derek Chiou

We present a method for accelerating server applications using hybrid CPU+FPGA architecture and demonstrate its advantages by Memcached, distributed key-value system. The accelerator, implemented on the FPGA fabric, processes request packets directly from network, avoiding CPU in most cases. accelerator is created profiling application to determine commonly executed trace of basic blocks which are then extracted. Traces speculatively within FPGA. If control flow exits prematurely, side...

10.1109/l-ca.2013.17 article EN IEEE Computer Architecture Letters 2013-07-16

Application-specific memory management for embedded systems using software-controlled caches

OPENALEX - Publications

Derek Chiou Prabhat Jain Larry Rudolph Srinivas Devadas

We propose a way to improve the performance of embedded processors running data-intensive applications by allowing software allocate on-chip memory on an application-specific basis. On-chip in form cache can be made act like scratch-pad via novel hardware mechanism, which we call column caching. Column caching enables dynamic partitioning software, mapping data regions specified sets “columns” or “ways.” When region is exclusively mapped equivalent sized partition cache, provides same...

10.1145/337292.337523 article EN Proceedings of the 40th conference on Design automation - DAC '03 2000-01-01

Cryptoraptor: High throughput reconfigurable cryptographic processor

OPENALEX - Publications

Gokhan Sayilar Derek Chiou

This paper describes a high performance, low power, and highly flexible cryptographic processor, Cryptoraptor, which is designed to support both today's tomorrow's symmetric-key cryptography algorithms standards. To the best of our knowledge, proposed processor supports widest range compared other solutions in literature only crypto-specific targeting future standards as well. Our 1GHz design achieves peak throughput 128Gbps for AES-128 competitive with ASIC designs has 25X 160X higher...

10.1109/iccad.2014.7001346 article EN 2015 IEEE/ACM International Conference on Computer-Aided Design (ICCAD) 2014-11-01

PrEsto: An FPGA-accelerated Power Estimation Methodology for Complex Systems

OPENALEX - Publications

Dam Sunwoo Gene Wu Nikhil A. Patil Derek Chiou

Reduced or bounded power consumption has become a first-order requirement for modern hardware design. As design progresses and more detailed information becomes available, accurate estimations possible but at the cost of significantly slower simulation speeds. Power that is both sufficiently-accurate fast would have positive impact on architecture In this paper, we propose PrEsto, modeling methodology improves speed accuracy estimation through FPGA-acceleration. PrEsto automatically...

10.1109/fpl.2010.69 article EN 2010-08-01

A reconfigurable fabric for accelerating large-scale datacenter services

OPENALEX - Publications

Andrew Putnam Adrian M. Caulfield Eric S. Chung Derek Chiou Kypros Constantinides and 18 more

Datacenter workloads demand high computational capabilities, flexibility, power efficiency, and low cost. It is challenging to improve all of these factors simultaneously. To advance datacenter capabilities beyond what commodity server designs can provide, we designed built a composable, reconfigurable hardware fabric based on field programmable gate arrays (FPGA). Each in the contains one FPGA, FPGAs within 48-server rack are interconnected over low-latency, high-bandwidth network. We...

10.1145/2996868 article EN Communications of the ACM 2016-10-28

Cryptoraptor: high throughput reconfigurable cryptographic processor

OPENALEX - Publications

Gokhan Sayilar Derek Chiou

This paper describes a high performance, low power, and highly flexible cryptographic processor, Cryptoraptor, which is designed to support both today's tomorrow's symmetric-key cryptography algorithms standards. To the best of our knowledge, proposed processor supports widest range compared other solutions in literature only crypto-specific targeting future standards as well. Our 1GHz design achieves peak throughput 128Gbps for AES-128 competitive with ASIC designs has 25X 160X higher...

10.5555/2691365.2691398 article EN 2014-11-03

FPGAs versus GPUs in Data centers

OPENALEX - Publications

Babak Falsafi Bill Dally Desh Deepak Singh Derek Chiou Joshua J. Yi and 1 more

This article presents position statements and a question-and-answer session by panelists at the Fourth Workshop on Computer Architecture Research Directions. The subject of debate was use field-programmable gate arrays versus GPUs in datacenters.

10.1109/mm.2017.19 article EN IEEE Micro 2017-01-01

Minnow

OPENALEX - Publications

Dan Zhang Xiaoyu Ma Michael Thomson Derek Chiou

The importance of irregular applications such as graph analytics is rapidly growing with the rise Big Data. However, parallel workloads tend to perform poorly on general-purpose chip multiprocessors (CMPs) due poor cache locality, low compute intensity, frequent synchronization, uneven task sizes, and dynamic generation. At high thread counts, execution time dominated by worklist synchronization overhead misses. Researchers have proposed hardware accelerators address scheduling costs, but...

10.1145/3173162.3173197 article EN 2018-03-19

Research accelerator for multiple processors

OPENALEX - Publications

David A. Patterson Arvind Arvind Krste Asanović Derek Chiou James C. Hoe and 5 more

This article consists of a collection slides from the author's conference presentation on RAMP, or research acclerators for multiple processors. Some specific topics discussed include: system specifications and architecture; uniprocessor performance capabilities; RAMP hardware description language features; applications development; storage future areas technological development.

10.1109/hotchips.2006.7477751 article EN 2006-08-01

FPGA-Accelerated Transactional Execution of Graph Workloads

OPENALEX - Publications

Xiaoyu Ma Dan Zhang Derek Chiou

Many applications that operate on large graphs can be intuitively parallelized by executing a number of the graph operations concurrently and as transactions to deal with potential conflicts. However, numbers occurring might incur too many conflicts would negate benefits parallelization which has probably made highly multi-threaded transactional machines seem impractical. Given size topology modern graphs, however, such provide real performance, energy efficiency, programability benefits....

10.1145/3020078.3021743 article EN 2017-02-02

Performance Studies of Id on the Monsoon Dataflow System

OPENALEX - Publications

James E. Hicks Derek Chiou Boon Seong Ang Arvi

10.1006/jpdc.1993.1065 article EN Journal of Parallel and Distributed Computing 1993-07-01

The FAST methodology for high-speed SoC/computer simulation

OPENALEX - Publications

Derek Chiou Dam Sunwoo Joonsoo Kim Nikita Patil William Reinhart and 2 more

This paper describes the FAST methodology that enables a single FPGA to accelerate performance of cycle-accurate computer system simulators modeling modem, realistic SoCs, embedded systems and standard desktop/laptop/server systems. The partitions simulator into (i) functional model simulates functionality (ii) predictive predicts other metrics. partitioning is crafted map most parallel work onto hardware-based model, eliminating much complexity difficulty simulating constructs on sequential...

10.5555/1326073.1326133 article EN International Conference on Computer Aided Design 2007-11-05

Configurable Clouds

OPENALEX - Publications

Adrian M. Caulfield Eric S. Chung Andrew Putnam Hari Angepat Daniel Firestone and 14 more

Hyperscale datacenter providers have struggled to balance the growing need for specialized hardware with economic benefits of homogeneity. The Configurable Cloud architecture introduces a layer reconfigurable logic (FPGAs) between network switches and servers. This enables line-rate transformation packets, acceleration local applications running on server, direct communication among FPGAs, at scale. low latency, ubiquitous deployment services spanning any number FPGAs be used shared quickly...

10.1109/mm.2017.51 article EN IEEE Micro 2017-01-01