Derek Chiou

ORCID: 0009-0008-6762-4527
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Parallel Computing and Optimization Techniques
  • Embedded Systems Design Techniques
  • Interconnection Networks and Systems
  • Distributed and Parallel Computing Systems
  • Cloud Computing and Resource Management
  • Advanced Data Storage Technologies
  • Real-time simulation and control systems
  • Distributed systems and fault tolerance
  • Software-Defined Networks and 5G
  • Low-power high-performance VLSI design
  • Simulation Techniques and Applications
  • Real-Time Systems Scheduling
  • VLSI and Analog Circuit Testing
  • Network Packet Processing and Optimization
  • Software System Performance and Reliability
  • VLSI and FPGA Design Techniques
  • Advanced Memory and Neural Computing
  • Model-Driven Software Engineering Techniques
  • Graph Theory and Algorithms
  • IoT and Edge/Fog Computing
  • Ferroelectric and Negative Capacitance Devices
  • Caching and Content Delivery
  • Radiation Effects in Electronics
  • Cryptographic Implementations and Security
  • Coding theory and cryptography

The University of Texas at Austin
2014-2024

Microsoft (United States)
2014-2024

Microsoft Research (United Kingdom)
2014-2021

Microsoft (Finland)
2016-2018

The University of Texas at San Antonio
2016

Pennsylvania State University
2012

Ghent University Hospital
2012

Institut national de recherche en informatique et en automatique
2012

Institut de Recherche en Informatique et Systèmes Aléatoires
2012

Texas Oncology
2006

Datacenter workloads demand high computational capabilities, flexibility, power efficiency, and low cost. It is challenging to improve all of these factors simultaneously. To advance datacenter capabilities beyond what commodity server designs can provide, we have designed built a composable, reconfigurablefabric accelerate portions large-scale software services. Each instantiation the fabric consists 6x8 2-D torus high-end Stratix V FPGAs embedded into half-rack 48 machines. One FPGA placed...

10.1145/2678373.2665678 article EN ACM SIGARCH Computer Architecture News 2014-06-14

Datacenter workloads demand high computational capabilities, flexibility, power efficiency, and low cost. It is challenging to improve all of these factors simultaneously. To advance datacenter capabilities beyond what commodity server designs can provide, we have designed built a composable, reconfigurable fabric accelerate portions large-scale software services. Each instantiation the consists 6×8 2-D torus high-end Stratix V FPGAs embedded into half-rack 48 machines. One FPGA placed each...

10.1109/isca.2014.6853195 article EN 2014-06-01

Hyperscale datacenter providers have struggled to balance the growing need for specialized hardware (efficiency) with economic benefits of homogeneity (manageability). In this paper we propose a new cloud architecture that uses reconfigurable logic accelerate both network plane functions and applications. This Configurable Cloud places layer (FPGAs) between switches servers, enabling flows be programmably transformed at line rate, acceleration local applications running on server, FPGAs...

10.1109/micro.2016.7783710 article EN 2016-10-01

Hyperscale datacenter providers have struggled to balance the growing need for specialized hardware (efficiency) with economic benefits of homogeneity (manageability). In this paper we propose a new cloud architecture that uses reconfigurable logic accelerate both network plane functions and applications. This Configurable Cloud places layer (FPGAs) between switches servers, enabling flows be programmably transformed at line rate, acceleration local applications running on server, FPGAs...

10.5555/3195638.3195647 article EN International Symposium on Microarchitecture 2016-10-15

To meet the computational demands required of deep learning, cloud operators are turning toward specialized hardware for improved efficiency and performance. Project Brainwave, Microsofts principal infrastructure AI serving in real time, accelerates neural network (DNN) inferencing major services such as Bings intelligent search features Azure. Exploiting distributed model parallelism pinning over low-latency microservices, Brainwave serves state-of-the-art, pre-trained DNN models with high...

10.1109/mm.2018.022071131 article EN IEEE Micro 2018-03-01

Graphics Processing Units (GPUs) have numerous configuration and design options, including core frequency, number of parallel compute units (CUs), available memory bandwidth. At many stages the process, it is important to estimate how application performance power are impacted by these options. This paper describes a GPU estimation model that uses machine learning techniques on measurements from real hardware. The trained collection applications run at different hardware configurations. From...

10.1109/hpca.2015.7056063 article EN 2015-02-01

The RAMP project's goal is to enable the intensive, multidisciplinary innovation that computing industry will need tackle problems of parallel processing. itself an open-source, community-developed, FPGA-based emulator architectures. its design framework lets a large, collaborative community develop and contribute reusable, composable modules. three complete designs - for transactional memory, distributed systems, distributed-shared memory demonstrate platform's potential.

10.1109/mm.2007.39 article EN IEEE Micro 2007-03-01

To advance datacenter capabilities beyond what commodity server designs can provide, the authors designed and built a composable, reconfigurable fabric to accelerate large-scale software services. Each instantiation of consists 6 x 8 2D torus high-end field-programmable gate arrays (FPGAs) embedded into half-rack 48 servers. The deployed in bed 1,632 servers FPGAs production successfully used it ranking portion Bing Web search engine by nearly factor two.

10.1109/mm.2015.42 article EN IEEE Micro 2015-05-01

This paper describes FAST, a novel simulation methodol- ogy that can produce simulators (i) are orders of mag- nitude faster than comparable simulators, (ii) cycle- accurate, (iii) model the entire system running unmodified applications and operating systems, (iv) provide visibility with minimal performance impact (v) capable current instruction sets such as x86. It achieves its capabilities by partitioning into speculative functional component simulates set architecture timing com- ponent...

10.5555/1331699.1331723 article EN International Symposium on Microarchitecture 2007-12-01

This paper describes FAST, a novel simulation methodology that can produce simulators (i) are orders of magnitude faster than comparable simulators, (ii) cycle- accurate, (Hi) model the entire system running unmodified applications and operating systems, (iv) provide visibility with minimal performance impact (v) capable current instruction sets such as x86. It achieves its capabilities by partitioning into speculative functional component simulates set architecture timing predicts...

10.1109/micro.2007.36 article EN 2007-01-01

We present a method for accelerating server applications using hybrid CPU+FPGA architecture and demonstrate its advantages by Memcached, distributed key-value system. The accelerator, implemented on the FPGA fabric, processes request packets directly from network, avoiding CPU in most cases. accelerator is created profiling application to determine commonly executed trace of basic blocks which are then extracted. Traces speculatively within FPGA. If control flow exits prematurely, side...

10.1109/l-ca.2013.17 article EN IEEE Computer Architecture Letters 2013-07-16

We propose a way to improve the performance of embedded processors running data-intensive applications by allowing software allocate on-chip memory on an application-specific basis. On-chip in form cache can be made act like scratch-pad via novel hardware mechanism, which we call column caching. Column caching enables dynamic partitioning software, mapping data regions specified sets “columns” or “ways.” When region is exclusively mapped equivalent sized partition cache, provides same...

10.1145/337292.337523 article EN Proceedings of the 40th conference on Design automation - DAC '03 2000-01-01

This paper describes a high performance, low power, and highly flexible cryptographic processor, Cryptoraptor, which is designed to support both today's tomorrow's symmetric-key cryptography algorithms standards. To the best of our knowledge, proposed processor supports widest range compared other solutions in literature only crypto-specific targeting future standards as well. Our 1GHz design achieves peak throughput 128Gbps for AES-128 competitive with ASIC designs has 25X 160X higher...

10.1109/iccad.2014.7001346 article EN 2015 IEEE/ACM International Conference on Computer-Aided Design (ICCAD) 2014-11-01

Reduced or bounded power consumption has become a first-order requirement for modern hardware design. As design progresses and more detailed information becomes available, accurate estimations possible but at the cost of significantly slower simulation speeds. Power that is both sufficiently-accurate fast would have positive impact on architecture In this paper, we propose PrEsto, modeling methodology improves speed accuracy estimation through FPGA-acceleration. PrEsto automatically...

10.1109/fpl.2010.69 article EN 2010-08-01

Datacenter workloads demand high computational capabilities, flexibility, power efficiency, and low cost. It is challenging to improve all of these factors simultaneously. To advance datacenter capabilities beyond what commodity server designs can provide, we designed built a composable, reconfigurable hardware fabric based on field programmable gate arrays (FPGA). Each in the contains one FPGA, FPGAs within 48-server rack are interconnected over low-latency, high-bandwidth network. We...

10.1145/2996868 article EN Communications of the ACM 2016-10-28

This paper describes a high performance, low power, and highly flexible cryptographic processor, Cryptoraptor, which is designed to support both today's tomorrow's symmetric-key cryptography algorithms standards. To the best of our knowledge, proposed processor supports widest range compared other solutions in literature only crypto-specific targeting future standards as well. Our 1GHz design achieves peak throughput 128Gbps for AES-128 competitive with ASIC designs has 25X 160X higher...

10.5555/2691365.2691398 article EN 2014-11-03

This article presents position statements and a question-and-answer session by panelists at the Fourth Workshop on Computer Architecture Research Directions. The subject of debate was use field-programmable gate arrays versus GPUs in datacenters.

10.1109/mm.2017.19 article EN IEEE Micro 2017-01-01

The importance of irregular applications such as graph analytics is rapidly growing with the rise Big Data. However, parallel workloads tend to perform poorly on general-purpose chip multiprocessors (CMPs) due poor cache locality, low compute intensity, frequent synchronization, uneven task sizes, and dynamic generation. At high thread counts, execution time dominated by worklist synchronization overhead misses. Researchers have proposed hardware accelerators address scheduling costs, but...

10.1145/3173162.3173197 article EN 2018-03-19

This article consists of a collection slides from the author's conference presentation on RAMP, or research acclerators for multiple processors. Some specific topics discussed include: system specifications and architecture; uniprocessor performance capabilities; RAMP hardware description language features; applications development; storage future areas technological development.

10.1109/hotchips.2006.7477751 article EN 2006-08-01

Many applications that operate on large graphs can be intuitively parallelized by executing a number of the graph operations concurrently and as transactions to deal with potential conflicts. However, numbers occurring might incur too many conflicts would negate benefits parallelization which has probably made highly multi-threaded transactional machines seem impractical. Given size topology modern graphs, however, such provide real performance, energy efficiency, programability benefits....

10.1145/3020078.3021743 article EN 2017-02-02

10.1006/jpdc.1993.1065 article EN Journal of Parallel and Distributed Computing 1993-07-01

This paper describes the FAST methodology that enables a single FPGA to accelerate performance of cycle-accurate computer system simulators modeling modem, realistic SoCs, embedded systems and standard desktop/laptop/server systems. The partitions simulator into (i) functional model simulates functionality (ii) predictive predicts other metrics. partitioning is crafted map most parallel work onto hardware-based model, eliminating much complexity difficulty simulating constructs on sequential...

10.5555/1326073.1326133 article EN International Conference on Computer Aided Design 2007-11-05

Hyperscale datacenter providers have struggled to balance the growing need for specialized hardware with economic benefits of homogeneity. The Configurable Cloud architecture introduces a layer reconfigurable logic (FPGAs) between network switches and servers. This enables line-rate transformation packets, acceleration local applications running on server, direct communication among FPGAs, at scale. low latency, ubiquitous deployment services spanning any number FPGAs be used shared quickly...

10.1109/mm.2017.51 article EN IEEE Micro 2017-01-01
Coming Soon ...