- Parallel Computing and Optimization Techniques
- Embedded Systems Design Techniques
- Interconnection Networks and Systems
- Distributed and Parallel Computing Systems
- Cloud Computing and Resource Management
- Advanced Data Storage Technologies
- Real-time simulation and control systems
- Distributed systems and fault tolerance
- Software-Defined Networks and 5G
- Low-power high-performance VLSI design
- Simulation Techniques and Applications
- Real-Time Systems Scheduling
- VLSI and Analog Circuit Testing
- Network Packet Processing and Optimization
- Software System Performance and Reliability
- VLSI and FPGA Design Techniques
- Advanced Memory and Neural Computing
- Model-Driven Software Engineering Techniques
- Graph Theory and Algorithms
- IoT and Edge/Fog Computing
- Ferroelectric and Negative Capacitance Devices
- Caching and Content Delivery
- Radiation Effects in Electronics
- Cryptographic Implementations and Security
- Coding theory and cryptography
The University of Texas at Austin
2014-2024
Microsoft (United States)
2014-2024
Microsoft Research (United Kingdom)
2014-2021
Microsoft (Finland)
2016-2018
The University of Texas at San Antonio
2016
Pennsylvania State University
2012
Ghent University Hospital
2012
Institut national de recherche en informatique et en automatique
2012
Institut de Recherche en Informatique et Systèmes Aléatoires
2012
Texas Oncology
2006
Datacenter workloads demand high computational capabilities, flexibility, power efficiency, and low cost. It is challenging to improve all of these factors simultaneously. To advance datacenter capabilities beyond what commodity server designs can provide, we have designed built a composable, reconfigurablefabric accelerate portions large-scale software services. Each instantiation the fabric consists 6x8 2-D torus high-end Stratix V FPGAs embedded into half-rack 48 machines. One FPGA placed...
Datacenter workloads demand high computational capabilities, flexibility, power efficiency, and low cost. It is challenging to improve all of these factors simultaneously. To advance datacenter capabilities beyond what commodity server designs can provide, we have designed built a composable, reconfigurable fabric accelerate portions large-scale software services. Each instantiation the consists 6×8 2-D torus high-end Stratix V FPGAs embedded into half-rack 48 machines. One FPGA placed each...
Hyperscale datacenter providers have struggled to balance the growing need for specialized hardware (efficiency) with economic benefits of homogeneity (manageability). In this paper we propose a new cloud architecture that uses reconfigurable logic accelerate both network plane functions and applications. This Configurable Cloud places layer (FPGAs) between switches servers, enabling flows be programmably transformed at line rate, acceleration local applications running on server, FPGAs...
Hyperscale datacenter providers have struggled to balance the growing need for specialized hardware (efficiency) with economic benefits of homogeneity (manageability). In this paper we propose a new cloud architecture that uses reconfigurable logic accelerate both network plane functions and applications. This Configurable Cloud places layer (FPGAs) between switches servers, enabling flows be programmably transformed at line rate, acceleration local applications running on server, FPGAs...
To meet the computational demands required of deep learning, cloud operators are turning toward specialized hardware for improved efficiency and performance. Project Brainwave, Microsofts principal infrastructure AI serving in real time, accelerates neural network (DNN) inferencing major services such as Bings intelligent search features Azure. Exploiting distributed model parallelism pinning over low-latency microservices, Brainwave serves state-of-the-art, pre-trained DNN models with high...
Graphics Processing Units (GPUs) have numerous configuration and design options, including core frequency, number of parallel compute units (CUs), available memory bandwidth. At many stages the process, it is important to estimate how application performance power are impacted by these options. This paper describes a GPU estimation model that uses machine learning techniques on measurements from real hardware. The trained collection applications run at different hardware configurations. From...
The RAMP project's goal is to enable the intensive, multidisciplinary innovation that computing industry will need tackle problems of parallel processing. itself an open-source, community-developed, FPGA-based emulator architectures. its design framework lets a large, collaborative community develop and contribute reusable, composable modules. three complete designs - for transactional memory, distributed systems, distributed-shared memory demonstrate platform's potential.
To advance datacenter capabilities beyond what commodity server designs can provide, the authors designed and built a composable, reconfigurable fabric to accelerate large-scale software services. Each instantiation of consists 6 x 8 2D torus high-end field-programmable gate arrays (FPGAs) embedded into half-rack 48 servers. The deployed in bed 1,632 servers FPGAs production successfully used it ranking portion Bing Web search engine by nearly factor two.
This paper describes FAST, a novel simulation methodol- ogy that can produce simulators (i) are orders of mag- nitude faster than comparable simulators, (ii) cycle- accurate, (iii) model the entire system running unmodified applications and operating systems, (iv) provide visibility with minimal performance impact (v) capable current instruction sets such as x86. It achieves its capabilities by partitioning into speculative functional component simulates set architecture timing com- ponent...
This paper describes FAST, a novel simulation methodology that can produce simulators (i) are orders of magnitude faster than comparable simulators, (ii) cycle- accurate, (Hi) model the entire system running unmodified applications and operating systems, (iv) provide visibility with minimal performance impact (v) capable current instruction sets such as x86. It achieves its capabilities by partitioning into speculative functional component simulates set architecture timing predicts...
We present a method for accelerating server applications using hybrid CPU+FPGA architecture and demonstrate its advantages by Memcached, distributed key-value system. The accelerator, implemented on the FPGA fabric, processes request packets directly from network, avoiding CPU in most cases. accelerator is created profiling application to determine commonly executed trace of basic blocks which are then extracted. Traces speculatively within FPGA. If control flow exits prematurely, side...
We propose a way to improve the performance of embedded processors running data-intensive applications by allowing software allocate on-chip memory on an application-specific basis. On-chip in form cache can be made act like scratch-pad via novel hardware mechanism, which we call column caching. Column caching enables dynamic partitioning software, mapping data regions specified sets “columns” or “ways.” When region is exclusively mapped equivalent sized partition cache, provides same...
This paper describes a high performance, low power, and highly flexible cryptographic processor, Cryptoraptor, which is designed to support both today's tomorrow's symmetric-key cryptography algorithms standards. To the best of our knowledge, proposed processor supports widest range compared other solutions in literature only crypto-specific targeting future standards as well. Our 1GHz design achieves peak throughput 128Gbps for AES-128 competitive with ASIC designs has 25X 160X higher...
Reduced or bounded power consumption has become a first-order requirement for modern hardware design. As design progresses and more detailed information becomes available, accurate estimations possible but at the cost of significantly slower simulation speeds. Power that is both sufficiently-accurate fast would have positive impact on architecture In this paper, we propose PrEsto, modeling methodology improves speed accuracy estimation through FPGA-acceleration. PrEsto automatically...
Datacenter workloads demand high computational capabilities, flexibility, power efficiency, and low cost. It is challenging to improve all of these factors simultaneously. To advance datacenter capabilities beyond what commodity server designs can provide, we designed built a composable, reconfigurable hardware fabric based on field programmable gate arrays (FPGA). Each in the contains one FPGA, FPGAs within 48-server rack are interconnected over low-latency, high-bandwidth network. We...
This paper describes a high performance, low power, and highly flexible cryptographic processor, Cryptoraptor, which is designed to support both today's tomorrow's symmetric-key cryptography algorithms standards. To the best of our knowledge, proposed processor supports widest range compared other solutions in literature only crypto-specific targeting future standards as well. Our 1GHz design achieves peak throughput 128Gbps for AES-128 competitive with ASIC designs has 25X 160X higher...
This article presents position statements and a question-and-answer session by panelists at the Fourth Workshop on Computer Architecture Research Directions. The subject of debate was use field-programmable gate arrays versus GPUs in datacenters.
The importance of irregular applications such as graph analytics is rapidly growing with the rise Big Data. However, parallel workloads tend to perform poorly on general-purpose chip multiprocessors (CMPs) due poor cache locality, low compute intensity, frequent synchronization, uneven task sizes, and dynamic generation. At high thread counts, execution time dominated by worklist synchronization overhead misses. Researchers have proposed hardware accelerators address scheduling costs, but...
This article consists of a collection slides from the author's conference presentation on RAMP, or research acclerators for multiple processors. Some specific topics discussed include: system specifications and architecture; uniprocessor performance capabilities; RAMP hardware description language features; applications development; storage future areas technological development.
Many applications that operate on large graphs can be intuitively parallelized by executing a number of the graph operations concurrently and as transactions to deal with potential conflicts. However, numbers occurring might incur too many conflicts would negate benefits parallelization which has probably made highly multi-threaded transactional machines seem impractical. Given size topology modern graphs, however, such provide real performance, energy efficiency, programability benefits....
This paper describes the FAST methodology that enables a single FPGA to accelerate performance of cycle-accurate computer system simulators modeling modem, realistic SoCs, embedded systems and standard desktop/laptop/server systems. The partitions simulator into (i) functional model simulates functionality (ii) predictive predicts other metrics. partitioning is crafted map most parallel work onto hardware-based model, eliminating much complexity difficulty simulating constructs on sequential...
Hyperscale datacenter providers have struggled to balance the growing need for specialized hardware with economic benefits of homogeneity. The Configurable Cloud architecture introduces a layer reconfigurable logic (FPGAs) between network switches and servers. This enables line-rate transformation packets, acceleration local applications running on server, direct communication among FPGAs, at scale. low latency, ubiquitous deployment services spanning any number FPGAs be used shared quickly...