- Parallel Computing and Optimization Techniques
- Advanced Data Storage Technologies
- Cloud Computing and Resource Management
- Interconnection Networks and Systems
- Superconducting Materials and Applications
- Caching and Content Delivery
- Distributed and Parallel Computing Systems
- Advanced Memory and Neural Computing
- Network Traffic and Congestion Control
- Real-Time Systems Scheduling
- Nuclear Physics and Applications
- Green IT and Sustainability
- Domain Adaptation and Few-Shot Learning
- Data Management and Algorithms
- Cloud Computing and Remote Desktop Technologies
- Ferroelectric and Negative Capacitance Devices
- Advanced Image and Video Retrieval Techniques
- Advanced Drug Delivery Systems
- Privacy-Preserving Technologies in Data
- Radiation Effects in Electronics
- Stochastic Gradient Optimization Techniques
- Computer Graphics and Visualization Techniques
- Simulation Techniques and Applications
- Advanced Neural Network Applications
- CCD and CMOS Imaging Sensors
Cornell University
2024-2025
University of Kansas
2020-2024
Birzeit University
2024
University of Illinois Urbana-Champaign
2015-2021
International University of the Caribbean
2020
Samsung (South Korea)
2018
Seoul National University
2018
University of Illinois System
2017
Deep Neural Networks (DNNs) have reinvigorated real-world applications that rely on learning patterns of data and are permeating into different industries markets. Cloud infrastructure accelerators offer INFerence-as-a-Service (INFaaS) become the enabler this rather quick invasive shift in industry. To end, mostly accelerator-based INFaaS (Google's TPU [1], NVIDIA T4 [2], Microsoft Brainwave [3], etc.) has backbone many real-life applications. However, as demand for such services grows,...
Training real-world Deep Neural Networks (DNNs) can take an eon (i.e., weeks or months) without leveraging distributed systems. Even training takes inordinate time, of which a large fraction is spent in communicating weights and gradients over the network. State-of-the-art algorithms use hierarchy worker-aggregator nodes. The aggregators repeatedly receive gradient updates from their allocated group workers, send back updated weights. This paper sets out to reduce this significant...
The physical memory capacity of servers is expected to increase drastically with deployment the forthcoming non-volatile technologies. This a welcomed improvement for emerging data-intensive applications. For such be cost-effective, nonetheless, we must cost-effectively compute throughput and bandwidth commensurate in without compromising application readiness. Tackling this challenge, present Memory Channel Network (MCN) architecture paper. Specifically, first, propose an MCN DIMM,...
When analyzing a distributed computer system, we often observe that the complex interplay among processor, node, and network sub-systems can profoundly affect performance power efficiency of system. Therefore, to effectively cross-optimize hardware software components need full-system simulation infrastructure precisely capture interplay. Responding aforementioned need, present dist-gem5, flexible, detailed, open-source model simulate system using multiple hosts. Then validate dist-gem5...
A modern datacenter server aims to achieve high energy efficiency by co-running multiple applications. Some of such applications (e.g., web search) are latency sensitive. Therefore, they require low-latency I/O services fast respond requests from clients. However, we observe that simply replacing the storage devices servers with Ultra-Low-Latency (ULL) SSDs does not notably reduce services, especially when In this paper, propose FLASHSHARE assist ULL satisfy different levels service...
In modern server CPUs, last-level cache (LLC) is a critical hardware resource that exerts significant influence on the performance of workloads, and how to manage LLC key isolation QoS in cloud with multi-tenancy. this paper, we argue addition CPU cores, high-speed I/O also important for management. This because an Intel architectural innovation – Data Direct (DDIO) directly injects inbound traffic (part of) instead main memory. We summarize two problems caused by DDIO show (1) default...
Improving the performance and power efficiency of a single processor has been fraught with various challenges stemming from end classical technology scaling. Thus, importance efficiently running applications on parallel/distributed computer system continued to increase. In developing optimizing such system, it is critical study impact complex interplay amongst processor, node, network architectures in detail. This necessitates flexible, detailed open-source full-system simulation...
Optimizing bandwidth was the main focus of designing scale-out networks for several decades and this optimization trend has served well traditional Internet applications. However, emergence datacenters as single computer entities made latency important in datacenter networks. PCIe interconnect is known to be bottleneck communication its overhead can contribute up ~90% overall latency. Despite overheads, de facto standard servers it been established maintained more than two decades. In...
The open-source and community-supported gem5 simulator is one of the most popular tools for computer architecture research. This simulation infrastructure allows researchers to model modern hardware at cycle level, it has enough fidelity boot unmodified Linux-based operating systems run full applications multiple architectures including x86, Arm, RISC-V. been under active development over last nine years since original release. In this time, there have 7500 commits codebase from 250 unique...
The rate of network packets encapsulating requests from clients can significantly affect the utilization, and thus performance sleep states processors in servers deploying a power management policy. To improve energy efficiency, may adopt an aggressive policy that frequently transitions processor to low-performance or state at low utilization. However, such not respond sudden increase early enough due considerable penalty transitioning high-performance state. This turn entails violations...
I/O performance plays a critical role in the overall of modern servers. The emergence ultra high-speed devices makes data movement between processors, main memory, and major bottleneck. Conventionally, memory is used as an intermediate buffer processor cannot directly access side caches. Data Direct (DDIO) technology aims to reduce bandwidth utilization by enabling leverage Last Level Cache (LLC) buffer. Our experimental results show that DDIO can completely eliminate while running...
There has been significant focus on offloading upperlayer network protocols (ULPs) to accelerators located CPUs and SmartNICs. However, restricting accelerator placement these locations limits both the variety of ULPs that can be accelerated overall performance. In particular, it overlooks opportunity accelerate running atop a stateful transport protocol in face high cache contention. That is, at rates, frequent DRAM accesses SmartNIC-CPU synchronizations outweigh benefits hardware...
While (I) serverless computing is emerging as a popular form of cloud execution, datacenters are going through major changes: (II) storage dissaggregation in the system infrastructure level and (III) integration domain-specific accelerators hardware level. Each these three trends individually provide significant benefits; however, when combined benefits diminish. On convergence trends, paper makes observation that for functions, overhead accessing dissaggregated overshadows gains from...
High-bandwidth network interface cards (NICs), each capable of transferring 100s Gigabits per second, are making inroads into the servers next-generation datacenters. Such unprecedented data delivery rates impose immense pressure, especially on server's memory subsystem, as NICs first transfer to DRAM before processing. To alleviate cache hierarchy has evolved, supporting a direct I/O (DDIO) technology directly place in last-level (LLC). Subsequently, various policies have been explored...
Processor power management exploiting Dynamic Voltage and Frequency Scaling (DVFS) plays a crucial role in improving the data-center's energy efficiency. However, we observe that current policies Linux (i.e., governors) often considerably increase tail response time violate given Service Level Objective (SLO)) consumption of latency-critical applications. Furthermore, previously proposed SLO-aware oversimplify network request processing ignore fact requests arrive at application layer...
The PCI-Express interconnect is the dominant interconnection technology within a single computer node that used for connecting off-chip devices such as network interface cards (NICs) and GPUs to processor chip. bandwidth latency are often bottleneck in processor, memory device interactions impacts overall performance of connected devices. Architecture simulators focus on modeling lack model I/O interconnections. In this work, we implement flexible detailed widely known architecture...
While (1) serverless computing is emerging as a popular form of cloud execution, datacenters are going through major changes: (2) storage dissaggregation in the system infrastructure level and (3) integration domain-specific accelerators hardware level. Each these three trends individually provide significant benefits; however, when combined benefits diminish. Specifically, paper makes key observation that for functions, overhead accessing dissaggregated persistent overshadows gains from...
In this work, we set out to find the answers following questions: (1) Where are bottlenecks in a state-of-theart architectural simulator? (2) How much faster can simulations run by tuning system configurations? (3) What opportunities accelerating software simulation using hardware accelerators? We choose gem5 as representative simulator, several with various configurations, perform detailed analysis of source code on different server platforms, tune both and settings for running simulations,...
The advance of DRAM manufacturing technology slows down, whereas the density and performance needs continue to increase. This desire has motivated industry explore emerging Non-Volatile Memory (e.g., 3D XPoint) high-density Managed Solution). Since such memory technologies increase at cost longer latency, lower bandwidth, or both, it is essential use them with fast conventional DRAM) which hot pages are transferred runtime. Nonetheless, we observe that page transfers often block channels...
Modern commercial-off-the-shelf (COTS) multicore processors have advanced memory hierarchies that enhance memory-level parallelism (MLP), which is crucial for high performance. To support MLP, shared last-level caches (LLCs) are divided into multiple banks, allowing parallel access. However, uneven distribution of cache requests from the cores, especially when cores concentrated on a single bank, can result in significant contention affecting all access cache. Such bank even be maliciously...