- Parallel Computing and Optimization Techniques
- Advanced Data Storage Technologies
- Cloud Computing and Resource Management
- Distributed and Parallel Computing Systems
- Superconducting Materials and Applications
- Distributed systems and fault tolerance
- Algorithms and Data Compression
- IoT and Edge/Fog Computing
- Embedded Systems Design Techniques
- Interconnection Networks and Systems
- Security and Verification in Computing
- Healthcare Technology and Patient Monitoring
- Advanced Malware Detection Techniques
- Computational Geometry and Mesh Generation
- COVID-19 diagnosis using AI
- Anomaly Detection Techniques and Applications
- Network Security and Intrusion Detection
- Advanced Memory and Neural Computing
- Ferroelectric and Negative Capacitance Devices
- Low-power high-performance VLSI design
- Cellular Automata and Applications
- Radiation Effects in Electronics
Menlo School
2024
Alpha Omega Alpha Medical Honor Society
2024
University of Wisconsin–Madison
2011-2017
Kitware (United States)
2017
North Carolina State University
2012
Our analysis shows that many "big-memory" server workloads, such as databases, in-memory caches, and graph analytics, pay a high cost for page-based virtual memory. They consume much 10% of execution cycles on TLB misses, even using large pages. On the other hand, we find these workloads use read-write permission most pages, are provisioned not to swap, rarely benefit from full flexibility
Page-based virtual memory improves programmer productivity, security, and utilization, but incurs performance overheads due to costly page table walks after TLB misses. This overhead can reach 50% for modern workloads that access increasingly vast with stagnating sizes.
A growing body of work has compiled a strong case for the single-ISA heterogeneous multi-core paradigm. provides multiple, differently-designed superscalar core types that can streamline execution diverse programs and program phases. No prior research addressed 'Achilles' heel this paradigm: design verification effort is multiplied by number different types.
Contemporary discrete GPUs support rich memory management features such as virtual and demand paging. These simplify GPU programming by providing a address space abstraction similar to CPUs eliminating manual management, but they introduce high performance overheads during (1) translation (2) page faults. A relies on degrees of thread-level parallelism (TLP) hide latency. Address can undermine TLP, single miss in the lookaside buffer (TLB) invokes an expensive serialized table walk that...
Virtualization provides value for many workloads, but its cost rises workloads with poor memory access locality. This overhead comes from translation look aside buffer (TLB) misses where the hardware performs a 2D page walk (up to 24 references on x86-64) rather than native TLB miss only 4 references). The first dimension translates guest virtual addresses physical addresses, while second host addresses. paper proposes new using direct segments three virtualized modes of operation that...
As the latency of network approaches that memory, it becomes increasingly attractive for applications to use remote memory---random-access memory at another computer is accessed using virtual subsystem. This an old idea whose time has come, in age fast networks. To work effectively, must address many technical challenges. In this paper, we enumerate these challenges, discuss their feasibility, explain how some them are addressed by recent work, and indicate other promising ways tackle them....
Graphics Processing Units (GPUs) exploit large amounts of threadlevel parallelism to provide high instruction throughput and efficiently hide long-latency stalls. The resulting throughput, along with continued programmability improvements, have made GPUs an essential computational resource in many domains. Applications from different domains can vastly compute memory demands on the GPU. In a large-scale computing environment, accommodate such wide-ranging without leaving GPU resources...
Address translation is fundamental to processor performance. Prior work focused on reducing Translation Lookaside Buffer (TLB) misses improve performance and energy, whereas we show that even TLB hits consume a significant amount of dynamic energy. To reduce the energy cost address translation, first propose Lite, mechanism monitors utility L1 TLBs, adaptively changes their sizes with way-disabling. The resulting TLBLite organization opportunistically reduces spent in by 23% average minimal...
Our analysis shows that many "big-memory" server workloads, such as databases, in-memory caches, and graph analytics, pay a high cost for page-based virtual memory. They consume much 10% of execution cycles on TLB misses, even using large pages. On the other hand, we find these workloads use read-write permission most pages, are provisioned not to swap, rarely benefit from full flexibility To remove miss overhead big-memory propose mapping part process's linear address space with direct...
The overheads of memory management units (MMUs) have gained importance in today's systems. Detailed simulators may be too slow to gain insights into micro-architectural techniques that improve MMU efficiency. To address this issue, we propose a novel tool, BadgerTrap, which allows online instrumentation TLB misses. It first-order analysis new hardware tool helps create and analyze x86-64 miss trace. We describe example studies show various ways can applied research insights.
Virtualization provides benefits for many workloads, but the overheads of virtualizing memory are not universally low. The cost comes from managing two levels address translation---one in guest virtual machine (VM) and other host monitor (VMM)---with either nested or shadow paging. Nested paging directly performs a two-level page walk that makes TLB misses slower than unvirtualized native, enables fast tables changes. Alternatively, restores native miss speeds, requires costly VMM...
Multi-socket machines with 1-100 TBs of physical memory are becoming prevalent. Applications running on such multi-socket suffer non-uniform bandwidth and latency when accessing memory. Decades research have focused data allocation placement policies in NUMA settings, but there been no studies the question how to place page-tables amongst sockets. We make case for explicit page-table show that is crucial overall performance. propose Mitosis mitigate effects walks by transparently replicating...
Virtualization provides benefits for many workloads, but the overheads of virtualizing memory are not universally low. The cost comes from managing two levels address translation - one in guest virtual machine (VM) and other host monitor (VMM) with either nested or shadow paging. Nested paging directly performs a two-level page walk that makes TLB misses slower than unvirtualized native, enables fast tables changes. Alternatively, restores native miss speeds, requires costly VMM intervention...
Modern workloads suffer high execution-time overhead due to page-based virtual memory. The authors introduce range translations that map arbitrary-sized memory ranges contiguous physical pages while retaining the flexibility of paging. A translation reduces address a lookup delivers near zero overhead.
We propose synergistic software and hardware mechanisms that alleviate the address translation overhead, focusing particularly on virtualized execution. On side, we contiguity-aware (CA) paging, a novel physical memory allocation technique creates larger-than-a-page contiguous mappings while preserving flexibility of demand paging. CA paging applies to hypervisor guest OS manager independently, as well native systems. Moreover, benefits any scheme leverages mappings. SpOT, simple...
The TLB is increasingly a bottleneck for big data applications. In most designs, the number of entries are highly constrained by latency requirements, and growing much more slowly than working sets Many solutions to this problem, such as huge pages, perforated or coalescing, rely on physical contiguity performance gains, yet cost defragmenting memory can easily nullify these gains. This paper introduces mosaic which increase reach compressing multiple, discrete translations into one entry....
Providing multiple superscalar core types on a chip, each tailored to different classes of instruction-level behavior, is an exciting direction for increasing processor performance and energy efficiency. Unfortunately, design verification effort increases with additional type, limiting the microarchitectural diversity that can be practically implemented. FabScalar aims automate design, opening up its many opportunities.
Increasing heterogeneity in the memory system mandates careful data placement to hide non-uniform access (NUMA) effects on applications. However, NUMA optimizations have predominantly focused application past decades, largely ignoring of kernel structures due their small footprint; this is evident typical OS designs that pin objects memory. In paper, we show gaining importance context page-tables: sub-optimal page-tables causes severe slowdown (up 3.1x) virtualized servers.
The open-source and community-supported gem5 simulator is one of the most popular tools for computer architecture research. This simulation infrastructure allows researchers to model modern hardware at cycle level, it has enough fidelity boot unmodified Linux-based operating systems run full applications multiple architectures including x86, Arm, RISC-V. been under active development over last nine years since original release. In this time, there have 7500 commits codebase from 250 unique...