NFDI4DS | UHH-SEMS - Publication Details

Efficient virtual memory for big memory servers

OPENALEX - Publications

Arkaprava Basu Jayneel Gandhi Jichuan Chang Mark D. Hill Michael M. Swift

Our analysis shows that many "big-memory" server workloads, such as databases, in-memory caches, and graph analytics, pay a high cost for page-based virtual memory. They consume much 10% of execution cycles on TLB misses, even using large pages. On the other hand, we find these workloads use read-write permission most pages, are provisioned not to swap, rarely benefit from full flexibility

10.1145/2485922.2485943 article EN 2013-06-23

Redundant memory mappings for fast access to large memories

OPENALEX - Publications

Vasileios Karakostas Jayneel Gandhi Furkan Ayar Adrián Cristal Mark D. Hill and 4 more

Page-based virtual memory improves programmer productivity, security, and utilization, but incurs performance overheads due to costly page table walks after TLB misses. This overhead can reach 50% for modern workloads that access increasingly vast with stagnating sizes.

10.1145/2749469.2749471 article EN 2015-05-26

FabScalar

OPENALEX - Publications

Niket K. Choudhary Salil V. Wadhavkar Tanmay A. Shah Hiran Mayukh Jayneel Gandhi and 4 more

A growing body of work has compiled a strong case for the single-ISA heterogeneous multi-core paradigm. provides multiple, differently-designed superscalar core types that can streamline execution diverse programs and program phases. No prior research addressed 'Achilles' heel this paradigm: design verification effort is multiplied by number different types.

10.1145/2000064.2000067 article EN 2011-06-04

Mosaic

OPENALEX - Publications

Rachata Ausavarungnirun Joshua Landgraf Vance Miller Saugata Ghose Jayneel Gandhi and 2 more

Contemporary discrete GPUs support rich memory management features such as virtual and demand paging. These simplify GPU programming by providing a address space abstraction similar to CPUs eliminating manual management, but they introduce high performance overheads during (1) translation (2) page faults. A relies on degrees of thread-level parallelism (TLP) hide latency. Address can undermine TLP, single miss in the lookaside buffer (TLB) invokes an expensive serialized table walk that...

10.1145/3123939.3123975 article EN 2017-10-04

Efficient Memory Virtualization: Reducing Dimensionality of Nested Page Walks

OPENALEX - Publications

Jayneel Gandhi Arkaprava Basu Mark D. Hill Michael M. Swift

Virtualization provides value for many workloads, but its cost rises workloads with poor memory access locality. This overhead comes from translation look aside buffer (TLB) misses where the hardware performs a 2D page walk (up to 24 references on x86-64) rather than native TLB miss only 4 references). The first dimension translates guest virtual addresses physical addresses, while second host addresses. paper proposes new using direct segments three virtualized modes of operation that...

10.1109/micro.2014.37 article EN 2014-12-01

Remote memory in the age of fast networks

OPENALEX - Publications

Marcos K. Aguilera Nadav Amit Irina Calciu Xavier Deguillard Jayneel Gandhi and 5 more

As the latency of network approaches that memory, it becomes increasingly attractive for applications to use remote memory---random-access memory at another computer is accessed using virtual subsystem. This an old idea whose time has come, in age fast networks. To work effectively, must address many technical challenges. In this paper, we enumerate these challenges, discuss their feasibility, explain how some them are addressed by recent work, and indicate other promising ways tackle them....

10.1145/3127479.3131612 article EN 2017-09-24

MASK

OPENALEX - Publications

Rachata Ausavarungnirun Vance Miller Joshua Landgraf Saugata Ghose Jayneel Gandhi and 3 more

Graphics Processing Units (GPUs) exploit large amounts of threadlevel parallelism to provide high instruction throughput and efficiently hide long-latency stalls. The resulting throughput, along with continued programmability improvements, have made GPUs an essential computational resource in many domains. Applications from different domains can vastly compute memory demands on the GPU. In a large-scale computing environment, accommodate such wide-ranging without leaving GPU resources...

10.1145/3173162.3173169 article EN 2018-03-19

Energy-efficient address translation

OPENALEX - Publications

Vasileios Karakostas Jayneel Gandhi Adrián Cristal Mark D. Hill Kathryn S. McKinley and 3 more

Address translation is fundamental to processor performance. Prior work focused on reducing Translation Lookaside Buffer (TLB) misses improve performance and energy, whereas we show that even TLB hits consume a significant amount of dynamic energy. To reduce the energy cost address translation, first propose Lite, mechanism monitors utility L1 TLBs, adaptively changes their sizes with way-disabling. The resulting TLBLite organization opportunistically reduces spent in by 23% average minimal...

10.1109/hpca.2016.7446100 article EN 2016-03-01

Efficient virtual memory for big memory servers

OPENALEX - Publications

Arkaprava Basu Jayneel Gandhi Jichuan Chang Mark D. Hill Michael M. Swift

Our analysis shows that many "big-memory" server workloads, such as databases, in-memory caches, and graph analytics, pay a high cost for page-based virtual memory. They consume much 10% of execution cycles on TLB misses, even using large pages. On the other hand, we find these workloads use read-write permission most pages, are provisioned not to swap, rarely benefit from full flexibility To remove miss overhead big-memory propose mapping part process's linear address space with direct...

10.1145/2508148.2485943 article EN ACM SIGARCH Computer Architecture News 2013-06-23

BadgerTrap

OPENALEX - Publications

Jayneel Gandhi Arkaprava Basu Mark D. Hill Michael M. Swift

The overheads of memory management units (MMUs) have gained importance in today's systems. Detailed simulators may be too slow to gain insights into micro-architectural techniques that improve MMU efficiency. To address this issue, we propose a novel tool, BadgerTrap, which allows online instrumentation TLB misses. It first-order analysis new hardware tool helps create and analyze x86-64 miss trace. We describe example studies show various ways can applied research insights.

10.1145/2669594.2669599 article EN ACM SIGARCH Computer Architecture News 2014-09-15

Agile paging

OPENALEX - Publications

Jayneel Gandhi Mark D. Hill Michael M. Swift

Virtualization provides benefits for many workloads, but the overheads of virtualizing memory are not universally low. The cost comes from managing two levels address translation---one in guest virtual machine (VM) and other host monitor (VMM)---with either nested or shadow paging. Nested paging directly performs a two-level page walk that makes TLB misses slower than unvirtualized native, enables fast tables changes. Alternatively, restores native miss speeds, requires costly VMM...

10.1145/3007787.3001212 article EN ACM SIGARCH Computer Architecture News 2016-06-18

Mitosis: Transparently Self-Replicating Page-Tables for Large-Memory Machines

OPENALEX - Publications

Reto Achermann Ashish Panwar Abhishek Bhattacharjee Timothy Roscoe Jayneel Gandhi

Multi-socket machines with 1-100 TBs of physical memory are becoming prevalent. Applications running on such multi-socket suffer non-uniform bandwidth and latency when accessing memory. Decades research have focused data allocation placement policies in NUMA settings, but there been no studies the question how to place page-tables amongst sockets. We make case for explicit page-table show that is crucial overall performance. propose Mitosis mitigate effects walks by transparently replicating...

10.1145/3373376.3378468 article EN 2020-03-09

Agile Paging: Exceeding the Best of Nested and Shadow Paging

OPENALEX - Publications

Jayneel Gandhi Mark D. Hill Michael M. Swift

Virtualization provides benefits for many workloads, but the overheads of virtualizing memory are not universally low. The cost comes from managing two levels address translation - one in guest virtual machine (VM) and other host monitor (VMM) with either nested or shadow paging. Nested paging directly performs a two-level page walk that makes TLB misses slower than unvirtualized native, enables fast tables changes. Alternatively, restores native miss speeds, requires costly VMM intervention...

10.1109/isca.2016.67 article EN 2016-06-01

Range Translations for Fast Virtual Memory

OPENALEX - Publications

Jayneel Gandhi Vasileios Karakostas Furkan Ayar Adrián Cristal Mark D. Hill and 4 more

Modern workloads suffer high execution-time overhead due to page-based virtual memory. The authors introduce range translations that map arbitrary-sized memory ranges contiguous physical pages while retaining the flexibility of paging. A translation reduces address a lookup delivers near zero overhead.

10.1109/mm.2016.10 article EN IEEE Micro 2016-03-18

Enhancing and Exploiting Contiguity for Fast Memory Virtualization

OPENALEX - Publications

Chloe Alverti Stratos Psomadakis Vasileios Karakostas Jayneel Gandhi Konstantinos Nikas and 2 more

We propose synergistic software and hardware mechanisms that alleviate the address translation overhead, focusing particularly on virtualized execution. On side, we contiguity-aware (CA) paging, a novel physical memory allocation technique creates larger-than-a-page contiguous mappings while preserving flexibility of demand paging. CA paging applies to hypervisor guest OS manager independently, as well native systems. Moreover, benefits any scheme leverages mappings. SpOT, simple...

10.1109/isca45697.2020.00050 article EN 2020-05-01

Mosaic Pages: Big TLB Reach with Small Pages

OPENALEX - Publications

Krishnan Gosakan Jaehyun Han William Kuszmaul Ibrahim N. Mubarek Nirjhar Mukherjee and 11 more

The TLB is increasingly a bottleneck for big data applications. In most designs, the number of entries are highly constrained by latency requirements, and growing much more slowly than working sets Many solutions to this problem, such as huge pages, perforated or coalescing, rely on physical contiguity performance gains, yet cost defragmenting memory can easily nullify these gains. This paper introduces mosaic which increase reach compressing multiple, discrete translations into one entry....

10.1145/3582016.3582021 article EN 2023-03-20

FabScalar: Automating Superscalar Core Design

OPENALEX - Publications

Niket K. Choudhary Salil V. Wadhavkar Tanmay A. Shah Hiran Mayukh Jayneel Gandhi and 4 more

Providing multiple superscalar core types on a chip, each tailored to different classes of instruction-level behavior, is an exciting direction for increasing processor performance and energy efficiency. Unfortunately, design verification effort increases with additional type, limiting the microarchitectural diversity that can be practically implemented. FabScalar aims automate design, opening up its many opportunities.

10.1109/mm.2012.23 article EN IEEE Micro 2012-04-12

Fast local page-tables for virtualized NUMA servers with vMitosis

OPENALEX - Publications

Ashish Panwar Reto Achermann Arkaprava Basu Abhishek Bhattacharjee K. Gopinath and 1 more

Increasing heterogeneity in the memory system mandates careful data placement to hide non-uniform access (NUMA) effects on applications. However, NUMA optimizations have predominantly focused application past decades, largely ignoring of kernel structures due their small footprint; this is evident typical OS designs that pin objects memory. In paper, we show gaining importance context page-tables: sub-optimal page-tables causes severe slowdown (up 3.1x) virtualized servers.

10.1145/3445814.3446709 article EN 2021-04-11

The gem5 Simulator: Version 20.0+

OPENALEX - Publications

Jason Lowe-Power Abdul Mutaal Ahmad Ayaz Akram Mohammad Alian Rico Amslinger and 73 more

The open-source and community-supported gem5 simulator is one of the most popular tools for computer architecture research. This simulation infrastructure allows researchers to model modern hardware at cycle level, it has enough fidelity boot unmodified Linux-based operating systems run full applications multiple architectures including x86, Arm, RISC-V. been under active development over last nine years since original release. In this time, there have 7500 commits codebase from 250 unique...

10.48550/arxiv.2007.03152 preprint EN cc-by arXiv (Cornell University) 2020-01-01