Jayneel Gandhi

ORCID: 0000-0003-1696-400X
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Parallel Computing and Optimization Techniques
  • Advanced Data Storage Technologies
  • Cloud Computing and Resource Management
  • Distributed and Parallel Computing Systems
  • Superconducting Materials and Applications
  • Distributed systems and fault tolerance
  • Algorithms and Data Compression
  • IoT and Edge/Fog Computing
  • Embedded Systems Design Techniques
  • Interconnection Networks and Systems
  • Security and Verification in Computing
  • Healthcare Technology and Patient Monitoring
  • Advanced Malware Detection Techniques
  • Computational Geometry and Mesh Generation
  • COVID-19 diagnosis using AI
  • Anomaly Detection Techniques and Applications
  • Network Security and Intrusion Detection
  • Advanced Memory and Neural Computing
  • Ferroelectric and Negative Capacitance Devices
  • Low-power high-performance VLSI design
  • Cellular Automata and Applications
  • Radiation Effects in Electronics

Menlo School
2024

Alpha Omega Alpha Medical Honor Society
2024

University of Wisconsin–Madison
2011-2017

Kitware (United States)
2017

North Carolina State University
2012

Our analysis shows that many "big-memory" server workloads, such as databases, in-memory caches, and graph analytics, pay a high cost for page-based virtual memory. They consume much 10% of execution cycles on TLB misses, even using large pages. On the other hand, we find these workloads use read-write permission most pages, are provisioned not to swap, rarely benefit from full flexibility

10.1145/2485922.2485943 article EN 2013-06-23

Page-based virtual memory improves programmer productivity, security, and utilization, but incurs performance overheads due to costly page table walks after TLB misses. This overhead can reach 50% for modern workloads that access increasingly vast with stagnating sizes.

10.1145/2749469.2749471 article EN 2015-05-26

A growing body of work has compiled a strong case for the single-ISA heterogeneous multi-core paradigm. provides multiple, differently-designed superscalar core types that can streamline execution diverse programs and program phases. No prior research addressed 'Achilles' heel this paradigm: design verification effort is multiplied by number different types.

10.1145/2000064.2000067 article EN 2011-06-04

Contemporary discrete GPUs support rich memory management features such as virtual and demand paging. These simplify GPU programming by providing a address space abstraction similar to CPUs eliminating manual management, but they introduce high performance overheads during (1) translation (2) page faults. A relies on degrees of thread-level parallelism (TLP) hide latency. Address can undermine TLP, single miss in the lookaside buffer (TLB) invokes an expensive serialized table walk that...

10.1145/3123939.3123975 article EN 2017-10-04

Virtualization provides value for many workloads, but its cost rises workloads with poor memory access locality. This overhead comes from translation look aside buffer (TLB) misses where the hardware performs a 2D page walk (up to 24 references on x86-64) rather than native TLB miss only 4 references). The first dimension translates guest virtual addresses physical addresses, while second host addresses. paper proposes new using direct segments three virtualized modes of operation that...

10.1109/micro.2014.37 article EN 2014-12-01

As the latency of network approaches that memory, it becomes increasingly attractive for applications to use remote memory---random-access memory at another computer is accessed using virtual subsystem. This an old idea whose time has come, in age fast networks. To work effectively, must address many technical challenges. In this paper, we enumerate these challenges, discuss their feasibility, explain how some them are addressed by recent work, and indicate other promising ways tackle them....

10.1145/3127479.3131612 article EN 2017-09-24

Graphics Processing Units (GPUs) exploit large amounts of threadlevel parallelism to provide high instruction throughput and efficiently hide long-latency stalls. The resulting throughput, along with continued programmability improvements, have made GPUs an essential computational resource in many domains. Applications from different domains can vastly compute memory demands on the GPU. In a large-scale computing environment, accommodate such wide-ranging without leaving GPU resources...

10.1145/3173162.3173169 article EN 2018-03-19

Address translation is fundamental to processor performance. Prior work focused on reducing Translation Lookaside Buffer (TLB) misses improve performance and energy, whereas we show that even TLB hits consume a significant amount of dynamic energy. To reduce the energy cost address translation, first propose Lite, mechanism monitors utility L1 TLBs, adaptively changes their sizes with way-disabling. The resulting TLBLite organization opportunistically reduces spent in by 23% average minimal...

10.1109/hpca.2016.7446100 article EN 2016-03-01

Our analysis shows that many "big-memory" server workloads, such as databases, in-memory caches, and graph analytics, pay a high cost for page-based virtual memory. They consume much 10% of execution cycles on TLB misses, even using large pages. On the other hand, we find these workloads use read-write permission most pages, are provisioned not to swap, rarely benefit from full flexibility To remove miss overhead big-memory propose mapping part process's linear address space with direct...

10.1145/2508148.2485943 article EN ACM SIGARCH Computer Architecture News 2013-06-23

The overheads of memory management units (MMUs) have gained importance in today's systems. Detailed simulators may be too slow to gain insights into micro-architectural techniques that improve MMU efficiency. To address this issue, we propose a novel tool, BadgerTrap, which allows online instrumentation TLB misses. It first-order analysis new hardware tool helps create and analyze x86-64 miss trace. We describe example studies show various ways can applied research insights.

10.1145/2669594.2669599 article EN ACM SIGARCH Computer Architecture News 2014-09-15

Virtualization provides benefits for many workloads, but the overheads of virtualizing memory are not universally low. The cost comes from managing two levels address translation---one in guest virtual machine (VM) and other host monitor (VMM)---with either nested or shadow paging. Nested paging directly performs a two-level page walk that makes TLB misses slower than unvirtualized native, enables fast tables changes. Alternatively, restores native miss speeds, requires costly VMM...

10.1145/3007787.3001212 article EN ACM SIGARCH Computer Architecture News 2016-06-18

Multi-socket machines with 1-100 TBs of physical memory are becoming prevalent. Applications running on such multi-socket suffer non-uniform bandwidth and latency when accessing memory. Decades research have focused data allocation placement policies in NUMA settings, but there been no studies the question how to place page-tables amongst sockets. We make case for explicit page-table show that is crucial overall performance. propose Mitosis mitigate effects walks by transparently replicating...

10.1145/3373376.3378468 article EN 2020-03-09

Virtualization provides benefits for many workloads, but the overheads of virtualizing memory are not universally low. The cost comes from managing two levels address translation - one in guest virtual machine (VM) and other host monitor (VMM) with either nested or shadow paging. Nested paging directly performs a two-level page walk that makes TLB misses slower than unvirtualized native, enables fast tables changes. Alternatively, restores native miss speeds, requires costly VMM intervention...

10.1109/isca.2016.67 article EN 2016-06-01

Modern workloads suffer high execution-time overhead due to page-based virtual memory. The authors introduce range translations that map arbitrary-sized memory ranges contiguous physical pages while retaining the flexibility of paging. A translation reduces address a lookup delivers near zero overhead.

10.1109/mm.2016.10 article EN IEEE Micro 2016-03-18

We propose synergistic software and hardware mechanisms that alleviate the address translation overhead, focusing particularly on virtualized execution. On side, we contiguity-aware (CA) paging, a novel physical memory allocation technique creates larger-than-a-page contiguous mappings while preserving flexibility of demand paging. CA paging applies to hypervisor guest OS manager independently, as well native systems. Moreover, benefits any scheme leverages mappings. SpOT, simple...

10.1109/isca45697.2020.00050 article EN 2020-05-01

The TLB is increasingly a bottleneck for big data applications. In most designs, the number of entries are highly constrained by latency requirements, and growing much more slowly than working sets Many solutions to this problem, such as huge pages, perforated or coalescing, rely on physical contiguity performance gains, yet cost defragmenting memory can easily nullify these gains. This paper introduces mosaic which increase reach compressing multiple, discrete translations into one entry....

10.1145/3582016.3582021 article EN 2023-03-20

Providing multiple superscalar core types on a chip, each tailored to different classes of instruction-level behavior, is an exciting direction for increasing processor performance and energy efficiency. Unfortunately, design verification effort increases with additional type, limiting the microarchitectural diversity that can be practically implemented. FabScalar aims automate design, opening up its many opportunities.

10.1109/mm.2012.23 article EN IEEE Micro 2012-04-12

Increasing heterogeneity in the memory system mandates careful data placement to hide non-uniform access (NUMA) effects on applications. However, NUMA optimizations have predominantly focused application past decades, largely ignoring of kernel structures due their small footprint; this is evident typical OS designs that pin objects memory. In paper, we show gaining importance context page-tables: sub-optimal page-tables causes severe slowdown (up 3.1x) virtualized servers.

10.1145/3445814.3446709 article EN 2021-04-11

The open-source and community-supported gem5 simulator is one of the most popular tools for computer architecture research. This simulation infrastructure allows researchers to model modern hardware at cycle level, it has enough fidelity boot unmodified Linux-based operating systems run full applications multiple architectures including x86, Arm, RISC-V. been under active development over last nine years since original release. In this time, there have 7500 commits codebase from 250 unique...

10.48550/arxiv.2007.03152 preprint EN cc-by arXiv (Cornell University) 2020-01-01
Coming Soon ...