NFDI4DS | UHH-SEMS - Publication Details

Efficient Address Translation for Architectures with Multiple Page Sizes

OPENALEX - Publications

Guilherme Cox Abhishek Bhattacharjee

Processors and operating systems (OSes) support multiple memory page sizes. Superpages increase Translation Lookaside Buffer (TLB) hits, while small pages provide fine-grained protection. Ideally, TLBs should perform well for any distribution of In reality, set-associative -- used frequently their energy efficiency compared to fully-associative cannot (easily) sizes concurrently. Instead, commercial typically implement separate different This means that when superpages are allocated...

10.1145/3037697.3037704 article EN 2017-04-04

Scheduling Page Table Walks for Irregular GPU Applications

OPENALEX - Publications

Seunghee Shin Guilherme Cox Mark Oskin Gabriel H. Loh Yan Solihin and 2 more

Recent studies on commercial hardware demonstrated that irregular GPU applications can bottleneck virtual-to-physical address translations. In this work, we explore ways to reduce translation overheads for such applications. We discover the order of servicing GPU's requests (specifically, page table walks) plays a key role in determining amount overhead experienced by an application. find different SIMD instructions executed application require vastly amounts work service their needs,...

10.1109/isca.2018.00025 article EN 2018-06-01

Hardware Translation Coherence for Virtualized Systems

OPENALEX - Publications

Zi Yan Ján Veselý Guilherme Cox Abhishek Bhattacharjee

To improve system performance, operating systems (OSes) often undertake activities that require modification of virtual-to-physical address translations. For example, the OS may migrate data between physical pages to manage heterogeneous memory devices. We refer such as page remappings. Unfortunately, remappings are expensive. show a big part this cost arises from translation coherence, particularly on employing virtualization. In response, we propose hardware invalidation and coherence or...

10.1145/3079856.3080211 preprint EN 2017-06-15

FastCap: An efficient and fair algorithm for power capping in many-core systems

OPENALEX - Publications

Yanpei Liu Guilherme Cox Qingyuan Deng Stark C. Draper Ricardo Bianchini

Future servers will incorporate many active low-power modes for different system components, such as cores and memory. Though these provide flexibility power management via Dynamic Voltage Frequency Scaling (DVFS), they must be operated in a coordinated manner. Such control creates combinatorial space of possible mode configurations. Given the rapid growth number cores, it is becoming increasingly challenging to quickly select configuration that maximizes performance under given budget....

10.1109/ispass.2016.7482074 article EN 2016-04-01

Scalable Distributed Last-Level TLBs Using Low-Latency Interconnects

OPENALEX - Publications

Srikant Bharadwaj Guilherme Cox Tushar Krishna Abhishek Bhattacharjee

Recent studies have shown the potential of last-level TLBs shared by multiple cores in tackling memory translation performance challenges posed "big data" workloads. A key stumbling block hindering their effectiveness, however, is high access time. We present a design methodology to reduce these times so as realize high-performance and scalable L2 TLBs. As first step, we study benefits replacing monolithic with distributed set small TLB slices. While this approach does lookup latency, it...

10.1109/micro.2018.00030 article EN 2018-10-01

Hardware Translation Coherence for Virtualized Systems

OPENALEX - Publications

Zi Yan Guilherme Cox Ján Veselý Abhishek Bhattacharjee

To improve system performance, modern operating systems (OSes) often undertake activities that require modification of virtual-to-physical page translation mappings. For example, the OS may migrate data between physical frames to defragment memory and enable superpages. The pages heterogeneous devices. We refer all such as remappings. Unfortunately, remappings are expensive. show coherence is a major culprit employing virtualization especially badly affected by their overheads. In response,...

10.48550/arxiv.1701.07517 preprint EN other-oa arXiv (Cornell University) 2017-01-01

Irregular Grid Raycasting Implementation on the Cell Broadband Engine

OPENALEX - Publications

Guilherme Cox André Maximo Cristiana Bentes Ricardo Farias

Direct volume rendering has become a popular technique for visualizing volumetric data from sources such as scientific simulations, analytic functions, medical scanners, among others. Volume algorithms, raycasting, can produce high-quality images, however, the use of raycasting been limited due to its high demands on computational power and memory bandwidth. In this paper, we propose new implementation algorithm that takes advantage highly parallel architecture Cell Broadband Engine...

10.1109/sbac-pad.2009.15 article EN 2009-10-01

Efficient Address Translation for Architectures with Multiple Page Sizes

OPENALEX - Publications

Guilherme Cox Abhishek Bhattacharjee

Processors and operating systems (OSes) support multiple memory page sizes. Superpages increase Translation Lookaside Buffer (TLB) hits, while small pages provide fine-grained protection. Ideally, TLBs should perform well for any distribution of In reality, set-associative -- used frequently their energy efficiency compared to fully-associative cannot (easily) sizes concurrently. Instead, commercial typically implement separate different This means that when superpages are allocated...

10.1145/3093315.3037704 article EN ACM SIGOPS Operating Systems Review 2017-04-04

A 3D-Stacked Architecture for Secure Memory Acquisition

OPENALEX - Publications

Guilherme Cox Zi Yan Abhishek Bhattacharjee Vinod Ganapathy

10.7282/t3xk8jzj article EN 2016-05-01

Secure, Consistent, and High-Performance Memory Snapshotting

OPENALEX - Publications

Guilherme Cox Zi Yan Abhishek Bhattacharjee Vinod Ganapathy

Many security and forensic analyses rely on the ability to fetch memory snapshots from a target machine. To date, community has relied virtualization, external hardware or trusted obtain such snapshots. These techniques either sacrifice snapshot consistency degrade performance of applications executing atop target. We present SnipSnap, new acquisition system based on-package DRAM technologies that offers without excessively hurting target's applications. realize SnipSnap evaluate its...

10.1145/3176258.3176325 article EN 2018-03-13

Fast Power and Energy Management for Future Many-Core Systems

OPENALEX - Publications

Yanpei Liu Guilherme Cox Qingyuan Deng Stark C. Draper Ricardo Bianchini

Future servers will incorporate many active low-power modes for each core and the main memory subsystem. Though these provide flexibility power and/or energy management via Dynamic Voltage Frequency Scaling (DVFS), prior work has shown that they must be managed in a coordinated manner. This requirement creates combinatorial space of possible mode configurations. As result, it becomes increasingly challenging to quickly select configuration optimizes both performance power/energy efficiency....

10.1145/3086504 article EN ACM Transactions on Modeling and Performance Evaluation of Computing Systems 2017-09-05

Efficient Address Translation for Architectures with Multiple Page Sizes

OPENALEX - Publications

Guilherme Cox Abhishek Bhattacharjee

Processors and operating systems (OSes) support multiple memory page sizes. Superpages increase Translation Lookaside Buffer (TLB) hits, while small pages provide fine-grained protection. Ideally, TLBs should perform well for any distribution of In reality, set-associative -- used frequently their energy efficiency compared to fully-associative cannot (easily) sizes concurrently. Instead, commercial typically implement separate different This means that when superpages are allocated...

10.1145/3093336.3037704 article EN ACM SIGPLAN Notices 2017-04-04

UVM Discard: Eliminating Redundant Memory Transfers for Accelerators

OPENALEX - Publications

Weixi Zhu Guilherme Cox Ján Veselý Mark Hairgrove Alan L. Cox and 1 more

An increasing number of applications benefit from heterogeneous hardware accelerators. Such accelerators often require the application to manually manage memory buffers on devices and transfer data between host device buffers. A programming model that unifies virtual address space across is appealing because it enables automatic transfers simplifies application-level programming. However, can sometimes be redundant, which decreases performance. NVIDIA's UVM (unified memory) driver provides a...

10.1109/iiswc55918.2022.00013 article EN 2022-11-01

Exploring parallelism in volume ray casting

OPENALEX - Publications

Guilherme Cox Cleomar Silva Leandro Fontoura Cupertino Cristiana Bentes Ricardo Farias

Direct volume rendering of irregular 3D datasets demands high computational power and memory bandwidth. Recent research in optimizing algorithms are exploring the processing offered by a new trend hardware design: multithreaded accelerator devices. Accelerators like Graphics Processing Units (GPU) Cell Broadband Engine processor (Cell BE) used as integrated coprocessors, off-loading application from CPU to offers promising speedups. The difficulty using these devices, however, is how program...

10.1145/2141702.2141710 article EN 2012-02-26

SUV: Static Analysis Guided Unified Virtual Memory

OPENALEX - Publications

B Pratheek Guilherme Cox Jan Vesely Arkaprava Basu

10.1109/micro61859.2024.00030 article EN 2024-11-02

Prefetching Using Principles of Hippocampal-Neocortical Interaction

OPENALEX - Publications

Michael Wu Ketaki Joshi Andrew Sheinberg Guilherme Cox Anurag Khandelwal and 2 more

Memory prefetching improves performance across many systems layers. However, achieving high prefetch accuracy with low overhead is challenging, as memory hierarchies and application access patterns become more complicated. Furthermore, a prefetcher's ability to adapt new they emerge becoming crucial than ever. Recent work has demonstrated the use of deep learning techniques improve accuracy, albeit impractical compute storage overheads. This paper suggests taking inspiration from mechanisms...

10.1145/3593856.3595901 article EN 2023-06-22

Hardware Translation Coherence for Virtualized Systems

OPENALEX - Publications

Zi Yan Ján Veselý Guilherme Cox Abhishek Bhattacharjee

To improve system performance, operating systems (OSes) often undertake activities that require modification of virtual-to-physical address translations. For example, the OS may migrate data between physical pages to manage heterogeneous memory devices. We refer such as page remappings. Unfortunately, remappings are expensive. show a big part this cost arises from translation coherence, particularly on employing virtualization. In response, we propose hardware invalidation and coherence or...

10.1145/3273982.3273988 article EN ACM SIGOPS Operating Systems Review 2018-08-28

FastCap: An Efficient and Fair Algorithm for Power Capping in Many-Core Systems

OPENALEX - Publications

Yanpei Liu Guilherme Cox Qingyuan Deng Stark C. Draper Ricardo Bianchini

Future servers will incorporate many active lowpower modes for different system components, such as cores and memory. Though these provide flexibility power management via Dynamic Voltage Frequency Scaling (DVFS), they must be operated in a coordinated manner. Such control creates combinatorial space of possible mode configurations. Given the rapid growth number cores, it is becoming increasingly challenging to quickly select configuration that maximizes performance under given budget. Prior...

10.48550/arxiv.1603.01313 preprint EN other-oa arXiv (Cornell University) 2016-01-01

Unleashing the Power of the Playstation 3 to Boost Graphics Programming

OPENALEX - Publications

André Maximo Guilherme Cox Cristiana Bentes Ricardo Farias

This tutorial is intended for programmers who are interested in boosting their graphics application using a different architectural paradigm: the cell broadband engine (Cell BE). Our main idea to focus on performance issues that can be efficiently handled by multicore and vector facilities of Cell BE. We aim offer an alternative way high-performance rather than use processing units (GPUs). The BE processor first implementation chip multiprocessor with significant number general purpose...

10.1109/sibgrapi-tutorials.2009.12 article EN 2009-10-01

Hardware Translation Coherence for Virtualized Systems

OPENALEX - Publications

Zi Yan Ján Veselý Guilherme Cox Abhishek Bhattacharjee

To improve system performance, operating systems (OSes) often undertake activities that require modification of virtual-to-physical address translations. For example, the OS may migrate data between physical pages to manage heterogeneous memory devices. We refer such as page remappings. Unfortunately, remappings are expensive. show a big part this cost arises from translation coherence, particularly on employing virtualization. In response, we propose hardware invalidation and coherence or...

10.1145/3140659.3080211 article EN ACM SIGARCH Computer Architecture News 2017-06-24

Efficient Address Translation for Architectures with Multiple Page Sizes

OPENALEX - Publications

Guilherme Cox Abhishek Bhattacharjee

Processors and operating systems (OSes) support multiple memory page sizes. Superpages increase Translation Lookaside Buffer (TLB) hits, while small pages provide fine-grained protection. Ideally, TLBs should perform well for any distribution of In reality, set-associative -- used frequently their energy efficiency compared to fully-associative cannot (easily) sizes concurrently. Instead, commercial typically implement separate different This means that when superpages are allocated...

10.1145/3093337.3037704 article EN ACM SIGARCH Computer Architecture News 2017-04-04

Improving and complementing virtual memory using hardware techniques

OPENALEX - Publications

Guilherme Cox

10.7282/t3-m0cq-tt95 article EN 2018-01-01

Address Translation Design Tradeoffs for Heterogeneous Systems

OPENALEX - Publications

Yunsung Kim Guilherme Cox Martha A. Kim Abhishek Bhattacharjee

This paper presents a broad, pathfinding design space exploration of memory management units (MMUs) for heterogeneous systems. We consider variety designs, ranging from accelerators tightly coupled with CPUs (and using their MMUs) to fully independent that have own MMUs. find regardless the CPU-accelerator communication, should not rely on CPU MMU any aspect address translation, and instead must its own, local, fully-fledged MMU. That MMU, however, can be as application-specific accelerator...

10.48550/arxiv.1707.09450 preprint EN other-oa arXiv (Cornell University) 2017-01-01

CLUSTER 2019 Committees

OPENALEX - Publications

Co-Chairs Bridges Ron Brightwell Patrick McCormick Martin Schulz Eishi Arima and 95 more

10.1109/cluster.2019.8891050 article 2019-09-01