Guilherme Cox

ORCID: 0000-0001-8292-4554
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Parallel Computing and Optimization Techniques
  • Advanced Data Storage Technologies
  • Cloud Computing and Resource Management
  • Network Packet Processing and Optimization
  • Distributed systems and fault tolerance
  • Interconnection Networks and Systems
  • Advanced Vision and Imaging
  • Embedded Systems Design Techniques
  • Computer Graphics and Visualization Techniques
  • Distributed and Parallel Computing Systems
  • Advanced Memory and Neural Computing
  • Security and Verification in Computing
  • 3D Shape Modeling and Analysis
  • Domain Adaptation and Few-Shot Learning
  • Radiation Detection and Scintillator Technologies
  • Ferroelectric and Negative Capacitance Devices
  • Real-time simulation and control systems
  • Image Enhancement Techniques
  • Cellular Automata and Applications

Nvidia (United States)
2023-2024

University of California, Santa Cruz
2019

Rutgers Sexual and Reproductive Health and Rights
2016-2018

Rutgers, The State University of New Jersey
2012-2018

Universidade Federal do Rio de Janeiro
2009

Universidade do Estado do Rio de Janeiro
2009

Processors and operating systems (OSes) support multiple memory page sizes. Superpages increase Translation Lookaside Buffer (TLB) hits, while small pages provide fine-grained protection. Ideally, TLBs should perform well for any distribution of In reality, set-associative -- used frequently their energy efficiency compared to fully-associative cannot (easily) sizes concurrently. Instead, commercial typically implement separate different This means that when superpages are allocated...

10.1145/3037697.3037704 article EN 2017-04-04

Recent studies on commercial hardware demonstrated that irregular GPU applications can bottleneck virtual-to-physical address translations. In this work, we explore ways to reduce translation overheads for such applications. We discover the order of servicing GPU's requests (specifically, page table walks) plays a key role in determining amount overhead experienced by an application. find different SIMD instructions executed application require vastly amounts work service their needs,...

10.1109/isca.2018.00025 article EN 2018-06-01

To improve system performance, operating systems (OSes) often undertake activities that require modification of virtual-to-physical address translations. For example, the OS may migrate data between physical pages to manage heterogeneous memory devices. We refer such as page remappings. Unfortunately, remappings are expensive. show a big part this cost arises from translation coherence, particularly on employing virtualization. In response, we propose hardware invalidation and coherence or...

10.1145/3079856.3080211 preprint EN 2017-06-15

Future servers will incorporate many active low-power modes for different system components, such as cores and memory. Though these provide flexibility power management via Dynamic Voltage Frequency Scaling (DVFS), they must be operated in a coordinated manner. Such control creates combinatorial space of possible mode configurations. Given the rapid growth number cores, it is becoming increasingly challenging to quickly select configuration that maximizes performance under given budget....

10.1109/ispass.2016.7482074 article EN 2016-04-01

Recent studies have shown the potential of last-level TLBs shared by multiple cores in tackling memory translation performance challenges posed "big data" workloads. A key stumbling block hindering their effectiveness, however, is high access time. We present a design methodology to reduce these times so as realize high-performance and scalable L2 TLBs. As first step, we study benefits replacing monolithic with distributed set small TLB slices. While this approach does lookup latency, it...

10.1109/micro.2018.00030 article EN 2018-10-01

To improve system performance, modern operating systems (OSes) often undertake activities that require modification of virtual-to-physical page translation mappings. For example, the OS may migrate data between physical frames to defragment memory and enable superpages. The pages heterogeneous devices. We refer all such as remappings. Unfortunately, remappings are expensive. show coherence is a major culprit employing virtualization especially badly affected by their overheads. In response,...

10.48550/arxiv.1701.07517 preprint EN other-oa arXiv (Cornell University) 2017-01-01

Direct volume rendering has become a popular technique for visualizing volumetric data from sources such as scientific simulations, analytic functions, medical scanners, among others. Volume algorithms, raycasting, can produce high-quality images, however, the use of raycasting been limited due to its high demands on computational power and memory bandwidth. In this paper, we propose new implementation algorithm that takes advantage highly parallel architecture Cell Broadband Engine...

10.1109/sbac-pad.2009.15 article EN 2009-10-01

Processors and operating systems (OSes) support multiple memory page sizes. Superpages increase Translation Lookaside Buffer (TLB) hits, while small pages provide fine-grained protection. Ideally, TLBs should perform well for any distribution of In reality, set-associative -- used frequently their energy efficiency compared to fully-associative cannot (easily) sizes concurrently. Instead, commercial typically implement separate different This means that when superpages are allocated...

10.1145/3093315.3037704 article EN ACM SIGOPS Operating Systems Review 2017-04-04

Many security and forensic analyses rely on the ability to fetch memory snapshots from a target machine. To date, community has relied virtualization, external hardware or trusted obtain such snapshots. These techniques either sacrifice snapshot consistency degrade performance of applications executing atop target. We present SnipSnap, new acquisition system based on-package DRAM technologies that offers without excessively hurting target's applications. realize SnipSnap evaluate its...

10.1145/3176258.3176325 article EN 2018-03-13

Future servers will incorporate many active low-power modes for each core and the main memory subsystem. Though these provide flexibility power and/or energy management via Dynamic Voltage Frequency Scaling (DVFS), prior work has shown that they must be managed in a coordinated manner. This requirement creates combinatorial space of possible mode configurations. As result, it becomes increasingly challenging to quickly select configuration optimizes both performance power/energy efficiency....

10.1145/3086504 article EN ACM Transactions on Modeling and Performance Evaluation of Computing Systems 2017-09-05

Processors and operating systems (OSes) support multiple memory page sizes. Superpages increase Translation Lookaside Buffer (TLB) hits, while small pages provide fine-grained protection. Ideally, TLBs should perform well for any distribution of In reality, set-associative -- used frequently their energy efficiency compared to fully-associative cannot (easily) sizes concurrently. Instead, commercial typically implement separate different This means that when superpages are allocated...

10.1145/3093336.3037704 article EN ACM SIGPLAN Notices 2017-04-04

An increasing number of applications benefit from heterogeneous hardware accelerators. Such accelerators often require the application to manually manage memory buffers on devices and transfer data between host device buffers. A programming model that unifies virtual address space across is appealing because it enables automatic transfers simplifies application-level programming. However, can sometimes be redundant, which decreases performance. NVIDIA's UVM (unified memory) driver provides a...

10.1109/iiswc55918.2022.00013 article EN 2022-11-01

Direct volume rendering of irregular 3D datasets demands high computational power and memory bandwidth. Recent research in optimizing algorithms are exploring the processing offered by a new trend hardware design: multithreaded accelerator devices. Accelerators like Graphics Processing Units (GPU) Cell Broadband Engine processor (Cell BE) used as integrated coprocessors, off-loading application from CPU to offers promising speedups. The difficulty using these devices, however, is how program...

10.1145/2141702.2141710 article EN 2012-02-26

Memory prefetching improves performance across many systems layers. However, achieving high prefetch accuracy with low overhead is challenging, as memory hierarchies and application access patterns become more complicated. Furthermore, a prefetcher's ability to adapt new they emerge becoming crucial than ever. Recent work has demonstrated the use of deep learning techniques improve accuracy, albeit impractical compute storage overheads. This paper suggests taking inspiration from mechanisms...

10.1145/3593856.3595901 article EN 2023-06-22

To improve system performance, operating systems (OSes) often undertake activities that require modification of virtual-to-physical address translations. For example, the OS may migrate data between physical pages to manage heterogeneous memory devices. We refer such as page remappings. Unfortunately, remappings are expensive. show a big part this cost arises from translation coherence, particularly on employing virtualization. In response, we propose hardware invalidation and coherence or...

10.1145/3273982.3273988 article EN ACM SIGOPS Operating Systems Review 2018-08-28

Future servers will incorporate many active lowpower modes for different system components, such as cores and memory. Though these provide flexibility power management via Dynamic Voltage Frequency Scaling (DVFS), they must be operated in a coordinated manner. Such control creates combinatorial space of possible mode configurations. Given the rapid growth number cores, it is becoming increasingly challenging to quickly select configuration that maximizes performance under given budget. Prior...

10.48550/arxiv.1603.01313 preprint EN other-oa arXiv (Cornell University) 2016-01-01

This tutorial is intended for programmers who are interested in boosting their graphics application using a different architectural paradigm: the cell broadband engine (Cell BE). Our main idea to focus on performance issues that can be efficiently handled by multicore and vector facilities of Cell BE. We aim offer an alternative way high-performance rather than use processing units (GPUs). The BE processor first implementation chip multiprocessor with significant number general purpose...

10.1109/sibgrapi-tutorials.2009.12 article EN 2009-10-01

To improve system performance, operating systems (OSes) often undertake activities that require modification of virtual-to-physical address translations. For example, the OS may migrate data between physical pages to manage heterogeneous memory devices. We refer such as page remappings. Unfortunately, remappings are expensive. show a big part this cost arises from translation coherence, particularly on employing virtualization. In response, we propose hardware invalidation and coherence or...

10.1145/3140659.3080211 article EN ACM SIGARCH Computer Architecture News 2017-06-24

Processors and operating systems (OSes) support multiple memory page sizes. Superpages increase Translation Lookaside Buffer (TLB) hits, while small pages provide fine-grained protection. Ideally, TLBs should perform well for any distribution of In reality, set-associative -- used frequently their energy efficiency compared to fully-associative cannot (easily) sizes concurrently. Instead, commercial typically implement separate different This means that when superpages are allocated...

10.1145/3093337.3037704 article EN ACM SIGARCH Computer Architecture News 2017-04-04

This paper presents a broad, pathfinding design space exploration of memory management units (MMUs) for heterogeneous systems. We consider variety designs, ranging from accelerators tightly coupled with CPUs (and using their MMUs) to fully independent that have own MMUs. find regardless the CPU-accelerator communication, should not rely on CPU MMU any aspect address translation, and instead must its own, local, fully-fledged MMU. That MMU, however, can be as application-specific accelerator...

10.48550/arxiv.1707.09450 preprint EN other-oa arXiv (Cornell University) 2017-01-01
Co-Chairs Bridges Ron Brightwell Patrick McCormick Martin Schulz Eishi Arima and 95 more Maya Gokhale David Boehme Kenneth B. Kent D. R. Cadena Kurt Brian Ferreira Amanda Randles Scott Levy Engin Arslan Michael Bader Costas Bekas Huilong Chen Rafael Ferreira da Silva Johannes Lagguth Hatem Simula Piotr Kaust Richard Membarth Gabriele Mencagli Shirley Moore Antonio J. Peña Sivasankaran Rajamanickam Suzanne M. Shontz Francesco Silvestri Shaden Smith Hari Sundar Nathan R. Tallent Ramachandran Vaidyanathan Tobias Weinzierl Mattan Erez Sudheer Chunduri Guilherme Cox Alexandros Daglis Sven Karlsson E. Kim John Kim Jagadish Kotra Frank Mueller Vassilis Papaefstathiou Gilles Pokham Steve Reihnhart Minsoo Rhu Alex Kaist Kentaro Rico Osman Sano Jeremy Wilkie Seyed Majid Zahedi Jishen Zhao Tianhao Zheng Albert Google Dorian Arnold Ali Anwar Michaela Becchi Aurélien Bouteiller Anthony Danalis Judit Giménez Taylor Groves Amina Guermouche Samuel K. Gutiérrez Laurent Lefèvre Dong Li Abid Malik Olga Pearce Judy Qiu Ioan Raicu Iván Rodero Seetheram Seelam Sameer Shende Alexandru Uta Carlos A. Varela Patrick Widener Jia Zou Marı́a S. Pérez Gagan Agrawal Gabriel Antoniu Philip Carns Toni Cortez Alexandru Costan Ian Foster Pilar González‐Férez Jian Huang Shadi Ibrahim Michael Kühn Adrien Lèbre Pierre Matri Suzanne Mcintosh Sai Narasimhamurthy Youssef S. G. Nashed Lukas Rupprecht Alberto Sánchez Michael Schoettner Heinrich-Heine Universtiät Düsseldorf Robert Sisneros Domenic Talia Jon Woodring Simon David Hammond Kevin Huck

10.1109/cluster.2019.8891050 article 2019-09-01
Coming Soon ...