- Parallel Computing and Optimization Techniques
- Advanced Data Storage Technologies
- Cloud Computing and Resource Management
- Network Packet Processing and Optimization
- Distributed systems and fault tolerance
- Interconnection Networks and Systems
- Advanced Vision and Imaging
- Embedded Systems Design Techniques
- Computer Graphics and Visualization Techniques
- Distributed and Parallel Computing Systems
- Advanced Memory and Neural Computing
- Security and Verification in Computing
- 3D Shape Modeling and Analysis
- Domain Adaptation and Few-Shot Learning
- Radiation Detection and Scintillator Technologies
- Ferroelectric and Negative Capacitance Devices
- Real-time simulation and control systems
- Image Enhancement Techniques
- Cellular Automata and Applications
Nvidia (United States)
2023-2024
University of California, Santa Cruz
2019
Rutgers Sexual and Reproductive Health and Rights
2016-2018
Rutgers, The State University of New Jersey
2012-2018
Universidade Federal do Rio de Janeiro
2009
Universidade do Estado do Rio de Janeiro
2009
Processors and operating systems (OSes) support multiple memory page sizes. Superpages increase Translation Lookaside Buffer (TLB) hits, while small pages provide fine-grained protection. Ideally, TLBs should perform well for any distribution of In reality, set-associative -- used frequently their energy efficiency compared to fully-associative cannot (easily) sizes concurrently. Instead, commercial typically implement separate different This means that when superpages are allocated...
Recent studies on commercial hardware demonstrated that irregular GPU applications can bottleneck virtual-to-physical address translations. In this work, we explore ways to reduce translation overheads for such applications. We discover the order of servicing GPU's requests (specifically, page table walks) plays a key role in determining amount overhead experienced by an application. find different SIMD instructions executed application require vastly amounts work service their needs,...
To improve system performance, operating systems (OSes) often undertake activities that require modification of virtual-to-physical address translations. For example, the OS may migrate data between physical pages to manage heterogeneous memory devices. We refer such as page remappings. Unfortunately, remappings are expensive. show a big part this cost arises from translation coherence, particularly on employing virtualization. In response, we propose hardware invalidation and coherence or...
Future servers will incorporate many active low-power modes for different system components, such as cores and memory. Though these provide flexibility power management via Dynamic Voltage Frequency Scaling (DVFS), they must be operated in a coordinated manner. Such control creates combinatorial space of possible mode configurations. Given the rapid growth number cores, it is becoming increasingly challenging to quickly select configuration that maximizes performance under given budget....
Recent studies have shown the potential of last-level TLBs shared by multiple cores in tackling memory translation performance challenges posed "big data" workloads. A key stumbling block hindering their effectiveness, however, is high access time. We present a design methodology to reduce these times so as realize high-performance and scalable L2 TLBs. As first step, we study benefits replacing monolithic with distributed set small TLB slices. While this approach does lookup latency, it...
To improve system performance, modern operating systems (OSes) often undertake activities that require modification of virtual-to-physical page translation mappings. For example, the OS may migrate data between physical frames to defragment memory and enable superpages. The pages heterogeneous devices. We refer all such as remappings. Unfortunately, remappings are expensive. show coherence is a major culprit employing virtualization especially badly affected by their overheads. In response,...
Direct volume rendering has become a popular technique for visualizing volumetric data from sources such as scientific simulations, analytic functions, medical scanners, among others. Volume algorithms, raycasting, can produce high-quality images, however, the use of raycasting been limited due to its high demands on computational power and memory bandwidth. In this paper, we propose new implementation algorithm that takes advantage highly parallel architecture Cell Broadband Engine...
Processors and operating systems (OSes) support multiple memory page sizes. Superpages increase Translation Lookaside Buffer (TLB) hits, while small pages provide fine-grained protection. Ideally, TLBs should perform well for any distribution of In reality, set-associative -- used frequently their energy efficiency compared to fully-associative cannot (easily) sizes concurrently. Instead, commercial typically implement separate different This means that when superpages are allocated...
Many security and forensic analyses rely on the ability to fetch memory snapshots from a target machine. To date, community has relied virtualization, external hardware or trusted obtain such snapshots. These techniques either sacrifice snapshot consistency degrade performance of applications executing atop target. We present SnipSnap, new acquisition system based on-package DRAM technologies that offers without excessively hurting target's applications. realize SnipSnap evaluate its...
Future servers will incorporate many active low-power modes for each core and the main memory subsystem. Though these provide flexibility power and/or energy management via Dynamic Voltage Frequency Scaling (DVFS), prior work has shown that they must be managed in a coordinated manner. This requirement creates combinatorial space of possible mode configurations. As result, it becomes increasingly challenging to quickly select configuration optimizes both performance power/energy efficiency....
Processors and operating systems (OSes) support multiple memory page sizes. Superpages increase Translation Lookaside Buffer (TLB) hits, while small pages provide fine-grained protection. Ideally, TLBs should perform well for any distribution of In reality, set-associative -- used frequently their energy efficiency compared to fully-associative cannot (easily) sizes concurrently. Instead, commercial typically implement separate different This means that when superpages are allocated...
An increasing number of applications benefit from heterogeneous hardware accelerators. Such accelerators often require the application to manually manage memory buffers on devices and transfer data between host device buffers. A programming model that unifies virtual address space across is appealing because it enables automatic transfers simplifies application-level programming. However, can sometimes be redundant, which decreases performance. NVIDIA's UVM (unified memory) driver provides a...
Direct volume rendering of irregular 3D datasets demands high computational power and memory bandwidth. Recent research in optimizing algorithms are exploring the processing offered by a new trend hardware design: multithreaded accelerator devices. Accelerators like Graphics Processing Units (GPU) Cell Broadband Engine processor (Cell BE) used as integrated coprocessors, off-loading application from CPU to offers promising speedups. The difficulty using these devices, however, is how program...
Memory prefetching improves performance across many systems layers. However, achieving high prefetch accuracy with low overhead is challenging, as memory hierarchies and application access patterns become more complicated. Furthermore, a prefetcher's ability to adapt new they emerge becoming crucial than ever. Recent work has demonstrated the use of deep learning techniques improve accuracy, albeit impractical compute storage overheads. This paper suggests taking inspiration from mechanisms...
To improve system performance, operating systems (OSes) often undertake activities that require modification of virtual-to-physical address translations. For example, the OS may migrate data between physical pages to manage heterogeneous memory devices. We refer such as page remappings. Unfortunately, remappings are expensive. show a big part this cost arises from translation coherence, particularly on employing virtualization. In response, we propose hardware invalidation and coherence or...
Future servers will incorporate many active lowpower modes for different system components, such as cores and memory. Though these provide flexibility power management via Dynamic Voltage Frequency Scaling (DVFS), they must be operated in a coordinated manner. Such control creates combinatorial space of possible mode configurations. Given the rapid growth number cores, it is becoming increasingly challenging to quickly select configuration that maximizes performance under given budget. Prior...
This tutorial is intended for programmers who are interested in boosting their graphics application using a different architectural paradigm: the cell broadband engine (Cell BE). Our main idea to focus on performance issues that can be efficiently handled by multicore and vector facilities of Cell BE. We aim offer an alternative way high-performance rather than use processing units (GPUs). The BE processor first implementation chip multiprocessor with significant number general purpose...
To improve system performance, operating systems (OSes) often undertake activities that require modification of virtual-to-physical address translations. For example, the OS may migrate data between physical pages to manage heterogeneous memory devices. We refer such as page remappings. Unfortunately, remappings are expensive. show a big part this cost arises from translation coherence, particularly on employing virtualization. In response, we propose hardware invalidation and coherence or...
Processors and operating systems (OSes) support multiple memory page sizes. Superpages increase Translation Lookaside Buffer (TLB) hits, while small pages provide fine-grained protection. Ideally, TLBs should perform well for any distribution of In reality, set-associative -- used frequently their energy efficiency compared to fully-associative cannot (easily) sizes concurrently. Instead, commercial typically implement separate different This means that when superpages are allocated...
This paper presents a broad, pathfinding design space exploration of memory management units (MMUs) for heterogeneous systems. We consider variety designs, ranging from accelerators tightly coupled with CPUs (and using their MMUs) to fully independent that have own MMUs. find regardless the CPU-accelerator communication, should not rely on CPU MMU any aspect address translation, and instead must its own, local, fully-fledged MMU. That MMU, however, can be as application-specific accelerator...