Sudhanva Gurumurthi

ORCID: 0000-0002-1740-7304
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Parallel Computing and Optimization Techniques
  • Advanced Data Storage Technologies
  • Radiation Effects in Electronics
  • Cloud Computing and Resource Management
  • Semiconductor materials and devices
  • Distributed systems and fault tolerance
  • Distributed and Parallel Computing Systems
  • Interconnection Networks and Systems
  • Caching and Content Delivery
  • Security and Verification in Computing
  • Magnetic properties of thin films
  • Advanced Memory and Neural Computing
  • Advancements in Semiconductor Devices and Circuit Design
  • Software System Performance and Reliability
  • VLSI and Analog Circuit Testing
  • Software Reliability and Analysis Research
  • Low-power high-performance VLSI design
  • Reliability and Maintenance Optimization
  • Data Management and Algorithms
  • 3D IC and TSV technologies
  • Embedded Systems Design Techniques
  • Real-Time Systems Scheduling
  • Context-Aware Activity Recognition Systems
  • Iterative Learning Control Systems
  • Software-Defined Networks and 5G

Advanced Micro Devices (United States)
2014-2024

Google (United States)
2024

Ghent University Hospital
2023

KU Leuven
2023

University of Toronto
2023

Advanced Micro Devices (Canada)
2013-2018

IBM (United States)
2016

University of Virginia
2006-2014

McCormick (United States)
2011

Pennsylvania State University
2001-2005

Spin-Transfer Torque RAM (STT-RAM) is an emerging non-volatile memory technology that a potential universal could replace SRAM in processor caches. This paper presents novel approach for redesigning STT-RAM cells to reduce the high dynamic energy and slow write latencies. We lower retention time by reducing planar area of cell, thereby current, which we then use with CACTI design caches memories. simulate quad-core designs using combination SRAM- STT-RAM-based Since ultra-low may lose data,...

10.1109/hpca.2011.5749716 article EN 2011-02-01

A large portion of the power budget in server environments goes into I/O subsystem - disk array particular. Traditional approaches to management involve completely stopping rotation, which can take a considerable amount time, making them less useful cases where idle times between requests may not be long enough outweigh overheads. This paper presents new approach called DRPM modulate speed (RPM) dynamically, and gives practical implementation exploit this mechanism. Extensive simulations...

10.1145/859618.859638 article EN 2003-01-01

Several recent publications have shown that hardware faults in the memory subsystem are commonplace. These predicted to become more frequent future systems contain orders of magnitude DRAM and SRAM than found current subsystems. subsystems will need provide resilience techniques tolerate these when deployed high-performance computing data centers containing tens thousands nodes. Therefore, it is critical understand efficacy determine whether they be suitable for systems. In this paper, we...

10.1145/2694344.2694348 article EN 2015-03-03

Power dissipation has become one of the most critical factors for continued development both high-end and low-end computer systems. We present a complete system power simulator, called SoftWatt, that models CPU, memory hierarchy, low-power disk subsystem quantifies behavior application operating system. This tool, built on top SimOS infrastructure, uses validated analytical energy to identify hotspots in components, capture relative contributions user kernel code profile, power-hungry...

10.1109/hpca.2002.995705 article EN 2004-04-23

Several recent publications confirm that faults are common in high-performance computing systems. Therefore, further attention to the experienced by such systems is warranted. In this paper, we present a study of DRAM and SRAM large Our goal understand factors influence production settings.

10.1145/2503210.2503257 article EN 2013-10-30

While graphics processing units (GPUs) have gained wide adoption as accelerators for general-purpose applications (GPGPU), the end-to-end reliability implications of their use not been quantified. Fault injection is a widely used method evaluating applications. However, building fault injector GPGPU challenging due to massive parallelism, which makes it difficult achieve representativeness while being time-efficient. This paper three key contributions. First, presents design fault-injection...

10.1109/ispass.2014.6844486 article EN 2014-03-01

Transient faults due to particle strikes are a key challenge in microprocessor design. Driven by exponentially increasing transistor counts, per-chip growing burden. To protect against soft errors, redundancy techniques such as redundant multithreading (RMT) often used. However, these assume that the probability structural fault will result error (i.e., Architectural Vulnerability Factor (AVF)) is 100 percent, unnecessarily draining processor resources. Due high cost of redundancy, there...

10.1145/1250662.1250726 article EN 2007-06-09

Processor caches already play a critical role in the performance of today’s computer systems. At same time, data integrity words coming out can have serious consequences on ability program to execute correctly, or even proceed. The checks need be performed time-sensitive manner not slow down execution when there are no errors as common case, and should excessively increase power budget which is high. ECC parity-based protection techniques use today fall at either extremes terms compromising...

10.1109/dsn.2003.1209939 article EN 2004-06-22

Spin-Transfer Torque RAM (STT-RAM) has emerged as a potential candidate for Universal memory. However, there are two challenges to using STT-RAM in memory system design: (1) the intrinsic variation storage element, Magnetic Tunnel Junction (MTJ), and (2) high write energy. In this paper, we present physically based thermal noise model simulating statistical variations of MTJs. We have implemented it HSPICE validated against analytical results. demonstrate its use setting pulse width given...

10.1109/islped.2011.5993623 article EN 2011-08-01

This article provides an overview of AMD's vision for exascale computing, and in particular, how heterogeneity will play a central role realizing this vision. Exascale computing requires high levels performance capabilities while staying within stringent power budgets. Using hardware optimized specific functions is much more energy efficient than implementing those with general-purpose cores. However, there strong desire supercomputer customers not to have pay custom components designed only...

10.1109/mm.2015.71 article EN IEEE Micro 2015-07-01

This paper presents a study of DDR4 DRAM faults in large fleet commodity servers, covering several billion memory device-hours data. The goal this is to understand devices measure the efficacy existing hardware resilience techniques and aid designing more resilient systems for future large-scale systems.The has key findings about fault characteristics DRAMs adds novel insights system reliability literature. Specifically, data show sixteen unique modes under study, including that have not...

10.1109/hpca56546.2023.10071066 article EN 2023-02-01

Although effective techniques exist for tackling disk power laptops and workstations, applying them in a server environment presents considerable challenge, especially under stringent performance requirements. Using dynamic rotations per minute approach to speed control arrays can provide significant savings I/O system consumption without lessening performance.

10.1109/mc.2003.1250884 article EN Computer 2003-12-01

In temperature-aware design, the presence or absence of a heatsink fundamentally changes thermal behavior with important design implications. recent years, chip-level infrared (IR) imaging has been gaining popularity in studying phenomena and management, as well reverse-engineering chip power consumption. Unfortunately, IR needs peculiar cooling solution, which removes applies an IR-transparent liquid flow over exposed bare die to carry away dissipated heat. Because this solution is...

10.1109/ispass.2009.4919633 article EN 2009-04-01

Spin-Transfer Torque RAM (STT-RAM) has emerged as a potential candidate for Universal memory. However, there are two challenges to using STT-RAM in memory system design: (1) the intrinsic variation storage element, Magnetic Tunnel Junction (MTJ), and (2) high write energy. In this paper, we present physically based thermal noise model simulating statistical variations of MTJs. We have implemented it HSPICE validated against analytical results. demonstrate its use setting pulse width given...

10.5555/2016802.2016836 article EN International Symposium on Low Power Electronics and Design 2011-08-01

As conventional memory technologies such as DRAM and Flash run into scaling challenges, architects system designers are forced to look at alternative for building future computer syst

10.2200/s00381ed1v01y201109cac018 article EN Synthesis lectures on computer architecture 2011-11-30

Reliability for general purpose processing on the GPU (GPGPU) is becoming a weak link in construction of reliable supercomputer systems. Because hardware protection expensive to develop, requires dedicated on-chip resources, and not portable across different architectures, efficiency software solutions such as redundant multithreading (RMT) must be explored. This paper presents real-world design evaluation automatic RMT hardware. We first describe compiler pass that automatically converts...

10.1109/isca.2014.6853227 article EN 2014-06-01

The Program Vulnerability Factor (PVF) has been proposed as a metric to understand the impact of hardware faults on software. PVF is calculated by identifying program bits required for architecturally correct execution (ACE bits). PVF, however, conservative it assumes that all erroneous executions are major concern, not just those result in silent data corruptions, and also does account errorsthat detected at runtime, i.e., lead crashes. A more discriminating can inform choice appropriate...

10.1109/dsn.2016.24 article EN 2016-06-01

The increasing demand of big data analytics for more main memory capacity in datacenters and exascale computing environments is driving the integration heterogeneous technologies. new technologies exhibit vastly greater differences access latencies, bandwidth compared to traditional NUMA systems. Leveraging this heterogeneity while also delivering application performance enhancements requires intelligent placement. We present Kleio, a page scheduler with machine intelligence applications...

10.1145/3307681.3325398 article EN 2019-06-17

The importance of pushing the performance envelope disk drives continues to grow, not just in server market but also numerous consumer electronics products. One most fundamental factors impacting drive design is heat dissipation and its effect on reliability, since high temperatures can cause off-track errors, or even head crashes. Until now, manufacturers have continued meet 40% annual growth target internal data rates (IDR) by increasing RPMs, shrinking platter sizes, both which...

10.1145/1080695.1069975 article EN ACM SIGARCH Computer Architecture News 2005-05-01

With the advent of cloud computing and online services, large enterprises rely heavily on their datacenters to serve end users. A datacenter facility incurs increased maintenance costs in addition service unavailability when there are failures. Among different server components, hard disk drives known contribute significantly failures; however, is very little understanding major determinants failures datacenters. In this work, we focus interrelationship between temperature, workload, drive a...

10.1145/2491472.2491475 article EN ACM Transactions on Storage 2013-07-01

GPGPUs are used increasingly in several domains, from gaming to different kinds of computationally intensive applications. In many applications GPGPU reliability is becoming a serious issue, and research activities focusing on its evaluation. This paper offers an overview some major results the area. First, it shows analyzes experiments assessing HPC datacenters. Second, provides recent derived radiation about GPGPUs. Third, describes characteristics advanced fault-injection environment,...

10.5555/2616606.2617090 article EN 2014-03-24

Reliability for general purpose processing on the GPU (GPGPU) is becoming a weak link in construction of reliable supercomputer systems. Because hardware protection expensive to develop, requires dedicated on-chip resources, and not portable across different architectures, efficiency software solutions such as redundant multithreading (RMT) must be explored. This paper presents real-world design evaluation automatic RMT hardware. We first describe compiler pass that automatically converts...

10.1145/2678373.2665686 article EN ACM SIGARCH Computer Architecture News 2014-06-14
Coming Soon ...