NFDI4DS | UHH-SEMS - Publication Details

RowClone

OPENALEX - Publications

Vivek Seshadri Yoongu Kim Chris Fallin Donghyuk Lee Rachata Ausavarungnirun and 6 more

Several system-level operations trigger bulk data copy or initialization. Even though these do not require any computation, current systems transfer a large quantity of back and forth on the memory channel to perform such operations. As result, consume high latency, bandwidth, energy--degrading both system performance energy efficiency.

10.1145/2540708.2540725 article EN 2013-12-07

Google Workloads for Consumer Devices

OPENALEX - Publications

Amirali Boroumand Saugata Ghose Youngsok Kim Rachata Ausavarungnirun Eric Shiu and 6 more

We are experiencing an explosive growth in the number of consumer devices, including smartphones, tablets, web-based computers such as Chromebooks, and wearable devices. For this class energy efficiency is a first-class concern due to limited battery capacity thermal power budget. find that data movement major contributor total system execution time The performance costs moving between memory compute units significantly higher than computation. As result, addressing crucial for In work, we...

10.1145/3173162.3173177 article EN 2018-03-19

Processing data where it makes sense: Enabling in-memory computation

OPENALEX - Publications

Onur Mutlu Saugata Ghose Juan Gómez-Luna Rachata Ausavarungnirun

10.1016/j.micpro.2019.01.009 article EN Microprocessors and Microsystems 2019-03-06

Row buffer locality aware caching policies for hybrid memories

OPENALEX - Publications

HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu

Phase change memory (PCM) is a promising technology that can offer higher capacity than DRAM. Unfortunately, PCM's access latency and energy are DRAM's its endurance lower. Many DRAM-PCM hybrid systems use DRAM as cache to PCM, achieve the low energy, high of DRAM, while taking advantage large capacity. A key question what data in best exploit advantages each avoiding disadvantages much possible. We propose new caching policy improves performance efficiency. Our observation both PCM banks...

10.1109/iccd.2012.6378661 article EN 2022 IEEE 40th International Conference on Computer Design (ICCD) 2012-09-01

Staged memory scheduling

OPENALEX - Publications

Rachata Ausavarungnirun Kevin K. Chang Lavanya Subramanian Gabriel H. Loh Onur Mutlu

When multiple processor (CPU) cores and a GPU integrated together on the same chip share off-chip main memory, requests from can heavily interfere with CPU cores, leading to low system performance starvation of cores. Unfortunately, state-of-the-art application-aware memory scheduling algorithms are ineffective at solving this problem complexity due large amount traffic. A costly request buffer is needed provide these enough visibility across global stream, requiring relatively complex...

10.1145/2366231.2337207 article EN ACM SIGARCH Computer Architecture News 2012-09-05

Managing GPU Concurrency in Heterogeneous Architectures

OPENALEX - Publications

Onur Kayıran Nachiappan Chidambaram Nachiappan Adwait Jog Rachata Ausavarungnirun Mahmut Kandemir and 3 more

Heterogeneous architectures consisting of general-purpose CPUs and throughput-optimized GPUs are projected to be the dominant computing platforms for many classes applications. The design such systems is more complex than that homogeneous because maximizing resource utilization while minimizing shared interference between CPU GPU applications difficult. We show tend monopolize hardware resources, as memory network, their high thread-level parallelism (TLP), discuss limitations existing...

10.1109/micro.2014.62 article EN 2014-12-01

MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect

OPENALEX - Publications

Chris Fallin Greg Nazario Xiangyao Yu Kevin K. Chang Rachata Ausavarungnirun and 1 more

A conventional Network-on-Chip (NoC) router uses input buffers to store in-flight packets. These improve performance, but consume significant power. It is possible bypass these when they are empty, reducing dynamic power, static buffer and power utilized, remain. To energy efficiency, less deflection routing removes buffers, instead (misrouting) resolve contention. However, at high network load, deflections cause unnecessary hops, wasting performance. In this work, we propose a new NoC...

10.1109/nocs.2012.8 article EN 2012-05-01

Decoupled Direct Memory Access: Isolating CPU and IO Traffic by Leveraging a Dual-Data-Port DRAM

OPENALEX - Publications

Donghyuk Lee Lavanya Subramanian Rachata Ausavarungnirun Jongmoo Choi Onur Mutlu

Memory channel contention is a critical performance bottleneck in modern systems that have highly parallelized processing units operating on large data sets. The memory contended not only by requests from different user applications (CPU access) but also system for peripheral (IO access), usually controlled Direct Access (DMA) engines. Our goal, this work, to improve byeliminating between CPU accesses and IO accesses. To end, we propose hardware-software cooperative transfer mechanism,...

10.1109/pact.2015.51 article EN 2015-10-01

Application-to-core mapping policies to reduce memory system interference in multi-core systems

OPENALEX - Publications

Reetuparna Das Rachata Ausavarungnirun Onur Mutlu Akhilesh Kumar Mani Azimi

Future many-core processors are likely to concurrently execute a large number of diverse applications. How these applications mapped cores largely determines the interference between in critical shared hardware resources. This paper proposes new application-to-core mapping policies improve system performance by reducing inter-application on-chip network and memory controllers. The major ideas our to: 1) map network-latency-sensitive separate parts from network-bandwidth-intensive such that...

10.1109/hpca.2013.6522311 article EN 2013-02-01

A case for core-assisted bottleneck acceleration in GPUs

OPENALEX - Publications

Nandita Vijaykumar Gennady Pekhimenko Adwait Jog Abhishek Bhowmick Rachata Ausavarungnirun and 4 more

Modern Graphics Processing Units (GPUs) are well provisioned to support the concurrent execution of thousands threads. Unfortunately, different bottlenecks during and heterogeneous application requirements create imbalances in utilization resources cores. For example, when a GPU is bottlenecked by available off-chip memory bandwidth, its computational often overwhelmingly idle, waiting for data from arrive.

10.1145/2749469.2750399 article EN 2015-05-26

Design-Induced Latency Variation in Modern DRAM Chips

OPENALEX - Publications

Donghyuk Lee Samira Khan Lavanya Subramanian Saugata Ghose Rachata Ausavarungnirun and 3 more

Variation has been shown to exist across the cells within a modern DRAM chip. Prior work studied and exploited several forms of variation, such as manufacturing-process- or temperature-induced variation. We empirically demonstrate new form variation that exists real chip, induced by design placement different components in chip: regions DRAM, based on their relative distances from peripheral structures, require minimum access latencies for reliable operation. In particular, we show most...

10.1145/3078505.3078533 article EN 2017-06-05

Mosaic

OPENALEX - Publications

Rachata Ausavarungnirun Joshua Landgraf Vance Miller Saugata Ghose Jayneel Gandhi and 2 more

Contemporary discrete GPUs support rich memory management features such as virtual and demand paging. These simplify GPU programming by providing a address space abstraction similar to CPUs eliminating manual management, but they introduce high performance overheads during (1) translation (2) page faults. A relies on degrees of thread-level parallelism (TLP) hide latency. Address can undermine TLP, single miss in the lookaside buffer (TLB) invokes an expensive serialized table walk that...

10.1145/3123939.3123975 article EN 2017-10-04

CoNDA

OPENALEX - Publications

Amirali Boroumand Saugata Ghose Minesh Patel Hasan Hassan Brandon Lucia and 6 more

Specialized on-chip accelerators are widely used to improve the energy efficiency of computing systems. Recent advances in memory technology have enabled near-data (NDAs), which reside off-chip close main and can yield further benefits than accelerators. However, enforcing coherence with rest system, is already a major challenge for accelerators, becomes more difficult NDAs. This because (1) cost communication between NDAs CPUs high, (2) NDA applications generate lot data movement. As...

10.1145/3307650.3322266 article EN 2019-06-14

GenASM: A High-Performance, Low-Power Approximate String Matching Acceleration Framework for Genome Sequence Analysis

OPENALEX - Publications

Damla Senol Cali Gurpreet S. Kalsi Zülal Bingöl Can Fırtına Lavanya Subramanian and 11 more

Genome sequence analysis has enabled significant advancements in medical and scientific areas such as personalized medicine, outbreak tracing, the understanding of evolution. To perform genome sequencing, devices extract small random fragments an organism's DNA (known reads). The first step is a computational process known read mapping. In mapping, each fragment matched to its potential location reference with goal identifying original genome. Unfortunately, rapid sequencing currently...

10.1109/micro50266.2020.00081 article EN 2020-10-01

Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems

OPENALEX - Publications

Rachata Ausavarungnirun Kevin K. Chang Lavanya Subramanian Gabriel H. Loh Onur Mutlu

When multiple processor (CPU) cores and a GPU integrated together on the same chip share off-chip main memory, requests from can heavily interfere with CPU cores, leading to low system performance starvation of cores. Unfortunately, state-of-the-art application-aware memory scheduling algorithms are ineffective at solving this problem complexity due large amount traffic. A costly request buffer is needed provide these enough visibility across global stream, requiring relatively complex...

10.1109/isca.2012.6237036 article EN 2012-06-01

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

OPENALEX - Publications

Rachata Ausavarungnirun Saugata Ghose Onur Kayıran Gabriel H. Loh Chita R. Das and 2 more

In a GPU, all threads within warp execute the same instruction in lockstep. For memory instruction, this can lead to divergence: requests for some are serviced early, while remaining incur long latencies. This divergence stalls warp, as it cannot next until from current complete. work, we make three new observations. First, GPGPU warps exhibit heterogeneous behavior at shared cache: have most of their hit cache (high utility), other see request miss (low utility). Second, retains periods...

10.1109/pact.2015.38 article EN 2015-10-01

A Framework for Memory Oversubscription Management in Graphics Processing Units

OPENALEX - Publications

Chen Li Rachata Ausavarungnirun Christopher J. Rossbach Youtao Zhang Onur Mutlu and 2 more

Modern discrete GPUs support unified memory and demand paging. Automatic management of data movement between CPU GPU dramatically reduces developer effort. However, when application working sets exceed physical capacity, the resulting can cause great performance loss.

10.1145/3297858.3304044 article EN 2019-04-04

GenStore: a high-performance in-storage processing system for genome sequence analysis

OPENALEX - Publications

Nika Mansouri Ghiasi Jisung Park Harun Mustafa Jeremie Kim Ataberk Olgun and 9 more

Read mapping is a fundamental step in many genomics applications. It used to identify potential matches and differences between fragments (called reads) of sequenced genome an already known reference genome). costly because it needs perform approximate string matching (ASM) on large amounts data. To address the computational challenges analysis, prior works propose various approaches such as accurate filters that select reads within dataset genomic read set) must undergo expensive...

10.1145/3503222.3507702 article EN 2022-02-22

HAT: Heterogeneous Adaptive Throttling for On-Chip Networks

OPENALEX - Publications

Kevin K. Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu

The network-on-chip (NoC) is a primary shared resource in chip multiprocessor (CMP) system. As core counts continue to increase and applications become increasingly data-intensive, the network load will also increase, leading more congestion network. This can degrade system performance if not appropriately controlled. Prior works have proposed source-throttling control, which limits rate at new traffic (packets) enters NoC order reduce improve performance. These prior control mechanisms...

10.1109/sbac-pad.2012.44 article EN 2012-10-01

Google Workloads for Consumer Devices

OPENALEX - Publications

Amirali Boroumand Saugata Ghose Youngsok Kim Rachata Ausavarungnirun Eric Shiu and 6 more

We are experiencing an explosive growth in the number of consumer devices, including smartphones, tablets, web-based computers such as Chromebooks, and wearable devices. For this class energy efficiency is a first-class concern due to limited battery capacity thermal power budget. find that data movement major contributor total system execution time The performance costs moving between memory compute units significantly higher than computation. As result, addressing crucial for In work, we...

10.1145/3296957.3173177 article EN ACM SIGPLAN Notices 2018-03-19

MASK

OPENALEX - Publications

Rachata Ausavarungnirun Vance Miller Joshua Landgraf Saugata Ghose Jayneel Gandhi and 3 more

Graphics Processing Units (GPUs) exploit large amounts of threadlevel parallelism to provide high instruction throughput and efficiently hide long-latency stalls. The resulting throughput, along with continued programmability improvements, have made GPUs an essential computational resource in many domains. Applications from different domains can vastly compute memory demands on the GPU. In a large-scale computing environment, accommodate such wide-ranging without leaving GPU resources...

10.1145/3173162.3173169 article EN 2018-03-19

Enabling Practical Processing in and near Memory for Data-Intensive Computing

OPENALEX - Publications

Onur Mutlu Saugata Ghose Juan Gómez-Luna Rachata Ausavarungnirun

Modern computing systems suffer from the dichotomy between computation on one side, which is performed only in processor (and accelerators), and data storage/movement other, all other parts of system are dedicated to. Due to this dichotomy, moves a lot order for perform it. Unfortunately, movement extremely expensive terms energy latency, much more so than computation. As result, large fraction spent performance lost solely moving modern system.

10.1145/3316781.3323476 article EN 2019-05-23

SISA: Set-Centric Instruction Set Architecture for Graph Mining on Processing-in-Memory Systems

OPENALEX - Publications

Maciej Besta Raghavendra Kanakagiri Grzegorz Kwaśniewski Rachata Ausavarungnirun Jakub Beránek and 14 more

Simple graph algorithms such as PageRank have been the target of numerous hardware accelerators. Yet, there also exist much more complex mining for problems clustering or maximal clique listing. These are memory-bound and thus could be accelerated by techniques Processing-in-Memory (PIM). However, they come with non-straightforward parallelism complicated memory access patterns. In this work, we address problem a simple yet surprisingly powerful observation: operations on sets vertices,...

10.1145/3466752.3480133 article EN 2021-10-17

More Apps, Faster Hot-Launch on Mobile Devices via Fore/Background-aware GC-Swap Co-design

OPENALEX - Publications

Jiacheng Huang Yunmo Zhang Junqiao Qiu Yu Liang Rachata Ausavarungnirun and 2 more

Faster app launching is crucial for the user experience on mobile devices. Apps launched from a background cached state, called hot-launching, have much better performance than apps scratch. To increase number of hot-launches, leading vendors now cache more in by enabling swap. Recent work also proposed reducing Java heap to apps. However, this paper found that existing methods deteriorate hot-launch while increasing simultaneously improve and performance, proposes Fleet,...

10.1145/3620666.3651377 article EN 2024-04-24