Rachata Ausavarungnirun

ORCID: 0000-0002-1459-0852
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Parallel Computing and Optimization Techniques
  • Advanced Data Storage Technologies
  • Interconnection Networks and Systems
  • Advanced Memory and Neural Computing
  • Cloud Computing and Resource Management
  • Distributed and Parallel Computing Systems
  • Caching and Content Delivery
  • Supercapacitor Materials and Fabrication
  • Low-power high-performance VLSI design
  • Ferroelectric and Negative Capacitance Devices
  • Algorithms and Data Compression
  • Graph Theory and Algorithms
  • Network Packet Processing and Optimization
  • Genomics and Phylogenetic Studies
  • Distributed systems and fault tolerance
  • Green IT and Sustainability
  • Graphene research and applications
  • Software-Defined Networks and 5G
  • Semiconductor materials and devices
  • Embedded Systems Design Techniques
  • Gene expression and cancer classification
  • Catalytic Processes in Materials Science
  • Technology Adoption and User Behaviour
  • Advanced Graph Neural Networks
  • Mobile and Web Applications

King Mongkut's University of Technology North Bangkok
2018-2024

Carnegie Mellon University
2011-2020

Several system-level operations trigger bulk data copy or initialization. Even though these do not require any computation, current systems transfer a large quantity of back and forth on the memory channel to perform such operations. As result, consume high latency, bandwidth, energy--degrading both system performance energy efficiency.

10.1145/2540708.2540725 article EN 2013-12-07

We are experiencing an explosive growth in the number of consumer devices, including smartphones, tablets, web-based computers such as Chromebooks, and wearable devices. For this class energy efficiency is a first-class concern due to limited battery capacity thermal power budget. find that data movement major contributor total system execution time The performance costs moving between memory compute units significantly higher than computation. As result, addressing crucial for In work, we...

10.1145/3173162.3173177 article EN 2018-03-19

Phase change memory (PCM) is a promising technology that can offer higher capacity than DRAM. Unfortunately, PCM's access latency and energy are DRAM's its endurance lower. Many DRAM-PCM hybrid systems use DRAM as cache to PCM, achieve the low energy, high of DRAM, while taking advantage large capacity. A key question what data in best exploit advantages each avoiding disadvantages much possible. We propose new caching policy improves performance efficiency. Our observation both PCM banks...

10.1109/iccd.2012.6378661 article EN 2022 IEEE 40th International Conference on Computer Design (ICCD) 2012-09-01

When multiple processor (CPU) cores and a GPU integrated together on the same chip share off-chip main memory, requests from can heavily interfere with CPU cores, leading to low system performance starvation of cores. Unfortunately, state-of-the-art application-aware memory scheduling algorithms are ineffective at solving this problem complexity due large amount traffic. A costly request buffer is needed provide these enough visibility across global stream, requiring relatively complex...

10.1145/2366231.2337207 article EN ACM SIGARCH Computer Architecture News 2012-09-05

Heterogeneous architectures consisting of general-purpose CPUs and throughput-optimized GPUs are projected to be the dominant computing platforms for many classes applications. The design such systems is more complex than that homogeneous because maximizing resource utilization while minimizing shared interference between CPU GPU applications difficult. We show tend monopolize hardware resources, as memory network, their high thread-level parallelism (TLP), discuss limitations existing...

10.1109/micro.2014.62 article EN 2014-12-01

A conventional Network-on-Chip (NoC) router uses input buffers to store in-flight packets. These improve performance, but consume significant power. It is possible bypass these when they are empty, reducing dynamic power, static buffer and power utilized, remain. To energy efficiency, less deflection routing removes buffers, instead (misrouting) resolve contention. However, at high network load, deflections cause unnecessary hops, wasting performance. In this work, we propose a new NoC...

10.1109/nocs.2012.8 article EN 2012-05-01

Memory channel contention is a critical performance bottleneck in modern systems that have highly parallelized processing units operating on large data sets. The memory contended not only by requests from different user applications (CPU access) but also system for peripheral (IO access), usually controlled Direct Access (DMA) engines. Our goal, this work, to improve byeliminating between CPU accesses and IO accesses. To end, we propose hardware-software cooperative transfer mechanism,...

10.1109/pact.2015.51 article EN 2015-10-01

Future many-core processors are likely to concurrently execute a large number of diverse applications. How these applications mapped cores largely determines the interference between in critical shared hardware resources. This paper proposes new application-to-core mapping policies improve system performance by reducing inter-application on-chip network and memory controllers. The major ideas our to: 1) map network-latency-sensitive separate parts from network-bandwidth-intensive such that...

10.1109/hpca.2013.6522311 article EN 2013-02-01

Modern Graphics Processing Units (GPUs) are well provisioned to support the concurrent execution of thousands threads. Unfortunately, different bottlenecks during and heterogeneous application requirements create imbalances in utilization resources cores. For example, when a GPU is bottlenecked by available off-chip memory bandwidth, its computational often overwhelmingly idle, waiting for data from arrive.

10.1145/2749469.2750399 article EN 2015-05-26

Variation has been shown to exist across the cells within a modern DRAM chip. Prior work studied and exploited several forms of variation, such as manufacturing-process- or temperature-induced variation. We empirically demonstrate new form variation that exists real chip, induced by design placement different components in chip: regions DRAM, based on their relative distances from peripheral structures, require minimum access latencies for reliable operation. In particular, we show most...

10.1145/3078505.3078533 article EN 2017-06-05

Contemporary discrete GPUs support rich memory management features such as virtual and demand paging. These simplify GPU programming by providing a address space abstraction similar to CPUs eliminating manual management, but they introduce high performance overheads during (1) translation (2) page faults. A relies on degrees of thread-level parallelism (TLP) hide latency. Address can undermine TLP, single miss in the lookaside buffer (TLB) invokes an expensive serialized table walk that...

10.1145/3123939.3123975 article EN 2017-10-04

Specialized on-chip accelerators are widely used to improve the energy efficiency of computing systems. Recent advances in memory technology have enabled near-data (NDAs), which reside off-chip close main and can yield further benefits than accelerators. However, enforcing coherence with rest system, is already a major challenge for accelerators, becomes more difficult NDAs. This because (1) cost communication between NDAs CPUs high, (2) NDA applications generate lot data movement. As...

10.1145/3307650.3322266 article EN 2019-06-14

Genome sequence analysis has enabled significant advancements in medical and scientific areas such as personalized medicine, outbreak tracing, the understanding of evolution. To perform genome sequencing, devices extract small random fragments an organism's DNA (known reads). The first step is a computational process known read mapping. In mapping, each fragment matched to its potential location reference with goal identifying original genome. Unfortunately, rapid sequencing currently...

10.1109/micro50266.2020.00081 article EN 2020-10-01

When multiple processor (CPU) cores and a GPU integrated together on the same chip share off-chip main memory, requests from can heavily interfere with CPU cores, leading to low system performance starvation of cores. Unfortunately, state-of-the-art application-aware memory scheduling algorithms are ineffective at solving this problem complexity due large amount traffic. A costly request buffer is needed provide these enough visibility across global stream, requiring relatively complex...

10.1109/isca.2012.6237036 article EN 2012-06-01

In a GPU, all threads within warp execute the same instruction in lockstep. For memory instruction, this can lead to divergence: requests for some are serviced early, while remaining incur long latencies. This divergence stalls warp, as it cannot next until from current complete. work, we make three new observations. First, GPGPU warps exhibit heterogeneous behavior at shared cache: have most of their hit cache (high utility), other see request miss (low utility). Second, retains periods...

10.1109/pact.2015.38 article EN 2015-10-01

Modern discrete GPUs support unified memory and demand paging. Automatic management of data movement between CPU GPU dramatically reduces developer effort. However, when application working sets exceed physical capacity, the resulting can cause great performance loss.

10.1145/3297858.3304044 article EN 2019-04-04

Read mapping is a fundamental step in many genomics applications. It used to identify potential matches and differences between fragments (called reads) of sequenced genome an already known reference genome). costly because it needs perform approximate string matching (ASM) on large amounts data. To address the computational challenges analysis, prior works propose various approaches such as accurate filters that select reads within dataset genomic read set) must undergo expensive...

10.1145/3503222.3507702 article EN 2022-02-22

The network-on-chip (NoC) is a primary shared resource in chip multiprocessor (CMP) system. As core counts continue to increase and applications become increasingly data-intensive, the network load will also increase, leading more congestion network. This can degrade system performance if not appropriately controlled. Prior works have proposed source-throttling control, which limits rate at new traffic (packets) enters NoC order reduce improve performance. These prior control mechanisms...

10.1109/sbac-pad.2012.44 article EN 2012-10-01

We are experiencing an explosive growth in the number of consumer devices, including smartphones, tablets, web-based computers such as Chromebooks, and wearable devices. For this class energy efficiency is a first-class concern due to limited battery capacity thermal power budget. find that data movement major contributor total system execution time The performance costs moving between memory compute units significantly higher than computation. As result, addressing crucial for In work, we...

10.1145/3296957.3173177 article EN ACM SIGPLAN Notices 2018-03-19

Graphics Processing Units (GPUs) exploit large amounts of threadlevel parallelism to provide high instruction throughput and efficiently hide long-latency stalls. The resulting throughput, along with continued programmability improvements, have made GPUs an essential computational resource in many domains. Applications from different domains can vastly compute memory demands on the GPU. In a large-scale computing environment, accommodate such wide-ranging without leaving GPU resources...

10.1145/3173162.3173169 article EN 2018-03-19

Modern computing systems suffer from the dichotomy between computation on one side, which is performed only in processor (and accelerators), and data storage/movement other, all other parts of system are dedicated to. Due to this dichotomy, moves a lot order for perform it. Unfortunately, movement extremely expensive terms energy latency, much more so than computation. As result, large fraction spent performance lost solely moving modern system.

10.1145/3316781.3323476 article EN 2019-05-23

Simple graph algorithms such as PageRank have been the target of numerous hardware accelerators. Yet, there also exist much more complex mining for problems clustering or maximal clique listing. These are memory-bound and thus could be accelerated by techniques Processing-in-Memory (PIM). However, they come with non-straightforward parallelism complicated memory access patterns. In this work, we address problem a simple yet surprisingly powerful observation: operations on sets vertices,...

10.1145/3466752.3480133 article EN 2021-10-17

Faster app launching is crucial for the user experience on mobile devices. Apps launched from a background cached state, called hot-launching, have much better performance than apps scratch. To increase number of hot-launches, leading vendors now cache more in by enabling swap. Recent work also proposed reducing Java heap to apps. However, this paper found that existing methods deteriorate hot-launch while increasing simultaneously improve and performance, proposes Fleet,...

10.1145/3620666.3651377 article EN 2024-04-24
Coming Soon ...