Shinobu Miwa

ORCID: 0000-0003-0315-3216
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Parallel Computing and Optimization Techniques
  • Interconnection Networks and Systems
  • Advanced Data Storage Technologies
  • Embedded Systems Design Techniques
  • Low-power high-performance VLSI design
  • Advanced Memory and Neural Computing
  • Network Packet Processing and Optimization
  • Caching and Content Delivery
  • Cloud Computing and Resource Management
  • Distributed and Parallel Computing Systems
  • Ferroelectric and Negative Capacitance Devices
  • Neural Networks and Applications
  • Semiconductor materials and devices
  • Software-Defined Networks and 5G
  • Supercapacitor Materials and Fabrication
  • Image and Signal Denoising Methods
  • Computer Graphics and Visualization Techniques
  • Advancements in Semiconductor Devices and Circuit Design
  • Algorithms and Data Compression
  • Optical measurement and interference techniques
  • Advanced Neural Network Applications
  • VLSI and FPGA Design Techniques
  • Network Traffic and Congestion Control
  • Video Coding and Compression Technologies
  • Image Enhancement Techniques

University of Electro-Communications
2015-2025

The University of Tokyo
2011-2014

Tokyo University of Agriculture and Technology
2007-2011

Tokyo University of Agriculture
2010

Kyoto University
2007

Future computer systems are built under much stringent power budget due to the limitation of delivery and cooling systems. To this end, sophisticated management techniques required. Power capping is a technique limit consumption system predetermined level, has been extensively studied in homogeneous However, few studies about CPU-GPU heterogeneous have done yet. In paper, we propose an efficient through coordinating DVFS task mapping single computing node equipped with GPUs. systems,...

10.1109/iccd.2013.6657064 article EN 2013-10-01

GPUs have become promising computing devices in current and future computer systems due to its high performance, energy efficiency, low price. However, lack of level GPU programming models hinders the wide spread applications. To resolve this issue, OpenACC is developed as first industry standard a directive-based model several implementations are now available. Although early evaluations showed significant performance improvement with modest efforts, they also revealed limitations systems....

10.1109/icpp.2013.35 article EN 2013-10-01

CUDA toolkits are widely used to develop applications running on NVIDIA GPUs. They include compilers and frequently updated integrate state-of-the-art compilation techniques. Hence, many HPC users believe that the latest toolkit will improve application performance; however, considering results from CPU compilers, there cases where this is not true. In paper, we thoroughly evaluate impact of version performance, power consumption, energy consumption GPU with three architectures. Our show...

10.1016/j.parco.2024.103081 article EN cc-by-nc Parallel Computing 2024-02-29

Normally-Off is a way of computing which aggressively powers off components computer systems when they need not to operate. Simple power gating cannot fully take the chances reduction because volatile memories lose data turned off. Recently, new non-volatile (NVMs) have appeared. High attention has been paid normally-off using these NVMs. In this paper, its expectation and challenges are addressed with brief introduction our project started in 2011.

10.1109/aspdac.2014.6742850 article EN 2022 27th Asia and South Pacific Design Automation Conference (ASP-DAC) 2014-01-01

In this paper, we show parallel implementation of Hilbert-Huang Transform on GPU. This focused the reducing computation complexity from O(N) a single CPU to O(N/P log (N)) GPU, as well use 'shared-global' switching method increase performance. Evaluation results our GPU using Tesla C1060 achieves 29.0x speedup in best case, and total 7.1x for all when compared Intel dual core CPU.

10.1109/ic-nc.2010.44 article EN 2010-11-01

Implementing last level caches (LLCs) with STT-MRAM is a promising approach for designing energy efficient microprocessors due to high density and low leakage power of its memory cells. However, peripheral circuits an cache still suffer from because large leaky transistors are required drive write current element. To overcome this problem, we propose new management scheme called Immediate Sleep (IS). IS immediately turns off subarray if the next access predicted be not critical in...

10.1109/iccd.2015.7357096 article EN 2015-10-01

Overprovisioning hardware devices and coordinating their power budgets are proposed to improve the application performance of future power-constrained HPC systems. This coordination process is called shifting. Meanwhile, recent studies have revealed that on/off links can save network in Future systems will thus adopt addition paper explores shifting interconnection networks with links. Given keep low at runtime, we transfer appreciable quantities on other before an runs. We propose a...

10.1145/2807591.2807639 article EN 2015-10-27

Shifting to multi-core designs is so pervasive a trend overcome the power wall and it necessary move for embedded systems in our rapidly evolving information society. Meanwhile, need increase battery life reduce maintenance costs such very critical. Therefore, wide variety of reduction techniques have been proposed realized, including Clock Gating, DVFS Power Gating. To maximize effectiveness these techniques, task scheduling key but complicated due huge exploration space. This problem major...

10.2197/ipsjtrans.7.122 article EN IPSJ Online Transactions 2014-01-01

In this paper, we propose CNFET7, the first open-source cell library for 7-nm carbon nanotube field-effect transistor (CNFET) technology. CNFET7 is based on an CNFET SPICE model called VS-CNFET, and various parameters such as channel width diameter are carefully tuned to mimic predictive technology presented in a published paper. Some nondisclosure parameters, size pin layout, derived from those of NanGate 15-nm same way framework circuit design. includes two types delay (i.e., composite...

10.1145/3566097.3567939 article EN Proceedings of the 28th Asia and South Pacific Design Automation Conference 2023-01-16

The inevitable advent of the multi-core era has driven an increasing demand for low latency on-chip inter-connection networks (or NoCs). Being a critical part memory hierarchy modern chip multi-processors (CMPs), these face stringent design constraints to provide fast communication with tight power budget. Modern NoC's first-order concern is clearly its latency, while we also find that internal bandwidth routers relatively plentiful; thus, present router utilizing technique call “multicast...

10.5555/2523721.2523765 article EN International Conference on Parallel Architectures and Compilation Techniques 2013-10-07

This paper describes a proposal of non-volatile cache architecture utilizing novel DRAM / MRAM cell-level hybrid structured memory (D-MRAM) that enables effective power reduction for high performance mobile SoCs without area overhead. Here, the key point to reduce active is intermittent refresh process DRAM-mode. D-MRAM has advantage static consumptions compared conventional SRAM, because there are no leakage paths in cell and it not needed supply voltage its cells when used as MRAM-mode....

10.7873/date.2013.363 article EN Design, Automation & Test in Europe Conference & Exhibition (DATE), 2015 2013-01-01

This paper describes a proposal of non-volatile cache architecture utilizing novel DRAM / MRAM cell-level hybrid structured memory (D-MRAM) that enables effective power reduction for high performance mobile SoCs without area overhead. Here, the key point to reduce active is intermittent refresh process DRAM-mode. D-MRAM has advantage static consumptions compared conventional SRAM, because there are no leakage paths in cell and it not needed supply voltage its cells when used as MRAM-mode....

10.5555/2485288.2485716 article EN Design, Automation, and Test in Europe 2013-03-18

Energy Efficient Ethernet (EEE) is an standard for lowering power consumption in commodity network devices. When the load of a link low, EEE allows to turn into low mode and therefore can significantly save device. expected be adopted high performance computing (HPC) systems few years later, but impact caused by EEE-enabled HPC still unknown. To encourage system developers adopt technology, it required estimation non-existing that would utilize technology. This paper presents method...

10.1007/s00450-013-0238-4 article EN cc-by Computer Science - Research and Development 2013-07-24

This paper describes state-of-the-art STT-MRAM, which can drastically save energy consumption dissipated in cache memory system compared with conventional SRAM-based ones. also presents how to build hierarchy both the state-of-art STT-MRAM and SRAM reduce consumption. The key point is "break-even-time aware design" based on normally-off operation. For further power reduction, an intelligent management technique for STT-MRAM-based discussed.

10.1109/isocc.2015.7401759 article EN 2015-11-01

We propose a technique to reduce compulsory misses of packet processing cache (PPC), which largely affects both throughput and energy core routers. Rather than prefetching data, our called response prediction (RPC) speculatively stores predicted data into PPC without additional access the low-throughput power-consuming memory (i.e., TCAM). RPC predicts related flow at arrival corresponding request flow, based on request-response model internet communications. can improve miss rate,...

10.1145/3195970.3196021 article EN 2018-06-19

Network-on-Chip (NoC) is a critical part of the memory hierarchy emerging multicores. Lowering its communication latency while preserving bandwidth key to achieving high system performance. By now, one most effective methods helps this goal prediction router (PR). PR works by predicting route an incoming packet may be transferred and it speculatively allocates resources (virtual channels switch crossbar) traverses packet's flits using predicted in single cycle without waiting for...

10.1109/ipdpsw.2013.40 article EN 2013-05-01

To reduce the processor energy consumption under low workload and clock frequency executions, a possible solution is to use ALU cascading while keeping supply voltage unchanged. This scheme uses single cycle execute multiple instructions which have data dependence relationship between them thus saves cycles for whole execution. Since product result of both power execution time, expected help optimization microprocessors operating status. implement in current superscalar processor, specific...

10.2197/ipsjtrans.2.122 article EN IPSJ Online Transactions 2009-01-01

Recently, a method called pipeline stage unification (PSU) has been proposed to reduce energy consumption for mobile processors via inactivating and bypassing some of the registers thus adopt shallow pipelines. It is designed be an efficient especially under future process technologies. In this paper, we present mechanism PSU controller which can dynamically predict suitable configuration based on program phase detection. Our results show that predictor achieve degree prediction accuracy...

10.1093/ietisy/e91-d.4.1010 article EN IEICE Transactions on Information and Systems 2008-04-01

User-friendly parallel programming environments, such as CUDA and OpenCL are widely used for accelerators. They provide programmers with useful APIs, but the APIs still low level primitives. Therefore, in order to apply communication optimization techniques, double buffering have manually write programs Manual requires significant knowledge of both application characteristics CPU-accelerator architecture. This prevents many developers from effective utilization In addition, managing is a...

10.1109/ipdpsw.2012.68 article EN 2012-05-01

Interconnection networks grow larger as supercomputers include more nodes and require higher bandwidth for performance. This scaling significantly increases the fraction of power consumed by network, increasing number network components (links switches). Typically, links consume continuously once they are turned on. However, recent proposals energy efficient interconnects have introduced low-power operation modes periods when idle. Low-power can increase messaging time switching a link from...

10.1109/hipc.2019.00044 article EN 2019-12-01

We propose a technique to reduce compulsory misses of packet processing cache (PPC), which largely affects both throughput and energy core routers. Rather than prefetching data, our called response prediction (RPC) speculatively stores predicted data into PPC without additional access the low-throughput power-consuming memory (i.e., TCAM). RPC predicts related flow at arrival corresponding request flow, based on request-response model internet communications. can improve miss rate,...

10.1109/dac.2018.8465884 article EN 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC) 2018-06-01

Intel SGX, which provides a strongly isolated execution environment for running applications to protect their code and data from privileged adversaries, is hopeful increased integrity confidentiality in multi-user systems. However, SGX rarely used the field of HPC despite increasing importance protection. This because had limited memory resources required supercomputing users modify execution. Lately, both 3rd generation Xeon scalable processors, have many advanced protection features...

10.1145/3624062.3624267 article EN cc-by 2023-11-10

Understanding the variations in performance and power-efficiency of compute nodes is important for enhancing these factors modern supercomputing systems. Previous studies have focused on CPUs DRAMs, but there has been little attention GPUs. This despite many current systems employing GPUs (which consume a significant fraction power such systems) as power-efficient accelerators HPC applications. paper describes first thorough evaluation Specifically, we execute 48 CUDA kernels 856 devices...

10.1145/3545008.3545084 article EN 2022-08-29

The inevitable advent of the multi-core era has driven an increasing demand for low latency on-chip inter-connection networks (or NoCs). Being a critical part memory hierarchy modern chip multi-processors (CMPs), these face stringent design constraints to provide fast communication with tight power budget. Modern NoC's first-order concern is clearly its latency, while we also find that internal bandwidth routers relatively plentiful; thus, present router utilizing technique call "multicast...

10.1109/pact.2013.6618828 article EN 2013-09-01
Coming Soon ...