- Parallel Computing and Optimization Techniques
- Advanced Data Storage Technologies
- Distributed and Parallel Computing Systems
- Cloud Computing and Resource Management
- Caching and Content Delivery
- Distributed systems and fault tolerance
- Interconnection Networks and Systems
- Scientific Computing and Data Management
- Advanced Neural Network Applications
- Matrix Theory and Algorithms
- Meteorological Phenomena and Simulations
- Network Traffic and Congestion Control
- Advanced Memory and Neural Computing
- Stochastic Gradient Optimization Techniques
- Computational Geometry and Mesh Generation
- Advanced Electron Microscopy Techniques and Applications
- Tropical and Extratropical Cyclones Research
- Advanced Optimization Algorithms Research
- Adversarial Robustness in Machine Learning
- Security and Verification in Computing
- X-ray Spectroscopy and Fluorescence Analysis
- Neural Networks and Applications
- Peer-to-Peer Network Technologies
- Advanced Algorithms and Applications
- Ferroelectric and Negative Capacitance Devices
RIKEN Center for Computational Science
2015-2024
Intel (United States)
2022-2024
Tokyo Institute of Technology
2022
University of Chicago
2022
National Institute of Informatics
2022
Argonne National Laboratory
2022
Fujitsu (Japan)
2022
Institut Polytechnique de Paris
2022
Lawrence Berkeley National Laboratory
2022
Columbia University
2022
Following the invention of telegraph, electronic computer, and remote sensing, "big data" is bringing another revolution to weather prediction. As sensor computer technologies advance, orders magnitude bigger data are produced by new sensors high-precision simulation or simulation." Data assimilation (DA) a key numerical prediction (NWP) integrating real-world into simulation. However, current DA NWP systems not designed handle from next-generation big Therefore, we propose assimilation"...
Extreme degree of parallelism in high-end computing requires low operating system noise so that large scale, bulk-synchronous parallel applications can be run efficiently. Noiseless execution has been historically achieved by deploying lightweight kernels (LWK), which, on the other hand, provide only a restricted set POSIX API exchange for scalability. However, increasing prevalence more complex application constructs, such as in-situ analysis and workflow composition, dictates need rich...
Stochastic gradient descent (SGD) is the most prevalent algorithm for training Deep Neural Networks (DNN). SGD iterates input data set in each epoch processing samples a random access fashion. Because this puts enormous pressure on I/O subsystem, common approach to distributed HPC environments replicate entire dataset node local SSDs. However, due rapidly growing sizes has become increasingly infeasible. Surprisingly, questions of why and what extent required have not received lot attention...
Turning towards exascale systems and beyond, it has been widely argued that the currently available software is not going to be feasible due various requirements such as ability deal with heterogeneous architectures, need for level optimization targeting specific applications, elimination of OS noise, at same time, compatibility legacy applications. To cope these issues, a hybrid design operating where light-weight specialized kernels can cooperate traditional kernel seems adequate, number...
The two most common parallel execution models for many-core CPUs today are multiprocess (e.g., MPI) and multithread OpenMP). model allows each process to own a private address space, although processes can explicitly allocate shared-memory regions. multithreaded shares all space by default, threads move data thread-private storage. In this paper, we present third called process-in-process (PiP), where multiple mapped into single virtual space. Thus, still owns its process-private storage...
This article proposes a pattern-based prefetching scheme with the support of adaptive cache management, at flash translation layer solid-state drives ( SSDs ). It works inside and has features OS dependence uses transparency. Specifically, it first mines frequent block access patterns that reflect correlation among occurred I/O requests. Then, compares requests in current time window identified to direct data into SSDs. More importantly, maximize use efficiency, we build mathematical model...
Heterogeneous architectures, where a multicore processor is accompanied with large number of simpler, but more power-efficient CPU cores optimized for parallel workloads, are receiving lot attention recently. At present, these co-processors, such as the Intel Xeon Phi product family, come limited on-board memory, which requires partitioning computational problems manually into pieces that can fit device's RAM, well efficiently overlapping computation and communication. In this paper we...
Scientific communities are increasingly adopting machine learning and deep models in their applications to accelerate scientific insights. High performance computing systems pushing the frontiers of with a rich diversity hardware resources massive scale-out capabilities. There is critical need understand fair effective benchmarking that representative real-world use cases. MLPerf<sup>™</sup> community-driven standard benchmark workloads, focusing on end-to-end metrics. In this paper,...
Distributed file systems have been widely deployed as back-end storage to offer I/O services for parallel/distributed applications that process large amounts of data. Data prefetching in distributed is a well-known optimization technique which can mask both network and disk latency consequently boost performance. Traditionally, data initiated by the client systems, however, conventional schemes are not well suited machines limited memory computing capacity. To an efficient approach...
Distributed virtual environments (DVE), such as multi-player online games and distributed simulations may involve a massive amount of concurrent clients. Deploying server architectures is currently the most prevalent way providing large-scale services, where typically space divided into several distinct regions requiring each to handle only part world. Inequalities in client distribution may, however, cause certain servers become overloaded, which potentially degrades interactivity...
Most flash-based solid-state drives (SSDs) adopt an onboard dynamic random access memory (DRAM) to buffer hot write data. Then, the or overwrite operations can be absorbed by DRAM cache, given that there is sufficient locality in applications' I/O pattern, consequently avoid flushing data onto underlying SSD cells. After analyzing typical real-world workloads over SSDs, we observed buffered of small-size requests are more likely reaccessed than those large requests. To efficiently utilize...
Checkpoint-recovery based Virtual Machine (VM) replication is an emerging approach towards accommodating VM installations with high availability. However, it comes the price of significant performance degradation application executed in due to large amount state that needs be synchronized between primary and backup machines. It therefore critical find new ways for attaining good performance, at same time, maintaining fault tolerant execution. In this paper, we present a novel improve...
As systems sizes increase to exascale and beyond, there is a need enhance the system software meet needs challenges of applications. The evolutionary versus revolutionary debate can be set aside by providing that simultaneously supports existing new programming models. seemingly contradictory requirements scalable performance traditional rich APIs (POSIX, Linux in particular) suggest approach, has lead class research. Traditionally, operating for extreme-scale computing have followed two...
Lightweight kernels (LWK) have been in use on the compute nodes of supercomputers for decades. Although many high-end systems now run Linux, interest options and alternatives has increased last couple years. Future extreme-scale require rethinking operating system, modern LWKs may well play a role final solution.
Read disturb is a circuit-level noise in solid-state drives (SSDs), which may corrupt existing data SSD blocks and then cause high read error rate longer latency. The approach of refresh commonly used to avoid errors by periodically migrating the hot other free blocks, but it places considerable negative impacts on I/O (Input/Output) responsiveness. This article proposes scheduling approaches write operations, mitigate effects caused disturb. To be specific, we first construct model classify...
The increasing prevalence of co-processors such as the Intel Xeon Phi, has been reshaping high performance computing (HPC) landscape. Phi comes with a large number power efficient CPU cores, but at same time, it's highly memory constraint environment leaving task management entirely up to application developers. To reduce programming complexity, we are focusing on transparent, operating system (OS) level hierarchical management.
Multi-kernels leverage today's multi-core chips to run multiple operating system (OS) kernels, typically a Light Weight Kernel (LWK) and Linux kernel, simultaneously. The LWK provides high performance scalability, while the kernel compatibility. show promise of being able meet tomorrow's extreme-scale computing needs providing strong isolation, yielding scalability needed by classical HPC applications. McKernel mOS started as independent research initiatives explore above potential. Previous...
Summary On the verge of convergence between high‐performance computing and Big Data processing, it has become increasingly prevalent to deploy large‐scale data analytics workloads on high‐end supercomputers. Such applications often come in form complex workflows with various different components, assimilating from scientific simulations as well measurements streamed sensor networks, such radars satellites. For example, part Flagship 2020 (post‐K) supercomputer project Japan, RIKEN is...
In HPC, two trends have led to the emergence and popularity of an operating-system approach in which multiple kernels are run simultaneously on each compute node. The first trend has been increase complexity HPC software environment, placed traditional kernel approaches under stress. Meanwhile, microprocessors with more cores being produced, allowing specialization within a As is typical emerging field, different groups considering many deploying multi-kernels.
Page-based memory management (paging) is utilized by most of the current operating systems (OSs) due to its rich features such as prevention fragmentation and fine-grained access control. virtual memory, however, stores physical mappings in page tables that also reside main memory. Because translating addresses requires walking tables, which turn implies additional accesses, modern CPUs employ translation lookaside buffers (TLBs) cache mappings. Nevertheless, TLBs are limited size...
Upcoming high-performance computing (HPC) platforms will have more complex memory hierarchies with high-bandwidth on-package and in the future also non-volatile memory. How to use such deep effectively remains an open research question. In this paper we evaluate performance implications of a scheme based on software-managed scratchpad coarse-grained memory-copy operations migrating application data structures between hierarchy levels. We expect that can, under specificcircumstances,...
Over the last three decades, innovations in memory subsystem were primarily targeted at overcoming data movement bottleneck. In this paper, we focus on a specific market trend technology: 3D-stacked and caches. We investigate impact of extending on-chip capabilities future HPC-focused processors, particularly by SRAM. First, propose method oblivious to gauge upper-bound performance improvements when costs are eliminated. Then, using gem5 simulator, model two variants hypothetical LARge Cache...
With the growing prevalence of cloud computing and increasing number CPU cores in modern processors, symmetric multiprocessing (SMP) Virtual Machines (VM), i.e. virtual machines with multiple CPUs, are gaining significance. However, accommodating SMP high availability at low overhead is still an open problem. Checkpoint-recovery based VM replication emerging approach, but it comes price significant performance degradation application executed due to large amount state that needs be...
Many-core processors are gathering attention in the areas of embedded systems due to their power-performance ratios. To utilize cores a many-core processor parallel, programmers build multi-task applications that use task models provided by operating systems. However, conventional cause some scalability problems when executed on processors. In this paper, new model named Partitioned Virtual Address Space (PVAS), which solves problems, is proposed. PVAS enhances inter-task communications and...