Balazs Gerofi

ORCID: 0009-0004-8585-6031
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Parallel Computing and Optimization Techniques
  • Advanced Data Storage Technologies
  • Distributed and Parallel Computing Systems
  • Cloud Computing and Resource Management
  • Caching and Content Delivery
  • Distributed systems and fault tolerance
  • Interconnection Networks and Systems
  • Scientific Computing and Data Management
  • Advanced Neural Network Applications
  • Matrix Theory and Algorithms
  • Meteorological Phenomena and Simulations
  • Network Traffic and Congestion Control
  • Advanced Memory and Neural Computing
  • Stochastic Gradient Optimization Techniques
  • Computational Geometry and Mesh Generation
  • Advanced Electron Microscopy Techniques and Applications
  • Tropical and Extratropical Cyclones Research
  • Advanced Optimization Algorithms Research
  • Adversarial Robustness in Machine Learning
  • Security and Verification in Computing
  • X-ray Spectroscopy and Fluorescence Analysis
  • Neural Networks and Applications
  • Peer-to-Peer Network Technologies
  • Advanced Algorithms and Applications
  • Ferroelectric and Negative Capacitance Devices

RIKEN Center for Computational Science
2015-2024

Intel (United States)
2022-2024

Tokyo Institute of Technology
2022

University of Chicago
2022

National Institute of Informatics
2022

Argonne National Laboratory
2022

Fujitsu (Japan)
2022

Institut Polytechnique de Paris
2022

Lawrence Berkeley National Laboratory
2022

Columbia University
2022

Following the invention of telegraph, electronic computer, and remote sensing, "big data" is bringing another revolution to weather prediction. As sensor computer technologies advance, orders magnitude bigger data are produced by new sensors high-precision simulation or simulation." Data assimilation (DA) a key numerical prediction (NWP) integrating real-world into simulation. However, current DA NWP systems not designed handle from next-generation big Therefore, we propose assimilation"...

10.1109/jproc.2016.2602560 article EN cc-by Proceedings of the IEEE 2016-09-26

Extreme degree of parallelism in high-end computing requires low operating system noise so that large scale, bulk-synchronous parallel applications can be run efficiently. Noiseless execution has been historically achieved by deploying lightweight kernels (LWK), which, on the other hand, provide only a restricted set POSIX API exchange for scalability. However, increasing prevalence more complex application constructs, such as in-situ analysis and workflow composition, dictates need rich...

10.1109/ipdps.2016.80 article EN 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2016-05-01

Stochastic gradient descent (SGD) is the most prevalent algorithm for training Deep Neural Networks (DNN). SGD iterates input data set in each epoch processing samples a random access fashion. Because this puts enormous pressure on I/O subsystem, common approach to distributed HPC environments replicate entire dataset node local SSDs. However, due rapidly growing sizes has become increasingly infeasible. Surprisingly, questions of why and what extent required have not received lot attention...

10.1109/ipdps53621.2022.00109 article EN 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2022-05-01

Turning towards exascale systems and beyond, it has been widely argued that the currently available software is not going to be feasible due various requirements such as ability deal with heterogeneous architectures, need for level optimization targeting specific applications, elimination of OS noise, at same time, compatibility legacy applications. To cope these issues, a hybrid design operating where light-weight specialized kernels can cooperate traditional kernel seems adequate, number...

10.1109/hipc.2014.7116885 article EN 2014-12-01

The two most common parallel execution models for many-core CPUs today are multiprocess (e.g., MPI) and multithread OpenMP). model allows each process to own a private address space, although processes can explicitly allocate shared-memory regions. multithreaded shares all space by default, threads move data thread-private storage. In this paper, we present third called process-in-process (PiP), where multiple mapped into single virtual space. Thus, still owns its process-private storage...

10.1145/3208040.3208045 article EN 2018-06-11

This article proposes a pattern-based prefetching scheme with the support of adaptive cache management, at flash translation layer solid-state drives ( SSDs ). It works inside and has features OS dependence uses transparency. Specifically, it first mines frequent block access patterns that reflect correlation among occurred I/O requests. Then, compares requests in current time window identified to direct data into SSDs. More importantly, maximize use efficiency, we build mathematical model...

10.1145/3474393 article EN ACM Transactions on Storage 2022-01-29

Heterogeneous architectures, where a multicore processor is accompanied with large number of simpler, but more power-efficient CPU cores optimized for parallel workloads, are receiving lot attention recently. At present, these co-processors, such as the Intel Xeon Phi product family, come limited on-board memory, which requires partitioning computational problems manually into pieces that can fit device's RAM, well efficiently overlapping computation and communication. In this paper we...

10.1109/ccgrid.2013.59 article EN 2013-05-01

Scientific communities are increasingly adopting machine learning and deep models in their applications to accelerate scientific insights. High performance computing systems pushing the frontiers of with a rich diversity hardware resources massive scale-out capabilities. There is critical need understand fair effective benchmarking that representative real-world use cases. MLPerf<sup>&#x2122;</sup> community-driven standard benchmark workloads, focusing on end-to-end metrics. In this paper,...

10.1109/mlhpc54614.2021.00009 article EN 2021-11-01

Distributed file systems have been widely deployed as back-end storage to offer I/O services for parallel/distributed applications that process large amounts of data. Data prefetching in distributed is a well-known optimization technique which can mask both network and disk latency consequently boost performance. Traditionally, data initiated by the client systems, however, conventional schemes are not well suited machines limited memory computing capacity. To an efficient approach...

10.1109/tpds.2015.2496595 article EN IEEE Transactions on Parallel and Distributed Systems 2015-10-30

Distributed virtual environments (DVE), such as multi-player online games and distributed simulations may involve a massive amount of concurrent clients. Deploying server architectures is currently the most prevalent way providing large-scale services, where typically space divided into several distinct regions requiring each to handle only part world. Inequalities in client distribution may, however, cause certain servers become overloaded, which potentially degrades interactivity...

10.1109/cluster.2010.25 article EN 2010-09-01

Most flash-based solid-state drives (SSDs) adopt an onboard dynamic random access memory (DRAM) to buffer hot write data. Then, the or overwrite operations can be absorbed by DRAM cache, given that there is sufficient locality in applications' I/O pattern, consequently avoid flushing data onto underlying SSD cells. After analyzing typical real-world workloads over SSDs, we observed buffered of small-size requests are more likely reaccessed than those large requests. To efficiently utilize...

10.1109/tcad.2022.3229293 article EN IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 2022-12-14

Checkpoint-recovery based Virtual Machine (VM) replication is an emerging approach towards accommodating VM installations with high availability. However, it comes the price of significant performance degradation application executed in due to large amount state that needs be synchronized between primary and backup machines. It therefore critical find new ways for attaining good performance, at same time, maintaining fault tolerant execution. In this paper, we present a novel improve...

10.1109/ucc.2011.20 article EN 2011-12-01

As systems sizes increase to exascale and beyond, there is a need enhance the system software meet needs challenges of applications. The evolutionary versus revolutionary debate can be set aside by providing that simultaneously supports existing new programming models. seemingly contradictory requirements scalable performance traditional rich APIs (POSIX, Linux in particular) suggest approach, has lead class research. Traditionally, operating for extreme-scale computing have followed two...

10.1145/2768405.2768410 article EN 2015-06-12

Lightweight kernels (LWK) have been in use on the compute nodes of supercomputers for decades. Although many high-end systems now run Linux, interest options and alternatives has increased last couple years. Future extreme-scale require rethinking operating system, modern LWKs may well play a role final solution.

10.1145/2768405.2768414 article EN 2015-06-12

Read disturb is a circuit-level noise in solid-state drives (SSDs), which may corrupt existing data SSD blocks and then cause high read error rate longer latency. The approach of refresh commonly used to avoid errors by periodically migrating the hot other free blocks, but it places considerable negative impacts on I/O (Input/Output) responsiveness. This article proposes scheduling approaches write operations, mitigate effects caused disturb. To be specific, we first construct model classify...

10.1145/3410332 article EN ACM Transactions on Design Automation of Electronic Systems 2020-09-01

The increasing prevalence of co-processors such as the Intel Xeon Phi, has been reshaping high performance computing (HPC) landscape. Phi comes with a large number power efficient CPU cores, but at same time, it's highly memory constraint environment leaving task management entirely up to application developers. To reduce programming complexity, we are focusing on transparent, operating system (OS) level hierarchical management.

10.1145/2600212.2600231 article EN 2014-06-20

Multi-kernels leverage today's multi-core chips to run multiple operating system (OS) kernels, typically a Light Weight Kernel (LWK) and Linux kernel, simultaneously. The LWK provides high performance scalability, while the kernel compatibility. show promise of being able meet tomorrow's extreme-scale computing needs providing strong isolation, yielding scalability needed by classical HPC applications. McKernel mOS started as independent research initiatives explore above potential. Previous...

10.1109/ipdps.2018.00022 article EN 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2018-05-01

Summary On the verge of convergence between high‐performance computing and Big Data processing, it has become increasingly prevalent to deploy large‐scale data analytics workloads on high‐end supercomputers. Such applications often come in form complex workflows with various different components, assimilating from scientific simulations as well measurements streamed sensor networks, such radars satellites. For example, part Flagship 2020 (post‐K) supercomputer project Japan, RIKEN is...

10.1002/cpe.4161 article EN Concurrency and Computation Practice and Experience 2017-05-15

In HPC, two trends have led to the emergence and popularity of an operating-system approach in which multiple kernels are run simultaneously on each compute node. The first trend has been increase complexity HPC software environment, placed traditional kernel approaches under stress. Meanwhile, microprocessors with more cores being produced, allowing specialization within a As is typical emerging field, different groups considering many deploying multi-kernels.

10.1145/2931088.2931092 article EN 2016-05-25

Page-based memory management (paging) is utilized by most of the current operating systems (OSs) due to its rich features such as prevention fragmentation and fine-grained access control. virtual memory, however, stores physical mappings in page tables that also reside main memory. Because translating addresses requires walking tables, which turn implies additional accesses, modern CPUs employ translation lookaside buffers (TLBs) cache mappings. Nevertheless, TLBs are limited size...

10.1145/2612262.2612264 article EN 2014-06-10

Upcoming high-performance computing (HPC) platforms will have more complex memory hierarchies with high-bandwidth on-package and in the future also non-volatile memory. How to use such deep effectively remains an open research question. In this paper we evaluate performance implications of a scheme based on software-managed scratchpad coarse-grained memory-copy operations migrating application data structures between hierarchy levels. We expect that can, under specificcircumstances,...

10.1109/cluster.2016.42 article EN 2016-09-01

Over the last three decades, innovations in memory subsystem were primarily targeted at overcoming data movement bottleneck. In this paper, we focus on a specific market trend technology: 3D-stacked and caches. We investigate impact of extending on-chip capabilities future HPC-focused processors, particularly by SRAM. First, propose method oblivious to gauge upper-bound performance improvements when costs are eliminated. Then, using gem5 simulator, model two variants hypothetical LARge Cache...

10.1145/3629520 article EN ACM Transactions on Architecture and Code Optimization 2023-10-25

With the growing prevalence of cloud computing and increasing number CPU cores in modern processors, symmetric multiprocessing (SMP) Virtual Machines (VM), i.e. virtual machines with multiple CPUs, are gaining significance. However, accommodating SMP high availability at low overhead is still an open problem. Checkpoint-recovery based VM replication emerging approach, but it comes price significant performance degradation application executed due to large amount state that needs be...

10.1109/cluster.2011.13 article EN 2011-09-01

Many-core processors are gathering attention in the areas of embedded systems due to their power-performance ratios. To utilize cores a many-core processor parallel, programmers build multi-task applications that use task models provided by operating systems. However, conventional cause some scalability problems when executed on processors. In this paper, new model named Partitioned Virtual Address Space (PVAS), which solves problems, is proposed. PVAS enhances inter-task communications and...

10.1145/2489068.2489075 article EN 2013-06-24
Coming Soon ...