NFDI4DS | UHH-SEMS - Publication Details

PROV-IO: A Cross-Platform Provenance Framework for Scientific Data on HPC Systems

OPENALEX - Publications

Runzhou Han Mai Zheng Suren Byna Houjun Tang Bin Dong and 5 more

Data provenance, or data lineage, describes the life cycle of data. In scientific workflows on HPC systems, scientists often seek diverse provenance (e.g., origins products, usage patterns datasets). Unfortunately, existing solutions cannot address challenges due to their incompatible models and/or system implementations. this paper, we analyze four representative in collaboration with domain identify concrete needs. Based first-hand analysis, propose a framework called PROV-IO...

10.1109/tpds.2024.3374555 article EN IEEE Transactions on Parallel and Distributed Systems 2024-03-14

SentiLog

OPENALEX - Publications

Di Zhang Dong Dai Runzhou Han Mai Zheng

As core components of High-performance computing (HPC) platforms, parallel file systems (PFSes) grow quickly in scale and complexity, hence are subject to various failures anomalies. Identifying their anomalies runtime is critically helpful for HPC operators administrators. Analyzing the logs detect large-scale has been proven effective many recent studies. However, applying them faces significant challenges due large volume irregularity PFSes logs. This study proposes SentiLog, a new...

10.1145/3465332.3470873 article EN 2021-07-20

A Study of Failure Recovery and Logging of High-Performance Parallel File Systems

OPENALEX - Publications

Runzhou Han Om Rameshwar Gatla Mai Zheng Jinrui Cao Di Zhang and 3 more

Large-scale parallel file systems (PFSs) play an essential role in high-performance computing (HPC). However, despite their importance, reliability is much less studied or understood compared with that of local storage cloud systems. Recent failure incidents at real HPC centers have exposed the latent defects PFS clusters as well urgent need for a systematic analysis. To address challenge, we perform study recovery and logging mechanisms PFSs this article. First, to trigger operations target...

10.1145/3483447 article EN ACM Transactions on Storage 2022-03-29

PROV-IO

OPENALEX - Publications

Runzhou Han Suren Byna Houjun Tang Bin Dong Mai Zheng

cData provenance, or data lineage, describes the life cycle of data. In scientific workflows on HPC systems, scientists often seek diverse provenance (e.g., origins products, usage patterns datasets). Unfortunately, existing solutions cannot address challenges due to their incompatible models and/or system implementations.

10.1145/3502181.3531477 article EN 2022-06-23

Revisiting Erasure Codes: A Configuration Perspective

OPENALEX - Publications

Runzhou Han Shi Chao Tabassum Mahmud Z. Yang Vladislav Esaulov and 5 more

10.1145/3655038.3665951 article EN 2024-06-27

Fingerprinting the Checker Policies of Parallel File Systems

OPENALEX - Publications

Runzhou Han Duo Zhang Mai Zheng

Parallel file systems (PFSes) play an essential role in high performance computing. To ensure the integrity, many PFSes are designed with a checker component, which serves as last line of defense to bring corrupted PFS back healthy state. Motivated by real-world incidents corruptions, we perform fine-grained study on capability checkers this paper. We apply type-aware fault injection specific structures, and examine detection repair policies meticulously via well-defined taxonomy. The...

10.1109/pdsw51947.2020.00013 article EN 2020-11-01

λFS: A Scalable and Elastic Distributed File System Metadata Service using Serverless Functions

OPENALEX - Publications

Benjamin Carver Runzhou Han Jingyuan Zhang Mai Zheng Yue Cheng

The metadata service (MDS) sits on the critical path for distributed file system (DFS) operations, and therefore it is key to overall performance of a large-scale DFS. Common "serverful" MDS architectures, such as single server or cluster servers, have significant shortcoming: either they are not scalable, make difficult achieve an optimal balance performance, resource utilization, cost. A modern requires novel architecture that addresses this shortcoming.

10.1145/3623278.3624765 article EN cc-by-nc-sa 2023-03-25

$λ$FS: A Scalable and Elastic Distributed File System Metadata Service using Serverless Functions

OPENALEX - Publications

Benjamin Carver Runzhou Han J. Zhang Mai Zheng Yue Cheng

The metadata service (MDS) sits on the critical path for distributed file system (DFS) operations, and therefore it is key to overall performance of a large-scale DFS. Common "serverful" MDS architectures, such as single server or cluster servers, have significant shortcoming: either they are not scalable, make difficult achieve an optimal balance performance, resource utilization, cost. A modern requires novel architecture that addresses this shortcoming. To end, we design implement...

10.48550/arxiv.2306.11877 preprint EN other-oa arXiv (Cornell University) 2023-01-01

PROV-IO+: A Cross-Platform Provenance Framework for Scientific Data on HPC Systems

OPENALEX - Publications

Runzhou Han Mai Zheng Suren Byna Houjun Tang Bin Dong and 6 more

Data provenance, or data lineage, describes the life cycle of data. In scientific workflows on HPC systems, scientists often seek diverse provenance (e.g., origins products, usage patterns datasets). Unfortunately, existing solutions cannot address challenges due to their incompatible models and/or system implementations. this paper, we analyze four representative in collaboration with domain identify concrete needs. Based first-hand analysis, propose a framework called PROV-IO+, which...

10.48550/arxiv.2308.00891 preprint EN cc-by-nc-sa arXiv (Cornell University) 2023-01-01

On Failure Diagnosis of the Storage Stack

OPENALEX - Publications

Zhang Duo Om Rameshwar Gatla Runzhou Han Mai Zheng

Diagnosing storage system failures is challenging even for professionals. One example the "When Solid State Drives Are Not That Solid" incident occurred at Algolia data center, where Samsung SSDs were mistakenly blamed caused by a Linux kernel bug. With complexity keeps increasing, such obscure will likely occur more often. As one step to address challenge, we present our on-going efforts called X-Ray. Different from traditional methods that focus on either software or hardware, X-Ray...

10.48550/arxiv.2005.02547 preprint EN other-oa arXiv (Cornell University) 2020-01-01

On the Reproducibility of Bugs in File-System Aware Storage Applications

OPENALEX - Publications

Duo Zhang Tabassum Mahmud Om Rameshwar Gatla Runzhou Han Yong Chen and 1 more

Many storage applications such as file system checkers, defragmentation tools, etc. require a detailed understanding of systems. Such file-system aware play an essential role today, but unfortunately they are error-prone. To better understand the challenges well opportunities to address issues, this paper presents empirical study real world bugs in applications. By analyzing 59 bug cases from 4 representative depth, we derive multiple insights terms general patterns, triggering conditions,...

10.1109/nas55553.2022.9925445 article EN 2022-10-01