- Advanced Data Storage Technologies
- Distributed and Parallel Computing Systems
- Caching and Content Delivery
- Cloud Computing and Resource Management
- Distributed systems and fault tolerance
- Research Data Management Practices
- Scientific Computing and Data Management
- Parallel Computing and Optimization Techniques
- Software System Performance and Reliability
- Error Correcting Code Techniques
- Network Security and Intrusion Detection
- Algorithms and Data Compression
Iowa State University
2020-2024
Samsung (United States)
2022
Data provenance, or data lineage, describes the life cycle of data. In scientific workflows on HPC systems, scientists often seek diverse provenance (e.g., origins products, usage patterns datasets). Unfortunately, existing solutions cannot address challenges due to their incompatible models and/or system implementations. this paper, we analyze four representative in collaboration with domain identify concrete needs. Based first-hand analysis, propose a framework called PROV-IO...
As core components of High-performance computing (HPC) platforms, parallel file systems (PFSes) grow quickly in scale and complexity, hence are subject to various failures anomalies. Identifying their anomalies runtime is critically helpful for HPC operators administrators. Analyzing the logs detect large-scale has been proven effective many recent studies. However, applying them faces significant challenges due large volume irregularity PFSes logs. This study proposes SentiLog, a new...
Large-scale parallel file systems (PFSs) play an essential role in high-performance computing (HPC). However, despite their importance, reliability is much less studied or understood compared with that of local storage cloud systems. Recent failure incidents at real HPC centers have exposed the latent defects PFS clusters as well urgent need for a systematic analysis. To address challenge, we perform study recovery and logging mechanisms PFSs this article. First, to trigger operations target...
cData provenance, or data lineage, describes the life cycle of data. In scientific workflows on HPC systems, scientists often seek diverse provenance (e.g., origins products, usage patterns datasets). Unfortunately, existing solutions cannot address challenges due to their incompatible models and/or system implementations.
Parallel file systems (PFSes) play an essential role in high performance computing. To ensure the integrity, many PFSes are designed with a checker component, which serves as last line of defense to bring corrupted PFS back healthy state. Motivated by real-world incidents corruptions, we perform fine-grained study on capability checkers this paper. We apply type-aware fault injection specific structures, and examine detection repair policies meticulously via well-defined taxonomy. The...
The metadata service (MDS) sits on the critical path for distributed file system (DFS) operations, and therefore it is key to overall performance of a large-scale DFS. Common "serverful" MDS architectures, such as single server or cluster servers, have significant shortcoming: either they are not scalable, make difficult achieve an optimal balance performance, resource utilization, cost. A modern requires novel architecture that addresses this shortcoming.
The metadata service (MDS) sits on the critical path for distributed file system (DFS) operations, and therefore it is key to overall performance of a large-scale DFS. Common "serverful" MDS architectures, such as single server or cluster servers, have significant shortcoming: either they are not scalable, make difficult achieve an optimal balance performance, resource utilization, cost. A modern requires novel architecture that addresses this shortcoming. To end, we design implement...
Data provenance, or data lineage, describes the life cycle of data. In scientific workflows on HPC systems, scientists often seek diverse provenance (e.g., origins products, usage patterns datasets). Unfortunately, existing solutions cannot address challenges due to their incompatible models and/or system implementations. this paper, we analyze four representative in collaboration with domain identify concrete needs. Based first-hand analysis, propose a framework called PROV-IO+, which...
Diagnosing storage system failures is challenging even for professionals. One example the "When Solid State Drives Are Not That Solid" incident occurred at Algolia data center, where Samsung SSDs were mistakenly blamed caused by a Linux kernel bug. With complexity keeps increasing, such obscure will likely occur more often. As one step to address challenge, we present our on-going efforts called X-Ray. Different from traditional methods that focus on either software or hardware, X-Ray...
Many storage applications such as file system checkers, defragmentation tools, etc. require a detailed understanding of systems. Such file-system aware play an essential role today, but unfortunately they are error-prone. To better understand the challenges well opportunities to address issues, this paper presents empirical study real world bugs in applications. By analyzing 59 bug cases from 4 representative depth, we derive multiple insights terms general patterns, triggering conditions,...