NFDI4DS | UHH-SEMS - Publication Details

Sarp Oral

ORCID: 0000-0001-8745-7078

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5014538652

Research Areas

Advanced Data Storage Technologies
Distributed and Parallel Computing Systems
Parallel Computing and Optimization Techniques
Caching and Content Delivery
Cloud Computing and Resource Management
Scientific Computing and Data Management
Interconnection Networks and Systems
Distributed systems and fault tolerance
Software System Performance and Reliability
Advanced Optical Network Technologies
Big Data and Business Intelligence
Peer-to-Peer Network Technologies
Research Data Management Practices
Privacy-Preserving Technologies in Data
Advanced Database Systems and Queries
Cloud Data Security Solutions
Network Time Synchronization Technologies
Software-Defined Networks and 5G
Time Series Analysis and Forecasting
Data Stream Mining Techniques
Data-Driven Disease Surveillance
China's Ethnic Minorities and Relations
Technology Assessment and Management
Reservoir Engineering and Simulation Methods
Innovation, Sustainability, Human-Machine Systems

Oak Ridge National Laboratory
2015-2024

Office of Scientific and Technical Information
2024

Naval Research Laboratory Information Technology Division
2016-2023

Oak Ridge Leadership Computing Facility
2023

Lawrence Berkeley National Laboratory
2021

Sandia National Laboratories
2017

University of Florida
2003-2005

The Design, Deployment, and Evaluation of the CORAL Pre-Exascale Systems

OPENALEX - Publications

Sudharshan S. Vazhkudai Bronis R. de Supinski Arthur S Buddy Bland Al Geist James Sexton and 38 more

CORAL, the Collaboration of Oak Ridge, Argonne and Livermore, is fielding two similar IBM systems, Summit Sierra, with NVIDIA GPUs that will replace existing Titan Sequoia systems. Sierra are currently ranked No. 1 3, respectively on Top500 list. We discuss design key differences Our evaluation systems highlights following. Applications fit in HBM see most benefit may prefer more GPUs; however, for some applications, CPU-GPU bandwidth important than number GPUs. The node-local burst buffer...

10.1109/sc.2018.00055 article EN 2018-11-01

Frontier: Exploring Exascale

OPENALEX - Publications

Scott Atchley Christopher Zimmer John R. Lange David E. Bernholdt Veronica Melesse Vergara and 28 more

As the US Department of Energy (DOE) computing facilities began deploying petascale systems in 2008, DOE was already setting its sights on exascale. In that year, DARPA published a report feasibility reaching The authors identified several key challenges pursuit exascale including power, memory, concurrency, and resiliency. That informed DOE's strategy for With deployment Oak Ridge National Laboratory's Frontier supercomputer, we have officially entered era. this paper, discuss Frontier's...

10.1145/3581784.3607089 article EN 2023-11-11

BurstMem: A high-performance burst buffer system for scientific applications

OPENALEX - Publications

Teng Wang Sarp Oral Yandong Wang Brad Settlemyer Scott Atchley and 1 more

The growth of computing power on large-scale systems requires commensurate high-bandwidth I/O systems. Many parallel file are designed to provide fast sustainable in response applications' soaring requirements. To meet this need, a novel system is imperative temporarily buffer the bursty and gradually flush datasets long-term In paper, we introduce design BurstMem, high-performance burst system. BurstMem provides storage framework with efficient communication management strategies. Our...

10.1109/bigdata.2014.7004215 article EN 2021 IEEE International Conference on Big Data (Big Data) 2014-10-01

Performance characterization and optimization of parallel I/O on the Cray XT

OPENALEX - Publications

Weikuan Yu Jeffrey S. Vetter Sarp Oral

This paper presents an extensive characterization, tuning, and optimization of parallel I/O on the Cray XT supercomputer, named Jaguar, at Oak Ridge National Laboratory. We have characterized performance scalability for different levels storage hierarchy including a single Lustre object target, S2A couplet, entire system. Our analysis covers both data- metadata-intensive patterns. In particular, small, non-contiguous intensive we evaluated several techniques, such as data sieving two- phase...

10.1109/ipdps.2008.4536277 article EN Proceedings - IEEE International Parallel and Distributed Processing Symposium 2008-04-01

Characterizing output bottlenecks in a supercomputer

OPENALEX - Publications

Bing Xie Jeffrey S. Chase David Dillow Oleg Drokin Scott Klasky and 2 more

Supercomputer I/O loads are often dominated by writes. HPC (High Performance Computing) file systems designed to absorb these bursty outputs at high bandwidth through massive parallelism. However, the delivered write falls well below peak. This paper characterizes data absorption behavior of a center-wide shared Lustre parallel system on Jaguar supercomputer. We use statistical methodology address challenges accurately measuring machine under production load and obtain distribution across...

10.5555/2388996.2389007 article EN IEEE International Conference on High Performance Computing, Data, and Analytics 2012-11-10

A semi-preemptive garbage collector for solid state drives

OPENALEX - Publications

Junghee Lee Youngjae Kim Galen Shipman Sarp Oral Feiyi Wang and 1 more

NAND flash memory is a preferred storage media for various platforms ranging from embedded systems to enterprise-scale systems. Flash devices do not have any mechanical moving parts and provide low-latency access. They also require less power compared rotating media. Unlike hard disks, use out-of-update operations they garbage collection (GC) process reclaim invalid pages create free blocks. This GC major cause of performance degradation when running concurrently with other I/O as internal...

10.1109/ispass.2011.5762711 article EN 2011-04-01

Characterizing output bottlenecks in a supercomputer

OPENALEX - Publications

Bing Xie Jeffrey S. Chase David Dillow Oleg Drokin Scott Klasky and 2 more

10.1109/sc.2012.28 article EN International Conference for High Performance Computing, Networking, Storage and Analysis 2012-11-01

Preemptible I/O Scheduling of Garbage Collection for Solid State Drives

OPENALEX - Publications

Junghee Lee Youngjae Kim Galen Shipman Sarp Oral Jong Man Kim

Unlike hard disks, flash devices use out-of-place updates operations and require a garbage collection (GC) process to reclaim invalid pages create free blocks. This GC is major cause of performance degradation when running concurrently with other I/O as internal bandwidth consumed these pages. The invocation the generally governed by low watermark on blocks device metrics that different workloads meet at intervals. results in an highly dependent workload characteristics. In this paper, we...

10.1109/tcad.2012.2227479 article EN IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 2013-01-21

Predicting Output Performance of a Petascale Supercomputer

OPENALEX - Publications

Bing Xie Yezhou Huang Jeffrey S. Chase Jong Youl Choi Scott Klasky and 2 more

In this paper, we develop a predictive model useful for output performance prediction of supercomputer file systems under production load. Our target environment is Titan---the 3rd fastest in the world---and its Lustre-based multi-stage write path. We observe from Titan that although highly variable at small time scales, mean stable and consistent over typical application run times. Moreover, find non-linearly related to correlated parameters due interference saturation on individual stages...

10.1145/3078597.3078614 article EN 2017-06-23

Integrating quantum computing resources into scientific HPC ecosystems

OPENALEX - Publications

Thomas L. Beck Alessandro Baroni Ryan S. Bennink Gilles Buchs E. A. Coello Pérez and 13 more

10.1016/j.future.2024.06.058 article EN Future Generation Computer Systems 2024-07-02

Harmonia: A globally coordinated garbage collector for arrays of Solid-State Drives

OPENALEX - Publications

Youngjae Kim Sarp Oral Galen Shipman Junghee Lee David Dillow and 1 more

Solid-State Drives (SSDs) offer significant performance improvements over hard disk drives (HDD) on a number of workloads. The frequency garbage collection (GC) activity is directly correlated with the pattern, frequency, and volume write requests, scheduling GC controlled by logic internal to SSD. SSDs can exhibit degradations when conflicts an ongoing I/O request stream. When using in RAID array, lack coordination local processes amplifies these degradations. No controller or SSD available...

10.1109/msst.2011.5937224 article EN 2011-05-01

TRIO: Burst Buffer Based I/O Orchestration

OPENALEX - Publications

Teng Wang Sarp Oral Michael Pritchard Bin Wang Weikuan Yu

The growing computing power on leadership HPC systems is often accompanied by ever-escalating failure rates. Checkpointing a common defensive mechanism used scientific applications for recovery. However, directly writing the large and bursty checkpointing dataset to parallel file can incur significant I/O contention storage servers. Such in turn degrades bandwidth utilization of servers prolongs average job time concurrent applications. Recently burst buffers have been proposed as an...

10.1109/cluster.2015.38 article EN 2015-09-01

Best Practices and Lessons Learned from Deploying and Operating Large-Scale Data-Centric Parallel File Systems

OPENALEX - Publications

Sarp Oral James Q. Simmons Jason Hill Dustin Leverman Feiyi Wang and 11 more

The Oak Ridge Leadership Computing Facility (OLCF) has deployed multiple large-scale parallel file systems (PFS) to support its operations. During this process, OLCF acquired significant expertise in storage system design, software development, technology evaluation, benchmarking, procurement, deployment, and operational practices. Based on the lessons learned from each new PFS improved operating procedures, strategies. This paper provides an account of our experience acquiring, deploying,...

10.1109/sc.2014.23 article EN 2014-11-01

UnifyFS: A User-level Shared File System for Unified Access to Distributed Local Storage

OPENALEX - Publications

Michael J. Brim Adam Moody Seung–Hwan Lim Ross Miller Swen Boehm and 3 more

We introduce UnifyFS, a user-level file system that aggregates node-local storage tiers available on high performance computing (HPC) systems and makes them to HPC applications under unified namespace. UnifyFS employs transparent I/O interception, so it does not require changes application code is compatible with commonly used libraries. The design of supports the predominant workloads optimized for bulk-synchronous patterns. Furthermore, provides customizable semantics flexibly adapt its...

10.1109/ipdps54959.2023.00037 article EN 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2023-05-01

Coordinating Garbage Collectionfor Arrays of Solid-State Drives

OPENALEX - Publications

Youngjae Kim Junghee Lee Sarp Oral David Dillow Feiyi Wang and 1 more

Although solid-state drives (SSDs) offer significant performance improvements over hard disk (HDDs) for a number of workloads, they can exhibit substantial variance in request latency and throughput as result garbage collection (GC). When GC conflicts with an I/O stream, the stream make no forward progress until cycle completes. cycles are scheduled by logic internal to SSD based on several factors such pattern, frequency, volume write requests. SSDs used RAID currently available technology,...

10.1109/tc.2012.256 article EN IEEE Transactions on Computers 2014-03-18

Comparative I/O workload characterization of two leadership class storage clusters

OPENALEX - Publications

Raghul Gunasekaran Sarp Oral Jason Hill Ross Miller Feiyi Wang and 1 more

The Oak Ridge Leadership Computing Facility (OLCF) is a leader in large-scale parallel file system development, design, deployment and continuous operation. For the last decade, OLCF has designed deployed two large center-wide systems. first instantiation, Spider 1, served Jaguar supercomputer its predecessor, 2, now serves Titan supercomputer, among many other computational resources. been rigorously collecting storage statistics from these systems since their transition to production state.

10.1145/2834976.2834985 article EN 2015-11-11

End-to-end I/O portfolio for the summit supercomputing ecosystem

OPENALEX - Publications

Sarp Oral Sudharshan S. Vazhkudai Feiyi Wang Christopher Zimmer Christopher Brumgard and 6 more

The I/O subsystem for the Summit supercomputer, No. 1 on Top500 list, and its ecosystem of analysis platforms is composed two distinct layers, namely in-system layer center-wide parallel file system (PFS), Spider 3. uses node-local SSDs provides 26.7 TB/s reads, 9.7 writes, 4.6 billion IOPS to Summit. 3 PFS IBM's Spectrum Scale™ 2.5 2.6 million other systems. While deploying them as layers was operationally efficient, it also presented usability challenges in terms multiple mount points lack...

10.1145/3295500.3356157 article EN 2019-11-07

Access Patterns and Performance Behaviors of Multi-layer Supercomputer I/O Subsystems under Production Load

OPENALEX - Publications

Jean Luca Bez Ahmad Maroof Karimi Arnab K. Paul Bing Xie Suren Byna and 4 more

Scientific computing workloads at HPC facilities have been shifting from traditional numerical simulations to AI/ML applications for training and inference while processing producing ever-increasing amounts of scientific data. To address the growing need increased storage capacity, lower access latency, higher bandwidth, emerging technologies such as non-volatile memory are integrated into supercomputer I/O subsystems. With these trends, we a better understanding multilayer systems ways use...

10.1145/3502181.3531461 article EN 2022-06-23

Efficient object storage journaling in a distributed parallel file system

OPENALEX - Publications

Sarp Oral Feiyi Wang David Dillow Galen Shipman Ross Miller and 1 more

Journaling is a widely used technique to increase file system robustness against metadata and/or data corruptions. While the overhead of journaling can be masked by page cache for small-scale, local systems, we found that Lustre's use object store significantly impacted overall performance our large-scale centerwide parallel system. By requiring each write request wait journal transaction commit, Lustre introduced serialization client stream and imposed additional latency due disk head...

10.5555/1855511.1855522 article EN 2010-02-23

Performance and scalability evaluation of the Ceph parallel file system

OPENALEX - Publications

Feiyi Wang Mark Nelson Sarp Oral Scott Atchley Sage A. Weil and 3 more

Ceph is an emerging open-source parallel distributed file and storage system. By design, leverages unreliable commodity network hardware, provides reliability fault-tolerance via controlled object placement data replication. This paper presents our block I/O performance scalability evaluation of for scientific high-performance computing (HPC) environments. Our work makes two unique contributions. First, performed under a realistic setup large-scale capability HPC environment using commercial...

10.1145/2538542.2538562 article EN 2013-11-15

Optimizing checkpoint data placement with guaranteed burst buffer endurance in large-scale hierarchical storage systems

OPENALEX - Publications

Lipeng Wan Qing Cao Feiyi Wang Sarp Oral

10.1016/j.jpdc.2016.10.002 article EN publisher-specific-oa Journal of Parallel and Distributed Computing 2016-10-17

I/O Bottleneck Detection and Tuning: Connecting the Dots using Interactive Log Analysis

OPENALEX - Publications

Jean Luca Bez Houjun Tang Bing Xie David B. Williams‐Young Robert Latham and 3 more

Using parallel file systems efficiently is a tricky problem due to inter-dependencies among multiple layers of I/O software, including high-level libraries (HDF5, netCDF, etc.), MPI-IO, POSIX, and (GPFS, Lustre, etc.). Profiling tools such as Darshan collect traces help understand the performance behavior. However, there are significant gaps in analyzing collected then applying tuning options offered by various software. Seeking connect dots between bottleneck detection tuning, we propose...

10.1109/pdsw54622.2021.00008 article EN 2021-11-01

SSD-optimized workload placement with adaptive learning and classification in HPC environments

OPENALEX - Publications

Lipeng Wan Zheng Lu Qing Cao Feiyi Wang Sarp Oral and 1 more

In recent years, non-volatile memory devices such as SSD drives have emerged a viable storage solution due to their increasing capacity and decreasing cost. Due the unique capability requirements in large scale HPC (High Performance Computing) environment, hybrid configuration (SSD HDD) may represent one of most available balanced solutions considering cost performance. Under this setting, effective data placement well movement with controlled overhead become pressing challenge. paper, we...

10.1109/msst.2014.6855552 article EN 2014-06-01

Coming Soon ...