Sarp Oral

ORCID: 0000-0001-8745-7078
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Advanced Data Storage Technologies
  • Distributed and Parallel Computing Systems
  • Parallel Computing and Optimization Techniques
  • Caching and Content Delivery
  • Cloud Computing and Resource Management
  • Scientific Computing and Data Management
  • Interconnection Networks and Systems
  • Distributed systems and fault tolerance
  • Software System Performance and Reliability
  • Advanced Optical Network Technologies
  • Big Data and Business Intelligence
  • Peer-to-Peer Network Technologies
  • Research Data Management Practices
  • Privacy-Preserving Technologies in Data
  • Advanced Database Systems and Queries
  • Cloud Data Security Solutions
  • Network Time Synchronization Technologies
  • Software-Defined Networks and 5G
  • Time Series Analysis and Forecasting
  • Data Stream Mining Techniques
  • Data-Driven Disease Surveillance
  • China's Ethnic Minorities and Relations
  • Technology Assessment and Management
  • Reservoir Engineering and Simulation Methods
  • Innovation, Sustainability, Human-Machine Systems

Oak Ridge National Laboratory
2015-2024

Office of Scientific and Technical Information
2024

Naval Research Laboratory Information Technology Division
2016-2023

Oak Ridge Leadership Computing Facility
2023

Lawrence Berkeley National Laboratory
2021

Sandia National Laboratories
2017

University of Florida
2003-2005

CORAL, the Collaboration of Oak Ridge, Argonne and Livermore, is fielding two similar IBM systems, Summit Sierra, with NVIDIA GPUs that will replace existing Titan Sequoia systems. Sierra are currently ranked No. 1 3, respectively on Top500 list. We discuss design key differences Our evaluation systems highlights following. Applications fit in HBM see most benefit may prefer more GPUs; however, for some applications, CPU-GPU bandwidth important than number GPUs. The node-local burst buffer...

10.1109/sc.2018.00055 article EN 2018-11-01

As the US Department of Energy (DOE) computing facilities began deploying petascale systems in 2008, DOE was already setting its sights on exascale. In that year, DARPA published a report feasibility reaching The authors identified several key challenges pursuit exascale including power, memory, concurrency, and resiliency. That informed DOE's strategy for With deployment Oak Ridge National Laboratory's Frontier supercomputer, we have officially entered era. this paper, discuss Frontier's...

10.1145/3581784.3607089 article EN 2023-11-11

The growth of computing power on large-scale systems requires commensurate high-bandwidth I/O systems. Many parallel file are designed to provide fast sustainable in response applications' soaring requirements. To meet this need, a novel system is imperative temporarily buffer the bursty and gradually flush datasets long-term In paper, we introduce design BurstMem, high-performance burst system. BurstMem provides storage framework with efficient communication management strategies. Our...

10.1109/bigdata.2014.7004215 article EN 2021 IEEE International Conference on Big Data (Big Data) 2014-10-01

This paper presents an extensive characterization, tuning, and optimization of parallel I/O on the Cray XT supercomputer, named Jaguar, at Oak Ridge National Laboratory. We have characterized performance scalability for different levels storage hierarchy including a single Lustre object target, S2A couplet, entire system. Our analysis covers both data- metadata-intensive patterns. In particular, small, non-contiguous intensive we evaluated several techniques, such as data sieving two- phase...

10.1109/ipdps.2008.4536277 article EN Proceedings - IEEE International Parallel and Distributed Processing Symposium 2008-04-01

Supercomputer I/O loads are often dominated by writes. HPC (High Performance Computing) file systems designed to absorb these bursty outputs at high bandwidth through massive parallelism. However, the delivered write falls well below peak. This paper characterizes data absorption behavior of a center-wide shared Lustre parallel system on Jaguar supercomputer. We use statistical methodology address challenges accurately measuring machine under production load and obtain distribution across...

10.5555/2388996.2389007 article EN IEEE International Conference on High Performance Computing, Data, and Analytics 2012-11-10

NAND flash memory is a preferred storage media for various platforms ranging from embedded systems to enterprise-scale systems. Flash devices do not have any mechanical moving parts and provide low-latency access. They also require less power compared rotating media. Unlike hard disks, use out-of-update operations they garbage collection (GC) process reclaim invalid pages create free blocks. This GC major cause of performance degradation when running concurrently with other I/O as internal...

10.1109/ispass.2011.5762711 article EN 2011-04-01

Supercomputer I/O loads are often dominated by writes. HPC (High Performance Computing) file systems designed to absorb these bursty outputs at high bandwidth through massive parallelism. However, the delivered write falls well below peak. This paper characterizes data absorption behavior of a center-wide shared Lustre parallel system on Jaguar supercomputer. We use statistical methodology address challenges accurately measuring machine under production load and obtain distribution across...

10.1109/sc.2012.28 article EN International Conference for High Performance Computing, Networking, Storage and Analysis 2012-11-01

Unlike hard disks, flash devices use out-of-place updates operations and require a garbage collection (GC) process to reclaim invalid pages create free blocks. This GC is major cause of performance degradation when running concurrently with other I/O as internal bandwidth consumed these pages. The invocation the generally governed by low watermark on blocks device metrics that different workloads meet at intervals. results in an highly dependent workload characteristics. In this paper, we...

10.1109/tcad.2012.2227479 article EN IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 2013-01-21

In this paper, we develop a predictive model useful for output performance prediction of supercomputer file systems under production load. Our target environment is Titan---the 3rd fastest in the world---and its Lustre-based multi-stage write path. We observe from Titan that although highly variable at small time scales, mean stable and consistent over typical application run times. Moreover, find non-linearly related to correlated parameters due interference saturation on individual stages...

10.1145/3078597.3078614 article EN 2017-06-23

Solid-State Drives (SSDs) offer significant performance improvements over hard disk drives (HDD) on a number of workloads. The frequency garbage collection (GC) activity is directly correlated with the pattern, frequency, and volume write requests, scheduling GC controlled by logic internal to SSD. SSDs can exhibit degradations when conflicts an ongoing I/O request stream. When using in RAID array, lack coordination local processes amplifies these degradations. No controller or SSD available...

10.1109/msst.2011.5937224 article EN 2011-05-01

The growing computing power on leadership HPC systems is often accompanied by ever-escalating failure rates. Checkpointing a common defensive mechanism used scientific applications for recovery. However, directly writing the large and bursty checkpointing dataset to parallel file can incur significant I/O contention storage servers. Such in turn degrades bandwidth utilization of servers prolongs average job time concurrent applications. Recently burst buffers have been proposed as an...

10.1109/cluster.2015.38 article EN 2015-09-01

The Oak Ridge Leadership Computing Facility (OLCF) has deployed multiple large-scale parallel file systems (PFS) to support its operations. During this process, OLCF acquired significant expertise in storage system design, software development, technology evaluation, benchmarking, procurement, deployment, and operational practices. Based on the lessons learned from each new PFS improved operating procedures, strategies. This paper provides an account of our experience acquiring, deploying,...

10.1109/sc.2014.23 article EN 2014-11-01

We introduce UnifyFS, a user-level file system that aggregates node-local storage tiers available on high performance computing (HPC) systems and makes them to HPC applications under unified namespace. UnifyFS employs transparent I/O interception, so it does not require changes application code is compatible with commonly used libraries. The design of supports the predominant workloads optimized for bulk-synchronous patterns. Furthermore, provides customizable semantics flexibly adapt its...

10.1109/ipdps54959.2023.00037 article EN 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2023-05-01

Although solid-state drives (SSDs) offer significant performance improvements over hard disk (HDDs) for a number of workloads, they can exhibit substantial variance in request latency and throughput as result garbage collection (GC). When GC conflicts with an I/O stream, the stream make no forward progress until cycle completes. cycles are scheduled by logic internal to SSD based on several factors such pattern, frequency, volume write requests. SSDs used RAID currently available technology,...

10.1109/tc.2012.256 article EN IEEE Transactions on Computers 2014-03-18

The Oak Ridge Leadership Computing Facility (OLCF) is a leader in large-scale parallel file system development, design, deployment and continuous operation. For the last decade, OLCF has designed deployed two large center-wide systems. first instantiation, Spider 1, served Jaguar supercomputer its predecessor, 2, now serves Titan supercomputer, among many other computational resources. been rigorously collecting storage statistics from these systems since their transition to production state.

10.1145/2834976.2834985 article EN 2015-11-11

The I/O subsystem for the Summit supercomputer, No. 1 on Top500 list, and its ecosystem of analysis platforms is composed two distinct layers, namely in-system layer center-wide parallel file system (PFS), Spider 3. uses node-local SSDs provides 26.7 TB/s reads, 9.7 writes, 4.6 billion IOPS to Summit. 3 PFS IBM's Spectrum Scale™ 2.5 2.6 million other systems. While deploying them as layers was operationally efficient, it also presented usability challenges in terms multiple mount points lack...

10.1145/3295500.3356157 article EN 2019-11-07

Scientific computing workloads at HPC facilities have been shifting from traditional numerical simulations to AI/ML applications for training and inference while processing producing ever-increasing amounts of scientific data. To address the growing need increased storage capacity, lower access latency, higher bandwidth, emerging technologies such as non-volatile memory are integrated into supercomputer I/O subsystems. With these trends, we a better understanding multilayer systems ways use...

10.1145/3502181.3531461 article EN 2022-06-23

Journaling is a widely used technique to increase file system robustness against metadata and/or data corruptions. While the overhead of journaling can be masked by page cache for small-scale, local systems, we found that Lustre's use object store significantly impacted overall performance our large-scale centerwide parallel system. By requiring each write request wait journal transaction commit, Lustre introduced serialization client stream and imposed additional latency due disk head...

10.5555/1855511.1855522 article EN 2010-02-23

Ceph is an emerging open-source parallel distributed file and storage system. By design, leverages unreliable commodity network hardware, provides reliability fault-tolerance via controlled object placement data replication. This paper presents our block I/O performance scalability evaluation of for scientific high-performance computing (HPC) environments. Our work makes two unique contributions. First, performed under a realistic setup large-scale capability HPC environment using commercial...

10.1145/2538542.2538562 article EN 2013-11-15

Using parallel file systems efficiently is a tricky problem due to inter-dependencies among multiple layers of I/O software, including high-level libraries (HDF5, netCDF, etc.), MPI-IO, POSIX, and (GPFS, Lustre, etc.). Profiling tools such as Darshan collect traces help understand the performance behavior. However, there are significant gaps in analyzing collected then applying tuning options offered by various software. Seeking connect dots between bottleneck detection tuning, we propose...

10.1109/pdsw54622.2021.00008 article EN 2021-11-01

In recent years, non-volatile memory devices such as SSD drives have emerged a viable storage solution due to their increasing capacity and decreasing cost. Due the unique capability requirements in large scale HPC (High Performance Computing) environment, hybrid configuration (SSD HDD) may represent one of most available balanced solutions considering cost performance. Under this setting, effective data placement well movement with controlled overhead become pressing challenge. paper, we...

10.1109/msst.2014.6855552 article EN 2014-06-01
Coming Soon ...