- Advanced Data Storage Technologies
- Cloud Computing and Resource Management
- Caching and Content Delivery
- Distributed and Parallel Computing Systems
- Parallel Computing and Optimization Techniques
- Scientific Computing and Data Management
- Distributed systems and fault tolerance
- Peer-to-Peer Network Technologies
- Algorithms and Data Compression
- Data Management and Algorithms
- Advanced Database Systems and Queries
- Cloud Data Security Solutions
- Advanced Neural Network Applications
- Software System Performance and Reliability
- Topic Modeling
- Big Data and Digital Economy
- Digital Rights Management and Security
- Text Readability and Simplification
- Natural Language Processing Techniques
- Ferroelectric and Negative Capacitance Devices
- Data Stream Mining Techniques
- Big Data Technologies and Applications
- Data Quality and Management
- Data Mining Algorithms and Applications
- IoT and Edge/Fog Computing
Carnegie Mellon University
2017-2025
University of Toronto
2012-2016
The energy consumed by data centers is starting to make up a significant fraction of the world's consumption and carbon emissions. A large spent on center cooling, which has motivated body work temperature management in centers. Interestingly, key aspect not been well understood: controlling setpoint at run center's cooling system. Most set their thermostat based (conservative) suggestions manufacturers, as there limited understanding how higher temperatures will affect At same time, studies...
The energy consumed by data centers is starting to make up a significant fraction of the world's consumption and carbon emissions. A large spent on center cooling, which has motivated body work temperature management in centers. Interestingly, key aspect not been well understood: controlling setpoint at run center's cooling system. Most set their thermostat based (conservative) suggestions manufacturers, as there limited understanding how higher temperatures will affect At same time, studies...
For a decade, the Ceph distributed file system followed conventional wisdom of building its storage backend on top local systems. This is preferred choice for most systems today because it allows them to benefit from convenience and maturity battle-tested code. Ceph's experience, however, shows that this comes at high price. First, developing zero-overhead transaction mechanism challenging. Second, metadata performance level can significantly affect level. Third, supporting emerging hardware...
Zoned Namespace (ZNS) SSDs are the latest evolution of host-managed flash storage, enabling improved performance at a lower cost-per-byte than traditional block interface (conventional) SSDs. To date, there is no support for arranging these new devices in arrays that offer increased throughput and reliability (RAID). We identify key challenges designing redundant ZNS SSD arrays, such as managing metadata updates persisting partial stripe writes absence overwrite from device. present RAIZN,...
Datacenters need to reduce embodied carbon emissions, particularly for flash, which accounts 40% of in servers. However, decreasing flash’s emissions is challenging due limited write endurance, more than halves with each generation denser flash. Reducing requires extending flash lifetime, stressing its endurance even further. The legacy Logical Block-Addressable Device (LBAD) interface exacerbates the problem by forcing devices perform garbage collection, leading writes. Flash-based caches...
Storage systems rely on maintenance tasks, such as backup and layout optimization, to ensure data availability good performance. These tasks access large amounts of can significantly impact foreground applications. We argue that storage be performed more efficiently by prioritizing processing is currently cached in memory. Data either due other requesting it previously, or overlapping I/O activity.
Analysis of large-scale simulation output is a core element scientific inquiry, but analysis queries may experience significant I/O overhead when the data not structured for efficient retrieval. While in-situ processing allows improved time-to-insight many applications, scaling frameworks to hundreds thousands cores can be difficult in practice. The DeltaFS indexing new approach massive amounts achieve point and small-range queries. This paper describes challenges lessons learned this...
Latent sector errors (LSEs) are a common hard disk failure mode, where sectors become inaccessible while the rest of remains unaffected. To protect against LSEs, commercial storage systems use scrubbers: background processes verifying data. The efficiency different scrubbing algorithms in detecting LSEs has been studied depth; however, no attempts have made to evaluate or mitigate impact on application performance. We provide first known evaluation performance policies implementation,...
For a decade, the Ceph distributed file system followed conventional wisdom of building its storage backend on top local systems. This is preferred choice for most systems today, because it allows them to benefit from convenience and maturity battle-tested code. Ceph’s experience, however, shows that this comes at high price. First, developing zero-overhead transaction mechanism challenging. Second, metadata performance level can significantly affect level. Third, supporting emerging...
Although large language models (LLMs) have been touted for their ability to generate natural-sounding text, there are growing concerns around possible negative effects of LLMs such as data memorization, bias, and inappropriate language. Unfortunately, the complexity generation capacities make validating (and correcting) difficult. In this work, we introduce ReLM, a system querying using standard regular expressions. ReLM formalizes enables broad range model evaluations, reducing complex...
Deep learning accelerators efficiently train over vast and growing amounts of data, placing a newfound burden on commodity networks storage devices. A common approach to conserve bandwidth involves resizing or compressing data prior training. We introduce Progressive Compressed Records (PCRs), format that uses compression reduce the overhead fetching transporting effectively reducing training time required achieve target accuracy. PCRs deviate from previous formats by combining progressive...
Input pipelines, which ingest and transform input data, are an essential part of training Machine Learning (ML) models. However, it is challenging to implement efficient as requires reasoning about parallelism, asynchrony, variability in fine-grained profiling information. Our analysis over two million ML jobs Google datacenters reveals that a significant fraction model could benefit from faster data pipelines. At the same time, our indicates most do not saturate host hardware, pointing...
In this paper we introduce the Indexed Massive Directory, a new technique for indexing data within DeltaFS. With its design as scalable, server-less file system HPC platforms, DeltaFS scales metadata performance with application scale. The Directory is novel extension to plane, enabling in-situ of massive amounts written single directory simultaneously, and in an arbitrarily large number files. We achieve through memory-efficient mechanism reordering data, log-structured storage layout pack...
Complex storage stacks providing data compression, indexing, and analytics help leverage the massive amounts of generated today to derive insights. It is challenging perform this computation, however, while fully utilizing underlying media. This because, servers with large core counts are widely available, single-core performance memory bandwidth per grow slower than count die. Computational offers a promising solution problem by dedicated compute resources along processing path. We present...
An increasing demand for cross-cloud and cross-region data access is bringing forth challenges related to high transfer costs latency. In response, we introduce Macaron, an auto-configuring cache system designed minimize cost remote access. A key insight behind Macaron that cloud size tied cost, not hardware limits, shifting the way think about design eviction policies. dynamically configures utilizes a mix of storage types adapt workload changes reduce costs. We demonstrate reduces by 65%...