- Cloud Computing and Resource Management
- Distributed and Parallel Computing Systems
- Caching and Content Delivery
- Advanced Data Storage Technologies
- Software-Defined Networks and 5G
- IoT and Edge/Fog Computing
- Graph Theory and Algorithms
- Peer-to-Peer Network Technologies
- Scientific Computing and Data Management
- Interconnection Networks and Systems
- Distributed systems and fault tolerance
- Data Stream Mining Techniques
- Parallel Computing and Optimization Techniques
- Advanced Database Systems and Queries
- Software System Performance and Reliability
- Gamma-ray bursts and supernovae
- Advanced Queuing Theory Analysis
- Advanced Malware Detection Techniques
- Multimedia Communication and Technology
- Online Learning and Analytics
- Wireless Communication Networks Research
- Scheduling and Optimization Algorithms
- Digital and Cyber Forensics
- Network Security and Intrusion Detection
- Stellar, planetary, and galactic studies
LinkedIn (United States)
2021
Meta (Israel)
2018-2019
Meta (United States)
2019
Microsoft (United States)
2012-2018
Microsoft Research (United Kingdom)
2013-2018
University of Southern California
2017
Southern California University for Professional Studies
2017
National Institute of Technology Karnataka
2015
Yahoo (United States)
2012
Microsoft (Finland)
2012
Tasks in modern data parallel clusters have highly diverse resource requirements, along CPU, memory, disk and network. Any of these resources may become bottlenecks hence, the likelihood wasting due to fragmentation is now larger. Today's schedulers do not explicitly reduce fragmentation. Worse, since they only allocate cores that ignore (disk network) can be over-allocated leading interference, failures hogging or memory could been used by other tasks. We present Tetris, a cluster scheduler...
Tasks in modern data parallel clusters have highly diverse resource requirements, along CPU, memory, disk and network. Any of these resources may become bottlenecks hence, the likelihood wasting due to fragmentation is now larger. Today's schedulers do not explicitly reduce fragmentation. Worse, since they only allocate cores that ignore (disk network) can be over-allocated leading interference, failures hogging or memory could been used by other tasks. We present Tetris, a cluster scheduler...
Modern resource management frameworks for large-scale analytics leave unresolved the problematic tension between high cluster utilization and job's performance predictability--respectively coveted by operators users. We address this in Morpheus, a new system that: 1) codifies implicit user expectations as explicit Service Level Objectives (SLOs), inferred from historical data, 2) enforces SLOs using novel scheduling techniques that isolate jobs sharing-induced variability, 3) mitigates...
To reduce the impact of network congestion on big data jobs, cluster management frameworks use various heuristics to schedule compute tasks and/or flows. Most these schedulers consider job input fixed and greedily flows that are ready run. However, a large fraction production jobs recurring with predictable characteristics, which allows us plan ahead for them. Coordinating placement significantly improving their locality freeing up bandwidth, can be used by other running cluster. With this...
The Quantcast File System (QFS) is an efficient alternative to the Hadoop Distributed (HDFS). QFS written in C++, plugin compatible with MapReduce, and offers several efficiency improvements relative HDFS: 50% disk space savings through erasure coding instead of replication, a resulting doubling write throughput, faster name node, support for sorting logging concurrent append feature, native command line client much than hadoop fs, global feedback-directed I/O device management. As works out...
In recent years, there has been an explosion of large-scale real-time analytics needs and a plethora streaming systems have developed to support such applications. These are able continue stream processing even when faced with hardware software failures. However, these do not address some crucial challenges facing their operators: the manual, time-consuming error-prone tasks tuning various configuration knobs achieve service level objectives (SLO) as well maintenance SLOs in face sudden,...
The continuous shift towards data-driven approaches to business, and a growing attention improving return on investments (ROI) for cluster infrastructures is generating new challenges big-data frameworks. Systems originally designed big batch jobs now handle an increasingly complex mix of computations. Moreover, they are expected guarantee stringent SLAs production minimize latency best-effort jobs.
We present a newcluster scheduler, GRAPHENE, aimed at jobs that have complex dependency structure and heterogeneous resource demands. Relaxing either of these challenges, i.e., scheduling DAG homogeneous tasks or an independent set tasks, leads to NP-hard problems. Reasonable heuristics exist for simpler problems, but they perform poorly when DAGs. Our key insights are: (1) focus on the long-running those with tough-to-pack demands, (2) compute schedule, offline, by first such troublesome...
Job scheduling in Big Data clusters is crucial both for cluster operators' return on investment and overall user experience. In this context, we observe several anomalies how modern schedulers manage queues, argue that maintaining queues of tasks at worker nodes has significant benefits. On one hand, centralized approaches do not use worker-side queues. Given the inherent feedback delays these systems incur, they achieve suboptimal utilization, particularly workloads dominated by short...
Query optimizers are notorious for inaccurate cost estimates, leading to poor performance. The root of the problem lies in cardinality i.e., size intermediate (and final) results a query plan. These estimates also determine resources consumed modern shared cloud infrastructures. In this paper, we present C ARD L EARNER , machine learning based approach learn models from previous job executions and use them predict cardinalities future jobs. key intuition our is that workloads often recurring...
Distributed systems are easier to build than ever with the emergence of new, data-centric abstractions for storing and computing over massive datasets. However, similar do not exist accessing meta-data. To fill this gap, Tango provides developers abstraction a replicated, in-memory data structure (such as map or tree) backed by shared log. objects easy use, replicating state via simple append read operations on log instead complex distributed protocols; in process, they obtain properties...
In this paper, we present Sailfish, a new Map-Reduce framework for large scale data processing. The Sailfish design is centered around aggregating intermediate data, specifically produced by map tasks and consumed later reduce tasks, to improve performance batching disk I/O. We introduce an abstraction called I-files supporting aggregation, describe how implemented it as extension of the distributed filesystem, efficiently batch written multiple writers read readers. adapts layer in Hadoop...
We observe significant overlaps in the computations performed by user jobs modern shared analytics clusters. Naïvely computing same subexpressions multiple times results wasting cluster resources and longer execution times. Given that these workloads consist of tens thousands jobs, identifying overlapping across is great interest to both operators users. Nevertheless, existing approaches support orders magnitude smaller or employ heuristics with limited effectiveness. In this paper, we focus...
Data-intensive computing (DISC) frameworks scale by partitioning a job across set of fault-tolerant tasks, then diffusing those tasks large clusters. Multi-tenanted clusters must accommodate service-level objectives (SLO) in their resource model, often expressed as maximum latency for allocating the desired resources to every job. When jobs are partitioned into statically, cluster cannot meet its SLOs while maintaining both high utilization and efficiency. Ideally, we want give when they...
The rise in popularity of machine learning, streaming, and latency-sensitive online applications shared production clusters has raised new challenges for cluster schedulers. To optimize their performance resilience, these require precise control placements, by means complex constraints, e.g., to collocate or separate long-running containers across groups nodes. In the presence applications, scheduler must attain global optimization objectives, such as maximizing number deployed minimizing...
Stream-processing workloads and modern shared cluster environments exhibit high variability unpredictability. Combined with the large parameter space diverse set of user SLOs, this makes streaming systems very challenging to statically configure tune. To address these issues, in paper we investigate a novel control-plane design, Chi, which supports continuous monitoring feedback, enables dynamic re-configuration. Chi leverages key insight embedding messages data-plane channels achieve...
Analytics-as-a-service, or analytics job service, is emerging as a new paradigm for data analytics, be it in cloud environment within enterprises. In this setting, users are not required to manage tune their hardware and software infrastructure, they pay only the processing resources consumed per job. However, shared nature of these services across several teams leads significant overlaps partial computations, i.e., parts duplicated multiple jobs, thus generating redundant costs. paper, we...
To reduce the impact of network congestion on big data jobs, cluster management frameworks use various heuristics to schedule compute tasks and/or flows. Most these schedulers consider job input fixed and greedily flows that are ready run. However, a large fraction production jobs recurring with predictable characteristics, which allows us plan ahead for them. Coordinating placement significantly improving their locality freeing up bandwidth, can be used by other running cluster. With this...
Walnut is an object-store being developed at Yahoo! with the goal of serving as a common low-level storage layer for variety cloud data management systems including Hadoop (a MapReduce system), MObStor multimedia and PNUTS (an extended key-value system). Thus, key performance challenge to meet latency throughput requirements wide range workloads commonly observed across these diverse systems. The motivation leverage carefully optimized system, support elasticity high-availability, all...
Large inter-datacenter transfers are crucial for cloud service efficiency and increasingly used by organizations that have dedicated wide area networks between datacenters. A recent work uses multicast forwarding trees to reduce the bandwidth needs improve completion times of point-to-multipoint transfers. Using a single tree per transfer, however, leads poor performance because slowest receiver dictates time all receivers. multiple transfer alleviates this concern--the average could finish...
Using multiple datacenters allows for higher availability, load balancing and reduced latency to customers of cloud services. To distribute copies data, providers depend on inter-datacenter WANs that ought be used efficiently considering their limited capacity the ever-increasing data demands. In this paper, we focus applications transfer objects from one datacenter several over dedicated networks. We present DCCast, a centralized Point Multi-Point (P2MP) algorithm uses forwarding trees...
Twitter's data centers process billions of events per day the instant is generated. To achieve real-time performance, Twitter has developed Heron, a streaming engine that provides unparalleled performance at large scale. Heron been recently open-sourced and thus now accessible to various other organizations. In this paper, we discuss challenges faced when transforming from system tailored for applications software stack efficiently handles with diverse characteristics on top Big Data...