- Graph Theory and Algorithms
- Advanced Neural Network Applications
- Parallel Computing and Optimization Techniques
- Caching and Content Delivery
- Software System Performance and Reliability
- Privacy-Preserving Technologies in Data
- Computability, Logic, AI Algorithms
- Software-Defined Networks and 5G
- Topic Modeling
- Cloud Computing and Resource Management
- Bayesian Modeling and Causal Inference
- Scientific Computing and Data Management
- Advanced Database Systems and Queries
- Advanced Graph Neural Networks
- Adversarial Robustness in Machine Learning
- Speech Recognition and Synthesis
- Data Management and Algorithms
- Stochastic Gradient Optimization Techniques
- Natural Language Processing Techniques
- Brain Tumor Detection and Classification
- Advanced Memory and Neural Computing
Amazon (United States)
2023-2025
Johns Hopkins University
2024
Northeastern University
2015
Large deep learning models have recently garnered substantial attention from both academia and industry. Nonetheless, frequent failures are observed during large model training due to large-scale resources involved extended time. Existing solutions significant failure recovery costs the severe restriction imposed by bandwidth of remote storage in which they store checkpoints.
Oobleck enables resilient distributed training of large DNN models with guaranteed fault tolerance. It takes a planning-execution co-design approach, where it first generates set heterogeneous pipeline templates and instantiates at least f + 1 logically equivalent replicas to tolerate any simultaneous failures. During execution, relies on already-replicated model states across the provide fast recovery. provably guarantees that some combination initially created can be used cover all...
Recent years have seen an increase in the development of large deep learning (DL) models, which makes training efficiency crucial. Common practice is struggling with trade-off between usability and performance. On one hand, DL frameworks such as PyTorch use dynamic graphs to facilitate model developers at a price sub-optimal other practitioners propose various approaches improving by sacrificing some flexibility, ranging from making graph static for more thorough optimization (e.g., XLA)...
Deep learning (DL) systems suffer from low resource utilization due to 1) monolithic server model that tightly couples compute and memory; 2) limited sharing between different inference applications, across training, because of strict service level objectives (SLOs). To address this problem, we present, a disaggregated DL system enables efficient multiplexing applications with near-optimal utilization. decouples host memory, exposes the abstractions GPU pool memory pool, each which can be...