Zhen Zhang

ORCID: 0000-0002-0164-0849
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Graph Theory and Algorithms
  • Advanced Neural Network Applications
  • Parallel Computing and Optimization Techniques
  • Caching and Content Delivery
  • Software System Performance and Reliability
  • Privacy-Preserving Technologies in Data
  • Computability, Logic, AI Algorithms
  • Software-Defined Networks and 5G
  • Topic Modeling
  • Cloud Computing and Resource Management
  • Bayesian Modeling and Causal Inference
  • Scientific Computing and Data Management
  • Advanced Database Systems and Queries
  • Advanced Graph Neural Networks
  • Adversarial Robustness in Machine Learning
  • Speech Recognition and Synthesis
  • Data Management and Algorithms
  • Stochastic Gradient Optimization Techniques
  • Natural Language Processing Techniques
  • Brain Tumor Detection and Classification
  • Advanced Memory and Neural Computing

Amazon (United States)
2023-2025

Johns Hopkins University
2024

Northeastern University
2015

Large deep learning models have recently garnered substantial attention from both academia and industry. Nonetheless, frequent failures are observed during large model training due to large-scale resources involved extended time. Existing solutions significant failure recovery costs the severe restriction imposed by bandwidth of remote storage in which they store checkpoints.

10.1145/3600006.3613145 article EN 2023-10-03

Oobleck enables resilient distributed training of large DNN models with guaranteed fault tolerance. It takes a planning-execution co-design approach, where it first generates set heterogeneous pipeline templates and instantiates at least f + 1 logically equivalent replicas to tolerate any simultaneous failures. During execution, relies on already-replicated model states across the provide fast recovery. provably guarantees that some combination initially created can be used cover all...

10.1145/3600006.3613152 preprint EN 2023-10-03

Recent years have seen an increase in the development of large deep learning (DL) models, which makes training efficiency crucial. Common practice is struggling with trade-off between usability and performance. On one hand, DL frameworks such as PyTorch use dynamic graphs to facilitate model developers at a price sub-optimal other practitioners propose various approaches improving by sacrificing some flexibility, ranging from making graph static for more thorough optimization (e.g., XLA)...

10.1145/3620665.3640399 article EN 2024-04-22

Deep learning (DL) systems suffer from low resource utilization due to 1) monolithic server model that tightly couples compute and memory; 2) limited sharing between different inference applications, across training, because of strict service level objectives (SLOs). To address this problem, we present, a disaggregated DL system enables efficient multiplexing applications with near-optimal utilization. decouples host memory, exposes the abstractions GPU pool memory pool, each which can be...

10.1109/tnet.2024.3355010 article EN IEEE/ACM Transactions on Networking 2024-01-24
Coming Soon ...