NFDI4DS | UHH-SEMS - Publication Details

Zhen Zhang

ORCID: 0000-0002-0164-0849

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5029095203

Research Areas

Graph Theory and Algorithms
Advanced Neural Network Applications
Parallel Computing and Optimization Techniques
Caching and Content Delivery
Software System Performance and Reliability
Privacy-Preserving Technologies in Data
Computability, Logic, AI Algorithms
Software-Defined Networks and 5G
Topic Modeling
Cloud Computing and Resource Management
Bayesian Modeling and Causal Inference
Scientific Computing and Data Management
Advanced Database Systems and Queries
Advanced Graph Neural Networks
Adversarial Robustness in Machine Learning
Speech Recognition and Synthesis
Data Management and Algorithms
Stochastic Gradient Optimization Techniques
Natural Language Processing Techniques
Brain Tumor Detection and Classification
Advanced Memory and Neural Computing

Amazon (United States)
2023-2025

Johns Hopkins University
2024

Northeastern University
2015

GEMINI: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints

OPENALEX - Publications

Zhuang Wang Zhen Jia Shuai Zheng Zhen Zhang Xinwei Fu and 2 more

Large deep learning models have recently garnered substantial attention from both academia and industry. Nonetheless, frequent failures are observed during large model training due to large-scale resources involved extended time. Existing solutions significant failure recovery costs the severe restriction imposed by bandwidth of remote storage in which they store checkpoints.

10.1145/3600006.3613145 article EN 2023-10-03

Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates

OPENALEX - Publications

Insu Jang Zhenning Yang Zhen Zhang Xin Jin Mosharaf Chowdhury

Oobleck enables resilient distributed training of large DNN models with guaranteed fault tolerance. It takes a planning-execution co-design approach, where it first generates set heterogeneous pipeline templates and instantiates at least f + 1 logically equivalent replicas to tolerate any simultaneous failures. During execution, relies on already-replicated model states across the provide fast recovery. provably guarantees that some combination initially created can be used cover all...

10.1145/3600006.3613152 preprint EN 2023-10-03

Verifying Semantic Equivalence of Large Models with Equality Saturation

OPENALEX - Publications

Kahfi S. Zulkifli Wenbo Qian Shaowei Zhu Zhou Yuan Zhen Zhang and 1 more

10.1145/3721146.3721943 article EN cc-by 2025-03-30

SDCC: software-defined collective communication for distributed training

OPENALEX - Publications

Xin Jin Zhen Zhang Yunshan Jia Yun Ma Xuanzhe Liu

10.1007/s11432-023-3894-4 article EN Science China Information Sciences 2024-07-31

Slapo: A Schedule Language for Progressive Optimization of Large Deep Learning Model Training

OPENALEX - Publications

Hongzheng Chen Cody Hao Yu Shuai Zheng Zhen Zhang Zhiru Zhang and 1 more

Recent years have seen an increase in the development of large deep learning (DL) models, which makes training efficiency crucial. Common practice is struggling with trade-off between usability and performance. On one hand, DL frameworks such as PyTorch use dynamic graphs to facilitate model developers at a price sub-optimal other practitioners propose various approaches improving by sacrificing some flexibility, ranging from making graph static for more thorough optimization (e.g., XLA)...

10.1145/3620665.3640399 article EN 2024-04-22

Distributed Training of Large Language Models on AWS Trainium

OPENALEX - Publications

Xinwei Fu Zhen Zhang Haozheng Fan Guangtai Huang Mohammad El-Shabani and 5 more

10.1145/3698038.3698535 article EN 2024-11-14

DistMind: Efficient Resource Disaggregation for Deep Learning Workloads

OPENALEX - Publications

Xin Jin Zhihao Bai Zhen Zhang Yibo Zhu Yinmin Zhong and 1 more

Deep learning (DL) systems suffer from low resource utilization due to 1) monolithic server model that tightly couples compute and memory; 2) limited sharing between different inference applications, across training, because of strict service level objectives (SLOs). To address this problem, we present, a disaggregated DL system enables efficient multiplexing applications with near-optimal utilization. decouples host memory, exposes the abstractions GPU pool memory pool, each which can be...

10.1109/tnet.2024.3355010 article EN IEEE/ACM Transactions on Networking 2024-01-24

Coming Soon ...