NFDI4DS | UHH-SEMS - Publication Details

Yuxiong He

ORCID: 0000-0003-0478-8854

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5040302174

Research Areas

Advanced Neural Network Applications
Topic Modeling
Parallel Computing and Optimization Techniques
Cloud Computing and Resource Management
Distributed and Parallel Computing Systems
Caching and Content Delivery
Natural Language Processing Techniques
Stochastic Gradient Optimization Techniques
Domain Adaptation and Few-Shot Learning
Multimodal Machine Learning Applications
Graph Theory and Algorithms
Advanced Image and Video Retrieval Techniques
Ferroelectric and Negative Capacitance Devices
Machine Learning and Data Classification
Data Management and Algorithms
Optimization and Search Problems
IoT and Edge/Fog Computing
Advanced Graph Neural Networks
Age of Information Optimization
Algorithms and Data Compression
Speech Recognition and Synthesis
Adversarial Robustness in Machine Learning
Interconnection Networks and Systems
Neural Networks and Applications
Scheduling and Optimization Algorithms

Microsoft (United States)
2015-2024

Microsoft Research (United Kingdom)
2011-2024

Bellevue Hospital Center
2019-2024

The Ohio State University
2023

Microsoft (Finland)
2023

Microsoft (Germany)
2023

Max Planck Institute for Software Systems
2017

Google (United States)
2017

Futures Group (United States)
2016

Nanyang Technological University
2008-2010

ZeRO: Memory optimizations Toward Training Trillion Parameter Models

OPENALEX - Publications

Samyam Rajbhandari Jeff Rasley Olatunji Ruwase Yuxiong He

Large deep learning models offer significant accuracy gains, but training billions to trillions of parameters is challenging. Existing solutions such as data and model parallelisms exhibit fundamental limitations fit these into limited device memory, while obtaining computation, communication development efficiency. We develop a novel solution, Zero Redundancy Optimizer (ZeRO), optimize vastly improving speed increasing the size that can be efficiently trained. ZeRO eliminates memory...

10.1109/sc41405.2020.00024 article EN 2020-11-01

DeepSpeed

OPENALEX - Publications

Jeff Rasley Samyam Rajbhandari Olatunji Ruwase Yuxiong He

Explore new techniques in Microsoft's open source library called DeepSpeed, which advances large model training by improving scale, speed, cost, and usability, unlocking the ability to train 100-billion-parameter models. DeepSpeed is compatible with PyTorch. One piece of our library, ZeRO, a parallelized optimizer that greatly reduces resources needed for data parallelism while massively increasing number parameters can be trained. Researchers have used these breakthroughs create Turing...

10.1145/3394486.3406703 article EN 2020-08-20

Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model

OPENALEX - Publications

Shaden Smith Mostofa Patwary Brandon Norick Patrick LeGresley Samyam Rajbhandari and 15 more

Pretrained general-purpose language models can achieve state-of-the-art accuracies in various natural processing domains by adapting to downstream tasks via zero-shot, few-shot and fine-tuning techniques. Because of their success, the size these has increased rapidly, requiring high-performance hardware, software, algorithmic techniques enable training such large models. As result a joint effort between Microsoft NVIDIA, we present details on largest monolithic transformer based model,...

10.48550/arxiv.2201.11990 preprint EN other-oa arXiv (Cornell University) 2022-01-01

ZeRO-infinity

OPENALEX - Publications

Samyam Rajbhandari Olatunji Ruwase Jeff Rasley Shaden Smith Yuxiong He

In the last three years, largest dense deep learning models have grown over 1000x to reach hundreds of billions parameters, while GPU memory has only by 5x (16 GB 80 GB). Therefore, growth in model scale been supported primarily though system innovations that allow large fit aggregate multiple GPUs. However, we are getting close wall. It requires 800 NVIDIA V100 GPUs just a trillion parameter for training, and such clusters simply out most data scientists. addition, training at complex...

10.1145/3458817.3476205 article EN 2021-10-21

OpenFold: retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization

OPENALEX - Publications

Gustaf Ahdritz Nazim Bouatta Christina Floristean Sachin Kadyan Qinghui Xia and 29 more

AlphaFold2 revolutionized structural biology with the ability to predict protein structures exceptionally high accuracy. Its implementation, however, lacks code and data required train new models. These are necessary (1) tackle tasks, like protein–ligand complex structure prediction, (2) investigate process by which model learns (3) assess model's capacity generalize unseen regions of fold space. Here we report OpenFold, a fast, memory efficient trainable implementation AlphaFold2. We...

10.1038/s41592-024-02272-z article EN cc-by Nature Methods 2024-05-14

DeepSpeed- Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale

OPENALEX - Publications

Reza Yazdani Aminabadi Samyam Rajbhandari Ammar Ahmad Awan Cheng Li Canbing Li and 6 more

The landscape of transformer model inference is increasingly diverse in size, characteristics, latency and throughput requirements, hardware etc. With such diversity, designing a versatile system challenging. DeepSpeed-Inference addresses these challenges by (1) multi-GPU solution to minimize while maximizing for both dense sparse transformers when the fits aggregate GPU memory, (2) heterogeneous that leverages CPU/NVMe/GPU memory enable high-throughput models larger than memory. reduces...

10.1109/sc41404.2022.00051 article EN 2022-11-01

Learning Intrinsic Sparse Structures within Long Short-Term Memory

OPENALEX - Publications

Wei Wen Yuxiong He Samyam Rajbhandari Minjia Zhang Wenhan Wang and 4 more

Model compression is significant for the wide adoption of Recurrent Neural Networks (RNNs) in both user devices possessing limited resources and business clusters requiring quick responses to large-scale service requests. This work aims learn structurally-sparse Long Short-Term Memory (LSTM) by reducing sizes basic structures within LSTM units, including input updates, gates, hidden states, cell states outputs. Independently can result inconsistent dimensions among them, consequently, end up...

10.48550/arxiv.1709.05027 preprint EN other-oa arXiv (Cornell University) 2017-01-01

Performance Modeling and Scalability Optimization of Distributed Deep Learning Systems

OPENALEX - Publications

Feng Yan Olatunji Ruwase Yuxiong He Trishul Chilimbi

Big deep neural network (DNN) models trained on large amounts of data have recently achieved the best accuracy hard tasks, such as image and speech recognition. Training these DNNs using a cluster commodity machines is promising approach since training time consuming compute-intensive. To enable extremely DNNs, are partitioned across machines. expedite very sets, multiple model replicas in parallel different subsets examples with global parameter server maintaining shared weights replicas....

10.1145/2783258.2783270 article EN 2015-08-07

Swayam

OPENALEX - Publications

Arpan Gujarati Sameh Elnikety Yuxiong He Kathryn S. McKinley Björn B. Brandenburg

Developers use Machine Learning (ML) platforms to train ML models and then deploy these as web services for inference (prediction). A key challenge platform providers is guarantee response-time Service Level Agreements (SLAs) workloads while maximizing resource efficiency. Swayam a fully distributed autoscaling framework that exploits characteristics of production deliver on the dual efficiency SLA compliance. Our contributions are (1) model-based takes into account SLAs workload...

10.1145/3135974.3135993 article EN 2017-11-30

ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers

OPENALEX - Publications

Zhewei Yao Reza Yazdani Aminabadi Minjia Zhang Xiaoxia Wu Conglong Li and 1 more

How to efficiently serve ever-larger trained natural language models in practice has become exceptionally challenging even for powerful cloud servers due their prohibitive memory/computation requirements. In this work, we present an efficient and affordable post-training quantization approach compress large Transformer-based models, termed as ZeroQuant. ZeroQuant is end-to-end inference pipeline with three main components: (1) a fine-grained hardware-friendly scheme both weight activations;...

10.48550/arxiv.2206.01861 preprint EN other-oa arXiv (Cornell University) 2022-01-01

The Cilkview scalability analyzer

OPENALEX - Publications

Yuxiong He Charles E. Leiserson William M. Leiserson

The Cilkview scalability analyzer is a software tool for profiling, estimating scalability, and benchmarking multithreaded Cilk++ applications. monitors logical parallelism during an instrumented execution of the application on single processing core. As executes, it analyzes dependencies within computation to determine its work span (critical-path length). These metrics allow estimate predict how will scale with number cores. In addition, cheduling overhead using concept "burdened dag,"...

10.1145/1810479.1810509 article EN 2010-06-13

Provably-Efficient Job Scheduling for Energy and Fairness in Geographically Distributed Data Centers

OPENALEX - Publications

Shaolei Ren Yuxiong He Fei Xu

Decreasing the soaring energy cost is imperative in large data centers. Meanwhile, limited computational resources need to be fairly allocated among different organizations. Latency another major concern for resource management. Nevertheless, cost, allocation fairness, and latency are important but often contradicting metrics on scheduling center workloads. In this paper, we explore benefit of electricity price variations across time locations. We study problem batch jobs, which originate...

10.1109/icdcs.2012.77 article EN 2012-06-01

Predictive parallelization

OPENALEX - Publications

Myeongjae Jeon Saehoon Kim Seung-won Hwang Yuxiong He Sameh Elnikety and 2 more

Web search engines are optimized to reduce the high-percentile response time consistently provide fast responses almost all user queries. This is a challenging task because query workload exhibits large variability, consisting of many short-running queries and few long-running that significantly impact time. With modern multicore servers, parallelizing processing an individual promising solution execution time, but it gives limited benefits compared sequential since most see little or no...

10.1145/2600428.2609572 article EN 2014-07-03

Few-to-Many

OPENALEX - Publications

M. E. Haque Yong hun Eom Yuxiong He Sameh Elnikety Ricardo Bianchini and 1 more

Interactive services, such as Web search, recommendations, games, and finance, must respond quickly to satisfy customers. Achieving this goal requires optimizing tail (e.g., 99th+ percentile) latency. Although every server is multicore, parallelizing individual requests reduce latency challenging because (1) service demand unknown when arrive; (2) blindly all oversubscribes hardware resources; (3) the numerous short will not improve This paper introduces Few-to-Many (FM) incremental...

10.1145/2694344.2694384 article EN 2015-03-03

ZeRO-Offload: Democratizing Billion-Scale Model Training

OPENALEX - Publications

Jie Ren Samyam Rajbhandari Reza Yazdani Aminabadi Olatunji Ruwase Shuangyan Yang and 3 more

Large-scale model training has been a playing ground for limited few requiring complex refactoring and access to prohibitively expensive GPU clusters. ZeRO-Offload changes the large landscape by making accessible nearly everyone. It can train models with over 13 billion parameters on single GPU, 10x increase in size compared popular framework such as PyTorch, it does so without any change from data scientists or sacrificing computational efficiency. enables offloading compute CPU. To...

10.48550/arxiv.2101.06840 preprint EN cc-by arXiv (Cornell University) 2021-01-01

ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning

OPENALEX - Publications

Samyam Rajbhandari Olatunji Ruwase Jeff Rasley Shaden Smith Yuxiong He

10.48550/arxiv.2104.07857 preprint EN cc-by arXiv (Cornell University) 2021-01-01

DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale

OPENALEX - Publications

Samyam Rajbhandari Conglong Li Zhewei Yao Minjia Zhang Reza Yazdani Aminabadi and 3 more

As the training of giant dense models hits boundary on availability and capability hardware resources today, Mixture-of-Experts (MoE) become one most promising model architectures due to their significant cost reduction compared a quality-equivalent model. Its saving is demonstrated from encoder-decoder (prior works) 5x for auto-aggressive language (this work along with parallel explorations). However, much larger size unique architecture, how provide fast MoE inference remains challenging...

10.48550/arxiv.2201.05596 preprint EN cc-by arXiv (Cornell University) 2022-01-01

DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing

OPENALEX - Publications

Conglong Li Zhewei Yao Xiaoxia Wu Minjia Zhang Connor Holmes and 2 more

Recent advances on deep learning models come at the price of formidable training cost. The increasing model size is one root causes, but another less-emphasized fact that data scale actually a similar speed as scale, and cost proportional to both them. Compared rapidly evolving architecture, how efficiently use (especially for expensive foundation pretraining) less explored difficult realize due lack convenient framework focus efficiency capabilities. To this end, we present DeepSpeed Data...

10.1609/aaai.v38i16.29810 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2024-03-24

Mercury: A memory-constrained spatio-temporal real-time search on microblogs

OPENALEX - Publications

Amr Magdy Mohamed F. Mokbel Sameh Elnikety Suman Nath Yuxiong He

This paper presents Mercury; a system for real-time support of top-k spatio-temporal queries on microblogs, where users are able to browse recent microblogs near their locations. With high arrival rates Mercury ensures query response within tight memory-constrained environment. bounds its search space include only those that have arrived certain spatial and temporal boundaries, in which the according ranking function, returned results. employs: (a) scalable dynamic in-memory index structure...

10.1109/icde.2014.6816649 article EN 2014-03-01

G-SPARQL

OPENALEX - Publications

Sherif Sakr Sameh Elnikety Yuxiong He

We propose a SPARQL-like language, G-SPARQL, for querying attributed graphs. The language expresses types of queries which large interest applications model their data as graphs such as: pattern matching, reachability and shortest path queries. Each query can combine both structural predicates value-based (on the attributes graph nodes edges). describe an algebraic compilation mechanism our proposed is extended from relational algebra based on basic construct building SPARQL queries, Triple...

10.1145/2396761.2396806 article EN 2012-10-29

GRNN

OPENALEX - Publications

Connor Holmes Daniel Mawhirter Yuxiong He Feng Yan Bo Wu

Recurrent neural networks (RNNs) have gained significant attention due to their effectiveness in modeling sequential data, such as text and voice signal. However, the complex data dependencies limited parallelism, current inference libraries for RNNs on GPUs produce either high latency or poor scalability, leading inefficient resource utilization. Consequently, companies like Microsoft Facebook use CPUs serve RNN models.

10.1145/3302424.3303949 article EN 2019-03-22

Scaling Vision-Language Models with Sparse Mixture of Experts

OPENALEX - Publications

Sheng Shen Zhewei Yao Chunyuan Li Trevor Darrell Kurt Keutzer and 1 more

The field of natural language processing (NLP) has made significant strides in recent years, particularly the development large-scale vision-language models (VLMs). These aim to bridge gap between text and visual information, enabling a more comprehensive understanding multimedia data. However, as these become larger complex, they also challenging train deploy. One approach addressing this challenge is use sparsely-gated mixture-of-experts (MoE) techniques, which divide model into smaller,...

10.18653/v1/2023.findings-emnlp.758 article EN cc-by 2023-01-01

ExCoT: Optimizing Reasoning for Text-to-SQL with Execution Feedback

OPENALEX - Publications

Bohan Zhai Canwen Xu Yuxiong He Zhewei Yao

Text-to-SQL demands precise reasoning to convert natural language questions into structured queries. While large models (LLMs) excel in many tasks, their ability leverage Chain-of-Thought (CoT) for text-to-SQL remains underexplored. We identify critical limitations: zero-shot CoT offers minimal gains, and Direct Preference Optimization (DPO) applied without yields marginal improvements. propose ExCoT, a novel framework that iteratively optimizes open-source LLMs by combining with off-policy...

10.48550/arxiv.2503.19988 preprint EN arXiv (Cornell University) 2025-03-25

Coming Soon ...