- Advanced Neural Network Applications
- Topic Modeling
- Parallel Computing and Optimization Techniques
- Cloud Computing and Resource Management
- Distributed and Parallel Computing Systems
- Caching and Content Delivery
- Natural Language Processing Techniques
- Stochastic Gradient Optimization Techniques
- Domain Adaptation and Few-Shot Learning
- Multimodal Machine Learning Applications
- Graph Theory and Algorithms
- Advanced Image and Video Retrieval Techniques
- Ferroelectric and Negative Capacitance Devices
- Machine Learning and Data Classification
- Data Management and Algorithms
- Optimization and Search Problems
- IoT and Edge/Fog Computing
- Advanced Graph Neural Networks
- Age of Information Optimization
- Algorithms and Data Compression
- Speech Recognition and Synthesis
- Adversarial Robustness in Machine Learning
- Interconnection Networks and Systems
- Neural Networks and Applications
- Scheduling and Optimization Algorithms
Microsoft (United States)
2015-2024
Microsoft Research (United Kingdom)
2011-2024
Bellevue Hospital Center
2019-2024
The Ohio State University
2023
Microsoft (Finland)
2023
Microsoft (Germany)
2023
Max Planck Institute for Software Systems
2017
Google (United States)
2017
Futures Group (United States)
2016
Nanyang Technological University
2008-2010
Large deep learning models offer significant accuracy gains, but training billions to trillions of parameters is challenging. Existing solutions such as data and model parallelisms exhibit fundamental limitations fit these into limited device memory, while obtaining computation, communication development efficiency. We develop a novel solution, Zero Redundancy Optimizer (ZeRO), optimize vastly improving speed increasing the size that can be efficiently trained. ZeRO eliminates memory...
Explore new techniques in Microsoft's open source library called DeepSpeed, which advances large model training by improving scale, speed, cost, and usability, unlocking the ability to train 100-billion-parameter models. DeepSpeed is compatible with PyTorch. One piece of our library, ZeRO, a parallelized optimizer that greatly reduces resources needed for data parallelism while massively increasing number parameters can be trained. Researchers have used these breakthroughs create Turing...
Pretrained general-purpose language models can achieve state-of-the-art accuracies in various natural processing domains by adapting to downstream tasks via zero-shot, few-shot and fine-tuning techniques. Because of their success, the size these has increased rapidly, requiring high-performance hardware, software, algorithmic techniques enable training such large models. As result a joint effort between Microsoft NVIDIA, we present details on largest monolithic transformer based model,...
In the last three years, largest dense deep learning models have grown over 1000x to reach hundreds of billions parameters, while GPU memory has only by 5x (16 GB 80 GB). Therefore, growth in model scale been supported primarily though system innovations that allow large fit aggregate multiple GPUs. However, we are getting close wall. It requires 800 NVIDIA V100 GPUs just a trillion parameter for training, and such clusters simply out most data scientists. addition, training at complex...
AlphaFold2 revolutionized structural biology with the ability to predict protein structures exceptionally high accuracy. Its implementation, however, lacks code and data required train new models. These are necessary (1) tackle tasks, like protein–ligand complex structure prediction, (2) investigate process by which model learns (3) assess model's capacity generalize unseen regions of fold space. Here we report OpenFold, a fast, memory efficient trainable implementation AlphaFold2. We...
The landscape of transformer model inference is increasingly diverse in size, characteristics, latency and throughput requirements, hardware etc. With such diversity, designing a versatile system challenging. DeepSpeed-Inference addresses these challenges by (1) multi-GPU solution to minimize while maximizing for both dense sparse transformers when the fits aggregate GPU memory, (2) heterogeneous that leverages CPU/NVMe/GPU memory enable high-throughput models larger than memory. reduces...
Model compression is significant for the wide adoption of Recurrent Neural Networks (RNNs) in both user devices possessing limited resources and business clusters requiring quick responses to large-scale service requests. This work aims learn structurally-sparse Long Short-Term Memory (LSTM) by reducing sizes basic structures within LSTM units, including input updates, gates, hidden states, cell states outputs. Independently can result inconsistent dimensions among them, consequently, end up...
Big deep neural network (DNN) models trained on large amounts of data have recently achieved the best accuracy hard tasks, such as image and speech recognition. Training these DNNs using a cluster commodity machines is promising approach since training time consuming compute-intensive. To enable extremely DNNs, are partitioned across machines. expedite very sets, multiple model replicas in parallel different subsets examples with global parameter server maintaining shared weights replicas....
Developers use Machine Learning (ML) platforms to train ML models and then deploy these as web services for inference (prediction). A key challenge platform providers is guarantee response-time Service Level Agreements (SLAs) workloads while maximizing resource efficiency. Swayam a fully distributed autoscaling framework that exploits characteristics of production deliver on the dual efficiency SLA compliance. Our contributions are (1) model-based takes into account SLAs workload...
How to efficiently serve ever-larger trained natural language models in practice has become exceptionally challenging even for powerful cloud servers due their prohibitive memory/computation requirements. In this work, we present an efficient and affordable post-training quantization approach compress large Transformer-based models, termed as ZeroQuant. ZeroQuant is end-to-end inference pipeline with three main components: (1) a fine-grained hardware-friendly scheme both weight activations;...
The Cilkview scalability analyzer is a software tool for profiling, estimating scalability, and benchmarking multithreaded Cilk++ applications. monitors logical parallelism during an instrumented execution of the application on single processing core. As executes, it analyzes dependencies within computation to determine its work span (critical-path length). These metrics allow estimate predict how will scale with number cores. In addition, cheduling overhead using concept "burdened dag,"...
Decreasing the soaring energy cost is imperative in large data centers. Meanwhile, limited computational resources need to be fairly allocated among different organizations. Latency another major concern for resource management. Nevertheless, cost, allocation fairness, and latency are important but often contradicting metrics on scheduling center workloads. In this paper, we explore benefit of electricity price variations across time locations. We study problem batch jobs, which originate...
Web search engines are optimized to reduce the high-percentile response time consistently provide fast responses almost all user queries. This is a challenging task because query workload exhibits large variability, consisting of many short-running queries and few long-running that significantly impact time. With modern multicore servers, parallelizing processing an individual promising solution execution time, but it gives limited benefits compared sequential since most see little or no...
Interactive services, such as Web search, recommendations, games, and finance, must respond quickly to satisfy customers. Achieving this goal requires optimizing tail (e.g., 99th+ percentile) latency. Although every server is multicore, parallelizing individual requests reduce latency challenging because (1) service demand unknown when arrive; (2) blindly all oversubscribes hardware resources; (3) the numerous short will not improve This paper introduces Few-to-Many (FM) incremental...
Large-scale model training has been a playing ground for limited few requiring complex refactoring and access to prohibitively expensive GPU clusters. ZeRO-Offload changes the large landscape by making accessible nearly everyone. It can train models with over 13 billion parameters on single GPU, 10x increase in size compared popular framework such as PyTorch, it does so without any change from data scientists or sacrificing computational efficiency. enables offloading compute CPU. To...
In the last three years, largest dense deep learning models have grown over 1000x to reach hundreds of billions parameters, while GPU memory has only by 5x (16 GB 80 GB). Therefore, growth in model scale been supported primarily though system innovations that allow large fit aggregate multiple GPUs. However, we are getting close wall. It requires 800 NVIDIA V100 GPUs just a trillion parameter for training, and such clusters simply out most data scientists. addition, training at complex...
As the training of giant dense models hits boundary on availability and capability hardware resources today, Mixture-of-Experts (MoE) become one most promising model architectures due to their significant cost reduction compared a quality-equivalent model. Its saving is demonstrated from encoder-decoder (prior works) 5x for auto-aggressive language (this work along with parallel explorations). However, much larger size unique architecture, how provide fast MoE inference remains challenging...
Recent advances on deep learning models come at the price of formidable training cost. The increasing model size is one root causes, but another less-emphasized fact that data scale actually a similar speed as scale, and cost proportional to both them. Compared rapidly evolving architecture, how efficiently use (especially for expensive foundation pretraining) less explored difficult realize due lack convenient framework focus efficiency capabilities. To this end, we present DeepSpeed Data...
This paper presents Mercury; a system for real-time support of top-k spatio-temporal queries on microblogs, where users are able to browse recent microblogs near their locations. With high arrival rates Mercury ensures query response within tight memory-constrained environment. bounds its search space include only those that have arrived certain spatial and temporal boundaries, in which the according ranking function, returned results. employs: (a) scalable dynamic in-memory index structure...
We propose a SPARQL-like language, G-SPARQL, for querying attributed graphs. The language expresses types of queries which large interest applications model their data as graphs such as: pattern matching, reachability and shortest path queries. Each query can combine both structural predicates value-based (on the attributes graph nodes edges). describe an algebraic compilation mechanism our proposed is extended from relational algebra based on basic construct building SPARQL queries, Triple...
Recurrent neural networks (RNNs) have gained significant attention due to their effectiveness in modeling sequential data, such as text and voice signal. However, the complex data dependencies limited parallelism, current inference libraries for RNNs on GPUs produce either high latency or poor scalability, leading inefficient resource utilization. Consequently, companies like Microsoft Facebook use CPUs serve RNN models.
The field of natural language processing (NLP) has made significant strides in recent years, particularly the development large-scale vision-language models (VLMs). These aim to bridge gap between text and visual information, enabling a more comprehensive understanding multimedia data. However, as these become larger complex, they also challenging train deploy. One approach addressing this challenge is use sparsely-gated mixture-of-experts (MoE) techniques, which divide model into smaller,...
Text-to-SQL demands precise reasoning to convert natural language questions into structured queries. While large models (LLMs) excel in many tasks, their ability leverage Chain-of-Thought (CoT) for text-to-SQL remains underexplored. We identify critical limitations: zero-shot CoT offers minimal gains, and Direct Preference Optimization (DPO) applied without yields marginal improvements. propose ExCoT, a novel framework that iteratively optimizes open-source LLMs by combining with off-policy...