- Distributed systems and fault tolerance
- Stochastic Gradient Optimization Techniques
- Optimization and Search Problems
- Parallel Computing and Optimization Techniques
- Advanced Neural Network Applications
- Privacy-Preserving Technologies in Data
- Topic Modeling
- Sparse and Compressive Sensing Techniques
- Domain Adaptation and Few-Shot Learning
- Cryptography and Data Security
- Advanced Data Storage Technologies
- Complexity and Algorithms in Graphs
- Natural Language Processing Techniques
- Machine Learning and Algorithms
- Adversarial Robustness in Machine Learning
- Cloud Computing and Resource Management
- Speech Recognition and Synthesis
- Ferroelectric and Negative Capacitance Devices
- Graph Theory and Algorithms
- Machine Learning and ELM
- Interconnection Networks and Systems
- Advanced Image and Video Retrieval Techniques
- Age of Information Optimization
- Algorithms and Data Compression
- Real-Time Systems Scheduling
Institute of Science and Technology Austria
2016-2024
Vassar College
2017
ETH Zurich
2013-2017
Computer Algorithms for Medicine
2017
Microsoft Research (United Kingdom)
2014-2016
Microsoft (United States)
2014-2015
Moscow Institute of Thermal Technology
2014
Massachusetts Institute of Technology
2013
École Polytechnique Fédérale de Lausanne
2008-2012
Laboratoire d'Informatique Fondamentale de Lille
2012
Parallel implementations of stochastic gradient descent (SGD) have received significant research attention, thanks to excellent scalability properties this algorithm, and its efficiency in the context training deep neural networks. A fundamental barrier for parallelizing large-scale SGD is fact that cost communicating updates between nodes can be very large. Consequently, lossy compression heuristics been proposed, by which only communicate quantized gradients. Although effective practice,...
Deep neural networks (DNNs) continue to make significant advances, solving tasks from image classification translation or reinforcement learning. One aspect of the field receiving considerable attention is efficiently executing deep models in resource-constrained environments, such as mobile embedded devices. This paper focuses on this problem, and proposes two new compression methods, which jointly leverage weight quantization distillation larger teacher into smaller student networks. The...
The growing energy and performance costs of deep learning have driven the community to reduce size neural networks by selectively pruning components. Similarly their biological counterparts, sparse generalize just as well, if not better than, original dense networks. Sparsity can memory footprint regular fit mobile devices, well shorten training time for ever In this paper, we survey prior work on sparsity in provide an extensive tutorial sparsification both inference training. We describe...
Distributed training of massive machine learning models, in particular deep neural networks, via Stochastic Gradient Descent (SGD) is becoming commonplace. Several families communication-reduction methods, such as quantization, large-batch and gradient sparsification, have been proposed. To date, sparsification methods - where each node sorts gradients by magnitude, only communicates a subset the components, accumulating rest locally are known to yield some largest practical gains. Such can...
Generative Pre-trained Transformer models, known as GPT or OPT, set themselves apart through breakthrough performance across complex language modelling tasks, but also by their extremely high computational and storage costs. Specifically, due to massive size, even inference for large, highly-accurate models may require multiple performant GPUs, which limits the usability of such models. While there is emerging work on relieving this pressure via model compression, applicability existing...
We show for the first time that large-scale generative pretrained transformer (GPT) family models can be pruned to at least 50% sparsity in one-shot, without any retraining, minimal loss of accuracy. This is achieved via a new pruning method called SparseGPT, specifically designed work efficiently and accurately on massive GPT-family models. execute SparseGPT largest available open-source models, OPT-175B BLOOM-176B, under 4.5 hours, reach 60% unstructured with negligible increase...
This paper studies the problem of distributed stochastic optimization in an adversarial setting where, out $m$ machines which allegedly compute gradients every iteration, $α$-fraction are Byzantine, and can behave arbitrarily adversarially. Our main result is a variant gradient descent (SGD) finds $\varepsilon$-approximate minimizers convex functions $T = \tilde{O}\big( \frac{1}{\varepsilon^2 m} + \frac{α^2}{\varepsilon^2} \big)$ iterations. In contrast, traditional mini-batch SGD needs...
Applying machine learning techniques to the quickly growing data in science and industry requires highly-scalable algorithms. Large datasets are most commonly processed "data parallel" distributed across many nodes. Each node's contribution overall gradient is summed using a global allreduce. This allreduce single communication thus scalability bottleneck for workloads. We observe that frequently, values (close to) zero, leading sparse of sparsifyable communications. To exploit this insight,...
High-performance concurrent priority queues are essential for applications such as task scheduling and discrete event simulation. Unfortunately, even the best performing implementations do not scale past a number of threads in single digits. This is because sequential bottleneck accessing elements at head queue order to perform DeleteMin operation. In this paper, we present SprayList, scalable with relaxed ordering semantics. Starting from non-blocking SkipList, main innovation behind our...
Population protocols, roughly defined as systems consisting of large numbers simple identical agents, interacting at random and updating their state following rules, are an important research topic the intersection distributed computing biology. One fundamental tasks that a population protocol may solve is majority: each node starts in one two states; goal for all nodes to reach correct consensus on which states was initially majority. Despite considerable effort, known protocols this...
Stochastic gradient descent (SGD) is a commonly used algorithm for training linear machine learning models. Based on vector algebra, it benefits from the inherent parallelism available in an FPGA. In this paper, we first present single-precision floating-point SGD implementation FPGA that provides similar performance as 10-core CPU. We then adapt design to make capable of processing low-precision data. The data obtained novel compression scheme-called stochastic quantization, specifically...
Eldar Kurtic, Daniel Campos, Tuan Nguyen, Elias Frantar, Mark Kurtz, Benjamin Fineran, Michael Goin, Dan Alistarh. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022.
Recent advances in large language model (LLM) pretraining have led to high-quality LLMs with impressive abilities. By compressing such via quantization 3-4 bits per parameter, they can fit into memory-limited devices as laptops and mobile phones, enabling personalized use. However, down parameter usually leads moderate-to-high accuracy losses, especially for smaller models the 1-10B range, which are well-suited edge deployments. To address this issue, we introduce Sparse-Quantized...
Dynamic memory reclamation is arguably the biggest open problem in concurrent data structure design: all known solutions induce high overhead, or must be customized to specific by programmer, both. This paper presents StackTrack, first scheme that can applied automatically a compiler, while maintaining efficiency. StackTrack eliminates most of expensive bookkeeping required for leveraging power hardware transactional (HTM) new way: it tracks thread variables dynamically, and an atomic...
Population protocols are a popular model of distributed computing, in which randomly-interacting agents with little computational power cooperate to jointly perform tasks. Inspired by developments molecular computation, and particular DNA recent algorithmic work has focused on the complexity solving simple yet fundamental tasks population model, such as leader election (which requires convergence single agent special state), majority (in must converge decision two possible initial states had...
Population protocols are a popular model of distributed computing, introduced by Angluin, Aspnes, Diamadi, Fischer, and Peralta [6] little over decade ago. In the meantime, has proved useful abstraction for modeling various settings, from wireless sensor networks [35, 26], to gene regulatory [17], chemical reaction [21]. nutshell, population protocol consists n agents with limited local state that interact randomly in pairs, according an underlying communication graph, cooperate collectively...
Population protocols are a popular model of distributed computing, in which randomly-interacting agents with little computational power cooperate to jointly perform tasks. Inspired by developments molecular computation, and particular DNA recent algorithmic work has focused on the complexity solving simple yet fundamental tasks population model, such as leader election (which requires convergence single agent special “leader” state), majority (in must converge decision two possible initial...
Second-order information, in the form of Hessian- or Inverse-Hessian-vector products, is a fundamental tool for solving optimization problems. Recently, there has been significant interest utilizing this information context deep neural networks; however, relatively little known about quality existing approximations context. Our work examines question, identifies issues with approaches, and proposes method called WoodFisher to compute faithful efficient estimate inverse Hessian. main...