Dan Alistarh

ORCID: 0000-0003-3650-940X
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Distributed systems and fault tolerance
  • Stochastic Gradient Optimization Techniques
  • Optimization and Search Problems
  • Parallel Computing and Optimization Techniques
  • Advanced Neural Network Applications
  • Privacy-Preserving Technologies in Data
  • Topic Modeling
  • Sparse and Compressive Sensing Techniques
  • Domain Adaptation and Few-Shot Learning
  • Cryptography and Data Security
  • Advanced Data Storage Technologies
  • Complexity and Algorithms in Graphs
  • Natural Language Processing Techniques
  • Machine Learning and Algorithms
  • Adversarial Robustness in Machine Learning
  • Cloud Computing and Resource Management
  • Speech Recognition and Synthesis
  • Ferroelectric and Negative Capacitance Devices
  • Graph Theory and Algorithms
  • Machine Learning and ELM
  • Interconnection Networks and Systems
  • Advanced Image and Video Retrieval Techniques
  • Age of Information Optimization
  • Algorithms and Data Compression
  • Real-Time Systems Scheduling

Institute of Science and Technology Austria
2016-2024

Vassar College
2017

ETH Zurich
2013-2017

Computer Algorithms for Medicine
2017

Microsoft Research (United Kingdom)
2014-2016

Microsoft (United States)
2014-2015

Moscow Institute of Thermal Technology
2014

Massachusetts Institute of Technology
2013

École Polytechnique Fédérale de Lausanne
2008-2012

Laboratoire d'Informatique Fondamentale de Lille
2012

Parallel implementations of stochastic gradient descent (SGD) have received significant research attention, thanks to excellent scalability properties this algorithm, and its efficiency in the context training deep neural networks. A fundamental barrier for parallelizing large-scale SGD is fact that cost communicating updates between nodes can be very large. Consequently, lossy compression heuristics been proposed, by which only communicate quantized gradients. Although effective practice,...

10.48550/arxiv.1610.02132 preprint EN other-oa arXiv (Cornell University) 2016-01-01

Deep neural networks (DNNs) continue to make significant advances, solving tasks from image classification translation or reinforcement learning. One aspect of the field receiving considerable attention is efficiently executing deep models in resource-constrained environments, such as mobile embedded devices. This paper focuses on this problem, and proposes two new compression methods, which jointly leverage weight quantization distillation larger teacher into smaller student networks. The...

10.48550/arxiv.1802.05668 preprint EN other-oa arXiv (Cornell University) 2018-01-01

The growing energy and performance costs of deep learning have driven the community to reduce size neural networks by selectively pruning components. Similarly their biological counterparts, sparse generalize just as well, if not better than, original dense networks. Sparsity can memory footprint regular fit mobile devices, well shorten training time for ever In this paper, we survey prior work on sparsity in provide an extensive tutorial sparsification both inference training. We describe...

10.48550/arxiv.2102.00554 preprint EN other-oa arXiv (Cornell University) 2021-01-01

Distributed training of massive machine learning models, in particular deep neural networks, via Stochastic Gradient Descent (SGD) is becoming commonplace. Several families communication-reduction methods, such as quantization, large-batch and gradient sparsification, have been proposed. To date, sparsification methods - where each node sorts gradients by magnitude, only communicates a subset the components, accumulating rest locally are known to yield some largest practical gains. Such can...

10.48550/arxiv.1809.10505 preprint EN other-oa arXiv (Cornell University) 2018-01-01

Generative Pre-trained Transformer models, known as GPT or OPT, set themselves apart through breakthrough performance across complex language modelling tasks, but also by their extremely high computational and storage costs. Specifically, due to massive size, even inference for large, highly-accurate models may require multiple performant GPUs, which limits the usability of such models. While there is emerging work on relieving this pressure via model compression, applicability existing...

10.48550/arxiv.2210.17323 preprint EN cc-by arXiv (Cornell University) 2022-01-01

We show for the first time that large-scale generative pretrained transformer (GPT) family models can be pruned to at least 50% sparsity in one-shot, without any retraining, minimal loss of accuracy. This is achieved via a new pruning method called SparseGPT, specifically designed work efficiently and accurately on massive GPT-family models. execute SparseGPT largest available open-source models, OPT-175B BLOOM-176B, under 4.5 hours, reach 60% unstructured with negligible increase...

10.48550/arxiv.2301.00774 preprint EN cc-by arXiv (Cornell University) 2023-01-01

This paper studies the problem of distributed stochastic optimization in an adversarial setting where, out $m$ machines which allegedly compute gradients every iteration, $α$-fraction are Byzantine, and can behave arbitrarily adversarially. Our main result is a variant gradient descent (SGD) finds $\varepsilon$-approximate minimizers convex functions $T = \tilde{O}\big( \frac{1}{\varepsilon^2 m} + \frac{α^2}{\varepsilon^2} \big)$ iterations. In contrast, traditional mini-batch SGD needs...

10.48550/arxiv.1803.08917 preprint EN other-oa arXiv (Cornell University) 2018-01-01

Applying machine learning techniques to the quickly growing data in science and industry requires highly-scalable algorithms. Large datasets are most commonly processed "data parallel" distributed across many nodes. Each node's contribution overall gradient is summed using a global allreduce. This allreduce single communication thus scalability bottleneck for workloads. We observe that frequently, values (close to) zero, leading sparse of sparsifyable communications. To exploit this insight,...

10.1145/3295500.3356222 article EN 2019-11-07

High-performance concurrent priority queues are essential for applications such as task scheduling and discrete event simulation. Unfortunately, even the best performing implementations do not scale past a number of threads in single digits. This is because sequential bottleneck accessing elements at head queue order to perform DeleteMin operation. In this paper, we present SprayList, scalable with relaxed ordering semantics. Starting from non-blocking SkipList, main innovation behind our...

10.1145/2688500.2688523 article EN 2015-01-24

Population protocols, roughly defined as systems consisting of large numbers simple identical agents, interacting at random and updating their state following rules, are an important research topic the intersection distributed computing biology. One fundamental tasks that a population protocol may solve is majority: each node starts in one two states; goal for all nodes to reach correct consensus on which states was initially majority. Despite considerable effort, known protocols this...

10.1145/2767386.2767429 article EN 2015-07-17

Stochastic gradient descent (SGD) is a commonly used algorithm for training linear machine learning models. Based on vector algebra, it benefits from the inherent parallelism available in an FPGA. In this paper, we first present single-precision floating-point SGD implementation FPGA that provides similar performance as 10-core CPU. We then adapt design to make capable of processing low-precision data. The data obtained novel compression scheme-called stochastic quantization, specifically...

10.1109/fccm.2017.39 article EN 2017-04-01

Eldar Kurtic, Daniel Campos, Tuan Nguyen, Elias Frantar, Mark Kurtz, Benjamin Fineran, Michael Goin, Dan Alistarh. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022.

10.18653/v1/2022.emnlp-main.279 article EN cc-by Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2022-01-01

Recent advances in large language model (LLM) pretraining have led to high-quality LLMs with impressive abilities. By compressing such via quantization 3-4 bits per parameter, they can fit into memory-limited devices as laptops and mobile phones, enabling personalized use. However, down parameter usually leads moderate-to-high accuracy losses, especially for smaller models the 1-10B range, which are well-suited edge deployments. To address this issue, we introduce Sparse-Quantized...

10.48550/arxiv.2306.03078 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Dynamic memory reclamation is arguably the biggest open problem in concurrent data structure design: all known solutions induce high overhead, or must be customized to specific by programmer, both. This paper presents StackTrack, first scheme that can applied automatically a compiler, while maintaining efficiency. StackTrack eliminates most of expensive bookkeeping required for leveraging power hardware transactional (HTM) new way: it tracks thread variables dynamically, and an atomic...

10.1145/2592798.2592808 article EN 2014-04-14

Population protocols are a popular model of distributed computing, in which randomly-interacting agents with little computational power cooperate to jointly perform tasks. Inspired by developments molecular computation, and particular DNA recent algorithmic work has focused on the complexity solving simple yet fundamental tasks population model, such as leader election (which requires convergence single agent special state), majority (in must converge decision two possible initial states had...

10.5555/3039686.3039855 article EN 2017-01-16

Population protocols are a popular model of distributed computing, introduced by Angluin, Aspnes, Diamadi, Fischer, and Peralta [6] little over decade ago. In the meantime, has proved useful abstraction for modeling various settings, from wireless sensor networks [35, 26], to gene regulatory [17], chemical reaction [21]. nutshell, population protocol consists n agents with limited local state that interact randomly in pairs, according an underlying communication graph, cooperate collectively...

10.1145/3289137.3289150 article EN ACM SIGACT News 2018-10-24

Population protocols are a popular model of distributed computing, in which randomly-interacting agents with little computational power cooperate to jointly perform tasks. Inspired by developments molecular computation, and particular DNA recent algorithmic work has focused on the complexity solving simple yet fundamental tasks population model, such as leader election (which requires convergence single agent special “leader” state), majority (in must converge decision two possible initial...

10.1137/1.9781611974782.169 preprint EN 2017-01-01

Second-order information, in the form of Hessian- or Inverse-Hessian-vector products, is a fundamental tool for solving optimization problems. Recently, there has been significant interest utilizing this information context deep neural networks; however, relatively little known about quality existing approximations context. Our work examines question, identifies issues with approaches, and proposes method called WoodFisher to compute faithful efficient estimate inverse Hessian. main...

10.48550/arxiv.2004.14340 preprint EN cc-by arXiv (Cornell University) 2020-01-01
Coming Soon ...