NFDI4DS | UHH-SEMS - Publication Details

Dan Alistarh

ORCID: 0000-0003-3650-940X

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5083822059

Research Areas

Distributed systems and fault tolerance
Stochastic Gradient Optimization Techniques
Optimization and Search Problems
Parallel Computing and Optimization Techniques
Advanced Neural Network Applications
Privacy-Preserving Technologies in Data
Topic Modeling
Sparse and Compressive Sensing Techniques
Domain Adaptation and Few-Shot Learning
Cryptography and Data Security
Advanced Data Storage Technologies
Complexity and Algorithms in Graphs
Natural Language Processing Techniques
Machine Learning and Algorithms
Adversarial Robustness in Machine Learning
Cloud Computing and Resource Management
Speech Recognition and Synthesis
Ferroelectric and Negative Capacitance Devices
Graph Theory and Algorithms
Machine Learning and ELM
Interconnection Networks and Systems
Advanced Image and Video Retrieval Techniques
Age of Information Optimization
Algorithms and Data Compression
Real-Time Systems Scheduling

Institute of Science and Technology Austria
2016-2024

Vassar College
2017

ETH Zurich
2013-2017

Computer Algorithms for Medicine
2017

Microsoft Research (United Kingdom)
2014-2016

Microsoft (United States)
2014-2015

Moscow Institute of Thermal Technology
2014

Massachusetts Institute of Technology
2013

École Polytechnique Fédérale de Lausanne
2008-2012

Laboratoire d'Informatique Fondamentale de Lille
2012

QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding

OPENALEX - Publications

Dan Alistarh Demjan Grubic Jerry Li Ryota Tomioka Milan Vojnović

Parallel implementations of stochastic gradient descent (SGD) have received significant research attention, thanks to excellent scalability properties this algorithm, and its efficiency in the context training deep neural networks. A fundamental barrier for parallelizing large-scale SGD is fact that cost communicating updates between nodes can be very large. Consequently, lossy compression heuristics been proposed, by which only communicate quantized gradients. Although effective practice,...

10.48550/arxiv.1610.02132 preprint EN other-oa arXiv (Cornell University) 2016-01-01

Model compression via distillation and quantization

OPENALEX - Publications

Antonio Polino Razvan Pascanu Dan Alistarh

Deep neural networks (DNNs) continue to make significant advances, solving tasks from image classification translation or reinforcement learning. One aspect of the field receiving considerable attention is efficiently executing deep models in resource-constrained environments, such as mobile embedded devices. This paper focuses on this problem, and proposes two new compression methods, which jointly leverage weight quantization distillation larger teacher into smaller student networks. The...

10.48550/arxiv.1802.05668 preprint EN other-oa arXiv (Cornell University) 2018-01-01

Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks

OPENALEX - Publications

Torsten Hoefler Dan Alistarh Tal Ben‐Nun Nikoli Dryden Alexandra Peşte

The growing energy and performance costs of deep learning have driven the community to reduce size neural networks by selectively pruning components. Similarly their biological counterparts, sparse generalize just as well, if not better than, original dense networks. Sparsity can memory footprint regular fit mobile devices, well shorten training time for ever In this paper, we survey prior work on sparsity in provide an extensive tutorial sparsification both inference training. We describe...

10.48550/arxiv.2102.00554 preprint EN other-oa arXiv (Cornell University) 2021-01-01

The Convergence of Sparsified Gradient Methods

OPENALEX - Publications

Dan Alistarh Torsten Hoefler Mikael Johansson Nikola Konstantinov Sarit Khirirat and 1 more

Distributed training of massive machine learning models, in particular deep neural networks, via Stochastic Gradient Descent (SGD) is becoming commonplace. Several families communication-reduction methods, such as quantization, large-batch and gradient sparsification, have been proposed. To date, sparsification methods - where each node sorts gradients by magnitude, only communicates a subset the components, accumulating rest locally are known to yield some largest practical gains. Such can...

10.48550/arxiv.1809.10505 preprint EN other-oa arXiv (Cornell University) 2018-01-01

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

OPENALEX - Publications

Elias Frantar Saleh Ashkboos Torsten Hoefler Dan Alistarh

Generative Pre-trained Transformer models, known as GPT or OPT, set themselves apart through breakthrough performance across complex language modelling tasks, but also by their extremely high computational and storage costs. Specifically, due to massive size, even inference for large, highly-accurate models may require multiple performant GPUs, which limits the usability of such models. While there is emerging work on relieving this pressure via model compression, applicability existing...

10.48550/arxiv.2210.17323 preprint EN cc-by arXiv (Cornell University) 2022-01-01

SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot

OPENALEX - Publications

Elias Frantar Dan Alistarh

We show for the first time that large-scale generative pretrained transformer (GPT) family models can be pruned to at least 50% sparsity in one-shot, without any retraining, minimal loss of accuracy. This is achieved via a new pruning method called SparseGPT, specifically designed work efficiently and accurately on massive GPT-family models. execute SparseGPT largest available open-source models, OPT-175B BLOOM-176B, under 4.5 hours, reach 60% unstructured with negligible increase...

10.48550/arxiv.2301.00774 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Byzantine Stochastic Gradient Descent

OPENALEX - Publications

Dan Alistarh Zeyuan Allen-Zhu Jerry Li

This paper studies the problem of distributed stochastic optimization in an adversarial setting where, out $m$ machines which allegedly compute gradients every iteration, $α$-fraction are Byzantine, and can behave arbitrarily adversarially. Our main result is a variant gradient descent (SGD) finds $\varepsilon$-approximate minimizers convex functions $T = \tilde{O}\big( \frac{1}{\varepsilon^2 m} + \frac{α^2}{\varepsilon^2} \big)$ iterations. In contrast, traditional mini-batch SGD needs...

10.48550/arxiv.1803.08917 preprint EN other-oa arXiv (Cornell University) 2018-01-01

SparCML

OPENALEX - Publications

Cédric Renggli Saleh Ashkboos Mehdi Aghagolzadeh Dan Alistarh Torsten Hoefler

Applying machine learning techniques to the quickly growing data in science and industry requires highly-scalable algorithms. Large datasets are most commonly processed "data parallel" distributed across many nodes. Each node's contribution overall gradient is summed using a global allreduce. This allreduce single communication thus scalability bottleneck for workloads. We observe that frequently, values (close to) zero, leading sparse of sparsifyable communications. To exploit this insight,...

10.1145/3295500.3356222 article EN 2019-11-07

The SprayList: a scalable relaxed priority queue

OPENALEX - Publications

Dan Alistarh Justin Kopinsky Jerry Li Nir Shavit

High-performance concurrent priority queues are essential for applications such as task scheduling and discrete event simulation. Unfortunately, even the best performing implementations do not scale past a number of threads in single digits. This is because sequential bottleneck accessing elements at head queue order to perform DeleteMin operation. In this paper, we present SprayList, scalable with relaxed ordering semantics. Starting from non-blocking SkipList, main innovation behind our...

10.1145/2688500.2688523 article EN 2015-01-24

Fast and Exact Majority in Population Protocols

OPENALEX - Publications

Dan Alistarh Rati Gelashvili Milan Vojnović

Population protocols, roughly defined as systems consisting of large numbers simple identical agents, interacting at random and updating their state following rules, are an important research topic the intersection distributed computing biology. One fundamental tasks that a population protocol may solve is majority: each node starts in one two states; goal for all nodes to reach correct consensus on which states was initially majority. Despite considerable effort, known protocols this...

10.1145/2767386.2767429 article EN 2015-07-17

FPGA-Accelerated Dense Linear Machine Learning: A Precision-Convergence Trade-Off

OPENALEX - Publications

Kaan Kara Dan Alistarh Gustavo Alonso Onur Mutlu Ce Zhang

Stochastic gradient descent (SGD) is a commonly used algorithm for training linear machine learning models. Based on vector algebra, it benefits from the inherent parallelism available in an FPGA. In this paper, we first present single-precision floating-point SGD implementation FPGA that provides similar performance as 10-core CPU. We then adapt design to make capable of processing low-precision data. The data obtained novel compression scheme-called stochastic quantization, specifically...

10.1109/fccm.2017.39 article EN 2017-04-01

The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models

OPENALEX - Publications

Eldar Kurtic Daniel Campos Tuan Nguyen Elias Frantar Mark Kurtz and 3 more

Eldar Kurtic, Daniel Campos, Tuan Nguyen, Elias Frantar, Mark Kurtz, Benjamin Fineran, Michael Goin, Dan Alistarh. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022.

10.18653/v1/2022.emnlp-main.279 article EN cc-by Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2022-01-01

SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression

OPENALEX - Publications

Tim Dettmers Ruslan Svirschevski Vage Egiazarian Denis Kuznedelev Elias Frantar and 4 more

Recent advances in large language model (LLM) pretraining have led to high-quality LLMs with impressive abilities. By compressing such via quantization 3-4 bits per parameter, they can fit into memory-limited devices as laptops and mobile phones, enabling personalized use. However, down parameter usually leads moderate-to-high accuracy losses, especially for smaller models the 1-10B range, which are well-suited edge deployments. To address this issue, we introduce Sparse-Quantized...

10.48550/arxiv.2306.03078 preprint EN other-oa arXiv (Cornell University) 2023-01-01

StackTrack

OPENALEX - Publications

Dan Alistarh Patrick Eugster Maurice Herlihy Alexander Matveev Nir Shavit

Dynamic memory reclamation is arguably the biggest open problem in concurrent data structure design: all known solutions induce high overhead, or must be customized to specific by programmer, both. This paper presents StackTrack, first scheme that can applied automatically a compiler, while maintaining efficiency. StackTrack eliminates most of expensive bookkeeping required for leveraging power hardware transactional (HTM) new way: it tracks thread variables dynamically, and an atomic...

10.1145/2592798.2592808 article EN 2014-04-14

Time-space trade-offs in population protocols

OPENALEX - Publications

Dan Alistarh James Aspnes David Eisenstat Rati Gelashvili Ronald L. Rivest

Population protocols are a popular model of distributed computing, in which randomly-interacting agents with little computational power cooperate to jointly perform tasks. Inspired by developments molecular computation, and particular DNA recent algorithmic work has focused on the complexity solving simple yet fundamental tasks population model, such as leader election (which requires convergence single agent special state), majority (in must converge decision two possible initial states had...

10.5555/3039686.3039855 article EN 2017-01-16

Recent Algorithmic Advances in Population Protocols

OPENALEX - Publications

Dan Alistarh Rati Gelashvili

Population protocols are a popular model of distributed computing, introduced by Angluin, Aspnes, Diamadi, Fischer, and Peralta [6] little over decade ago. In the meantime, has proved useful abstraction for modeling various settings, from wireless sensor networks [35, 26], to gene regulatory [17], chemical reaction [21]. nutshell, population protocol consists n agents with limited local state that interact randomly in pairs, according an underlying communication graph, cooperate collectively...

10.1145/3289137.3289150 article EN ACM SIGACT News 2018-10-24

Time-Space Trade-offs in Population Protocols

OPENALEX - Publications

Dan Alistarh James Aspnes David Eisenstat Rati Gelashvili Ronald L. Rivest

Population protocols are a popular model of distributed computing, in which randomly-interacting agents with little computational power cooperate to jointly perform tasks. Inspired by developments molecular computation, and particular DNA recent algorithmic work has focused on the complexity solving simple yet fundamental tasks population model, such as leader election (which requires convergence single agent special “leader” state), majority (in must converge decision two possible initial...

10.1137/1.9781611974782.169 preprint EN 2017-01-01

WoodFisher: Efficient Second-Order Approximation for Neural Network Compression

OPENALEX - Publications

Sidak Pal Singh Dan Alistarh

Second-order information, in the form of Hessian- or Inverse-Hessian-vector products, is a fundamental tool for solving optimization problems. Recently, there has been significant interest utilizing this information context deep neural networks; however, relatively little known about quality existing approximations context. Our work examines question, identifies issues with approaches, and proposes method called WoodFisher to compute faithful efficient estimate inverse Hessian. main...

10.48550/arxiv.2004.14340 preprint EN cc-by arXiv (Cornell University) 2020-01-01

Coming Soon ...