- Stochastic Gradient Optimization Techniques
- Sparse and Compressive Sensing Techniques
- Privacy-Preserving Technologies in Data
- Topic Modeling
- Natural Language Processing Techniques
- Advanced Neural Network Applications
- Adversarial Robustness in Machine Learning
- Domain Adaptation and Few-Shot Learning
- Advanced Optimization Algorithms Research
- Machine Learning and Algorithms
- Multimodal Machine Learning Applications
- Age of Information Optimization
- Advanced Bandit Algorithms Research
- Advanced Text Analysis Techniques
- Neural Networks and Applications
- Machine Learning and ELM
- Generative Adversarial Networks and Image Synthesis
- Sentiment Analysis and Opinion Mining
- Machine Learning in Healthcare
- Explainable Artificial Intelligence (XAI)
- Text Readability and Simplification
- Machine Learning and Data Classification
- Distributed Control Multi-Agent Systems
- Model Reduction and Neural Networks
- Statistical Methods and Inference
École Polytechnique Fédérale de Lausanne
2017-2024
University Hospital of Bern
2024
University of Michigan
2024
Yale University
2024
University of Tübingen
2023
ETH Zurich
2009-2019
Novartis (Switzerland)
2019
Novartis Institutes for BioMedical Research
2019
University of California, Berkeley
2019
École Polytechnique
2012-2018
Matteo Pagliardini, Prakhar Gupta, Martin Jaggi. Proceedings of the 2018 Conference North American Chapter Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018.
Federated Learning (FL) is a machine learning setting where many devices collaboratively train model while keeping the training data decentralized. In most of current schemes central refined by averaging parameters server and updated from client side. However, directly only possible if all models have same structure size, which could be restrictive constraint in scenarios. this work we investigate more powerful flexible aggregation for FL. Specifically, propose ensemble distillation fusion,...
Recent trends of incorporating attention mechanisms in vision have led researchers to reconsider the supremacy convolutional layers as a primary building block. Beyond helping CNNs handle long-range dependencies, Ramachandran et al. (2019) showed that can completely replace convolution and achieve state-of-the-art performance on tasks. This raises question: do learned operate similarly layers? work provides evidence perform and, indeed, they often learn so practice. Specifically, we prove...
Keyphrase extraction is the task of automatically selecting a small set phrases that best describe given free text document. Supervised keyphrase requires large amounts labeled training data and generalizes very poorly outside domain data. At same time, unsupervised systems have poor accuracy, often do not generalize well, as they require input document to belong larger corpus also input. Addressing these drawbacks, in this paper, we tackle from single documents with EmbedRank: novel method,...
With the growth of data and necessity for distributed optimization methods, solvers that work well on a single machine must be re-designed to leverage computation. Recent in this area has been limited by focusing heavily developing highly specific methods environment. These special-purpose are often unable fully competitive performance their well-tuned customized counterparts. Further, they easily integrate improvements continue made methods. To end, we present framework both allows...
Time series constitute a challenging data type for machine learning algorithms, due to their highly variable lengths and sparse labeling in practice. In this paper, we tackle challenge by proposing an unsupervised method learn universal embeddings of time series. Unlike previous works, it is scalable with respect length demonstrate the quality, transferability practicability learned representations thorough experiments comparisons. To end, combine encoder based on causal dilated convolutions...
Neural Architecture Search (NAS) aims to facilitate the design of deep networks for new tasks. Existing techniques rely on two stages: searching over architecture space and validating best architecture. NAS algorithms are currently compared solely based their results downstream task. While intuitive, this fails explicitly evaluate effectiveness search strategies. In paper, we propose phase. To end, compare quality solutions obtained by policies with that random selection. We find that: (i)...
Mini-batch stochastic gradient methods (SGD) are state of the art for distributed training deep neural networks. Drastic increases in mini-batch sizes have lead to key efficiency and scalability gains recent years. However, progress faces a major roadblock, as models trained with large batches often do not generalize well, i.e. they show good accuracy on new data. As remedy, we propose \emph{post-local} SGD that it significantly improves generalization performance compared large-batch...
Sign-based algorithms (e.g. signSGD) have been proposed as a biased gradient compression technique to alleviate the communication bottleneck in training large neural networks across multiple workers. We show simple convex counter-examples where signSGD does not converge optimum. Further, even when it converge, may generalize poorly compared with SGD. These issues arise because of nature sign operator. then that using error-feedback, i.e. incorporating error made by operator into next step,...
Federated learning and analytics are a distributed approach for collaboratively models (or statistics) from decentralized data, motivated by designed privacy protection. The process can be formulated as solving federated optimization problems, which emphasize communication efficiency, data heterogeneity, compatibility with system requirements, other constraints that not primary considerations in problem settings. This paper provides recommendations guidelines on formulating, designing,...
Large language models (LLMs) can potentially democratize access to medical knowledge. While many efforts have been made harness and improve LLMs' knowledge reasoning capacities, the resulting are either closed-source (e.g., PaLM, GPT-4) or limited in scale (<= 13B parameters), which restricts their abilities. In this work, we large-scale LLMs by releasing MEDITRON: a suite of open-source with 7B 70B parameters adapted domain. MEDITRON builds on Llama-2 (through our adaptation Nvidia's...
The Frank-Wolfe (FW) optimization algorithm has lately re-gained popularity thanks in particular to its ability nicely handle the structured constraints appearing machine learning applications. However, convergence rate is known be slow (sublinear) when solution lies at boundary. A simple less-known fix add possibility take 'away steps' during optimization, an operation that importantly does not require a feasibility oracle. In this paper, we highlight and clarify several variants of have...
This study deals with semantic segmentation of high-resolution (aerial) images where a class label is assigned to each pixel via supervised classification as basis for automatic map generation. Recently, deep convolutional neural networks (CNNs) have shown impressive performance and quickly become the de-facto standard segmentation, added benefit that task-specific feature design no longer necessary. However, major downside learning methods they are extremely data-hungry, thus aggravating...
We propose a randomized block-coordinate variant of the classic Frank-Wolfe algorithm for convex optimization with block-separable constraints. Despite its lower iteration cost, we show that it achieves similar convergence rate in duality gap as full algorithm. also that, when applied to dual structural support vector machine (SVM) objective, this yields an online has same low complexity primal stochastic subgradient methods. However, unlike methods, allows us compute optimal step-size and...
We consider decentralized stochastic optimization with the objective function (e.g. data samples for machine learning task) being distributed over $n$ machines that can only communicate to their neighbors on a fixed communication graph. To reduce bottleneck, nodes compress quantize or sparsify) model updates. cover both unbiased and biased compression operators quality denoted by $\omega \leq 1$ ($\omega=1$ meaning no compression). (i) propose novel gossip-based gradient descent algorithm,...