- Machine Learning and Data Classification
- Advanced Graph Neural Networks
- Topic Modeling
- Data Quality and Management
- Radio Astronomy Observations and Technology
- Privacy-Preserving Technologies in Data
- Machine Learning and Algorithms
- Neural Networks and Applications
- Bayesian Modeling and Causal Inference
- Sparse and Compressive Sensing Techniques
- Scientific Computing and Data Management
- Data Stream Mining Techniques
- Data Management and Algorithms
- Advanced Neural Network Applications
- Recommender Systems and Techniques
- Adversarial Robustness in Machine Learning
- Auction Theory and Applications
- Advanced Image and Video Retrieval Techniques
- Geophysics and Gravity Measurements
- Statistical and numerical algorithms
- Complexity and Algorithms in Graphs
- Advanced Database Systems and Queries
- Graph Theory and Algorithms
- Mathematical Analysis and Transform Methods
- Anomaly Detection Techniques and Applications
Delft University of Technology
2024
ETH Zurich
2018-2021
IBM Research - Zurich
2017
Given a data set D containing millions of points and consumer who is willing to pay for $ X train machine learning (ML) model over , how should we distribute this $X each point reflect its "value"? In paper, define the "relative value data" via Shapley value, as it uniquely possesses properties with appealing real-world interpretations, such fairness, rationality decentralizability. For general, bounded utility functions, known be challenging compute: get values all N points, requires O (2 )...
Developing machine learning models can be seen as a process similar to the one established for traditional software development. A key difference between two lies in strong dependency quality of model and data used train or perform evaluations. In this work, we demonstrate how different aspects propagate through various stages By performing joint analysis impact well-known dimensions downstream process, show that components typical MLOps pipeline efficiently designed, providing both...
Machine learning (ML) applications have been thriving recently, largely attributed to the increasing availability of data. However, inconsistency and incomplete information are ubiquitous in real-world datasets, their impact on ML remains elusive. In this paper, we present a formal study by extending notion Certain Answers for Codd tables , which has explored database research community decades, into field machine learning. Specifically, focus classification problems propose "Certain...
Mining from graph-structured data is an integral component of graph management. A recent trending technique, convolutional network (GCN), has gained momentum in the mining field, and plays essential part numerous graph-related tasks. Although emerging GCN optimization techniques bring improvements to specific scenarios, they perform diversely different applications introduce many trial-and-error costs for practitioners. Moreover, existing models often suffer oversmoothing problem. Besides,...
High-order interactive features capture the correlation between different columns and thus are promising to enhance various learning tasks on ubiquitous tabular data. To automate generation of features, existing works either explicitly traverse feature space or implicitly express interactions via intermediate activations some designed models. These two kinds methods show that there is essentially a trade-off interpretability search efficiency. possess both their merits, we propose novel...
Modern scientific instruments produce vast amounts of data, which can overwhelm the processing ability computer systems. Lossy compression data is an intriguing solution, but comes with its own drawbacks, such as potential signal loss, and need for careful optimization ratio. In this work, we focus on a setting where problem especially acute: compressive sensing frameworks interferometry medical imaging. We ask following question: precision representation be lowered all inputs, recovery...
"How much is my data worth?" an increasingly common question posed by organizations and individuals alike. An answer to this could allow, for instance, fairly distributing profits among multiple contributors determining prospective compensation when breaches happen. In paper, we study the problem of valuation utilizing Shapley value, a popular notion value which originated in cooperative game theory. The defines unique payoff scheme that satisfies many desiderata value. However, often...
Despite the impressive capabilities of large language models (LLMs) across diverse applications, they still suffer from trustworthiness issues, such as hallucinations and misalignments. Retrieval-augmented (RAG) have been proposed to enhance credibility generations by grounding external knowledge, but theoretical understandings their generation risks remains unexplored. In this paper, we answer: 1) whether RAG can indeed lead low risks, 2) how provide provable guarantees on vanilla LLMs, 3)...
Machine learning (ML) applications have been thriving recently, largely attributed to the increasing availability of data. However, inconsistency and incomplete information are ubiquitous in real-world datasets, their impact on ML remains elusive. In this paper, we present a formal study by extending notion Certain Answers for Codd tables, which has explored database research community decades, into field machine learning. Specifically, focus classification problems propose Predictions (CP)...
Despite the great successes achieved by deep neural networks (DNNs), recent studies show that they are vulnerable against adversarial examples, which aim to mislead DNNs adding small perturbations. Several defenses have been proposed such attacks, while many of them adaptively attacked. In this work, we enhance ML robustness from a different perspective leveraging domain knowledge: We propose Knowledge Enhanced Machine Learning Pipeline (KEMLP) integrate knowledge (i.e., logic relationships...
Given $k$ pre-trained classifiers and a stream of unlabeled data examples, how can we actively decide when to query label so that distinguish the best model from rest while making small number queries? Answering this question has profound impact on range practical scenarios. In work, design an online selective sampling approach selects informative examples outputs with high probability at any round. Our algorithm be used for prediction tasks both adversarial stochastic streams. We establish...
Methods for carefully selecting or generating a small set of training data to learn from, i.e., pruning, coreset selection, and distillation, have been shown be effective in reducing the ever-increasing cost neural networks. Behind this success are rigorously designed strategies identifying informative examples out large datasets. However, these come with additional computational costs associated subset selection distillation before begins, furthermore, many even under-perform random...
Drawing from discussions at the inaugural DMLR workshop ICML 2023 and meetings prior, in this report we outline relevance of community engagement infrastructure development for creation next-generation public datasets that will advance machine learning science. We chart a path forward as collective effort to sustain maintenance these methods towards positive scientific, societal business impact.
Machine learning (ML) applications have been thriving recently, largely attributed to the increasing availability of data. However, inconsistency and incomplete information are ubiquitous in real-world datasets, their impact on ML remains elusive. In this paper, we present a formal study by extending notion Certain Answers for Codd tables, which has explored database research community decades, into field machine learning. Specifically, focus classification problems propose "Certain...
Conformal prediction has shown spurring performance in constructing statistically rigorous sets for arbitrary black-box machine learning models, assuming the data is exchangeable. However, even small adversarial perturbations during inference can violate exchangeability assumption, challenge coverage guarantees, and result a subsequent decline empirical coverage. In this work, we propose certifiably robust learning-reasoning conformal framework (COLEP) via probabilistic circuits, which...
Federated learning (FL) has emerged as a prominent method for collaboratively training machine models using local data from edge devices, all while keeping decentralized. However, accounting the quality of contributed by clients remains critical challenge in FL, are often susceptible to corruption various forms noise and perturbations, which compromise aggregation process lead subpar global model. In this work, we focus on addressing problem noisy input space, an under-explored area compared...
With the multitude of pretrained models available thanks to advancements in large-scale supervised and self-supervised learning, choosing right model is becoming increasingly pivotal machine learning lifecycle. However, much like training process, best off-the-shelf for raw, unlabeled data a labor-intensive task. To overcome this, we introduce MODEL SELECTOR, framework label-efficient selection classifiers. Given pool target data, SELECTOR samples small subset highly informative examples...
Radio interferometry usually compensates for high levels of noise in sensor/antenna electronics by throwing data and energy at the problem: observe longer, then store process it all. Furthermore, only end image is cleaned, reducing flexibility substantially. We propose instead a method to remove explicitly before imaging. To this end, we developed an algorithm that first decomposes sensor signals into components using Singular Spectrum Analysis cluster these graph Laplacian matrix. show...