NFDI4DS | UHH-SEMS - Publication Details

Jonathan Crabbé

ORCID: 0000-0002-0341-7712

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5113266713

Research Areas

Explainable Artificial Intelligence (XAI)
Machine Learning in Healthcare
Machine Learning and Data Classification
Anomaly Detection Techniques and Applications
Topic Modeling
Machine Learning in Materials Science
Neural Networks and Applications
Time Series Analysis and Forecasting
Adversarial Robustness in Machine Learning
Model Reduction and Neural Networks
Domain Adaptation and Few-Shot Learning
Reservoir Engineering and Simulation Methods
Advanced Neural Network Applications
Catalysis and Oxidation Reactions
Advanced Database Systems and Queries
Data Stream Mining Techniques
Radiomics and Machine Learning in Medical Imaging
Modular Robots and Swarm Intelligence
Advanced Causal Inference Techniques
Advanced Image Fusion Techniques
Remote-Sensing Image Classification
3D Modeling in Geospatial Applications
Geochemistry and Geologic Mapping
Gaussian Processes and Bayesian Inference
Generative Adversarial Networks and Image Synthesis

Microsoft Research (United Kingdom)
2025

University of Cambridge
2020-2023

A generative model for inorganic materials design

OPENALEX - Publications

Claudio Zeni Robert Pinsler Daniel Zügner Andrew T. Fowler Matthew K. Horton and 21 more

10.1038/s41586-025-08628-5 article EN cc-by Nature 2025-01-16

MatterGen: a generative model for inorganic materials design

OPENALEX - Publications

Claudio Zeni Robert Pinsler Daniel Zügner Andrew Fowler Matthew K. Horton and 7 more

The design of functional materials with desired properties is essential in driving technological advances areas like energy storage, catalysis, and carbon capture. Generative models provide a new paradigm for by directly generating entirely novel given property constraints. Despite recent progress, current generative have low success rate proposing stable crystals, or can only satisfy very limited set Here, we present MatterGen, model that generates stable, diverse inorganic across the...

10.48550/arxiv.2312.03687 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Concept Activation Regions: A Generalized Framework For Concept-Based Explanations

OPENALEX - Publications

Jonathan Crabbé Mihaela van der Schaar

Concept-based explanations permit to understand the predictions of a deep neural network (DNN) through lens concepts specified by users. Existing methods assume that examples illustrating concept are mapped in fixed direction DNN's latent space. When this holds true, can be represented activation vector (CAV) pointing direction. In work, we propose relax assumption allowing scattered across different clusters Each is then region space includes these and call (CAR). To formalize idea,...

10.48550/arxiv.2209.11222 preprint EN cc-by arXiv (Cornell University) 2022-01-01

Explaining Time Series Predictions with Dynamic Masks

OPENALEX - Publications

Jonathan Crabbé Mihaela van der Schaar

How can we explain the predictions of a machine learning model? When data is structured as multivariate time series, this question induces additional difficulties such necessity for explanation to embody dependency and large number inputs. To address these challenges, propose dynamic masks (Dynamask). This method produces instance-wise importance scores each feature at step by fitting perturbation mask input sequence. In order incorporate data, Dynamask studies effects operators. tackle...

10.48550/arxiv.2106.05303 preprint EN cc-by arXiv (Cornell University) 2021-01-01

Learning outside the Black-Box: The pursuit of interpretable models

OPENALEX - Publications

Jonathan Crabbé Yao Zhang William R. Zame Mihaela van der Schaar

Machine Learning has proved its ability to produce accurate models but the deployment of these outside machine learning community been hindered by difficulties interpreting models. This paper proposes an algorithm that produces a continuous global interpretation any given black-box function. Our employs variation projection pursuit in which ridge functions are chosen be Meijer G-functions, rather than usual polynomial splines. Because G-functions differentiable their parameters, we can tune...

10.48550/arxiv.2011.08596 preprint EN cc-by arXiv (Cornell University) 2020-01-01

Explaining Latent Representations with a Corpus of Examples

OPENALEX - Publications

Jonathan Crabbé Zhaozhi Qian Fergus Imrie Mihaela van der Schaar

Modern machine learning models are complicated. Most of them rely on convoluted latent representations their input to issue a prediction. To achieve greater transparency than black-box that connects inputs predictions, it is necessary gain deeper understanding these representations. aim, we propose SimplEx: user-centred method provides example-based explanations with reference freely selected set examples, called the corpus. SimplEx uses corpus improve user's space post-hoc answering two...

10.48550/arxiv.2110.15355 preprint EN other-oa arXiv (Cornell University) 2021-01-01

Data-IQ: Characterizing subgroups with heterogeneous outcomes in tabular data

OPENALEX - Publications

Nabeel Seedat Jonathan Crabbé Ioana Bica Mihaela van der Schaar

High model performance, on average, can hide that models may systematically underperform subgroups of the data. We consider tabular setting, which surfaces unique issue outcome heterogeneity - this is prevalent in areas such as healthcare, where patients with similar features have different outcomes, thus making reliable predictions challenging. To tackle this, we propose Data-IQ, a framework to stratify examples into respect their outcomes. do by analyzing behavior individual during...

10.48550/arxiv.2210.13043 preprint EN cc-by arXiv (Cornell University) 2022-01-01

Joint Training of Deep Ensembles Fails Due to Learner Collusion

OPENALEX - Publications

Alan Jeffares Tennison Liu Jonathan Crabbé Mihaela van der Schaar

Ensembles of machine learning models have been well established as a powerful method improving performance over single model. Traditionally, ensembling algorithms train their base learners independently or sequentially with the goal optimizing joint performance. In case deep ensembles neural networks, we are provided opportunity to directly optimize true objective: ensemble whole. Surprisingly, however, minimizing loss appears rarely be applied in practice. Instead, most previous research...

10.48550/arxiv.2301.11323 preprint EN cc-by arXiv (Cornell University) 2023-01-01

TANGOS: Regularizing Tabular Neural Networks through Gradient Orthogonalization and Specialization

OPENALEX - Publications

Alan Jeffares Tennison Liu Jonathan Crabbé Fergus Imrie Mihaela van der Schaar

Despite their success with unstructured data, deep neural networks are not yet a panacea for structured tabular data. In the domain, efficiency crucially relies on various forms of regularization to prevent overfitting and provide strong generalization performance. Existing techniques include broad modelling decisions such as choice architecture, loss functions, optimization methods. this work, we introduce Tabular Neural Gradient Orthogonalization Specialization (TANGOS), novel framework in...

10.48550/arxiv.2303.05506 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Evaluating the Robustness of Interpretability Methods through Explanation Invariance and Equivariance

OPENALEX - Publications

Jonathan Crabbé Mihaela van der Schaar

Interpretability methods are valuable only if their explanations faithfully describe the explained model. In this work, we consider neural networks whose predictions invariant under a specific symmetry group. This includes popular architectures, ranging from convolutional to graph networks. Any explanation that explains type of model needs be in agreement with invariance property. We formalize intuition through notion and equivariance by leveraging formalism geometric deep learning. Through...

10.48550/arxiv.2304.06715 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Benchmarking Heterogeneous Treatment Effect Models through the Lens of Interpretability

OPENALEX - Publications

Jonathan Crabbé Alicia Curth Ioana Bica Mihaela van der Schaar

Estimating personalized effects of treatments is a complex, yet pervasive problem. To tackle it, recent developments in the machine learning (ML) literature on heterogeneous treatment effect estimation gave rise to many sophisticated, but opaque, tools: due their flexibility, modularity and ability learn constrained representations, neural networks particular have become central this literature. Unfortunately, assets such black boxes come at cost: models typically involve countless...

10.48550/arxiv.2206.08363 preprint EN cc-by arXiv (Cornell University) 2022-01-01

Time Series Diffusion in the Frequency Domain

OPENALEX - Publications

Jonathan Crabbé Nicolas Huynh Jan Stanczuk Mihaela van der Schaar

Fourier analysis has been an instrumental tool in the development of signal processing. This leads us to wonder whether this framework could similarly benefit generative modelling. In paper, we explore question through scope time series diffusion models. More specifically, analyze representing frequency domain is a useful inductive bias for score-based By starting from canonical SDE formulation domain, show that dual process occurs with important nuance: Brownian motions are replaced by what...

10.48550/arxiv.2402.05933 preprint EN arXiv (Cornell University) 2024-02-08

DAGnosis: Localized Identification of Data Inconsistencies using Structures

OPENALEX - Publications

Nicolas Huynh Jeroen Berrevoets Nabeel Seedat Jonathan Crabbé Zhaozhi Qian and 1 more

Identification and appropriate handling of inconsistencies in data at deployment time is crucial to reliably use machine learning models. While recent data-centric methods are able identify such with respect the training set, they suffer from two key limitations: (1) suboptimality settings where features exhibit statistical independencies, due their usage compressive representations (2) lack localization pin-point why a sample might be flagged as inconsistent, which important guide future...

10.48550/arxiv.2402.17599 preprint EN arXiv (Cornell University) 2024-02-26

LaTable: Towards Large Tabular Models

OPENALEX - Publications

Boris van Breugel Jonathan Crabbé Rob Davis Mihaela van der Schaar

Tabular data is one of the most ubiquitous modalities, yet literature on tabular generative foundation models lagging far behind its text and vision counterparts. Creating such a model hard, due to heterogeneous feature spaces different datasets, metadata (e.g. dataset description headers), tables lacking prior knowledge order). In this work we propose LaTable: novel diffusion that addresses these challenges can be trained across datasets. Through extensive experiments find LaTable...

10.48550/arxiv.2406.17673 preprint EN arXiv (Cornell University) 2024-06-25

Data-SUITE: Data-centric identification of in-distribution incongruous examples

OPENALEX - Publications

Nabeel Seedat Jonathan Crabbé Mihaela van der Schaar

Systematic quantification of data quality is critical for consistent model performance. Prior works have focused on out-of-distribution data. Instead, we tackle an understudied yet equally important problem characterizing incongruous regions in-distribution (ID) data, which may arise from feature space heterogeneity. To this end, propose a paradigm shift with Data-SUITE: data-centric AI framework to identify these regions, independent task-specific model. Data-SUITE leverages copula...

10.48550/arxiv.2202.08836 preprint EN cc-by arXiv (Cornell University) 2022-01-01

Explaining the Absorption Features of Deep Learning Hyperspectral Classification Models

OPENALEX - Publications

Arthur Vandenhoeke Lennert Antson Guillem Ballesteros Jonathan Crabbé Michal Shimoni

Over the past decade, Deep Learning (DL) models have proven to be efficient at classifying remotely sensed Earth Observation (EO) hyperspectral imaging (HSI) data. Those show state-of-the-art performances across various bench-marked data sets by extracting abstract spatial-spectral features using 2D and 3D convolutions. However, black-box nature of DL hinders explanation, limits trust, underscores need for profound insights beyond raw performance metrics. In this contribution, we implement a...

10.1109/igarss52108.2023.10282988 article EN IGARSS 2022 - 2022 IEEE International Geoscience and Remote Sensing Symposium 2023-07-16

Robust multimodal models have outlier features and encode more concepts

OPENALEX - Publications

Jonathan Crabbé Pau Rodríguez Vaishaal Shankar Luca Zappella Arno Blaas

What distinguishes robust models from non-robust ones? This question has gained traction with the appearance of large-scale multimodal models, such as CLIP. These have demonstrated unprecedented robustness respect to natural distribution shifts. While it been shown that differences in can be traced back training data, so far is not known what translates terms model learned. In this work, we bridge gap by probing representation spaces 12 various backbones (ResNets and ViTs) pretraining sets...

10.48550/arxiv.2310.13040 preprint EN cc-by arXiv (Cornell University) 2023-01-01

TRIAGE: Characterizing and auditing training data for improved regression

OPENALEX - Publications

Nabeel Seedat Jonathan Crabbé Zhaozhi Qian Mihaela van der Schaar

Data quality is crucial for robust machine learning algorithms, with the recent interest in data-centric AI emphasizing importance of training data characterization. However, current characterization methods are largely focused on classification settings, regression settings understudied. To address this, we introduce TRIAGE, a novel framework tailored to tasks and compatible broad class regressors. TRIAGE utilizes conformal predictive distributions provide model-agnostic scoring method,...

10.48550/arxiv.2310.18970 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Label-Free Explainability for Unsupervised Models

OPENALEX - Publications

Jonathan Crabbé Mihaela van der Schaar

Unsupervised black-box models are challenging to interpret. Indeed, most existing explainability methods require labels select which component(s) of the black-box's output In absence labels, outputs often representation vectors whose components do not correspond any meaningful quantity. Hence, choosing interpret in a label-free unsupervised/self-supervised setting is an important, yet unsolved problem. To bridge this gap literature, we introduce two crucial extensions post-hoc explanation...

10.48550/arxiv.2203.01928 preprint EN cc-by arXiv (Cornell University) 2022-01-01

Coming Soon ...