Jonathan Crabbé

ORCID: 0000-0002-0341-7712
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Explainable Artificial Intelligence (XAI)
  • Machine Learning in Healthcare
  • Machine Learning and Data Classification
  • Anomaly Detection Techniques and Applications
  • Topic Modeling
  • Machine Learning in Materials Science
  • Neural Networks and Applications
  • Time Series Analysis and Forecasting
  • Adversarial Robustness in Machine Learning
  • Model Reduction and Neural Networks
  • Domain Adaptation and Few-Shot Learning
  • Reservoir Engineering and Simulation Methods
  • Advanced Neural Network Applications
  • Catalysis and Oxidation Reactions
  • Advanced Database Systems and Queries
  • Data Stream Mining Techniques
  • Radiomics and Machine Learning in Medical Imaging
  • Modular Robots and Swarm Intelligence
  • Advanced Causal Inference Techniques
  • Advanced Image Fusion Techniques
  • Remote-Sensing Image Classification
  • 3D Modeling in Geospatial Applications
  • Geochemistry and Geologic Mapping
  • Gaussian Processes and Bayesian Inference
  • Generative Adversarial Networks and Image Synthesis

Microsoft Research (United Kingdom)
2025

University of Cambridge
2020-2023

The design of functional materials with desired properties is essential in driving technological advances areas like energy storage, catalysis, and carbon capture. Generative models provide a new paradigm for by directly generating entirely novel given property constraints. Despite recent progress, current generative have low success rate proposing stable crystals, or can only satisfy very limited set Here, we present MatterGen, model that generates stable, diverse inorganic across the...

10.48550/arxiv.2312.03687 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Concept-based explanations permit to understand the predictions of a deep neural network (DNN) through lens concepts specified by users. Existing methods assume that examples illustrating concept are mapped in fixed direction DNN's latent space. When this holds true, can be represented activation vector (CAV) pointing direction. In work, we propose relax assumption allowing scattered across different clusters Each is then region space includes these and call (CAR). To formalize idea,...

10.48550/arxiv.2209.11222 preprint EN cc-by arXiv (Cornell University) 2022-01-01

How can we explain the predictions of a machine learning model? When data is structured as multivariate time series, this question induces additional difficulties such necessity for explanation to embody dependency and large number inputs. To address these challenges, propose dynamic masks (Dynamask). This method produces instance-wise importance scores each feature at step by fitting perturbation mask input sequence. In order incorporate data, Dynamask studies effects operators. tackle...

10.48550/arxiv.2106.05303 preprint EN cc-by arXiv (Cornell University) 2021-01-01

Machine Learning has proved its ability to produce accurate models but the deployment of these outside machine learning community been hindered by difficulties interpreting models. This paper proposes an algorithm that produces a continuous global interpretation any given black-box function. Our employs variation projection pursuit in which ridge functions are chosen be Meijer G-functions, rather than usual polynomial splines. Because G-functions differentiable their parameters, we can tune...

10.48550/arxiv.2011.08596 preprint EN cc-by arXiv (Cornell University) 2020-01-01

Modern machine learning models are complicated. Most of them rely on convoluted latent representations their input to issue a prediction. To achieve greater transparency than black-box that connects inputs predictions, it is necessary gain deeper understanding these representations. aim, we propose SimplEx: user-centred method provides example-based explanations with reference freely selected set examples, called the corpus. SimplEx uses corpus improve user's space post-hoc answering two...

10.48550/arxiv.2110.15355 preprint EN other-oa arXiv (Cornell University) 2021-01-01

High model performance, on average, can hide that models may systematically underperform subgroups of the data. We consider tabular setting, which surfaces unique issue outcome heterogeneity - this is prevalent in areas such as healthcare, where patients with similar features have different outcomes, thus making reliable predictions challenging. To tackle this, we propose Data-IQ, a framework to stratify examples into respect their outcomes. do by analyzing behavior individual during...

10.48550/arxiv.2210.13043 preprint EN cc-by arXiv (Cornell University) 2022-01-01

Ensembles of machine learning models have been well established as a powerful method improving performance over single model. Traditionally, ensembling algorithms train their base learners independently or sequentially with the goal optimizing joint performance. In case deep ensembles neural networks, we are provided opportunity to directly optimize true objective: ensemble whole. Surprisingly, however, minimizing loss appears rarely be applied in practice. Instead, most previous research...

10.48550/arxiv.2301.11323 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Despite their success with unstructured data, deep neural networks are not yet a panacea for structured tabular data. In the domain, efficiency crucially relies on various forms of regularization to prevent overfitting and provide strong generalization performance. Existing techniques include broad modelling decisions such as choice architecture, loss functions, optimization methods. this work, we introduce Tabular Neural Gradient Orthogonalization Specialization (TANGOS), novel framework in...

10.48550/arxiv.2303.05506 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Interpretability methods are valuable only if their explanations faithfully describe the explained model. In this work, we consider neural networks whose predictions invariant under a specific symmetry group. This includes popular architectures, ranging from convolutional to graph networks. Any explanation that explains type of model needs be in agreement with invariance property. We formalize intuition through notion and equivariance by leveraging formalism geometric deep learning. Through...

10.48550/arxiv.2304.06715 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Estimating personalized effects of treatments is a complex, yet pervasive problem. To tackle it, recent developments in the machine learning (ML) literature on heterogeneous treatment effect estimation gave rise to many sophisticated, but opaque, tools: due their flexibility, modularity and ability learn constrained representations, neural networks particular have become central this literature. Unfortunately, assets such black boxes come at cost: models typically involve countless...

10.48550/arxiv.2206.08363 preprint EN cc-by arXiv (Cornell University) 2022-01-01

Fourier analysis has been an instrumental tool in the development of signal processing. This leads us to wonder whether this framework could similarly benefit generative modelling. In paper, we explore question through scope time series diffusion models. More specifically, analyze representing frequency domain is a useful inductive bias for score-based By starting from canonical SDE formulation domain, show that dual process occurs with important nuance: Brownian motions are replaced by what...

10.48550/arxiv.2402.05933 preprint EN arXiv (Cornell University) 2024-02-08

Identification and appropriate handling of inconsistencies in data at deployment time is crucial to reliably use machine learning models. While recent data-centric methods are able identify such with respect the training set, they suffer from two key limitations: (1) suboptimality settings where features exhibit statistical independencies, due their usage compressive representations (2) lack localization pin-point why a sample might be flagged as inconsistent, which important guide future...

10.48550/arxiv.2402.17599 preprint EN arXiv (Cornell University) 2024-02-26

Tabular data is one of the most ubiquitous modalities, yet literature on tabular generative foundation models lagging far behind its text and vision counterparts. Creating such a model hard, due to heterogeneous feature spaces different datasets, metadata (e.g. dataset description headers), tables lacking prior knowledge order). In this work we propose LaTable: novel diffusion that addresses these challenges can be trained across datasets. Through extensive experiments find LaTable...

10.48550/arxiv.2406.17673 preprint EN arXiv (Cornell University) 2024-06-25

Systematic quantification of data quality is critical for consistent model performance. Prior works have focused on out-of-distribution data. Instead, we tackle an understudied yet equally important problem characterizing incongruous regions in-distribution (ID) data, which may arise from feature space heterogeneity. To this end, propose a paradigm shift with Data-SUITE: data-centric AI framework to identify these regions, independent task-specific model. Data-SUITE leverages copula...

10.48550/arxiv.2202.08836 preprint EN cc-by arXiv (Cornell University) 2022-01-01

Over the past decade, Deep Learning (DL) models have proven to be efficient at classifying remotely sensed Earth Observation (EO) hyperspectral imaging (HSI) data. Those show state-of-the-art performances across various bench-marked data sets by extracting abstract spatial-spectral features using 2D and 3D convolutions. However, black-box nature of DL hinders explanation, limits trust, underscores need for profound insights beyond raw performance metrics. In this contribution, we implement a...

10.1109/igarss52108.2023.10282988 article EN IGARSS 2022 - 2022 IEEE International Geoscience and Remote Sensing Symposium 2023-07-16

What distinguishes robust models from non-robust ones? This question has gained traction with the appearance of large-scale multimodal models, such as CLIP. These have demonstrated unprecedented robustness respect to natural distribution shifts. While it been shown that differences in can be traced back training data, so far is not known what translates terms model learned. In this work, we bridge gap by probing representation spaces 12 various backbones (ResNets and ViTs) pretraining sets...

10.48550/arxiv.2310.13040 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Data quality is crucial for robust machine learning algorithms, with the recent interest in data-centric AI emphasizing importance of training data characterization. However, current characterization methods are largely focused on classification settings, regression settings understudied. To address this, we introduce TRIAGE, a novel framework tailored to tasks and compatible broad class regressors. TRIAGE utilizes conformal predictive distributions provide model-agnostic scoring method,...

10.48550/arxiv.2310.18970 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Unsupervised black-box models are challenging to interpret. Indeed, most existing explainability methods require labels select which component(s) of the black-box's output In absence labels, outputs often representation vectors whose components do not correspond any meaningful quantity. Hence, choosing interpret in a label-free unsupervised/self-supervised setting is an important, yet unsolved problem. To bridge this gap literature, we introduce two crucial extensions post-hoc explanation...

10.48550/arxiv.2203.01928 preprint EN cc-by arXiv (Cornell University) 2022-01-01
Coming Soon ...