Boris van Breugel

ORCID: 0009-0006-5125-0028
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Privacy-Preserving Technologies in Data
  • Explainable Artificial Intelligence (XAI)
  • Topic Modeling
  • Ethics and Social Impacts of AI
  • Scientific Computing and Data Management
  • Machine Learning in Healthcare
  • Computational and Text Analysis Methods
  • Natural Language Processing Techniques
  • Hate Speech and Cyberbullying Detection
  • Radiomics and Machine Learning in Medical Imaging
  • Generative Adversarial Networks and Image Synthesis
  • Adversarial Robustness in Machine Learning
  • Advanced Data Storage Technologies
  • Mathematics and Applications
  • Software Engineering Research
  • Data Quality and Management
  • Medical Image Segmentation Techniques
  • Gene expression and cancer classification
  • Data Stream Mining Techniques
  • Cell Image Analysis Techniques
  • Surface and Thin Film Phenomena
  • COVID-19 diagnosis using AI
  • Molecular Junctions and Nanostructures
  • History and Theory of Mathematics
  • AI in cancer detection

University of Cambridge
2021-2024

University College London
2021

Daniel de Vassimon Manela, David Errington, Thomas Fisher, Boris van Breugel, Pasquale Minervini. Proceedings of the 16th Conference European Chapter Association for Computational Linguistics: Main Volume. 2021.

10.18653/v1/2021.eacl-main.190 article EN cc-by 2021-01-01

Devising domain- and model-agnostic evaluation metrics for generative models is an important as yet unresolved problem. Most existing metrics, which were tailored solely to the image synthesis setup, exhibit a limited capacity diagnosing different modes of failure across broader application domains. In this paper, we introduce 3-dimensional metric, ($\alpha$-Precision, $\beta$-Recall, Authenticity), that characterizes fidelity, diversity generalization performance any model in...

10.48550/arxiv.2102.08921 preprint EN other-oa arXiv (Cornell University) 2021-01-01

Data is the foundation of most science. Unfortunately, sharing data can be obstructed by risk violating privacy, impeding research in fields like healthcare. Synthetic a potential solution. It aims to generate that has same distribution as original data, but does not disclose information about individuals. Membership Inference Attacks (MIAs) are common privacy attack, which attacker attempts determine whether particular real sample was used for training model. Previous works propose MIAs...

10.48550/arxiv.2302.12580 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Generating synthetic data through generative models is gaining interest in the ML community and beyond. In past, was often regarded as a means to private release, but surge of recent papers explore how its potential reaches much further than this -- from creating more fair augmentation, simulation text generated by ChatGPT. perspective we whether, how, may become dominant force machine learning world, promising future where datasets can be tailored individual needs. Just importantly, discuss...

10.48550/arxiv.2304.03722 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Machine learning models have been criticized for reflecting unfair biases in the training data. Instead of solving this by introducing fair algorithms directly, we focus on generating synthetic data, such that any downstream learner is fair. Generating data from - while remaining truthful to underlying data-generating process (DGP) non-trivial. In paper, introduce DECAF: a GAN-based generator tabular With DECAF embed DGP explicitly as structural causal model input layers generator, allowing...

10.48550/arxiv.2110.12884 preprint EN cc-by arXiv (Cornell University) 2021-01-01

Generating synthetic data through generative models is gaining interest in the ML community and beyond, promising a future where datasets can be tailored to individual needs. Unfortunately, usually not perfect, resulting potential errors downstream tasks. In this work we explore how process affects task. We show that naive approach -- using as if it real leads analyses do generalize well data. As first step towards better regime, introduce Deep Generative Ensemble (DGE) framework inspired by...

10.48550/arxiv.2305.09235 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Recent text and image foundation models are incredibly impressive, these attracting an ever-increasing portion of research resources. In this position piece we aim to shift the ML community's priorities ever so slightly a different modality: tabular data. Tabular data is dominant modality in many fields, yet it given hardly any attention significantly lags behind terms scale power. We believe time now start developing models, or what coin Large Model (LTM). LTMs could revolutionise way...

10.48550/arxiv.2405.01147 preprint EN arXiv (Cornell University) 2024-05-02

Tabular data is one of the most ubiquitous modalities, yet literature on tabular generative foundation models lagging far behind its text and vision counterparts. Creating such a model hard, due to heterogeneous feature spaces different datasets, metadata (e.g. dataset description headers), tables lacking prior knowledge order). In this work we propose LaTable: novel diffusion that addresses these challenges can be trained across datasets. Through extensive experiments find LaTable...

10.48550/arxiv.2406.17673 preprint EN arXiv (Cornell University) 2024-06-25

Evaluating the performance of machine learning models on diverse and underrepresented subgroups is essential for ensuring fairness reliability in real-world applications. However, accurately assessing model becomes challenging due to two main issues: (1) a scarcity test data, especially small subgroups, (2) possible distributional shifts model's deployment setting, which may not align with available data. In this work, we introduce 3S Testing, deep generative modeling framework facilitate...

10.48550/arxiv.2310.16524 preprint EN cc-by arXiv (Cornell University) 2023-01-01

This paper proposes two intuitive metrics, skew and stereotype, that quantify analyse the gender bias present in contextual language models when tackling WinoBias pronoun resolution task. We find evidence stereotype correlates approximately negatively with out-of-the-box models, suggesting there is a trade-off between these forms of bias. investigate methods to mitigate The first approach an online method which effective at removing expense stereotype. second, inspired by previous work on...

10.48550/arxiv.2101.09688 preprint EN other-oa arXiv (Cornell University) 2021-01-01

The aim of this essay is to better understand the Grasshopper Problem on surface unit sphere. problem motivated by analysing Bell inequalities, but can be formulated as a geometric puzzle follows. Given white sphere and bucket black paint, one asked paint half sphere, such that antipodal pairs points are oppositely coloured. A grasshopper lands jumps fixed distance in random direction. How should coloured probability landing same colour maximized? Goulko Kent have explored plane without an...

10.48550/arxiv.2307.05359 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Because diffusion models have shown impressive performances in a number of tasks, such as image synthesis, there is trend recent works to prove (with certain assumptions) that these strong approximation capabilities. In this paper, we show current actually an expressive bottleneck backward denoising and some assumption made by existing theoretical guarantees too strong. Based on finding, unbounded errors both local global denoising. light our studies, introduce soft mixture (SMD), efficient...

10.48550/arxiv.2309.14068 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Machine Learning (ML) in low-data settings remains an underappreciated yet crucial problem. Hence, data augmentation methods to increase the sample size of datasets needed for ML are key unlocking transformative potential data-deprived regions and domains. Unfortunately, limited training set constrains traditional tabular synthetic generators their ability generate a large diverse augmented dataset tasks. To address this challenge, we introduce CLLM, which leverages prior knowledge Large...

10.48550/arxiv.2312.12112 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Biomedical imaging datasets are often small and biased, meaning that real-world performance of predictive models can be substantially lower than expected from internal testing. This work proposes using generative image editing to simulate dataset shifts diagnose failure modes biomedical vision models; this used in advance deployment assess readiness, potentially reducing cost patient harm. Existing methods produce undesirable changes, with spurious correlations learned due the co-occurrence...

10.48550/arxiv.2312.12865 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Uncertainty Quantification (UQ) is essential for creating trustworthy machine learning models. Recent years have seen a steep rise in UQ methods that can flag suspicious examples, however, it often unclear what exactly these identify. In this work, we propose framework categorizing uncertain examples flagged by classification tasks. We introduce the confusion density matrix -- kernel-based approximation of misclassification and use to categorize identified given uncertainty method into three...

10.48550/arxiv.2207.05161 preprint EN other-oa arXiv (Cornell University) 2022-01-01

It is important to guarantee that machine learning algorithms deployed in the real world do not result unfairness or unintended social consequences. Fair ML has largely focused on protection of single attributes simpler setting where both and target outcomes are binary. However, practical application many a real-world problem entails simultaneous multiple sensitive attributes, which often simply binary, but continuous categorical. To address this more challenging task, we introduce...

10.48550/arxiv.2211.06138 preprint EN cc-by arXiv (Cornell University) 2022-01-01
Coming Soon ...