Michelle Sweering

ORCID: 0000-0003-1200-6015
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Algorithms and Data Compression
  • Natural Language Processing Techniques
  • semigroups and automata theory
  • Privacy-Preserving Technologies in Data
  • DNA and Biological Computing
  • Genomics and Phylogenetic Studies
  • Complexity and Algorithms in Graphs
  • Cryptography and Data Security
  • Advanced Graph Theory Research
  • Data Quality and Management
  • Data Mining Algorithms and Applications
  • Plant and animal studies
  • Genome Rearrangement Algorithms
  • Ecology and Vegetation Dynamics Studies
  • Optimization and Search Problems
  • Network Packet Processing and Optimization
  • Imbalanced Data Classification Techniques
  • Plant Parasitism and Resistance
  • Data Management and Algorithms
  • Machine Learning in Bioinformatics
  • Fuzzy and Soft Set Theory
  • 3D Shape Modeling and Analysis
  • Authorship Attribution and Profiling
  • Diabetic Foot Ulcer Assessment and Management
  • Access Control and Trust

Centrum Wiskunde & Informatica
2019-2025

Vitenparken
2022

Conservation Leadership Programme
2018-2019

University of Cambridge
2018-2019

Abstract Bipartite networks are widely used to represent a diverse range of species interactions, such as pollination, herbivory, parasitism and seed dispersal. The structure these is usually characterised by calculating one or more indices that capture different aspects network architecture. While useful properties networks, they relatively insensitive changes in structure. Consequently, variation ecologically‐important interactions can be missed. Network motifs way characterise...

10.1111/2041-210x.13149 article EN cc-by Methods in Ecology and Evolution 2019-01-12

Abstract Missing values arise routinely in real-world sequential (string) datasets due to: (1) imprecise data measurements; (2) flexible sequence modeling, such as binding profiles of molecular sequences; or (3) the existence confidential information a dataset which has been deleted deliberately for privacy protection. In order to analyze datasets, it is often important replace each missing value, with one more valid letters, an efficient and effective way. Here we formalize this task...

10.1007/s10618-024-01074-3 article EN cc-by Data Mining and Knowledge Discovery 2025-01-22

Introduction An elastic-degenerate (ED) string is a sequence of sets strings. It can also be seen as directed acyclic graph whose edges are labeled by The notion ED strings was introduced simple alternative to variation and graphs for representing pangenome, that is, collection genomic sequences analyzed jointly or used reference. Methods In this study, we define notions matching statistics two similarity measures between pangenomes and, consequently infer corresponding distance measure. We...

10.3389/fbinf.2024.1397036 article EN cc-by Frontiers in Bioinformatics 2024-09-26

The minimizers sampling mechanism is a popular for string sampling. However, mechanisms lack good guarantees on the expected size of their samples different combinations input parameters. Furthermore, indexes constructed over worst-case on-line pattern searches. In response, we propose bidirectional anchors (bd-anchors), new mechanism. Given an integer <inline-formula><tex-math notation="LaTeX">$\ell$</tex-math></inline-formula> , our selects lexicographically smallest rotation in every...

10.1109/tkde.2022.3231780 article EN IEEE Transactions on Knowledge and Data Engineering 2023-01-16

Let W be a string of length n over an alphabet Σ, k positive integer, and set length-k substrings W. The ETFS problem asks us to construct X_{ED} such that: (i) no occurs in X_{ED}; (ii) the order all other Σ is same (iii) has minimal edit distance When represents individual’s data confidential substrings, algorithms solving can applied for utility-preserving sanitization [Bernardini et al., ECML PKDD 2019]. Our first result here algorithm solve (kn²) time, which improves on state art arXiv...

10.4230/lipics.cpm.2020.7 preprint EN other-oa HAL (Le Centre pour la Communication Scientifique Directe) 2020-06-17

String data are often disseminated to support applications such as location-based service provision or DNA sequence analysis. This dissemination, however, may expose sensitive patterns that model confidential knowledge (e.g., trips mental health clinics from a string representing user’s location history). In this article, we consider the problem of sanitizing by concealing occurrences patterns, while maintaining utility, in two settings relevant many common processing tasks. first setting,...

10.1145/3418683 article EN ACM Transactions on Knowledge Discovery from Data 2020-12-07

Abstract Bipartite networks are widely-used to represent a diverse range of species interactions, such as pollination, herbivory, parasitism and seed dispersal. The structure these is usually characterised by calculating one or more metrics that capture different aspects network architecture. While useful properties networks, they only consider at the scale whole (the macro-scale) individual micro-scale). ‘Meso-scale’ between scales ignored, despite representing ecologically-important...

10.1101/302356 preprint EN cc-by bioRxiv (Cold Spring Harbor Laboratory) 2018-04-17

A k-truss is a graph such that each edge contained in at least k-2 triangles. This notion has attracted much attention, because it models meaningful cohesive subgraphs of graph. We introduce the problem identifying smallest subset given whose removal makes k-truss-free. also variant where identified contains only edges incident to set nodes and ensures these are not any k-truss. These problems directly applicable communication networks: correspond vital network connections; or social can be...

10.1145/3447548.3467365 preprint EN 2021-08-12

We initiate a study on the fundamental relation between data sanitization (i.e., process of hiding confidential information in given dataset) and frequent pattern mining, context sequential (string) data. Current methods for string hide patterns introducing, however, number spurious that may harm utility mining. The main computational problem is to minimize this harm. Our contribution here twofold. First, we present several hardness results, different variants problem, essentially showing...

10.1109/icdm50108.2020.00103 article EN 2021 IEEE International Conference on Data Mining (ICDM) 2020-11-01

Data sanitization and frequent pattern mining are two well-studied topics in data mining. Our work initiates a study on the fundamental relation between context of sequential (string) data. Current methods for string hide confidential patterns. This, however, may lead to spurious patterns that harm utility The main computational problem is minimize this harm. contribution here as follows. First, we present several hardness results, different variants problem, essentially showing these cannot...

10.1109/tkde.2022.3158063 article EN publisher-specific-oa IEEE Transactions on Knowledge and Data Engineering 2022-01-01

We introduce the general problem of identifying a smallest edge subset given graph whose deletion makes community-free. consider this under two community notions that have attracted significant attention: k -truss and -core. also variant where identified contains edges incident to set nodes ensures these are not contained in any community: or -core, our case. These problems directly applicable social networks: The can be hidden by users sanitized from output graph; communication correspond...

10.1145/3644077 article EN other-oa ACM Transactions on Knowledge Discovery from Data 2024-02-15

An elastic-degenerate (ED) string $T$ is a sequence of $n$ sets $T[1],\ldots,T[n]$ containing $m$ strings in total whose cumulative length $N$. We call $n$, $m$, and $N$ the length, cardinality size $T$, respectively. The language defined as $L(T)=\{S_1 \cdots S_n\,:\,S_i \in T[i]$ for all $i\in[1,n]\}$. ED have been introduced to represent set closely-related DNA sequences, also known pangenome. basic question we investigate here is: Given two strings, how fast can check whether languages...

10.48550/arxiv.2411.07782 preprint EN arXiv (Cornell University) 2024-11-12

We introduce a novel measure for quantifying the error in input predictions. The is based on minimum-cost hyperedge cover suitably defined hypergraph and provides general template which we apply to online graph problems. captures errors due absent predicted requests as well unpredicted actual requests; hence, inputs can be of arbitrary size. achieve refined performance guarantees previously studied network design problems online-list model, such Steiner tree facility location. Further,...

10.48550/arxiv.2205.12850 preprint EN cc-by arXiv (Cornell University) 2022-01-01

String data are often disseminated to support applications such as location-based service provision or DNA sequence analysis. This dissemination, however, may expose sensitive patterns that model confidential knowledge. In this paper, we consider the problem of sanitizing a string by concealing occurrences patterns, while maintaining utility, in two settings relevant many common processing tasks. first setting, aim generate minimal-length preserves order appearance and frequency all...

10.48550/arxiv.1906.11030 preprint EN cc-by arXiv (Cornell University) 2019-01-01

This report describes and develops different methods for converting 3D time series data to surface representations. The considered contain public domain mesh generation software, as well linear regression models representations using signed distance functions. We provide a simple code base the latter two such that they can be used further research in manner. apply test algorithms on point cloud foot model. All yield good of underlying geometry. Such therefore have big impact handling problems.

10.33774/miir-2023-mk9wn preprint EN cc-by 2023-09-30

Let $W$ be a string of length $n$ over an alphabet $\Sigma$, $k$ positive integer, and $\mathcal{S}$ set length-$k$ substrings $W$. The ETFS problem asks us to construct $X_{\mathrm{ED}}$ such that: (i) no occurs in $X_{\mathrm{ED}}$; (ii) the order all other $\Sigma$ (and thus frequency) is same (iii) has minimal edit distance When represents individual's data confidential patterns, for transforming preserve its privacy utility [Bernardini et al., ECML PKDD 2019]. can solved...

10.48550/arxiv.2007.08179 preprint EN cc-by arXiv (Cornell University) 2020-01-01
Coming Soon ...