- Data Quality and Management
- Semantic Web and Ontologies
- Data Mining Algorithms and Applications
- Advanced Database Systems and Queries
- Data Management and Algorithms
- Big Data and Business Intelligence
- Scientific Computing and Data Management
- Anomaly Detection Techniques and Applications
- Topic Modeling
- Image Retrieval and Classification Techniques
- Biomedical Text Mining and Ontologies
- Big Data Technologies and Applications
- Data Stream Mining Techniques
- Advanced Image and Video Retrieval Techniques
- Privacy-Preserving Technologies in Data
- Time Series Analysis and Forecasting
- Rough Sets and Fuzzy Logic
- Mobile Crowdsensing and Crowdsourcing
- Machine Learning and Data Classification
- Advanced Text Analysis Techniques
- Misinformation and Its Impacts
- Explainable Artificial Intelligence (XAI)
- Remote-Sensing Image Classification
- Web Data Mining and Analysis
- Data Visualization and Analytics
Acteurs, Ressources et Territoires dans le Développement
2015-2025
Institut de Recherche pour le Développement
2016-2025
UMR Espace-Dev
2016-2025
Office of Scientific and Technical Information
2024
Oak Ridge National Laboratory
2024
Laboratoire d'Informatique, de Robotique et de Microélectronique de Montpellier
2024
Aix-Marseille Université
2012-2023
Hamad bin Khalifa University
2015-2021
Laboratoire d’Informatique et Systèmes
2018-2021
Centre National de la Recherche Scientifique
2012-2021
Many data management applications, such as setting up Web portals, managing enterprise data, community and sharing scientific require integrating from multiple sources. Each of these sources provides a set values different can often provide conflicting values. To present quality to users, it is critical that integration systems resolve conflicts discover true Typically, we expect value be provided by more than any particular false one, so take the majority truth. Unfortunately, spread...
Modern information management applications often require integrating data from a variety of sources, some which may copy or buy other sources. When these sources model dynamically changing world ( e.g. , people's contact changes over time, restaurants open and go out business), provide out-of-date data. Errors can also creep into when are updated often. Given erroneous provided by different, possibly dependent, it is challenging for integration systems to the true values. Straightforward...
Web technologies have enabled data sharing between sources but also simplified copying (and often publishing without proper attribution). The relationships can be complex: some copy from multiple on different subsets of data; co-copy the same source, and transitively another. Understanding such is desirable both for business purposes improving many key components in integration, as resolving conflicts across various sources, reconciling distinct references to real-world entity, efficiently...
Various computational procedures or constraint-based methods for data repairing have been proposed over the last decades to identify errors and, when possible, correct them. However, these approaches several limitations including scalability and quality of values be used in replacement errors. In this paper, we propose a new approach that is based on maximizing likelihood given distribution, which can modeled using statistical machine learning techniques. This novel combining cleaning dirty...
Quantitative Data Cleaning (QDC) is the use of statistical and other analytical techniques to detect, quantify, correct data quality problems (or glitches). Current QDC approaches focus on addressing each category glitch individually. However, in real-world data, different types glitches co-occur complex patterns. These patterns interactions between offer valuable clues for developing effective domain-specific quantitative cleaning strategies. In this paper, we address shortcomings extant...
A fundamental problem in data fusion is to determine the veracity of multi-source order resolve conflicts. While previous work truth discovery has proved be useful practice for specific settings, sources' behavior or set characteristics, there been limited systematic comparison competing methods terms efficiency, usability, and repeatability. We remedy this deficit by providing a comprehensive review 12 state-of-the art algorithms discovery. provide reference implementations an in-depth...
Social networks and the Web in general are characterized by multiple information sources often claiming conflicting data values. Data veracity is hard to estimate, especially when there no prior knowledge about or claims time-dependent scenarios where initially very few observers can report first information. Despite wide set of recently proposed truth discovery approaches, "no-one-fits-all" solution emerges for estimating on-line open contexts. However, analyzing space disagreeing might be...
Multimodal AI models are increasingly used in fields like healthcare, finance, and autonomous driving, where information is drawn from multiple sources or modalities such as images, texts, audios, videos. However, effectively managing uncertainty - arising noise, insufficient evidence, conflicts between crucial for reliable decision-making. Current uncertainty-aware ML methods leveraging, example, evidence averaging, accumulation underestimate uncertainties high-conflict scenarios. Moreover,...
The Web has enabled the availability of a huge amount useful information, but also eased ability to spread false information and rumors across multiple sources, making it hard distinguish between what is true not. Recent examples include premature Steve Jobs obituary, second bankruptcy United airlines, creation Black Holes by operation Large Hadron Collider, etc. Since important permit expression dissenting conflicting opinions, would be fallacy try ensure that provides only consistent...
Functional dependencies (FDs) play an important role in maintaining data quality. They can be used to enforce consistency and guide repairs over a database. In this work, we investigate the problem of missing values its impact on FD discovery. When using existing discovery algorithms, some genuine FDs could not detected precisely due or non-genuine discovered even though they are caused by with certain NULL semantics. We define notion genuineness propose algorithms compute score FD. This...
Data cleaning and preparation has been a long-standing challenge in data science to avoid incorrect results misleading conclusions obtained from dirty data. For given dataset machine learning-based task, plethora of preprocessing techniques alternative curation strategies may lead dramatically different outputs with unequal quality performance. Most current work on automated learning, however, focus developing either algorithms or user-guided systems
On the Web, a massive amount of user-generated content is available through various channels (e.g., texts, tweets, Web tables, databases, multimedia-sharing platforms, etc.). Conflicting information,
Object-based image analysis (OBIA) has been widely adopted as a common paradigm to deal with very high-resolution remote sensing images. Nevertheless, OBIA methods strongly depend on the results of segmentation. Many segmentation quality metrics have proposed. Supervised give accurate estimation but require ground-truth reference. Unsupervised only make use intrinsic and segment properties; yet most them application do not well variability objects in Furthermore, few developed context mainly...
Error detection is the process of identifying problematic data cells that are different from their ground truth. Functional dependencies (FDs) have been widely studied in support this process. Oftentimes, it assumed FDs given by experts. Unfortunately, usually hard and expensive for experts to define such FDs. In addition, automatic profiling over dirty order find correct known be a problem. paper, we propose an end-to-end solution detect FD-detectable errors data. The broad intuition...
Many emerging applications, from domains such as healthcare and oil & gas, require several data processing systems for complex analytics. This demo paper showcases system, a framework that provides multi-platform task execution applications. It features three-layer abstraction new query optimization approach settings. We will demonstrate the strengths of system by using real-world scenarios three different namely, machine learning, cleaning, fusion.
Through extensive experience developing and explaining machine learning (ML) applications for real-world domains, we have learned that ML models are only as interpretable their features. Even simple, highly model types such regression can be difficult or impossible to understand if they use uninterpretable Different users, especially those using decision-making in may require different levels of feature interpretability. Furthermore, based on our experiences, claim the term "interpretable...
It is widely accepted that data preparation one of the most time-consuming steps machine learning (ML) lifecycle. also important steps, as quality directly influences a model. In this tutorial, we will discuss importance and role exploratory analysis (EDA) visualisation techniques to find issues for preparation, relevant building ML pipelines. We latest advances in these fields bring out areas need innovation. To make tutorial actionable practitioners, popular open-source packages can get...
Anomaly detection on time series data is increasingly common across various industrial domains that monitor metrics in order to prevent potential accidents and economic losses. However, a scarcity of labeled ambiguous definitions anomalies can complicate these efforts. Recent unsupervised machine learning methods have made remarkable progress tackling this problem using either single-timestamp predictions or reconstructions. While traditionally considered separately, are not mutually...
Ensuring and maximizing the quality integrity of information is a crucial process for today enterprise systems (EIS). It requires clear understanding interdependencies between dimensions characterizing data (QoD), conceptual model (QoM) database, keystone EIS, management integration processes (QoP). The improvement one dimension (such as accuracy or expressiveness) may have negative consequences on other (e.g., freshness completeness data). In this paper we briefly present framework, called...
Estimation of data veracity is recognized as one the grand challenges big data. Typically, goal truth discovery to determine multi-source, conflicting and return, outputs, a label confidence score for each value, along with trustworthiness source claiming it. Although plethora methods has been proposed, it unlikely technique dominates all others across sets. Furthermore, performance evaluation entirely depends on availability labeled ground (i.e., whose manually checked). In context Big...