- Data Quality and Management
- Advanced Database Systems and Queries
- Data Management and Algorithms
- Privacy-Preserving Technologies in Data
- Data Mining Algorithms and Applications
- Semantic Web and Ontologies
- Topic Modeling
- Web Data Mining and Analysis
- Advanced Graph Neural Networks
- Natural Language Processing Techniques
- Big Data and Business Intelligence
- Data-Driven Disease Surveillance
- Bayesian Modeling and Causal Inference
- Distributed systems and fault tolerance
- Algorithms and Data Compression
- Anomaly Detection Techniques and Applications
- Cloud Data Security Solutions
- Cryptography and Data Security
- Scientific Computing and Data Management
- Advanced Image and Video Retrieval Techniques
- Peer-to-Peer Network Technologies
- Logic, Reasoning, and Knowledge
- Graph Theory and Algorithms
- Data Stream Mining Techniques
- Constraint Satisfaction and Optimization
University of Waterloo
2015-2024
Apple (United States)
2022-2023
Sapienza University of Rome
2023
University of Calgary
2023
University of Michigan
2023
Universitas Flores
2021
Universitas Syiah Kuala
2020
Qatar Cardiovascular Research Center
2012-2013
Qatar Foundation
2011-2013
Qatar Airways (Qatar)
2013
Top-k processing in uncertain databases is semantically and computationally different from traditional top-k processing. The interplay between score uncertainty makes techniques inapplicable. We introduce new probabilistic formulations for queries. Our are based on "marriage" of semantics possible worlds semantics. In the light these formulations, we construct a framework that encapsulates state space model efficient query to tackle challenges data settings. prove our optimal terms number...
We introduce HoloClean, a framework for holistic data repairing driven by probabilistic inference. HoloClean unifies qualitative repairing, which relies on integrity constraints or external sources, with quantitative methods, leverage statistical properties of the input data. Given an inconsistent dataset as input, automatically generates program that performs repairing. Inspired recent theoretical advances in inference, we series optimizations ensure inference over HoloClean's model scales...
Detecting and repairing dirty data is one of the perennial challenges in analytics, failure to do so can result inaccurate analytics unreliable decisions. Over past few years, there has been a surge interest from both industry academia on cleaning problems including new abstractions, interfaces, approaches for scalability, statistical techniques. To better understand advances field, we will first present taxonomy literature which highlight recent techniques that use constraints, rules, or...
Data cleaning is an important problem and data quality rules are the most promising way to face it with a declarative approach. Previous work has focused on specific formalisms, such as functional dependencies (FDs), conditional (CFDs), matching (MDs), those have always been studied in isolation. Moreover, techniques usually applied pipeline or interleaved. In this we tackle novel, unified framework. First, let users specify using denial constraints ad-hoc predicates. This language subsumes...
Despite the increasing importance of data quality and rich theoretical practical contributions in all aspects cleaning, there is no single end-to-end off-the-shelf solution to (semi-)automate detection repairing violations w.r.t. a set heterogeneous ad-hoc constraints. In short, commodity platform similar general purpose DBMSs that can be easily customized deployed solve application-specific problems. this paper, we present NADEEF, an extensible, generalized easy-to-deploy cleaning platform....
Classical approaches to clean data have relied on using integrity constraints, statistics, or machine learning. These are known be limited in the cleaning accuracy, which can usually improved by consulting master and involving experts resolve ambiguity. The advent of knowledge bases KBs both general-purpose within enterprises, crowdsourcing marketplaces providing yet more opportunities achieve higher accuracy at a larger scale. We propose KATARA, base crowd powered system that, given table,...
Integrity constraints (ICs) provide a valuable tool for enforcing correct application semantics. However, designing ICs requires experts and time. Proposals automatic discovery have been made some formalisms, such as functional dependencies their extension conditional dependencies. Unfortunately, these cannot express many common business rules. For example, an American citizen lower salary higher tax rate than another in the same state. In this paper, we tackle challenges of discovering more...
Data cleaning has played a critical role in ensuring data quality for enterprise applications. Naturally, there been extensive research this area, and many algorithms have translated into tools to detect possibly repair certain classes of errors such as outliers, duplicates, missing values, violations integrity constraints. Since different types may coexist the same set, we often need run more than one kind tool. In paper, investigate two pragmatic questions: (1) are these robust enough...
The rich dependency structure found in the columns of real-world relational databases can be exploited to great advantage, but also cause query optimizers---which usually assume that are statistically independent---to underestimate selectivities conjunctive predicates by orders magnitude. We introduce CORDS, an efficient and scalable tool for automatic discovery correlations soft functional dependencies between columns. CORDS searches column pairs might have interesting useful relations...
This paper introduces RankSQL, a system that provides systematic and principled framework to support efficient evaluations of ranking (top-k) queries in relational database systems (RDBMS), by extending algebra query optimization. Previously, top-k processing is studied the middleware scenario or RDBMS piecemeal fashion, i.e., focusing on specific operator sitting outside core engines. In contrast, we aim as first-class construct. As key insight, new relationship can be viewed another...
In this paper we present GDR, a Guided Data Repair framework that incorporates user feedback in the cleaning process to enhance and accelerate existing automatic repair techniques while minimizing involvement. GDR consults on updates are most likely be beneficial improving data quality. also uses machine learning methods identify apply correct directly database without actual involvement of these specific updates. To rank potential for consultation by user, first group repairs quantify...
Uncertainty pervades many domains in our lives. Current real-life applications, e.g., location tracking using GPS devices or cell phones, multimedia feature extraction, and sensor data management, deal with different kinds of uncertainty. Finding the nearest neighbor objects to a given query point is an important type these applications. In this paper, we study problem finding highest marginal probability being neighbors object. We adopt general uncertainty model allowing for Under model,...
Data cleansing approaches have usually focused on detecting and fixing errors with little attention to scaling big datasets. This presents a serious impediment since data often involves costly computations such as enumerating pairs of tuples, handling inequality joins, dealing user-defined functions. In this paper, we present BigDansing, Big Cleansing system tackle efficiency, scalability, ease-of-use issues in cleansing. The can run top most common general purpose processing platforms,...
We introduce a few-shot learning framework for error detection. show that data augmentation (a form of weak supervision) is key to training high-quality, ML-based detection models require minimal human involvement. Our consists two parts: (1) an expressive model learn rich representations capture the inherent syntactic and semantic heterogeneity errors; (2) that, given small seed clean records, uses dataset-specific transformations automatically generate additional data. insight policies...
Ranking is an important property that needs to be fully supported by current relational query engines. Recently, several rank-join operators have been proposed based on rank aggregation algorithms. Rank-join progressively the join results while performing operation. The new a direct impact traditional processing and optimization.We introduce rank-aware optimization framework integrates into extending System R dynamic programming algorithm in both enumeration pruning. We define ranking as...
Violations of functional dependencies (FDs) are common in practice, often arising the context data integration or Web extraction. Resolving these violations is known to be challenging for a variety reasons, one them being exponential number possible "repairs". Previous work has tackled this problem either by producing single repair that (nearly) optimal with respect some metric, computing consistent answers selected classes queries without explicitly generating repairs. In paper, we propose...
We present the demonstration of design "STEAM", Purdue Boiler Makers' stream database system that allows for processing continuous and snap-shot queries over data streams. Specifically, focuses on query engine, "Nile". Nile extends processor engine an object-relational management system, PREDATOR, to process supports extended SQL operators handle sliding-window execution as approach restrict size stored state in such join.
Functional dependencies (FDs) specify the intended data semantics while violations of FDs indicate deviation from these semantics. In this paper, we study a cleaning problem in which may not be completely correct, e.g., due to evolution or incomplete knowledge We argue that notion relative trust is crucial aspect problem: if are outdated, should modify them fit data, but suspect there problems with FDs. practice, it usually unclear how much versus To address problem, propose an algorithm for...