- Data Quality and Management
- Advanced Database Systems and Queries
- Data Management and Algorithms
- Topic Modeling
- Privacy-Preserving Technologies in Data
- Data Mining Algorithms and Applications
- Semantic Web and Ontologies
- Scientific Computing and Data Management
- Machine Learning and Data Classification
- Natural Language Processing Techniques
- Data Visualization and Analytics
- Anomaly Detection Techniques and Applications
- Web Data Mining and Analysis
- Advanced Data Storage Technologies
- Distributed systems and fault tolerance
- Graph Theory and Algorithms
- Research Data Management Practices
- Software Engineering Research
- Cloud Data Security Solutions
- Advanced Computational Techniques and Applications
- Video Analysis and Summarization
- Advanced Text Analysis Techniques
- Data Stream Mining Techniques
- Machine Learning and Algorithms
- Mobile Crowdsensing and Crowdsourcing
Hong Kong University of Science and Technology
2023-2025
University of Hong Kong
2023-2025
Chengdu University of Technology
2021-2024
Zhejiang Police College
2024
Shenzhen Children's Hospital
2024
Qatar Airways (Qatar)
2013-2023
Qatar Cardiovascular Research Center
2013-2023
South China University of Technology
2023
Hamad bin Khalifa University
2016-2022
University of Electronic Science and Technology of China
2020
Graph pattern matching is typically defined in terms of subgraph isomorphism, which makes it an np-complete problem. Moreover, requires bijective functions, are often too restrictive to characterize patterns emerging applications. We propose a class graph patterns, edge denotes the connectivity data within predefined number hops. In addition, we define based on notion bounded simulation, extension simulation. show that with this revision, can be performed cubic-time, by providing such...
Despite the increasing importance of data quality and rich theoretical practical contributions in all aspects cleaning, there is no single end-to-end off-the-shelf solution to (semi-)automate detection repairing violations w.r.t. a set heterogeneous ad-hoc constraints. In short, commodity platform similar general purpose DBMSs that can be easily customized deployed solve application-specific problems. this paper, we present NADEEF, an extensible, generalized easy-to-deploy cleaning platform....
Classical approaches to clean data have relied on using integrity constraints, statistics, or machine learning. These are known be limited in the cleaning accuracy, which can usually improved by consulting master and involving experts resolve ambiguity. The advent of knowledge bases KBs both general-purpose within enterprises, crowdsourcing marketplaces providing yet more opportunities achieve higher accuracy at a larger scale. We propose KATARA, base crowd powered system that, given table,...
Data cleaning has played a critical role in ensuring data quality for enterprise applications. Naturally, there been extensive research this area, and many algorithms have translated into tools to detect possibly repair certain classes of errors such as outliers, duplicates, missing values, violations integrity constraints. Since different types may coexist the same set, we often need run more than one kind tool. In paper, investigate two pragmatic questions: (1) are these robust enough...
Data visualization is invaluable for explaining the significance of data to people who are visually oriented. The central task automatic is, given a dataset, visualize its compelling stories by transforming (e.g., selecting attributes, grouping and binning values) deciding right type bar or line charts). We present DEEPEYE, novel system that tackles three problems: (1) Visualization recognition: visualization, it "good "bad"? (2) ranking: two visualizations, which one "better"? And (3)...
Entity resolution (ER) is a key data integration problem. Despite the efforts in 70+ years all aspects of ER, there still high demand for democratizing ER - humans are heavily involved labeling data, performing feature engineering, tuning parameters, and defining blocking functions. With recent advances deep learning, particular distributed representation words (a.k.a. word embeddings), we present novel system, called DeepER, that achieves good accuracy, efficiency, as well ease-of-use...
A variety of integrity constraints have been studied for data cleaning. While these can detect the presence errors, they fall short guiding us to correct errors. Indeed, repairing based on may not find certain fixes that are absolutely correct, and worse, introduce new errors when data. We propose a method finding fixes, master data, notion regions , class editing rules . region is set attributes assured by users. Given tell what fix how update them. show be used in monitoring enrichment....
It is increasingly common to find graphs in which edges bear different types, indicating a variety of relationships. For such we propose class reachability queries and graph patterns, an edge specified with regular expression certain form, expressing the connectivity data via various types. In addition, define pattern matching based on revised notion simulation. On emerging applications as social networks, show that these are capable finding more sensible information than their traditional...
Data cleansing approaches have usually focused on detecting and fixing errors with little attention to scaling big datasets. This presents a serious impediment since data often involves costly computations such as enumerating pairs of tuples, handling inequality joins, dealing user-defined functions. In this paper, we present BigDansing, Big Cleansing system tackle efficiency, scalability, ease-of-use issues in cleansing. The can run top most common general purpose processing platforms,...
Despite the efforts in 70+ years all aspects of entity resolution (ER), there is still a high demand for democratizing ER - by reducing heavy human involvement labeling data, performing feature engineering, tuning parameters, and defining blocking functions. With recent advances deep learning, particular distributed representations words (a.k.a. word embeddings), we present novel system, called DeepER, that achieves good accuracy, efficiency, as well ease-of-use (i.e., much less efforts). We...
Central to a data cleaning system are record matching and repairing. Matching aims identify tuples that refer the same real-world object, repairing is make database consistent by fixing errors in using constraints. These treated as separate processes current systems, based on heuristic solutions. This paper studies new problem, namely, interaction between We show can effectively help us matches, vice versa. To capture interaction, we propose uniform framework seamlessly unifies operations,...
Join order selection (JOS) - the problem of finding optimal join for an SQL query is a primary focus database optimizers. The hard due to its large solution space. Exhaustively traversing space prohibitively expensive, which often combined with heuristic pruning. Despite decades-long effort, traditional optimizers still suffer from low scalability or accuracy when handling complicated queries. Recent attempts using deep reinforcement learning (DRL), by encoding trees fixed-length handtuned...
Entity matching (EM) is a critical part of data integration. We study how to synthesize entity rules from positive-negative examples. The core our solution program synthesis , powerful tool automatically generate (or programs) that satisfy given high-level specification, via predefined grammar. This grammar describes General Boolean Formula ( GBF ) can include arbitrary attribute predicates combined by conjunctions (∧), disjunctions (∨) and negations (¬), expressive enough model EM problems,...
research-article Towards dependable data repairing with fixing rules Share on Authors: Jiannan Wang UC Berkeley, CA, USA USAView Profile , Nan Tang Qatar Computing Research Institute (QCRI), Doha, QatarView Authors Info & Claims SIGMOD '14: Proceedings of the 2014 ACM International Conference Management DataJune Pages 457–468https://doi.org/10.1145/2588555.2610494Online:18 June 2014Publication History 97citation671DownloadsMetricsTotal Citations97Total Downloads671Last 12 Months42Last 6...
A graph stream, which refers to the with edges being updated sequentially in a form of has important applications cyber security and social networks. Due sheer volume highly dynamic nature streams, practical way handling them is by summarization. Given stream G, directed or undirected, problem summarization summarize G as SG much smaller (sublinear) space, linear construction time constant maintenance cost for each edge update, such that allows many queries over be approximately conducted...
Detecting erroneous values is a key step in data cleaning. Error detection algorithms usually require user to provide input configurations the form of rules or statistical parameters. However, providing complete, yet correct, set for each new dataset not trivial, as has know about both and error upfront. In this paper, we present Raha, configuration-free system. By generating limited number that cover various types errors, can generate an expressive feature vector tuple value. Leveraging...
Supporting the translation from natural language (NL) query to visualization (NL2VIS) can simplify creation of data visualizations because if successful, anyone generate by their tabular data. The state-of-the-art NL2VIS approaches (e.g., NL4DV and FlowSense) are based on semantic parsers heuristic algorithms, which not end-to-end designed for supporting (possibly) complex transformations. Deep neural network powered machine models have made great strides in many tasks, suggests that they...
Employees that spend more time finding relevant data than analyzing it suffer from a discovery problem. The large volume of in enterprises, and sometimes the lack knowledge schemas aggravates this Similar to how we navigate Web, propose identify semantic links assist analysts their tasks. These relate tables each other, facilitate navigating schemas. They also external sources, such as ontologies dictionaries, help explain schema meaning. We materialize an enterprise graph, where they become...
We present Falcon, an interactive, deterministic, and declarative data cleaning system, which uses SQL update queries as the language to repair data. Falcon does not rely on existence of a set pre-defined quality rules. On contrary, it encourages users explore data, identify possible problems, make updates fix them. Bootstrapped by one user update, guesses sql that can be used The main technical challenge addressed in this paper consists finding is minimal size at same time fixes largest...
For real-world time dependent road networks (TDRNs), answering shortest path-based route queries and plans in real-time is highly desirable by many industrial applications. Unfortunately, traditional ( Dijkstra - or A *-like) algorithms are computationally expensive for such tasks on TDRNs. Naturally, indexes needed to meet the constraint required real In this paper, we propose a novel height-balanced tree-structured index, called TD-G-tree, which supports fast over The key idea use...
Creating good visualizations for ordinary users is hard, even with the help of state-of-the-art interactive data visualization tools, such as Tableau, Qlik, because they require to understand and very well. DeepEye an innovative system that aims at helping everyone create simply like a Google search. Given dataset keyword query, understands query intent, generates ranks visualizations. The user can pick one she likes do further faceted navigation easily navigate candidate In this...