- Data Management and Algorithms
- Advanced Database Systems and Queries
- Data Quality and Management
- Data Mining Algorithms and Applications
- Algorithms and Data Compression
- Semantic Web and Ontologies
- Advanced Image and Video Retrieval Techniques
- Data Stream Mining Techniques
- Web Data Mining and Analysis
- Complex Network Analysis Techniques
- Topic Modeling
- Peer-to-Peer Network Technologies
- Privacy-Preserving Technologies in Data
- Video Analysis and Summarization
- Time Series Analysis and Forecasting
- Video Surveillance and Tracking Methods
- Human Mobility and Location-Based Analysis
- Machine Learning and Algorithms
- Anomaly Detection Techniques and Applications
- Distributed systems and fault tolerance
- Mobile Crowdsensing and Crowdsourcing
- Machine Learning and Data Classification
- Opinion Dynamics and Social Influence
- Multimodal Machine Learning Applications
- Human Pose and Action Recognition
University of Toronto
2015-2025
Hong Kong Baptist University
2021
New Jersey Institute of Technology
2021
Athens University of Economics and Business
2021
The University of Texas at Arlington
2019
Center for Information Technology
2008
National University of Singapore
2008
Information Technology University
2007
Institute of Electrical and Electronics Engineers
2006
AT&T (United States)
2000-2005
XML employs a tree-structured data model, and, naturally, queries specify patterns of selection predicates on multiple elements related by tree structure. Finding all occurrences such twig pattern in an database is core operation for query processing. Prior work has typically decomposed the into binary structural (parent-child and ancestor-descendant) relationships, matching achieved by: (i) using join algorithms to match relationships against database, (ii) stitching together these basic...
We present TwitterMonitor, a system that performs trend detection over the Twitter stream. The identifies emerging topics (i.e. 'trends') on in real time and provides meaningful analytics synthesize an accurate description of each topic. Users interact with by ordering identified trends using different criteria submitting their own for trend.
XML queries typically specify patterns of selection predicates on multiple elements that have some specified tree structured relationships. The primitive relationships are parent-child and ancestor-descendant, finding all occurrences these in an database is a core operation for query processing. We develop two families structural join algorithms this task: tree-merge stack-tree. natural extension traditional merge joins the multi-predicate joins, while stack-tree no counterpart relational...
Privacy is a serious concern when microdata need to be released for ad hoc analyses. The privacy goals of existing protection approaches (e.g., k-anonymity and l-diversity) are suitable only categorical sensitive attributes. Since applying them directly numerical attributes salary) may result in undesirable information leakage, we propose better capture the Complementing desire support aggregate analyses over microdata. Existing generalization-based anonymization cannot answer queries with...
Many location-based applications require constant monitoring of k-nearest neighbor (k-NN) queries over moving objects within a geographic area. Existing approaches to this problem have focused on predictive queries, and relied the assumption that trajectories are fully predictable at query processing time. We relax assumption, propose two efficient scalable algorithms using grid indices. One is based indexing objects, other queries. For each approach, cost model developed, detailed analysis...
This tutorial provides a comprehensive and cohesive overview of the key research results in area record linkage methodologies algorithms for identifying approximate duplicate records, available tools this purpose. It encompasses techniques introduced several communities including databases, information retrieval, statistics machine learning. aims to identify similarities differences across as well their merits limitations.
Histograms have been used widely to capture data distribution, represent the by a small number of step functions. Dynamic programming algorithms which provide optimal construction these histograms exist, albeit running in quadratic time and linear space. In this paper we 1 + ε approximation histograms, polylogarithmic
Users often need to optimize the selection of objects by appropriately weighting importance multiple object attributes. Such optimization problems appear in operations' research and applied mathematics as well everyday life; e.g., a buyer may select home weighted function number attributes like its distance from office, price, area, etc.
Large-scale data analysis lies in the core of modern enterprises and scientific research. With emergence cloud computing, use an analytical query processing infrastructure (e.g., Amazon EC2) can be directly mapped to monetary value. MapReduce has been a popular framework context designed serve long running queries (jobs) which processed batch mode. Taking into account that different jobs often perform similar work, there are many opportunities for sharing. In principle, sharing work reduces...
Histograms are a concise and flexible way to construct summary structures for large data sets. They have attracted lot of attention in database research due their utility many areas, including query optimization, approximate answering. also basic tool visualization analysis.In this paper, we present formal study dynamic multidimensional histogram over continuous streams. At the heart our proposal is use structure (vastly different from histogram) maintaining succinct approximation...
We investigate the use of biased sampling according to density data set speed up operation general mining tasks, such as clustering and outlier detection in large multidimensional sets. In density-biased sampling, probability that a given point will be included sample depends on local set. propose technique for can factor user requirements properties interest tuned specific tasks. This allows great flexibility improved accuracy results over simple random sampling. describe our approach...
XML employs a tree-structured data model, and, naturally, queries specify patterns of selection predicates on multiple elements related by tree structure. Finding all occurrences such twig pattern in an database is core operation for query processing. Prior work has typically decomposed the into binary structural (parent-child and ancestor-descendant) relationships, matching achieved by: (i) using join algorithms to match relationships against database, (ii) stitching together these basic...
Query monitoring refers to the problem of observing and predicting various parameters related execution a query in database system. In addition being useful tool for users administrators, it can also serve as an information collection service resource allocation adaptive processing techniques. this article, we present system from ground up, describing new techniques monitoring, their implementation inside real system, novel interface that presents observed predicted accessible manner. To...
Recent years have witnessed an unprecedented proliferation of social media. People around the globe author, every day, millions blog posts, micro-blog network status updates, etc. This rich stream information can be used to identify, on ongoing basis, emerging stories, and events that capture popular attention. Stories identified via groups tightly-coupled real-world entities, namely people, locations, products, etc., are involved in story. The sheer scale, rapid evolution data necessitate...
Recent works have shown the benefits of keyword proximity search in querying XML documents addition to text documents. For example, given query keywords over Shakespeare's plays XML, user might be interested knowing how cooccur. In this paper, we focus on trees and define keyword, queries return (possibly heterogeneous) set minimum connecting (MCTs) matches individual query. We consider efficiently executing labeled (XML) various settings: 1) when database has been preprocessed 2) no indices...
The problem of obtaining efficient answers to top-k queries has attracted a lot research attention. Several algorithms and numerous variants the retrieval have been introduced in recent years. general form this requests k highest ranked values from relation, using monotone combining functions on (a subset of) its attributes.In paper we explore space performance tradeoffs related problem. In particular study answering views. A view context is materialized version previously posed query,...
Selectivity estimation - the problem of estimating result size queries is a fundamental in databases. Accurate query selectivity involving multiple correlated attributes especially challenging. Poor cardinality estimates could selection bad plans by optimizer. Recently, deep learning has been applied to this with promising results. However, many proposed approaches often struggle provide accurate results for multi attribute large number predicates and low selectivity. In paper, we propose...
In this paper we address the issue of using local embeddings for data visualization in two and three dimensions, classification. We advocate their use on basis that they provide an efficient mapping procedure from original dimension data, to a lower intrinsic dimension. depict how can accurately capture user's perception similarity high-dimensional purposes. Moreover, exploit low-dimensional provided by these embeddings, develop new classification techniques, show experimentally accuracy is...
The integration of data produced and collected across autonomous, heterogeneous web services is an increasingly important challenging problem. Due to the lack global identifiers, same entity (e.g., a product) might have different textual representations databases. Textual also often noisy because transcription errors, incomplete information, standard formats. A fundamental task during matching strings that refer entity. In this paper, we adopt widely used established cosine similarity metric...
XML is widely recognized as the data interchange standard for tomorrow, because of its ability to represent from a wide variety sources. Hence, likely be format through which multiple sources integrated.In this paper we study problem integrating correlations realized join operations. A challenging aspect operation document structure. Two documents might convey approximately or exactly same information but may quite different in Consequently approximate match structure, addition to, content...