- Advanced Database Systems and Queries
- Data Quality and Management
- Data Management and Algorithms
- Data Mining Algorithms and Applications
- Semantic Web and Ontologies
- Data Stream Mining Techniques
- Distributed systems and fault tolerance
- Blockchain Technology Applications and Security
- Topic Modeling
- Cloud Computing and Resource Management
- Privacy-Preserving Technologies in Data
- Natural Language Processing Techniques
- Advanced Data Storage Technologies
- Smart Grid Energy Management
- Graph Theory and Algorithms
- Electric Vehicles and Infrastructure
- Cryptography and Data Security
- Hate Speech and Cyberbullying Detection
- Misinformation and Its Impacts
- Explainable Artificial Intelligence (XAI)
- Complex Network Analysis Techniques
- Scientific Computing and Data Management
- Sentiment Analysis and Opinion Mining
- Advanced Graph Neural Networks
- Social Media and Politics
University of Waterloo
2015-2024
University of Cambridge
2023
Linköping University
2023
National Institute of Informatics
2023
Association for Computing Machinery
2023
Chinese University of Hong Kong
2023
Oracle (United States)
2023
Aalborg University
2019
University of Windsor
2018
AT&T (United States)
2008-2012
Traditional databases store sets of relatively static records with no pre-defined notion time, unless timestamp attributes are explicitly added. While this model adequately represents commercial catalogues or repositories personal information, many current and emerging applications require support for on-line analysis rapidly changing data streams. Limitations traditional DBMSs in supporting streaming have been recognized, prompting research to augment existing technologies build new systems...
Blockchain technologies are expected to make a significant impact on variety of industries. However, one issue holding them back is their limited transaction throughput, especially compared established solutions such as distributed database systems. In this paper, we re-architect modern permissioned blockchain system, Hyperledger Fabric, increase throughput from 3,000 20,000 transactions per second. We focus performance bottlenecks beyond the consensus mechanism, and propose architectural...
Summary Blockchain technologies are expected to make a significant impact on variety of industries. However, one issue holding them back is their limited transaction throughput, especially compared established solutions such as distributed database systems. In this paper, we rearchitect modern permissioned blockchain system, Hyperledger Fabric, increase throughput from 3000 20 000 transactions per second. We focus performance bottlenecks beyond the consensus mechanism, and propose...
Conditional functional dependencies (CFDs) have recently been proposed as a useful integrity constraint to summarize data semantics and identify inconsistencies. A CFD augments dependency (FD) with pattern tableau that defines the context (i.e., subset of tuples) in which underlying FD holds. While many aspects CFDs studied, including static analysis detecting repairing violations, there has not prior work on generating tableaux, is critical realize full potential CFDs. This paper first...
Internet traffic patterns are believed to obey the power law, implying that most of bandwidth is consumed by a small set heavy users. Hence, queries return list frequently occurring items important in analysis real-time packet streams. While several results exist for computing frequent item using limited memory infinite stream model, this paper we consider limited-memory sliding window model. This model maintains last $N$ have arrived at any given time and forbids storage entire memory. We...
Violations of functional dependencies (FDs) are common in practice, often arising the context data integration or Web extraction. Resolving these violations is known to be challenging for a variety reasons, one them being exponential number possible "repairs". Previous work has tackled this problem either by producing single repair that (nearly) optimal with respect some metric, computing consistent answers selected classes queries without explicitly generating repairs. In paper, we propose...
We describe DataDepot, a tool for generating warehouses from streaming data feeds, such as network-traffic traces, router alerts, financial tickers, transaction logs, and so on. DataDepot is warehouse designed to automate the ingestion of wide variety sources maintain complex materialized views over these sources. As warehouse, similar Data Stream Management Systems (DSMSs) with its emphasis on temporal data, best-effort consistency, real-time response. However, store tens hundreds terabytes...
We study sequential dependencies that express the semantics of data with ordered domains and help identify quality problems such data. Given an interval g , we write X → Y to denote difference between -attribute values any two consecutive records, when sorted on must be in g. For example, time (0,∞) sequence_number indicates sequence numbers are strictly increasing over time, whereas [4, 5] means "gaps" 4 5. Sequential relationships attributes, missing (gaps too large), extraneous small)...
Functional dependencies (FDs) specify the intended data semantics while violations of FDs indicate deviation from these semantics. In this paper, we study a cleaning problem in which may not be completely correct, e.g., due to evolution or incomplete knowledge We argue that notion relative trust is crucial aspect problem: if are outdated, should modify them fit data, but suspect there problems with FDs. practice, it usually unclear how much versus To address problem, propose an algorithm for...
In this paper, we solve the following data summarization problem: given a multi-dimensional set augmented with binary attribute, how can construct an interpretable and informative summary of factors affecting attribute in terms combinations values dimension attributes? We refer to such summaries as explanation tables. show hardness constructing optimally-informative tables from data, propose effective efficient heuristics. The proposed heuristics are based on sampling include optimizations...
Smart electricity meters have been replacing conventional worldwide, enabling automated collection of fine-grained (e.g., every 15 minutes or hourly) consumption data. A variety smart meter analytics algorithms and applications proposed, mainly in the grid literature. However, focus has on what can be done with data rather than how to do it efficiently. In this article, we examine from a software performance perspective. First, design benchmark that includes common tasks. These include...
We present a novel iterative, edit-based approach to unsupervised sentence simplification. Our model is guided by scoring function involving fluency, simplicity, and meaning preservation. Then, we iteratively perform word phrase-level edits on the complex sentence. Compared with previous approaches, our does not require parallel training set, but more controllable interpretable. Experiments Newsela WikiLarge datasets show that nearly as effective state-of-the-art supervised approaches.
This paper discusses updating a data warehouse that collects near-real-time streams from variety of external sources. The objective is to keep all the tables and materialized views up-to-date as new arrive over time. We define notion staleness, formalize problem scheduling updates in way minimizes average present algorithms designed handle complex environment real-time stream warehouse. A novel feature our framework it considers effect an update on staleness underlying rather than any...
The complexity of the Internet has rapidly increased, making it more important and challenging to design scalable network monitoring tools. Network typically requires rolling data analysis, i.e., continuously incrementally updating (rolling-over) various reports statistics over highvolume streams. In this paper, we describe DBStream, which is an SQL-based system that explicitly supports incremental queries for analysis. We also present a performance comparison DBStream with parallel...
Integrity constraints (ICs) are useful for query optimization and expressing enforcing application semantics. However, formulating manually requires domain expertise, is prone to human errors, may be excessively time consuming, especially on large datasets. Hence, proposals automatic discovery have been made some classes of ICs, such as functional dependencies (FDs), recently, order (ODs). ODs properly subsume FDs, they can additionally express business rules involving order; e.g., an...
A defining characteristic of continuous queries over on-line data streams, possibly bounded by sliding windows, is the potentially infinite and time-evolving nature their inputs outputs. New items continually arrive on input streams new results are produced. Additionally, expire falling out range windows when they cease to satisfy query. This impacts query processing in two ways. First, stream systems allow tables be queried alongside but terms semantics, it not clear how updates different...
With the widespread use of shared-nothing clusters servers, there has been a proliferation distributed object stores that offer high availability, reliability and enhanced performance for MapReduce-style workloads. However, data-intensive scientific workflows join-intensive queries cannot always be evaluated efficiently using processing without extensive data migrations, which cause network congestion reduced query throughput. In this paper, we study problem computing placement strategies...
Performance and scalability are major concerns for blockchains: permissionless systems typically limited by slow proof of X consensus algorithms sequential postorder transaction execution on every node the network. By introducing a small amount trust in their participants, permissioned blockchain such as Hyperledger Fabric can benefit from more efficient make use parallel pre-order subset network nodes. Fabric, particular, has been shown to handle tens thousands transactions per second....
We present the Multi-Modal Discussion Transformer (mDT), a novel method for detecting hate speech on online social networks such as Reddit discussions. In contrast to traditional comment-only methods, our approach labelling comment involves holistic analysis of text and images grounded in discussion context. This is done by leveraging graph transformers capture contextual relationships surrounding grounding interwoven fusion layers that combine image embeddings instead processing modalities...
We discuss update scheduling in streaming data warehouses, which combine the features of traditional warehouses and stream systems. In our setting, external sources push append-only streams into warehouse with a wide range interarrival times. While are typically refreshed during downtimes, updated as new arrive. model problem problem, where jobs correspond to processes that load tables, whose objective is minimize staleness over time (at t, if table has been information up some earlier r,...