- Advanced Database Systems and Queries
- Cloud Computing and Resource Management
- Advanced Data Storage Technologies
- Distributed systems and fault tolerance
- Data Quality and Management
- Scientific Computing and Data Management
- Data Management and Algorithms
- Distributed and Parallel Computing Systems
- Parallel Computing and Optimization Techniques
- Time Series Analysis and Forecasting
- Algorithms and Data Compression
- Research Data Management Practices
- Anomaly Detection Techniques and Applications
- Privacy-Preserving Technologies in Data
- Personal Information Management and User Behavior
- Blockchain Technology Applications and Security
- Advanced Image and Video Retrieval Techniques
- Fault Detection and Control Systems
- Petri Nets in System Modeling
- Semantic Web and Ontologies
- Visual Attention and Saliency Detection
- Real-Time Systems Scheduling
- Adversarial Robustness in Machine Learning
- Data Stream Mining Techniques
- Advanced Vision and Imaging
University of Chicago
2015-2024
University of Illinois Chicago
2017-2021
University of Washington
2018
Portland State University
2018
University of California, Santa Barbara
2010-2013
Frostburg State University
2013
This paper presents a new view of federated databases to address the growing need for managing information that spans multiple data models. trend is fueled by proliferation storage engines and query languages based on observation 'no one size fits all'. To this shift, we propose polystore architecture; it designed unify querying over We consider challenges opportunities associated with polystores. Open questions in space revolve around optimization assignment objects engines. introduce our...
Multitenant data infrastructures for large cloud platforms hosting hundreds of thousands applications face the challenge serving characterized by small footprint and unpredictable load patterns. When such a platform is built on an elastic pay-per-use infrastructure, added to minimize system's operating cost while guaranteeing tenants' service level agreements (SLA). Elastic balancing therefore important feature enable scale-up during high scaling down when low. Live migration, technique...
On-line transaction processing (OLTP) database management systems (DBMSs) often serve time-varying workloads due to daily, weekly or seasonal fluctuations in demand, because of rapid growth demand a company's business success. In addition, many OLTP are heavily skewed "hot" tuples ranges tuples. For example, the majority NYSE volume involves only 40 stocks. To deal with such fluctuations, an DBMS needs be elastic; that is, it must able expand and contract resources response load dynamically...
This paper presents BigDAWG, a reference implementation of new architecture for "Big Data" applications. Such applications not only call large-scale analytics, but also real-time streaming support, smaller analytics at interactive speeds, data visualization, and cross-storage-system queries. Guided by the principle that "one size does fit all", we build on top variety storage engines, each designed specialized use case. To illustrate promise this approach, demonstrate its effectiveness...
Anomaly detection (AD) is a fundamental task for time-series analytics with important implications the downstream performance of many applications. In contrast to other domains where AD mainly focuses on point-based anomalies (i.e., outliers in standalone observations), time series also concerned range-based spanning multiple observations). Nevertheless, it common use traditional information retrieval measures, such as Precision, Recall, and F-score, assess quality methods by thresholding...
Large language models (LLMs), such as GPT-4, are revolutionizing software's ability to understand, process, and synthesize language. The authors of this paper believe that advance in technology is significant enough prompt introspection the data management community, similar previous technological disruptions advents world wide web, cloud computing, statistical machine learning. We argue disruptive influence LLMs will have on come from two angles. (1) A number hard database problems, namely,...
Transaction processing database management systems (DBMSs) are critical for today's data-intensive applications because they enable an organization to quickly ingest and query new information. Many of these exceed the capabilities a single server, thus their has be deployed in distributed DBMS. The key factor affecting such system's performance is how partitioned. If partitioned incorrectly, number transactions can high. These have synchronize operations over network, which considerably...
Organizations are often faced with the challenge of providing data management solutions for large, heterogenous datasets that may have different underlying and programming models. For example, a medical dataset unstructured text, relational data, time series waveforms imagery. Trying to fit such in single system can adverse performance efficiency effects. As part Intel Science Technology Center on Big Data, we developing polystore designed problems. BigDAWG (short Data Analytics Working...
We present a framework for concurrency control and availability in multi-datacenter datastores. While we consider Google's Megastore as our motivating example, define general abstractions key components, making solution extensible to any system that satisfies the abstraction properties. first develop analyze transaction management replication protocol based on straightforward implementation of Paxos algorithm. Our investigation reveals this acts prevention mechanism rather than mechanism....
For data-intensive applications with many concurrent users, modern distributed main memory database management systems (DBMS) provide the necessary scale-out support beyond what is possible single-node systems. These DBMSs are optimized for short-lived transactions that common in on-line transaction processing (OLTP) workloads. One way they achieve this to partition into disjoint subsets and use a single-threaded manager per executes one-at-a-time serial order. This minimizes overhead of...
While there have been many solutions proposed for storing and analyzing large volumes of data, all these limited support collaborative data analytics, especially given the individuals teams are simultaneously analyzing, modifying exchanging datasets, employing a number heterogeneous tools or languages analysis, writing scripts to clean, preprocess, query data. We demonstrate DataHub, unified platform with ability load, store, query, collaboratively analyze, interactively visualize, interface...
Distance measures are core building blocks in time-series analysis and the subject of active research for decades. Unfortunately, most detailed experimental study this area is outdated (over a decade old) and, naturally, does not reflect recent progress. Importantly, (i) omitted multiple distance measures, including classic measure literature; (ii) considered only single normalization method; (iii) reported raw classification error rates without statistically validating findings, resulting...
Anomaly detection (AD) is a fundamental task for time-series analytics with important implications the downstream performance of many applications. In contrast to other domains where AD mainly focuses on point-based anomalies (i.e., outliers in standalone observations), time series also concerned range-based spanning multiple observations). Nevertheless, it common use traditional information retrieval measures, such as Precision, Recall, and F-score, assess quality methods by thresholding...
A multitenant database management system (DBMS) in the cloud must continuously monitor trade-off between efficient resource sharing among multiple application databases (tenants) and their performance. Considering scale of \attn{hundreds to} thousands tenants such DBMSs, manual approaches for continuous monitoring are not tenable. self-managing controller a DBMS faces several challenges. For instance, how to characterize tenant given its variety workloads, reduce impact colocation, detect...
Modern data-intensive applications often generate large amounts of low precision float data with a limited range values. Despite the prevalence such data, there is lack an effective solution to ingest, store, and analyze bounded, low-precision, numeric data. To address this gap, we propose Buff, new compression technique that uses decomposed columnar storage encoding methods provide compression, fast ingestion, high-speed in-situ adaptive query operators SIMD support.
With the explosive growth of high-dimensional data, approximate methods emerge as promising solutions for nearest neighbor search. Among alternatives, quantization have gained attention due to fast query responses and low encoding storage costs. Quantization decompose data dimensions into non-overlapping subspaces encode using a different dictionary per subspace. The state-of-the-art approach assigns sizes uniformly across while attempting balance relative importance subspaces....
Data partitioning is crucial to improving query performance several workload-based techniques have been proposed in database literature. However, many modern analytic applications involve ad-hoc or exploratory analysis where users do not a representative workload priori. Static data are therefore suitable for such settings. In this paper, we propose Amoeba, distributed storage system that uses adaptive multi-attribute efficiently support as well recurring queries. Amoeba requires zero set-up...
As scientific endeavors and data analysis become increasingly collaborative, there is a need for management systems that natively support the versioning or branching of datasets to enable concurrent analysis, cleaning, integration, manipulation, curation across teams individuals. Common practice sharing collaborating on involves creating storing multiple copies dataset, one each stage with no provenance information tracking relationships between these datasets. This results not only in...
Big data analytic applications give rise to large-scale extract-transform-load (ETL) as a fundamental step transform new into native representation. ETL workloads pose significant performance challenges on conventional architectures, so we propose the design of unstructured processor (UDP), software programmable accelerator that includes multi-way dispatch, variable-size symbol support, Flexible-source dispatch (stream buffer and scalar registers), memory addressing accelerate kernels both...
Columnar databases rely on specialized encoding schemes to reduce storage requirements. These encodings also enable efficient in-situ data processing. Nevertheless, many existing columnar are encoding-oblivious. When storing the data, these systems a global understanding of dataset or types derive simple rules for selection. Such rule-based selection leads unsatisfactory performance. Specifically, when performing queries, always decode into memory, ignoring possibility optimizing access...
Similarity search is a core analytical task, and its performance critically depends on the choice of distance measure. For time-series querying, elastic measures achieve state-of-the-art accuracy but are computationally expensive. Thus, fast lower bounding (LB) prune unnecessary comparisons with distances to accelerate similarity search. Despite decades attention, there has never been study assess progress in this area. In addition, research disproportionately focused one popular measure,...
Data science teams often collaboratively analyze datasets, generating dataset versions at each stage of iterative exploration and analysis. There is a pressing need for system that can support versioning, enabling such to efficiently store, track, query across versions. We introduce O rpheus DB, version control " bolts on versioning capabilities traditional relational database system, thereby gaining the analytics "for free". develop evaluate multiple data models representing versioned data,...