NFDI4DS | UHH-SEMS - Publication Details

The BigDAWG Polystore System

OPENALEX - Publications

Jennie Duggan Aaron J. Elmore Michael Stonebraker Magda Balazinska Bill Howe and 5 more

This paper presents a new view of federated databases to address the growing need for managing information that spans multiple data models. trend is fueled by proliferation storage engines and query languages based on observation 'no one size fits all'. To this shift, we propose polystore architecture; it designed unify querying over We consider challenges opportunities associated with polystores. Open questions in space revolve around optimization assignment objects engines. introduce our...

10.1145/2814710.2814713 article EN ACM SIGMOD Record 2015-08-12

Zephyr

OPENALEX - Publications

Aaron J. Elmore Sudipto Das Divyakant Agrawal Amr El Abbadi

Multitenant data infrastructures for large cloud platforms hosting hundreds of thousands applications face the challenge serving characterized by small footprint and unpredictable load patterns. When such a platform is built on an elastic pay-per-use infrastructure, added to minimize system's operating cost while guaranteeing tenants' service level agreements (SLA). Elastic balancing therefore important feature enable scale-up during high scaling down when low. Live migration, technique...

10.1145/1989323.1989356 article EN 2011-06-12

E-store

OPENALEX - Publications

Rebecca Taft Essam Mansour Marco Serafini Jennie Duggan Aaron J. Elmore and 3 more

On-line transaction processing (OLTP) database management systems (DBMSs) often serve time-varying workloads due to daily, weekly or seasonal fluctuations in demand, because of rapid growth demand a company's business success. In addition, many OLTP are heavily skewed "hot" tuples ranges tuples. For example, the majority NYSE volume involves only 40 stocks. To deal with such fluctuations, an DBMS needs be elastic; that is, it must able expand and contract resources response load dynamically...

10.14778/2735508.2735514 article EN Proceedings of the VLDB Endowment 2014-11-01

A demonstration of the BigDAWG polystore system

OPENALEX - Publications

Aaron J. Elmore Jennie Duggan Michael Stonebraker Magda Balazinska Ug̃ur Çetintemel and 13 more

This paper presents BigDAWG, a reference implementation of new architecture for "Big Data" applications. Such applications not only call large-scale analytics, but also real-time streaming support, smaller analytics at interactive speeds, data visualization, and cross-storage-system queries. Guided by the principle that "one size does fit all", we build on top variety storage engines, each designed specialized use case. To illustrate promise this approach, demonstrate its effectiveness...

10.14778/2824032.2824098 article EN Proceedings of the VLDB Endowment 2015-08-01

Volume under the surface

OPENALEX - Publications

John Paparrizos Paul Boniol Themis Palpanas Ruey S. Tsay Aaron J. Elmore and 1 more

Anomaly detection (AD) is a fundamental task for time-series analytics with important implications the downstream performance of many applications. In contrast to other domains where AD mainly focuses on point-based anomalies (i.e., outliers in standalone observations), time series also concerned range-based spanning multiple observations). Nevertheless, it common use traditional information retrieval measures, such as Precision, Recall, and F-score, assess quality methods by thresholding...

10.14778/3551793.3551830 article EN Proceedings of the VLDB Endowment 2022-07-01

How Large Language Models Will Disrupt Data Management

OPENALEX - Publications

Raul Castro Fernandez Aaron J. Elmore Michael J. Franklin Sanjay Krishnan Chenhao Tan

Large language models (LLMs), such as GPT-4, are revolutionizing software's ability to understand, process, and synthesize language. The authors of this paper believe that advance in technology is significant enough prompt introspection the data management community, similar previous technological disruptions advents world wide web, cloud computing, statistical machine learning. We argue disruptive influence LLMs will have on come from two angles. (1) A number hard database problems, namely,...

10.14778/3611479.3611527 article EN Proceedings of the VLDB Endowment 2023-07-01

Clay

OPENALEX - Publications

Marco Serafini Rebecca Taft Aaron J. Elmore Andrew Pavlo Ashraf Aboulnaga and 1 more

Transaction processing database management systems (DBMSs) are critical for today's data-intensive applications because they enable an organization to quickly ingest and query new information. Many of these exceed the capabilities a single server, thus their has be deployed in distributed DBMS. The key factor affecting such system's performance is how partitioned. If partitioned incorrectly, number transactions can high. These have synchronize operations over network, which considerably...

10.14778/3025111.3025125 article EN Proceedings of the VLDB Endowment 2016-11-01

The BigDAWG polystore system and architecture

OPENALEX - Publications

Vijay Gadepally Peinan Chen Jennie Duggan Aaron J. Elmore Brandon Haynes and 4 more

Organizations are often faced with the challenge of providing data management solutions for large, heterogenous datasets that may have different underlying and programming models. For example, a medical dataset unstructured text, relational data, time series waveforms imagery. Trying to fit such in single system can adverse performance efficiency effects. As part Intel Science Technology Center on Big Data, we developing polystore designed problems. BigDAWG (short Data Analytics Working...

10.1109/hpec.2016.7761636 preprint EN 2016-09-01

Serializability, not serial

OPENALEX - Publications

Stacy Patterson Aaron J. Elmore Faisal Nawab Divyakant Agrawal Amr El Abbadi

We present a framework for concurrency control and availability in multi-datacenter datastores. While we consider Google's Megastore as our motivating example, define general abstractions key components, making solution extensible to any system that satisfies the abstraction properties. first develop analyze transaction management replication protocol based on straightforward implementation of Paxos algorithm. Our investigation reveals this acts prevention mechanism rather than mechanism....

10.14778/2350229.2350261 article EN Proceedings of the VLDB Endowment 2012-07-01

Squall

OPENALEX - Publications

Aaron J. Elmore Vaibhav Arora Rebecca Taft Andrew Pavlo Divyakant Agrawal and 1 more

For data-intensive applications with many concurrent users, modern distributed main memory database management systems (DBMS) provide the necessary scale-out support beyond what is possible single-node systems. These DBMSs are optimized for short-lived transactions that common in on-line transaction processing (OLTP) workloads. One way they achieve this to partition into disjoint subsets and use a single-threaded manager per executes one-at-a-time serial order. This minimizes overhead of...

10.1145/2723372.2723726 article EN 2015-05-27

Collaborative data analytics with DataHub

OPENALEX - Publications

Anant Bhardwaj Amol Deshpande Aaron J. Elmore David R. Karger Samuel Madden and 4 more

While there have been many solutions proposed for storing and analyzing large volumes of data, all these limited support collaborative data analytics, especially given the individuals teams are simultaneously analyzing, modifying exchanging datasets, employing a number heterogeneous tools or languages analysis, writing scripts to clean, preprocess, query data. We demonstrate DataHub, unified platform with ability load, store, query, collaboratively analyze, interactively visualize, interface...

10.14778/2824032.2824100 article EN Proceedings of the VLDB Endowment 2015-08-01

Debunking Four Long-Standing Misconceptions of Time-Series Distance Measures

OPENALEX - Publications

John Paparrizos Chunwei Liu Aaron J. Elmore Michael J. Franklin

Distance measures are core building blocks in time-series analysis and the subject of active research for decades. Unfortunately, most detailed experimental study this area is outdated (over a decade old) and, naturally, does not reflect recent progress. Importantly, (i) omitted multiple distance measures, including classic measure literature; (ii) considered only single normalization method; (iii) reported raw classification error rates without statistically validating findings, resulting...

10.1145/3318464.3389760 article EN 2020-05-29

VUS: Effective and Efficient Accuracy Measures for Time-Series Anomaly Detection

OPENALEX - Publications

Paul Boniol A. Krishna Marine Bruel Qinghua Liu Mong‐Han Huang and 5 more

Anomaly detection (AD) is a fundamental task for time-series analytics with important implications the downstream performance of many applications. In contrast to other domains where AD mainly focuses on point-based anomalies (i.e., outliers in standalone observations), time series also concerned range-based spanning multiple observations). Nevertheless, it common use traditional information retrieval measures, such as Precision, Recall, and F-score, assess quality methods by thresholding...

10.48550/arxiv.2502.13318 preprint EN arXiv (Cornell University) 2025-02-18

VUS: effective and efficient accuracy measures for time-series anomaly detection

OPENALEX - Publications

Paul Boniol A. Krishna Marine Bruel Qinghua Liu Mong‐Han Huang and 5 more

10.1007/s00778-025-00907-x article EN The VLDB Journal 2025-03-27

Characterizing tenant behavior for placement and crisis mitigation in multitenant DBMSs

OPENALEX - Publications

Aaron J. Elmore Sudipto Das Alexander Pucher Divyakant Agrawal Amr El Abbadi and 1 more

A multitenant database management system (DBMS) in the cloud must continuously monitor trade-off between efficient resource sharing among multiple application databases (tenants) and their performance. Considering scale of \attn{hundreds to} thousands tenants such DBMSs, manual approaches for continuous monitoring are not tenable. self-managing controller a DBMS faces several challenges. For instance, how to characterize tenant given its variety workloads, reduce impact colocation, detect...

10.1145/2463676.2465308 article EN 2013-06-22

Decomposed bounded floats for fast compression and queries

OPENALEX - Publications

Chunwei Liu Hao Jiang John Paparrizos Aaron J. Elmore

Modern data-intensive applications often generate large amounts of low precision float data with a limited range values. Despite the prevalence such data, there is lack an effective solution to ingest, store, and analyze bounded, low-precision, numeric data. To address this gap, we propose Buff, new compression technique that uses decomposed columnar storage encoding methods provide compression, fast ingestion, high-speed in-situ adaptive query operators SIMD support.

10.14778/3476249.3476305 article EN Proceedings of the VLDB Endowment 2021-07-01

Fast Adaptive Similarity Search through Variance-Aware Quantization

OPENALEX - Publications

John Paparrizos Ikraduya Edian Chunwei Liu Aaron J. Elmore Michael J. Franklin

With the explosive growth of high-dimensional data, approximate methods emerge as promising solutions for nearest neighbor search. Among alternatives, quantization have gained attention due to fast query responses and low encoding storage costs. Quantization decompose data dimensions into non-overlapping subspaces encode using a different dictionary per subspace. The state-of-the-art approach assigns sizes uniformly across while attempting balance relative importance subspaces....

10.1109/icde53745.2022.00268 article EN 2022 IEEE 38th International Conference on Data Engineering (ICDE) 2022-05-01

A robust partitioning scheme for ad-hoc query workloads

OPENALEX - Publications

Anil Shanbhag Alekh Jindal Samuel Madden Jorge Quiane Aaron J. Elmore

Data partitioning is crucial to improving query performance several workload-based techniques have been proposed in database literature. However, many modern analytic applications involve ad-hoc or exploratory analysis where users do not a representative workload priori. Static data are therefore suitable for such settings. In this paper, we propose Amoeba, distributed storage system that uses adaptive multi-attribute efficiently support as well recurring queries. Amoeba requires zero set-up...

10.1145/3127479.3131613 article EN 2017-09-24

Decibel

OPENALEX - Publications

Michael E. Maddox David Goehring Aaron J. Elmore Samuel Madden Aditya Parameswaran and 1 more

As scientific endeavors and data analysis become increasingly collaborative, there is a need for management systems that natively support the versioning or branching of datasets to enable concurrent analysis, cleaning, integration, manipulation, curation across teams individuals. Common practice sharing collaborating on involves creating storing multiple copies dataset, one each stage with no provenance information tracking relationships between these datasets. This results not only in...

10.14778/2947618.2947619 article EN Proceedings of the VLDB Endowment 2016-05-01

UDP

OPENALEX - Publications

Yuanwei Fang Chen Zou Aaron J. Elmore Andrew A. Chien

Big data analytic applications give rise to large-scale extract-transform-load (ETL) as a fundamental step transform new into native representation. ETL workloads pose significant performance challenges on conventional architectures, so we propose the design of unstructured processor (UDP), software programmable accelerator that includes multi-way dispatch, variable-size symbol support, Flexible-source dispatch (stream buffer and scalar registers), memory addressing accelerate kernels both...

10.1145/3123939.3123983 article EN 2017-10-04

Good to the Last Bit: Data-Driven Encoding with CodecDB

OPENALEX - Publications

Hao Jiang Chunwei Liu John Paparrizos Andrew A. Chien Jihong Ma and 1 more

Columnar databases rely on specialized encoding schemes to reduce storage requirements. These encodings also enable efficient in-situ data processing. Nevertheless, many existing columnar are encoding-oblivious. When storing the data, these systems a global understanding of dataset or types derive simple rules for selection. Such rule-based selection leads unsatisfactory performance. Specifically, when performing queries, always decode into memory, ignoring possibility optimizing access...

10.1145/3448016.3457283 article EN Proceedings of the 2022 International Conference on Management of Data 2021-06-09

Accelerating Similarity Search for Elastic Measures: A Study and New Generalization of Lower Bounding Distances

OPENALEX - Publications

John Paparrizos Kaize Wu Aaron J. Elmore Christos Faloutsos Michael J. Franklin

Similarity search is a core analytical task, and its performance critically depends on the choice of distance measure. For time-series querying, elastic measures achieve state-of-the-art accuracy but are computationally expensive. Thus, fast lower bounding (LB) prune unnecessary comparisons with distances to accelerate similarity search. Despite decades attention, there has never been study assess progress in this area. In addition, research disproportionately focused one popular measure,...

10.14778/3594512.3594530 article EN Proceedings of the VLDB Endowment 2023-04-01

O rpheus DB

OPENALEX - Publications

Silu Huang Liqi Xu Jialin Liu Aaron J. Elmore Aditya Parameswaran

Data science teams often collaboratively analyze datasets, generating dataset versions at each stage of iterative exploration and analysis. There is a pressing need for system that can support versioning, enabling such to efficiently store, track, query across versions. We introduce O rpheus DB, version control " bolts on versioning capabilities traditional relational database system, thereby gaining the analytics "for free". develop evaluate multiple data models representing versioned data,...

10.14778/3115404.3115417 article EN Proceedings of the VLDB Endowment 2017-06-01