- Advanced Database Systems and Queries
- Cloud Computing and Resource Management
- Data Management and Algorithms
- Peer-to-Peer Network Technologies
- Data Stream Mining Techniques
- Caching and Content Delivery
- Data Quality and Management
- Advanced Data Storage Technologies
- Scientific Computing and Data Management
- Distributed systems and fault tolerance
- Optimization and Search Problems
- Time Series Analysis and Forecasting
- Semantic Web and Ontologies
- Recommender Systems and Techniques
- Distributed and Parallel Computing Systems
- IoT and Edge/Fog Computing
- Machine Learning and Data Classification
- Machine Learning and Algorithms
- Stochastic Gradient Optimization Techniques
- Software System Performance and Reliability
- Functional Brain Connectivity Studies
- Big Data and Business Intelligence
- Cell Image Analysis Techniques
- Anomaly Detection Techniques and Applications
- Advanced Image and Video Retrieval Techniques
Brandeis University
2012-2022
John Brown University
2006-2013
Brown University
2004-2008
Athens University of Economics and Business
2001
Data exploration is about efficiently extracting knowledge from data even if we do not know exactly what are looking for. In this tutorial, survey recent developments in the emerging area of database systems tailored for exploration. We discuss new ideas on how to store and access as well interact with a system enable users applications quickly figure out which parts interest. addition, exploit lessons-learned past research, challenges crafts, future research directions.
Current trends in data management systems, such as cloud and multi-tenant databases, are leading to processing environments that concurrently execute heterogeneous query workloads. At the same time, these systems need satisfy diverse performance expectations. In newly-emerging settings, avoiding potential Quality-of-Service (QoS) violations heavily relies on predictability, i.e., ability estimate impact of concurrent execution individual queries a continuously evolving workload.
Interactive Data Exploration (IDE) is a key ingredient of diverse set discovery-oriented applications, including ones from scientific computing and evidence-based medicine. In these data discovery highly ad hoc interactive process where users execute numerous exploration queries using varying predicates aiming to balance the trade-off between collecting all relevant information reducing size returned data. Therefore, there strong need support human-in-the-loop applications by assisting their...
Join order selection plays a significant role in query performance. However, modern optimizers typically employ static join enumeration algorithms that do not incorporate feedback about the quality of resulting plan. Hence, often repeatedly choose same bad plan, as they have no mechanism for "learning from their mistakes." Here, we argue deep reinforcement learning techniques can be applied to address this challenge. These techniques, powered by artificial neural networks, automatically...
Query performance prediction, the task of predicting a query's latency prior to execution, is challenging problem in database management systems. Existing approaches rely on features and models engineered by human experts, but often fail capture complex interactions between query operators input relations, generally do not adapt naturally workload characteristics patterns execution plans. In this paper, we argue that deep learning can be applied prediction problem, introduce novel neural...
Query optimization is one of the most challenging problems in database systems. Despite progress made over past decades, query optimizers remain extremely complex components that require a great deal hand-tuning for specific workloads and datasets. Motivated by this shortcoming inspired recent advances applying machine learning to data management challenges, we introduce Neo ( Neural Optimizer ), novel learning-based optimizer relies on deep neural networks generate executions plans....
In this paper, we argue that database systems be augmented with an automated data exploration service methodically steers users through the in a meaningful way. Such system is crucial for deriving insights from complex datasets found many big applications such as scientific and healthcare well reducing human effort of exploration. Towards end, present AIDE, Automatic Interactive Data Exploration framework assists discovering new interesting patterns eliminate expensive ad-hoc exploratory...
Borealis is a distributed stream processing engine that being developed at Brandeis University, Brown and MIT. inherits core functionality from Aurora inter-node communication Medusa.We propose to demonstrate some of the key aspects operation in Borealis, using multi-player network game as underlying application. The demonstration will illustrate dynamic resource management, query optimization high availability mechanisms employed by visual performance-monitoring tools well gaming experience.
Query optimization remains one of the most important and well-studied problems in database systems. However, traditional query optimizers are complex heuristically-driven systems, requiring large amounts time to tune for a particular even more develop maintain first place. In this vision paper, we argue that new type optimizer, based on deep reinforcement learning, can drastically improve state-of-the-art. We identify potential complications future research integrates learning with...
Workload management for cloud databases deals with the tasks of resource provisioning, query placement, and scheduling in a manner that meets application's performance goals while minimizing cost using resources. Existing solutions have approached these three challenges isolation aiming to optimize single metric. In this paper, we introduce WiSeDB, learning-based framework generating holistic workload customized application-defined characteristics. Our approach relies on supervised learning...
We address the problem of content-based dissemination highly-distributed, high-volume data streams for stream-based monitoring applications and large-scale delivery. Existing approaches commonly rely on distributed filtering trees that require at all brokers tree. present a new semantic multicast approach eliminates need interior facilitates fine-grained control over construction efficient trees. The central idea is to split incoming (based their contents, rates, destinations) then spread...
We discuss the problem of resource provisioning for database management systems operating on top an Infrastructure-As-A-Service (IaaS) cloud. To solve this problem, we describe extensible framework that, given a target query workload, continually optimizes system's operational cost, estimated based IaaS provider's pricing model, while satisfying QoS expectations. Specifically, two different approaches, ¿white-box¿ approach that uses fine-grained estimation expected consumption and...
Science applications are accumulating an ever-increasing amount of multidimensional data. Although some it can be processed in a relational database, much is better suited to array-based engines. As such, important optimize the query processing these systems. This paper focuses on efficient join operations within array database. These engines invariably ``chunk'' their data into tiles that they use efficiently process spatial queries. traditional algorithms need substantially modified take...
Predicting query performance under concurrency is a difficult task that has many applications in capacity planning, cloud computing, and batch scheduling. We introduce Contender, new resourcemodeling approach for predicting the concurrent of analytical workloads. Contender’s unique feature it can generate effective predictions both static as well adhoc or dynamic workloads with low training requirements. These characteristics make Contender practical solution real-world deployment. relies on...
Distributed data management systems often operate on "elastic'' clusters that can scale up or down demand. These face numerous challenges, including fragmentation, replication, and cluster sizing. Unfortunately, these challenges have traditionally been treated independently, leaving administrators with little insight how the interplay of decisions affects query performance. This paper introduces NashDB, an adaptive distribution framework relies economic model to automatically balance supply...
We introduce XPORT, a profile-driven distributed data dissemination system that supports an extensible set of types, profile and optimization metrics. XPORT efficiently implements generic tree-based overlay network, which can be customized per application using small number methods encapsulate application-specific filtering, aggregation, logic. The clean separation between the "plumbing" "application" enables to uniformly support disparate dissemination-based applications.We first provide...
We consider the problem of content-based routing and dissemination highly-distributed, fast data streams from multiple sources to receivers. Our target application domain includes real-time, stream-based monitoring applications large-scale event dissemination. introduce SemCast, a new semantic multicast approach that, unlike previous approaches, eliminates need for forwarding at interior brokers facilitates fine-grained control over construction overlays. present initial design SemCast...
Existing stream processing systems are optimized for a specific metric, which may limit their applicability to diverse applications and environments. This paper presents XFlow, generic data collection, processing, dissemination system that addresses this limitation efficiently. XFlow can express optimize variety of optimization metrics constraints by distributing queries across wide-area network. It uses metric-independent decentralized algorithms work on localized, aggregated statistics,...
We introduce pulse, a framework for processing continuous queries over models of continuous-time data, which can compactly and accurately represent many real-world activities processes. Pulse implements several query operators, including filters, aggregates joins, that work by solving simultaneous equation systems, in cases is significantly cheaper than stream tuples. As such, pulse translates regular to on inputs, reduce computational overhead latency while meeting user-specified error...