- Advanced Database Systems and Queries
- Data Management and Algorithms
- Cloud Computing and Resource Management
- Advanced Data Storage Technologies
- Distributed systems and fault tolerance
- Data Mining Algorithms and Applications
- Algorithms and Data Compression
- Scientific Computing and Data Management
- Graph Theory and Algorithms
- Blockchain Technology Applications and Security
- Caching and Content Delivery
- Parallel Computing and Optimization Techniques
- Meteorological Phenomena and Simulations
- Data Stream Mining Techniques
- Big Data and Business Intelligence
- Web Data Mining and Analysis
- Fire effects on ecosystems
- Semantic Web and Ontologies
- Peer-to-Peer Network Technologies
- Software Testing and Debugging Techniques
- Data Quality and Management
- Plant Water Relations and Carbon Dynamics
- Advanced Text Analysis Techniques
- Information Systems Education and Curriculum Development
- Optimization and Search Problems
Saarland University
2013-2023
Max Planck Institute for Informatics
2009
Max Planck Society
2009
ETH Zurich
2008
Philipps University of Marburg
2002
One of the main reasons why cloud computing has gained so much popularity is due to its ease use and ability scale resources on demand. As a result, users can now rent nodes large commercial clusters through several vendors, such as Amazon rackspace. However, despite attention paid by Cloud providers, performance unpredictability major issue in for (1) database researchers performing wall clock experiments, (2) applications providing service-level agreements. In this paper, we carry out...
MapReduce is a computing paradigm that has gained lot of attention in recent years from industry and research. Unlike parallel DBMSs, allows non-expert users to run complex analytical tasks over very large data sets on clusters clouds. However, this comes at price: processes scan-oriented fashion. Hence, the performance Hadoop --- an open-source implementation often does not match one well-configured DBMS. In paper we propose new type system named Hadoop++: it boosts task without changing...
This tutorial is motivated by the clear need of many organizations, companies, and researchers to deal with big data volumes efficiently. Examples include web analytics applications, scientific social networks. A popular processing engine for Hadoop MapReduce. Early versions MapReduce suffered from severe performance problems. Today, this becoming history. There are techniques that can be used jobs boost orders magnitude. In we teach such techniques. First, will briefly familiarize audience...
Within the last few years, a countless number of blockchain systems have emerged on market, each one claiming to revolutionize way distributed transaction processing in or other. Many features, such as byzantine fault tolerance, are indeed valuable additions modern environments. However, despite all hype around technology, many challenges that face fundamental management problems. These largely shared with traditional database systems, which been for decades already. similarities become...
MapReduce is becoming ubiquitous in large-scale data analysis. Several recent works have shown that the performance of Hadoop could be improved, for instance, by creating indexes a non-invasive manner. However, they ignore impact layout used inside blocks Distributed File System (HDFS). In this paper, we analyze different layouts detail context and argue Row, Column, PAX can lead to poor system performance. We propose new layout, coined Trojan Layout, internally organizes into attribute...
Relational equi-joins are at the heart of almost every query plan. They have been studied, improved, and reexamined on a regular basis since existence database community. In past four years several new join algorithms proposed experimentally evaluated. Some those papers contradict each other in their experimental findings. This makes it surprisingly hard to answer very simple question: what is fastest algorithm 2015? this paper we will try develop an answer. We start with end-to-end black...
Yellow elephants are slow. A major reason is that they consume their inputs entirely before responding to an elephant rider's orders. Some clever riders have trained yellow only parts of the responding. However, teaching time make do high. So high lessons often not pay off. We take a different approach. aggressive; this will them very fast. propose HAIL (Hadoop Aggressive Indexing Library), enhancement HDFS and Hadoop MapReduce dramatically improves runtimes several classes jobs. changes...
Hashing is a solved problem. It allows us to get constant time access for lookups. also simple. safe use an arbitrary method as black box and expect good performance, optimizations hashing can only improve it by negligible delta. Why are all of the previous statements plain wrong? That what this paper about. In we thoroughly study integer keys carefully analyze most common methods in five-dimensional requirements space: (1) data-distribution, (2) load factor, (3) dataset size, (4)...
MapReduce is a computing paradigm that has gained lot of popularity as it allows non-expert users to easily run complex analytical tasks at very large-scale. At such scale, task and node failures are no longer an exception but rather characteristic large-scale systems. This makes fault-tolerance critical issue for the efficient operation any application. automatically reschedules failed available nodes, which in turn recompute from scratch. However, this policy can significantly decrease...
Database cracking has been an area of active research in recent years. The core idea database is to create indexes adaptively and incrementally as a side-product query processing. Several works have proposed different techniques for aspects including updates, tuple-reconstruction, convergence, concurrency-control, robustness. However, there lack any comparative study these methods by independent group. In this paper, we conduct experimental on cracking. Our goal critically review several...
The partition-based spatial-merge join (PBSM) of J.M. Patel and D.J. DeWitt (1996) the size separation spatial (S/sup 3/J) N. Koudas K.C. Sevcik (1997) are considered to be among most efficient methods for processing (intersection) joins on two or more relations. Neither method assumes presence pre-existing indices In this paper, we propose several improvements these algorithms. particular, deal with impact data redundancy duplicate detection performance methods. For PBSM, present a simple...
Like any large software system, a full-fledged DBMS offers an overwhelming amount of configuration knobs. These range from static initialisation parameters like buffer sizes, degree concurrency, or level replication to complex runtime decisions creating secondary index on particular column reorganising the physical layout store. To simplify configuration, industry grade DBMSs are usually shipped with various advisory tools, that provide recommendations for given workloads and machines....
With prices of main memory constantly decreasing, people nowadays are more interested in performing their computations memory, and leave high I/O costs traditional disk-based systems out the equation. This change paradigm, however, represents new challenges to way data should be stored indexed order processed efficiently. Traditional structures, like venerable B-tree, were designed work on systems, but they no longer go main-memory at least not original form, due poor cache utilization run...
The recursive model index (RMI) has recently been introduced as a machine-learned replacement for traditional indexes over sorted data, achieving remarkably fast lookups. Follow-up work focused on explaining RMI's performance and automatically configuring RMIs through enumeration. Unfortunately, involves setting several hyperparameters, the enumeration of which is often too time-consuming in practice. Therefore, this work, we conduct first inventor-independent broad analysis with goal...
SIGMOD 2008 was the first database conference that offered to test submitters' programs against their data verify experiments published. This paper discusses rationale for this effort, community's reaction, our experiences, and advice future similar efforts.
Vertical partitioning is a crucial step in physical database design row-oriented databases. A number of vertical algorithms have been proposed over the last three decades for variety niche scenarios. In principle, underlying problem remains same: decompose table into one or more partitions. However, it not clear how good different are comparison to each other. fact, even experimentally compare algorithms. this paper, we present an exhaustive experimental study several We categorize along...
Adaptive indexing is a concept that considers index creation in databases as by-product of query processing; opposed to traditional full where the effort performed up front before answering any queries. has received considerable amount attention, and several algorithms have been proposed over past few years; including recent experimental study comparing large number existing methods. Until now, however, most adaptive designed single-threaded, yet with multi-core systems already well...
Memory management is one of the most boring topics in database research. It plays a minor role tasks like free-space or efficient space usage. Here and there we also realize its impact on performance when worrying about NUMA-aware memory allocation, data compacting, snapshotting, defragmentation. But, overall, let's face it: entire topic sounds as exciting 'garbage collection' 'debugging program for leaks'. What if were technique that would promote from third class helper thingie to first...