- Parallel Computing and Optimization Techniques
- Advanced Database Systems and Queries
- Cloud Computing and Resource Management
- Machine Learning and Data Classification
- Advanced Data Storage Technologies
- Distributed and Parallel Computing Systems
- Graph Theory and Algorithms
- Business Process Modeling and Analysis
- Service-Oriented Architecture and Web Services
- Scientific Computing and Data Management
- Data Quality and Management
- Stochastic Gradient Optimization Techniques
- Software System Performance and Reliability
- Machine Learning and Algorithms
- Digital Innovation in Industries
- Machine Learning in Materials Science
- Software Engineering Research
- Time Series Analysis and Forecasting
- Explainable Artificial Intelligence (XAI)
- Information Technology Governance and Strategy
- Data Stream Mining Techniques
- Scheduling and Optimization Algorithms
- ERP Systems Implementation and Impact
- Stock Market Forecasting Methods
- Data Mining Algorithms and Applications
Technische Universität Berlin
2023-2025
Graz University of Technology
2019-2022
IBM Research - Almaden
2013-2019
IBM (United States)
2014-2017
Technische Universität Dresden
2008-2014
Osnabrück University
2011-2013
University of Münster
2010
Hochschule für Technik und Wirtschaft Dresden – University of Applied Sciences
2008-2009
The rising need for custom machine learning (ML) algorithms and the growing data sizes that require exploitation of distributed, data-parallel frameworks such as MapReduce or Spark, pose significant productivity challenges to scientists. Apache SystemML addresses these through declarative ML by (1) increasing scientists they are able express in a familiar domain-specific language covering linear algebra primitives statistical functions, (2) transparently running on applying cost-based...
Large-scale data analytics using statistical machine learning (ML), popularly called advanced analytics, underpins many modern data-driven applications. The management community has been working for over a decade on tackling management-related challenges that arise in ML workloads, and built several systems analytics. This tutorial provides comprehensive review of such analyzes key techniques. We focus three complementary lines work: (1) integrating algorithms languages with existing as...
SystemML aims at declarative, large-scale machine learning (ML) on top of MapReduce, where high-level ML scripts with R-like syntax are compiled to programs MR jobs. The declarative specification algorithms enables---in contrast existing libraries---automatic optimization. SystemML's primary focus is data parallelism but many inherently exhibit opportunities for task as well. A major challenge how efficiently combine both types arbitrary and workloads. In this paper, we present a systematic...
Large-scale machine learning (ML) algorithms are often iterative, using repeated read-only data access and I/O-bound matrix-vector multiplications to converge an optimal model. It is crucial for performance fit the into single-node or distributed main memory. General-purpose, heavy- lightweight compression techniques struggle achieve both good ratios fast decompression speed enable block-wise uncompressed operations. Hence, we initiate work on compressed linear algebra (CLA), in which...
Declarative large-scale machine learning (ML) aims at flexible specification of ML algorithms and automatic generation hybrid runtime plans ranging from single node, in-memory computations to distributed on MapReduce (MR) or similar frameworks. State-of-the-art compilers in this context are very sensitive memory constraints the master process MR cluster configuration. Different configurations can lead significant performance differences. Interestingly, resource negotiation frameworks like...
Many machine learning (ML) systems allow the specification of ML algorithms by means linear algebra programs, and automatically generate efficient execution plans. The opportunities for fused operators---in terms chains basic operators---are ubiquitous, include fewer materialized intermediates, scans inputs, sparsity exploitation across operators. However, existing fusion heuristics struggle to find good plans complex operator DAGs or hybrid local distributed operations. In this paper, we...
Nowadays, Renewable Energy Sources (RES) are attracting more and interest. Thus, many countries aim to increase the share of green energy have face with several challenges (e.g., balancing, storage, pricing). In this paper, we address balancing challenge present MIRABEL project which aims prototype an Data Management System (EDMS) takes benefit flexibilities efficiently balance demand supply. The EDMS consists millions heterogeneous nodes that each incorporates advanced components...
Slice finding---a recent work on debugging machine learning (ML) models---aims to find the top-K data slices (e.g., conjunctions of predicates such as gender female and degree PhD), where a trained model performs significantly worse than entire training/test data. These may be used acquire more for problematic subset, add rules, or otherwise improve model. In contrast decision trees, general slice finding problem allows overlapping slices. The resulting search space is huge it covers all...
Exploitation of parallel architectures has become critical to scalable machine learning (ML). Since a wide range ML algorithms employ linear algebraic operators, GPUs with BLAS libraries are natural choice for such an exploitation. Two approaches commonly pursued: (i) developing specific GPU accelerated implementations complete algorithms; and (ii) kernels primitive operators like matrix-vector multiplication, which then used in algorithms. This paper extends the latter approach by fused...
Time series data from a variety of sensors and IoT devices need effective compression to reduce storage I/O bandwidth requirements. While most time databases systems rely on lossless compression, lossy techniques offer even greater space-saving with small loss in precision. However, the unknown impact downstream analytics applications requires semi-manual trial-and-error exploration. We initiate work that provides guarantees complex statistical features (which are strongly correlated...
Open science and data exchange in general rely on standardized interoperable file formats. Comma-separated value (CSV) files are probably the most versatile, simplest, widely-used format for tabular data. For example, FAIR principles of research management promote findable, accessible, interoperable, reusable metadata. In this context, CSV ensure accessibility interoperability because its simple structure textbased format, making them amenable long-term storage. An analysis by Google Dataset...
Large-scale data analytics using machine learning (ML) underpins many modern data-driven applications. ML systems provide means of specifying and executing these workloads in an efficient scala
Exploitation of parallel architectures has become critical to scalable machine learning (ML). Since a wide range ML algorithms employ linear algebraic operators, GPUs with BLAS libraries are natural choice for such an exploitation. Two approaches commonly pursued: (i) developing specific GPU accelerated implementations complete algorithms; and (ii) kernels primitive operators like matrix-vector multiplication, which then used in algorithms. This paper extends the latter approach by fused...
Large-scale Machine Learning (ML) algorithms are often iterative, using repeated read-only data access and I/O-bound matrix-vector multiplications. Hence, it is crucial for performance to fit the into single-node or distributed main memory enable fast operations. General-purpose compression struggles achieve both good ratios decompression block-wise uncompressed Therefore, we introduce Compressed Linear Algebra (CLA) lossless matrix compression. CLA encodes matrices with lightweight,...
Machine learning (ML) and data science workflows are inherently exploratory. Data scientists pose hypotheses, integrate the necessary data, run ML pipelines of cleaning, feature engineering, model selection hyper-parameter tuning. The repetitive nature these workflows, their hierarchical composition from building blocks exhibits high computational redundancy. Existing work addresses this redundancy with coarse-grained lineage tracing reuse for pipelines. This approach allows using existing...
Declarative machine learning (ML) aims at the high-level specification of ML tasks or algorithms, and automatic generation optimized execution plans from these specifications. The fundamental goal is to simplify usage and/or development which especially important in context large-scale computations. However, systems different abstraction levels have emerged over time accordingly there has been a controversy about meaning this general definition declarative ML. Specification alternatives...
Data science workflows are largely exploratory, dealing with under-specified objectives, open-ended problems, and unknown business value. Therefore, little investment is made in systematic acquisition, integration, pre-processing of data. This lack infrastructure results redundant manual effort computation. Furthermore, central data consolidation not always technically or economically desirable even feasible (e.g., due to privacy, and/or ownership). The ExDRa system aims provide for this...
Efficiently computing linear algebra expressions is central to machine learning (ML) systems. Most systems support sparse formats and operations because matrices are ubiquitous their dense representation can cause prohibitive overheads. Estimating the sparsity of intermediates, however, remains a key challenge when generating execution plans or performing operations. These estimates used for cost memory estimates, format decisions, result allocation. Existing estimators tend focus on matrix...