NFDI4DS | UHH-SEMS - Publication Details

Mohamed Y. Eltabakh

ORCID: 0000-0002-6344-8246

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5074322354

Research Areas

Advanced Database Systems and Queries
Data Management and Algorithms
Scientific Computing and Data Management
Cloud Computing and Resource Management
Data Quality and Management
Time Series Analysis and Forecasting
Advanced Data Storage Technologies
Data Mining Algorithms and Applications
Semantic Web and Ontologies
Graph Theory and Algorithms
Algorithms and Data Compression
Data Stream Mining Techniques
Distributed and Parallel Computing Systems
Research Data Management Practices
Privacy-Preserving Technologies in Data
Human Mobility and Location-Based Analysis
Network Packet Processing and Optimization
Distributed systems and fault tolerance
Music and Audio Processing
Cryptography and Data Security
Plant nutrient uptake and metabolism
Traumatic Brain Injury Research
Cloud Data Security Solutions
Data Visualization and Analytics
Privacy, Security, and Data Protection

Qatar Cardiovascular Research Center
2023-2024

Worcester Polytechnic Institute
2013-2023

Teradata (United Kingdom)
2021

Menoufia University
2021

Teradata (United States)
2018

IBM (United States)
2011

Purdue University West Lafayette
2004-2010

CoHadoop

OPENALEX - Publications

Mohamed Y. Eltabakh Yuanyuan Tian Fatma Özcan Rainer Gemulla Aljoscha Krettek and 1 more

Hadoop has become an attractive platform for large-scale data analytics. In this paper, we identify a major performance bottleneck of Hadoop: its lack ability to colocate related on the same set nodes. To overcome bottleneck, introduce CoHadoop, lightweight extension that allows applications control where are stored. contrast previous approaches, CoHadoop retains flexibility in it does not require users convert their certain format (e.g., relational database or specific file format)....

10.14778/2002938.2002943 article EN Proceedings of the VLDB Endowment 2011-06-01

Jaql

OPENALEX - Publications

Kevin Beyer Vuk Ercegovac Rainer Gemulla Andrey Balmin Mohamed Y. Eltabakh and 3 more

This paper describes Jaql, a declarative scripting language for analyzing large semistructured datasets in parallel using Hadoop's MapReduce framework. Jaql is currently used IBM's InfoSphere BigInsights [5] and Cognos Consumer Insight [9] products. Jaql's design features are: (1) flexible data model, (2) reusability, (3) varying levels of abstraction, (4) scalability. model inspired by JSON can be to represent that vary from flat, relational tables collections documents. A script start...

10.14778/3402755.3402761 article EN Proceedings of the VLDB Endowment 2011-08-01

Nile: a query processing engine for data streams

OPENALEX - Publications

Moustafa A. Hammad Mohamed F. Mokbel Mohamed H. Ali Walid G. Aref Ann Christine Catlin and 8 more

We present the demonstration of design "STEAM", Purdue Boiler Makers' stream database system that allows for processing continuous and snap-shot queries over data streams. Specifically, focuses on query engine, "Nile". Nile extends processor engine an object-relational management system, PREDATOR, to process supports extended SQL operators handle sliding-window execution as approach restrict size stored state in such join.

10.1109/icde.2004.1320080 article EN 2004-09-28

Fanar: An Arabic-Centric Multimodal Generative AI Platform

OPENALEX - Publications

Fanar Team Ummar Abbas Mohammad Shahmeer Ahmad Firoj Alam Enes Altınışık and 37 more

We present Fanar, a platform for Arabic-centric multimodal generative AI systems, that supports language, speech and image generation tasks. At the heart of Fanar are Star Prime, two highly capable Arabic Large Language Models (LLMs) best in class on well established benchmarks similar sized models. is 7B (billion) parameter model was trained from scratch nearly 1 trillion clean deduplicated Arabic, English Code tokens. Prime 9B continually Gemma-2 base same token set. Both models...

10.48550/arxiv.2501.13944 preprint EN arXiv (Cornell University) 2025-01-18

Eagle-eyed elephant

OPENALEX - Publications

Mohamed Y. Eltabakh Fatma Özcan Yannis Sismanis Peter J. Haas Hamid Pirahesh and 1 more

An increasingly important analytics scenario for Hadoop involves multiple (often ad hoc) grouping and aggregation queries with selection predicates over a slowly changing dataset. These are typically expressed via high-level query languages such as Jaql, Pig, Hive, used either directly business-intelligence applications or to prepare the data statistical model building machine learning. In scenarios it has been recognized that, in classical databases, techniques avoiding access irrelevant...

10.1145/2452376.2452388 article EN 2013-03-18

Supporting annotations on relations

OPENALEX - Publications

Mohamed Y. Eltabakh Walid G. Aref Ahmed K. Elmagarmid Mourad Ouzzani Yasin N. Silva

Annotations play a key role in understanding and curating databases. may represent comments, descriptions, lineage information, among several others. Annotation management is vital mechanism for sharing knowledge building an interactive collaborative environment database users scientists. What makes it challenging that annotations can be attached to entities at various granularities, e.g., the table, tuple, column, cell levels, or more generally, any subset of cells results from select...

10.1145/1516360.1516405 article EN 2009-03-24

TARDIS: Distributed Indexing Framework for Big Time Series Data

OPENALEX - Publications

Liang Zhang Noura Alghamdi Mohamed Y. Eltabakh Elke A. Rundensteiner

The massive amounts of time series data continuously generated and collected by applications warrant the need for large scale distributed processing systems. Indexing plays a critical role in speeding up similarity queries on which various analytics rely. However, state-of-the-art indexing techniques, are iSAX-based structures, do not well due to small adopted fan-out (binary) that leads highly deep index tree, expensive search cost through many internal nodes. More seriously, iSAX...

10.1109/icde.2019.00110 article EN 2022 IEEE 38th International Conference on Data Engineering (ICDE) 2019-04-01

Multi-Tactic Distance-Based Outlier Detection

OPENALEX - Publications

Lei Cao Yizhou Yan Caitlin Kuhlman Qingyang Wang Elke A. Rundensteiner and 1 more

As datasets increase radically in size, highly scalable algorithms leveraging modern distributed infrastructures need to be developed for detecting outliers massive datasets. In this work, we present the first distance-based outlier detection approach using MapReduce-based infrastructure, called DOD. DOD features a single-pass execution framework that minimizes communication overhead. Furthermore, overturns two fundamental assumptions widely adopted analytics literature, namely...

10.1109/icde.2017.143 article EN 2017-04-01

Engineering a Policy-Based System for Federated Healthcare Databases

OPENALEX - Publications

Rafae Bhatti Arjmand Samuel Mohamed Y. Eltabakh Haseeb Amjad Arif Ghafoor

Policy-based management for federated healthcare systems has recently gained increasing attention due to strict privacy and disclosure rules. Although the work on languages enforcement mechanisms, such as Hippocratic databases, advanced our understanding of designing privacy-preserving policies need integrate these in a practical framework is becoming acute. Additionally, although most this area been organization oriented, dealing with exchange information between organizations (such...

10.1109/tkde.2007.1050 article EN IEEE Transactions on Knowledge and Data Engineering 2007-08-22

Managing Biological Data using bdbms

OPENALEX - Publications

Mohamed Y. Eltabakh Mourad Ouzzani Walid G. Aref Ahmed K. Elmagarmid Yasin Laura-Silva and 3 more

We demonstrate bdbms, an extensible database engine for biological databases, bdbms started on the observation that technology has not kept pace with specific requirements of databases and several needed key functionalities are supported at level. While aims supporting these functionalities, this demo focuses on: (1) Annotation provenance management including storage, indexing, querying, propagation, (2) Local dependency tracking dependencies derivations among data items, (3) Update...

10.1109/icde.2008.4497631 article EN 2008-04-01

Cross Modal Data Discovery over Structured and Unstructured Data Lakes

OPENALEX - Publications

Mohamed Y. Eltabakh Mayuresh Kunjir Ahmed K. Elmagarmid Mohammad Shahmeer Ahmad

Organizations are collecting increasingly large amounts of data for data-driven decision making. These often dumped into a centralized repository, e.g., lake, consisting thousands structured and unstructured datasets. Perversely, such mixture makes the problem discovering tables or documents that relevant to user's query very challenging. Despite recent efforts in discovery , remains widely open especially two fronts (1) relationships relatedness across datasets-where existing techniques...

10.14778/3611479.3611533 article EN Proceedings of the VLDB Endowment 2023-07-01

Space-Partitioning Trees in PostgreSQL: Realization and Performance

OPENALEX - Publications

Mohamed Y. Eltabakh Ramy Eltarras Walid G. Aref

Many evolving database applications warrant the use of non-traditional indexing mechanisms beyond B+-trees and hash tables. SP-GiST is an extensible framework that broadens class supported indexes to include disk-based versions a wide variety space-partitioning trees, e.g., trie variants, quadtree kd-trees. This paper presents serious attempt at implementing realizing SP-GiST-based inside PostgreSQL. Several index types are realized PostgreSQL facilitated by rapid instantiations. Challenges,...

10.1109/icde.2006.146 article EN 2006-01-01

Exploiting soft and hard correlations in big data query optimization

OPENALEX - Publications

Hai Liu Dongqing Xiao Pankaj Didwania Mohamed Y. Eltabakh

Big data infrastructures are increasingly supporting datasets that relatively structured. These full of correlations among their attributes, which if managed in systematic ways would enable optimization opportunities otherwise will be missed. Unlike relational databases discovering and exploiting the query have been extensively studied, big infrastructures, such important properties utilization mostly abandoned. The key reason is domain experts may know many but with a degree uncertainty (...

10.14778/2994509.2994519 article EN Proceedings of the VLDB Endowment 2016-08-01

ChainLink: Indexing Big Time Series Data For Long Subsequence Matching

OPENALEX - Publications

Noura Alghamdi Liang Zhang Huayi Zhang Elke A. Rundensteiner Mohamed Y. Eltabakh

Scalable subsequence matching is critical for supporting analytics on big time series from mining, prediction to hypothesis testing. However, state-of-the-art techniques do not scale well TB-scale datasets. Not only does index construction become prohibitively expensive, but also the query response deteriorates quickly as length of exceeds several 100s data points. Although Locality Sensitive Hashing (LSH) has emerged a promising solution indexing long series, it relies expensive hash...

10.1109/icde48307.2020.00052 article EN 2022 IEEE 38th International Conference on Data Engineering (ICDE) 2020-04-01

InsightNotes

OPENALEX - Publications

Dongqing Xiao Mohamed Y. Eltabakh

In this paper, we address the challenges that arise from growing scale of annotations in scientific databases. On one hand, end-users and scientists are incapable analyzing extracting knowledge large number reported annotations, e.g., tuple may have hundreds attached to it over time. other current annotation management techniques fall short providing advanced processing beyond just propagating them end-users. To limitation, propose InsightNotes system, a summary-based engine relational...

10.1145/2588555.2610501 article EN 2014-06-18

RetClean: Retrieval-Based Data Cleaning Using LLMs and Data Lakes

OPENALEX - Publications

Zan Ahmad Naeem Mohammad Shahmeer Ahmad Mohamed Y. Eltabakh Mourad Ouzzani Nan Tang

Large language models (LLMs) have shown great potential in data cleaning, which is a fundamental task all modern applications. In this demo proposal, we demonstrate that indeed LLMs can assist e.g., filling missing values table, through different approaches. For example, cloud-based non-private LLMs, OpenAI GPT family or Google Gemini, cleaning datasets encompass world-knowledge information (Scenario 1). However, such may struggle with they never encountered before, local enterprise data,...

10.14778/3685800.3685890 article EN Proceedings of the VLDB Endowment 2024-08-01

CD62P (P-selectin) expression as a platelet activation marker in patients with liver cirrhosis with and without cholestasis

OPENALEX - Publications

Sara A. M. Hegazy Maha Elsabaawy Mohamed Y. Eltabakh Reham Hammad Hanan M. Bedair

P-selectin (CD62P) is a platelet activation marker that was claimed to mediate the accumulation of platelets induced by cholestasis. The nature dysfunction and hemostasis abnormalities in cholestatic liver disease needs be more explored. aim this study assess CD62P expression cirrhotic patients with without cholestasis, evaluate its relationship bleeding tendency.150 were included case-control study. Participants divided into 84 cirrhosis (group I), 44 whom had cholestasis (Group Ia) 40 Ib);...

10.5114/ceh.2021.107566 article EN Clinical and Experimental Hepatology 2021-01-01

The SBC-tree

OPENALEX - Publications

Mohamed Y. Eltabakh Wing-Kai Hon Rahul Shah Walid G. Aref Jeffrey Scott Vitter

Run-Length-Encoding (RLE) is a data compression technique that used in various applications, e.g., time series, biological sequences, and multimedia databases. One of the main challenges how to operate on (e.g., index, search, retrieve) compressed without decompressing it. In this paper, we introduce String B-tree for Compressed termed SBC-tree, indexing searching RLE-compressed sequences arbitrary length. The SBC-tree two-level index structure based well-known String...

10.1145/1353343.1353407 article EN 2008-03-25

Redoop infrastructure for recurring big data queries

OPENALEX - Publications

Chuan Lei Zhongfang Zhuang Elke A. Rundensteiner Mohamed Y. Eltabakh

This demonstration presents the Redoop infrastructure, first full-fledged MapReduce framework with native support for recurring big data queries. Recurring queries, repeatedly being executed long periods of time over evolving high-volume data, have become a bedrock component in most large-scale analytic applications. is comprehensive extension to Hadoop that pushes and optimization queries into Hadoop's core functionality. While backward compatible regular jobs, achieves an order magnitude...

10.14778/2733004.2733037 article EN Proceedings of the VLDB Endowment 2014-08-01

Shared execution of recurring workloads in MapReduce

OPENALEX - Publications

Chuan Lei Zhongfang Zhuang Elke A. Rundensteiner Mohamed Y. Eltabakh

With the increasing complexity of data-intensive MapReduce workloads, Hadoop must often accommodate hundreds or even thousands recurring analytics queries that periodically execute over frequently updated datasets, e.g., latest stock transactions, new log files, recent news feeds. For many applications, such come with user-specified service-level agreements (SLAs), commonly expressed as maximum allowed latency for producing results before their merits decay. The nature these emerging...

10.14778/2752939.2752941 article EN Proceedings of the VLDB Endowment 2015-02-01

HandsOn DB: Managing Data Dependencies Involving Human Actions

OPENALEX - Publications

Mohamed Y. Eltabakh Walid G. Aref Ahmed K. Elmagarmid Mourad Ouzzani

Consider two values, x and y, in the database, where y = F(x). To maintain consistency of data, whenever changes, F needs to be executed re-compute update its value database. This is straightforward case can by DBMS, e.g., SQL or C function. In this paper, we address more challenging a human action, conducting wet-lab experiment, taking manual measurements, collecting instrument readings. case, when remains invalid (inconsistent with current x) until action involved derivation performed...

10.1109/tkde.2013.117 article EN IEEE Transactions on Knowledge and Data Engineering 2013-07-16

Adaptive correlation exploitation in big data query optimization

OPENALEX - Publications

Yuchen Liu Hai Liu Dongqing Xiao Mohamed Y. Eltabakh

10.1007/s00778-018-0515-8 article EN The VLDB Journal 2018-07-28

Coming Soon ...