Mohamed Y. Eltabakh

ORCID: 0000-0002-6344-8246
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Advanced Database Systems and Queries
  • Data Management and Algorithms
  • Scientific Computing and Data Management
  • Cloud Computing and Resource Management
  • Data Quality and Management
  • Time Series Analysis and Forecasting
  • Advanced Data Storage Technologies
  • Data Mining Algorithms and Applications
  • Semantic Web and Ontologies
  • Graph Theory and Algorithms
  • Algorithms and Data Compression
  • Data Stream Mining Techniques
  • Distributed and Parallel Computing Systems
  • Research Data Management Practices
  • Privacy-Preserving Technologies in Data
  • Human Mobility and Location-Based Analysis
  • Network Packet Processing and Optimization
  • Distributed systems and fault tolerance
  • Music and Audio Processing
  • Cryptography and Data Security
  • Plant nutrient uptake and metabolism
  • Traumatic Brain Injury Research
  • Cloud Data Security Solutions
  • Data Visualization and Analytics
  • Privacy, Security, and Data Protection

Qatar Cardiovascular Research Center
2023-2024

Worcester Polytechnic Institute
2013-2023

Teradata (United Kingdom)
2021

Menoufia University
2021

Teradata (United States)
2018

IBM (United States)
2011

Purdue University West Lafayette
2004-2010

Hadoop has become an attractive platform for large-scale data analytics. In this paper, we identify a major performance bottleneck of Hadoop: its lack ability to colocate related on the same set nodes. To overcome bottleneck, introduce CoHadoop, lightweight extension that allows applications control where are stored. contrast previous approaches, CoHadoop retains flexibility in it does not require users convert their certain format (e.g., relational database or specific file format)....

10.14778/2002938.2002943 article EN Proceedings of the VLDB Endowment 2011-06-01

This paper describes Jaql, a declarative scripting language for analyzing large semistructured datasets in parallel using Hadoop's MapReduce framework. Jaql is currently used IBM's InfoSphere BigInsights [5] and Cognos Consumer Insight [9] products. Jaql's design features are: (1) flexible data model, (2) reusability, (3) varying levels of abstraction, (4) scalability. model inspired by JSON can be to represent that vary from flat, relational tables collections documents. A script start...

10.14778/3402755.3402761 article EN Proceedings of the VLDB Endowment 2011-08-01

We present the demonstration of design "STEAM", Purdue Boiler Makers' stream database system that allows for processing continuous and snap-shot queries over data streams. Specifically, focuses on query engine, "Nile". Nile extends processor engine an object-relational management system, PREDATOR, to process supports extended SQL operators handle sliding-window execution as approach restrict size stored state in such join.

10.1109/icde.2004.1320080 article EN 2004-09-28

We present Fanar, a platform for Arabic-centric multimodal generative AI systems, that supports language, speech and image generation tasks. At the heart of Fanar are Star Prime, two highly capable Arabic Large Language Models (LLMs) best in class on well established benchmarks similar sized models. is 7B (billion) parameter model was trained from scratch nearly 1 trillion clean deduplicated Arabic, English Code tokens. Prime 9B continually Gemma-2 base same token set. Both models...

10.48550/arxiv.2501.13944 preprint EN arXiv (Cornell University) 2025-01-18

An increasingly important analytics scenario for Hadoop involves multiple (often ad hoc) grouping and aggregation queries with selection predicates over a slowly changing dataset. These are typically expressed via high-level query languages such as Jaql, Pig, Hive, used either directly business-intelligence applications or to prepare the data statistical model building machine learning. In scenarios it has been recognized that, in classical databases, techniques avoiding access irrelevant...

10.1145/2452376.2452388 article EN 2013-03-18

Annotations play a key role in understanding and curating databases. may represent comments, descriptions, lineage information, among several others. Annotation management is vital mechanism for sharing knowledge building an interactive collaborative environment database users scientists. What makes it challenging that annotations can be attached to entities at various granularities, e.g., the table, tuple, column, cell levels, or more generally, any subset of cells results from select...

10.1145/1516360.1516405 article EN 2009-03-24

The massive amounts of time series data continuously generated and collected by applications warrant the need for large scale distributed processing systems. Indexing plays a critical role in speeding up similarity queries on which various analytics rely. However, state-of-the-art indexing techniques, are iSAX-based structures, do not well due to small adopted fan-out (binary) that leads highly deep index tree, expensive search cost through many internal nodes. More seriously, iSAX...

10.1109/icde.2019.00110 article EN 2022 IEEE 38th International Conference on Data Engineering (ICDE) 2019-04-01

As datasets increase radically in size, highly scalable algorithms leveraging modern distributed infrastructures need to be developed for detecting outliers massive datasets. In this work, we present the first distance-based outlier detection approach using MapReduce-based infrastructure, called DOD. DOD features a single-pass execution framework that minimizes communication overhead. Furthermore, overturns two fundamental assumptions widely adopted analytics literature, namely...

10.1109/icde.2017.143 article EN 2017-04-01

Policy-based management for federated healthcare systems has recently gained increasing attention due to strict privacy and disclosure rules. Although the work on languages enforcement mechanisms, such as Hippocratic databases, advanced our understanding of designing privacy-preserving policies need integrate these in a practical framework is becoming acute. Additionally, although most this area been organization oriented, dealing with exchange information between organizations (such...

10.1109/tkde.2007.1050 article EN IEEE Transactions on Knowledge and Data Engineering 2007-08-22

We demonstrate bdbms, an extensible database engine for biological databases, bdbms started on the observation that technology has not kept pace with specific requirements of databases and several needed key functionalities are supported at level. While aims supporting these functionalities, this demo focuses on: (1) Annotation provenance management including storage, indexing, querying, propagation, (2) Local dependency tracking dependencies derivations among data items, (3) Update...

10.1109/icde.2008.4497631 article EN 2008-04-01

Organizations are collecting increasingly large amounts of data for data-driven decision making. These often dumped into a centralized repository, e.g., lake, consisting thousands structured and unstructured datasets. Perversely, such mixture makes the problem discovering tables or documents that relevant to user's query very challenging. Despite recent efforts in discovery , remains widely open especially two fronts (1) relationships relatedness across datasets-where existing techniques...

10.14778/3611479.3611533 article EN Proceedings of the VLDB Endowment 2023-07-01

Many evolving database applications warrant the use of non-traditional indexing mechanisms beyond B+-trees and hash tables. SP-GiST is an extensible framework that broadens class supported indexes to include disk-based versions a wide variety space-partitioning trees, e.g., trie variants, quadtree kd-trees. This paper presents serious attempt at implementing realizing SP-GiST-based inside PostgreSQL. Several index types are realized PostgreSQL facilitated by rapid instantiations. Challenges,...

10.1109/icde.2006.146 article EN 2006-01-01

Big data infrastructures are increasingly supporting datasets that relatively structured. These full of correlations among their attributes, which if managed in systematic ways would enable optimization opportunities otherwise will be missed. Unlike relational databases discovering and exploiting the query have been extensively studied, big infrastructures, such important properties utilization mostly abandoned. The key reason is domain experts may know many but with a degree uncertainty (...

10.14778/2994509.2994519 article EN Proceedings of the VLDB Endowment 2016-08-01

Scalable subsequence matching is critical for supporting analytics on big time series from mining, prediction to hypothesis testing. However, state-of-the-art techniques do not scale well TB-scale datasets. Not only does index construction become prohibitively expensive, but also the query response deteriorates quickly as length of exceeds several 100s data points. Although Locality Sensitive Hashing (LSH) has emerged a promising solution indexing long series, it relies expensive hash...

10.1109/icde48307.2020.00052 article EN 2022 IEEE 38th International Conference on Data Engineering (ICDE) 2020-04-01

In this paper, we address the challenges that arise from growing scale of annotations in scientific databases. On one hand, end-users and scientists are incapable analyzing extracting knowledge large number reported annotations, e.g., tuple may have hundreds attached to it over time. other current annotation management techniques fall short providing advanced processing beyond just propagating them end-users. To limitation, propose InsightNotes system, a summary-based engine relational...

10.1145/2588555.2610501 article EN 2014-06-18

Large language models (LLMs) have shown great potential in data cleaning, which is a fundamental task all modern applications. In this demo proposal, we demonstrate that indeed LLMs can assist e.g., filling missing values table, through different approaches. For example, cloud-based non-private LLMs, OpenAI GPT family or Google Gemini, cleaning datasets encompass world-knowledge information (Scenario 1). However, such may struggle with they never encountered before, local enterprise data,...

10.14778/3685800.3685890 article EN Proceedings of the VLDB Endowment 2024-08-01

P-selectin (CD62P) is a platelet activation marker that was claimed to mediate the accumulation of platelets induced by cholestasis. The nature dysfunction and hemostasis abnormalities in cholestatic liver disease needs be more explored. aim this study assess CD62P expression cirrhotic patients with without cholestasis, evaluate its relationship bleeding tendency.150 were included case-control study. Participants divided into 84 cirrhosis (group I), 44 whom had cholestasis (Group Ia) 40 Ib);...

10.5114/ceh.2021.107566 article EN Clinical and Experimental Hepatology 2021-01-01

Run-Length-Encoding (RLE) is a data compression technique that used in various applications, e.g., time series, biological sequences, and multimedia databases. One of the main challenges how to operate on (e.g., index, search, retrieve) compressed without decompressing it. In this paper, we introduce <u>S</u>tring <u>B</u>-tree for <u>C</u>ompressed termed SBC-tree, indexing searching RLE-compressed sequences arbitrary length. The SBC-tree two-level index structure based well-known String...

10.1145/1353343.1353407 article EN 2008-03-25

This demonstration presents the Redoop infrastructure, first full-fledged MapReduce framework with native support for recurring big data queries. Recurring queries, repeatedly being executed long periods of time over evolving high-volume data, have become a bedrock component in most large-scale analytic applications. is comprehensive extension to Hadoop that pushes and optimization queries into Hadoop's core functionality. While backward compatible regular jobs, achieves an order magnitude...

10.14778/2733004.2733037 article EN Proceedings of the VLDB Endowment 2014-08-01

With the increasing complexity of data-intensive MapReduce workloads, Hadoop must often accommodate hundreds or even thousands recurring analytics queries that periodically execute over frequently updated datasets, e.g., latest stock transactions, new log files, recent news feeds. For many applications, such come with user-specified service-level agreements (SLAs), commonly expressed as maximum allowed latency for producing results before their merits decay. The nature these emerging...

10.14778/2752939.2752941 article EN Proceedings of the VLDB Endowment 2015-02-01

Consider two values, x and y, in the database, where y = F(x). To maintain consistency of data, whenever changes, F needs to be executed re-compute update its value database. This is straightforward case can by DBMS, e.g., SQL or C function. In this paper, we address more challenging a human action, conducting wet-lab experiment, taking manual measurements, collecting instrument readings. case, when remains invalid (inconsistent with current x) until action involved derivation performed...

10.1109/tkde.2013.117 article EN IEEE Transactions on Knowledge and Data Engineering 2013-07-16
Coming Soon ...