- Scientific Computing and Data Management
- Distributed and Parallel Computing Systems
- Advanced Data Storage Technologies
- Research Data Management Practices
- Data Quality and Management
- Cloud Computing and Resource Management
- Distributed systems and fault tolerance
- Advanced Database Systems and Queries
- Data Management and Algorithms
- Geographic Information Systems Studies
- Business Process Modeling and Analysis
- Mobile Crowdsensing and Crowdsourcing
- Geological Modeling and Analysis
- Environmental Monitoring and Data Management
- Semantic Web and Ontologies
- IoT and Edge/Fog Computing
- Explainable Artificial Intelligence (XAI)
- Service-Oriented Architecture and Web Services
- Blockchain Technology Applications and Security
- Cloud Data Security Solutions
- Machine Learning in Materials Science
- Machine Learning and Data Classification
Oak Ridge National Laboratory
2022-2024
Office of Scientific and Technical Information
2024
Naval Research Laboratory Information Technology Division
2023
University of Chicago
2016-2022
University of Illinois Chicago
2020-2022
Argonne National Laboratory
2017
Exploding data volumes and velocities, new computational methods platforms, ubiquitous connectivity demand approaches to computation in the sciences. These must enable be mobile, so that, for example, it can occur near data, triggered by events (e.g., arrival of data), offloaded specialized accelerators, or run remotely where resources are available. They also require design which monolithic applications decomposed into smaller components, that may turn executed separately on most suitable...
funcX is a distributed function as service (FaaS) platform that enables flexible, scalable, and high performance remote execution. Unlike centralized FaaS systems, decouples the cloud-hosted management functionality from edge-hosted execution functionality. funcX's endpoint software can be deployed, by users or administrators, on arbitrary laptops, clouds, clusters, supercomputers, in effect turning them into serving systems. provides single location for registering, sharing, managing both...
Growing data volumes and velocities are driving exciting new methods across the sciences in which analytics machine learning increasingly intertwined with research. These require approaches for scientific computing computation is mobile, so that, example, it can occur near data, be triggered by events (e.g., arrival of data), or offloaded to specialized accelerators. They also design monolithic applications decomposed into smaller components, that may turn executed separately on most...
Many interesting geospatial datasets are publicly accessible on web sites and other online repositories. However, the sheer number of locations, plus a lack support for cross-repository search, makes it difficult researchers to discover integrate relevant data. We describe here early results from system, Klimatic, that aims overcome these barriers discovery use by automating tasks crawling, indexing, integrating, distributing Klimatic implements scalable crawling processing architecture uses...
Modern large-scale scientific discovery requires multidisciplinary collaboration across diverse computing facilities, including High Performance Computing (HPC) machines and the Edge-to-Cloud continuum. Integrated data analysis plays a crucial role in discovery, especially current AI era, by enabling Responsible development, FAIR, Reproducibility, User Steering. However, heterogeneous nature of science poses challenges such as dealing with multiple supporting tools, cross-facility...
The use and reuse of scientific data is ultimately dependent on the ability to understand what those represent, how they were captured, can be used. In many ways, are only as useful metadata available describe them. Unfortunately, due growing volumes, large distributed collaborations, a desire store for long periods time, "data lakes" quickly become disorganized lack necessary researchers. New automated approaches needed derive from files these organization discovery. Here we one such...
Many interesting geospatial datasets are publicly accessible on web sites and other online repositories. However, the sheer number of locations, plus a lack support for cross-repository search, makes it difficult researchers to discover integrate relevant data. We describe here early results from system, Klimatic, that aims overcome these barriers discovery use by automating tasks crawling, indexing, integrating, distributing Klimatic implements scalable crawling processing architecture uses...
To mitigate the effects of high-velocity data expansion and to automate organization filesystems repositories, we have developed Skluma-a system that automatically processes a target filesystem or repository, extracts content-and context-based metadata, organizes extracted metadata for subsequent use. Skluma is able extract diverse including aggregate values derived from embedded structured data; named entities latent topics buried within free-text documents; content encoded in images....
We introduce Xtract, an automated and scalable system for bulk metadata extraction from large, distributed research data repositories. Xtract orchestrates the application of extractors to groups files, determining which apply each file and, extractor file, where execute. A hybrid computing model, built on funcX federated FaaS platform, enables balance tradeoffs between time transfer costs by dispatching task most appropriate location. Experiments a range clouds supercomputers show that can...
Scientists' capacity to make use of existing data is predicated on their ability find and understand those data. While significant progress has been made with respect publication, indeed one can point a number well organized highly utilized repositories, there remain many such repositories in which archived are poorly described thus impossible use. We present Skluma---an automated system designed process vast amounts extract deeply embedded metadata, latent topics, relationships between...
Scientific workflows have become integral tools in broad scientific computing use cases. Science discovery is increasingly dependent on to orchestrate large and complex experiments that range from execution of a cloud-based data preprocessing pipeline multi-facility instrument-to-edge-to-HPC computational workflows. Given the changing landscape evolving needs emerging applications, it paramount development novel system functionalities seek increase efficiency, resilience, pervasiveness...
FAIR principles require that scientific data be findable, discoverable, and reusable by users. To enable FAIRness, practioners of a science repository will often construct rich, searchable index metadata derived from the data. Unfortunately, manual annotation methods do not scale to many files generated projects; instead automated extraction systems are needed scalably parse these files—often with nonstandard schema requiring specialized parsing strategies—and deposit representative into...
The rapid generation of data from distributed IoT devices, scientific instruments, and compute clusters presents unique management challenges. influx large, heterogeneous, complex causes repositories to become siloed or generally unsearchable---both problems not currently well-addressed by file systems. In this work, we propose Xtract, a serverless middleware extract metadata files spread across heterogeneous edge computing resources. my future intend study how Xtract can automatically...
The advancement of science is increasingly intertwined with complex computational processes [1]. Scientific workflows are at the heart this evolution, acting as essential orchestrators for a vast range experiments. Specifically, these central to field Earth Sciences, where they orchestrate diverse activities, from cloud-based data preprocessing pipelines in environmental modeling intricate multi-facility instrument-to-edge-to-HPC frameworks seismic analysis and geophysical simulations [2]....
Many extreme-scale applications require the movement of large quantities data to, from, and among leadership computing facilities, as well other scientific facilities home institutions facility users. These applications, particularly when are involved, can touch upon edge cases (e.g., terabyte files) that had not been a focus previous Globus optimization work, which emphasized rather many smaller (megabyte to gigabyte) files. We report here on how automated client-driven chunking be used...
The Workflows Community Summit gathered 111 participants from 18 countries to discuss emerging trends and challenges in scientific workflows, focusing on six key areas: time-sensitive AI-HPC convergence, multi-facility heterogeneous HPC environments, user experience, FAIR computational workflows. integration of AI exascale computing has revolutionized enabling higher-fidelity models complex, processes, while introducing managing environments data dependencies. rise large language is driving...