- Scientific Computing and Data Management
- Research Data Management Practices
- Semantic Web and Ontologies
- Information Retrieval and Search Behavior
- Distributed and Parallel Computing Systems
- Data Quality and Management
- Topic Modeling
- Advanced Data Storage Technologies
- Natural Language Processing Techniques
- Advanced Text Analysis Techniques
- Web Data Mining and Analysis
- Peer-to-Peer Network Technologies
- Explainable Artificial Intelligence (XAI)
- Information and Cyber Security
- Recommender Systems and Techniques
- Music and Audio Processing
- Meta-analysis and systematic reviews
- Smart Agriculture and AI
- Network Security and Intrusion Detection
- Digital Humanities and Scholarship
- Biomedical Text Mining and Ontologies
- Wikis in Education and Collaboration
- Library Science and Information Systems
- Data Management and Algorithms
- Advanced Malware Detection Techniques
University of Illinois Urbana-Champaign
2012-2021
National Center for Supercomputing Applications
2017-2020
University of Notre Dame
2020
University of Illinois System
2013-2016
University of North Carolina at Chapel Hill
2011-2014
The proliferation of discipline‐specific metadata schemes contributes to artificial barriers that can impede interdisciplinary and transdisciplinary research. authors considered this problem by examining the domains , objectives architectures nine used document scientific data in physical, life, social sciences. They a mixed‐methods content analysis G reenberg's ( ) objectives, principles, domains, architectural layout MODAL framework, derived 22 metadata‐related goals from textual...
The Transparent Research Object Vocabulary (TROV) is a key element of the Transparency Certified (TRACE) approach to ensuring research trustworthiness. In contrast with methods that entail repeating computations in part or full verify descriptions included publication are sufficient reproduce reported results, TRACE depends on controlled computing environment termed System (TRS) guarantee accurate, sufficiently complete, and otherwise trustworthy records captured when results obtained first...
The Transportation Energy Resources from Renewable Agriculture Phenotyping Reference Platform (TERRA-REF) provides a data and computation pipeline responsible for collecting, transferring, processing distributing large volumes of crop sensing genomic genetically informative germplasm sets. primary source these is field scanner system built over an experimental at the University Arizona Maricopa Agricultural Center. uses several different sensors to observe dense collection frequency with...
This article distills findings from a qualitative study of seven reproducibility initiatives to enumerate nine key decision points for journals seeking address concerns about the quality and rigor computational research by expanding peer review publication process. We evaluate our guidance in light recent National Academies Science, Engineering, Medicine (NASEM, 2019) report on Reproducibility Replicability Science recommendation journal audits. present 10 that clarify how contend with...
ABSTRACT To realize the great potential value of large‐scale digital libraries, we need a fuller understanding range ways in which scholarly communities conduct research, or want to research within them. Scholars build collections course their work. How can anticipate and support various kinds collection‐building ‐use, order diversity researchers who work libraries books? This paper reports selected results study how user groups HathiTrust Digital Library create use research. aims contribute...
Abstract Editor's Summary HIVE (Helping Interdisciplinary Vocabulary Engineering) is an effort to automatically generate metadata for content, drawing descriptor terms from multiple vocabularies encoded as Simple Knowledge Organization Systems (SKOS). The a response the challenges of interoperability, cost and usability terminology sets often needed adequately describe digital resources. By offering access more than one vocabulary with useful descriptors broad domain, enables aggregating...
Relationships between terms and features are an essential component of thesauri, ontologies, a range controlled vocabularies. In this article, we describe ways to identify important concepts in documents using the relationships thesaurus or other vocabulary structures. We introduce methodology for analysis modeling indexing process based on weighted random walk algorithm. The primary goal research is contribution structure process. resulting models evaluated context automatic subject four...
We present and define a structured digital object, called "Tale," for the dissemination publication of computational scientific findings in scholarly record. The Tale emerges from NSF funded Whole project (wholetale.org) which is developing environment designed to capture entire pipeline associated with experiment thereby enable reproducibility. A allows researchers create package code, data information about workflow necessary support, review, recreate results reported published research....
The assignment of subject metadata to music is useful for organizing and accessing digital collections. Since manual annotation large-scale collections labor-intensive, automatic methods are preferred. Topic modeling algorithms can be used automatically identify latent topics from appropriate text sources. Candidate sources such as song lyrics often too poetic, resulting in lower-quality topics. Users' interpretations provide an alternative source. In this paper, we propose topic discovery...
Purpose – The purpose of this paper is to examine the effect Helping Interdisciplinary Vocabulary Engineering (HIVE) system on inter-indexer consistency information professionals when assigning keywords a scientific abstract. This study examined first, potential HIVE users; second, impact had consistency; and third, challenges associated with using HIVE. Design/methodology/approach A within-subjects quasi-experimental research design was used for study. Data were collected task-scenario...
The Rocker Project provides widely used Docker images for R across different application scenarios. This article surveys downstream projects that build upon the and presents current state of packages managing controlling containers. These use cases cover diverse topics such as package development, reproducible research, collaborative work, cloud-based data processing, production deployment services. variety applications demonstrates power specifically containerisation in general. Across ways...
Abstract Searching large collections of digitized books is a relatively new area in information‐seeking and retrieval research, made possible by initiatives such as Google Books the HathiTrust Digital Library. The availability full‐text book transforming how users search interact with information books, but characteristics these changes are unknown. This paper aims to provide insight into searches collection first step broader research agenda intended improve retrieval. To better understand...
The growing size of high-value sensor-born or computationally derived scientific datasets are pushing the boundaries traditional models data access and discovery. Due to their size, these often accessible only through systems on which they were created. Access for exploration reproducibility is limited file transfer by applying used store generate original data, infeasible. There a trend toward providing large-scale research in-place via container-based analysis environments. This paper...
Many data packaging standards are available to researchers and repository operators the choice use an existing standard or create a new one is challenging. We introduce DataONE Data Package which based on OAI-ORE Resource Map standard. describe functionality provides, implementation considerations, compare it standards, discuss future extensions including ability execution environments via WholeTale "Tales"" alternate serialization formats.
Research has shown that automatic subject indexing is more efficient and consistent than manual indexing; yet many organizations continue to use because of the unacceptable quality automatically produced results. This poster presents results an exploratory experiment examining consistency stemming from a machine-aided approach. The HIVE vocabulary server was used present concepts 31 workshop participants. presentation terms via sequence reduced indexer burden contributed increased...
Entity-centric document filtering is the task of analyzing a time-ordered stream documents and emitting those that are relevant to specified set entities (e.g., people, places, organizations). This exemplified by TREC Knowledge Base Acceleration (KBA) track has broad applicability in other modern IR settings. In this paper, we present simple yet effective approach based on learning high-quality Boolean queries can be applied deterministically during filtering. We call these statements...
In this paper we describe our experience adopting the Research Object Bundle (RO-Bundle) format with BagIt serialization (BagIt-RO) for design and implementation of "tales" in Whole Tale platform. A tale is an executable research object intended dissemination computational scientific findings that captures information needed to facilitate understanding, transparency, re-execution review reproducibility at time publication. We platform requirements led adoption BagIt-RO, specifics...
Research Objects have the potential to significantly enhance reproducibility of scientific research. One important way can do this is by encapsulating means for re-executing computational components studies, thus supporting new form enabled digital computing-exact repeatability. However, also make research more reproducible transparency, a component orthogonal re-executability. We describe here our vision making transparent providing disambiguating claims about generally, and repeatability...
The CHEESE project supplements and enhances traditional cybersecurity education with hands-on, practical experience in common flaws solutions. requires only a web browser, allowing users to develop skills without compromising their own computer or spending hours setting up complex virtual machine (VM) sandbox environment. In this tutorial we will conduct hands-on walkthrough of couple demonstrations on present an overview the platform community-driven contribution development process.
This work takes an in-depth look at the factors that affect manual classifications of 'temporally sensitive' information needs. We use qualitative and quantitative techniques to analyze 660 topics from Text Retrieval Conference (TREC) previously used in experimental evaluation temporal retrieval models. Regression analysis is identify previous classifications. explore potential problems with classifications, considering principles guidelines for future on