NFDI4DS | UHH-SEMS - Publication Details

The FAIR Cookbook - the essential resource for and by FAIR doers

OPENALEX - Publications

Philippe Rocca‐Serra Wei Gu Vassilios Ioannidis Tooba Abbassi‐Daloii Salvador Capella-Gutiérrez and 57 more

Abstract The notion that data should be Findable, Accessible, Interoperable and Reusable, according to the FAIR Principles, has become a global norm for good stewardship prerequisite reproducibility. Nowadays, guides policy actions professional practices in public private sectors. Despite such endorsements, however, Principles are aspirational, remaining elusive at best, intimidating worst. To address lack of practical guidance, help with capability gaps, we developed Cookbook, an open,...

10.1038/s41597-023-02166-3 article EN cc-by Scientific Data 2023-05-19

Predictive Big Data Analytics: A Study of Parkinson’s Disease Using Large, Complex, Heterogeneous, Incongruent, Multi-Source and Incomplete Observations

OPENALEX - Publications

Ivo D. Dinov Ben Heavner Ming Tang Gustavo Glusman Kyle Chard and 15 more

A unique archive of Big Data on Parkinson's Disease is collected, managed and disseminated by the Progression Markers Initiative (PPMI). The integration such complex heterogeneous from multiple sources offers unparalleled opportunities to study early stages prevalent neurodegenerative processes, track their progression quickly identify efficacies alternative treatments. Many previous human animal studies have examined relationship disease (PD) risk trauma, genetics, environment,...

10.1371/journal.pone.0157077 article EN cc-by PLoS ONE 2016-08-05

Neuroanatomical morphometric characterization of sex differences in youth using statistical learning

OPENALEX - Publications

Farshid Sepehrband Kirsten M. Lynch Ryan P. Cabeen Clio González-Zacarías Lu Zhao and 6 more

10.1016/j.neuroimage.2018.01.065 article EN publisher-specific-oa NeuroImage 2018-02-03

pSCANNER: patient-centered Scalable National Network for Effectiveness Research

OPENALEX - Publications

Lucila Ohno‐Machado Zia Agha Douglas S. Bell Lisa Dahm Michele E. Day and 58 more

This article describes the patient-centered Scalable National Network for Effectiveness Research (pSCANNER), which is part of recently formed PCORnet, a national network composed learning healthcare systems and patient-powered research networks funded by Patient Centered Outcomes Institute (PCORI). It designed to be stakeholder-governed federated that uses distributed architecture integrate data from three existing covering over 21 million patients in all 50 states: (1) VA Informatics...

10.1136/amiajnl-2014-002751 article EN cc-by-nc Journal of the American Medical Informatics Association 2014-04-30

SciRepEval: A Multi-Format Benchmark for Scientific Document Representations

OPENALEX - Publications

Amanpreet Singh Mike D’Arcy Arman Cohan Doug Downey Sergey Feldman

Learned representations of scientific documents can serve as valuable input features for downstream tasks without further fine-tuning. However, existing benchmarks evaluating these fail to capture the diversity relevant tasks. In response, we introduce SciRepEval, first comprehensive benchmark training and document representations. It includes 24 challenging realistic tasks, 8 which are new, across four formats: classification, regression, ranking search. We then use this study improve...

10.18653/v1/2023.emnlp-main.338 article EN cc-by Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2023-01-01

OPENALEX - Publications

Michael Chen Mike D’Arcy Alisa Liu Jared Fernandez Doug Downey

Commonsense reasoning is a critical AI capability, but it difficult to construct challenging datasets that test common sense. Recent neural question answering systems, based on large pre-trained models of language, have already achieved near-human-level performance commonsense knowledge benchmarks. These systems do not possess human-level sense, are able exploit limitations the achieve scores. We introduce CODAH dataset, an adversarially-constructed evaluation dataset for testing forms...

10.18653/v1/w19-2008 article EN 2019-01-01

Monitoring the grid with the Globus Toolkit MDS4

OPENALEX - Publications

Jennifer M. Schopf Laura Pearlman Neill Miller Carl Kesselman Ian Foster and 2 more

The Globus Toolkit Monitoring and Discovery System (MDS4) defines implements mechanisms for service resource discovery monitoring in distributed environments. MDS4 is distinguished from previous similar systems by its extensive use of interfaces behaviors defined the WS-Resource Framework WS-Notification specifications, deep integration into essentially every component Toolkit. We describe architecture Web that allow users to discover resources services, monitor states, receive updates on...

10.1088/1742-6596/46/1/072 article EN Journal of Physics Conference Series 2006-09-01

I'll take that to go: Big data bags and minimal identifiers for exchange of large, complex datasets

OPENALEX - Publications

Kyle Chard Mike D’Arcy Ben Heavner Ian Foster Carl Kesselman and 9 more

Big data workflows often require the assembly and exchange of complex, multi-element datasets. For example, in biomedical applications, input to an analytic pipeline can be a dataset consisting thousands images genome sequences assembled from diverse repositories, requiring description contents concise unambiguous form. Typical approaches creating datasets for big assume that all reside single location, costly marshaling permitting errors omission commission because members are not...

10.1109/bigdata.2016.7840618 article EN 2021 IEEE International Conference on Big Data (Big Data) 2016-12-01

Reproducible big data science: A case study in continuous FAIRness

OPENALEX - Publications

Ravi Madduri Kyle Chard Mike D’Arcy Segun Jung Alexis Rodriguez and 10 more

Big biomedical data create exciting opportunities for discovery, but make it difficult to capture analyses and outputs in forms that are findable, accessible, interoperable, reusable (FAIR). In response, we describe tools easy capture, assign identifiers to, code throughout the lifecycle. We illustrate use of these via a case study involving multi-step analysis creates an atlas putative transcription factor binding sites from terabytes ENCODE DNase I hypersensitive sequencing data. show how...

10.1371/journal.pone.0213013 article EN cc-by PLoS ONE 2019-04-11

MARG: Multi-Agent Review Generation for Scientific Papers

OPENALEX - Publications

Mike D’Arcy Tom Hope Larry Birnbaum Doug Downey

We study the ability of LLMs to generate feedback for scientific papers and develop MARG, a generation approach using multiple LLM instances that engage in internal discussion. By distributing paper text across agents, MARG can consume full beyond input length limitations base LLM, by specializing agents incorporating sub-tasks tailored different comment types (experiments, clarity, impact) it improves helpfulness specificity feedback. In user study, baseline methods GPT-4 were rated as...

10.48550/arxiv.2401.04259 preprint EN cc-by arXiv (Cornell University) 2024-01-01

ARIES: A Corpus of Scientific Paper Edits Made in Response to Peer Reviews

OPENALEX - Publications

Mike D’Arcy Alexis Ross Erin Bransom Bailey Kuehl Jonathan Bragg and 2 more

10.18653/v1/2024.acl-long.377 article EN 2024-01-01

Making Common Fund data more findable: catalyzing a data ecosystem

OPENALEX - Publications

Amanda Charbonneau Arthur Brady Karl Czajkowski Jain Aluvathingal Saranya Canchi and 37 more

The Common Fund Data Ecosystem (CFDE) has created a flexible system of data federation that enables researchers to discover datasets from across the US National Institutes Health without requiring owners move, reformat, or rehost those data. This is centered on catalog integrates detailed descriptions biomedical individual Programs' Coordination Centers (DCCs) into uniform metadata model can then be indexed and searched centralized portal. Crosscut Metadata Model (C2M2) supports wide variety...

10.1093/gigascience/giac105 article EN cc-by GigaScience 2022-01-01

A system to build distributed multivariate models and manage disparate data sharing policies: implementation in the scalable national network for effectiveness research

OPENALEX - Publications

Daniella Meeker Xiaoqian Jiang Michael E. Matheny Claudiu Farcas Mike D’Arcy and 11 more

Abstract Background Centralized and federated models for sharing data in research networks currently exist. To build multivariate analysis centralized networks, transfer of patient-level to a central computation resource is necessary. The authors implemented distributed which kept at each site exchange policies are managed study-centric manner. Objective objective was implement infrastructure that supports the functionality some existing (e.g., cohort discovery, workflow management,...

10.1093/jamia/ocv017 article EN cc-by-nc Journal of the American Medical Informatics Association 2015-07-03

OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs

OPENALEX - Publications

Akari Asai Jacqueline He Rulin Shao Weijia Shi Amanpreet Singh and 20 more

Scientific progress depends on researchers' ability to synthesize the growing body of literature. Can large language models (LMs) assist scientists in this task? We introduce OpenScholar, a specialized retrieval-augmented LM that answers scientific queries by identifying relevant passages from 45 million open-access papers and synthesizing citation-backed responses. To evaluate we develop ScholarQABench, first large-scale multi-domain benchmark for literature search, comprising 2,967...

10.48550/arxiv.2411.14199 preprint EN arXiv (Cornell University) 2024-11-21

DeepMoTIon: Learning to Navigate Like Humans

OPENALEX - Publications

Mahmoud Hamandi Mike D’Arcy Pooyan Fazli

We present a novel human-aware navigation approach, where the robot learns to mimic humans navigate safely in crowds. The presented model, referred as Deep-MoTIon, is trained with pedestrian surveillance data predict human velocity environment. processes LiDAR scans via network target location. conduct extensive experiments assess components of our and prove their necessity imitate humans. Our show that DeepMoTIion outperforms all benchmarks terms imitation, achieving 24% reduction time...

10.1109/ro-man46459.2019.8956408 article EN 2019-10-01

Monitoring the Earth System Grid with MDS4

OPENALEX - Publications

Ann Chervenak Jennifer M. Schopf Laura Pearlman Mei-Hui Su S. Bharathi and 4 more

In production Grids for scientific applications, service and resource failures must be detected addressed quickly. this paper, we describe the monitoring infrastructure used by Earth System Grid (ESG) project, a collaboration that supports global climate research. ESG uses Globus Toolkit Monitoring Discovery (MDS4) to monitor its resources. We how MDS4 Index Service collects information about resources Trigger checks specified failure conditions notifies system administrators when occur....

10.1109/e-science.2006.102 article EN 2006-12-04

Monitoring the Earth System Grid with MDS4

OPENALEX - Publications

Ann Chervenak Jennifer M. Schopf Laura Pearlman Mei-Hui Su S. Bharathi and 4 more

In production Grids for scientific applications, service and resource failures must be detected addressed quickly. this paper, we describe the monitoring infrastructure used by Earth System Grid (ESG) project, a collaboration that supports global climate research. ESG uses Globus Toolkit Monitoring Discovery (MDS4) to monitor its resources. We how MDS4 Index Service collects information about resources Trigger checks specified failure conditions notifies system administrators when occur....

10.1109/e-science.2006.261153 article EN 2006-12-01

CODAH: An Adversarially Authored Question-Answer Dataset for Common Sense

OPENALEX - Publications

Michael Chen Mike D’Arcy Alisa Liu Jared Fernandez Doug Downey

Commonsense reasoning is a critical AI capability, but it difficult to construct challenging datasets that test common sense. Recent neural question answering systems, based on large pre-trained models of language, have already achieved near-human-level performance commonsense knowledge benchmarks. These systems do not possess human-level sense, are able exploit limitations the achieve scores. We introduce CODAH dataset, an adversarially-constructed evaluation dataset for testing forms...

10.48550/arxiv.1904.04365 preprint EN other-oa arXiv (Cornell University) 2019-01-01

Embedding Recycling for Language Models

OPENALEX - Publications

Jon Saad-Falcon Amanpreet Singh Luca Soldaini Mike D’Arcy Arman Cohan and 1 more

Real-world applications of neural language models often involve running many different over the same corpus. The high computational cost these runs has led to interest in techniques that can reuse contextualized embeddings produced previous speed training and inference future ones. We refer this approach as embedding recycling (ER). While multiple ER have been proposed, their practical effectiveness is still unknown because existing evaluations consider very few do not adequately account for...

10.18653/v1/2023.findings-eacl.145 article EN cc-by 2023-01-01

An Open Ecosystem for Pervasive Use of Persistent Identifiers

OPENALEX - Publications

Rachana Ananthakrishnan Kyle Chard Mike D’Arcy Ian Foster Carl Kesselman and 5 more

Persistent identifiers (PIDs) are essential for making data Findable, Accessible, Interoperable, and Reusable, or FAIR. While the advantages of PIDs publication citation well understood, Digital Object Identifiers (DOIs) increasingly applied to data, there two gaps in current identifier ecosystem: 1) services that provide a consistent baseline capabilities encompassing key aspects research lifecycle, including canonical landing pages machine-readable metadata via same URL; 2) support be...

10.1145/3311790.3396660 article EN Practice and Experience in Advanced Research Computing 2020-07-22

BDQC: a general-purpose analytics tool for domain-blind validation of Big Data

OPENALEX - Publications

Eric W. Deutsch Roger Kramer Joseph Ames Andrew Bauman David Campbell and 20 more

Abstract Translational biomedical research is generating exponentially more data: thousands of whole-genome sequences (WGS) are now available; brain data doubling every two years. Analyses Big Data, including imaging, genomic, phenotypic, and clinical data, present qualitatively new challenges as well opportunities. Among the a proliferation in ways analyses can fail, due largely to increasing length complexity processing pipelines. Anomalies input runtime resource exhaustion or node failure...

10.1101/258822 preprint EN bioRxiv (Cold Spring Harbor Laboratory) 2018-02-02

Towards Co-Evolution of Data-Centric Ecosystems

OPENALEX - Publications

Robert Schuler Karl Czajkowski Mike D’Arcy Hongsuda Tangmunarunkit Carl Kesselman

Database evolution is a notoriously difficult task, and it exacerbated by the necessity to evolve database-dependent applications. As science becomes increasingly dependent on sophisticated data management, need an array of database-driven systems will only intensify. In this paper, we present architecture for data-centric ecosystems that allows components seamlessly co-evolve centralizing models mappings at service pushing model-adaptive interactions database clients. Boundary objects fill...

10.1145/3400903.3400908 article EN 2020-07-07

Making Common Fund data more findable: Catalyzing a Data Ecosystem

OPENALEX - Publications

Amanda Charbonneau Arthur Brady Karl Czajkowski Jain Aluvathingal Saranya Canchi and 32 more

Abstract The Common Fund Data Ecosystem (CFDE) has created a flexible system of data federation that enables users to discover datasets from across the U.S. National Institutes Health without requiring owners move, reformat, or rehost those data. CFDE’s is centered on catalog ingests metadata individual Program’s Coordination Centers (DCCs) into uniform model can then be indexed and searched centralized portal. This Crosscut Metadata Model (C2M2) supports wide variety types terms used by...

10.1101/2021.11.05.467504 preprint EN cc-by bioRxiv (Cold Spring Harbor Laboratory) 2021-11-08