Mike D’Arcy

ORCID: 0000-0003-2280-917X
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Scientific Computing and Data Management
  • Research Data Management Practices
  • Topic Modeling
  • Natural Language Processing Techniques
  • Multimodal Machine Learning Applications
  • Distributed and Parallel Computing Systems
  • Biomedical Text Mining and Ontologies
  • Data Quality and Management
  • Robotics and Automated Systems
  • Video Surveillance and Tracking Methods
  • Anomaly Detection Techniques and Applications
  • Advanced Data Storage Technologies
  • Health, Environment, Cognitive Aging
  • Semantic Web and Ontologies
  • Gene expression and cancer classification
  • Domain Adaptation and Few-Shot Learning
  • Expert finding and Q&A systems
  • Social Robot Interaction and HRI
  • Context-Aware Activity Recognition Systems
  • Human Pose and Action Recognition
  • Advanced Neuroimaging Techniques and Applications
  • Data Analysis with R
  • Autonomous Vehicle Technology and Safety
  • Peer-to-Peer Network Technologies
  • Genomics and Rare Diseases

University of Southern California
2011-2024

Northwestern University
2017-2023

Marina Del Rey Hospital
2020

RAND Corporation
2014

Southern California University for Professional Studies
2012

Southern States University
2006

Abstract The notion that data should be Findable, Accessible, Interoperable and Reusable, according to the FAIR Principles, has become a global norm for good stewardship prerequisite reproducibility. Nowadays, guides policy actions professional practices in public private sectors. Despite such endorsements, however, Principles are aspirational, remaining elusive at best, intimidating worst. To address lack of practical guidance, help with capability gaps, we developed Cookbook, an open,...

10.1038/s41597-023-02166-3 article EN cc-by Scientific Data 2023-05-19

A unique archive of Big Data on Parkinson's Disease is collected, managed and disseminated by the Progression Markers Initiative (PPMI). The integration such complex heterogeneous from multiple sources offers unparalleled opportunities to study early stages prevalent neurodegenerative processes, track their progression quickly identify efficacies alternative treatments. Many previous human animal studies have examined relationship disease (PD) risk trauma, genetics, environment,...

10.1371/journal.pone.0157077 article EN cc-by PLoS ONE 2016-08-05

This article describes the patient-centered Scalable National Network for Effectiveness Research (pSCANNER), which is part of recently formed PCORnet, a national network composed learning healthcare systems and patient-powered research networks funded by Patient Centered Outcomes Institute (PCORI). It designed to be stakeholder-governed federated that uses distributed architecture integrate data from three existing covering over 21 million patients in all 50 states: (1) VA Informatics...

10.1136/amiajnl-2014-002751 article EN cc-by-nc Journal of the American Medical Informatics Association 2014-04-30

Learned representations of scientific documents can serve as valuable input features for downstream tasks without further fine-tuning. However, existing benchmarks evaluating these fail to capture the diversity relevant tasks. In response, we introduce SciRepEval, first comprehensive benchmark training and document representations. It includes 24 challenging realistic tasks, 8 which are new, across four formats: classification, regression, ranking search. We then use this study improve...

10.18653/v1/2023.emnlp-main.338 article EN cc-by Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2023-01-01

Commonsense reasoning is a critical AI capability, but it difficult to construct challenging datasets that test common sense. Recent neural question answering systems, based on large pre-trained models of language, have already achieved near-human-level performance commonsense knowledge benchmarks. These systems do not possess human-level sense, are able exploit limitations the achieve scores. We introduce CODAH dataset, an adversarially-constructed evaluation dataset for testing forms...

10.18653/v1/w19-2008 article EN 2019-01-01

The Globus Toolkit Monitoring and Discovery System (MDS4) defines implements mechanisms for service resource discovery monitoring in distributed environments. MDS4 is distinguished from previous similar systems by its extensive use of interfaces behaviors defined the WS-Resource Framework WS-Notification specifications, deep integration into essentially every component Toolkit. We describe architecture Web that allow users to discover resources services, monitor states, receive updates on...

10.1088/1742-6596/46/1/072 article EN Journal of Physics Conference Series 2006-09-01

Big data workflows often require the assembly and exchange of complex, multi-element datasets. For example, in biomedical applications, input to an analytic pipeline can be a dataset consisting thousands images genome sequences assembled from diverse repositories, requiring description contents concise unambiguous form. Typical approaches creating datasets for big assume that all reside single location, costly marshaling permitting errors omission commission because members are not...

10.1109/bigdata.2016.7840618 article EN 2021 IEEE International Conference on Big Data (Big Data) 2016-12-01

Big biomedical data create exciting opportunities for discovery, but make it difficult to capture analyses and outputs in forms that are findable, accessible, interoperable, reusable (FAIR). In response, we describe tools easy capture, assign identifiers to, code throughout the lifecycle. We illustrate use of these via a case study involving multi-step analysis creates an atlas putative transcription factor binding sites from terabytes ENCODE DNase I hypersensitive sequencing data. show how...

10.1371/journal.pone.0213013 article EN cc-by PLoS ONE 2019-04-11

We study the ability of LLMs to generate feedback for scientific papers and develop MARG, a generation approach using multiple LLM instances that engage in internal discussion. By distributing paper text across agents, MARG can consume full beyond input length limitations base LLM, by specializing agents incorporating sub-tasks tailored different comment types (experiments, clarity, impact) it improves helpfulness specificity feedback. In user study, baseline methods GPT-4 were rated as...

10.48550/arxiv.2401.04259 preprint EN cc-by arXiv (Cornell University) 2024-01-01

The Common Fund Data Ecosystem (CFDE) has created a flexible system of data federation that enables researchers to discover datasets from across the US National Institutes Health without requiring owners move, reformat, or rehost those data. This is centered on catalog integrates detailed descriptions biomedical individual Programs' Coordination Centers (DCCs) into uniform metadata model can then be indexed and searched centralized portal. Crosscut Metadata Model (C2M2) supports wide variety...

10.1093/gigascience/giac105 article EN cc-by GigaScience 2022-01-01

Abstract Background Centralized and federated models for sharing data in research networks currently exist. To build multivariate analysis centralized networks, transfer of patient-level to a central computation resource is necessary. The authors implemented distributed which kept at each site exchange policies are managed study-centric manner. Objective objective was implement infrastructure that supports the functionality some existing (e.g., cohort discovery, workflow management,...

10.1093/jamia/ocv017 article EN cc-by-nc Journal of the American Medical Informatics Association 2015-07-03

Scientific progress depends on researchers' ability to synthesize the growing body of literature. Can large language models (LMs) assist scientists in this task? We introduce OpenScholar, a specialized retrieval-augmented LM that answers scientific queries by identifying relevant passages from 45 million open-access papers and synthesizing citation-backed responses. To evaluate we develop ScholarQABench, first large-scale multi-domain benchmark for literature search, comprising 2,967...

10.48550/arxiv.2411.14199 preprint EN arXiv (Cornell University) 2024-11-21

We present a novel human-aware navigation approach, where the robot learns to mimic humans navigate safely in crowds. The presented model, referred as Deep-MoTIon, is trained with pedestrian surveillance data predict human velocity environment. processes LiDAR scans via network target location. conduct extensive experiments assess components of our and prove their necessity imitate humans. Our show that DeepMoTIion outperforms all benchmarks terms imitation, achieving 24% reduction time...

10.1109/ro-man46459.2019.8956408 article EN 2019-10-01

In production Grids for scientific applications, service and resource failures must be detected addressed quickly. this paper, we describe the monitoring infrastructure used by Earth System Grid (ESG) project, a collaboration that supports global climate research. ESG uses Globus Toolkit Monitoring Discovery (MDS4) to monitor its resources. We how MDS4 Index Service collects information about resources Trigger checks specified failure conditions notifies system administrators when occur....

10.1109/e-science.2006.102 article EN 2006-12-04

In production Grids for scientific applications, service and resource failures must be detected addressed quickly. this paper, we describe the monitoring infrastructure used by Earth System Grid (ESG) project, a collaboration that supports global climate research. ESG uses Globus Toolkit Monitoring Discovery (MDS4) to monitor its resources. We how MDS4 Index Service collects information about resources Trigger checks specified failure conditions notifies system administrators when occur....

10.1109/e-science.2006.261153 article EN 2006-12-01

Commonsense reasoning is a critical AI capability, but it difficult to construct challenging datasets that test common sense. Recent neural question answering systems, based on large pre-trained models of language, have already achieved near-human-level performance commonsense knowledge benchmarks. These systems do not possess human-level sense, are able exploit limitations the achieve scores. We introduce CODAH dataset, an adversarially-constructed evaluation dataset for testing forms...

10.48550/arxiv.1904.04365 preprint EN other-oa arXiv (Cornell University) 2019-01-01

Real-world applications of neural language models often involve running many different over the same corpus. The high computational cost these runs has led to interest in techniques that can reuse contextualized embeddings produced previous speed training and inference future ones. We refer this approach as embedding recycling (ER). While multiple ER have been proposed, their practical effectiveness is still unknown because existing evaluations consider very few do not adequately account for...

10.18653/v1/2023.findings-eacl.145 article EN cc-by 2023-01-01

Persistent identifiers (PIDs) are essential for making data Findable, Accessible, Interoperable, and Reusable, or FAIR. While the advantages of PIDs publication citation well understood, Digital Object Identifiers (DOIs) increasingly applied to data, there two gaps in current identifier ecosystem: 1) services that provide a consistent baseline capabilities encompassing key aspects research lifecycle, including canonical landing pages machine-readable metadata via same URL; 2) support be...

10.1145/3311790.3396660 article EN Practice and Experience in Advanced Research Computing 2020-07-22

Abstract Translational biomedical research is generating exponentially more data: thousands of whole-genome sequences (WGS) are now available; brain data doubling every two years. Analyses Big Data, including imaging, genomic, phenotypic, and clinical data, present qualitatively new challenges as well opportunities. Among the a proliferation in ways analyses can fail, due largely to increasing length complexity processing pipelines. Anomalies input runtime resource exhaustion or node failure...

10.1101/258822 preprint EN bioRxiv (Cold Spring Harbor Laboratory) 2018-02-02

Database evolution is a notoriously difficult task, and it exacerbated by the necessity to evolve database-dependent applications. As science becomes increasingly dependent on sophisticated data management, need an array of database-driven systems will only intensify. In this paper, we present architecture for data-centric ecosystems that allows components seamlessly co-evolve centralizing models mappings at service pushing model-adaptive interactions database clients. Boundary objects fill...

10.1145/3400903.3400908 article EN 2020-07-07

Abstract The Common Fund Data Ecosystem (CFDE) has created a flexible system of data federation that enables users to discover datasets from across the U.S. National Institutes Health without requiring owners move, reformat, or rehost those data. CFDE’s is centered on catalog ingests metadata individual Program’s Coordination Centers (DCCs) into uniform model can then be indexed and searched centralized portal. This Crosscut Metadata Model (C2M2) supports wide variety types terms used by...

10.1101/2021.11.05.467504 preprint EN cc-by bioRxiv (Cold Spring Harbor Laboratory) 2021-11-08
Coming Soon ...