Luca Soldaini

ORCID: 0000-0001-6998-9863
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Topic Modeling
  • Natural Language Processing Techniques
  • Multimodal Machine Learning Applications
  • Biomedical Text Mining and Ontologies
  • Expert finding and Q&A systems
  • Advanced Text Analysis Techniques
  • Semantic Web and Ontologies
  • Information Retrieval and Search Behavior
  • Mental Health via Writing
  • Sentiment Analysis and Opinion Mining
  • Text Readability and Simplification
  • Ethics and Social Impacts of AI
  • Domain Adaptation and Few-Shot Learning
  • Digital Mental Health Interventions
  • Speech and dialogue systems
  • Data Quality and Management
  • Recommender Systems and Techniques
  • Intelligent Tutoring Systems and Adaptive Learning
  • Data-Driven Disease Surveillance
  • Speech Recognition and Synthesis
  • Machine Learning in Healthcare
  • Data Visualization and Analytics
  • Innovative Human-Technology Interaction
  • Complex Network Analysis Techniques
  • Scientific Computing and Data Management

Allen Institute for Artificial Intelligence
2023-2024

Allen Institute
2022-2024

University of Glasgow
2024

University of Maryland, College Park
2024

Johns Hopkins University
2024

University of Washington
2015-2023

University of California, Berkeley
2022-2023

Northwestern University
2023

Massachusetts Institute of Technology
2023

Yale University
2023

Virtual assistants such as Amazon Alexa, Apple Siri, and Google Assistant often rely on a semantic parsing component to understand which action(s) execute for an utterance spoken by its users. Traditionally, rule-based or statistical slot-filling systems have been used parse “simple” queries; that is, queries contain single action can be decomposed into set of non-overlapping entities. More recently, shift-reduce parsers proposed process more complex utterances. These methods, while...

10.1145/3366423.3380064 preprint EN 2020-04-20

The volume of scientific output is creating an urgent need for automated tools to help scientists keep up with developments in their field. Semantic Scholar (S2) open data platform and website aimed at accelerating science by helping scholars discover understand literature. We combine public proprietary sources using state-of-the-art techniques scholarly PDF content extraction automatic knowledge graph construction build the Academic Graph, largest literature to-date, 200M+ papers, 80M+...

10.48550/arxiv.2301.10140 preprint EN cc-by arXiv (Cornell University) 2023-01-01

We present Queer in AI as a case study for community-led participatory design AI. examine how and intersectional tenets started shaped this community's programs over the years. discuss different challenges that emerged process, look at ways organization has fallen short of operationalizing principles, then assess organization's impact. provides important lessons insights practitioners theorists methods broadly through its rejection hierarchy favor decentralization, success building aid by...

10.1145/3593013.3594134 article EN 2022 ACM Conference on Fairness, Accountability, and Transparency 2023-06-12

Mental health is a significant and growing public concern. As language usage can be leveraged to obtain crucial insights into mental conditions, there need for large-scale, labeled, health-related datasets of users who have been diagnosed with one or more such conditions. In this paper, we investigate the creation high-precision patterns identify self-reported diagnoses nine different high-quality labeled data without manual labelling. We introduce SMHD (Self-reported Health Diagnoses)...

10.48550/arxiv.1806.05258 preprint EN cc-by arXiv (Cornell University) 2018-01-01

A rapidly growing number of voices argue that AI research, and computer vision in particular, is powering mass surveillance. Yet the direct path from research to surveillance has remained obscured difficult assess. Here, we reveal Surveillance pipeline by analyzing three decades papers downstream patents, more than 40,000 documents. We find large majority annotated patents self-report their technology enables extracting data about humans. Moreover, these technologies specifically enable...

10.48550/arxiv.2309.15084 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Large transformer-based language models have been shown to be very effective in many classification tasks. However, their computational complexity prevents use applications requiring the of a large set candidates. While previous works investigated approaches reduce model size, relatively little attention has paid techniques improve batch throughput during inference. In this paper, we introduce Cascade Transformer, simple yet technique adapt into cascade rankers. Each ranker is used prune...

10.18653/v1/2020.acl-main.504 article EN cc-by 2020-01-01

Dealing with unjudged documents ("holes") in relevance assessments is a perennial problem when evaluating search systems offline experiments. Holes can reduce the apparent effectiveness of retrieval during evaluation and introduce biases models trained incomplete data. In this work, we explore whether large language help us fill such holes to improve evaluations. We examine an extreme, albeit common, setting wherein only single known relevant document per query available for evaluation. then...

10.1145/3539618.3592032 article EN Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval 2023-07-18

Language models have become a critical technology to tackling wide range of natural language processing tasks, yet many details about how the best-performing were developed are not reported. In particular, information their pretraining corpora is seldom discussed: commercial rarely provide any data; even open release datasets they trained on, or an exact recipe reproduce them. As result, it challenging conduct certain threads modeling research, such as understanding training data impacts...

10.48550/arxiv.2402.00159 preprint EN arXiv (Cornell University) 2024-01-31

In real-world settings, vision language models (VLMs) should robustly handle naturalistic, noisy visual content as well domain-specific and concepts. For example, K-12 educators using digital learning platforms may need to examine provide feedback across many images of students' math work. To assess the potential VLMs support in settings like this one, we introduce DrawEduMath, an English-language dataset 2,030 handwritten responses problems. Teachers provided detailed annotations, including...

10.48550/arxiv.2501.14877 preprint EN arXiv (Cornell University) 2025-01-24

Retrieval systems generally focus on web-style queries that are short and underspecified. However, advances in language models have facilitated the nascent rise of retrieval can understand more complex with diverse intents. these efforts focused exclusively English; therefore, we do not yet how they work across languages. We introduce mFollowIR, a multilingual benchmark for measuring instruction-following ability models. mFollowIR builds upon TREC NeuCLIR narratives (or instructions) span...

10.48550/arxiv.2501.19264 preprint EN arXiv (Cornell University) 2025-01-31

Modern language models are trained on large, unstructured datasets consisting of trillions tokens and obtained by crawling the web. The nature makes it difficult to reason about their contents develop systematic approaches data curation. In this paper, we unpack monolithic web corpora developing taxonomies organizing them into domains. We introduce WebOrganizer, a framework for pages in terms both topic format. Using these two complementary notions domains, automatically annotate...

10.48550/arxiv.2502.10341 preprint EN arXiv (Cornell University) 2025-02-14

Scholars need to keep up with an exponentially increasing flood of scientific papers. To aid this challenge, we introduce Scim, a novel intelligent interface that helps experienced researchers skim – or rapidly review paper attain cursory understanding its contents. Scim supports the skimming process by highlighting salient contents in order direct reader's attention. The system's highlights are faceted content type, evenly distributed across paper, and have density configurable readers at...

10.1145/3581641.3584034 article EN 2023-03-27

Sean MacAvaney, Bart Desmet, Arman Cohan, Luca Soldaini, Andrew Yates, Ayah Zirikly, Nazli Goharian. Proceedings of the Fifth Workshop on Computational Linguistics and Clinical Psychology: From Keyboard to Clinic. 2018.

10.18653/v1/w18-0618 article EN cc-by 2018-01-01

Language models (LMs) have become ubiquitous in both NLP research and commercial product offerings. As their importance has surged, the most powerful closed off, gated behind proprietary interfaces, with important details of training data, architectures, development undisclosed. Given these scientifically studying models, including biases potential risks, we believe it is essential for community to access powerful, truly open LMs. To this end, technical report first release OLMo, a...

10.48550/arxiv.2402.00838 preprint EN arXiv (Cornell University) 2024-02-01

Large Language Models (LLMs) have enabled new ways to satisfy information needs.Although great strides been made in applying them settings like document ranking and short-form text generation, they still struggle compose complete, accurate, verifiable long-form reports.Reports with these qualities are necessary the complex, nuanced, or multi-faceted needs of users.In this perspective paper, we draw together opinions from industry academia, a variety related research areas, present our vision...

10.1145/3626772.3657846 article EN Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval 2024-07-10

Recent advancements in transformer-based models have greatly improved the ability of Question Answering (QA) systems to provide correct answers; particular, answer sentence selection (AS2) models, core components retrieval-based systems, achieved impressive results.While generally effective, these fail a satisfying when all retrieved candidates are poor quality, even if they contain information.In AS2, trained select best among set for given question.In this work, we propose generate answers...

10.18653/v1/2021.findings-acl.374 article EN cc-by 2021-01-01

Internet data has surfaced as a primary source for investigation of different aspects human behavior. A crucial step in such studies is finding suitable cohort (i.e., set users) that shares common trait interest to researchers. However, direct identification users sharing this often impossible, the available researchers usually anonymized preserve user privacy. To facilitate research on specific topics interest, especially medicine, we introduce an algorithm identifying anonymous users. We...

10.1145/3038912.3052629 preprint EN 2017-04-03

Large language models' (LLMs) abilities are drawn from their pretraining data, and model development begins with data curation. However, decisions around what is retained or removed during this initial stage under-scrutinized. In our work, we ground web text, which a popular source, to its social geographic contexts. We create new dataset of 10.3 million self-descriptions website creators, extract information about who they where from: topical interests, roles, affiliations. Then, conduct...

10.48550/arxiv.2401.06408 preprint EN cc-by arXiv (Cornell University) 2024-01-01

Citation sentences (citances) to a reference article have been extensively studied for summarization tasks.However, citances might not accurately represent the content of cited article, as they often fail capture context reported findings and can be affected by epistemic value drift.Following intuition behind TAC (Text Analysis Conference) 2014 Biomedical Summarization track, we propose system that identifies text spans in are related given citance.We refer this problem citance-reference...

10.3115/v1/n15-1110 article EN cc-by Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 2015-01-01
Coming Soon ...