- Topic Modeling
- Natural Language Processing Techniques
- Biomedical Text Mining and Ontologies
- Advanced Text Analysis Techniques
- Mental Health via Writing
- Semantic Web and Ontologies
- Text Readability and Simplification
- Sentiment Analysis and Opinion Mining
- Expert finding and Q&A systems
- Multimodal Machine Learning Applications
- Digital Mental Health Interventions
- Data Quality and Management
- Information Retrieval and Search Behavior
- Text and Document Classification Technologies
- Mathematics, Computing, and Information Processing
- Domain Adaptation and Few-Shot Learning
- Software Engineering Research
- Machine Learning and Data Classification
- Speech and dialogue systems
- Explainable Artificial Intelligence (XAI)
- Misinformation and Its Impacts
- Scientific Computing and Data Management
- Speech Recognition and Synthesis
- Neural Networks and Applications
- Machine Learning in Healthcare
Allen Institute
2020-2024
Yale University
2023-2024
University of Washington
2020-2023
Allen Institute for Artificial Intelligence
2019-2023
Mongolia International University
2023
RIKEN Center for Advanced Intelligence Project
2023
University of Copenhagen
2022
Art Institute of Portland
2020-2021
Georgetown University
2014-2019
Adobe Systems (United States)
2018
Iz Beltagy, Kyle Lo, Arman Cohan. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint (EMNLP-IJCNLP). 2019.
Transformer-based models are unable to process long sequences due their self-attention operation, which scales quadratically with the sequence length. To address this limitation, we introduce Longformer an attention mechanism that linearly length, making it easy documents of thousands tokens or longer. Longformer's is a drop-in replacement for standard and combines local windowed task motivated global attention. Following prior work on long-sequence transformers, evaluate character-level...
Arman Cohan, Franck Dernoncourt, Doo Soon Kim, Trung Bui, Seokhwan Walter Chang, Nazli Goharian. Proceedings of the 2018 Conference North American Chapter Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). 2018.
Although considerable attention has been given to neural ranking architectures recently, far less paid the term representations that are used as input these models. In this work, we investigate how two pretrained contextualized language models (ELMo and BERT) can be utilized for ad-hoc document ranking. Through experiments on TREC benchmarks, find several ex-sting benefit from additional context provided by Furthermore, propose a joint approach incorporates BERT's classification vector into...
Representation learning is a critical ingredient for natural language processing systems. Recent Transformer models like BERT learn powerful textual representations, but these are targeted towards token- and sentence-level training objectives do not leverage information on inter-document relatedness, which limits their document-level representation power. For applications scientific documents, such as classification recommendation, accurate embeddings of documents necessity. We propose...
Users suffering from mental health conditions often turn to online resources for support, including specialized support communities or general such as Twitter and Reddit. In this work, we present a framework supporting studying users in both types of communities. We propose methods identifying posts that may indicate risk self-harm, demonstrate our approach outperforms strong previously proposed posts. Self-harm is closely related depression, which makes depressed on forums crucial task....
David Wadden, Shanchuan Lin, Kyle Lo, Lucy Lu Wang, Madeleine van Zuylen, Arman Cohan, Hannaneh Hajishirzi. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020.
Arman Cohan, Waleed Ammar, Madeleine van Zuylen, Field Cady. Proceedings of the 2019 Conference North American Chapter Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019.
We introduce TLDR generation, a new form of extreme summarization, for scientific papers. generation involves high source compression and requires expert background knowledge understanding complex domain-specific language. To facilitate study on this task, we SCITLDR, multi-target dataset 5.4K TLDRs over 3.2K SCITLDR contains both author-written expert-derived TLDRs, where the latter are collected using novel annotation protocol that produces high-quality summaries while minimizing burden....
We introduce PRIMERA, a pre-trained model for multi-document representation with focus on summarization that reduces the need dataset-specific architectures and large amounts of fine-tuning labeled data. PRIMERA uses our newly proposed pre-training objective designed to teach connect aggregate information across documents. It also efficient encoder-decoder transformers simplify processing concatenated input With extensive experiments 6 datasets from 3 different domains zero-shot, few-shot...
As a step toward better document-level understanding, we explore classification of sequence sentences into their corresponding categories, task that requires understanding in context the document. Recent successful models for this have used hierarchical to contextualize sentence representations, and Conditional Random Fields (CRFs) incorporate dependencies between subsequent labels. In work, show pretrained language models, BERT (Devlin et al., 2018) particular, can be capture contextual...
Managing the data for Information Retrieval (IR) experiments can be challenging. Dataset documentation is scattered across Internet and once one obtains a copy of data, there are numerous different formats to work with. Even basic have subtle dataset-specific nuances that need considered proper use. To help mitigate these challenges, we introduce new robust lightweight tool (ir_datasets) acquiring, managing, performing typical operations over datasets used in IR. We primarily focus on...
Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A. Smith, Matt Gardner. Proceedings of the 2021 Conference North American Chapter Association for Computational Linguistics: Human Language Technologies. 2021.
The volume of scientific output is creating an urgent need for automated tools to help scientists keep up with developments in their field. Semantic Scholar (S2) open data platform and website aimed at accelerating science by helping scholars discover understand literature. We combine public proprietary sources using state-of-the-art techniques scholarly PDF content extraction automatic knowledge graph construction build the Academic Graph, largest literature to-date, 200M+ papers, 80M+...
We propose a summarization approach for scientific articles which takes advantage of citation-context and the document discourse model.While citations have been previously used in generating summaries, they lack related context from referenced article therefore do not accurately reflect article's content.Our method overcomes problem inconsistency between citation summary content by providing each citation.We also leverage inherent producing better summaries.We show that our proposed...
Mental health is a significant and growing public concern. As language usage can be leveraged to obtain crucial insights into mental conditions, there need for large-scale, labeled, health-related datasets of users who have been diagnosed with one or more such conditions. In this paper, we investigate the creation high-precision patterns identify self-reported diagnoses nine different high-quality labeled data without manual labelling. We introduce SMHD (Self-reported Health Diagnoses)...
Automatically generating accurate summaries from clinical reports could save a clinician's time, improve summary coverage, and reduce errors. We propose sequence-to-sequence abstractive summarization model augmented with domain-specific ontological information to enhance content selection generation. apply our method dataset of radiology show that it significantly outperforms the current state-of-the-art on this task in terms rouge scores. Extensive human evaluation conducted by radiologist...
Few-shot NLP research is highly active, yet conducted in disjoint threads with evaluation suites that lack challenging-yet-realistic testing setups and fail to employ careful experimental design. Consequently, the community does not know which techniques perform best or even if they outperform simple baselines. In response, we formulate FLEX Principles, a set of requirements practices for unified, rigorous, valid, cost-sensitive few-shot evaluation. These principles include Sample Size...
We introduce a new pretraining approach geared for multi-document language modeling, incorporating two key ideas into the masked modeling self-supervised objective. First, instead of considering documents in isolation, we pretrain over sets multiple related documents, encouraging model to learn cross-document relationships. Second, improve recent long-range transformers by introducing dynamic global attention that has access entire input predict tokens. release CDLM (Cross-Document Language...
Automated methods have been widely used to identify and analyze mental health conditions (e.g., depression) from various sources of information, including social media. Yet, deployment such models in real-world healthcare applications faces challenges poor out-of-domain generalization lack trust black box models. In this work, we propose approaches for depression detection that are constrained different degrees by the presence symptoms described PHQ9, a questionnaire clinicians screening...