Arman Cohan

ORCID: 0000-0002-8954-2724
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Topic Modeling
  • Natural Language Processing Techniques
  • Biomedical Text Mining and Ontologies
  • Advanced Text Analysis Techniques
  • Mental Health via Writing
  • Semantic Web and Ontologies
  • Text Readability and Simplification
  • Sentiment Analysis and Opinion Mining
  • Expert finding and Q&A systems
  • Multimodal Machine Learning Applications
  • Digital Mental Health Interventions
  • Data Quality and Management
  • Information Retrieval and Search Behavior
  • Text and Document Classification Technologies
  • Mathematics, Computing, and Information Processing
  • Domain Adaptation and Few-Shot Learning
  • Software Engineering Research
  • Machine Learning and Data Classification
  • Speech and dialogue systems
  • Explainable Artificial Intelligence (XAI)
  • Misinformation and Its Impacts
  • Scientific Computing and Data Management
  • Speech Recognition and Synthesis
  • Neural Networks and Applications
  • Machine Learning in Healthcare

Allen Institute
2020-2024

Yale University
2023-2024

University of Washington
2020-2023

Allen Institute for Artificial Intelligence
2019-2023

Mongolia International University
2023

RIKEN Center for Advanced Intelligence Project
2023

University of Copenhagen
2022

Art Institute of Portland
2020-2021

Georgetown University
2014-2019

Adobe Systems (United States)
2018

Iz Beltagy, Kyle Lo, Arman Cohan. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint (EMNLP-IJCNLP). 2019.

10.18653/v1/d19-1371 article EN cc-by 2019-01-01

Transformer-based models are unable to process long sequences due their self-attention operation, which scales quadratically with the sequence length. To address this limitation, we introduce Longformer an attention mechanism that linearly length, making it easy documents of thousands tokens or longer. Longformer's is a drop-in replacement for standard and combines local windowed task motivated global attention. Following prior work on long-sequence transformers, evaluate character-level...

10.48550/arxiv.2004.05150 preprint EN other-oa arXiv (Cornell University) 2020-01-01

Arman Cohan, Franck Dernoncourt, Doo Soon Kim, Trung Bui, Seokhwan Walter Chang, Nazli Goharian. Proceedings of the 2018 Conference North American Chapter Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). 2018.

10.18653/v1/n18-2097 article EN cc-by 2018-01-01

Although considerable attention has been given to neural ranking architectures recently, far less paid the term representations that are used as input these models. In this work, we investigate how two pretrained contextualized language models (ELMo and BERT) can be utilized for ad-hoc document ranking. Through experiments on TREC benchmarks, find several ex-sting benefit from additional context provided by Furthermore, propose a joint approach incorporates BERT's classification vector into...

10.1145/3331184.3331317 preprint EN 2019-07-18

Representation learning is a critical ingredient for natural language processing systems. Recent Transformer models like BERT learn powerful textual representations, but these are targeted towards token- and sentence-level training objectives do not leverage information on inter-document relatedness, which limits their document-level representation power. For applications scientific documents, such as classification recommendation, accurate embeddings of documents necessity. We propose...

10.18653/v1/2020.acl-main.207 article EN cc-by 2020-01-01

Users suffering from mental health conditions often turn to online resources for support, including specialized support communities or general such as Twitter and Reddit. In this work, we present a framework supporting studying users in both types of communities. We propose methods identifying posts that may indicate risk self-harm, demonstrate our approach outperforms strong previously proposed posts. Self-harm is closely related depression, which makes depressed on forums crucial task....

10.18653/v1/d17-1322 article EN cc-by Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2017-01-01

David Wadden, Shanchuan Lin, Kyle Lo, Lucy Lu Wang, Madeleine van Zuylen, Arman Cohan, Hannaneh Hajishirzi. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020.

10.18653/v1/2020.emnlp-main.609 article EN cc-by 2020-01-01

Arman Cohan, Waleed Ammar, Madeleine van Zuylen, Field Cady. Proceedings of the 2019 Conference North American Chapter Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019.

10.18653/v1/n19-1361 article EN 2019-01-01

We introduce TLDR generation, a new form of extreme summarization, for scientific papers. generation involves high source compression and requires expert background knowledge understanding complex domain-specific language. To facilitate study on this task, we SCITLDR, multi-target dataset 5.4K TLDRs over 3.2K SCITLDR contains both author-written expert-derived TLDRs, where the latter are collected using novel annotation protocol that produces high-quality summaries while minimizing burden....

10.18653/v1/2020.findings-emnlp.428 article EN cc-by 2020-01-01

We introduce PRIMERA, a pre-trained model for multi-document representation with focus on summarization that reduces the need dataset-specific architectures and large amounts of fine-tuning labeled data. PRIMERA uses our newly proposed pre-training objective designed to teach connect aggregate information across documents. It also efficient encoder-decoder transformers simplify processing concatenated input With extensive experiments 6 datasets from 3 different domains zero-shot, few-shot...

10.18653/v1/2022.acl-long.360 article EN cc-by Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2022-01-01

As a step toward better document-level understanding, we explore classification of sequence sentences into their corresponding categories, task that requires understanding in context the document. Recent successful models for this have used hierarchical to contextualize sentence representations, and Conditional Random Fields (CRFs) incorporate dependencies between subsequent labels. In work, show pretrained language models, BERT (Devlin et al., 2018) particular, can be capture contextual...

10.18653/v1/d19-1383 preprint EN cc-by 2019-01-01

Managing the data for Information Retrieval (IR) experiments can be challenging. Dataset documentation is scattered across Internet and once one obtains a copy of data, there are numerous different formats to work with. Even basic have subtle dataset-specific nuances that need considered proper use. To help mitigate these challenges, we introduce new robust lightweight tool (ir_datasets) acquiring, managing, performing typical operations over datasets used in IR. We primarily focus on...

10.1145/3404835.3463254 article EN Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval 2021-07-11

Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A. Smith, Matt Gardner. Proceedings of the 2021 Conference North American Chapter Association for Computational Linguistics: Human Language Technologies. 2021.

10.18653/v1/2021.naacl-main.365 article EN cc-by Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 2021-01-01

The volume of scientific output is creating an urgent need for automated tools to help scientists keep up with developments in their field. Semantic Scholar (S2) open data platform and website aimed at accelerating science by helping scholars discover understand literature. We combine public proprietary sources using state-of-the-art techniques scholarly PDF content extraction automatic knowledge graph construction build the Academic Graph, largest literature to-date, 200M+ papers, 80M+...

10.48550/arxiv.2301.10140 preprint EN cc-by arXiv (Cornell University) 2023-01-01

We propose a summarization approach for scientific articles which takes advantage of citation-context and the document discourse model.While citations have been previously used in generating summaries, they lack related context from referenced article therefore do not accurately reflect article's content.Our method overcomes problem inconsistency between citation summary content by providing each citation.We also leverage inherent producing better summaries.We show that our proposed...

10.18653/v1/d15-1045 article EN cc-by Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2015-01-01

Mental health is a significant and growing public concern. As language usage can be leveraged to obtain crucial insights into mental conditions, there need for large-scale, labeled, health-related datasets of users who have been diagnosed with one or more such conditions. In this paper, we investigate the creation high-precision patterns identify self-reported diagnoses nine different high-quality labeled data without manual labelling. We introduce SMHD (Self-reported Health Diagnoses)...

10.48550/arxiv.1806.05258 preprint EN cc-by arXiv (Cornell University) 2018-01-01

Automatically generating accurate summaries from clinical reports could save a clinician's time, improve summary coverage, and reduce errors. We propose sequence-to-sequence abstractive summarization model augmented with domain-specific ontological information to enhance content selection generation. apply our method dataset of radiology show that it significantly outperforms the current state-of-the-art on this task in terms rouge scores. Extensive human evaluation conducted by radiologist...

10.1145/3331184.3331319 article EN 2019-07-18

Few-shot NLP research is highly active, yet conducted in disjoint threads with evaluation suites that lack challenging-yet-realistic testing setups and fail to employ careful experimental design. Consequently, the community does not know which techniques perform best or even if they outperform simple baselines. In response, we formulate FLEX Principles, a set of requirements practices for unified, rigorous, valid, cost-sensitive few-shot evaluation. These principles include Sample Size...

10.48550/arxiv.2107.07170 preprint EN other-oa arXiv (Cornell University) 2021-01-01

We introduce a new pretraining approach geared for multi-document language modeling, incorporating two key ideas into the masked modeling self-supervised objective. First, instead of considering documents in isolation, we pretrain over sets multiple related documents, encouraging model to learn cross-document relationships. Second, improve recent long-range transformers by introducing dynamic global attention that has access entire input predict tokens. release CDLM (Cross-Document Language...

10.18653/v1/2021.findings-emnlp.225 article EN cc-by 2021-01-01

Automated methods have been widely used to identify and analyze mental health conditions (e.g., depression) from various sources of information, including social media. Yet, deployment such models in real-world healthcare applications faces challenges poor out-of-domain generalization lack trust black box models. In this work, we propose approaches for depression detection that are constrained different degrees by the presence symptoms described PHQ9, a questionnaire clinicians screening...

10.18653/v1/2022.acl-long.578 article EN cc-by Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2022-01-01
Coming Soon ...