NFDI4DS | UHH-SEMS - Publication Details

Arman Cohan

ORCID: 0000-0002-8954-2724

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5064858748

Research Areas

Topic Modeling
Natural Language Processing Techniques
Biomedical Text Mining and Ontologies
Advanced Text Analysis Techniques
Mental Health via Writing
Semantic Web and Ontologies
Text Readability and Simplification
Sentiment Analysis and Opinion Mining
Expert finding and Q&A systems
Multimodal Machine Learning Applications
Digital Mental Health Interventions
Data Quality and Management
Information Retrieval and Search Behavior
Text and Document Classification Technologies
Mathematics, Computing, and Information Processing
Domain Adaptation and Few-Shot Learning
Software Engineering Research
Machine Learning and Data Classification
Speech and dialogue systems
Explainable Artificial Intelligence (XAI)
Misinformation and Its Impacts
Scientific Computing and Data Management
Speech Recognition and Synthesis
Neural Networks and Applications
Machine Learning in Healthcare

Allen Institute
2020-2024

Yale University
2023-2024

University of Washington
2020-2023

Allen Institute for Artificial Intelligence
2019-2023

Mongolia International University
2023

RIKEN Center for Advanced Intelligence Project
2023

University of Copenhagen
2022

Art Institute of Portland
2020-2021

Georgetown University
2014-2019

Adobe Systems (United States)
2018

SciBERT: A Pretrained Language Model for Scientific Text

OPENALEX - Publications

Iz Beltagy Kyle Lo Arman Cohan

Iz Beltagy, Kyle Lo, Arman Cohan. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint (EMNLP-IJCNLP). 2019.

10.18653/v1/d19-1371 article EN cc-by 2019-01-01

Longformer: The Long-Document Transformer

OPENALEX - Publications

Iz Beltagy Matthew E. Peters Arman Cohan

Transformer-based models are unable to process long sequences due their self-attention operation, which scales quadratically with the sequence length. To address this limitation, we introduce Longformer an attention mechanism that linearly length, making it easy documents of thousands tokens or longer. Longformer's is a drop-in replacement for standard and combines local windowed task motivated global attention. Following prior work on long-sequence transformers, evaluate character-level...

10.48550/arxiv.2004.05150 preprint EN other-oa arXiv (Cornell University) 2020-01-01

A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents

OPENALEX - Publications

Arman Cohan Franck Dernoncourt Doo Soon Kim Trung Bui Seokhwan Kim and 2 more

Arman Cohan, Franck Dernoncourt, Doo Soon Kim, Trung Bui, Seokhwan Walter Chang, Nazli Goharian. Proceedings of the 2018 Conference North American Chapter Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). 2018.

10.18653/v1/n18-2097 article EN cc-by 2018-01-01

CEDR

OPENALEX - Publications

Sean MacAvaney Andrew Yates Arman Cohan Nazli Goharian

Although considerable attention has been given to neural ranking architectures recently, far less paid the term representations that are used as input these models. In this work, we investigate how two pretrained contextualized language models (ELMo and BERT) can be utilized for ad-hoc document ranking. Through experiments on TREC benchmarks, find several ex-sting benefit from additional context provided by Furthermore, propose a joint approach incorporates BERT's classification vector into...

10.1145/3331184.3331317 preprint EN 2019-07-18

SPECTER: Document-level Representation Learning using Citation-informed Transformers

OPENALEX - Publications

Arman Cohan Sergey Feldman Iz Beltagy Doug Downey Daniel S. Weld

Representation learning is a critical ingredient for natural language processing systems. Recent Transformer models like BERT learn powerful textual representations, but these are targeted towards token- and sentence-level training objectives do not leverage information on inter-document relatedness, which limits their document-level representation power. For applications scientific documents, such as classification recommendation, accurate embeddings of documents necessity. We propose...

10.18653/v1/2020.acl-main.207 article EN cc-by 2020-01-01

Depression and Self-Harm Risk Assessment in Online Forums

OPENALEX - Publications

Andrew Yates Arman Cohan Nazli Goharian

Users suffering from mental health conditions often turn to online resources for support, including specialized support communities or general such as Twitter and Reddit. In this work, we present a framework supporting studying users in both types of communities. We propose methods identifying posts that may indicate risk self-harm, demonstrate our approach outperforms strong previously proposed posts. Self-harm is closely related depression, which makes depressed on forums crucial task....

10.18653/v1/d17-1322 article EN cc-by Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2017-01-01

Fact or Fiction: Verifying Scientific Claims

OPENALEX - Publications

David Wadden Shanchuan Lin Kyle Lo Lucy Lu Wang Madeleine van Zuylen and 2 more

David Wadden, Shanchuan Lin, Kyle Lo, Lucy Lu Wang, Madeleine van Zuylen, Arman Cohan, Hannaneh Hajishirzi. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020.

10.18653/v1/2020.emnlp-main.609 article EN cc-by 2020-01-01

Structural Scaffolds for Citation Intent Classification in Scientific Publications

OPENALEX - Publications

Arman Cohan Waleed Ammar Madeleine van Zuylen Field Cady

Arman Cohan, Waleed Ammar, Madeleine van Zuylen, Field Cady. Proceedings of the 2019 Conference North American Chapter Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019.

10.18653/v1/n19-1361 article EN 2019-01-01

TLDR: Extreme Summarization of Scientific Documents

OPENALEX - Publications

Isabel Cachola Kyle Lo Arman Cohan Daniel S. Weld

We introduce TLDR generation, a new form of extreme summarization, for scientific papers. generation involves high source compression and requires expert background knowledge understanding complex domain-specific language. To facilitate study on this task, we SCITLDR, multi-target dataset 5.4K TLDRs over 3.2K SCITLDR contains both author-written expert-derived TLDRs, where the latter are collected using novel annotation protocol that produces high-quality summaries while minimizing burden....

10.18653/v1/2020.findings-emnlp.428 article EN cc-by 2020-01-01

PRIMERA: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization

OPENALEX - Publications

Xiao Wen Iz Beltagy Giuseppe Carenini Arman Cohan

We introduce PRIMERA, a pre-trained model for multi-document representation with focus on summarization that reduces the need dataset-specific architectures and large amounts of fine-tuning labeled data. PRIMERA uses our newly proposed pre-training objective designed to teach connect aggregate information across documents. It also efficient encoder-decoder transformers simplify processing concatenated input With extensive experiments 6 datasets from 3 different domains zero-shot, few-shot...

10.18653/v1/2022.acl-long.360 article EN cc-by Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2022-01-01

OLMo: Accelerating the Science of Language Models

OPENALEX - Publications

Dirk Groeneveld Iz Beltagy Evan Pete Walsh Akshita Bhagia Rodney Kinney and 38 more

10.18653/v1/2024.acl-long.841 article EN 2024-01-01

Pretrained Language Models for Sequential Sentence Classification

OPENALEX - Publications

Arman Cohan Iz Beltagy Daniel King Bhavana Dalvi Dan Weld

As a step toward better document-level understanding, we explore classification of sequence sentences into their corresponding categories, task that requires understanding in context the document. Recent successful models for this have used hierarchical to contextualize sentence representations, and Conditional Random Fields (CRFs) incorporate dependencies between subsequent labels. In work, show pretrained language models, BERT (Devlin et al., 2018) particular, can be capture contextual...

10.18653/v1/d19-1383 preprint EN cc-by 2019-01-01

Scientific document summarization via citation contextualization and scientific discourse

OPENALEX - Publications

Arman Cohan Nazli Goharian

10.1007/s00799-017-0216-8 article EN International Journal on Digital Libraries 2017-05-09

Simplified Data Wrangling with ir_datasets

OPENALEX - Publications

Sean MacAvaney Andrew Yates Sergey Feldman Doug Downey Arman Cohan and 1 more

Managing the data for Information Retrieval (IR) experiments can be challenging. Dataset documentation is scattered across Internet and once one obtains a copy of data, there are numerous different formats to work with. Even basic have subtle dataset-specific nuances that need considered proper use. To help mitigate these challenges, we introduce new robust lightweight tool (ir_datasets) acquiring, managing, performing typical operations over datasets used in IR. We primarily focus on...

10.1145/3404835.3463254 article EN Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval 2021-07-11

A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers

OPENALEX - Publications

Pradeep Dasigi Kyle Lo Iz Beltagy Arman Cohan Noah A. Smith and 1 more

Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A. Smith, Matt Gardner. Proceedings of the 2021 Conference North American Chapter Association for Computational Linguistics: Human Language Technologies. 2021.

10.18653/v1/2021.naacl-main.365 article EN cc-by Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 2021-01-01

The Semantic Scholar Open Data Platform

OPENALEX - Publications

Rodney Kinney Chloe Anastasiades Russell Authur Iz Beltagy Jonathan Bragg and 43 more

The volume of scientific output is creating an urgent need for automated tools to help scientists keep up with developments in their field. Semantic Scholar (S2) open data platform and website aimed at accelerating science by helping scholars discover understand literature. We combine public proprietary sources using state-of-the-art techniques scholarly PDF content extraction automatic knowledge graph construction build the Academic Graph, largest literature to-date, 200M+ papers, 80M+...

10.48550/arxiv.2301.10140 preprint EN cc-by arXiv (Cornell University) 2023-01-01

MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning

OPENALEX - Publications

Xiangru Tang Anni Zou Zhuosheng Zhang Ziming Li Yilun Zhao and 3 more

10.18653/v1/2024.findings-acl.33 article EN Findings of the Association for Computational Linguistics: ACL 2022 2024-01-01

On Learning to Summarize with Large Language Models as References

OPENALEX - Publications

Yixin Liu Kejian Shi Katherine He Longtian Ye Alexander R. Fabbri and 3 more

10.18653/v1/2024.naacl-long.478 article EN 2024-01-01

Scientific Article Summarization Using Citation-Context and Article's Discourse Structure

OPENALEX - Publications

Arman Cohan Nazli Goharian

We propose a summarization approach for scientific articles which takes advantage of citation-context and the document discourse model.While citations have been previously used in generating summaries, they lack related context from referenced article therefore do not accurately reflect article's content.Our method overcomes problem inconsistency between citation summary content by providing each citation.We also leverage inherent producing better summaries.We show that our proposed...

10.18653/v1/d15-1045 article EN cc-by Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2015-01-01

SMHD: A Large-Scale Resource for Exploring Online Language Usage for Multiple Mental Health Conditions

OPENALEX - Publications

Arman Cohan Bart Desmet Andrew Yates Luca Soldaini Sean MacAvaney and 1 more

Mental health is a significant and growing public concern. As language usage can be leveraged to obtain crucial insights into mental conditions, there need for large-scale, labeled, health-related datasets of users who have been diagnosed with one or more such conditions. In this paper, we investigate the creation high-precision patterns identify self-reported diagnoses nine different high-quality labeled data without manual labelling. We introduce SMHD (Self-reported Health Diagnoses)...

10.48550/arxiv.1806.05258 preprint EN cc-by arXiv (Cornell University) 2018-01-01

Ontology-Aware Clinical Abstractive Summarization

OPENALEX - Publications

Sean MacAvaney Sajad Sotudeh Arman Cohan Nazli Goharian Ish Talati and 1 more

Automatically generating accurate summaries from clinical reports could save a clinician's time, improve summary coverage, and reduce errors. We propose sequence-to-sequence abstractive summarization model augmented with domain-specific ontological information to enhance content selection generation. apply our method dataset of radiology show that it significantly outperforms the current state-of-the-art on this task in terms rouge scores. Extensive human evaluation conducted by radiologist...

10.1145/3331184.3331319 article EN 2019-07-18

FLEX: Unifying Evaluation for Few-Shot NLP

OPENALEX - Publications

Jonathan Bragg Arman Cohan Kyle Lo Iz Beltagy

Few-shot NLP research is highly active, yet conducted in disjoint threads with evaluation suites that lack challenging-yet-realistic testing setups and fail to employ careful experimental design. Consequently, the community does not know which techniques perform best or even if they outperform simple baselines. In response, we formulate FLEX Principles, a set of requirements practices for unified, rigorous, valid, cost-sensitive few-shot evaluation. These principles include Sample Size...

10.48550/arxiv.2107.07170 preprint EN other-oa arXiv (Cornell University) 2021-01-01

CDLM: Cross-Document Language Modeling

OPENALEX - Publications

Avi Caciularu Arman Cohan Iz Beltagy Matthew E. Peters Arie Cattan and 1 more

We introduce a new pretraining approach geared for multi-document language modeling, incorporating two key ideas into the masked modeling self-supervised objective. First, instead of considering documents in isolation, we pretrain over sets multiple related documents, encouraging model to learn cross-document relationships. Second, improve recent long-range transformers by introducing dynamic global attention that has access entire input predict tokens. release CDLM (Cross-Document Language...

10.18653/v1/2021.findings-emnlp.225 article EN cc-by 2021-01-01

Improving the Generalizability of Depression Detection by Leveraging Clinical Questionnaires

OPENALEX - Publications

T. Q. Nguyen Andrew Yates Ayah Zirikly Bart Desmet Arman Cohan

Automated methods have been widely used to identify and analyze mental health conditions (e.g., depression) from various sources of information, including social media. Yet, deployment such models in real-world healthcare applications faces challenges poor out-of-domain generalization lack trust black box models. In this work, we propose approaches for depression detection that are constrained different degrees by the presence symptoms described PHQ9, a questionnaire clinicians screening...

10.18653/v1/2022.acl-long.578 article EN cc-by Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2022-01-01

Coming Soon ...