Xiangru Tang

ORCID: 0009-0006-2700-4513
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Topic Modeling
  • Natural Language Processing Techniques
  • Text Readability and Simplification
  • Biomedical Text Mining and Ontologies
  • Software Engineering Research
  • Advanced Text Analysis Techniques
  • Computational Drug Discovery Methods
  • Protein Structure and Dynamics
  • Multi-Agent Systems and Negotiation
  • Speech and dialogue systems
  • Data Quality and Management
  • Scientific Computing and Data Management
  • Semantic Web and Ontologies
  • Model-Driven Software Engineering Techniques
  • Genomics and Phylogenetic Studies
  • Mathematics, Computing, and Information Processing
  • Information Retrieval and Search Behavior
  • Digital Humanities and Scholarship
  • Machine Learning in Materials Science
  • Software Testing and Debugging Techniques
  • Glycosylation and Glycoproteins Research
  • RNA and protein synthesis mechanisms
  • Interpreting and Communication in Healthcare
  • Machine Learning in Healthcare
  • Ferroelectric and Negative Capacitance Devices

Yale University
2022-2025

Tongmyong University
2024

Mohamed bin Zayed University of Artificial Intelligence
2023

University of Cambridge
2023

Meta (Israel)
2022

Columbia University
2022

National University of Defense Technology
2005

Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M Saiful Bari, Sheng Shen, Zheng Xin Yong, Hailey Schoelkopf, Xiangru Tang, Dragomir Radev, Alham Fikri Aji, Khalid Almubarak, Samuel Albanie, Zaid Alyafeai, Albert Webson, Edward Raff, Colin Raffel. Proceedings of the 61st Annual Meeting Association for Computational Linguistics (Volume 1: Long Papers). 2023.

10.18653/v1/2023.acl-long.891 article EN cc-by 2023-01-01

Stephen Bach, Victor Sanh, Zheng Xin Yong, Albert Webson, Colin Raffel, Nihal V. Nayak, Abheesht Sharma, Taewoon Kim, M Saiful Bari, Thibault Fevry, Zaid Alyafeai, Manan Dey, Andrea Santilli, Zhiqing Sun, Srulik Ben-david, Canwen Xu, Gunjan Chhablani, Han Wang, Jason Fries, Maged Al-shaibani, Shanya Urmish Thakker, Khalid Almubarak, Xiangru Tang, Dragomir Radev, Mike Tian-jian Jiang, Alexander Rush. Proceedings of the 60th Annual Meeting Association for Computational Linguistics: System...

10.18653/v1/2022.acl-demo.9 article EN cc-by 2022-01-01

Despite the advancements of open-source large language models (LLMs), e.g., LLaMA, they remain significantly limited in tool-use capabilities, i.e., using external tools (APIs) to fulfill human instructions. The reason is that current instruction tuning largely focuses on basic tasks but ignores domain. This contrast excellent capabilities state-of-the-art (SOTA) closed-source LLMs, ChatGPT. To bridge this gap, we introduce ToolLLM, a general framework encompassing data construction, model...

10.48550/arxiv.2307.16789 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Artificial intelligence (AI)-driven methods can vastly improve the historically costly drug design process, with various generative models already in widespread use. Generative for de novo design, particular, focus on creation of novel biological compounds entirely from scratch, representing a promising future direction. Rapid development field, combined inherent complexity creates difficult landscape new researchers to enter. In this survey, we organize into two overarching themes: small...

10.1093/bib/bbae338 article EN cc-by-nc Briefings in Bioinformatics 2024-05-23

The advent of large language models (LLMs) has catalyzed a transformative shift in artificial intelligence, paving the way for advanced intelligent agents capable sophisticated reasoning, robust perception, and versatile action across diverse domains. As these increasingly drive AI research practical applications, their design, evaluation, continuous improvement present intricate, multifaceted challenges. This survey provides comprehensive overview, framing within modular, brain-inspired...

10.48550/arxiv.2504.01990 preprint EN arXiv (Cornell University) 2025-03-31

Factual inconsistencies in generated summaries severely limit the practical applications of abstractive dialogue summarization. Although significant progress has been achieved by using pre-trained models, substantial amounts hallucinated content are found during human evaluation. Pre-trained models most commonly fine-tuned with cross-entropy loss for text summarization, which may not be an optimal strategy. In this work, we provide a typology factual errors annotation data to highlight types...

10.18653/v1/2022.naacl-main.415 article EN cc-by Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 2022-01-01

The identification of protein homologs in large databases using conventional methods, such as sequence comparison, often misses remote homologs. Here, we offer an ultrafast, highly sensitive method, dense homolog retriever (DHR), for detecting on the basis a language model and retrieval techniques. Its dual-encoder architecture generates different embeddings same easily locates by comparing these representations. alignment-free nature improves speed incorporates rich evolutionary structural...

10.1038/s41587-024-02353-6 article EN cc-by-nc-nd Nature Biotechnology 2024-08-09

We present DART, an open domain structured DAta Record to Text generation dataset with over 82k instances (DARTs). Data-to-Text annotations can be a costly process, especially when dealing tables which are the major source of data and contain nontrivial structures. To this end, we propose procedure extracting semantic triples from that encodes their structures by exploiting dependencies among table headers title. Our construction framework effectively merged heterogeneous sources parsing...

10.48550/arxiv.2007.02871 preprint EN cc-by-sa arXiv (Cornell University) 2020-01-01

Artificial intelligence (AI)-driven methods can vastly improve the historically costly drug design process, with various generative models already in widespread use. Generative for de novo design, particular, focus on creation of novel biological compounds entirely from scratch, representing a promising future direction. Rapid development field, combined inherent complexity creates difficult landscape new researchers to enter. In this survey, we organize into two overarching themes: small...

10.48550/arxiv.2402.08703 preprint EN arXiv (Cornell University) 2024-02-13

<title>Abstract</title> Inverse protein folding, which aims to design amino acid sequences for desired structures, is fundamental engineering and therapeutic development. While recent deep-learning approaches have made remarkable progress, they typically represent biochemical properties as discrete features associated with individual residues. Here, we present BC-Design, a framework that represents continuous distributions across surfaces interiors. Through contrastive learning, our model...

10.21203/rs.3.rs-6310665/v1 preprint EN Research Square (Research Square) 2025-05-08

Finetuning large language models (LLMs) on instructions leads to vast performance improvements natural tasks. We apply instruction tuning using code, leveraging the structure of Git commits, which pair code changes with human instructions. compile CommitPack: 4 terabytes commits across 350 programming languages. benchmark CommitPack against other and synthetic (xP3x, Self-Instruct, OASST) 16B parameter StarCoder model, achieve state-of-the-art among not trained OpenAI outputs, HumanEval...

10.48550/arxiv.2308.07124 preprint EN other-oa arXiv (Cornell University) 2023-01-01

We introduce MMVU, a comprehensive expert-level, multi-discipline benchmark for evaluating foundation models in video understanding. MMVU includes 3,000 expert-annotated questions spanning 27 subjects across four core disciplines: Science, Healthcare, Humanities & Social Sciences, and Engineering. Compared to prior benchmarks, features three key advancements. First, it challenges apply domain-specific knowledge perform expert-level reasoning analyze specialized-domain videos, moving beyond...

10.48550/arxiv.2501.12380 preprint EN arXiv (Cornell University) 2025-01-21

Xiangru Tang, Alexander Fabbri, Haoran Li, Ziming Mao, Griffin Adams, Borui Wang, Asli Celikyilmaz, Yashar Mehdad, Dragomir Radev. Proceedings of the 2022 Conference North American Chapter Association for Computational Linguistics: Human Language Technologies. 2022.

10.18653/v1/2022.naacl-main.417 article EN cc-by Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 2022-01-01

Multitask prompted finetuning (MTF) has been shown to help large language models generalize new tasks in a zero-shot setting, but so far explorations of MTF have focused on English data and models. We apply the pretrained multilingual BLOOM mT5 model families produce finetuned variants called BLOOMZ mT0. find with prompts allows for task generalization non-English languages that appear only pretraining corpus. Finetuning further improves performance leading various state-of-the-art results....

10.48550/arxiv.2211.01786 preprint EN other-oa arXiv (Cornell University) 2022-01-01

Abstract Motivation The current paradigm of deep learning models for the joint representation molecules and text primarily relies on 1D or 2D molecular formats, neglecting significant 3D structural information that offers valuable physical insight. This narrow focus inhibits models’ versatility adaptability across a wide range modalities. Conversely, limited research focusing explicit tends to overlook textual data within biomedical domain. Results We present unified pre-trained language...

10.1093/bioinformatics/btae260 article EN cc-by Bioinformatics 2024-05-09

Recent observations have underscored a disparity between the inflated benchmark scores and actual performance of LLMs, raising concerns about potential contamination evaluation benchmarks. This issue is especially critical for closed-source models certain open-source where training data transparency lacking. In this paper we study by proposing two methods tailored both proprietary LLMs. We first introduce retrieval-based system to explore overlaps benchmarks pretraining corpora. further...

10.48550/arxiv.2311.09783 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Tabular data is prevalent across various industries, necessitating significant time and effort for users to understand manipulate their information-seeking purposes. The advancements in large language models (LLMs) have shown enormous potential improve user efficiency. However, the adoption of LLMs real-world applications table information seeking remains underexplored. In this paper, we investigate table-to-text capabilities different using four datasets within two scenarios. These include...

10.18653/v1/2023.emnlp-industry.17 article EN cc-by 2023-01-01

PromptSource is a system for creating, sharing, and using natural language prompts. Prompts are functions that map an example from dataset to input target output. Using prompts train query models emerging area in NLP requires new tools let users develop refine these collaboratively. addresses the emergent challenges this setting with (1) templating defining data-linked prompts, (2) interface lets quickly iterate on prompt development by observing outputs of their many examples, (3)...

10.48550/arxiv.2202.01279 preprint EN other-oa arXiv (Cornell University) 2022-01-01

Pre-trained large language models have significantly improved code generation. As these scale up, there is an increasing need for the output to handle more intricate tasks and be appropriately specialized particular domains. Here, we target bioinformatics due amount of domain knowledge, algorithms, data operations this discipline requires. We present BioCoder, a benchmark developed evaluate (LLMs) in generating bioinformatics-specific code. BioCoder spans broad spectrum field covers...

10.48550/arxiv.2308.16458 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Despite the remarkable capabilities of Large Language Models (LLMs) like GPT-4, producing complex, structured tabular data remains challenging. Our study assesses LLMs' proficiency in structuring tables and introduces a novel fine-tuning method, cognizant structures, to bolster their performance. We unveil Struc-Bench, comprehensive benchmark featuring prominent LLMs (GPT-NeoX-20B, GPT-3.5, Vicuna), which spans text tables, HTML, LaTeX formats. proposed FormatCoT aids crafting...

10.48550/arxiv.2309.08963 preprint EN cc-by arXiv (Cornell University) 2023-01-01

The advancement and extensive application of large language models (LLMs) have been remarkable, including their use in scientific research assistance. However, these often generate scientifically incorrect or unsafe responses, some cases, they may encourage users to engage dangerous behavior. To address this issue the field chemistry, we introduce ChemSafetyBench, a benchmark designed evaluate accuracy safety LLM responses. ChemSafetyBench encompasses three key tasks: querying chemical...

10.48550/arxiv.2411.16736 preprint EN arXiv (Cornell University) 2024-11-23

Yilun Zhao, Chen Linyong Nan, Zhenting Qi, Wenlin Zhang, Xiangru Tang, Boyu Mi, Dragomir Radev. Proceedings of the 61st Annual Meeting Association for Computational Linguistics (Volume 1: Long Papers). 2023.

10.18653/v1/2023.acl-long.334 article EN cc-by 2023-01-01
Coming Soon ...