Xueguang Ma

ORCID: 0000-0003-3430-4910
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Topic Modeling
  • Natural Language Processing Techniques
  • Multimodal Machine Learning Applications
  • Domain Adaptation and Few-Shot Learning
  • Semantic Web and Ontologies
  • Explainable Artificial Intelligence (XAI)
  • Image Retrieval and Classification Techniques
  • Advanced Image and Video Retrieval Techniques
  • Misinformation and Its Impacts
  • Information Systems Theories and Implementation
  • Data Quality and Management
  • Online and Blended Learning
  • Text and Document Classification Technologies
  • Algorithms and Data Compression
  • Neural Networks and Applications
  • Information Retrieval and Search Behavior
  • Biomedical Text Mining and Ontologies
  • Machine Learning in Materials Science
  • Genomics and Phylogenetic Studies
  • Access Control and Trust
  • Privacy-Preserving Technologies in Data
  • Recommender Systems and Techniques
  • Handwritten Text Recognition Techniques
  • Teacher Education and Leadership Studies
  • Data Management and Algorithms

University of Waterloo
2020-2024

University of California, Los Angeles
2023

University of Maryland, College Park
2006-2011

University of Maryland, Baltimore County
2005

Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations. It aims to provide effective, reproducible, easy-to-use first-stage in multi-stage ranking architecture. Our self-contained as standard package comes queries, relevance judgments, pre-built indexes, evaluation scripts many commonly used IR test collections. We aim support, out of the box, entire lifecycle efforts aimed at improving modern neural approaches. In particular,...

10.1145/3404835.3463238 article EN Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval 2021-07-11

Recently, there has been significant progress in teaching language models to perform step-by-step reasoning solve complex numerical tasks. Chain-of-thoughts prompting (CoT) is by far the state-of-art method for these CoT uses both and computation multi-step `thought' process. To disentangle from reasoning, we propose `Program of Thoughts' (PoT), which (mainly Codex) express process as a program. The relegated an external computer, executes generated programs derive answer. We evaluate PoT on...

10.48550/arxiv.2211.12588 preprint EN other-oa arXiv (Cornell University) 2022-01-01

While dense retrieval has been shown to be effective and efficient across tasks languages, it remains difficult create fully zero-shot systems when no relevance labels are available. In this paper, we recognize the difficulty of learning encoding relevance. Instead, propose pivot through Hypothetical Document Embeddings (HyDE). Given a query, HyDE first prompts an instruction-following language model (e.g., InstructGPT) generate hypothetical document. The document captures patterns but is...

10.18653/v1/2023.acl-long.99 article EN cc-by 2023-01-01

10.1145/3626772.3657951 article EN Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval 2024-07-10

Recent developments in representational learning for information retrieval can be organized a conceptual framework that establishes two pairs of contrasts: sparse vs. dense representations and unsupervised learned representations. Sparse further decomposed into expansion term weighting components. This allows us to understand the relationship between recently proposed techniques such as DPR, ANCE, DeepCT, DeepImpact, COIL, furthermore, gaps revealed by our analysis point "low hanging fruit"...

10.48550/arxiv.2106.14807 preprint EN other-oa arXiv (Cornell University) 2021-01-01

Question answering over knowledge bases is considered a difficult problem due to the challenge of generalizing wide variety possible natural language questions. Additionally, heterogeneity base schema items between different often necessitates specialized training for question-answering (KBQA) datasets. To handle questions diverse KBQA datasets with unified training-free framework, we propose KB-BINDER, which first time enables few-shot in-context learning tasks. Firstly, KB-BINDER leverages...

10.18653/v1/2023.acl-long.385 article EN cc-by 2023-01-01

The recent LLMs like GPT-4 and PaLM-2 have made tremendous progress in solving fundamental math problems GSM8K by achieving over 90% accuracy. However, their capabilities to solve more challenging which require domain-specific knowledge (i.e. theorem) yet be investigated. In this paper, we introduce TheoremQA, the first theorem-driven question-answering dataset designed evaluate AI models’ apply theorems science problems. TheoremQA is curated domain experts containing 800 high-quality...

10.18653/v1/2023.emnlp-main.489 article EN cc-by Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2023-01-01

In traditional RAG framework, the basic retrieval units are normally short. The common retrievers like DPR work with 100-word Wikipedia paragraphs. Such a design forces retriever to search over large corpus find `needle' unit. contrast, readers only need extract answers from short retrieved units. an imbalanced `heavy' and `light' reader can lead sub-optimal performance. order alleviate imbalance, we propose new framework LongRAG, consisting of `long retriever' reader'. LongRAG processes...

10.48550/arxiv.2406.15319 preprint EN arXiv (Cornell University) 2024-06-21

Pyserini is an easy-to-use Python toolkit that supports replicable IR research by providing effective first-stage retrieval in a multi-stage ranking architecture. Our self-contained as standard package and comes with queries, relevance judgments, pre-built indexes, evaluation scripts for many commonly used test collections. We aim to support, out of the box, entire lifecycle efforts aimed at improving modern neural approaches. In particular, sparse (e.g., BM25 scoring using bag-of-words...

10.48550/arxiv.2102.10073 preprint EN cc-by arXiv (Cornell University) 2021-01-01

Recent rapid advancements in deep pre-trained language models and the introductions of large datasets have powered research embedding-based dense retrieval. While several good papers emerged, many them come with their own software stacks. These stacks are typically optimized for some particular goals instead efficiency or code structure. In this paper, we present Tevatron, a retrieval toolkit efficiency, flexibility, simplicity. Tevatron provides standardized pipeline including text...

10.48550/arxiv.2203.05765 preprint EN other-oa arXiv (Cornell University) 2022-01-01

Supervised ranking methods based on bi-encoder or cross-encoder architectures have shown success in multi-stage text tasks, but they require large amounts of relevance judgments as training data. In this work, we propose Listwise Reranker with a Large Language Model (LRL), which achieves strong reranking effectiveness without using any task-specific Different from the existing pointwise methods, where documents are scored independently and ranked according to scores, LRL directly generates...

10.48550/arxiv.2305.02156 preprint EN other-oa arXiv (Cornell University) 2023-01-01

In the age of large-scale language models, benchmarks like Massive Multitask Language Understanding (MMLU) have been pivotal in pushing boundaries what AI can achieve comprehension and reasoning across diverse domains. However, as models continue to improve, their performance on these has begun plateau, making it increasingly difficult discern differences model capabilities. This paper introduces MMLU-Pro, an enhanced dataset designed extend mostly knowledge-driven MMLU benchmark by...

10.48550/arxiv.2406.01574 preprint EN arXiv (Cornell University) 2024-06-03

This work describes the adaptation of a pretrained sequence-to-sequence model to task scientific claim verification in biomedical domain. We propose VERT5ERINI that exploits T5 for abstract retrieval, sentence selection and label prediction, which are three critical sub-tasks verification. evaluate our pipeline on SCIFACT, newly curated dataset requires models not just predict veracity claims but also provide relevant sentences from corpus literature support this decision. Empirically,...

10.48550/arxiv.2010.11930 preprint EN other-oa arXiv (Cornell University) 2020-01-01

Recent rapid advances in deep pre-trained language models and the introduction of large datasets have powered research embedding-based neural retrieval. While many excellent papers emerged, most them come with their own implementations, which are typically optimized for some particular goals instead efficiency or code organization. In this paper, we introduce Tevatron, a retrieval toolkit that is efficiency, flexibility, simplicity. Tevatron enables model training evaluation variety ranking...

10.1145/3539618.3591805 article EN Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval 2023-07-18

10.1145/3626772.3657862 article EN Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval 2024-07-10

10.18653/v1/2024.emnlp-main.250 article EN Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2024-01-01

Recent advancements in dense retrieval have introduced vision-language model (VLM)-based retrievers, such as DSE and ColPali, which leverage document screenshots embedded vectors to enable effective search offer a simplified pipeline over traditional text-only methods. In this study, we propose three pixel poisoning attack methods designed compromise VLM-based retrievers evaluate their effectiveness under various settings parameter configurations. Our empirical results demonstrate that...

10.48550/arxiv.2501.16902 preprint EN arXiv (Cornell University) 2025-01-28

Existing foundation models typically process visual input as pixels and textual tokens, a paradigm that contrasts with human perception, where both modalities are processed in unified manner. With the rise of embodied agentic AI, inputs primarily come from camera pixels, need for perception framework becomes increasingly evident. In this paper, we propose to unify all (text, tables, code, diagrams, images, etc) pixel inputs, i.e. "Perceive Everything Pixels" (PEAP). We introduce PixelWorld,...

10.48550/arxiv.2501.19339 preprint EN arXiv (Cornell University) 2025-01-31

Text retrieval using learned dense representations has recently emerged as a promising alternative to "traditional" text sparse bag-of-words representations. One recent work that garnered much attention is the passage retriever (DPR) technique proposed by Karpukhin et al. (2020) for end-to-end open-domain question answering. We present replication study of this work, starting with model checkpoints provided authors, but otherwise from an independent implementation in our group's Pyserini IR...

10.48550/arxiv.2104.05740 preprint EN cc-by arXiv (Cornell University) 2021-01-01

The COVID-19 pandemic has brought about a proliferation of harmful news articles online, with sources lacking credibility and misrepresenting scientific facts. Misinformation real consequences for consumer health search, i.e., users searching information. In the context multi-stage ranking architectures, there been little work exploring whether they prioritize correct credible information over misinformation. We find that, indeed, training models on standard relevance datasets like MS MARCO...

10.1145/3404835.3463120 article EN Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval 2021-07-11

We present Mr. TyDi, a multi-lingual benchmark dataset for mono-lingual retrieval in eleven typologically diverse languages, designed to evaluate ranking with learned dense representations. The goal of this resource is spur research techniques non-English motivated by recent observations that existing representation learning perform poorly when applied out-of-distribution data. As starting point, we provide zero-shot baselines new based on adaptation DPR call "mDPR". Experiments show...

10.18653/v1/2021.mrl-1.12 article EN cc-by 2021-01-01

Large-scale language models (LLMs) like ChatGPT have demonstrated impressive abilities in generating responses based on human instructions. However, their use the medical field can be challenging due to lack of specific, in-depth knowledge. In this study, we present a system called LLMs Augmented with Medical Textbooks (LLM-AMT) designed enhance proficiency specialized domains. LLM-AMT integrates authoritative textbooks into LLMs' framework using plug-and-play modules. These modules include...

10.48550/arxiv.2309.02233 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Dense retrieval models using a transformer-based bi-encoder architecture have emerged as an active area of research. In this article, we focus on the task monolingual in variety typologically diverse languages such architecture. Although recent work with multilingual transformers demonstrates that they exhibit strong cross-lingual generalization capabilities, there remain many open research questions, which tackle here. Our study is organized “best practices” guide for training dense models,...

10.1145/3613447 article EN ACM transactions on office information systems 2023-08-12
Coming Soon ...