Shitao Xiao

ORCID: 0000-0003-2567-6843
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Topic Modeling
  • Natural Language Processing Techniques
  • Multimodal Machine Learning Applications
  • Advanced Image and Video Retrieval Techniques
  • Domain Adaptation and Few-Shot Learning
  • Semantic Web and Ontologies
  • Recommender Systems and Techniques
  • Advanced Graph Neural Networks
  • Image Retrieval and Classification Techniques
  • Ferroelectric and Negative Capacitance Devices
  • Advanced Memory and Neural Computing
  • Generative Adversarial Networks and Image Synthesis
  • Advanced Bandit Algorithms Research
  • Speech Recognition and Synthesis
  • Artificial Intelligence in Healthcare and Education
  • EEG and Brain-Computer Interfaces
  • Advanced MEMS and NEMS Technologies
  • Constraint Satisfaction and Optimization
  • Innovative Teaching and Learning Methods
  • Consumer Market Behavior and Pricing
  • Human Pose and Action Recognition
  • Context-Aware Activity Recognition Systems
  • Multi-Agent Systems and Negotiation
  • Web Data Mining and Analysis
  • Hate Speech and Cyberbullying Detection

Beijing Academy of Artificial Intelligence
2023-2025

Beijing Academy of Social Sciences
2024

Beijing University of Posts and Telecommunications
2021-2023

Huawei Technologies (China)
2023

Hainan Agricultural School
2022

Microsoft Research (United Kingdom)
2021

10.1145/3626772.3657878 article EN Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval 2024-07-10

We introduce C-Pack, a package of resources that significantly advance the field general Chinese embeddings. C-Pack includes three critical resources. 1) C-MTEB is comprehensive benchmark for text embeddings covering 6 tasks and 35 datasets. 2) C-MTP massive embedding dataset curated from labeled unlabeled corpora training models. 3) C-TEM family models multiple sizes. Our outperform all prior on by up to +10% upon time release. also integrate optimize entire suite methods C-TEM. Along with...

10.48550/arxiv.2309.07597 preprint EN other-oa arXiv (Cornell University) 2023-01-01

In this paper, we present a new embedding model, called M3-Embedding, which is distinguished for its versatility in Multi-Linguality, Multi-Functionality, and Multi-Granularity. It can support more than 100 working languages, leading to state-of-the-art performances on multi-lingual cross-lingual retrieval tasks. simultaneously perform the three common functionalities of model: dense retrieval, multi-vector sparse provides unified model foundation real-world IR applications. able process...

10.48550/arxiv.2402.03216 preprint EN arXiv (Cornell University) 2024-02-05

Despite pre-training’s progress in many important NLP tasks, it remains to explore effective pre-training strategies for dense retrieval. In this paper, we propose RetroMAE, a new retrieval oriented paradigm based on Masked Auto-Encoder (MAE). RetroMAE is highlighted by three critical designs. 1) A novel MAE workflow, where the input sentence polluted encoder and decoder with different masks. The embedding generated from encoder’s masked input; then, original recovered decoder’s via language...

10.18653/v1/2022.emnlp-main.35 article EN cc-by Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2022-01-01

The representation learning on textual graph is to generate low-dimensional embeddings for the nodes based individual features and neighbourhood information. Recent breakthroughs pretrained language models neural networks push forward development of corresponding techniques. existing works mainly rely cascaded model architecture: are independently encoded by at first; aggregated afterwards. However, above architecture limited due independent modeling features. In this work, we propose...

10.48550/arxiv.2105.02605 preprint EN cc-by arXiv (Cornell University) 2021-01-01

News recommendation calls for deep insights of news articles' underlying semantics. Therefore, pretrained language models (PLMs), like BERT and RoBERTa, may substantially contribute to the quality. However, it's extremely challenging have recommenders trained together with such big models: learning requires intensive encoding operations, whose cost is prohibitive if PLMs are used as encoder. In this paper, we propose a novel framework, SpeedyFeed, which efficiently trains PLMs-based superior...

10.1145/3534678.3539120 article EN Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining 2022-08-12

Large language models (LLMs) face significant challenges stemming from their inherent limitations in knowledge, memory, alignment, and action. These cannot be addressed by LLMs alone, but should rely on assistance the external world, such as knowledge base, memory store, demonstration examples, tools. Retrieval augmentation stands a vital mechanism for bridging gap between assistance. However, conventional methods encounter two pressing issues. On one hand, general-purpose retrievers are not...

10.48550/arxiv.2310.07554 preprint EN cc-by arXiv (Cornell University) 2023-01-01

10.18653/v1/2024.findings-acl.145 article EN Findings of the Association for Computational Linguistics: ACL 2022 2024-01-01

Large language models (LLMs) provide powerful foundations to perform fine-grained text re-ranking. However, they are often prohibitive in reality due constraints on computation bandwidth. In this work, we propose a \textbf{flexible} architecture called \textbf{Matroyshka Re-Ranker}, which is designed facilitate \textbf{runtime customization} of model layers and sequence lengths at each layer based users' configurations. Consequently, the LLM-based re-rankers can be made applicable across...

10.48550/arxiv.2501.16302 preprint EN arXiv (Cornell University) 2025-01-27

Modern large language models (LLMs) driven by scaling laws, achieve intelligence emergency in model sizes. Recently, the increasing concerns about cloud costs, latency, and privacy make it an urgent requirement to develop compact edge models. Distinguished from direct pretraining that bounded law, this work proposes pruning-aware pretraining, focusing on retaining performance of much larger optimized It features following characteristics: 1) Data-scalable: we introduce minimal parameter...

10.48550/arxiv.2502.06663 preprint EN arXiv (Cornell University) 2025-02-10

Text embeddings are typically evaluated on a limited set of tasks, which constrained by language, domain, and task diversity. To address these limitations provide more comprehensive evaluation, we introduce the Massive Multilingual Embedding Benchmark (MMTEB) - large-scale, community-driven expansion MTEB, covering over 500 quality-controlled evaluation tasks across 250+ languages. MMTEB includes diverse challenging, novel such as instruction following, long-document retrieval, code...

10.48550/arxiv.2502.13595 preprint EN arXiv (Cornell University) 2025-02-19

Vector quantization (VQ) based ANN indexes, such as Inverted File System (IVF) and Product Quantization (PQ), have been widely applied to embedding document retrieval thanks the competitive time memory efficiency. Originally, VQ is learned minimize reconstruction loss, i.e., distortions between original dense embeddings reconstructed after quantization. Unfortunately, an objective inconsistent with goal of selecting ground-truth documents for input query, which may cause severe loss quality....

10.1145/3477495.3531799 article EN Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval 2022-07-06

Embedding based retrieval (EBR) is a fundamental building block in many web applications. However, EBR sponsored search distinguished from other generic scenarios and technically challenging due to the need of serving multiple purposes: firstly, it has retrieve high-relevance ads, which may exactly serve user's intent; secondly, needs high-CTR ads so as maximize overall user clicks. In this paper, we present novel representation learning framework Uni-Retriever developed for Bing Search,...

10.1145/3534678.3539212 article EN Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining 2022-08-12

We extend the context length of Llama-3-8B-Instruct from 8K to 80K via QLoRA fine-tuning. The entire training cycle is super efficient, which takes 8 hours on one 8xA800 (80G) GPU machine. resulted model exhibits superior performances across a broad range evaluation tasks, such as NIHS, topic retrieval, and long-context language understanding; meanwhile, it also well preserves original capability over short contexts. dramatic extension mainly attributed merely 3.5K synthetic samples...

10.48550/arxiv.2404.19553 preprint EN arXiv (Cornell University) 2024-04-30

Ad-hoc search calls for the selection of appropriate answers from a massive-scale corpus. Nowadays, embedding-based retrieval (EBR) becomes promising solution, where deep learning based document representation and ANN techniques are allied to handle this task. However, major challenge is that index can be too large fit into memory, given considerable size answer In work, we tackle problem with Bi-Granular Document Representation, lightweight sparse embeddings indexed standby in memory...

10.1145/3485447.3511957 article EN Proceedings of the ACM Web Conference 2022 2022-04-25

To better support information retrieval tasks such as web search and open-domain question answering, growing effort is made to develop retrieval-oriented language models, e.g., RetroMAE many others. Most of the existing works focus on improving semantic representation capability for contextualized embedding [CLS] token. However, recent study shows that ordinary tokens besides may provide extra information, which help produce a effect. As such, it's necessary extend current methods where all...

10.18653/v1/2023.acl-long.148 article EN cc-by 2023-01-01

The recent advancements in large language models (LLMs) with billions of parameters have significantly boosted their performance across various real-world applications. However, the inference processes for these require substantial energy and computational resources, presenting considerable deployment challenges. In contrast, human brains, which contain approximately 86 billion biological neurons, exhibit greater efficiency compared to LLMs a similar number parameters. Inspired by this, we...

10.48550/arxiv.2407.04752 preprint EN arXiv (Cornell University) 2024-07-05

Product quantization (PQ) is a widely used technique for ad-hoc retrieval. Recent studies propose supervised PQ, where the embedding and models can be jointly trained with learning. However, there lack of appropriate formulation joint training objective; thus, improvements over previous non-supervised baselines are limited in reality. In this work, we Matching-oriented Quantization (MoPQ), novel objective Multinoulli Contrastive Loss (MCL) formulated. With minimization MCL, able to maximize...

10.18653/v1/2021.emnlp-main.640 article EN cc-by Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2021-01-01

Inverted file structure is a common technique for accelerating dense retrieval. It clusters documents based on their embeddings; during searching, it probes nearby w.r.t. an input query and only evaluates within them by subsequent codecs, thus avoiding the expensive cost from exhaustive traversal. However, clustering always lossy, which results in miss of relevant probed hence degrades retrieval quality. In contrast, lexical matching, such as overlaps salient terms, tend to be strong...

10.18653/v1/2023.emnlp-main.116 article EN cc-by Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2023-01-01
Coming Soon ...