NFDI4DS | UHH-SEMS - Publication Details

Shitao Xiao

ORCID: 0000-0003-2567-6843

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5044147794

Research Areas

Topic Modeling
Natural Language Processing Techniques
Multimodal Machine Learning Applications
Advanced Image and Video Retrieval Techniques
Domain Adaptation and Few-Shot Learning
Semantic Web and Ontologies
Recommender Systems and Techniques
Advanced Graph Neural Networks
Image Retrieval and Classification Techniques
Ferroelectric and Negative Capacitance Devices
Advanced Memory and Neural Computing
Generative Adversarial Networks and Image Synthesis
Advanced Bandit Algorithms Research
Speech Recognition and Synthesis
Artificial Intelligence in Healthcare and Education
EEG and Brain-Computer Interfaces
Advanced MEMS and NEMS Technologies
Constraint Satisfaction and Optimization
Innovative Teaching and Learning Methods
Consumer Market Behavior and Pricing
Human Pose and Action Recognition
Context-Aware Activity Recognition Systems
Multi-Agent Systems and Negotiation
Web Data Mining and Analysis
Hate Speech and Cyberbullying Detection

Beijing Academy of Artificial Intelligence
2023-2025

Beijing Academy of Social Sciences
2024

Beijing University of Posts and Telecommunications
2021-2023

Huawei Technologies (China)
2023

Hainan Agricultural School
2022

Microsoft Research (United Kingdom)
2021

M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

OPENALEX - Publications

J H Chen Shitao Xiao Peitian Zhang Kun Luo Defu Lian and 1 more

10.18653/v1/2024.findings-acl.137 article EN Findings of the Association for Computational Linguistics: ACL 2022 2024-01-01

C-Pack: Packed Resources For General Chinese Embeddings

OPENALEX - Publications

Shitao Xiao Zheng Liu Peitian Zhang Niklas Muennighoff Defu Lian and 1 more

10.1145/3626772.3657878 article EN Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval 2024-07-10

C-Pack: Packaged Resources To Advance General Chinese Embedding

OPENALEX - Publications

Shitao Xiao Zheng Liu Peitian Zhang Niklas Muennighof

We introduce C-Pack, a package of resources that significantly advance the field general Chinese embeddings. C-Pack includes three critical resources. 1) C-MTEB is comprehensive benchmark for text embeddings covering 6 tasks and 35 datasets. 2) C-MTP massive embedding dataset curated from labeled unlabeled corpora training models. 3) C-TEM family models multiple sizes. Our outperform all prior on by up to +10% upon time release. also integrate optimize entire suite methods C-TEM. Along with...

10.48550/arxiv.2309.07597 preprint EN other-oa arXiv (Cornell University) 2023-01-01

BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

OPENALEX - Publications

J.B. Chen Shitao Xiao Peitian Zhang Kun Luo Defu Lian and 1 more

In this paper, we present a new embedding model, called M3-Embedding, which is distinguished for its versatility in Multi-Linguality, Multi-Functionality, and Multi-Granularity. It can support more than 100 working languages, leading to state-of-the-art performances on multi-lingual cross-lingual retrieval tasks. simultaneously perform the three common functionalities of model: dense retrieval, multi-vector sparse provides unified model foundation real-world IR applications. able process...

10.48550/arxiv.2402.03216 preprint EN arXiv (Cornell University) 2024-02-05

RetroMAE: Pre-Training Retrieval-oriented Language Models Via Masked Auto-Encoder

OPENALEX - Publications

Shitao Xiao Zheng Liu Yingxia Shao Zhao Cao

Despite pre-training’s progress in many important NLP tasks, it remains to explore effective pre-training strategies for dense retrieval. In this paper, we propose RetroMAE, a new retrieval oriented paradigm based on Masked Auto-Encoder (MAE). RetroMAE is highlighted by three critical designs. 1) A novel MAE workflow, where the input sentence polluted encoder and decoder with different masks. The embedding generated from encoder’s masked input; then, original recovered decoder’s via language...

10.18653/v1/2022.emnlp-main.35 article EN cc-by Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2022-01-01

GraphFormers: GNN-nested Transformers for Representation Learning on Textual Graph

OPENALEX - Publications

Junhan Yang Zheng Liu Shitao Xiao Chaozhuo Li Defu Lian and 4 more

The representation learning on textual graph is to generate low-dimensional embeddings for the nodes based individual features and neighbourhood information. Recent breakthroughs pretrained language models neural networks push forward development of corresponding techniques. existing works mainly rely cascaded model architecture: are independently encoded by at first; aggregated afterwards. However, above architecture limited due independent modeling features. In this work, we propose...

10.48550/arxiv.2105.02605 preprint EN cc-by arXiv (Cornell University) 2021-01-01

LECF: recommendation via learnable edge collaborative filtering

OPENALEX - Publications

Shitao Xiao Yingxia Shao Yawen Li Hongzhi Yin Yanyan Shen and 1 more

10.1007/s11432-020-3274-6 article EN Science China Information Sciences 2021-12-23

Training Large-Scale News Recommenders with Pretrained Language Models in the Loop

OPENALEX - Publications

Shitao Xiao Zheng Liu Yingxia Shao Tao Di Bhuvan Middha and 2 more

News recommendation calls for deep insights of news articles' underlying semantics. Therefore, pretrained language models (PLMs), like BERT and RoBERTa, may substantially contribute to the quality. However, it's extremely challenging have recommenders trained together with such big models: learning requires intensive encoding operations, whose cost is prohibitive if PLMs are used as encoder. In this paper, we propose a novel framework, SpeedyFeed, which efficiently trains PLMs-based superior...

10.1145/3534678.3539120 article EN Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining 2022-08-12

Retrieve Anything To Augment Large Language Models

OPENALEX - Publications

Peitian Zhang Shitao Xiao Zheng Liu Zhicheng Dou Jian‐Yun Nie

Large language models (LLMs) face significant challenges stemming from their inherent limitations in knowledge, memory, alignment, and action. These cannot be addressed by LLMs alone, but should rely on assistance the external world, such as knowledge base, memory store, demonstration examples, tools. Retrieval augmentation stands a vital mechanism for bridging gap between assistance. However, conventional methods encounter two pressing issues. On one hand, general-purpose retrievers are not...

10.48550/arxiv.2310.07554 preprint EN cc-by arXiv (Cornell University) 2023-01-01

LM-Cocktail: Resilient Tuning of Language Models via Model Merging

OPENALEX - Publications

Shitao Xiao Zheng Liu Peitian Zhang Xingrun Xing

10.18653/v1/2024.findings-acl.145 article EN Findings of the Association for Computational Linguistics: ACL 2022 2024-01-01

Matryoshka Re-Ranker: A Flexible Re-Ranking Architecture With Configurable Depth and Width

OPENALEX - Publications

Zheng Liu Chaofan Li Shitao Xiao Chaozhuo Li Defu Lian and 1 more

Large language models (LLMs) provide powerful foundations to perform fine-grained text re-ranking. However, they are often prohibitive in reality due constraints on computation bandwidth. In this work, we propose a \textbf{flexible} architecture called \textbf{Matroyshka Re-Ranker}, which is designed facilitate \textbf{runtime customization} of model layers and sequence lengths at each layer based users' configurations. Consequently, the LLM-based re-rankers can be made applicable across...

10.48550/arxiv.2501.16302 preprint EN arXiv (Cornell University) 2025-01-27

EfficientLLM: Scalable Pruning-Aware Pretraining for Architecture-Agnostic Edge Language Models

OPENALEX - Publications

Xingrun Xing Zheng Liu Shitao Xiao Boyan Gao Yiming Liang and 4 more

Modern large language models (LLMs) driven by scaling laws, achieve intelligence emergency in model sizes. Recently, the increasing concerns about cloud costs, latency, and privacy make it an urgent requirement to develop compact edge models. Distinguished from direct pretraining that bounded law, this work proposes pruning-aware pretraining, focusing on retaining performance of much larger optimized It features following characteristics: 1) Data-scalable: we introduce minimal parameter...

10.48550/arxiv.2502.06663 preprint EN arXiv (Cornell University) 2025-02-10

MMTEB: Massive Multilingual Text Embedding Benchmark

OPENALEX - Publications

Kenneth Enevoldsen Isaac Kwan Yin Chung Imene Lydia Kerboua Márton Kardos Ashwin Mathur and 81 more

Text embeddings are typically evaluated on a limited set of tasks, which constrained by language, domain, and task diversity. To address these limitations provide more comprehensive evaluation, we introduce the Massive Multilingual Embedding Benchmark (MMTEB) - large-scale, community-driven expansion MTEB, covering over 500 quality-controlled evaluation tasks across 250+ languages. MMTEB includes diverse challenging, novel such as instruction following, long-document retrieval, code...

10.48550/arxiv.2502.13595 preprint EN arXiv (Cornell University) 2025-02-19

Lighter And Better: Towards Flexible Context Adaptation For Retrieval Augmented Generation

OPENALEX - Publications

Chenyuan Wu Ninglu Shao Zheng Liu Shitao Xiao Chaozhuo Li and 3 more

10.1145/3701551.3703580 article EN 2025-02-26

Distill-VQ

OPENALEX - Publications

Shitao Xiao Zheng Liu Weihao Han Jianjin Zhang Defu Lian and 6 more

Vector quantization (VQ) based ANN indexes, such as Inverted File System (IVF) and Product Quantization (PQ), have been widely applied to embedding document retrieval thanks the competitive time memory efficiency. Originally, VQ is learned minimize reconstruction loss, i.e., distortions between original dense embeddings reconstructed after quantization. Unfortunately, an objective inconsistent with goal of selecting ground-truth documents for input query, which may cause severe loss quality....

10.1145/3477495.3531799 article EN Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval 2022-07-06

Uni-Retriever: Towards Learning the Unified Embedding Based Retriever in Bing Sponsored Search

OPENALEX - Publications

Jianjin Zhang Zheng Liu Weihao Han Shitao Xiao Zheng Ruicheng and 7 more

Embedding based retrieval (EBR) is a fundamental building block in many web applications. However, EBR sponsored search distinguished from other generic scenarios and technically challenging due to the need of serving multiple purposes: firstly, it has retrieve high-relevance ads, which may exactly serve user's intent; secondly, needs high-CTR ads so as maximize overall user clicks. In this paper, we present novel representation learning framework Uni-Retriever developed for Bing Search,...

10.1145/3534678.3539212 article EN Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining 2022-08-12

Extending Llama-3's Context Ten-Fold Overnight

OPENALEX - Publications

Peitian Zhang Ninglu Shao Zheng Liu Shitao Xiao Hongjin Qian and 2 more

We extend the context length of Llama-3-8B-Instruct from 8K to 80K via QLoRA fine-tuning. The entire training cycle is super efficient, which takes 8 hours on one 8xA800 (80G) GPU machine. resulted model exhibits superior performances across a broad range evaluation tasks, such as NIHS, topic retrieval, and long-context language understanding; meanwhile, it also well preserves original capability over short contexts. dramatic extension mainly attributed merely 3.5K synthetic samples...

10.48550/arxiv.2404.19553 preprint EN arXiv (Cornell University) 2024-04-30

Llama2Vec: Unsupervised Adaptation of Large Language Models for Dense Retrieval

OPENALEX - Publications

Chaofan Li Zheng Liu Shitao Xiao Yingxia Shao Defu Lian

10.18653/v1/2024.acl-long.191 article EN 2024-01-01

Progressively Optimized Bi-Granular Document Representation for Scalable Embedding Based Retrieval

OPENALEX - Publications

Shitao Xiao Zheng Liu Weihao Han Jianjin Zhang Yingxia Shao and 7 more

Ad-hoc search calls for the selection of appropriate answers from a massive-scale corpus. Nowadays, embedding-based retrieval (EBR) becomes promising solution, where deep learning based document representation and ANN techniques are allied to handle this task. However, major challenge is that index can be too large fit into memory, given considerable size answer In work, we tackle problem with Bi-Granular Document Representation, lightweight sparse embeddings indexed standby in memory...

10.1145/3485447.3511957 article EN Proceedings of the ACM Web Conference 2022 2022-04-25

Landmark Embedding: A Chunking-Free Embedding Method For Retrieval Augmented Long-Context Large Language Models

OPENALEX - Publications

Kun Luo Zheng Liu Shitao Xiao Tong Zhou Yubo Chen and 2 more

10.18653/v1/2024.acl-long.180 article EN 2024-01-01

RetroMAE-2: Duplex Masked Auto-Encoder For Pre-Training Retrieval-Oriented Language Models

OPENALEX - Publications

Zheng Liu Shitao Xiao Yingxia Shao Zhao Cao

To better support information retrieval tasks such as web search and open-domain question answering, growing effort is made to develop retrieval-oriented language models, e.g., RetroMAE many others. Most of the existing works focus on improving semantic representation capability for contextualized embedding [CLS] token. However, recent study shows that ordinary tokens besides may provide extra information, which help produce a effect. As such, it's necessary extend current methods where all...

10.18653/v1/2023.acl-long.148 article EN cc-by 2023-01-01

SpikeLLM: Scaling up Spiking Neural Network to Large Language Models via Saliency-based Spiking

OPENALEX - Publications

Xingrun Xing Boyan Gao Zheng Zhang David A. Clifton Shitao Xiao and 3 more

The recent advancements in large language models (LLMs) with billions of parameters have significantly boosted their performance across various real-world applications. However, the inference processes for these require substantial energy and computational resources, presenting considerable deployment challenges. In contrast, human brains, which contain approximately 86 billion biological neurons, exhibit greater efficiency compared to LLMs a similar number parameters. Inspired by this, we...

10.48550/arxiv.2407.04752 preprint EN arXiv (Cornell University) 2024-07-05

Matching-oriented Embedding Quantization For Ad-hoc Retrieval

OPENALEX - Publications

Shitao Xiao Zheng Liu Yingxia Shao Defu Lian Xing Xie

Product quantization (PQ) is a widely used technique for ad-hoc retrieval. Recent studies propose supervised PQ, where the embedding and models can be jointly trained with learning. However, there lack of appropriate formulation joint training objective; thus, improvements over previous non-supervised baselines are limited in reality. In this work, we Matching-oriented Quantization (MoPQ), novel objective Multinoulli Contrastive Loss (MCL) formulated. With minimization MCL, able to maximize...

10.18653/v1/2021.emnlp-main.640 article EN cc-by Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2021-01-01

Hybrid Inverted Index Is a Robust Accelerator for Dense Retrieval

OPENALEX - Publications

Peitian Zhang Zheng Liu Shitao Xiao Zhicheng Dou Jing Yao

Inverted file structure is a common technique for accelerating dense retrieval. It clusters documents based on their embeddings; during searching, it probes nearby w.r.t. an input query and only evaluates within them by subsequent codecs, thus avoiding the expensive cost from exhaustive traversal. However, clustering always lossy, which results in miss of relevant probed hence degrades retrieval quality. In contrast, lexical matching, such as overlaps salient terms, tend to be strong...

10.18653/v1/2023.emnlp-main.116 article EN cc-by Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2023-01-01

Coming Soon ...