- Topic Modeling
- Natural Language Processing Techniques
- Multimodal Machine Learning Applications
- Advanced Image and Video Retrieval Techniques
- Domain Adaptation and Few-Shot Learning
- Semantic Web and Ontologies
- Recommender Systems and Techniques
- Advanced Graph Neural Networks
- Image Retrieval and Classification Techniques
- Ferroelectric and Negative Capacitance Devices
- Advanced Memory and Neural Computing
- Generative Adversarial Networks and Image Synthesis
- Advanced Bandit Algorithms Research
- Speech Recognition and Synthesis
- Artificial Intelligence in Healthcare and Education
- EEG and Brain-Computer Interfaces
- Advanced MEMS and NEMS Technologies
- Constraint Satisfaction and Optimization
- Innovative Teaching and Learning Methods
- Consumer Market Behavior and Pricing
- Human Pose and Action Recognition
- Context-Aware Activity Recognition Systems
- Multi-Agent Systems and Negotiation
- Web Data Mining and Analysis
- Hate Speech and Cyberbullying Detection
Beijing Academy of Artificial Intelligence
2023-2025
Beijing Academy of Social Sciences
2024
Beijing University of Posts and Telecommunications
2021-2023
Huawei Technologies (China)
2023
Hainan Agricultural School
2022
Microsoft Research (United Kingdom)
2021
We introduce C-Pack, a package of resources that significantly advance the field general Chinese embeddings. C-Pack includes three critical resources. 1) C-MTEB is comprehensive benchmark for text embeddings covering 6 tasks and 35 datasets. 2) C-MTP massive embedding dataset curated from labeled unlabeled corpora training models. 3) C-TEM family models multiple sizes. Our outperform all prior on by up to +10% upon time release. also integrate optimize entire suite methods C-TEM. Along with...
In this paper, we present a new embedding model, called M3-Embedding, which is distinguished for its versatility in Multi-Linguality, Multi-Functionality, and Multi-Granularity. It can support more than 100 working languages, leading to state-of-the-art performances on multi-lingual cross-lingual retrieval tasks. simultaneously perform the three common functionalities of model: dense retrieval, multi-vector sparse provides unified model foundation real-world IR applications. able process...
Despite pre-training’s progress in many important NLP tasks, it remains to explore effective pre-training strategies for dense retrieval. In this paper, we propose RetroMAE, a new retrieval oriented paradigm based on Masked Auto-Encoder (MAE). RetroMAE is highlighted by three critical designs. 1) A novel MAE workflow, where the input sentence polluted encoder and decoder with different masks. The embedding generated from encoder’s masked input; then, original recovered decoder’s via language...
The representation learning on textual graph is to generate low-dimensional embeddings for the nodes based individual features and neighbourhood information. Recent breakthroughs pretrained language models neural networks push forward development of corresponding techniques. existing works mainly rely cascaded model architecture: are independently encoded by at first; aggregated afterwards. However, above architecture limited due independent modeling features. In this work, we propose...
News recommendation calls for deep insights of news articles' underlying semantics. Therefore, pretrained language models (PLMs), like BERT and RoBERTa, may substantially contribute to the quality. However, it's extremely challenging have recommenders trained together with such big models: learning requires intensive encoding operations, whose cost is prohibitive if PLMs are used as encoder. In this paper, we propose a novel framework, SpeedyFeed, which efficiently trains PLMs-based superior...
Large language models (LLMs) face significant challenges stemming from their inherent limitations in knowledge, memory, alignment, and action. These cannot be addressed by LLMs alone, but should rely on assistance the external world, such as knowledge base, memory store, demonstration examples, tools. Retrieval augmentation stands a vital mechanism for bridging gap between assistance. However, conventional methods encounter two pressing issues. On one hand, general-purpose retrievers are not...
Large language models (LLMs) provide powerful foundations to perform fine-grained text re-ranking. However, they are often prohibitive in reality due constraints on computation bandwidth. In this work, we propose a \textbf{flexible} architecture called \textbf{Matroyshka Re-Ranker}, which is designed facilitate \textbf{runtime customization} of model layers and sequence lengths at each layer based users' configurations. Consequently, the LLM-based re-rankers can be made applicable across...
Modern large language models (LLMs) driven by scaling laws, achieve intelligence emergency in model sizes. Recently, the increasing concerns about cloud costs, latency, and privacy make it an urgent requirement to develop compact edge models. Distinguished from direct pretraining that bounded law, this work proposes pruning-aware pretraining, focusing on retaining performance of much larger optimized It features following characteristics: 1) Data-scalable: we introduce minimal parameter...
Text embeddings are typically evaluated on a limited set of tasks, which constrained by language, domain, and task diversity. To address these limitations provide more comprehensive evaluation, we introduce the Massive Multilingual Embedding Benchmark (MMTEB) - large-scale, community-driven expansion MTEB, covering over 500 quality-controlled evaluation tasks across 250+ languages. MMTEB includes diverse challenging, novel such as instruction following, long-document retrieval, code...
Vector quantization (VQ) based ANN indexes, such as Inverted File System (IVF) and Product Quantization (PQ), have been widely applied to embedding document retrieval thanks the competitive time memory efficiency. Originally, VQ is learned minimize reconstruction loss, i.e., distortions between original dense embeddings reconstructed after quantization. Unfortunately, an objective inconsistent with goal of selecting ground-truth documents for input query, which may cause severe loss quality....
Embedding based retrieval (EBR) is a fundamental building block in many web applications. However, EBR sponsored search distinguished from other generic scenarios and technically challenging due to the need of serving multiple purposes: firstly, it has retrieve high-relevance ads, which may exactly serve user's intent; secondly, needs high-CTR ads so as maximize overall user clicks. In this paper, we present novel representation learning framework Uni-Retriever developed for Bing Search,...
We extend the context length of Llama-3-8B-Instruct from 8K to 80K via QLoRA fine-tuning. The entire training cycle is super efficient, which takes 8 hours on one 8xA800 (80G) GPU machine. resulted model exhibits superior performances across a broad range evaluation tasks, such as NIHS, topic retrieval, and long-context language understanding; meanwhile, it also well preserves original capability over short contexts. dramatic extension mainly attributed merely 3.5K synthetic samples...
Ad-hoc search calls for the selection of appropriate answers from a massive-scale corpus. Nowadays, embedding-based retrieval (EBR) becomes promising solution, where deep learning based document representation and ANN techniques are allied to handle this task. However, major challenge is that index can be too large fit into memory, given considerable size answer In work, we tackle problem with Bi-Granular Document Representation, lightweight sparse embeddings indexed standby in memory...
To better support information retrieval tasks such as web search and open-domain question answering, growing effort is made to develop retrieval-oriented language models, e.g., RetroMAE many others. Most of the existing works focus on improving semantic representation capability for contextualized embedding [CLS] token. However, recent study shows that ordinary tokens besides may provide extra information, which help produce a effect. As such, it's necessary extend current methods where all...
The recent advancements in large language models (LLMs) with billions of parameters have significantly boosted their performance across various real-world applications. However, the inference processes for these require substantial energy and computational resources, presenting considerable deployment challenges. In contrast, human brains, which contain approximately 86 billion biological neurons, exhibit greater efficiency compared to LLMs a similar number parameters. Inspired by this, we...
Product quantization (PQ) is a widely used technique for ad-hoc retrieval. Recent studies propose supervised PQ, where the embedding and models can be jointly trained with learning. However, there lack of appropriate formulation joint training objective; thus, improvements over previous non-supervised baselines are limited in reality. In this work, we Matching-oriented Quantization (MoPQ), novel objective Multinoulli Contrastive Loss (MCL) formulated. With minimization MCL, able to maximize...
Inverted file structure is a common technique for accelerating dense retrieval. It clusters documents based on their embeddings; during searching, it probes nearby w.r.t. an input query and only evaluates within them by subsequent codecs, thus avoiding the expensive cost from exhaustive traversal. However, clustering always lossy, which results in miss of relevant probed hence degrades retrieval quality. In contrast, lexical matching, such as overlaps salient terms, tend to be strong...