- Topic Modeling
- Natural Language Processing Techniques
- Domain Adaptation and Few-Shot Learning
- Multimodal Machine Learning Applications
- Quantum Computing Algorithms and Architecture
- Advanced Image and Video Retrieval Techniques
- Cryptography and Data Security
- Quantum Information and Cryptography
- Text and Document Classification Technologies
- Advanced Neural Network Applications
- Information Retrieval and Search Behavior
- Organic Electronics and Photovoltaics
- Cultural Heritage Materials Analysis
- Flood Risk Assessment and Management
- Expert finding and Q&A systems
- Genomics and Phylogenetic Studies
- Pigment Synthesis and Properties
- Bayesian Modeling and Causal Inference
- Biometric Identification and Security
- Quantum-Dot Cellular Automata
- Vitamin D Research Studies
- Data Management and Algorithms
- Neural Networks and Applications
- Urban Heat Island Mitigation
- Explainable Artificial Intelligence (XAI)
University of Waterloo
2021-2024
Dalian University of Technology
2023
Hefei National Center for Physical Sciences at Nanoscale
2021-2023
University of Science and Technology of China
2018-2023
Beijing Academy of Quantum Information Sciences
2023
University of Washington
2023
Liaoning Normal University
2023
Ningbo University
2020
Michigan State University
2019
Various techniques have been developed in recent years to improve dense retrieval (DR), such as unsupervised contrastive learning and pseudo-query generation. Existing DRs, however, often suffer from effectiveness tradeoffs between supervised zero-shot retrieval, which some argue was due the limited model capacity. We contradict this hypothesis show that a generalizable DR can be trained achieve high accuracy both without increasing size. In particular, we systematically examine of under...
Abstract Pre-trained language models have been successful in many knowledge-intensive NLP tasks. However, recent work has shown that such as BERT are not “structurally ready” to aggregate textual information into a [CLS] vector for dense passage retrieval (DPR). This “lack of readiness” results from the gap between model pre-training and DPR fine-tuning. Previous solutions call computationally expensive techniques hard negative mining, cross-encoder distillation, further learn robust model....
We report a an experimental study of device-independent quantum random number generation based on detection-loophole free Bell test with entangled photons. After considering statistical fluctuations and applying 80 Gb × 45.6 Mb Toeplitz matrix hashing, we achieve final bit rate 114 bits/s, failure probability less than 10−5.
Recent work has shown that dense passage retrieval techniques achieve better ranking accuracy in open-domain question answering compared to sparse such as BM25, but at the cost of large space and memory requirements. In this paper, we analyze redundancy present encoded vectors show default dimension 768 is unnecessarily large. To improve efficiency, propose a simple unsupervised compression pipeline consists principal component analysis (PCA), product quantization, hybrid search. We further...
Minghan Li, Sheng-Chieh Lin, Barlas Oguz, Asish Ghoshal, Jimmy Yashar Mehdad, Wen-tau Yih, Xilun Chen. Proceedings of the 61st Annual Meeting Association for Computational Linguistics (Volume 1: Long Papers). 2023.
This paper introduces Sparsified Late Interaction for Multi-vector (SLIM) retrieval with inverted indexes. methods have demonstrated their effectiveness on various datasets, and among them, ColBERT is the most established method based late interaction of contextualized token embeddings pre-trained language models. However, efficient implementations require complex engineering cannot take advantage off-the-shelf search libraries, impeding practical use. To address this issue, SLIM first maps...
Sodium p-perfluorous nonenoxybenzenesulfonate (OBS) is a novel alternative to perfluorooctane sulfonate (PFOS), with environmental health risks largely unknown. The present study aims unravel the adipogenesis effects and underlying molecular initiating events of OBS, which are crucial for understanding predicting its adverse outcome. In undifferentiated human mesenchymal stem cells (hMSCs), exposure 1-100 nM OBS 7 days stimulated reactive oxygen species production. subsequent multipotent...
Query expansion has been widely used to improve the search results of first-stage retrievers, yet its influence on second-stage, crossencoder rankers remains under-explored.A recent work Weller et al. [44] shows that current techniques benefit weaker models such as DPR and BM25 but harm stronger MonoT5.In this paper, we re-examine conclusion raise following question: Can query generalization strong cross-encoder rankers?To answer question, first apply popular methods state-of-the-art verify...
Multi-task dense retrieval models can be used to retrieve documents from a common corpus (e.g., Wikipedia) for different open-domain question-answering (QA) tasks. However, (CITATION) shows that jointly learning QA tasks with one model is not always beneficial due inconsistency. For example, SQuAD only focuses on small set of Wikipedia articles while datasets like NQ and Trivia cover more entries, joint training their union cause performance degradation. To solve this problem, we propose...
Zero-knowledge proof (ZKP) is a fundamental cryptographic primitive that allows prover to convince verifier of the validity statement without leaking any further information. As an efficient variant ZKP, noninteractive zero-knowledge (NIZKP) adopting Fiat-Shamir heuristic essential wide spectrum applications, such as federated learning, blockchain, and social networks. However, typically built upon random oracle model makes ideal assumptions about hash functions, which does not hold in...
Recently, a few open-vocabulary methods have been proposed by employing unified architecture to tackle generic segmentation and detection tasks. However, their performance still lags behind the task-specific models due conflict between different tasks, capability is limited inadequate use of CLIP. To address these challenges, we present universal transformer-based framework, abbreviated as OpenSD, which utilizes same network parameters handle First, introduce decoder decoupled learning...
Various techniques have been developed in recent years to improve dense retrieval (DR), such as unsupervised contrastive learning and pseudo-query generation. Existing DRs, however, often suffer from effectiveness tradeoffs between supervised zero-shot retrieval, which some argue was due the limited model capacity. We contradict this hypothesis show that a generalizable DR can be trained achieve high accuracy both without increasing size. In particular, we systematically examine of under...
The bi-encoder design of dense passage retriever (DPR) is a key factor to its success in open-domain question answering (QA), yet it unclear how DPR's encoder and individually contributes overall performance, which we refer as the attribution problem. problem important helps us identify factors that affect individual encoders further improve performance. In this paper, formulate our analysis under probabilistic framework called marginalization, where quantify contribution single by...
Pre-trained language models have been successful in many knowledge-intensive NLP tasks. However, recent work has shown that such as BERT are not ``structurally ready'' to aggregate textual information into a [CLS] vector for dense passage retrieval (DPR). This ``lack of readiness'' results from the gap between model pre-training and DPR fine-tuning. Previous solutions call computationally expensive techniques hard negative mining, cross-encoder distillation, further learn robust model. In...
Query expansion is an effective approach for mitigating vocabulary mismatch between queries and documents in information retrieval. One recent line of research uses language models to generate query-related contexts expansion. Along this line, we argue that terms from these should balance two key aspects: diversity relevance. The obvious way increase sample multiple the model. However, comes at cost relevance, because there a well-known tendency hallucinate incorrect or irrelevant contexts....
In information retrieval (IR), candidate set pruning has been commonly used to speed up two-stage relevance ranking. However, such an approach lacks accurate error control and often trades accuracy against computational efficiency in empirical fashion, missing theoretical guarantees. this paper, we propose the concept of certified for ranking, which means that test after is guaranteed be controlled under a user-specified threshold with high probability. Both in-domain out-of-domain...
In the real world, documents are organized in different formats and varied modalities. Traditional retrieval pipelines require tailored document parsing techniques content extraction modules to prepare input for indexing. This process is tedious, prone errors, has information loss. To this end, we propose Document Screenshot Embedding} (DSE), a novel paradigm that regards screenshots as unified format, which does not any preprocess preserves all (e.g., text, image layout). DSE leverages...
This paper introduces Sparsified Late Interaction for Multi-vector (SLIM) retrieval with inverted indexes. methods have demonstrated their effectiveness on various datasets, and among them, ColBERT is the most established method based late interaction of contextualized token embeddings pre-trained language models. However, efficient implementations require complex engineering cannot take advantage off-the-shelf search libraries, impeding practical use. To address this issue, SLIM first maps...
Recent progress in information retrieval finds that embedding query and document representation into multi-vector yields a robust bi-encoder retriever on out-of-distribution datasets. In this paper, we explore whether late interaction, the simplest form of multi-vector, is also helpful to neural rerankers only use [CLS] vector compute similarity score. Although intuitively, attention mechanism at previous layers already gathers token-level information, find adding interaction still brings an...
This paper analyzes its role in the composition analysis and identification of ancient glass products by flexible use statistical methods, emphasizes four methods: systematic clustering algorithm, K-means logistic regression model grey correlation analysis. Taking C project CUMCM 2022 as an example, this systematically introduces these common data classification methods to classify analyze given data. In paper, suitable chemical components high potassium lead barium were selected for...
One key feature of dense passage retrievers (DPR) is the use separate question and encoder in a bi-encoder design. Previous work on generalization DPR mainly focus testing both encoders tandem out-of-distribution (OOD) question-answering (QA) tasks, which also known as domain adaptation. However, it still unknown how DPR's individual question/passage affects generalization. Specifically, this paper, we want to know an in-distribution (IND) would generalize if paired with OOD passage/question...