NFDI4DS | UHH-SEMS - Publication Details

How to Train Your Dragon: Diverse Augmentation Towards Generalizable Dense Retrieval

OPENALEX - Publications

Sheng-Chieh Lin Akari Asai Minghan Li Barlas Oğuz Jimmy Lin and 3 more

Various techniques have been developed in recent years to improve dense retrieval (DR), such as unsupervised contrastive learning and pseudo-query generation. Existing DRs, however, often suffer from effectiveness tradeoffs between supervised zero-shot retrieval, which some argue was due the limited model capacity. We contradict this hypothesis show that a generalizable DR can be trained achieve high accuracy both without increasing size. In particular, we systematically examine of under...

10.18653/v1/2023.findings-emnlp.423 article EN cc-by 2023-01-01

Predicting bioretention pollutant removal efficiency with design features: A data-driven approach

OPENALEX - Publications

Runzi Wang Xuewen Zhang Minghan Li

10.1016/j.jenvman.2019.04.064 article EN Journal of Environmental Management 2019-05-03

Aggretriever: A Simple Approach to Aggregate Textual Representations for Robust Dense Passage Retrieval

OPENALEX - Publications

Sheng-Chieh Lin Minghan Li Jimmy Lin

Abstract Pre-trained language models have been successful in many knowledge-intensive NLP tasks. However, recent work has shown that such as BERT are not “structurally ready” to aggregate textual information into a [CLS] vector for dense passage retrieval (DPR). This “lack of readiness” results from the gap between model pre-training and DPR fine-tuning. Previous solutions call computationally expensive techniques hard negative mining, cross-encoder distillation, further learn robust model....

10.1162/tacl_a_00556 article EN cc-by Transactions of the Association for Computational Linguistics 2023-01-01

Multi-Function Integrated Optic Chip for Miniaturized Resonant Fiber Optic Gyroscope

OPENALEX - Publications

Heliang Shen Lei Zhang Minghan Li Fei Huang Xuan She and 4 more

10.1016/j.optcom.2025.131774 article EN Optics Communications 2025-03-01

High speed device-independent quantum random number generation without detection loophole

OPENALEX - Publications

Yang Liu Xiao Yuan Minghan Li Weijun Zhang Qi Zhao and 15 more

We report a an experimental study of device-independent quantum random number generation based on detection-loophole free Bell test with entangled photons. After considering statistical fluctuations and applying 80 Gb × 45.6 Mb Toeplitz matrix hashing, we achieve final bit rate 114 bits/s, failure probability less than 10−5.

10.1364/cleo_qels.2018.ftu4a.4 article EN Conference on Lasers and Electro-Optics 2018-01-01

Simple and Effective Unsupervised Redundancy Elimination to Compress Dense Vectors for Passage Retrieval

OPENALEX - Publications

Xueguang Ma Minghan Li Kai Sun Ji Xin Jimmy Lin

Recent work has shown that dense passage retrieval techniques achieve better ranking accuracy in open-domain question answering compared to sparse such as BM25, but at the cost of large space and memory requirements. In this paper, we analyze redundancy present encoded vectors show default dimension 768 is unnecessarily large. To improve efficiency, propose a simple unsupervised compression pipeline consists principal component analysis (PCA), product quantization, hybrid search. We further...

10.18653/v1/2021.emnlp-main.227 article EN cc-by Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2021-01-01

CITADEL: Conditional Token Interaction via Dynamic Lexical Routing for Efficient and Effective Multi-Vector Retrieval

OPENALEX - Publications

Minghan Li Sheng-Chieh Lin Barlas Oğuz Asish Ghoshal Jimmy Lin and 3 more

Minghan Li, Sheng-Chieh Lin, Barlas Oguz, Asish Ghoshal, Jimmy Yashar Mehdad, Wen-tau Yih, Xilun Chen. Proceedings of the 61st Annual Meeting Association for Computational Linguistics (Volume 1: Long Papers). 2023.

10.18653/v1/2023.acl-long.663 article EN cc-by 2023-01-01

SLIM: Sparsified Late Interaction for Multi-Vector Retrieval with Inverted Indexes

OPENALEX - Publications

Minghan Li Sheng-Chieh Lin Xueguang Ma Jimmy Lin

This paper introduces Sparsified Late Interaction for Multi-vector (SLIM) retrieval with inverted indexes. methods have demonstrated their effectiveness on various datasets, and among them, ColBERT is the most established method based late interaction of contextualized token embeddings pre-trained language models. However, efficient implementations require complex engineering cannot take advantage off-the-shelf search libraries, impeding practical use. To address this issue, SLIM first maps...

10.1145/3539618.3591977 article EN Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval 2023-07-18

Adipogenic and osteogenic effects of OBS and synergistic action with PFOS via PPARγ–RXRα heterodimers

OPENALEX - Publications

Hui Qin Yueming Lang Yiteng Wang Wei Cui Yuxin Niu and 6 more

Sodium p-perfluorous nonenoxybenzenesulfonate (OBS) is a novel alternative to perfluorooctane sulfonate (PFOS), with environmental health risks largely unknown. The present study aims unravel the adipogenesis effects and underlying molecular initiating events of OBS, which are crucial for understanding predicting its adverse outcome. In undifferentiated human mesenchymal stem cells (hMSCs), exposure 1-100 nM OBS 7 days stimulated reactive oxygen species production. subsequent multipotent...

10.1016/j.envint.2023.108354 article EN cc-by-nc-nd Environment International 2023-11-25

Can Query Expansion Improve Generalization of Strong Cross-Encoder Rankers?

OPENALEX - Publications

Minghan Li Honglei Zhuang Kai Hui Zhen Qin Jimmy Lin and 3 more

Query expansion has been widely used to improve the search results of first-stage retrievers, yet its influence on second-stage, crossencoder rankers remains under-explored.A recent work Weller et al. [44] shows that current techniques benefit weaker models such as DPR and BM25 but harm stronger MonoT5.In this paper, we re-examine conclusion raise following question: Can query generalization strong cross-encoder rankers?To answer question, first apply popular methods state-of-the-art verify...

10.1145/3626772.3657979 article EN Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval 2024-07-10

Multi-Task Dense Retrieval via Model Uncertainty Fusion for Open-Domain Question Answering

OPENALEX - Publications

Minghan Li Ming Li Kun Xiong Jimmy Lin

Multi-task dense retrieval models can be used to retrieve documents from a common corpus (e.g., Wikipedia) for different open-domain question-answering (QA) tasks. However, (CITATION) shows that jointly learning QA tasks with one model is not always beneficial due inconsistency. For example, SQuAD only focuses on small set of Wikipedia articles while datasets like NQ and Trivia cover more entries, joint training their union cause performance degradation. To solve this problem, we propose...

10.18653/v1/2021.findings-emnlp.26 article EN cc-by 2021-01-01

Device-independent quantum randomness–enhanced zero-knowledge proof

OPENALEX - Publications

Cheng-Long Li Kaiyi Zhang Xingjian Zhang Kui-Xing Yang Yu Han and 13 more

Zero-knowledge proof (ZKP) is a fundamental cryptographic primitive that allows prover to convince verifier of the validity statement without leaking any further information. As an efficient variant ZKP, noninteractive zero-knowledge (NIZKP) adopting Fiat-Shamir heuristic essential wide spectrum applications, such as federated learning, blockchain, and social networks. However, typically built upon random oracle model makes ideal assumptions about hash functions, which does not hold in...

10.1073/pnas.2205463120 article EN cc-by-nc-nd Proceedings of the National Academy of Sciences 2023-11-02

OpenSD: Unified Open-Vocabulary Segmentation and Detection

OPENALEX - Publications

Shuai Li Minghan Li Pengfei Wang Lei Zhang

Recently, a few open-vocabulary methods have been proposed by employing unified architecture to tackle generic segmentation and detection tasks. However, their performance still lags behind the task-specific models due conflict between different tasks, capability is limited inadequate use of CLIP. To address these challenges, we present universal transformer-based framework, abbreviated as OpenSD, which utilizes same network parameters handle First, introduce decoder decoupled learning...

10.48550/arxiv.2312.06703 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Performance improvement of perovskite solar cells via spiro-OMeTAD pre-crystallization

OPENALEX - Publications

Minghan Li Yanyan Wang Haoyuan Xu Houcheng Zhang Jing Zhang and 2 more

10.1007/s10853-020-04896-w article EN Journal of Materials Science 2020-06-12

How to Train Your DRAGON: Diverse Augmentation Towards Generalizable Dense Retrieval

OPENALEX - Publications

Sheng-Chieh Lin Akari Asai Minghan Li Barlas Oğuz Jimmy Lin and 3 more

Various techniques have been developed in recent years to improve dense retrieval (DR), such as unsupervised contrastive learning and pseudo-query generation. Existing DRs, however, often suffer from effectiveness tradeoffs between supervised zero-shot retrieval, which some argue was due the limited model capacity. We contradict this hypothesis show that a generalizable DR can be trained achieve high accuracy both without increasing size. In particular, we systematically examine of under...

10.48550/arxiv.2302.07452 preprint EN other-oa arXiv (Cornell University) 2023-01-01

An Encoder Attribution Analysis for Dense Passage Retriever in Open-Domain Question Answering

OPENALEX - Publications

Minghan Li Xueguang Ma Jimmy Lin

The bi-encoder design of dense passage retriever (DPR) is a key factor to its success in open-domain question answering (QA), yet it unclear how DPR's encoder and individually contributes overall performance, which we refer as the attribution problem. problem important helps us identify factors that affect individual encoders further improve performance. In this paper, formulate our analysis under probabilistic framework called marginalization, where quantify contribution single by...

10.18653/v1/2022.trustnlp-1.1 article EN cc-by 2022-01-01

Aggretriever: A Simple Approach to Aggregate Textual Representations for Robust Dense Passage Retrieval

OPENALEX - Publications

Sheng-Chieh Lin Minghan Li Jimmy Lin

Pre-trained language models have been successful in many knowledge-intensive NLP tasks. However, recent work has shown that such as BERT are not ``structurally ready'' to aggregate textual information into a [CLS] vector for dense passage retrieval (DPR). This ``lack of readiness'' results from the gap between model pre-training and DPR fine-tuning. Previous solutions call computationally expensive techniques hard negative mining, cross-encoder distillation, further learn robust model. In...

10.48550/arxiv.2208.00511 preprint EN other-oa arXiv (Cornell University) 2022-01-01

Query Expansion Using Contextual Clue Sampling with Language Models

OPENALEX - Publications

Linqing Liu Minghan Li Jimmy Lin Sebastian Riedel Pontus Stenetorp

Query expansion is an effective approach for mitigating vocabulary mismatch between queries and documents in information retrieval. One recent line of research uses language models to generate query-related contexts expansion. Along this line, we argue that terms from these should balance two key aspects: diversity relevance. The obvious way increase sample multiple the model. However, comes at cost relevance, because there a well-known tendency hallucinate incorrect or irrelevant contexts....

10.48550/arxiv.2210.07093 preprint EN other-oa arXiv (Cornell University) 2022-01-01

Certified Error Control of Candidate Set Pruning for Two-Stage Relevance Ranking

OPENALEX - Publications

Minghan Li Xinyu Zhang Xin Ji Hongyang Zhang Jimmy Lin

In information retrieval (IR), candidate set pruning has been commonly used to speed up two-stage relevance ranking. However, such an approach lacks accurate error control and often trades accuracy against computational efficiency in empirical fashion, missing theoretical guarantees. this paper, we propose the concept of certified for ranking, which means that test after is guaranteed be controlled under a user-specified threshold with high probability. Both in-domain out-of-domain...

10.18653/v1/2022.emnlp-main.23 article EN cc-by Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2022-01-01

Unifying Multimodal Retrieval via Document Screenshot Embedding

OPENALEX - Publications

Xueguang Ma Sheng-Chieh Lin Minghan Li Wenhu Chen Jimmy Lin

In the real world, documents are organized in different formats and varied modalities. Traditional retrieval pipelines require tailored document parsing techniques content extraction modules to prepare input for indexing. This process is tedious, prone errors, has information loss. To this end, we propose Document Screenshot Embedding} (DSE), a novel paradigm that regards screenshots as unified format, which does not any preprocess preserves all (e.g., text, image layout). DSE leverages...

10.48550/arxiv.2406.11251 preprint EN arXiv (Cornell University) 2024-06-17

SLIM: Sparsified Late Interaction for Multi-Vector Retrieval with Inverted Indexes

OPENALEX - Publications

Minghan Li Sheng-Chieh Lin Xueguang Ma Jimmy Lin

This paper introduces Sparsified Late Interaction for Multi-vector (SLIM) retrieval with inverted indexes. methods have demonstrated their effectiveness on various datasets, and among them, ColBERT is the most established method based late interaction of contextualized token embeddings pre-trained language models. However, efficient implementations require complex engineering cannot take advantage off-the-shelf search libraries, impeding practical use. To address this issue, SLIM first maps...

10.48550/arxiv.2302.06587 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Improving Out-of-Distribution Generalization of Neural Rerankers with Contextualized Late Interaction

OPENALEX - Publications

Xinyu Zhang Minghan Li Jimmy Lin

Recent progress in information retrieval finds that embedding query and document representation into multi-vector yields a robust bi-encoder retriever on out-of-distribution datasets. In this paper, we explore whether late interaction, the simplest form of multi-vector, is also helpful to neural rerankers only use [CLS] vector compute similarity score. Although intuitively, attention mechanism at previous layers already gathers token-level information, find adding interaction still brings an...

10.48550/arxiv.2302.06589 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Component Analysis and Identification of Ancient Glass Products Based on Statistical Methods

OPENALEX - Publications

Kerui Wu Minghan Li Hongyi Ren

This paper analyzes its role in the composition analysis and identification of ancient glass products by flexible use statistical methods, emphasizes four methods: systematic clustering algorithm, K-means logistic regression model grey correlation analysis. Taking C project CUMCM 2022 as an example, this systematically introduces these common data classification methods to classify analyze given data. In paper, suitable chemical components high potassium lead barium were selected for...

10.9734/ajpas/2023/v24i2518 article EN Asian Journal of Probability and Statistics 2023-08-24

Encoder Adaptation of Dense Passage Retrieval for Open-Domain Question Answering

OPENALEX - Publications

Minghan Li Jimmy Lin

One key feature of dense passage retrievers (DPR) is the use separate question and encoder in a bi-encoder design. Previous work on generalization DPR mainly focus testing both encoders tandem out-of-distribution (OOD) question-answering (QA) tasks, which also known as domain adaptation. However, it still unknown how DPR's individual question/passage affects generalization. Specifically, this paper, we want to know an in-distribution (IND) would generalize if paired with OOD passage/question...

10.48550/arxiv.2110.01599 preprint EN cc-by arXiv (Cornell University) 2021-01-01