Minghan Li

ORCID: 0009-0007-8972-7714
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Topic Modeling
  • Natural Language Processing Techniques
  • Domain Adaptation and Few-Shot Learning
  • Multimodal Machine Learning Applications
  • Quantum Computing Algorithms and Architecture
  • Advanced Image and Video Retrieval Techniques
  • Cryptography and Data Security
  • Quantum Information and Cryptography
  • Text and Document Classification Technologies
  • Advanced Neural Network Applications
  • Information Retrieval and Search Behavior
  • Organic Electronics and Photovoltaics
  • Cultural Heritage Materials Analysis
  • Flood Risk Assessment and Management
  • Expert finding and Q&A systems
  • Genomics and Phylogenetic Studies
  • Pigment Synthesis and Properties
  • Bayesian Modeling and Causal Inference
  • Biometric Identification and Security
  • Quantum-Dot Cellular Automata
  • Vitamin D Research Studies
  • Data Management and Algorithms
  • Neural Networks and Applications
  • Urban Heat Island Mitigation
  • Explainable Artificial Intelligence (XAI)

University of Waterloo
2021-2024

Dalian University of Technology
2023

Hefei National Center for Physical Sciences at Nanoscale
2021-2023

University of Science and Technology of China
2018-2023

Beijing Academy of Quantum Information Sciences
2023

University of Washington
2023

Liaoning Normal University
2023

Ningbo University
2020

Michigan State University
2019

Various techniques have been developed in recent years to improve dense retrieval (DR), such as unsupervised contrastive learning and pseudo-query generation. Existing DRs, however, often suffer from effectiveness tradeoffs between supervised zero-shot retrieval, which some argue was due the limited model capacity. We contradict this hypothesis show that a generalizable DR can be trained achieve high accuracy both without increasing size. In particular, we systematically examine of under...

10.18653/v1/2023.findings-emnlp.423 article EN cc-by 2023-01-01

Abstract Pre-trained language models have been successful in many knowledge-intensive NLP tasks. However, recent work has shown that such as BERT are not “structurally ready” to aggregate textual information into a [CLS] vector for dense passage retrieval (DPR). This “lack of readiness” results from the gap between model pre-training and DPR fine-tuning. Previous solutions call computationally expensive techniques hard negative mining, cross-encoder distillation, further learn robust model....

10.1162/tacl_a_00556 article EN cc-by Transactions of the Association for Computational Linguistics 2023-01-01

We report a an experimental study of device-independent quantum random number generation based on detection-loophole free Bell test with entangled photons. After considering statistical fluctuations and applying 80 Gb × 45.6 Mb Toeplitz matrix hashing, we achieve final bit rate 114 bits/s, failure probability less than 10−5.

10.1364/cleo_qels.2018.ftu4a.4 article EN Conference on Lasers and Electro-Optics 2018-01-01

Recent work has shown that dense passage retrieval techniques achieve better ranking accuracy in open-domain question answering compared to sparse such as BM25, but at the cost of large space and memory requirements. In this paper, we analyze redundancy present encoded vectors show default dimension 768 is unnecessarily large. To improve efficiency, propose a simple unsupervised compression pipeline consists principal component analysis (PCA), product quantization, hybrid search. We further...

10.18653/v1/2021.emnlp-main.227 article EN cc-by Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2021-01-01

Minghan Li, Sheng-Chieh Lin, Barlas Oguz, Asish Ghoshal, Jimmy Yashar Mehdad, Wen-tau Yih, Xilun Chen. Proceedings of the 61st Annual Meeting Association for Computational Linguistics (Volume 1: Long Papers). 2023.

10.18653/v1/2023.acl-long.663 article EN cc-by 2023-01-01

This paper introduces Sparsified Late Interaction for Multi-vector (SLIM) retrieval with inverted indexes. methods have demonstrated their effectiveness on various datasets, and among them, ColBERT is the most established method based late interaction of contextualized token embeddings pre-trained language models. However, efficient implementations require complex engineering cannot take advantage off-the-shelf search libraries, impeding practical use. To address this issue, SLIM first maps...

10.1145/3539618.3591977 article EN Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval 2023-07-18

Sodium p-perfluorous nonenoxybenzenesulfonate (OBS) is a novel alternative to perfluorooctane sulfonate (PFOS), with environmental health risks largely unknown. The present study aims unravel the adipogenesis effects and underlying molecular initiating events of OBS, which are crucial for understanding predicting its adverse outcome. In undifferentiated human mesenchymal stem cells (hMSCs), exposure 1-100 nM OBS 7 days stimulated reactive oxygen species production. subsequent multipotent...

10.1016/j.envint.2023.108354 article EN cc-by-nc-nd Environment International 2023-11-25

Query expansion has been widely used to improve the search results of first-stage retrievers, yet its influence on second-stage, crossencoder rankers remains under-explored.A recent work Weller et al. [44] shows that current techniques benefit weaker models such as DPR and BM25 but harm stronger MonoT5.In this paper, we re-examine conclusion raise following question: Can query generalization strong cross-encoder rankers?To answer question, first apply popular methods state-of-the-art verify...

10.1145/3626772.3657979 article EN Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval 2024-07-10

Multi-task dense retrieval models can be used to retrieve documents from a common corpus (e.g., Wikipedia) for different open-domain question-answering (QA) tasks. However, (CITATION) shows that jointly learning QA tasks with one model is not always beneficial due inconsistency. For example, SQuAD only focuses on small set of Wikipedia articles while datasets like NQ and Trivia cover more entries, joint training their union cause performance degradation. To solve this problem, we propose...

10.18653/v1/2021.findings-emnlp.26 article EN cc-by 2021-01-01

Zero-knowledge proof (ZKP) is a fundamental cryptographic primitive that allows prover to convince verifier of the validity statement without leaking any further information. As an efficient variant ZKP, noninteractive zero-knowledge (NIZKP) adopting Fiat-Shamir heuristic essential wide spectrum applications, such as federated learning, blockchain, and social networks. However, typically built upon random oracle model makes ideal assumptions about hash functions, which does not hold in...

10.1073/pnas.2205463120 article EN cc-by-nc-nd Proceedings of the National Academy of Sciences 2023-11-02

Recently, a few open-vocabulary methods have been proposed by employing unified architecture to tackle generic segmentation and detection tasks. However, their performance still lags behind the task-specific models due conflict between different tasks, capability is limited inadequate use of CLIP. To address these challenges, we present universal transformer-based framework, abbreviated as OpenSD, which utilizes same network parameters handle First, introduce decoder decoupled learning...

10.48550/arxiv.2312.06703 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Various techniques have been developed in recent years to improve dense retrieval (DR), such as unsupervised contrastive learning and pseudo-query generation. Existing DRs, however, often suffer from effectiveness tradeoffs between supervised zero-shot retrieval, which some argue was due the limited model capacity. We contradict this hypothesis show that a generalizable DR can be trained achieve high accuracy both without increasing size. In particular, we systematically examine of under...

10.48550/arxiv.2302.07452 preprint EN other-oa arXiv (Cornell University) 2023-01-01

The bi-encoder design of dense passage retriever (DPR) is a key factor to its success in open-domain question answering (QA), yet it unclear how DPR's encoder and individually contributes overall performance, which we refer as the attribution problem. problem important helps us identify factors that affect individual encoders further improve performance. In this paper, formulate our analysis under probabilistic framework called marginalization, where quantify contribution single by...

10.18653/v1/2022.trustnlp-1.1 article EN cc-by 2022-01-01

Pre-trained language models have been successful in many knowledge-intensive NLP tasks. However, recent work has shown that such as BERT are not ``structurally ready'' to aggregate textual information into a [CLS] vector for dense passage retrieval (DPR). This ``lack of readiness'' results from the gap between model pre-training and DPR fine-tuning. Previous solutions call computationally expensive techniques hard negative mining, cross-encoder distillation, further learn robust model. In...

10.48550/arxiv.2208.00511 preprint EN other-oa arXiv (Cornell University) 2022-01-01

Query expansion is an effective approach for mitigating vocabulary mismatch between queries and documents in information retrieval. One recent line of research uses language models to generate query-related contexts expansion. Along this line, we argue that terms from these should balance two key aspects: diversity relevance. The obvious way increase sample multiple the model. However, comes at cost relevance, because there a well-known tendency hallucinate incorrect or irrelevant contexts....

10.48550/arxiv.2210.07093 preprint EN other-oa arXiv (Cornell University) 2022-01-01

In information retrieval (IR), candidate set pruning has been commonly used to speed up two-stage relevance ranking. However, such an approach lacks accurate error control and often trades accuracy against computational efficiency in empirical fashion, missing theoretical guarantees. this paper, we propose the concept of certified for ranking, which means that test after is guaranteed be controlled under a user-specified threshold with high probability. Both in-domain out-of-domain...

10.18653/v1/2022.emnlp-main.23 article EN cc-by Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2022-01-01

In the real world, documents are organized in different formats and varied modalities. Traditional retrieval pipelines require tailored document parsing techniques content extraction modules to prepare input for indexing. This process is tedious, prone errors, has information loss. To this end, we propose Document Screenshot Embedding} (DSE), a novel paradigm that regards screenshots as unified format, which does not any preprocess preserves all (e.g., text, image layout). DSE leverages...

10.48550/arxiv.2406.11251 preprint EN arXiv (Cornell University) 2024-06-17

This paper introduces Sparsified Late Interaction for Multi-vector (SLIM) retrieval with inverted indexes. methods have demonstrated their effectiveness on various datasets, and among them, ColBERT is the most established method based late interaction of contextualized token embeddings pre-trained language models. However, efficient implementations require complex engineering cannot take advantage off-the-shelf search libraries, impeding practical use. To address this issue, SLIM first maps...

10.48550/arxiv.2302.06587 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Recent progress in information retrieval finds that embedding query and document representation into multi-vector yields a robust bi-encoder retriever on out-of-distribution datasets. In this paper, we explore whether late interaction, the simplest form of multi-vector, is also helpful to neural rerankers only use [CLS] vector compute similarity score. Although intuitively, attention mechanism at previous layers already gathers token-level information, find adding interaction still brings an...

10.48550/arxiv.2302.06589 preprint EN other-oa arXiv (Cornell University) 2023-01-01

This paper analyzes its role in the composition analysis and identification of ancient glass products by flexible use statistical methods, emphasizes four methods: systematic clustering algorithm, K-means logistic regression model grey correlation analysis. Taking C project CUMCM 2022 as an example, this systematically introduces these common data classification methods to classify analyze given data. In paper, suitable chemical components high potassium lead barium were selected for...

10.9734/ajpas/2023/v24i2518 article EN Asian Journal of Probability and Statistics 2023-08-24

One key feature of dense passage retrievers (DPR) is the use separate question and encoder in a bi-encoder design. Previous work on generalization DPR mainly focus testing both encoders tandem out-of-distribution (OOD) question-answering (QA) tasks, which also known as domain adaptation. However, it still unknown how DPR's individual question/passage affects generalization. Specifically, this paper, we want to know an in-distribution (IND) would generalize if paired with OOD passage/question...

10.48550/arxiv.2110.01599 preprint EN cc-by arXiv (Cornell University) 2021-01-01
Coming Soon ...