Precise Zero-Shot Dense Retrieval without Relevance Labels

Relevance Similarity (geometry) Relevance Feedback Vector space model Zero (linguistics)
DOI: 10.18653/v1/2023.acl-long.99 Publication Date: 2023-08-05T00:57:42Z
ABSTRACT
While dense retrieval has been shown to be effective and efficient across tasks languages, it remains difficult create fully zero-shot systems when no relevance labels are available. In this paper, we recognize the difficulty of learning encoding relevance. Instead, propose pivot through Hypothetical Document Embeddings (HyDE). Given a query, HyDE first prompts an instruction-following language model (e.g., InstructGPT) generate hypothetical document. The document captures patterns but is "fake" may contain hallucinations. Then, unsupervised contrastively learned encoder Contriever) encodes into embedding vector. This vector identifies neighborhood in corpus space, from which similar real documents retrieved based on similarity. second step grounds generated actual corpus, with encoder's bottleneck filtering out Our experiments show that significantly outperforms state-of-the-art retriever Contriever shows strong performance comparable fine-tuned retrievers various (e.g. web search, QA, fact verification) non-English languages sw, ko, ja, bn).
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (0)
CITATIONS (53)
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....