Index-based, High-dimensional, Cosine Threshold Querying with Optimality Guarantees
Cosine similarity
Similarity (geometry)
DOI:
10.48550/arxiv.1812.07695
Publication Date:
2018-01-01
AUTHORS (5)
ABSTRACT
Given a database of vectors, cosine threshold query returns all vectors in the having similarity to vector above given {\theta}. These queries arise naturally many applications, such as document retrieval, image search, and mass spectrometry. The present paper considers efficient evaluation queries, providing novel optimality guarantees exhibiting good performance on real datasets. We take starting point Fagin's well-known Threshold Algorithm (TA), which can be used answer follows: an inverted index is first built from during pre-processing; at time, algorithm traverses partially gather set candidate later verified for {\theta}-similarity. However, directly applying TA its raw form misses significant optimization opportunities. Indeed, we show that one advantage fact assumed normalized, obtain improved, tight stopping condition traversal efficiently compute it incrementally. Then data skewness better strategies. In particular, strategy exploits common holds multiple domains including spectrometry, documents, databases. under assumption, new has strong, near-optimal guarantee. techniques developed are quite general since they applied large class functions beyond cosine.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....