An Ensemble Blocking Scheme for Entity Resolution of Large and Sparse Datasets

Blocking (statistics)
DOI: 10.48550/arxiv.1609.06265 Publication Date: 2016-01-01
ABSTRACT
Entity Resolution, also called record linkage or deduplication, refers to the process of identifying and merging duplicate versions same entity into a unified representation. The standard practice is use Rule based Machine Learning model that compares pairs assigns score represent pairs' Match/Non-Match status. However, performing an exhaustive pair-wise comparison on all records leads quadratic matcher complexity hence Blocking step performed before Matching group similar entities smaller blocks can then examine exhaustively. Several blocking schemes have been developed efficiently effectively block input dataset manageable groups. At CareerBuilder (CB), we perform deduplication massive datasets people profiles collected from disparate sources with varying informational content. We observed that, employing single technique did not cover base for possible scenarios due multi-faceted nature our data sources. In this paper, describe ensemble approach combines two different techniques leverage their respective strengths.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....