NFDI4DS | UHH-SEMS - Publication Details

Estimating deep web data source size by capture–recapture method

0202 electrical engineering, electronic engineering, information engineering 02 engineering and technology

DOI: 10.1007/s10791-009-9107-y Publication Date: 2009-08-12T15:22:09Z

Abstract Supplemental Material References Cited by

AUTHORS (2)

Jianguo Lu

Dingding Li

ABSTRACT

This paper addresses the problem of estimating the size of a deep web data source that is accessible by queries only. Since most deep web data sources are non-cooperative, a data source size can only be estimated by sending queries and analyzing the returning results. We propose an efficient estimator based on the capture---recapture method. First we derive an equation between the overlapping rate and the percentage of the data examined when random samples are retrieved from a uniform distribution. This equation is conceptually simple and leads to the derivation of an estimator for samples obtained by random queries. Since random queries do not produce random documents, it is well known that the traditional methods by random queries underestimate the size, i.e., those estimators have negative bias. Based on the simple estimator for random samples, we adjust the equation so that it can handle the samples returned by random queries. We conduct both simulation studies and experiments on corpora including Gov2, Reuters, Newsgroups, and Wikipedia. The results show that our method has small bias and standard deviation.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES (37)

CITATIONS (21)

EXTERNAL LINKS

OPENAIRE - Products CROSSREF - Publications

PlumX Metrics

Estimating deep web data source size by capture–recapture method

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....