NFDI4DS | UHH-SEMS - Publication Details

Crawling deep web entity pages

0202 electrical engineering, electronic engineering, information engineering 02 engineering and technology

DOI: 10.1145/2433396.2433442 Publication Date: 2013-02-05T08:19:52Z

Abstract Supplemental Material References Cited by

AUTHORS (5)

Yeye He

Dong Xin

Venkatesh Ganti

Sriram Rajaraman

Nirav Shah

ABSTRACT

Deep-web crawl is concerned with the problem of surfacing hidden content behind search interfaces on the Web. While many deep-web sites maintain document-oriented textual content (e.g., Wikipedia, PubMed, Twitter, etc.), which has traditionally been the focus of the deep-web literature, we observe that a significant portion of deep-web sites, including almost all online shopping sites, curate structured entities as opposed to text documents. Although crawling such entity-oriented content is clearly useful for a variety of purposes, existing crawling techniques optimized for document oriented content are not best suited for entity-oriented sites. In this work, we describe a prototype system we have built that specializes in crawling entity-oriented deep-web sites. We propose techniques tailored to tackle important subproblems including query generation, empty page filtering and URL deduplication in the specific context of entity oriented deep-web sites. These techniques are experimentally evaluated and shown to be effective.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES (23)

CITATIONS (32)

EXTERNAL LINKS

OPENAIRE - Products CROSSREF - Publications

PlumX Metrics

Crawling deep web entity pages

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....