A Full-Coverage Two-Level URL Duplication Checking Method for a High-Speed Parallel Web Crawler

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 427
  • Download : 0
For efficient large-scale Web crawlers, URL duplication checking is an important technique since it is a significant bottleneck. In this paper, we propose a new URL duplication checking technique for a parallel Web crawler; we call it full-coverage two level URL duplication checking (full-coverage-2L-UDC). Full-coverage-2L-UDC provides efficient URL duplication checking while ensuring maximum coverage. First, we propose two-level URL duplication checking (2L-UDC). It provides efficiency in URL duplication checking by communicating at the Web site level rather than at the Web page level. Second, we present a solution for the so-called coverage problem, which is directly related to the recall of the search engine. It is the first solution for the coverage problem in the centralized parallel architecture. Third, we propose an architecture, FC2-LUDCbot, for a centralized parallel crawler using full-coverage-2L-UDC. We build a seven-agent FC2L-UDCbot for extensive experiments. We show that the crawling speed of FC2L-UDCbot is approximately proportional to the number of agents (Le., FC2L-UDCbot is faster than a single-machine crawler by 6.9 times). Full-coverage-2L-UDC allows FC2L-UDCbot to be scalable to the number of agents since it effectively deals with the overheads incurred in a parallel environment. Through an in-depth analysis, we construct a cost model for estimating the crawling speed of a scaled-up crawler. Using the model, we show that FC2L-UDCbot can crawl Google-scale Web pages within several days using dozens of agents.
Publisher
INST INFORMATION SCIENCE
Issue Date
2015-05
Language
English
Article Type
Article
Citation

JOURNAL OF INFORMATION SCIENCE AND ENGINEERING, v.31, no.3, pp.839 - 860

ISSN
1016-2364
URI
http://hdl.handle.net/10203/200014
Appears in Collection
CS-Journal Papers(저널논문)
Files in This Item
There are no files associated with this item.

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0