Design and implementation of a community-based cluster crawler using the link structure and text information of hyperlinks = 하이퍼링크의 링크 구조와 텍스트 정보를 이용한 커뮤니티 기반의 클러스터 크롤러의 설계 및 구현

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 252
  • Download : 0
Community-limited search is a technique for improving the quality of search output by limiting the search within a specified community. A community in this thesis refers to a collection of semantically-related web pages. There have been few techniques proposed for finding such communities. The incremental cluster crawler, proposed by Kim, finds communities incrementally using the link structure of web pages crawled. This crawler, however, has some drawbacks. For instance, it does not consider the text information. Moreover, seed URLs affect clustering quality because one community is created for each seed URL. In this thesis, we propose a new method for finding communities incrementally. The key idea is to use both the link structure and the text information. Specifically, it first computes the similarity based on the link structure and the text information separately, and then combines the two resulting similarity scores. To compute the similarity based on the text information, we use the text embedded in the hyperlink to a target web page instead of the text in the target web page itself. By using both the link structure and text information, the proposed method can improve the overall clustering quality. We also propose a method for merging communities to reduce the influence of seed URLs on the clustering quality. The proposed method merges communities that are created from different seed URLs by computing the similarity between communities. Experimental results show that the proposed method improves the clustering quality by up to 3 times compared with the incremental cluster crawler proposed by Kim.
Whang, Kyu-Youngresearcher황규영researcher
한국과학기술원 : 전산학전공,
Issue Date
268875/325007  / 020044370

학위논문(석사) - 한국과학기술원 : 전산학전공, 2007. 8, [ vii, 39 p. ]


web crawling; web clustering; web community; 웹 크롤링; 웹 클러스터링; 웹 커뮤니티; web crawling; web clustering; web community; 웹 크롤링; 웹 클러스터링; 웹 커뮤니티

Appears in Collection
Files in This Item
There are no files associated with this item.


  • mendeley


rss_1.0 rss_2.0 atom_1.0