DSpace at KOASAS: Multiple sets of features for automatic genre classification of web documents

DSpace at KOASAS

College of Engineering(공과대학)School of Computing(전산학부)CS-Journal Papers(저널논문)

Multiple sets of features for automatic genre classification of web documents

Cited 69 time in

Cited 0 time in

Hit : 271
Download : 0

Export

Lim, CS / Lee, KJ / Kim, Gil Chang

With the increase of information on the Web, it is difficult to find desired information quickly out of the documents retrieved by a search engine. One way to solve this problem is to classify web documents according to various criteria. Most document classification has been focused on a subject or a topic of a document. A genre or a style is another view of a document different from a subject or a topic. The genre is also a criterion to classify documents. In this paper, we suggest multiple sets of features to classify genres of web documents. The basic set of features, which have been proposed in the previous studies, is acquired from the textual properties of documents, such as the number of sentences, the number of a certain word, etc. However, web documents are different from textual documents in that they contain URL and HTML tags within the pages. We introduce new sets of features specific to web documents, which are extracted from URL and HTML tags. The present work is an attempt to evaluate the performance of the proposed sets of features, and to discuss their characteristics. Finally, we conclude which is an appropriate set of features in automatic genre classification of web documents. (c) 2004 Elsevier Ltd. All rights reserved.

Publisher: PERGAMON-ELSEVIER SCIENCE LTD

Issue Date: 2005-09

Language: English

Article Type: Article

Citation: INFORMATION PROCESSING & MANAGEMENT, v.41, no.5, pp.1263 - 1276

ISSN: 0306-4573

DOI: 10.1016/j.ipm.2004.06.004

URI: http://hdl.handle.net/10203/90220

Appears in Collection

Files in This Item: There are no files associated with this item.

This item is cited by other documents in WoS

⊙ Detail Information in WoSⓡ	Click to see
⊙ Cited 69 items in WoS	Click to see citing articles in

Display Full Item Record

qr_code

트윗하기

KOASAS

Knowledge Service Development Team, KAIST 291 Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea. T. 82-42-350-4493 Email. koasas@kaist.ac.kr
Copyright © 2016. Korea Advanced Institute of Science and Technology. All Rights Reserved.

KOASAS

KOASAS

Browse

Multiple sets of features for automatic genre classification of web documents

This item is cited by other documents in WoS

KOASAS

Communities & Collections