Effective data clustering for large volume high dimensional datasets대용량 고차원 데이타집합을 위한 효과적인 데이타 클러스터링

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 470
  • Download : 0
Data Clustering is one of the most frequently used tools in Data Mining, which refers to the process of partitioning data so that intra-group similarities are maximized and inter-group similarities are minimized at the same time. Data clustering enables us to get a rough idea about the composition of the given dataset. It is especially useful when there is little knowledge about the given dataset. But as datasets become larger in their volumes and higher in their dimensions, more efficient clustering methods are required. Especially, the high dimensionality of a dataset makes it very difficult to generate a meaningful clustering result because the distance between any data object pair becomes similar in a high dimension. In this thesis, we present a study of an effective data clustering for a large volume of high dimensional datasets. To deal with the curse of dimensionality, the proposed method follows the philosophy of subspace clustering which assumes that important dimensions can be different between clusters. We first define a new similarity measure devised for high dimensional datasets. To measure the similarity between two data objects, the proposed similarity measure focuses on the number of dimensions that two objects are near enough from each other, rather than merely averaging the similarities along all dimensions. We then present a novel way to find out each cluster``s important dimensions(i.e. subspace). The suggested subspace finding method uses the nearest neighbor query results to gather the information required for selecting important dimensions. The gathered information is used to determine whether each dimension is important or not based on a binomial probability model. Finally we propose an algorithm which adopts our similarity measure and subspace finding method to perform clustering on a large volume of high dimensional dataset. Through the experiment results on various datasets, the proposed algorithm is shown to meet many requirements fo...
Advisors
Lee, Yoon-Joonresearcher이윤준researcher
Description
한국과학기술원 : 전산학전공,
Publisher
한국과학기술원
Issue Date
2004
Identifier
237666/325007  / 000985231
Language
eng
Description

학위논문(박사) - 한국과학기술원 : 전산학전공, 2004.2, [ vi, 43 p. ]

Keywords

DATA MINING; HIGH DIMENSION; 고차원 클러스터링; 데이타 마이닝; CLUSTERING

URI
http://hdl.handle.net/10203/32863
Link
http://library.kaist.ac.kr/search/detail/view.do?bibCtrlNo=237666&flag=dissertation
Appears in Collection
CS-Theses_Ph.D.(박사논문)
Files in This Item
There are no files associated with this item.

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0