Data Clustering is one of the most frequently used tools in Data Mining, which refers to the process of partitioning data so that intra-group similarities are maximized and inter-group similarities are minimized at the same time. Data clustering enables us to get a rough idea about the composition of the given dataset. It is especially useful when there is little knowledge about the given dataset.
But as datasets become larger in their volumes and higher in their dimensions, more efficient clustering methods are required. Especially, the high dimensionality of a dataset makes it very difficult to generate a meaningful clustering result because the distance between any data object pair becomes similar in a high dimension.
In this thesis, we present a study of an effective data clustering for a large volume of high dimensional datasets. To deal with the curse of dimensionality, the proposed method follows the philosophy of subspace clustering which assumes that important dimensions can be different between clusters.
We first define a new similarity measure devised for high dimensional datasets. To measure the similarity between two data objects, the proposed similarity measure focuses on the number of dimensions that two objects are near enough from each other, rather than merely averaging the similarities along all dimensions. We then present a novel way to find out each cluster``s important dimensions(i.e. subspace). The suggested subspace finding method uses the nearest neighbor query results to gather the information required for selecting important dimensions. The gathered information is used to determine whether each dimension is important or not based on a binomial probability model. Finally we propose an algorithm which adopts our similarity measure and subspace finding method to perform clustering on a large volume of high dimensional dataset.
Through the experiment results on various datasets, the proposed algorithm is shown to meet many requirements fo...