Progress in various hardware and sensor technology has made new kind of management for data emerge. These data, being generated and growing over time continuously and rapidly, are referred to stream data. Stream data became a challenge for Knowledge Discovery and Data mining (KDD) due to their large size and dynamics in generation and processing. Even high-dimensional attributes and multi-valued categorical values found in recent stream data issues a new challenge in management and processing of them.
When processing stream data, three aspects should be considered. First, the size of stream data is very large to fit a limited system memory. Second, stream data is seriously affected by time because it emerges in time line and the characteristics of it are subject to be changed. Furthermore, recent applications of stream data require more sophisticated processes on complicated data format like summarizing or finding hidden knowledge in it, not only simple data management or filtering process. Based on factors of processing of stream data, we suggested a sampling for limited memory, a clustering method for multi-valued categorical data in high-dimension space, and a method to detect evolution of characteristics of data and learn from it.
We suggest a sampling method reflecting time feature of stream data based on Quantile system. The importance of data is apt to be dependent on data arrival rate. Our method samples more data in the data interval with high arrival rate. Our sampling method can be applied to sophisticated knowledge applications such as clustering from multi-sources and help them to reflect the characteristics of stream data effectively.
We propose an effective method to quantify the level of dissimilarity of categorical values and developed a framework of unsupervised learning for high dimensional categorical data. Clustering is the most representative unsupervised learning in KDD to group similar data and to find out hidden information about the ch...