(A) pattern-based approach to identifying and correcting outliers in software project data = 소프트웨어 프로젝트 데이터에 대한 패턴 기반의 이상치 검출 및 정제 기법

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 260
  • Download : 0
Despite the importance of the quality of Software Project Data (SPD), problematic data inevitably occurs during data collection. These data are called as outliers, which are the SPD instances with abnormal values on certain attributes. We call these attributes the abnormal attributes of outliers. To improve the quality of SPD instances, it is necessary to identifying outliers and their abnormal attributes, and correcting abnormal values should be considered also. Although few existing approaches identify outliers and their abnormal attributes, these approaches are not effective in (1) identifying the abnormal attributes when the outlier has abnormal values on more than the specific number of its attributes and (2) identifying the outliers that contains the abnormal values of attributes other than a specific attribute related to the base algorithm. The existing approach correcting abnormal values of outliers has the tendency to generate many new outliers by its improper correction. In this paper, we propose a pattern-based approach to identifying and correcting outliers in SPD instances: after discovering the reliable frequent patterns that reflect the typical characteristics of the SPD instances, outliers and their abnormal attributes are detected by matching the SPD instances with those patterns. Then, the abnormal values of the outliers are corrected by replacing with the weighted mean of k similar SPD instances, which are completely matched with the most similar and significant patterns with the outliers. Empirical studies were performed on three industrial data sets and 64 artificial data sets with injected outliers. The detection accuracy results demonstrate that our approach outperforms five other approaches by an average of 35.27% and 107.5% in detecting the outliers and abnormal attributes, respectively, on the industrial data sets, and an average of 61.51% and 110.93% respectively on the artificial data sets. In addition, the correction accura...
Advisors
Bae, Doo-Hwanresearcher배두환researcher
Description
한국과학기술원 : 전산학과,
Publisher
한국과학기술원
Issue Date
2010
Identifier
418729/325007  / 020035189
Language
eng
Description

학위논문(박사) - 한국과학기술원 : 전산학과, 2010.2, [ vii, 82 p. ]

Keywords

software data; data cleaning; data quality; outlier; noisy data; 노이지 데이터; 소프트웨어 데이터; 데이터 정제; 데이터 품질; 이상치

URI
http://hdl.handle.net/10203/33292
Link
http://library.kaist.ac.kr/search/detail/view.do?bibCtrlNo=418729&flag=dissertation
Appears in Collection
CS-Theses_Ph.D.(박사논문)
Files in This Item
There are no files associated with this item.

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0