Hybrid imputation of cluster-based k-NN and maximum likelihood estimation in software project data군집기반 k-NN과 최대우도추정법을 결합한 소프트웨어 프로젝트 데이터용 하이브리드 대치법

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 708
  • Download : 0
Missing data is one of the common problems that software practitioners face often when they analyze software project data. In the empirical software engineering community, k-NN and Maximum likelihood estimation were known to be effective to software project data. However, they have the following limitations in applying alone to software project data: (1) the imputation accuracy of k-NN is affected by the homogeneity of data, and (2) Maximum likelihood estimation is ineffective in the data set containing less than 100 project instances. To cope with these limitations of existing techniques in applying them alone to software project data, hybrid imputation techniques combining several methods have been developed. However, it can be applied to only software project data with less than 100 project instances. In this paper, we propose a hybrid imputation method using cluster-based k-NN and Maximum likelihood estimation in software project data. Maximum likelihood estimation is applied first and then Hierarchical clustering partitions software project data into clusters. Initial imputation using Maximum likelihood estimation makes k-NN use the non-missing data of project instances having missing data, in its searching; partitioning software project data into clusters increases the homogeneity of data set. After finding most $\it{k}$ similar project instances in the cluster, an average between the result of k-NN and that of Maximum likelihood estimation is taken. In the empirical study, we evaluated our approach and other five methods by experiments on 2,160 data sets, which are generated by injecting missing data into the two industrial data sets such as software project data measured in a bank in Korea and ISBSG data set. The results of the Wilcoxon rank sum test confirm that our approach outperforms the other five methods with respect to the data set size, the number of missing attributes, the missing data percentage, and the missingness mechanism.
Advisors
Bae, Doo-Hwanresearcher배두환researcher
Description
한국과학기술원 : 전산학전공,
Publisher
한국과학기술원
Issue Date
2009
Identifier
303642/325007  / 020073371
Language
eng
Description

학위논문(석사) - 한국과학기술원 : 전산학전공, 2009.2, [ vi, 47 p. ]

Keywords

imputation; k-NN; maximum likelihood estimation; software project data; cluster; 대치법; k 최근접이웃대치법; 최대우도추정법; 소프트웨어 프로젝트 데이터; 클러스터; imputation; k-NN; maximum likelihood estimation; software project data; cluster; 대치법; k 최근접이웃대치법; 최대우도추정법; 소프트웨어 프로젝트 데이터; 클러스터

URI
http://hdl.handle.net/10203/34840
Link
http://library.kaist.ac.kr/search/detail/view.do?bibCtrlNo=303642&flag=dissertation
Appears in Collection
CS-Theses_Master(석사논문)
Files in This Item
There are no files associated with this item.

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0