Despite the importance of the quality of Software Project Data (SPD), problematic data inevitably occurs during data collection. These data are called as outliers, which are the SPD instances with abnormal values on certain attributes. We call these attributes the abnormal attributes of outliers. To improve the quality of SPD instances, it is necessary to identifying outliers and their abnormal attributes, and correcting abnormal values should be considered also.
Although few existing approaches identify outliers and their abnormal attributes, these approaches are not effective in (1) identifying the abnormal attributes when the outlier has abnormal values on more than the specific number of its attributes and (2) identifying the outliers that contains the abnormal values of attributes other than a specific attribute related to the base algorithm. The existing approach correcting abnormal values of outliers has the tendency to generate many new outliers by its improper correction.
In this paper, we propose a pattern-based approach to identifying and correcting outliers in SPD instances: after discovering the reliable frequent patterns that reflect the typical characteristics of the SPD instances, outliers and their abnormal attributes are detected by matching the SPD instances with those patterns. Then, the abnormal values of the outliers are corrected by replacing with the weighted mean of k similar SPD instances, which are completely matched with the most similar and significant patterns with the outliers.
Empirical studies were performed on three industrial data sets and 64 artificial data sets with injected outliers. The detection accuracy results demonstrate that our approach outperforms five other approaches by an average of 35.27% and 107.5% in detecting the outliers and abnormal attributes, respectively, on the industrial data sets, and an average of 61.51% and 110.93% respectively on the artificial data sets. In addition, the correction accura...