崔军, 刘亚娜, 郭新峰, 王瑞波, 李济洪. 基于最大信息系数的软件缺陷预测模型[J]. 应用概率统计, 2019, 35(1): 86-108. DOI: 10.3969/j.issn.1001-4268.2019.01.007
引用本文: 崔军, 刘亚娜, 郭新峰, 王瑞波, 李济洪. 基于最大信息系数的软件缺陷预测模型[J]. 应用概率统计, 2019, 35(1): 86-108. DOI: 10.3969/j.issn.1001-4268.2019.01.007
CUI Jun, LIU Yana, GUO Xinfeng, WANG Ruibo, LI Jihong. Software Defect Prediction Model Based on Maximal Information Coefficient[J]. Chinese Journal of Applied Probability and Statistics, 2019, 35(1): 86-108. DOI: 10.3969/j.issn.1001-4268.2019.01.007
Citation: CUI Jun, LIU Yana, GUO Xinfeng, WANG Ruibo, LI Jihong. Software Defect Prediction Model Based on Maximal Information Coefficient[J]. Chinese Journal of Applied Probability and Statistics, 2019, 35(1): 86-108. DOI: 10.3969/j.issn.1001-4268.2019.01.007

基于最大信息系数的软件缺陷预测模型

Software Defect Prediction Model Based on Maximal Information Coefficient

  • 摘要: 在软件缺陷预测的回归建模中,由静态代码提取的类层面度量元~(特征)~以及由方法聚合(sum、avg、max、min)到类的特征往往较多, 使用传统的特征选择方法(如AIC、BIC)通常先要确定了模型,不同的模型选出的特征集差异较大, 且模型的可解释性差.最大信息系数MIC(maximal information coefficient)是Reshef等\ucite4提出的度量两个连续变量之间相互依赖程度的一个指标, 且有基于观测数据的计算办法.本文基于软件缺陷个数与各特征的MIC度量先选择特征,再对所选特征进行了适当的幂次变换, 最后使用主成分泊松和负二项回归建模.本文实验基于NASA的KC1的类层面数据集,采用了m\times2交叉验证的序贯t-检验来对两模型的性能差异的显著性进行检验,模型性能评价指标采用FPA、AAE、ARE. 实验结果表明:1)基于MIC选出的特征主要是sum、avg、max三种聚合模式特征,与AIC、BIC方法有明显的差异;2)对特征做适当的幂次变换在多数模型下可以改善其性能;3)对特征做幂次变换后,做主成分分析与因子分析可以得到两个明显的因子,其一个因子正好对应avg与max聚合模式的特征集,另一个因子正好对应sum的聚合模式特征集, 使得模型具有较好的可解释性.综合实验的各项指标可以得出, sum、avg、max三种聚合模式对软件缺陷预测有显著作用,且基于MIC所选特征而构造的模型是有优势的.

     

    Abstract: In software defect prediction with a regression model, too many metrics extracted from static code and aggregated (sum, avg, max, min) from methods into classes can be candidate features, and the classical feature selection methods, such as AIC, BIC, should be processed at a given model. As a result, the selected feature sets are significantly different for various models without a reasonable interpretation. Maximal information coefficient (MIC) presented by Reshef et al.\ucite4 is a novel method to measure the degree of the interdependence between two continuous variables, and an available computing method is also given based on the observations. This paper firstly use the MIC between defect counts and each feature to select features, and then conduct the power transformation on the selected features, and finally build up the principal component Poisson and negative binomial regression model. All experiments are conducted on KC1 data set in NASA repository on the level of class. The block-regularized m\times 2 cross-validated sequential t-test is employed to test the difference of performance of two models. The performance measures of a model in this paper are FPA, AAE, ARE. The experimental results show that 1) the aggregated features, such as sum, avg, max, are selected by MIC except min, which are significantly different from AIC, BIC; 2) the power transformation to the features can improve the performance for majority of models; 3) after PCA and factorial analysis, two clear factors are obtained in the model. One corresponds to the aggregated features via avg and max, and the other corresponds to the aggregated features with sum. Therefore, the model owns a reasonable interpretation. Conclusively, the aggregated features with sum, avg, max are significantly effective for software defect prediction, and the regression model based on the selected features by MIC has some advantages.

     

/

返回文章
返回