宋亚男, 赵学靖. 一种改进的异常值检测算法与稳健估计[J]. 应用概率统计, 2021, 37(2): 136-154. DOI: 10.3969/j.issn.1001-4268.2021.02.003
引用本文: 宋亚男, 赵学靖. 一种改进的异常值检测算法与稳健估计[J]. 应用概率统计, 2021, 37(2): 136-154. DOI: 10.3969/j.issn.1001-4268.2021.02.003
SONG Yanan, ZHAO Xuejing. An Improved Outlier Detection Algorithm and Robust Estimation[J]. Chinese Journal of Applied Probability and Statistics, 2021, 37(2): 136-154. DOI: 10.3969/j.issn.1001-4268.2021.02.003
Citation: SONG Yanan, ZHAO Xuejing. An Improved Outlier Detection Algorithm and Robust Estimation[J]. Chinese Journal of Applied Probability and Statistics, 2021, 37(2): 136-154. DOI: 10.3969/j.issn.1001-4268.2021.02.003

一种改进的异常值检测算法与稳健估计

An Improved Outlier Detection Algorithm and Robust Estimation

  • 摘要: 对模型精度与稳健性的要求使得异常值检测与稳健估计在模型构建中变得日益重要. 本文首先利用基于边际相关系数构造的高维影响度量指标(HIM)与基于距离相关系数构造的高维数据异常值判别方法(HDC)分别对数据中的异常值进行初步检测, 将数据集中的点分为正常点与异常点两类,然后在初始正常点集的基础上利用稳健的参数估计方法和残差空间超椭球等高面的概念构造了对初始正常点集中误判点的纠正方法,并对初始异常点集中各点的异常值概率重新进行计算,以进一步纠正误判入异常点集的正常点, 最终对异常值检测的准确率进行进一步的提升.通过对两种数据结构下三种不同类型异常数据的模拟, 证明了所提方法的有效性,并通过实例进行验证与分析.

     

    Abstract: The requirements of model accuracy and robustness make the outlier detection and robust estimation become more and more important in the model construction. In this paper, we first use the high-dimensional influential measure (HIM) based on the marginal correlation and the high-dimensional discriminant method based on the distance correlation (HDC) to respectively detect the outliers in the data set. Then the points are divided into two parts: normal points and abnormal points. Based on the initial normal point set, we construct the method of recovery for the points that are misclassified to normal point set, by using a kind of robust coefficient estimation method and the concept of hyper ellipsoid contour in residual space. Thereafter the outlier probability of each point in the abnormal point set are calculated to further recover the normal points that are misspecified in the abnormal point set and thus detect the true outlier value. The accuracy rate of outlier detection has been further improved. The performance of the proposed method is illustrated through simulations of three types of anomaly data under two predictive data structures, as well as three real examples.

     

/

返回文章
返回