基于随机森林模型的成分数据缺失值填补法

Imputation of Missing Values for Compositional Data Based on Random Forest

摘要: 缺失数据处理是数据挖掘领域中进行数据预处理的一个重要环节, 由于成分数据特殊的几何性质, 传统的缺失值填补方法不能直接用于这种类型的数据. 因此, 对成分数据而言, 缺失值的填补具有十分重要的意义. 为了解决这个问题, 本文利用了成分数据和欧氏数据之间的关系, 提出了一种基于随机森林的成分数据缺失值迭代填补法, 该方法的实施和评估采用模拟和真实的数据集. 实验结果表明: 新的填补方法可广泛应用于多种类型的数据集且具有较高准确性.

Abstract: Dealing with the missing values is an important object in the field of data mining. Besides, the properties of compositional data lead to that traditional imputation methods may get undesirable result if they are directly used in this type of data. As a result, the management of missing values in compositional data is of great significant. To solve this problem, this paper uses the relationship between compositional data and Euclidean data, and proposes a new method based on Random Forest for missing values in compositional data. This method has been implemented and evaluated using both simulated and real-world databases, then the experimental results reveal that the new imputation method can be widely used in various types of data sets and has good performance than other methods.