多元回归中选择自变量的一种简单方法

A Simple Approach in Regression Variable Selection

  • 摘要: 在线性回归模型建模中, 回归自变量选择是一个受到广泛关注、文献众多, 具有很强的理论和实际意义的问题. 回归自变量选择子集的相合性是其中一个重要问题, 如果某种自变量选择方法选择的子集在样本量趋于无穷时是相合的, 而且预测均方误差较小, 则这种方法是可取的. 利用BIC准则可以挑选相合的自变量子集, 但是在自变量个数很多时计算量过大; 适应lasso方法具有较高计算效率, 也能找到相合的自变量子集; 本文提出一种更简单的自变量选择方法, 只需要计算两次普通线性回归: 第一次进行全集回归, 得到全集的回归系数估计, 然后利用这些回归系数估计挑选子集, 然后只要在挑选的自变量子集上再进行一次普通线性回归就得到了回归结果.
    考虑如下的回归模型: 其中回归系数中非零分量下标的集合为, 设是本文方法选择的自变量子集下标集合, 是本文方法估计的回归系数(未选中的自变量对应的系数为零), 本文证明了, 在适当条件下, 其中表示的分量下标在中的元素的组成的向量, 是误差方差, 是与矩阵极限有关的矩阵和常数.
    数值模拟结果表明本文方法具有很好的中小样本性质.

     

    Abstract: Regression variable subset selection is one of the most important aspects in linear model theory. If the selected subset is consistent when the sample size tends to infinity, and the prediction mean square error is small, then the selection method is preferred. The BIC criterion can give consistent subset, but as the number of variables get large, it involves too much computation. The adaptive lasso has better computational efficiency, while keeping consistency. In this paper we propose a new approach for multiple linear regression variable selection, which is much simpler than the other variable selection methods, while it gives consistent subset. The new method only compute two passes of ordinary least squares regressions, the first pass computes a complete set regression, selects a variable subset based on the regression coefficient estimates, then the second pass regresses on the selected variables.
    Consider the following regression model: where the indexes of the non-zero elements of are denoted by , and suppose the new method gives a regression variable subset indexed by , and is the regression coefficient estimate using our new method, in which the coefficients of the dropped out variables are defined to be zero. We proved that under suitable conditions where denotes the vector composed of the elements of indexed by , is the error variance, are matrix and constant relying on the limit of.
    Simulation result and application examples show that the new approach have good small to medium sample performance, which is comparable to the other methods such as BIC, adaptive lasso.

     

/

返回文章
返回