Abstract:
Regression variable subset selection is one of the most important aspects in linear model theory. If the selected subset is consistent when the sample size tends to infinity, and the prediction mean square error is small, then the selection method is preferred. The BIC criterion can give consistent subset, but as the number of variables get large, it involves too much computation. The adaptive lasso has better computational efficiency, while keeping consistency. In this paper we propose a new approach for multiple linear regression variable selection, which is much simpler than the other variable selection methods, while it gives consistent subset. The new method only compute two passes of ordinary least squares regressions, the first pass computes a complete set regression, selects a variable subset based on the regression coefficient estimates, then the second pass regresses on the selected variables.
Consider the following regression model: where the indexes of the non-zero elements of are denoted by , and suppose the new method gives a regression variable subset indexed by , and is the regression coefficient estimate using our new method, in which the coefficients of the dropped out variables are defined to be zero. We proved that under suitable conditions where denotes the vector composed of the elements of indexed by , is the error variance, are matrix and constant relying on the limit of.
Simulation result and application examples show that the new approach have good small to medium sample performance, which is comparable to the other methods such as BIC, adaptive lasso.