SSI for Case-control Binary Data under Possibly Mis-specified Logistic Models
-
-
Abstract
Semi-supervised data contains a labeled data set with both responses and covariates and an unlabeled data set with covariates only. The inference based on semi-supervised data is gaining more and more interests in statistics. When the response in the labeled data is binary, case-control sampling is commonly used to alleviate the imbalanced data structure. When the response and the covariates satisfy the logistic model, the slope parameter of the model can be consistently estimated even for the case-control sampling. However, when the logistic model is incorrectly specified for the data, the case-control samples can not estimate the population risk minimizer consistently. With the help of the unlabeled data, we derive a consistent estimator for the case population proportion. Then, an inverse probability weighted loss function is developed to obtain a consistent estimator for the population risk minimizer. The proposed estimators are shown to be asymptotically normal and the limiting variance-covariance matrix can be consistently estimated. Simulation results show that the proposed method gives out reasonable finite sample performances. A real data example is also analyzed for illustration.
-
-