病例对照二分类数据下逻辑回归模型的稳健半监督推断

全卓君; 郑明; 郁文

doi:10.3969/j.issn.1001-4268.2023.05.008

病例对照二分类数据下逻辑回归模型的稳健半监督推断

SSI for Case-control Binary Data under Possibly Mis-specified Logistic Models

摘要

摘要: 本文基于半监督推断方法,研究了标记数据来自病例对照抽样而逻辑回归模型不正确时相关目标参数的估计问题.在二分类任务中, 常用病例对照抽样解决数据结构不平衡的问题,常用逻辑回归模型作为统计模型. 但在现实应用中,模型假设往往是错误的. 若逻辑回归模型错误,仅利用病例对照抽样获得的标记数据无法对病例比例进行识别进而无法对目标参数,即使得总体风险达到最小值的参数进行估计. 本文借助于半监督推断方法,首先利用标记数据和无标记数据得到病例比例的无偏估计, 然后基于该估计,构造逆概率加权的损失函数来纠正病例对照数据中的抽样偏差.本文证明了求解以上的损失函数得到的解是关于目标参数的相合且渐近正态的估计,并且其极限分布的方差也可以通过观察到的数据进行一致地估计. 同时,模拟研究的结果表明论文提出的方法能对目标参数给出相合的估计.

Abstract: Semi-supervised data contains a labeled data set with both responses and covariates and an unlabeled data set with covariates only. The inference based on semi-supervised data is gaining more and more interests in statistics. When the response in the labeled data is binary, case-control sampling is commonly used to alleviate the imbalanced data structure. When the response and the covariates satisfy the logistic model, the slope parameter of the model can be consistently estimated even for the case-control sampling. However, when the logistic model is incorrectly specified for the data, the case-control samples can not estimate the population risk minimizer consistently. With the help of the unlabeled data, we derive a consistent estimator for the case population proportion. Then, an inverse probability weighted loss function is developed to obtain a consistent estimator for the population risk minimizer. The proposed estimators are shown to be asymptotically normal and the limiting variance-covariance matrix can be consistently estimated. Simulation results show that the proposed method gives out reasonable finite sample performances. A real data example is also analyzed for illustration.

HTML全文

参考文献(0)

施引文献

资源附件(0)