Skip to main content
Oxford University Press logoLink to Oxford University Press
. 2024 Mar 11;80(1):ujae002. doi: 10.1093/biomtc/ujae002

Semisupervised transfer learning for evaluation of model classification performance

Linshanshan Wang 1, Xuan Wang 2, Katherine P Liao 3, Tianxi Cai 4,
PMCID: PMC10926267  PMID: 38465982

ABSTRACT

In many modern machine learning applications, changes in covariate distributions and difficulty in acquiring outcome information have posed challenges to robust model training and evaluation. Numerous transfer learning methods have been developed to robustly adapt the model itself to some unlabeled target populations using existing labeled data in a source population. However, there is a paucity of literature on transferring performance metrics, especially receiver operating characteristic (ROC) parameters, of a trained model. In this paper, we aim to evaluate the performance of a trained binary classifier on unlabeled target population based on ROC analysis. We proposed Semisupervised Transfer lEarning of Accuracy Measures (STEAM), an efficient three-step estimation procedure that employs (1) double-index modeling to construct calibrated density ratio weights and (2) robust imputation to leverage the large amount of unlabeled data to improve estimation efficiency. We establish the consistency and asymptotic normality of the proposed estimator under the correct specification of either the density ratio model or the outcome model. We also correct for potential overfitting bias in the estimators in finite samples with cross-validation. We compare our proposed estimators to existing methods and show reductions in bias and gains in efficiency through simulations. We illustrate the practical utility of the proposed method on evaluating prediction performance of a phenotyping model for rheumatoid arthritis (RA) on a temporally evolving EHR cohort.

Keywords: Covariate shift, model evaluation, receiver operating characteristic curve, risk prediction, semisupervised learning, transfer learning

1. INTRODUCTION

The increasing adoption of electronic health records (EHR) for clinical care has enabled many opportunities for translational and clinical research (Hripcsak and Albers, 2013; Miotto et al., 2016). However, a major limitation of EHR data is the lack of precise information on disease phenotypes. Readily available EHR features such as diagnosis codes often exhibit poor specificity that can bias or depower the downstream study (Liao et al., 2010; Cipparone et al., 2015). Meanwhile, manual annotation of phenotypes via chart review is laborious and unscalable. To overcome these challenges, numerous supervised phenotyping classification algorithms have been developed for prediction of disease status of individual patients (Carroll et al., 2012; Xia et al., 2013; Liao et al., 2015; Liao et al., 2015).

To build a supervised phenotyping model, researchers often manually annotate outcome information for a randomly chosen subset of the large EHR dataset, and then use these labeled observations for model training and validation. Once a model is trained and validated in one population, it is often then used for other populations, such as a similar population from a different healthcare system (Carroll et al., 2012) or the same EHR cohort updated over time (Huang et al., 2020). Generalizing a classification model trained in a source population to a target population, however, requires caution due to the change in the covariate distribution owing to the heterogeneity in the underlying patient populations. It is well known that such changes in the distribution of the covariates, often referred to as covariate shift (Shimodaira, 2000), may have a large impact on the performance of a prediction algorithm trained in source cohort on target cohort (Rasmy et al., 2018). It is thus important to assess the portability of the algorithm in face of such changes in datasets before its usage in the target sites.

In these scenarios, methods to accurately evaluate the performance of these algorithms in target cohort is especially valuable. If outcome information is available for some patients in target cohort, for instance, via chart review, then the standard methods can be applied to estimate the prediction performance parameters on target dataset. Recently, robust semisupervised methods have also been proposed to improve the estimation efficiency for prediction performance measures (Gronsbell and Cai, 2018). The issue, however, is that performing chart review for every new target population might not be infeasible due to time and resource constraints. It is thus highly desirable to robustly and efficiently estimate the prediction performance parameters in target cohort without requiring any outcome information in target dataset.

Although the problem of covariate shift has been extensively studied in recent literature of transfer learning, most existing work focused on robust adaptation of the model itself to the target population (Chen et al., 2016; Wen et al., 2014; Rotnitzky et al., 2012; Reddi et al., 2015; Liu et al., 2020). There is a paucity of literature on transferring performance metrics of a trained model based on the commonly used ROC analysis (Pepe, 2003). Inoue and Uno (2018) used density ratio weighting methods to assess model performance in the target population. However, their estimator do not protect against the misspecification of the density ratio model. Xu and Tibshirani (2022) proposed a method based on parametric bootstrap which does not require estimation of the density ratio model. However, their focus was on the out-of-sample error of the prediction model, not estimating the accuracy meausures based on the ROC curve. Moreover, their method does not protect against misspecification of the prediction model. Doubly robust methods for estimating accuracy measures have been proposed in verification bias literature (Alonzo and Pepe 2005; Rotnitzky et al., 2006; Fluss et al., 2009), and recently in the context of estimating the AUC in a target population that has a different data distribution compared to the source population (Li et al., 2022). Nevertheless, no existing method leveraged both unlabeled source and target data to improve estimation efficiency. To address this gap in methodology, we propose a Semisupervised Transfer lEarning of Accuracy Measures (STEAM) in the unlabeled target cohort. We employ a double-index modeling approach to construct calibrated density ratio weights similar to Cheng et al. (2020), incorporating both the covariate shift model and the outcome imputation model. Although also taking a reweighting form, the STEAM estimators are doubly robust with consistency attained when either the density ratio model or the classification model is correctly specified. The large amount of unlabeled source and target data enabled us to efficiently estimate the density ratio and the imputation model.

The rest of the paper is organized as follows: we formalize the problem of interest in Section 2. In Section 3, we detail the STEAM estimation procedure along with a perturbation resampling procedure for inference. We conduct simulation studies to evaluate the finite sample robustness and efficiency of the proposed method in Section 4. We then illustrate the practical utility of the proposed method on evaluating the performance of a phenotyping model for rheumatoid arthritis (RA) on temporally evolving EHR cohorts in Section 5.

2. PROBLEM SETUP

2.1. Notations and assumptions

Let Y denote the binary phenotype of interest, and Inline graphic the p + 1-dimensional covariate vector for some fixed p. The source data (indexed by S = 1), obtained from a simple random sample from the source population, consist of n independent and identically distributed (iid) labeled data Inline graphic, where Inline graphic, and N iid unlabeled data Inline graphic where Inline graphic. The target data (indexed by S = 0), obtained from a simple random sample from the target population, consists of Nt iid unlabeled data Inline graphic, where Inline graphic. The entire observed data Inline graphic can be written as Inline graphic, where L is an indicator of labelling. We assume that N/Ntc ∈ (0, ∞), Inline graphic such that n1/2N−1/3 → 0, as n → ∞. We assume random labeling in source population and hence Inline graphic. Furthermore, we require the standard covariate shift assumption with Inline graphic, where Inline graphic is the joint density of Inline graphic evaluated at Inline graphic given S = s, Inline graphic denotes the probability density of Inline graphic at Inline graphic given S = s and Inline graphic denotes the conditional density of Y at y given Inline graphic, which is shared between the source and target populations. Throughout, we let Inline graphic an Inline graphic denote the probability distribution and expectation taken over the population S = s.

2.2. Classification rule

The aim is to assess the accuracy of a binary classification rule for Y given Inline graphic developed using the source data. Although our STEAM approach can be easily extended to other regression models, for the ease of presentation, we focus on the working generalized linear model (GLM)

2.2. (1)

where g is a specified smooth link function such that Inline graphic is convex in Inline graphic. One can also easily accommodate nonlinear effects by including nonlinear basis functions of Inline graphic. We consider standard estimation procedures for Inline graphic. As an example, one may employ an adaptive LASSO penalized estimator (Zou, 2006) to stabilize the finite sample estimation when p is not very small relative to n:

2.2. (2)

where lg is the log likelihood from the assumed model, Inline graphic is a root-n consistent initial estimate of Inline graphic, and Inline graphic denotes the subvector of Inline graphic that excludes the intercept β0, λμ, n is a tuning parameter such that n1/2λμ, n → 0 and n(1 − ν)(1 + γ)/2λμ, n → ∞ with γ > 2ν(1 − ν) (Zou and Zhang, 2009). Following results given in Zou and Zhang (2009), Inline graphic is a root-n consistent for the population parameter

2.2.

and attains oracle property under sparsity of Inline graphic, regardless if the fitted model (1) holds. Our proposed STEAM procedure is not restricted to any specific estimation procedure provided that Inline graphic is a regular root-n estimators for Inline graphic.

2.3. Accuracy measures of interest

We are interested in estimating the accuracy parameters of classification model based on Inline graphic. With Inline graphic from source population, we may classify subjects in the target population with the highest scores of Inline graphic, that is, Inline graphic for some fraction c, as having the phenotype of interest, where Inline graphic. To identify a desirable threshold c and evaluate the classification accuracy in the unlabeled target population, we consider the commonly employed receiver operating characteristic (ROC) analysis (Pepe, 2003). Specifically, the true positive rate (TPR) and false positive rate (FPR) of the binary classification rule Inline graphic in the target population are defined as:

2.3.

The ROC curve, Inline graphic, summarizes the trade-off between the TPR and FPR functions. The area under the ROC curve, Inline graphic, summarizes the overall prediction performance of Inline graphic. Additionally, a threshold value c0 for classifying an individual as having the phenotype, namely Inline graphic, is often chosen to achieve a desired FPR level u0. Once the threshold value c0 is determined, one may summarize the predictive performance of Inline graphic, based on positive predictive value (PPV) and negative predictive value (NPV) defined as

2.3.

Directly estimating theses accuracy parameters using the source data will result in inconsistency due to the covariate shift as well as potential model misspecification of the model for Inline graphic (Liu et al., 2021). To correct for the bias, it is natural to incorporate importance sampling weighting. For example, with trained Inline graphic, one may estimate Inline graphic as:

2.3. (3)

where Inline graphic is an estimate for the density ratio

2.3.

Inline graphic , and Inline graphic. Although the direct estimates of Inline graphic, P(S = 1) and P(S = 0) are not possible as they are not identifiable under the sampling scheme considered, we can use use the inverse-odds weights estimated from the dataset in (3) since they are equal to the density ratio weight up to a proportionality constant (Steingrimsson et al., 2021).

One issue with this re-weighting estimator is that its consistency heavily relies on the consistency of the estimate of the weights, and can perform poorly when the density ratio model is not well estimated. Moreover, the estimator suffers from low precision when n is small, which is often the case in EHR settings. To improve the robustness and efficiency of the estimation for the accuracy parameters, we next propose the STEAM estimators that leverage the unlabeled data in the source population.

3. METHODS

3.1. Estimation procedure

To motivate the STEAM estimation procedure, we first note that

3.1.

where Inline graphic. Even though Inline graphic, Inline graphic under covariate shift and possible misspecification of the outcome model (1). To enable robust transfer of knowledge from the source to target, while leveraging unlabeled data, the STEAM estimation procedure consists of three key steps: (I) construct calibrated density ratio weight Inline graphic via double index modeling; (II) derive nonparametric calibrated estimator of Inline graphic; and (III) obtain the final STEAM estimator by projecting to the unlabeled target data using the estimated Inline graphic.

Step I: calibrated density ratio weight estimation. The calibrated density ratio weight estimator involves both an initial estimator for Inline graphic and the estimated outcome model. To estimate Inline graphic, we consider a parametric working model,

3.1. (4)

where gπ is a known link function, Inline graphic is a q-dimensional vector of transformed Inline graphic that may include nonlinear bases of Inline graphic with the first element being 1, and Inline graphic is the unknown parameter. Since π( · ) doesn’t involve the outcome, we may use the unlabeled data to obtain a regularized estimator for Inline graphic:

3.1.

with properly chosen tuning parameters λπ, N, γ, and an initial estimator Inline graphic for Inline graphic, where the expectation is taken over the pooled source and target population. Similar to the outcome model, one may show that Inline graphic is a regular estimator for Inline graphic. We then obtain a nonparametrically calibrated estimate of Inline graphic by smoothing S over the two scores Inline graphic and Inline graphic. Specifically, we estimate Inline graphic as Inline graphic,

3.1.

where Inline graphic and Inline graphic are bivariate scores that represent the covariate in the directions of Inline graphic and Inline graphic. We then construct the calibrated density ratio weight for the ith subject with Inline graphic as:

3.1.

where Inline graphic, K( · ) is a smooth symmetric density function, and h1 = O(N−1/6).

Step II: calibrated estimate ofInline graphic. In step II, we obtain a nonparametrically calibrated estimate of the conditional risk function Inline graphic via kernel smoothing as:

3.1.

where Inline graphic, Inline graphic, Inline graphic, Inline graphic, and h2 is a bandwidth such that Inline graphic and Inline graphic as n → ∞.

Step III: projection toInline graphic with Inline graphic Finally, we construct a plug-in STEAM estimator by projecting to Inline graphic using the imputed outcomes Inline graphic. Specifically, the STEAM estimator for Inline graphic is obtained as:

3.1.

Remark 1

The calibrated estimator Inline graphic consistently estimates Inline graphic if either (1) or (4) is correct. The STEAM approach to imputation using Inline graphic is appealing as it achieves double robustness by leveraging the large unlabeled data and only requiring 1D smoothing over the small labeled data.

Similarly we may construct a STEAM estimator of Inline graphic as:

3.1.

The STEAM estimates of the ROC curve and the resulting AUC are

3.1.

Furthermore, when a threshold value Inline graphic is chosen to attain a prespecified FPR level, say u0, we may obtain STEAM estimators of Inline graphic and Inline graphic as Inline graphic and Inline graphic, where Inline graphic

3.1.

Inline graphic is the SS estimator of the prevalence Inline graphic.

In the supervised setting, it is well-known that when the same set of labeled data is used for both model training and evaluation, the apparent estimators of accuracy measures may be overly optimistic in finite samples (Efron, 1986). Similarly in our setting, the above STEAM estimator may also suffer from overfitting bias since the labeled data Inline graphic is used for estimating both the classification model, and the calibrated conditional risk function for imputation in estimation of the accuracy measures. Following similar strategies given in Gronsbell and Cai (2018), it is not difficult to employ a K-fold cross-validation (CV) procedure to correct for the overfitting bias in the STEAM estimators. Detailed forms of the CV estimator can be found in the Supplementary Material C. For all numerical studies, we employ the CV correction for the STEAM estimator.

3.2. Asymptotic results for STEAM estimators

Though analogous results hold for all STEAM estimators, we present the main results for Inline graphic.

Theorem 1

Under assumptions outlined in the Supplementary Material, when either (1) or (4) is correct, the Inline graphic is consistent for Inline graphic. Moreover,

Theorem 1

The proof for consistency directly follows from Remark 1 and is outlined in Supplementary Material A. To show asymptotic normality of STEAM estimators, let

3.2.

Inline graphic , and Inline graphic. We show that

3.2.

where Inline graphic is the influence function of Inline graphic, and C2 is defined in Supplementary Material B. The weak convergence of Inline graphic thus follows.

Even though the asymptotic variance can be obtained through influence function, a direct estimate is difficult because it involves unknown conditional density functions, which are difficult to estimate explicitly, particularly under model misspecification. We instead propose the use of a simple resampling procedure for inference based on Jin et al. (2001) in the next section.

3.3. Perturbation resampling procedure for inference

Now we outline the perturbation resampling procedure for inference. Let Inline graphic be non-negative iid random variables with unit mean and variance that are independent of the observed data. We first obtain perturbed estimates Inline graphic as:

3.3.

where Inline graphic is perturbed initial estimates obtained from analogously perturbing its estimating equations. This leads to the perturbed estimates of the density ratio weights:

3.3.

where,

3.3.

with Inline graphic and Inline graphic. Since Inline graphic and Inline graphic are estimated based Inline graphic, their contribution to the asymptotic variance is of higher order when Nn. We then obtain the perturbed estimate of conditional risk function as:

3.3.

Finally, we calculate the perturbed Inline graphic as:

3.3.

It can be shown using arguments from Jin et al. (2001) that the asymptotic distribution of Inline graphic coincides with that of Inline graphic. This allows us to consistently approximate the SE of Inline graphic based on the empirical SD of resamples Inline graphic. The confidence intervals of Inline graphic can be constructed by using the empirical percentiles of the resamples. We can obtain perturbed counterparts of Inline graphic, Inline graphic, Inline graphic, Inline graphic, respectively, denoted as Inline graphic, Inline graphic, Inline graphic, Inline graphic, and obtain SE estimates and construct confidence intervals accordingly.

For practical use, an issue of this perturbation resampling procedure is that calculating Inline graphic requires 2D smoothing, which can be very time-consuming. To overcome this difficulty, we propose an approximated resampling procedure that is more computationally feasible. We first generate Inline graphic, where Inline graphic has dimension N × B, followed by the perturbed Inline graphic estimates Inline graphic where Inline graphic has dimension p × B. We then approximate the perturbed estimates Inline graphic by

3.3.

where ,

3.3.

The perturbed estimate of conditional risk can then be obtained as Inline graphic, and the perturbed Inline graphic thus follows. We evaluate the performance of the perturbation resampling procedure with and without approximation through simulations.

4. SIMULATION STUDIES

We performed extensive simulations to assess the finite sample bias, SE (empirical SD of the point estimates, SE), root mean square error (RMSE) of our proposed estimators (STEAM) compared to alternative estimators. In separate simulations we also examined the performance of the perturbation procedure, with and without approximation, for inference based on STEAM estimators. Across all numerical studies including the data application in section 5, we used the adaptive LASSO to estimate Inline graphic and Inline graphic, where we choose λμ, n and λπ, N using a modified BIC criteria (Minnier et al., 2011). For the bandwidth in the 2D smoothing for π( · ), we used a second-order Gaussian product kernel with a plug-in bandwidth Inline graphic, where Inline graphic is the sample SD of either Inline graphic or Inline graphic in source. Prior to smoothing, Inline graphic and Inline graphic were standardized and transformed by a probability integral transform based on the normal cumulative distribution function to induce approximately uniformly distributed inputs, which can improve finite-sample performance (Wand et al., 1991). For the nonparametric smoothing for m( · ), we used the Gaussian kernel with Inline graphic, where Inline graphic is the empirical SD of Inline graphic. As we focused on binary outcomes, we specified logistic link functions gμ(u) = gπ(u) = 1/(1 + eu).

For comparison, we considered alternative estimators for performance accuracy parameters: (1) the naive estimators, where the performance accuracy parameters are estimated using CV on the labeled source dataset (source), (2) estimation based on 100 randomly labeled validation samples in target dataset (target_labeled), (3) the standard weighted estimators based on CV (weighted), (4) the doubly robust estimator proposed by Alonzo and Pepe (Alonzo and Pepe) with CV (DR-aug). This estimator is a special case of Rotnitzky et al. (2006) and Fluss et al. (2009) when Y is independent of S conditional on Inline graphic.

Throughout, we generated Inline graphic from Inline graphic, Inline graphic, and Inline graphic. We considered simulating data that roughly resembled the EHR data example in terms of model parameters but were simple enough to be more broadly relevant. To this end, we considered p = 10 covariates with variance σ2 = 1, and correlations ρ = 0.2. These simulations were varied over different model specifications, predictive strength of Inline graphic to S, and sample sizes. We considered the following data generating mechanisms:

4.

Either Inline graphic or Inline graphic were potentially misspecified by neglecting the pairwise interaction terms in model fitting:

  1. Both correct: Fitting Inline graphic and Inline graphic.

  2. Inline graphic correct and Inline graphicmisspecified: Fitting Inline graphic and Inline graphic.

  3. Inline graphic correct and Inline graphicmisspecified: Fitting Inline graphic and Inline graphic.

  4. Both Inline graphic and Inline graphicmisspecified: Fitting Inline graphic and Inline graphic.

where, Inline graphic, Inline graphic. The outcome prevalence was approximately Inline graphic, and the sample size in source and target are roughly equal. We considered N = 10000 and n = 200, 400. The results in each scenario are summarized from 1000 simulated datasets.

Note that if Inline graphic, then Inline graphic follows a logistic regression model (Efron, 1975; Harrell and Lee, 1985). In particular, if the covariance matrices Inline graphic, then Inline graphic would follow a logistic regression model without interaction terms. Otherwise if we assume Inline graphic, then Inline graphic would then follow a logistic regression model that involves interaction and quadratic terms of X1,..., Xp.

We present results for Inline graphic, Inline graphic, Inline graphic for u0 = 0.05. Table 1 presents the bias, SE, and RMSE for scenarios (1)-(3) with moderate strength of Inline graphic to S with N = 10000. As expected, the source estimators exhibit noticeable bias, regardless of the size of n. This is consistent with the consensus that the naive estimator based on source data is not appropriate for prediction performance assessment in the presence of covariate shift. The target_labeled estimators are unbiased, but have large SEs dues to the limited sample size. The weighted estimators are single robust and have noticeable bias when Inline graphic is misspecified. Among scenarios (1)-(3), the bias of STEAM is small relative to the RMSE, verifying its doubly-robustness. In scenario (4), where both models are misspecified (Supplementary Material D), the STEAM estimators, although slightly biased, are more robust compared to other estimators and attained the lowest RMSE.

TABLE 1.

Bias, SE, and RMSE of estimators under different model misspecification scenarios with moderate association of Inline graphic to S, over 1000 simulated datasets. N = 10000, k = 5, n = 200 or 400. All values are multiplied by 100.

both correct π(X) misspecified μ(X) misspecified
Estimator Bias SE RMSE Bias SE RMSE Bias SE RMSE
Cutoff target_labeled −0.19 4.86 4.86 −0.19 4.86 4.86 −0.40 4.88 4.89
n = 200 source 1.02 4.24 4.35 1.02 4.24 4.35 1.24 4.19 4.36
weighted −0.33 4.64 4.64 −0.34 4.47 4.47 −0.05 4.42 4.42
DR-aug −0.53 4.42 4.45 −0.55 4.23 4.27 −0.35 4.45 4.46
STEAM −0.21 3.47 3.47 −0.39 3.42 3.43 −0.06 3.52 3.51
n = 400 source 1.94 3.04 3.59 1.94 3.04 3.59 1.80 3.17 3.64
weighted 0.18 3.05 3.05 0.14 3.06 3.06 0.04 3.12 3.12
DR-aug 0.22 2.91 2.92 0.35 2.91 2.93 0.25 3.11 3.12
STEAM 0.08 2.52 2.52 0.01 2.49 2.48 0.09 2.53 2.53
AUC target_labeled 0.57 4.74 4.76 0.57 4.74 4.76 0.62 4.60 4.63
n = 200 source 1.16 3.66 3.83 1.16 3.66 3.83 0.90 3.68 3.78
weighted 0.39 3.89 3.90 0.97 3.79 3.90 0.17 3.87 3.86
DR-aug 0.41 3.73 3.75 0.52 3.66 3.70 0.72 3.77 3.84
STEAM −0.38 3.71 3.72 −0.10 3.65 3.65 −0.68 3.73 3.78
n = 400 source 1.19 2.49 2.76 1.19 2.49 2.76 1.10 2.53 2.75
weighted 0.37 2.63 2.65 1.06 2.64 2.84 0.26 2.66 2.66
DR-aug 0.40 2.58 2.61 0.45 2.61 2.64 0.33 2.61 2.63
STEAM −0.36 2.51 2.52 −0.18 2.52 2.52 −0.27 2.53 2.53
TPR target_labeled 0.43 10.52 10.53 0.43 10.52 10.53 0.59 10.57 10.59
n = 200 source 3.60 9.20 9.88 3.60 9.20 9.88 3.33 9.36 9.93
weighted 0.82 9.26 9.3 1.87 8.84 9.04 0.73 9.14 9.17
DR-aug 0.80 9.18 9.21 0.78 8.88 8.91 0.25 9.02 9.02
STEAM 0.22 7.60 7.60 0.66 7.52 7.55 −0.12 7.75 7.75
n = 400 source 2.63 6.52 7.03 2.63 6.52 7.03 2.84 6.89 7.44
weighted 0.58 6.47 6.5 1.83 6.49 6.88 1.17 6.53 6.62
DR-aug 0.50 6.14 6.16 0.73 6.11 6.15 0.36 6.44 6.45
STEAM −0.32 5.62 5.63 −0.15 5.59 5.58 −0.36 5.69 5.68
PPV target_labeled 0.21 5.79 5.78 0.21 5.78 5.78 0.82 5.37 5.43
n = 200 source 3.66 3.82 5.29 3.66 3.82 5.29 3.66 3.88 5.33
weighted 0.89 4.24 4.33 2.04 4.01 4.50 1.24 4.16 4.34
DR-aug −0.79 4.30 4.37 −0.81 4.49 4.55 −0.32 4.49 4.50
STEAM −0.41 4.02 4.04 −0.01 3.93 3.93 −0.57 3.89 3.93
n = 400 source 2.99 2.24 3.74 2.99 2.24 3.73 3.06 2.36 3.87
weighted 0.83 2.53 2.66 1.46 2.54 2.93 1.10 2.53 2.75
DR-aug −0.42 2.64 2.67 −0.34 2.88 2.90 -0.41 2.69 2.72
STEAM −0.31 2.44 2.45 −0.25 2.45 2.46 −0.29 2.46 2.47
NPV target_labeled 0.56 5.87 5.88 0.56 5.87 5.88 0.76 5.88 5.92
n = 200 source −2.26 4.63 5.14 −2.26 4.63 5.14 −2.35 4.65 5.20
weighted −0.28 4.50 4.50 0.17 4.43 4.42 −0.48 4.28 4.30
DR-aug −0.42 4.43 4.45 −0.25 4.44 4.45 −0.38 4.38 4.40
STEAM −0.30 4.04 4.04 −0.13 4.04 4.03 −0.39 4.10 4.11
n = 400 source −2.47 3.70 4.44 −2.47 3.70 4.44 −2.39 3.71 4.41
weighted −0.32 3.40 3.40 0.19 3.44 3.43 −0.26 3.34 3.34
DR-aug −0.51 3.27 3.31 −0.30 3.37 3.38 −0.40 3.37 3.39
STEAM −0.33 3.17 3.17 −0.24 3.17 3.17 −0.35 3.16 3.17

Figure 1 presents the number of labels in target cohort needed to achieve the same RMSE as each estimator across different predictive strength of Inline graphic to S (weak, moderate, and strong), and misspecification scenarios for N = 10000, n = 200, k = 5. A larger number of labels needed in target cohort represents lower RMSE and higher efficiency of the estimator. The STEAM estimators are more efficient than source, weighted and DR-aug across the settings. The amount of efficiency gain of STEAM over creating labels in target varies with the predictive strength of Inline graphic to S. We observe that when Inline graphic is weakly or moderately correlated with S, using the n = 200 labels in source for STEAM estimators is equivalent to creating ∼120-200 new labels in target in terms of RMSE, but when Inline graphic is strongly associated with S, using STEAM is less advantageous, only equivalent to ∼100-150 labels in target in terms of RMSE.

FIGURE 1.

FIGURE 1

Number of labels in target cohort required to achieve the same RMSEs as each estimator, by three model misspecification scenarios for different predictive strength of Inline graphic to S (weak, moderate,and strong), over 1000 simulated datasets. Larger number of labels denotes greater efficiency (lower MSE) relative to target_labeled. N = 10000, n = 200, k = 5.

To implement the perturbation procedure, we used the weights Gi ∼ 4 × Beta(0.5, 1.5) and 1000 sets of Inline graphic for SE and CI estimation. We considered evaluating the perturbations only in the scenario where both Inline graphic and Inline graphic are correctly specified and Inline graphic has moderate predictive strength to S, with N = 10000 and n = 200. The results are presented in Table 2. The SEs estimated with and without approximation in perturbation approximated well the empirical standard error. The coverage of the Inline graphic CIs were also close to nominal levels.

TABLE 2.

Performance of perturbation resampling for STEAM in 1000 simulated datasets when both Inline graphic, Inline graphic are correctly specified. Emp SE: Empirical SE over simulations. ASE: average of estimated SE based on the SD of perturbed estimates; CovProb: coverage probability of Inline graphic CIs; ASE_approx: average of estimated SE based on the SD of perturbed estimates using the approximated method; CovProb_approx: coverage probability of Inline graphic CIs using the approximated method. N = 10000, n = 200, k = 5.

Estimate Emp SE ASE CovProb ASE_approx CovProb_approx
Cutoff 0.678 0.040 0.037 0.938 0.037 0.925
AUC 0.779 0.032 0.030 0.929 0.036 0.968
TPR 0.286 0.053 0.051 0.920 0.055 0.929
PPV 0.823 0.032 0.033 0.944 0.036 0.960
NPV 0.620 0.040 0.036 0.918 0.043 0.964

5. APPLICATION TO RHEUMATOID ARTHRITIS (RA) PHENOTYPING MODEL

To illustrate the performance of our proposed estimators, we applied our procedures to evaluate a phenotyping algorithm for classifying RA disease, using EHR data from Partner’s Healthcare (Liao et al., 2010). The source dataset from 2009 consists of 20 451 total patients, of which n = 267 were labeled with the true disease status through chart review by an rheumatologist. Both narrative and codified data were available to develop the prediction model (p = 21). The narrative variables, including disease diagnoses and medications, were obtained with natural language processing. The codified data included ICD-9 codes, electronic prescriptions, and laboratory values. Healthcare utility was also included. The transformation u↦log (1 + u) was applied to all count variables to mitigate instability in the estimation due to skewness in their distributions. An RA phenotyping model, named as ALASSO-2009 (Supplementary Material E), was trained on labeled data from 2009. We are interested in assessing its performance in classifying the disease status of patients in the same EHR database across different time windows, in 2011 (Nt = 25, 405, nt = 100), 2013 (Nt = 30, 804, nt = 100), 2015 (Nt = 36, 095, nt = 150), and 2017 (Nt = 39, 550, nt = 200). Each of these consists of a large number of unlabeled data with the same set of covariates available. For dataset in each year, a random subset was selected and labeled with true RA status through manual chart review by trained physicians, which can serve as the validation sets to verify our results. Summary of patient characteristic in labeled and unlabeled data in each year is shown in Supplementary Material F.

For comparison, we considered source, target_labeled, weighted, and DR-aug estimators as in Section 4. We also considered semisupervised estimation using labeled and unlabeled data in target as proposed by Gronsbell and Cai (2018) (target_SSL). The estimated AUC of the model on the target datasets is presented in Figure 2. The AUC estimates for the target datasets obtained using the validation sets (target_labeled and target_SSL) are noticeably different from the ones estimated from the source data (source), especially in 2015 and 2017. This is not too surprising since many changes in the database have occurred over the years: for instance, patient characteristics, such as age, have undergo changes. Moreover, the EHR system was switched to Epic (Epic System Corporation) and the International Classification of Diseases (ICD) system was changed from version 9 to version 10 around 2015-2016, which may also explain the lower AUC in 2015 and 2017. The point estimates based on our proposed STEAM method are similar to target_labeled, suggesting the stability of the proposed procedure in a real data setting. Moreover, we observe substantial efficiency gains of STEAM over target_labeled, weighted, and DR-aug. As a result, we have a more precise estimate of the prediction performance of the phenotype algorithm using the STEAM method. The estimated ROC curves along with 95% point-wise confidence bands can be found in Supplementary Material G.

FIGURE 2.

FIGURE 2

Estimated AUC with Inline graphic CI of ALASSO-2009 on datasets in 2011, 2013, 2015, and 2017 based on different methods.

The estimates of cutoff values and the corresponding TPR, PPV, and NPV estimates of the binary classification rule for u0 = 0.05 and 0.10 as well as their Inline graphic CIs are presented in Figure 3. For choosing the cutoff values, STEAM is ∼4.2-7.2 times as efficient as target_labeled, and ∼1.1-2.1 times as efficient as DR-aug. For each accuracy measure, STEAM estimators are 3.1-5.8 times as efficient as target_labeled, and the efficiency gain is at least Inline graphic with gains as high as Inline graphic compared with DR-aug.

FIGURE 3.

FIGURE 3

Estimated cutoffs and the corresponding TPR, PPV, and NPV estimates of the binary classification rule for u0 = 0.05 (top) and u0 = 0.10 (bottom) of ALASSO-2009. Error bars denote Inline graphic CIs.

6. DISCUSSION

In this paper, we proposed a robust and efficient transfer learning procedure for model evaluation rather than model fitting. Our estimators are doubly robust, i.e. they are consistent when either the prediction model Inline graphic or the weighting model Inline graphic is correct. We addressed potential overfitting bias in our SS estimators with CV and also developed a perturbation resampling procedure for making inference. We illustrate the performance and practical utility of the proposed estimators through simulations and application on an RA phenotying model.

We focused on estimating the ROC, which is a measure of discrimination. An aspect that the measures considered fail to address is the calibration of a prediction model, that is, to what extent the model-based predicted probability given a risk group matches with the actual percentage of patients with the disease. In the presence of covariate shift, if the prediction model is misspecified, the risk model trained with source data may not be well calibrated in the target data. Our proposed STEAM procedure, in fact, has already performed model calibration during its process since the method hinges on the estimation of the calibrated risk Inline graphic.

We have assumed that the true outcomes are labeled completely at random in source population, which is reasonable if investigators control the labeling. For example, with the goal of validating phenotyping models on EHR data, the researchers have the ability to design the study such that they randomly select the labeled examples and ask domain experts to perform chart review on the selected sample. However, this assumption could be restrictive if labeling was stratified by some known factors or if some records that are available were not labeled for research purposes. If the selection of labels is not random by design, the unlabeled source data effectively becomes a second target data, which also differs from the main target data. In that case, its utility in improving the estimation of accuracy parameter for the original target data becomes less clear. Other refinements and extensions to the proposed approach is possible. For instance, when the prediction model is clearly misspecified, we may incorporate an additional outcome model in the estimation procedure to preserve the doubly robust property, that is, the estimator is consistent if either the new outcome model or the weighting model is correct.

We focused on the setting where labeled data in the target population is not readily available or is expensive to obtain. However, when there exists small number of labeled data in the target population, as considered by Cai et al. (2022), it is also of interest to develop methods utilizing those existing labels to further improve robustness and efficiency of estimation.

Throughout, we focused on the setting with fixed p but accommodated settings in which p is not small relative to n in finite sample with regularized estimation. However, when p is large so that ν is large, a large power parameter γ would be required to maintain the oracle properties, leading to an unstable penalty and poor finite sample performance. Estimation under the setting with pn would require different theoretical justifications and warrant additional research.

Supplementary Material

ujae002_Supplemental_Files

Web Appendices, Tables, and Figures referenced in Sections 3-5, as well as a zip file containing the R code are available with this paper at the Biometrics website on Oxford Academic.

Acknowledgement

The authors thank the associate editor and referees for their insightful feedback and suggestions.

Contributor Information

Linshanshan Wang, Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA 02115, United States.

Xuan Wang, Division of Biostatistics, Department of Population Health Sciences, University of Utah, Salt Lake City, UT 84108, United States.

Katherine P Liao, Division of Rheumatology, Brigham and Women’s Hospital, Boston, MA 02115, United States.

Tianxi Cai, Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, United States.

FUNDING

This work was supported by National Institutes of Health grants R01LM013614 and P30AR072577.

CONFLICT OF INTEREST

None declared.

DATA AVAILABILITY

The data that support the findings in this paper are available on request from the corresponding author. The data are not publicly available due to privacy or ethical restrictions.

References

  1. Alonzo  T. A., Pepe  M. S. (2005). Assessing accuracy of a continuous screening test in the presence of verification bias. Journal of the Royal Statistical Society: Series C (Applied Statistics), 54, 173–190. [Google Scholar]
  2. Cai  T., Li  M., Liu  M. (2022). Semi-supervised triply robust inductive transfer learning. arXiv preprint arXiv:2209.04977.
  3. Carroll  R. J., Thompson  W. K., Eyler  A. E., Mandelin  A. M., Cai  T., Zink  R. M.  et al. (2012). Portability of an algorithm to identify rheumatoid arthritis in electronic health records. Journal of the American Medical Informatics Association, 19, e162–e169. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Chen  X., Monfort  M., Liu  A., Ziebart  B. D. (2016). Robust covariate shift regression. In Artificial Intelligence and Statistics, pp. 1270–1279., PMLR. [Google Scholar]
  5. Cheng  D., Chakrabortty  A., Ananthakrishnan  A. N., Cai  T. (2020). Estimating average treatment effects with a double-index propensity score. Biometrics, 76, 767–777. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Cipparone  C. W., Withiam-Leitch  M., Kimminau  K. S., Fox  C. H., Singh  R., Kahn  L. (2015). Inaccuracy of icd-9 codes for chronic kidney disease: a study from two practice-based research networks (pbrns). The Journal of the American Board of Family Medicine, 28, 678–682. [DOI] [PubMed] [Google Scholar]
  7. Efron  B. (1975). The efficiency of logistic regression compared to normal discriminant analysis. Journal of the American Statistical Association, 70, 892–898. [Google Scholar]
  8. Efron  B. (1986). How biased is the apparent error rate of a prediction rule?. Journal of the American Statistical Association, 81, 461–470. [Google Scholar]
  9. Fluss  R., Reiser  B., Faraggi  D., Rotnitzky  A. (2009). Estimation of the ROC curve under verification bias. Biometrical Journal, 51, 475–490. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Gronsbell  J. L., Cai  T. (2018). Semi-supervised approaches to efficient evaluation of model prediction performance. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80, 579–594. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Harrell  F. E., Lee  K. L. (1985). A comparison of the discrimination of discriminant analysis and logistic regression under multivariate normality. Biostatistics: Statistics in Biomedical, Public Health and Environmental Sciences, 333–343. [Google Scholar]
  12. Hripcsak  G., Albers  D. J. (2013). Next-generation phenotyping of electronic health records. Journal of the American Medical Informatics Association, 20, 117–121. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Huang  S., Huang  J., Cai  T., Dahal  K. P., Cagan  A., He  Z.  et al. (2020). Impact of ICD10 and secular changes on electronic medical record rheumatoid arthritis algorithms. Rheumatology, 59, 3759–3766. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Inoue  E., Uno  H. (2018). APPEstimation: Adjusted prediction model performance estimation (Version 0.1.1). [Google Scholar]
  15. Jin  Z., Ying  Z., Wei  L. (2001). A simple resampling method by perturbing the minimand. Biometrika, 88, 381–390. [Google Scholar]
  16. Li  B., Gatsonis  C., Dahabreh  I. J., Steingrimsson  J. A. (2022). Estimating the area under the roc curve when transporting a prediction model to a target population. Biometrics, 79, 2382–2393. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Liao  K. P., Cai  T., Gainer  V., Goryachev  S., Zeng-treitler  Q., Raychaudhuri  S.  et al. (2010). Electronic medical records for discovery research in rheumatoid arthritis. Arthritis Care and Research, 62, 1120–1127. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Liao  K. P., Cai  T., Savova  G. K., Murphy  S. N., Karlson  E. W., Ananthakrishnan  A. N.  et al. (2015). Development of phenotype algorithms using electronic medical records and incorporating natural language processing. bmj, 350, h1885. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Liao  K. P., Ananthakrishnan  A. N., Kumar  V., Xia  Z., Cagan  A., Gainer  V. S.  et al. (2015). Methods to develop an electronic medical record phenotype algorithm to compare the risk of coronary artery disease across 3 chronic disease cohorts. PloS one, 10, e0136651. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Liu  M., Zhang  Y., Cai  T. (2020). Doubly robust covariate shift regression with semi-nonparametric nuisance models. arXiv preprint arXiv:2010.02521.
  21. Liu  M., Zhang  Y., Zhou  D. (2021). Double/debiased machine learning for logistic partially linear model. The Econometrics Journal, 24, 559–588. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Minnier  J., Tian  L., Cai  T. (2011). A perturbation method for inference on regularized regression estimates. Journal of the American Statistical Association, 106, 1371–1382. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Miotto  R., Li  L., Kidd  B. A., Dudley  J. T. (2016). Deep patient: an unsupervised representation to predict the future of patients from the electronic health records. Scientific Reports, 6, 1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Pepe  M. S. (2003). The Statistical Evaluation of Medical Tests for Classification and Prediction, Oxford University Press. [Google Scholar]
  25. Rasmy  L., Wu  Y., Wang  N., Geng  X., Zheng  W. J., Wang  F.  et al. (2018). A study of generalizability of recurrent neural network-based predictive models for heart failure onset risk using a large and heterogeneous ehr data set. Journal of Biomedical Informatics, 84, 11–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Reddi  S., Poczos  B., Smola  A. (2015). Doubly robust covariate shift correction. Proceedings of the AAAI Conference on Artificial Intelligence, 29(1). [Google Scholar]
  27. Rotnitzky  A., Faraggi  D., Schisterman  E. (2006). Doubly robust estimation of the area under the receiver-operating characteristic curve in the presence of verification bias. Journal of the American Statistical Association, 101, 1276–1288. [Google Scholar]
  28. Rotnitzky  A., Lei  Q., Sued  M., Robins  J. M. (2012). Improved double-robust estimation in missing data and causal inference models. Biometrika, 99, 439–456. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Shimodaira  H. (2000). Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90, 227–244. [Google Scholar]
  30. Steingrimsson  J. A., Gatsonis  C., Dahabreh  I. J. (2021). Transporting a prediction model for use in a new target population. arXiv preprint arXiv:2101.11182. [DOI] [PMC free article] [PubMed]
  31. Wand  M. P., Marron  J. S., Ruppert  D. (1991). Transformations in density estimation. Journal of the American Statistical Association, 86, 343–353. [Google Scholar]
  32. Wen  J., Yu  C.N., Greiner  R. (2014). Robust learning under uncertain test distributions: Relating covariate shift to model misspecification. In International Conference on Machine Learning, pp. 631–639., PMLR. [Google Scholar]
  33. Xia  Z., Secor  E., Chibnik  L. B., Bove  R. M., Cheng  S., Chitnis  T.  et al. (2013). Modeling disease severity in multiple sclerosis using electronic health records. PloS one, 8, e78927. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Xu  H., Tibshirani  R. (2022). Estimation of prediction error with known covariate shift. arXiv preprint arXiv:2205.01849.
  35. Zou  H., Zhang  H. H. (2009). On the adaptive elastic-net with a diverging number of parameters. The Annals of Statistics, 37(4), 1733. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Zou  H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101, 1418–1429. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ujae002_Supplemental_Files

Web Appendices, Tables, and Figures referenced in Sections 3-5, as well as a zip file containing the R code are available with this paper at the Biometrics website on Oxford Academic.

Data Availability Statement

The data that support the findings in this paper are available on request from the corresponding author. The data are not publicly available due to privacy or ethical restrictions.


Articles from Biometrics are provided here courtesy of Oxford University Press

RESOURCES