ABSTRACT
In many modern machine learning applications, changes in covariate distributions and difficulty in acquiring outcome information have posed challenges to robust model training and evaluation. Numerous transfer learning methods have been developed to robustly adapt the model itself to some unlabeled target populations using existing labeled data in a source population. However, there is a paucity of literature on transferring performance metrics, especially receiver operating characteristic (ROC) parameters, of a trained model. In this paper, we aim to evaluate the performance of a trained binary classifier on unlabeled target population based on ROC analysis. We proposed Semisupervised Transfer lEarning of Accuracy Measures (STEAM), an efficient three-step estimation procedure that employs (1) double-index modeling to construct calibrated density ratio weights and (2) robust imputation to leverage the large amount of unlabeled data to improve estimation efficiency. We establish the consistency and asymptotic normality of the proposed estimator under the correct specification of either the density ratio model or the outcome model. We also correct for potential overfitting bias in the estimators in finite samples with cross-validation. We compare our proposed estimators to existing methods and show reductions in bias and gains in efficiency through simulations. We illustrate the practical utility of the proposed method on evaluating prediction performance of a phenotyping model for rheumatoid arthritis (RA) on a temporally evolving EHR cohort.
Keywords: Covariate shift, model evaluation, receiver operating characteristic curve, risk prediction, semisupervised learning, transfer learning
1. INTRODUCTION
The increasing adoption of electronic health records (EHR) for clinical care has enabled many opportunities for translational and clinical research (Hripcsak and Albers, 2013; Miotto et al., 2016). However, a major limitation of EHR data is the lack of precise information on disease phenotypes. Readily available EHR features such as diagnosis codes often exhibit poor specificity that can bias or depower the downstream study (Liao et al., 2010; Cipparone et al., 2015). Meanwhile, manual annotation of phenotypes via chart review is laborious and unscalable. To overcome these challenges, numerous supervised phenotyping classification algorithms have been developed for prediction of disease status of individual patients (Carroll et al., 2012; Xia et al., 2013; Liao et al., 2015; Liao et al., 2015).
To build a supervised phenotyping model, researchers often manually annotate outcome information for a randomly chosen subset of the large EHR dataset, and then use these labeled observations for model training and validation. Once a model is trained and validated in one population, it is often then used for other populations, such as a similar population from a different healthcare system (Carroll et al., 2012) or the same EHR cohort updated over time (Huang et al., 2020). Generalizing a classification model trained in a source population to a target population, however, requires caution due to the change in the covariate distribution owing to the heterogeneity in the underlying patient populations. It is well known that such changes in the distribution of the covariates, often referred to as covariate shift (Shimodaira, 2000), may have a large impact on the performance of a prediction algorithm trained in source cohort on target cohort (Rasmy et al., 2018). It is thus important to assess the portability of the algorithm in face of such changes in datasets before its usage in the target sites.
In these scenarios, methods to accurately evaluate the performance of these algorithms in target cohort is especially valuable. If outcome information is available for some patients in target cohort, for instance, via chart review, then the standard methods can be applied to estimate the prediction performance parameters on target dataset. Recently, robust semisupervised methods have also been proposed to improve the estimation efficiency for prediction performance measures (Gronsbell and Cai, 2018). The issue, however, is that performing chart review for every new target population might not be infeasible due to time and resource constraints. It is thus highly desirable to robustly and efficiently estimate the prediction performance parameters in target cohort without requiring any outcome information in target dataset.
Although the problem of covariate shift has been extensively studied in recent literature of transfer learning, most existing work focused on robust adaptation of the model itself to the target population (Chen et al., 2016; Wen et al., 2014; Rotnitzky et al., 2012; Reddi et al., 2015; Liu et al., 2020). There is a paucity of literature on transferring performance metrics of a trained model based on the commonly used ROC analysis (Pepe, 2003). Inoue and Uno (2018) used density ratio weighting methods to assess model performance in the target population. However, their estimator do not protect against the misspecification of the density ratio model. Xu and Tibshirani (2022) proposed a method based on parametric bootstrap which does not require estimation of the density ratio model. However, their focus was on the out-of-sample error of the prediction model, not estimating the accuracy meausures based on the ROC curve. Moreover, their method does not protect against misspecification of the prediction model. Doubly robust methods for estimating accuracy measures have been proposed in verification bias literature (Alonzo and Pepe 2005; Rotnitzky et al., 2006; Fluss et al., 2009), and recently in the context of estimating the AUC in a target population that has a different data distribution compared to the source population (Li et al., 2022). Nevertheless, no existing method leveraged both unlabeled source and target data to improve estimation efficiency. To address this gap in methodology, we propose a Semisupervised Transfer lEarning of Accuracy Measures (STEAM) in the unlabeled target cohort. We employ a double-index modeling approach to construct calibrated density ratio weights similar to Cheng et al. (2020), incorporating both the covariate shift model and the outcome imputation model. Although also taking a reweighting form, the STEAM estimators are doubly robust with consistency attained when either the density ratio model or the classification model is correctly specified. The large amount of unlabeled source and target data enabled us to efficiently estimate the density ratio and the imputation model.
The rest of the paper is organized as follows: we formalize the problem of interest in Section 2. In Section 3, we detail the STEAM estimation procedure along with a perturbation resampling procedure for inference. We conduct simulation studies to evaluate the finite sample robustness and efficiency of the proposed method in Section 4. We then illustrate the practical utility of the proposed method on evaluating the performance of a phenotyping model for rheumatoid arthritis (RA) on temporally evolving EHR cohorts in Section 5.
2. PROBLEM SETUP
2.1. Notations and assumptions
Let Y denote the binary phenotype of interest, and
the p + 1-dimensional covariate vector for some fixed p. The source data (indexed by S = 1), obtained from a simple random sample from the source population, consist of n independent and identically distributed (iid) labeled data
, where
, and N iid unlabeled data
where
. The target data (indexed by S = 0), obtained from a simple random sample from the target population, consists of Nt iid unlabeled data
, where
. The entire observed data
can be written as
, where L is an indicator of labelling. We assume that N/Nt → c ∈ (0, ∞),
such that n1/2N−1/3 → 0, as n → ∞. We assume random labeling in source population and hence
. Furthermore, we require the standard covariate shift assumption with
, where
is the joint density of
evaluated at
given S = s,
denotes the probability density of
at
given S = s and
denotes the conditional density of Y at y given
, which is shared between the source and target populations. Throughout, we let
an
denote the probability distribution and expectation taken over the population S = s.
2.2. Classification rule
The aim is to assess the accuracy of a binary classification rule for Y given
developed using the source data. Although our STEAM approach can be easily extended to other regression models, for the ease of presentation, we focus on the working generalized linear model (GLM)
![]() |
(1) |
where g is a specified smooth link function such that
is convex in
. One can also easily accommodate nonlinear effects by including nonlinear basis functions of
. We consider standard estimation procedures for
. As an example, one may employ an adaptive LASSO penalized estimator (Zou, 2006) to stabilize the finite sample estimation when p is not very small relative to n:
![]() |
(2) |
where lg is the log likelihood from the assumed model,
is a root-n consistent initial estimate of
, and
denotes the subvector of
that excludes the intercept β0, λμ, n is a tuning parameter such that n1/2λμ, n → 0 and n(1 − ν)(1 + γ)/2λμ, n → ∞ with γ > 2ν(1 − ν) (Zou and Zhang, 2009). Following results given in Zou and Zhang (2009),
is a root-n consistent for the population parameter
![]() |
and attains oracle property under sparsity of
, regardless if the fitted model (1) holds. Our proposed STEAM procedure is not restricted to any specific estimation procedure provided that
is a regular root-n estimators for
.
2.3. Accuracy measures of interest
We are interested in estimating the accuracy parameters of classification model based on
. With
from source population, we may classify subjects in the target population with the highest scores of
, that is,
for some fraction c, as having the phenotype of interest, where
. To identify a desirable threshold c and evaluate the classification accuracy in the unlabeled target population, we consider the commonly employed receiver operating characteristic (ROC) analysis (Pepe, 2003). Specifically, the true positive rate (TPR) and false positive rate (FPR) of the binary classification rule
in the target population are defined as:
![]() |
The ROC curve,
, summarizes the trade-off between the TPR and FPR functions. The area under the ROC curve,
, summarizes the overall prediction performance of
. Additionally, a threshold value c0 for classifying an individual as having the phenotype, namely
, is often chosen to achieve a desired FPR level u0. Once the threshold value c0 is determined, one may summarize the predictive performance of
, based on positive predictive value (PPV) and negative predictive value (NPV) defined as
![]() |
Directly estimating theses accuracy parameters using the source data will result in inconsistency due to the covariate shift as well as potential model misspecification of the model for
(Liu et al., 2021). To correct for the bias, it is natural to incorporate importance sampling weighting. For example, with trained
, one may estimate
as:
![]() |
(3) |
where
is an estimate for the density ratio
![]() |
, and
. Although the direct estimates of
, P(S = 1) and P(S = 0) are not possible as they are not identifiable under the sampling scheme considered, we can use use the inverse-odds weights estimated from the dataset in (3) since they are equal to the density ratio weight up to a proportionality constant (Steingrimsson et al., 2021).
One issue with this re-weighting estimator is that its consistency heavily relies on the consistency of the estimate of the weights, and can perform poorly when the density ratio model is not well estimated. Moreover, the estimator suffers from low precision when n is small, which is often the case in EHR settings. To improve the robustness and efficiency of the estimation for the accuracy parameters, we next propose the STEAM estimators that leverage the unlabeled data in the source population.
3. METHODS
3.1. Estimation procedure
To motivate the STEAM estimation procedure, we first note that
![]() |
where
. Even though
,
under covariate shift and possible misspecification of the outcome model (1). To enable robust transfer of knowledge from the source to target, while leveraging unlabeled data, the STEAM estimation procedure consists of three key steps: (I) construct calibrated density ratio weight
via double index modeling; (II) derive nonparametric calibrated estimator of
; and (III) obtain the final STEAM estimator by projecting to the unlabeled target data using the estimated
.
Step I: calibrated density ratio weight estimation. The calibrated density ratio weight estimator involves both an initial estimator for
and the estimated outcome model. To estimate
, we consider a parametric working model,
![]() |
(4) |
where gπ is a known link function,
is a q-dimensional vector of transformed
that may include nonlinear bases of
with the first element being 1, and
is the unknown parameter. Since π( · ) doesn’t involve the outcome, we may use the unlabeled data to obtain a regularized estimator for
:
![]() |
with properly chosen tuning parameters λπ, N, γ, and an initial estimator
for
, where the expectation is taken over the pooled source and target population. Similar to the outcome model, one may show that
is a regular estimator for
. We then obtain a nonparametrically calibrated estimate of
by smoothing S over the two scores
and
. Specifically, we estimate
as
,
![]() |
where
and
are bivariate scores that represent the covariate in the directions of
and
. We then construct the calibrated density ratio weight for the ith subject with
as:
![]() |
where
, K( · ) is a smooth symmetric density function, and h1 = O(N−1/6).
Step II: calibrated estimate of
. In step II, we obtain a nonparametrically calibrated estimate of the conditional risk function
via kernel smoothing as:
![]() |
where
,
,
,
, and h2 is a bandwidth such that
and
as n → ∞.
Step III: projection to
with
Finally, we construct a plug-in STEAM estimator by projecting to
using the imputed outcomes
. Specifically, the STEAM estimator for
is obtained as:
![]() |
Remark 1
The calibrated estimator
consistently estimates
if either (1) or (4) is correct. The STEAM approach to imputation using
is appealing as it achieves double robustness by leveraging the large unlabeled data and only requiring 1D smoothing over the small labeled data.
Similarly we may construct a STEAM estimator of
as:
![]() |
The STEAM estimates of the ROC curve and the resulting AUC are
![]() |
Furthermore, when a threshold value
is chosen to attain a prespecified FPR level, say u0, we may obtain STEAM estimators of
and
as
and
, where 
![]() |
is the SS estimator of the prevalence
.
In the supervised setting, it is well-known that when the same set of labeled data is used for both model training and evaluation, the apparent estimators of accuracy measures may be overly optimistic in finite samples (Efron, 1986). Similarly in our setting, the above STEAM estimator may also suffer from overfitting bias since the labeled data
is used for estimating both the classification model, and the calibrated conditional risk function for imputation in estimation of the accuracy measures. Following similar strategies given in Gronsbell and Cai (2018), it is not difficult to employ a K-fold cross-validation (CV) procedure to correct for the overfitting bias in the STEAM estimators. Detailed forms of the CV estimator can be found in the Supplementary Material C. For all numerical studies, we employ the CV correction for the STEAM estimator.
3.2. Asymptotic results for STEAM estimators
Though analogous results hold for all STEAM estimators, we present the main results for
.
Theorem 1
Under assumptions outlined in the Supplementary Material, when either (1) or (4) is correct, the
is consistent for
. Moreover,
The proof for consistency directly follows from Remark 1 and is outlined in Supplementary Material A. To show asymptotic normality of STEAM estimators, let
![]() |
, and
. We show that
![]() |
where
is the influence function of
, and C2 is defined in Supplementary Material B. The weak convergence of
thus follows.
Even though the asymptotic variance can be obtained through influence function, a direct estimate is difficult because it involves unknown conditional density functions, which are difficult to estimate explicitly, particularly under model misspecification. We instead propose the use of a simple resampling procedure for inference based on Jin et al. (2001) in the next section.
3.3. Perturbation resampling procedure for inference
Now we outline the perturbation resampling procedure for inference. Let
be non-negative iid random variables with unit mean and variance that are independent of the observed data. We first obtain perturbed estimates
as:
![]() |
where
is perturbed initial estimates obtained from analogously perturbing its estimating equations. This leads to the perturbed estimates of the density ratio weights:
![]() |
where,
![]() |
with
and
. Since
and
are estimated based
, their contribution to the asymptotic variance is of higher order when N ≫ n. We then obtain the perturbed estimate of conditional risk function as:
![]() |
Finally, we calculate the perturbed
as:
![]() |
It can be shown using arguments from Jin et al. (2001) that the asymptotic distribution of
coincides with that of
. This allows us to consistently approximate the SE of
based on the empirical SD of resamples
. The confidence intervals of
can be constructed by using the empirical percentiles of the resamples. We can obtain perturbed counterparts of
,
,
,
, respectively, denoted as
,
,
,
, and obtain SE estimates and construct confidence intervals accordingly.
For practical use, an issue of this perturbation resampling procedure is that calculating
requires 2D smoothing, which can be very time-consuming. To overcome this difficulty, we propose an approximated resampling procedure that is more computationally feasible. We first generate
, where
has dimension N × B, followed by the perturbed
estimates
where
has dimension p × B. We then approximate the perturbed estimates
by
![]() |
where ,
![]() |
The perturbed estimate of conditional risk can then be obtained as
, and the perturbed
thus follows. We evaluate the performance of the perturbation resampling procedure with and without approximation through simulations.
4. SIMULATION STUDIES
We performed extensive simulations to assess the finite sample bias, SE (empirical SD of the point estimates, SE), root mean square error (RMSE) of our proposed estimators (STEAM) compared to alternative estimators. In separate simulations we also examined the performance of the perturbation procedure, with and without approximation, for inference based on STEAM estimators. Across all numerical studies including the data application in section 5, we used the adaptive LASSO to estimate
and
, where we choose λμ, n and λπ, N using a modified BIC criteria (Minnier et al., 2011). For the bandwidth in the 2D smoothing for π( · ), we used a second-order Gaussian product kernel with a plug-in bandwidth
, where
is the sample SD of either
or
in source. Prior to smoothing,
and
were standardized and transformed by a probability integral transform based on the normal cumulative distribution function to induce approximately uniformly distributed inputs, which can improve finite-sample performance (Wand et al., 1991). For the nonparametric smoothing for m( · ), we used the Gaussian kernel with
, where
is the empirical SD of
. As we focused on binary outcomes, we specified logistic link functions gμ(u) = gπ(u) = 1/(1 + e−u).
For comparison, we considered alternative estimators for performance accuracy parameters: (1) the naive estimators, where the performance accuracy parameters are estimated using CV on the labeled source dataset (source), (2) estimation based on 100 randomly labeled validation samples in target dataset (target_labeled), (3) the standard weighted estimators based on CV (weighted), (4) the doubly robust estimator proposed by Alonzo and Pepe (Alonzo and Pepe) with CV (DR-aug). This estimator is a special case of Rotnitzky et al. (2006) and Fluss et al. (2009) when Y is independent of S conditional on
.
Throughout, we generated
from
,
, and
. We considered simulating data that roughly resembled the EHR data example in terms of model parameters but were simple enough to be more broadly relevant. To this end, we considered p = 10 covariates with variance σ2 = 1, and correlations ρ = 0.2. These simulations were varied over different model specifications, predictive strength of
to S, and sample sizes. We considered the following data generating mechanisms:
![]() |
Either
or
were potentially misspecified by neglecting the pairwise interaction terms in model fitting:
Both correct: Fitting
and
.
correct and
misspecified: Fitting
and
.
correct and
misspecified: Fitting
and
.Both
and
misspecified: Fitting
and
.
where,
,
. The outcome prevalence was approximately
, and the sample size in source and target are roughly equal. We considered N = 10000 and n = 200, 400. The results in each scenario are summarized from 1000 simulated datasets.
Note that if
, then
follows a logistic regression model (Efron, 1975; Harrell and Lee, 1985). In particular, if the covariance matrices
, then
would follow a logistic regression model without interaction terms. Otherwise if we assume
, then
would then follow a logistic regression model that involves interaction and quadratic terms of X1,..., Xp.
We present results for
,
,
for u0 = 0.05. Table 1 presents the bias, SE, and RMSE for scenarios (1)-(3) with moderate strength of
to S with N = 10000. As expected, the source estimators exhibit noticeable bias, regardless of the size of n. This is consistent with the consensus that the naive estimator based on source data is not appropriate for prediction performance assessment in the presence of covariate shift. The target_labeled estimators are unbiased, but have large SEs dues to the limited sample size. The weighted estimators are single robust and have noticeable bias when
is misspecified. Among scenarios (1)-(3), the bias of STEAM is small relative to the RMSE, verifying its doubly-robustness. In scenario (4), where both models are misspecified (Supplementary Material D), the STEAM estimators, although slightly biased, are more robust compared to other estimators and attained the lowest RMSE.
TABLE 1.
Bias, SE, and RMSE of estimators under different model misspecification scenarios with moderate association of
to S, over 1000 simulated datasets. N = 10000, k = 5, n = 200 or 400. All values are multiplied by 100.
| both correct | π(X) misspecified | μ(X) misspecified | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Estimator | Bias | SE | RMSE | Bias | SE | RMSE | Bias | SE | RMSE | ||
| Cutoff | target_labeled | −0.19 | 4.86 | 4.86 | −0.19 | 4.86 | 4.86 | −0.40 | 4.88 | 4.89 | |
| n = 200 | source | 1.02 | 4.24 | 4.35 | 1.02 | 4.24 | 4.35 | 1.24 | 4.19 | 4.36 | |
| weighted | −0.33 | 4.64 | 4.64 | −0.34 | 4.47 | 4.47 | −0.05 | 4.42 | 4.42 | ||
| DR-aug | −0.53 | 4.42 | 4.45 | −0.55 | 4.23 | 4.27 | −0.35 | 4.45 | 4.46 | ||
| STEAM | −0.21 | 3.47 | 3.47 | −0.39 | 3.42 | 3.43 | −0.06 | 3.52 | 3.51 | ||
| n = 400 | source | 1.94 | 3.04 | 3.59 | 1.94 | 3.04 | 3.59 | 1.80 | 3.17 | 3.64 | |
| weighted | 0.18 | 3.05 | 3.05 | 0.14 | 3.06 | 3.06 | 0.04 | 3.12 | 3.12 | ||
| DR-aug | 0.22 | 2.91 | 2.92 | 0.35 | 2.91 | 2.93 | 0.25 | 3.11 | 3.12 | ||
| STEAM | 0.08 | 2.52 | 2.52 | 0.01 | 2.49 | 2.48 | 0.09 | 2.53 | 2.53 | ||
| AUC | target_labeled | 0.57 | 4.74 | 4.76 | 0.57 | 4.74 | 4.76 | 0.62 | 4.60 | 4.63 | |
| n = 200 | source | 1.16 | 3.66 | 3.83 | 1.16 | 3.66 | 3.83 | 0.90 | 3.68 | 3.78 | |
| weighted | 0.39 | 3.89 | 3.90 | 0.97 | 3.79 | 3.90 | 0.17 | 3.87 | 3.86 | ||
| DR-aug | 0.41 | 3.73 | 3.75 | 0.52 | 3.66 | 3.70 | 0.72 | 3.77 | 3.84 | ||
| STEAM | −0.38 | 3.71 | 3.72 | −0.10 | 3.65 | 3.65 | −0.68 | 3.73 | 3.78 | ||
| n = 400 | source | 1.19 | 2.49 | 2.76 | 1.19 | 2.49 | 2.76 | 1.10 | 2.53 | 2.75 | |
| weighted | 0.37 | 2.63 | 2.65 | 1.06 | 2.64 | 2.84 | 0.26 | 2.66 | 2.66 | ||
| DR-aug | 0.40 | 2.58 | 2.61 | 0.45 | 2.61 | 2.64 | 0.33 | 2.61 | 2.63 | ||
| STEAM | −0.36 | 2.51 | 2.52 | −0.18 | 2.52 | 2.52 | −0.27 | 2.53 | 2.53 | ||
| TPR | target_labeled | 0.43 | 10.52 | 10.53 | 0.43 | 10.52 | 10.53 | 0.59 | 10.57 | 10.59 | |
| n = 200 | source | 3.60 | 9.20 | 9.88 | 3.60 | 9.20 | 9.88 | 3.33 | 9.36 | 9.93 | |
| weighted | 0.82 | 9.26 | 9.3 | 1.87 | 8.84 | 9.04 | 0.73 | 9.14 | 9.17 | ||
| DR-aug | 0.80 | 9.18 | 9.21 | 0.78 | 8.88 | 8.91 | 0.25 | 9.02 | 9.02 | ||
| STEAM | 0.22 | 7.60 | 7.60 | 0.66 | 7.52 | 7.55 | −0.12 | 7.75 | 7.75 | ||
| n = 400 | source | 2.63 | 6.52 | 7.03 | 2.63 | 6.52 | 7.03 | 2.84 | 6.89 | 7.44 | |
| weighted | 0.58 | 6.47 | 6.5 | 1.83 | 6.49 | 6.88 | 1.17 | 6.53 | 6.62 | ||
| DR-aug | 0.50 | 6.14 | 6.16 | 0.73 | 6.11 | 6.15 | 0.36 | 6.44 | 6.45 | ||
| STEAM | −0.32 | 5.62 | 5.63 | −0.15 | 5.59 | 5.58 | −0.36 | 5.69 | 5.68 | ||
| PPV | target_labeled | 0.21 | 5.79 | 5.78 | 0.21 | 5.78 | 5.78 | 0.82 | 5.37 | 5.43 | |
| n = 200 | source | 3.66 | 3.82 | 5.29 | 3.66 | 3.82 | 5.29 | 3.66 | 3.88 | 5.33 | |
| weighted | 0.89 | 4.24 | 4.33 | 2.04 | 4.01 | 4.50 | 1.24 | 4.16 | 4.34 | ||
| DR-aug | −0.79 | 4.30 | 4.37 | −0.81 | 4.49 | 4.55 | −0.32 | 4.49 | 4.50 | ||
| STEAM | −0.41 | 4.02 | 4.04 | −0.01 | 3.93 | 3.93 | −0.57 | 3.89 | 3.93 | ||
| n = 400 | source | 2.99 | 2.24 | 3.74 | 2.99 | 2.24 | 3.73 | 3.06 | 2.36 | 3.87 | |
| weighted | 0.83 | 2.53 | 2.66 | 1.46 | 2.54 | 2.93 | 1.10 | 2.53 | 2.75 | ||
| DR-aug | −0.42 | 2.64 | 2.67 | −0.34 | 2.88 | 2.90 | -0.41 | 2.69 | 2.72 | ||
| STEAM | −0.31 | 2.44 | 2.45 | −0.25 | 2.45 | 2.46 | −0.29 | 2.46 | 2.47 | ||
| NPV | target_labeled | 0.56 | 5.87 | 5.88 | 0.56 | 5.87 | 5.88 | 0.76 | 5.88 | 5.92 | |
| n = 200 | source | −2.26 | 4.63 | 5.14 | −2.26 | 4.63 | 5.14 | −2.35 | 4.65 | 5.20 | |
| weighted | −0.28 | 4.50 | 4.50 | 0.17 | 4.43 | 4.42 | −0.48 | 4.28 | 4.30 | ||
| DR-aug | −0.42 | 4.43 | 4.45 | −0.25 | 4.44 | 4.45 | −0.38 | 4.38 | 4.40 | ||
| STEAM | −0.30 | 4.04 | 4.04 | −0.13 | 4.04 | 4.03 | −0.39 | 4.10 | 4.11 | ||
| n = 400 | source | −2.47 | 3.70 | 4.44 | −2.47 | 3.70 | 4.44 | −2.39 | 3.71 | 4.41 | |
| weighted | −0.32 | 3.40 | 3.40 | 0.19 | 3.44 | 3.43 | −0.26 | 3.34 | 3.34 | ||
| DR-aug | −0.51 | 3.27 | 3.31 | −0.30 | 3.37 | 3.38 | −0.40 | 3.37 | 3.39 | ||
| STEAM | −0.33 | 3.17 | 3.17 | −0.24 | 3.17 | 3.17 | −0.35 | 3.16 | 3.17 | ||
Figure 1 presents the number of labels in target cohort needed to achieve the same RMSE as each estimator across different predictive strength of
to S (weak, moderate, and strong), and misspecification scenarios for N = 10000, n = 200, k = 5. A larger number of labels needed in target cohort represents lower RMSE and higher efficiency of the estimator. The STEAM estimators are more efficient than source, weighted and DR-aug across the settings. The amount of efficiency gain of STEAM over creating labels in target varies with the predictive strength of
to S. We observe that when
is weakly or moderately correlated with S, using the n = 200 labels in source for STEAM estimators is equivalent to creating ∼120-200 new labels in target in terms of RMSE, but when
is strongly associated with S, using STEAM is less advantageous, only equivalent to ∼100-150 labels in target in terms of RMSE.
FIGURE 1.
Number of labels in target cohort required to achieve the same RMSEs as each estimator, by three model misspecification scenarios for different predictive strength of
to S (weak, moderate,and strong), over 1000 simulated datasets. Larger number of labels denotes greater efficiency (lower MSE) relative to target_labeled. N = 10000, n = 200, k = 5.
To implement the perturbation procedure, we used the weights Gi ∼ 4 × Beta(0.5, 1.5) and 1000 sets of
for SE and CI estimation. We considered evaluating the perturbations only in the scenario where both
and
are correctly specified and
has moderate predictive strength to S, with N = 10000 and n = 200. The results are presented in Table 2. The SEs estimated with and without approximation in perturbation approximated well the empirical standard error. The coverage of the
CIs were also close to nominal levels.
TABLE 2.
Performance of perturbation resampling for STEAM in 1000 simulated datasets when both
,
are correctly specified. Emp SE: Empirical SE over simulations. ASE: average of estimated SE based on the SD of perturbed estimates; CovProb: coverage probability of
CIs; ASE_approx: average of estimated SE based on the SD of perturbed estimates using the approximated method; CovProb_approx: coverage probability of
CIs using the approximated method. N = 10000, n = 200, k = 5.
| Estimate | Emp SE | ASE | CovProb | ASE_approx | CovProb_approx | |
|---|---|---|---|---|---|---|
| Cutoff | 0.678 | 0.040 | 0.037 | 0.938 | 0.037 | 0.925 |
| AUC | 0.779 | 0.032 | 0.030 | 0.929 | 0.036 | 0.968 |
| TPR | 0.286 | 0.053 | 0.051 | 0.920 | 0.055 | 0.929 |
| PPV | 0.823 | 0.032 | 0.033 | 0.944 | 0.036 | 0.960 |
| NPV | 0.620 | 0.040 | 0.036 | 0.918 | 0.043 | 0.964 |
5. APPLICATION TO RHEUMATOID ARTHRITIS (RA) PHENOTYPING MODEL
To illustrate the performance of our proposed estimators, we applied our procedures to evaluate a phenotyping algorithm for classifying RA disease, using EHR data from Partner’s Healthcare (Liao et al., 2010). The source dataset from 2009 consists of 20 451 total patients, of which n = 267 were labeled with the true disease status through chart review by an rheumatologist. Both narrative and codified data were available to develop the prediction model (p = 21). The narrative variables, including disease diagnoses and medications, were obtained with natural language processing. The codified data included ICD-9 codes, electronic prescriptions, and laboratory values. Healthcare utility was also included. The transformation u↦log (1 + u) was applied to all count variables to mitigate instability in the estimation due to skewness in their distributions. An RA phenotyping model, named as ALASSO-2009 (Supplementary Material E), was trained on labeled data from 2009. We are interested in assessing its performance in classifying the disease status of patients in the same EHR database across different time windows, in 2011 (Nt = 25, 405, nt = 100), 2013 (Nt = 30, 804, nt = 100), 2015 (Nt = 36, 095, nt = 150), and 2017 (Nt = 39, 550, nt = 200). Each of these consists of a large number of unlabeled data with the same set of covariates available. For dataset in each year, a random subset was selected and labeled with true RA status through manual chart review by trained physicians, which can serve as the validation sets to verify our results. Summary of patient characteristic in labeled and unlabeled data in each year is shown in Supplementary Material F.
For comparison, we considered source, target_labeled, weighted, and DR-aug estimators as in Section 4. We also considered semisupervised estimation using labeled and unlabeled data in target as proposed by Gronsbell and Cai (2018) (target_SSL). The estimated AUC of the model on the target datasets is presented in Figure 2. The AUC estimates for the target datasets obtained using the validation sets (target_labeled and target_SSL) are noticeably different from the ones estimated from the source data (source), especially in 2015 and 2017. This is not too surprising since many changes in the database have occurred over the years: for instance, patient characteristics, such as age, have undergo changes. Moreover, the EHR system was switched to Epic (Epic System Corporation) and the International Classification of Diseases (ICD) system was changed from version 9 to version 10 around 2015-2016, which may also explain the lower AUC in 2015 and 2017. The point estimates based on our proposed STEAM method are similar to target_labeled, suggesting the stability of the proposed procedure in a real data setting. Moreover, we observe substantial efficiency gains of STEAM over target_labeled, weighted, and DR-aug. As a result, we have a more precise estimate of the prediction performance of the phenotype algorithm using the STEAM method. The estimated ROC curves along with 95% point-wise confidence bands can be found in Supplementary Material G.
FIGURE 2.
Estimated AUC with
CI of ALASSO-2009 on datasets in 2011, 2013, 2015, and 2017 based on different methods.
The estimates of cutoff values and the corresponding TPR, PPV, and NPV estimates of the binary classification rule for u0 = 0.05 and 0.10 as well as their
CIs are presented in Figure 3. For choosing the cutoff values, STEAM is ∼4.2-7.2 times as efficient as target_labeled, and ∼1.1-2.1 times as efficient as DR-aug. For each accuracy measure, STEAM estimators are 3.1-5.8 times as efficient as target_labeled, and the efficiency gain is at least
with gains as high as
compared with DR-aug.
FIGURE 3.
Estimated cutoffs and the corresponding TPR, PPV, and NPV estimates of the binary classification rule for u0 = 0.05 (top) and u0 = 0.10 (bottom) of ALASSO-2009. Error bars denote
CIs.
6. DISCUSSION
In this paper, we proposed a robust and efficient transfer learning procedure for model evaluation rather than model fitting. Our estimators are doubly robust, i.e. they are consistent when either the prediction model
or the weighting model
is correct. We addressed potential overfitting bias in our SS estimators with CV and also developed a perturbation resampling procedure for making inference. We illustrate the performance and practical utility of the proposed estimators through simulations and application on an RA phenotying model.
We focused on estimating the ROC, which is a measure of discrimination. An aspect that the measures considered fail to address is the calibration of a prediction model, that is, to what extent the model-based predicted probability given a risk group matches with the actual percentage of patients with the disease. In the presence of covariate shift, if the prediction model is misspecified, the risk model trained with source data may not be well calibrated in the target data. Our proposed STEAM procedure, in fact, has already performed model calibration during its process since the method hinges on the estimation of the calibrated risk
.
We have assumed that the true outcomes are labeled completely at random in source population, which is reasonable if investigators control the labeling. For example, with the goal of validating phenotyping models on EHR data, the researchers have the ability to design the study such that they randomly select the labeled examples and ask domain experts to perform chart review on the selected sample. However, this assumption could be restrictive if labeling was stratified by some known factors or if some records that are available were not labeled for research purposes. If the selection of labels is not random by design, the unlabeled source data effectively becomes a second target data, which also differs from the main target data. In that case, its utility in improving the estimation of accuracy parameter for the original target data becomes less clear. Other refinements and extensions to the proposed approach is possible. For instance, when the prediction model is clearly misspecified, we may incorporate an additional outcome model in the estimation procedure to preserve the doubly robust property, that is, the estimator is consistent if either the new outcome model or the weighting model is correct.
We focused on the setting where labeled data in the target population is not readily available or is expensive to obtain. However, when there exists small number of labeled data in the target population, as considered by Cai et al. (2022), it is also of interest to develop methods utilizing those existing labels to further improve robustness and efficiency of estimation.
Throughout, we focused on the setting with fixed p but accommodated settings in which p is not small relative to n in finite sample with regularized estimation. However, when p is large so that ν is large, a large power parameter γ would be required to maintain the oracle properties, leading to an unstable penalty and poor finite sample performance. Estimation under the setting with p ≫ n would require different theoretical justifications and warrant additional research.
Supplementary Material
Web Appendices, Tables, and Figures referenced in Sections 3-5, as well as a zip file containing the R code are available with this paper at the Biometrics website on Oxford Academic.
Acknowledgement
The authors thank the associate editor and referees for their insightful feedback and suggestions.
Contributor Information
Linshanshan Wang, Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA 02115, United States.
Xuan Wang, Division of Biostatistics, Department of Population Health Sciences, University of Utah, Salt Lake City, UT 84108, United States.
Katherine P Liao, Division of Rheumatology, Brigham and Women’s Hospital, Boston, MA 02115, United States.
Tianxi Cai, Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, United States.
FUNDING
This work was supported by National Institutes of Health grants R01LM013614 and P30AR072577.
CONFLICT OF INTEREST
None declared.
DATA AVAILABILITY
The data that support the findings in this paper are available on request from the corresponding author. The data are not publicly available due to privacy or ethical restrictions.
References
- Alonzo T. A., Pepe M. S. (2005). Assessing accuracy of a continuous screening test in the presence of verification bias. Journal of the Royal Statistical Society: Series C (Applied Statistics), 54, 173–190. [Google Scholar]
- Cai T., Li M., Liu M. (2022). Semi-supervised triply robust inductive transfer learning. arXiv preprint arXiv:2209.04977.
- Carroll R. J., Thompson W. K., Eyler A. E., Mandelin A. M., Cai T., Zink R. M. et al. (2012). Portability of an algorithm to identify rheumatoid arthritis in electronic health records. Journal of the American Medical Informatics Association, 19, e162–e169. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen X., Monfort M., Liu A., Ziebart B. D. (2016). Robust covariate shift regression. In Artificial Intelligence and Statistics, pp. 1270–1279., PMLR. [Google Scholar]
- Cheng D., Chakrabortty A., Ananthakrishnan A. N., Cai T. (2020). Estimating average treatment effects with a double-index propensity score. Biometrics, 76, 767–777. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cipparone C. W., Withiam-Leitch M., Kimminau K. S., Fox C. H., Singh R., Kahn L. (2015). Inaccuracy of icd-9 codes for chronic kidney disease: a study from two practice-based research networks (pbrns). The Journal of the American Board of Family Medicine, 28, 678–682. [DOI] [PubMed] [Google Scholar]
- Efron B. (1975). The efficiency of logistic regression compared to normal discriminant analysis. Journal of the American Statistical Association, 70, 892–898. [Google Scholar]
- Efron B. (1986). How biased is the apparent error rate of a prediction rule?. Journal of the American Statistical Association, 81, 461–470. [Google Scholar]
- Fluss R., Reiser B., Faraggi D., Rotnitzky A. (2009). Estimation of the ROC curve under verification bias. Biometrical Journal, 51, 475–490. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gronsbell J. L., Cai T. (2018). Semi-supervised approaches to efficient evaluation of model prediction performance. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80, 579–594. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Harrell F. E., Lee K. L. (1985). A comparison of the discrimination of discriminant analysis and logistic regression under multivariate normality. Biostatistics: Statistics in Biomedical, Public Health and Environmental Sciences, 333–343. [Google Scholar]
- Hripcsak G., Albers D. J. (2013). Next-generation phenotyping of electronic health records. Journal of the American Medical Informatics Association, 20, 117–121. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huang S., Huang J., Cai T., Dahal K. P., Cagan A., He Z. et al. (2020). Impact of ICD10 and secular changes on electronic medical record rheumatoid arthritis algorithms. Rheumatology, 59, 3759–3766. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Inoue E., Uno H. (2018). APPEstimation: Adjusted prediction model performance estimation (Version 0.1.1). [Google Scholar]
- Jin Z., Ying Z., Wei L. (2001). A simple resampling method by perturbing the minimand. Biometrika, 88, 381–390. [Google Scholar]
- Li B., Gatsonis C., Dahabreh I. J., Steingrimsson J. A. (2022). Estimating the area under the roc curve when transporting a prediction model to a target population. Biometrics, 79, 2382–2393. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liao K. P., Cai T., Gainer V., Goryachev S., Zeng-treitler Q., Raychaudhuri S. et al. (2010). Electronic medical records for discovery research in rheumatoid arthritis. Arthritis Care and Research, 62, 1120–1127. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liao K. P., Cai T., Savova G. K., Murphy S. N., Karlson E. W., Ananthakrishnan A. N. et al. (2015). Development of phenotype algorithms using electronic medical records and incorporating natural language processing. bmj, 350, h1885. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liao K. P., Ananthakrishnan A. N., Kumar V., Xia Z., Cagan A., Gainer V. S. et al. (2015). Methods to develop an electronic medical record phenotype algorithm to compare the risk of coronary artery disease across 3 chronic disease cohorts. PloS one, 10, e0136651. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu M., Zhang Y., Cai T. (2020). Doubly robust covariate shift regression with semi-nonparametric nuisance models. arXiv preprint arXiv:2010.02521.
- Liu M., Zhang Y., Zhou D. (2021). Double/debiased machine learning for logistic partially linear model. The Econometrics Journal, 24, 559–588. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Minnier J., Tian L., Cai T. (2011). A perturbation method for inference on regularized regression estimates. Journal of the American Statistical Association, 106, 1371–1382. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Miotto R., Li L., Kidd B. A., Dudley J. T. (2016). Deep patient: an unsupervised representation to predict the future of patients from the electronic health records. Scientific Reports, 6, 1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pepe M. S. (2003). The Statistical Evaluation of Medical Tests for Classification and Prediction, Oxford University Press. [Google Scholar]
- Rasmy L., Wu Y., Wang N., Geng X., Zheng W. J., Wang F. et al. (2018). A study of generalizability of recurrent neural network-based predictive models for heart failure onset risk using a large and heterogeneous ehr data set. Journal of Biomedical Informatics, 84, 11–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Reddi S., Poczos B., Smola A. (2015). Doubly robust covariate shift correction. Proceedings of the AAAI Conference on Artificial Intelligence, 29(1). [Google Scholar]
- Rotnitzky A., Faraggi D., Schisterman E. (2006). Doubly robust estimation of the area under the receiver-operating characteristic curve in the presence of verification bias. Journal of the American Statistical Association, 101, 1276–1288. [Google Scholar]
- Rotnitzky A., Lei Q., Sued M., Robins J. M. (2012). Improved double-robust estimation in missing data and causal inference models. Biometrika, 99, 439–456. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shimodaira H. (2000). Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90, 227–244. [Google Scholar]
- Steingrimsson J. A., Gatsonis C., Dahabreh I. J. (2021). Transporting a prediction model for use in a new target population. arXiv preprint arXiv:2101.11182. [DOI] [PMC free article] [PubMed]
- Wand M. P., Marron J. S., Ruppert D. (1991). Transformations in density estimation. Journal of the American Statistical Association, 86, 343–353. [Google Scholar]
- Wen J., Yu C.N., Greiner R. (2014). Robust learning under uncertain test distributions: Relating covariate shift to model misspecification. In International Conference on Machine Learning, pp. 631–639., PMLR. [Google Scholar]
- Xia Z., Secor E., Chibnik L. B., Bove R. M., Cheng S., Chitnis T. et al. (2013). Modeling disease severity in multiple sclerosis using electronic health records. PloS one, 8, e78927. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xu H., Tibshirani R. (2022). Estimation of prediction error with known covariate shift. arXiv preprint arXiv:2205.01849.
- Zou H., Zhang H. H. (2009). On the adaptive elastic-net with a diverging number of parameters. The Annals of Statistics, 37(4), 1733. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zou H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101, 1418–1429. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Web Appendices, Tables, and Figures referenced in Sections 3-5, as well as a zip file containing the R code are available with this paper at the Biometrics website on Oxford Academic.
Data Availability Statement
The data that support the findings in this paper are available on request from the corresponding author. The data are not publicly available due to privacy or ethical restrictions.




































