Summary:
The proliferation of biobanks and large public clinical datasets enables their integration with a smaller amount of locally gathered data for the purposes of parameter estimation and model prediction. However, public datasets may be subject to context-dependent confounders and the protocols behind their generation are often opaque; naively integrating all external datasets equally can bias estimates and lead to spurious conclusions. Weighted data integration is a potential solution, but current methods still require subjective specifications of weights and can become computationally intractable. Under the assumption that local data is generated from the set of unknown true parameters, we propose a novel weighted integration method based upon using the external data to minimize the local data leave-one-out cross validation (LOOCV) error. We demonstrate how the optimization of LOOCV errors for linear and Cox proportional hazards models can be rewritten as functions of external dataset integration weights. Significant reductions in estimation error and prediction error are shown using simulation studies mimicking the heterogeneity of clinical data as well as a real-world example using kidney transplant patients from the Scientific Registry of Transplant Recipients.
Keywords: asymmetric cross-validation, data integration, leave-one-out error
1. Introduction
One major advantage of gathering data from a local collection site is the notion of verifiability; that is, it is possible to confirm that proper collection procedures are being performed and that they are more or less consistent across samples. However, financial and temporal constraints frequently limit the number of subjects from whom data can be gathered locally. As a result, it has become increasingly common in epidemiological and statistical genetic studies to integrate data from other sources (large surveys, biobanks, etc.) for the purposes of estimation or prediction on the local population of interest, but with these outside data, verifiability no longer holds (Louie et al., 2007). In this case, there is the potential that context-specific confounders such as population structure, batch effects, and clinical confounding (e.g. healthy user bias) may distort or even be the primary driver of any significant discoveries and subsequent conclusions (Than et al., 2017; Brookhart et al., 2010).
To address the potential heterogeneity in external datasets, a growing body of research has focused on weighted integration, where information from external datasets can be down-weighted relative to that of the local data. In the Bayesian setting, Ibrahim and Chen (2000) proposed the power prior, where the likelihood function of outside data is raised by a discounting parameter between 0 and 1, which they later extended to multiple historical datasets (Ibrahim et al., 2003). More recently, Jiang et al. (2016) suggested weighing between prior information and current data in the context of variable selection in their development of the prior lasso. However, there is a degree of subjectivity associated with the choice of tuning parameters in both approaches; commonly used practical tuning methods may not satisfy theoretical parametric conditions and computational intractability remains a pertinent issue.
The weighted integration problem has also been approached from a weighted likelihood-based standpoint. Wang and Zidek (2005) suggested selecting likelihood weights by cross-validation, assuming that observations from each population are identically distributed and thus cannot account for high heterogeneity. Also based on the identically distributed setting, Plante (2008) proposed “minimum averaged mean squared error” adaptive likelihood weights, a mixture of empirical distribution functions that outperform standard maximum likelihood estimates, and showed the consistency of the method yet no rate of convergence was established (Plante, 2009). Similarly, Fu et al. (2009)’s empirical likelihood method appears promising in the context of common mean estimation and stratified sampling, but without exploring parameter estimations in regression settings. The data fusion method of Guo et al. (2012) determines likelihood weights based likelihood ratios and applies to generalized linear models, but the sum of squares formulation used to calculate the weighted likelihood estimate can be influenced by large external datasets that are moderately biased. Before the publication of this paper, we became aware of an independent work in press by Zhai and Han (2022) proposing an data integration method using a penalized constrained maximum likelihood method based on empirical likelihood and lasso penalty. Taken together, there is a need for a data-driven, computationally non-costly weighted integration approach that can deal with external dataset heterogeneity and fully exclude sufficiently noisy datasets.
As a result, here we propose a novel approach to the weighted data integration problem based on minimizing the leave-one-out cross validation (LOOCV) error. Besides the linear model, we also focus on the Cox model in this paper due to its popularity in clinical studies where the local sample size is often small due to the difficulty in patient recruitment which leads to the need for integrating with other external datasets. We demonstrate that the objective function associated with model-specific LOOCV error minimization is in fact a function of the integration weights. That is, we can generate integration weights by solving this optimization problem over the set of all possible integration weights. Under the assumption that the parameters estimates for the local data are consistent for the true underlying set of generating parameters, our approach is able to discriminate among external datasets for inclusion by assigning weights of 0 to extremely noisy datasets. Our method is also able to distinguish among external datasets with at least some utility by assigning different fractional weights. Taken together, these abilities of our LOOCV-minimizing approach are especially valuable in working with large-scale public use datasets, whose potential utility is masked by variations in collection procedures. In general, our proposed methodology significantly streamlines the integration process, requiring nothing more than the external datasets to objectively enumerate their integration weights in an all-in-one procedure.
The rest of the paper is outlined as follows. In Section 2, we first outline the weighted integration problem by introducing the concept and the notation used throughout the paper. We then define the LOOCV error objective functions for linear and Cox proportional hazards models, demonstrating how they are functions of the integration weights. After establishing the weighted integration problem, we describe methods to enumerate weights to optimize the objective functions in Section 3. In Section 4, we perform simulation studies over a range of different setups designed to match possible real-world scenarios. In Section 5, we demonstrate the clinical utility of our method on datasets from the Scientific Registry of Transplant Recipients (SRTR), using our integrative process to improve estimates of 3-year graft failure in Black kidney transplant patients. We conclude in Section 6 with possible extensions of our work and additional final remarks.
2. Formulating the Weighted Integration Problem
2.1. Integration Weights
The desirability of assigning integration weights arises from the notion that although we can reasonably verify the accuracy of data gathered locally, no such verification is possible for public-use datasets. That is, although integrating these external datasets increases the sample size and generalizability of any conclusions, naive integration that assigns equal importance to the local and external datasets may bias our estimates. In more statistical terms, let β = β0 be the p-vector of true parameters of interest for the population from which the local dataset is sampled, and the estimated local data parameter vector. Suppose we also have K candidate external datasets, with the parameter vector estimated from the ith external dataset, i = 1, …, K. Asymptotically, it is assumed that the local data parameter estimator is consistent for the true parameters β. It is possible that estimators using the external datasets are also consistent for β, but in general, this is not the case (i.e., are consistent estimators for βi ≠ β, i = 1, …, K).
The first question we wish to answer is how to combine with the set of ‘s to better estimate β. With no further information available, the best estimator of β is , and the same relationship is true for βi and . Intuitively, for the ith external dataset, if βi = β we would want to completely integrate it with our local data, as it would reduce the variance of our estimate. In contrast, we want to exclude the dataset if βi is sufficiently far away from β, as doing so substantially biases our estimate. The scenario where βi ≈ β represents an intermediate case, where the bias introduced by integrating the external dataset is moderated by a reduction in estimation variance due to the increase in sample size. These three possible scenarios motivate the concept of assigning a set of integration weights w = (w1, …, wK), with each external dataset i having weight wi. These could be either binary 0/1 weights to exclude especially noisy external datasets or fractional weights between 0 and 1 to more finely distinguish among external datasets with at least some partial utility. We emphasize that although ideally we would want to integrate based on comparing each βi to β, the process is complicated by the fact that only the estimated and ’s are observed.
In practice, β is unknown, so a related question is whether we can apply this weighted integration to improve out-of-sample predictive ability, in the sense that β directly influences the response. Pursuing this question has real-world value, because in a clinical setting the immediate focus may be on using past patient data to inform future patient treatment plans, rather than making generalizations about predictor/outcome associations. However, improving prediction accuracy or equivalently, reducing prediction error, is more challenging than reducing estimation error given the limited size of the local data. Because only the local dataset is guaranteed to be generated from the true parameters β, the typical training/testing framework used to examine prediction ability can only be set up with the local data. This is the same as in the estimation problem: estimation error can only be assessed over the local dataset because it is the sole source of β-generated data. In Section 4, we perform tests to demonstrate the utility of our weighted integration method in reducing both estimation/insample and prediction/out-of-sample error.
2.2. Notation and Defining LOOCV Error
We now define the notation used throughout the paper and introduce the leave-one-out cross-validation or LOOCV error for continuous and survival data. We also delineate the associated objective functions as the minimization of the LOOCV errors. For brevity, we make no reference to the integration weights at this stage, although in the following sections we demonstrate that the objective functions are in fact functions of these weights.
Suppose we have gathered a local dataset of size n0 and want to estimate the value of p parameters β = β0 = (β1, …, βp) or alternatively use these parameters for prediction. For continuous data, this dataset consists of the design matrix and the n0 × 1 response vector y0. In survival data, this is augmented by the addition of the censoring vector δ0, where for the jth individual δ0j = 1 if the exact time to death is known and 0 otherwise, j = 1, …, n0. As before, there are K candidate external datasets to assign integration weights to, where the ith external dataset can be characterized by , the ni × 1 response vector yi, and if necessary δi. Throughout this paper, the index i will be used to refer to datasets and the index j will refer to individuals within the datasets. Finally, the important assumption is made that the local data is generated from the true parameter values β and that its maximum likelihood estimator is consistent for β. No such assumption is made for the external datasets. We wish to incorporate knowledge from the external datasets by assigning integration weights w to reduce estimation and prediction error; let be the estimates for the local data parameters after performing data integration with these weights. The notation is summarized in Table S1 in the Supporting Information.
In this paper, we consider the LOOCV errors for linear models and Cox proportional hazards models as loss functions which we wish to minimize. The idea we wish to emphasize is that the LOOCV error, in addition to being a function of the observed local data X0 and y0, is a function of the integration weights w, themselves a function of the external datasets data Xi and yi. That is, for any model where a loss function L can be defined as the LOOCV error, the associated objective function is the minimization of this loss:
| (1) |
In linear models for continuous data, the LOOCV error is referred to as the predicted residual error sum of squares statistic (PRESS), and is commonly used as a validation method to assess the predictive ability of competing regression models (see Hong et al., 2003; Inan et al., 2019; Rodriguez-Bermudez et al., 2013). The statistic is defined as where yj is the value of the jth observation and is the predicted value of the jth observation from a regression model fit on the dataset with the jth observation omitted. The statistic can be thought of as an estimate for testing prediction error as it assesses a model’s ability to deal with data that it was not constructed on. In general, models with better predictive power tend to have smaller values for the predicted residual error sum of squares. For our purposes, the statistic is evaluated over the local dataset; rewriting the above definition in the notation described earlier, the objective function is thus:
| (2) |
where we emphasize the minimization is performed over the set of integration weights w, as the leave-one-out observations are a function of these weights, as we demonstrate in the following section. It is well known that the PRESS statistic can be calculated using a closed-form in linear models, without having to refit a new model for each local observation (Belsey et al., 1980). This is also the case for our LOOCV error, as shown in the following section. As a result, the calculation of this statistic is straightforward.
Forming an equivalent objective function minimizing the LOOCV error for survival data by fitting a Cox proportional hazards model is more complicated since the sum of squares formulation of the predicted residual error sum of squares is incompatible with censored data. For the Cox model, the leave-one-out cross-validated partial likelihood is defined as (Verweij and Van Houwelingen, 1993). Similar to of the predicted residual error sum of squares, is calculated with the omission of the jth observation. In the summation, the first log-likelihood is calculated over the entire dataset. The second log-likelihood is calculated over the leave-one-out dataset (the dataset with the jth observation omitted). The difference of the two likelihoods is calculated for each and then summed. Each difference can be thought of as the contribution from the omitted observation to the log-likelihood, using parameters estimated from a training set of every other observation. This means that in this formulation, it is desired to maximize the CVPL. Rewriting in the notation used in our paper, the objective function is thus the minimization of the negative cross-validated partial likelihood:
| (3) |
To avoid a time-consuming iterative process to estimate each , we build on the efficient approximate formulation of the Cox model leave-one-out parameters and the cross-validated partial log-likelihood measure of predictive performance by applying the Sherman-Morrison-Woodbury theorem and a first-order Taylor expansion around the full model maximum likelihood parameters. Henceforth, we will refer to the minimization of the objective functions and the minimization of the LOOCV errors interchangeably.
2.3. The LOOCV Error as a Function of Integration Weights
2.3.1. Linear Models.
We demonstrate how the objective functions are in fact functions of the integration weights w. In this section, we describe how the leave-one-out responses and the predicted residual error sum of squares (PRESS) statistic incorporate the information of the external datasets. The following section describes the process for Cox models and the cross-validated partial likelihood.
Suppose the integration weights w are known. For a linear model, a closed-form solution exists for parameter estimation, , that consists of two products: XT X and XT y. For the local dataset and each of the external datasets, we precompute these terms and take weighted sums using the integration weights as below:
| (4) |
Intuitively, in terms of estimation, all of an external dataset’s information is summarized by and , which augment the local data during integration. Then where the jth individual has response , j = 1, …, n0. For linear models, the leave-one-out parameters can be calculated without refitting n0 models: , where Hjj is the jth diagonal element of the n0 × n0 projection matrix (Belsey et al., 1980). The PRESS statistic is as
| (5) |
2.3.2. Cox Models.
We now show how the leave-one-out parameters and the cross-validated partial likelihood are calculated for the Cox model when outside datasets are considered for integration. The ideas here are extensions of work by Verweij and Van Houwelingen (1993), who originally constructed the cross-validated partial likelihood as a metric for the predictive value of a Cox model, as well as Van Houwelingen et al. (2006) and Meijer and Goeman (2013), who consider the idea in ridge regression for high dimensional settings. Without any approximation, the exact cross-validated partial likelihood requires iteratively fitting n0 Cox models each with n0 − 1 observations, where n0 is the size of the local dataset. As a computational speedup, we extended the procedures in Meijer and Goeman (2013) and Verweij and Van Houwelingen (1993) by allowing for the effect of integration weights w.
Suppose we have the Cox post-integration local parameter estimates with each dataset ordered by increasing failure time. We computed using a standard Newton-Raphson (NR) algorithm maximizing the partial likelihood, with the score vector and Hessian matrix in each N-R update constructed as weighted sums in a manner similar to that described for linear models. For any generalized linear model, the leave-one-out parameter estimates maximize the leave-one-out likelihood ℓ(−j). That is, a second-order Taylor expansion of the derivative of ℓ(−j) around yields the following simplified expression:
| (6) |
which exactly equals to one N-R update using the leave-one-out likelihood (Meijer and Goeman, 2013). Meijer and Goeman’s usage of this expression to derive an approximation for the Cox leave-one-out parameters is based on the full Cox likelihood including the baseline hazard, in order for the Hessian to be diagonal. The resultant calculation of the leave-one-out parameters is nontrivial; we thus detail our approach in using their formula below.
Given the parameters with no omitted local observations , we first construct the post-integration local risk vector as the reverse cumulative sum of the hazard vector . The Breslow estimator of the cumulative baseline hazard for each local failure time y0j is . Note that the sum of each Breslow estimator is cumulative and its calculation is considered only over the local failure times. The length n0 vector collects the Breslow estimator at each failure time. In order to calculate the second derivative of the Cox full likelihood with respect to , define the local weight matrix ; the Hessian is equal to .
Similarly, we can calculate for each of the external datasets and using our integration weights w, take the weighted sum . Denoting to calculate the gradient, the leave-one-out parameters can then be calculated as follows:
| (7) |
where Vjj is the jth diagonal element of the matrix .
The primary speedup of this approximation lies in that only one diagonal Hessian is inverted to approximate each , rather than potentially many in the exact iterative calculation of before its convergence. As a caveat, this equation assumes that the removal of the j-th observation does not significantly impact the cumulative baseline hazard and therefore it remains fixed in each leave-one-out estimate. To conclude, the leave-one-out local estimates are then plugged into the negative cross-validated partial likelihood formula
| (8) |
2.4. Mathematical Properties
In this section, we briefly describe some mathematical properties of the optimal weights w in our proposed data integration framework with linear models. For simplicity, assume that there is only one external dataset (X1, y1), and that the local dataset is (X0, y0). The weight for the external dataset is only a scalar w. The parameter estimates for the two datasets are , i = 0, 1. The weighted parameter estimate is given by:
The mathematical properties of the estimator βw are summarized in Proposition 1.
Proposition 1 (Mathematical properties of βw): Suppose the local dataset is unbiased, that is Bias(β0) = 0, , i = 0, 1. Regardless of the values of X0, X1 and the value of Bias(β1), there always exists a w ∈ (0, ∞) which minimizes MSE(βw). Furthermore, while in general a closed-form expression for the optimal w is hard to obtain, it is possible to derive it in some special cases. For instance, when Bias(β1) = 0, we have , and when p = 1, we have . As a consequence, when , the optimal w is between (0, 1], and w = 1 if and only if Bias(β1) = 0.
The proof of Proposition 1 is given in Section S1 in the Supporting Information.
3. Full and Reduced Space Optimization
3.1. Full Space Optimization
In the preceding section, we defined how the integration weights w influence the value of the predicted residual error sum of squares (PRESS) statistic, as weighted sums of the products that constitute the estimation equation. This idea is similar for the Cox model. This framework is therefore a constrained optimization problem: determining the integration weights that minimize the objective functions, where the individual weights wi are constrained to be between 0 (no integration) and 1 (full integration). For simplicity, in this paper we do not allow weights to be greater than 1. Intuitively, we are essentially assuming that the quality of local data is always at least as good as that of external data.
The most straightforward approach to solving this constrained optimization problem is to directly find the associated vector of weights w that minimize the LOOCV error. This is particularly useful when the number of external datasets is small, and it is not too computationally challenging to solve this K-dimensional optimization. In the following sections, we used the L-BFGS-B algorithm Byrd et al. (1995), which can incorporate box constraints (each wi ∈ [0,1]), in order to perform this full optimization. With respect to the number of external datasets, the L-BFGS-B algorithm’s computation cost is linear or O(K) (Byrd et al., 1995), so the time associated with the full optimization is not particularly odious unless K is quite large. From simulations examining run-time (not presented), the integration method is roughly linear in the size of the local dataset n0 and external datasets ni as well as constant in small to moderate p ≤ 15 or linear in higher p. Although we do not present an explicit time complexity formula, the integration process takes only minutes for several external datasets of sample size in the hundreds.
The main advantage of the full optimization approach is that the external datasets can be considered jointly in their potential contributions with respect to the local dataset’s parameters. However, as aforementioned, there is also the issue of computational resources required with a large number of external datasets, which motivates the development of the two-parameter optimization method described in the following section.
3.2. Reduced Space Optimization
3.2.1. Motivating the Likelihood Ratio Test as an Integration Tool.
Although full optimization attempts to find weights that minimize the LOOCV error (which should also reduce both estimation and prediction error), it remains that the optimization is performed over K dimensions. This can be an issue when the number of external datasets is even moderately large, especially in the case of the Cox model where calculating the LOOCV error requires more work. Furthermore, since the optimization is non-convex, there is no guarantee that the global minimum is found. Motivated by a desire to reduce the computational resources necessary for integrating a high number of survival datasets, we now introduce an alternative method to minimize through the usage of a likelihood-ratio test. The likelihood-ratio test is a useful tool for data integration, given that integration can be reformulated as a hypothesis testing problem. Under the null, the estimated parameters of an external dataset are sufficiently close to that of the local dataset such that a ‘reduced’ model fitting one, shared set of parameters is sufficient. Again, the assumption that is consistent for the true parameters β is important, because is the point of comparison for whether external data is integrated.
The procedure is described as follows. As before, let and represent the estimated parameter vectors by which the response vectors y0 and yi are generated from the design matrices and , i ∈ 1, …, K. For each external dataset, the likelihood-ratio test constructs a ‘stacked’ reduced model fitting p parameters and a ‘wide’ full model fitting 2p parameters from n = n0 + ni observations as shown below:
The likelihood-ratio test considers the test statistic λLR = −2[ℓR − ℓF] increases, where ℓR and ℓF are respectively the reduced and full model log-likelihoods, with . The linear log-likelihood is:
| (9) |
It is assumed that a single σ is sufficient to describe both the local and external error distributions under the alternative. The Cox partial log-likelihood is:
| (10) |
where hi is the ith element of the hazard vector and ri the risk or total hazard at the ith patient’s failure time, calculating by summing the hi’s associated with patients with longer failure times.
As the distance between and increases, the test statistic also increases, as one set of common parameters β1, …, βp becomes unable to accommodate both the local and external datasets, causing ℓR to decrease relative to ℓF . External datasets that are rejected by the test are those whose estimated parameters are sufficiently unequal to those of the local dataset , the best estimator of β given no other information. These correspond to the same datasets we would not want to integrate with the local data, which is presupposed to be generated from the true parameters. That is, external datasets with small test p-values qi should also have small integration weights wi.
3.2.2. Minimizing the LOOCV Error over the Four-Point Family with Two Parameters.
Using the likelihood ratio test procedure, a p-value qi is generated for each external dataset which should be equated in some way to an integration weight wi. Thus, it is desired to formulate a mapping qi ⟼ wi;[0, 1] ⟼ [0, 1] between the two spaces. Some rules for this mapping are immediately apparent, deriving from the interpretation of the test statistic. First, let (qi, wi) denote a p-value/integration weight pair. If qi = 1, this means that and the external dataset should be integrated entirely (wi = 1). Thus, the mapping must include the point (1, 1) and similar reasoning can be employed to argue that it must also include (0, 0). Furthermore, the mapping must also be monotonically increasing in the the p-value space, with the rationale that a higher test p-value corresponds to a smaller distance between and .
Suppose we directly apply the likelihood-ratio test result in deciding whether to integrate an external dataset: discarding if the test is rejected and completely integrating the dataset otherwise. That is, qi ∈ (0.05, 1] ⟺ wi = 1 and qi ∈ [0, 0.05] ⟺ wi = 0. This “binary” mapping is advantageous in that it contains a mechanism for completely discarding external datasets that sufficiently deviate from the local data. On the other hand, it could be argued that this “binary” mapping is too reductive in that it assigns equal weights to datasets with qi close to 1 and those with marginally insignificant p-values (qi ≈ 0.05). One alternative is to instead to directly use the test p-values as the integration weights: qi = wi, which has the advantage of being able to assign fractional weights to datasets that should be integrated partially. On the other hand, it can never completely discard an external dataset by assigning wi = 0.
Our motivation for introducing the binary and identity mappings is that both methods are members of a “four-point” family of mappings defined as the connection of four points: (0, 0), (qA, 0), (qB, 1), and (1, 1), where it is understood that 0 ≤ qA ≤ qB ≤ 1. For example, in the binary method qA = qB = 0.05 whereas in the identity method, qA = 0 and qB = 1. Of course, there is nothing restricting us to these specific choices of qA and qB. There may exist qA and qB that combine the advantages of the two mappings while negating their shortcomings. In general, the form of a mapping can have regions for complete integration, fractional integration, and discarding. The idea is that for points qA, qB and likelihood ratio test p-value qi, the weight wi = 0 if qi ≤ qA, wi = (qi − qA)/(qB − qA) if qA < qi ≤ qB, and wi = 1 if qi > qB. A depiction of different forms of the four-point family is available in Figure S1 in the Supporting Information.
The way we choose qA and qB circles back to the idea of minimizing the LOOCV errors (Equations 2 and 3). In Section 2.3.1, we demonstrated how the predicted residual error sum of squares (PRESS) statistic is a function of the integration weights w and did the same for the CVPL in Section 2.3.2. The weights w are themselves functions of the test p-values q, but q is fixed, such that w is entirely determined through the mapping specified by the choice of qA and qB in the four-point family. That is, if we restrict the set of all possible mappings to just the four-point family, we have reduced the K-parameter full optimization problem of specifying individual weights wi, to the 2-parameter optimization of choosing qA and qB. Under this 2-parameter or “testing” optimization, the objective functions defined in Equation 2 and 3 change only in the associated constraints. The full optimization constraint that the integration weights are between 0 and 1 (wi ∈ [0, 1]) is replaced by the constraint that the second and third points of the four-point family are between 0 and 1 (0 ≤ qA ≤ qB ≤ 1):
and
where we emphasize that the maximization is over the two parameters qA and qB.
4. Simulation Studies
We performed simulations mimicking situations encountered in the real world (limited amounts of local data and larger public datasets of unknown quality) to assess the utility of the methods described in the previous section. Clinical and genetic expression data are characterized by potentially high correlations across predictors, heteroskedasticity, and non-zero true effects.
With these observations in mind, simulation scenarios are described in Table 1. K is the total number of external datasets, of which K01 should be integrated fully and K0w should be integrated partially. The true parameter vector β has length 5, with elements drawn from either a standard normal or N(1, 1) distribution. The local dataset and K01 fully integrated external datasets have generating parameters β. Let be a length 5 noise vector with elements drawn from a standard normal for each external dataset, i ∈ 1, …, K; the K0w partially integrated datasets have parameters and the non-integrated datasets have parameters .
Table 1.
Simulation scenarios.
| Scenario | K | K 01 | K 0w | Local corr. | External corr. | β dist. | Obs. |
|---|---|---|---|---|---|---|---|
| 1 | 4 | 2 | 0 | I | I | N(0, 1) | η + 0.5ϵ |
| 2 | 4 | 1 | 1 | Σ 0 | Σ 0 | N(0, 1) | η + 0.5ϵ |
| 3 | 8 | 4 | 0 | Σ 0 | Σ i | N(1, 1) | η + 0.5ϵ |
| 4 | 8 | 2 | 2 | Σ 0 | Σ i | N(1, 1) | η + 0.5ϵ′ |
Note: ϵ ~ N (0, 1), ϵ′ ~ N{0, 0.5} + logit−1(Σx)}, η = xTβ is the linear predictor.
Covariance matrices were constructed with elements drawn from a standard normal and converted to correlation matrices. Data in the design matrices are either uncorrelated (Scenario 1), have the same correlation structure in the local and external datasets (Scenario 2), or different local and external structures (Scenarios 3 and 4). The linear predictor η is varied with additional noise ϵ drawn from a standard normal for each observation. In Scenario 4, heteroskedasticity is introduced by drawing ϵ′ from a normal distribution with mean 0 and variance 0.5 + logit−1(Σx) for each observation.
For all simulations, the local or training dataset is of size n0 = 50, sampled from a hypothetical population of 500 (with the other 450 observations representing the testing/validation dataset). For survival data, the local dataset consists of the observation with the longest survival time and 49 other observations, in order for the prediction error to be computed over the remaining observations. Each of the external data sample sizes is the same i.e., n1 = ··· = nK = 200. For survival data, time to death and time to censoring are simulated from exponential distributions with rates exp(η) and exp(x1β1), respectively, with censoring equal to 1 if the time to death is lesser and 0 otherwise. Each scenario was repeated for 50 replicates.
Results for the 4 simulation scenarios are displayed in Figure 1. Values in the figure are reported as the mean percentage reduction in error after integration, averaged over the replicates for each of full and 2-parameter optimization. The distribution of the local-only error vs. post-integration error ratios tended to be skewed right. Thus, to calculate confidence intervals (CI), we took the mean log local-only error vs. post-integration error ratios R and their 95% Wald-type CIs R±1.96 σR, calculating the mean percent reduction as 1−1/eR and the percent reduction CIs similarly from the log mean ratio CIs. Our integration method significantly reduces error if CIs around the percent reduction in error do not include 0 (bolded horizontal line in figure).
Figure 1.

Percent decrease in estimation error (left) and prediction error (right) for simulation scenarios described. Bands represent 95% confidence intervals calculated from log mean local-only vs. post-integration error. Note the y-axis scale differs between plots.
Our method shows substantial reductions in both estimation error and prediction error. For linear models, the local-only to post-integration estimation error ratio ranges from 1.8 (Scenario 1, full opt.) to 7.3 (Scenario 3, 2 par opt.), corresponding to a mean 45 to 86 percent decrease in estimation error relative to using only local data. For Cox models, the ratio ranges from 2.3 (Scenario 1, full opt.) to 26.4 (Scenario 3, 2 par opt.), corresponding to a mean 56 to 96 percent error reduction. These considerable reductions arise from the combination of the (by design) small size of the local training dataset with the squared error loss function magnifying the utility of our integration method.
Unlike with estimation error where relatively small changes in estimated parameters can produce large local-only/post-integration ratios, we would not expect such values with prediction error ratios (for mathematical justification, refer to the derivation of the prediction error bound in Proposition 2 in Section S1 in the Supporting Information). It is more practical to examine the prediction error ratios as the true parameters governing biological processes and disease mechanisms are generally unknown. For linear models, the prediction error ratio ranges from 1.01 (Scenario 4, full opt.) to 1.05 (Scenario 3, 2 par opt.), a 1.3 to 3.6 percent mean error reduction. For Cox models, the ratio ranges from 1.05 (Scenario 2, 2 par opt.) to 1.11 (Scenario 3, 2 par opt.), a 5.0 to 11.1 percent error reduction.
Of note is that the largest post-integration decreases in both estimation and prediction error came from the Cox model simulations. This is not surprising, given that per observation, survival data provides less information than continuous data, partially arising from the relatively small contribution of censored observations to the Cox partial likelihood. It is also interesting to see the slightly better performance of 2-parameter optimization given its reduced optimization space. When the noise is reduced in the local dataset and maintained in the external datasets, the full optimization method performs better (see Figure S2 in the Supporting Information). We hypothesize this occurs because the full method tends to overfits the external weights, which is less of an issue when there is little noise in the local dataset relative to external datasets.
4.1. Comparisons to Existing Weighted Integration Methods
We also ran simulations comparing our proposed integration methods to two other methods that also perform weighted integration: the power prior of Ibrahim and Chen (2000) and the likelihood ratio data fusion of Guo et al. (2012). The power prior is implemented in the package NPP (Han et al., 2019) while our implementation of likelihood ratio data fusion is available with our integration methods on GitHub. As neither method provides support for Cox models and NPP only allows one historical dataset, we considered comparative simulations with linear models and one external dataset. Simulation results are displayed in Figure 2), with details of the simulations in Section S2 in the Supporting Information. We found that when the external dataset should be integrated completely with weight 1, our methods perform comparably to the power prior (but is significantly faster) and outperforms likelihood ratio data fusion in terms of reducing both estimation and prediction error. When the external dataset should not be integrated (weight 0), our methods outperform the power prior and are comparable to likelihood ratio data fusion. These results suggest that our approach is more flexible to different qualities of external datasets than other previous weighted integration methods.
Figure 2.

Comparison of performance of competing integration methods in reducing prediction error under different true external dataset scenarios (top: full integration, middle: partial integration, bottom: no integration). In each plot, from left to right: full optimization, 2 parameter optimization, likelihood ratio (LR) data fusion, power prior (as NPP), naive integration, no integration. Y-axis represents the root mean square error (RMSE) in the validation dataset (lower is better).
5. Real Data Case Study: SRTR Kidney Data
In patients with end-stage renal disease, kidney transplantation is known to lead to improved survival over dialysis (Wolfe et al., 1999). The supply of kidneys from living donors and cadavers is insufficient to meet the needs of individuals requiring transplantation, with median wait times of up to 5 years (Davis et al., 2014). As a result, individuals on the transplant list with contraindications or who are not currently suitable for transplantation may be wait-listed or given inactive status; a large and growing number of these patients never receive a kidney due to death or removal from the transplant list after permanent contraindication (Delmonico and McBride, 2008; Tennankore et al., 2020). Overall, an analysis of factors affecting graft failure and death in patients who did receive a kidney transplant has utility in both pretransplant allocation and post-transplant care, potentially decreasing total end-stage renal disease deaths.
With this in mind, we applied our proposed method to Scientific Registry of Transplant Recipients (SRTR, Snyder et al. (2016)) data from the kidney transplant program at the University of Michigan Medical Center between 2007 and 2011. We are interested in examining how long a recipient has had end-stage renal disease, the recipient’s age and BMI, and transplant expanded criteria donor status affect graft success and survival. Expanded criteria donors are are those either aged over 60 or 50–59 falling within donor criteria, whose donorship is intended to expand the supply of transplanted organs (Metzger et al., 2003).
Datasets were constructed by segregating data by year of transplant and race of recipient; the datasets of interest are 2010–2011 Blacks (n = 6821, 85% censored), 2010–2011 Whites (n = 12174, 87% censored), 2010–2011 Asians (n = 1388, 90% censored), and 2007–2009 Blacks (n = 9510, 80% censored). An event is death or graft failure before the end of 3–year followup. A Cox proportional hazards model was fit with the following covariates: time with end-stage renal disease (categorical: 1–5 years; >5 years, baseline <1 year), recipient age and BMI (both centered by subtracting the mean and dividing by 10), and whether the donated kidney was from an expanded criteria donor (categorical, baseline: not from expanded criteria donor).
We considered 40 replications of our integration approach on “local” data of size n = 100, 200, or 300 sampled from 2010–2011 Blacks. This dataset was chosen due to its recency and relatively higher event rate. The three external datasets are samples of size n = 500 from 2010–2011 Whites, 2010–2011 Asians, and 2007–2009 Blacks. For each replication, ties in event times were broken through the addition of small uniform random increments, with prediction error calculated over the remaining 2010–2011 Blacks not in the local data.
The results of our integration process are displayed in Figure 3. The median run-time for one replicate using full/2-parameter optimization respectively was 24.2/7.13 seconds (local n = 100), 46.2/17.8 seconds (local n = 200), and 109/22.7 seconds (local n = 300). As the simulation studies of the previous section suggest that the distribution of prediction errors is skewed to the right, we construct a 95% confidence interval around the mean of the log local-only to post integration prediction error ratio. We report small but significant reductions in cumulative prediction error upon application of integration. As expected, as the size of the training sample increases, the sample itself more accurately describes the rest of the 2010–2011 Black transplant dataset, decreasing the prediction error ratio towards 1. Observing over the range of training data sample sizes, applying data integration reduces cumulative prediction error anywhere from 1 to 4.4 percent. The full optimization and 2-parameter optimization perform relatively the same. External weights tended to be high, likely due to the low event rate across the datasets.
Figure 3.

Left: plot of percent reduction in prediction error upon integration of other SRTR datasets. X-axis: size of training/local data; Y-axis: percent reduction in prediction error. Right: data integration runtime, in seconds on log10 scale.
Additionally, we wanted to explore how integration affects the parameter estimates associated with the covariates of interest. We display the effect of training data sample size and the integration process on the estimates associated with each of the covariates in Figure 4. We observed that integration reduced the variability of covariate coefficient estimates and moved negative coefficient values towards 0 and positive values. That the values of the coefficients become overall more positive is consistent with prior literature, where duration of end-stage renal disease (Goldfarb-Rumyantzev et al., 2005), as well as recipient age (Veroux et al., 2012) and BMI (Meier-Kriesche et al., 2002) are important risk factors for graft failure and death.
Figure 4.

Coefficient estimates for covariates in SRTR kidney transplant model. Top row: age, BMI. Middle row: duration of end-stage renal disease 1–5 years, duration of end-stage renal disease >5 years (reference: end stage renal disease <1 year). Bottom row: expanded donor criteria.
6. Discussion and Extensions
We have elucidated two methods of determining integration weights for external datasets by optimizing the objective function of LOOCV error, a full optimization and a preferred, simpler 2-parameter optimization. With any optimization, there is the potential that the global solution is not reached, which in this case may lead to incorrect weight assignments. In full optimization, we considered a K-dimensional 0 vector as starting optimization weights, corresponding to the null hypothesis of no integration. When compared to choosing the 2-parameter output (the best available “guess”) as starting optimization weights, we observed practically no difference in post-integration prediction error, indicating that our optimization reaches the global solution. This result is included in Figure S3 in the Supporting Information. Another interesting observation to note is that a relation can be made between our results and ridge regression; in our approach, there exists a set of external weights that minimizes local leave-one-out error whereas in ridge regression, there exists a penalty that minimizes mean squared error.
In this paper, we presented our weighted integration method for linear and Cox proportional hazards models, due to the prevalence of continuous and survival outcomes in the clinical space. In principle, our integration method is usable with other generalized linear models, such as binary logistic and Poisson models. In initial exploratory work with logistic models we found that our approach showed some promising results in terms of prediction error improvement, but that integration could also be challenging because relatively few changes in binary responses can significantly bias coefficient estimates in small-size local datasets. Nonetheless, many clinical outcomes are described by categorical variables, so integration with logistic models represents a possible direction for future work.
In summary, we have developed a new method for data integration based on the minimization of the leave-one-out cross-validation error. Our data-driven integration method outputs full, partial, and zero weights, not only recognizing which external datasets should be integrated, but also differentiating integrated datasets with different levels of noise. Indeed, in our real-data case study with the public, expected to be noisy SRTR datasets, information from transplants in other races and years reduced prediction error in estimating graft failures and deaths for 2 years of transplants in Blacks. As even the real-data analysis presented here ran quickly on a single local computer, we imagine cluster resources can increase both the number of predictors and candidate external datasets that are considered, providing a better understanding of how donor and recipient-related covariates can affect future transplant outcomes.
Supplementary Material
Acknowledgements
The work presented in this paper was partly supported by National Institutes of Health grants 5T32CA083654, 5T32HG000040, 1R01DK129539 and the Investigators Awards grant program of Precision Health at the University of Michigan. Work with the SRTR data was supported in part by the Health Resources and Services Administration contract HHSH250-2019-00001C. The content is the responsibility of the authors alone and does not necessarily reflect the views or policies of the Department of Health and Human Services or the official views of U-M Precision Health, nor does mention of trade names, commercial products, or organizations imply endorsement by the U.S. Government.
Footnotes
Supporting Information
Web Appendices, Tables, and Figures referenced in Sections 2, 3, 4, 5, and 6 are available with this paper at the Biometrics website on Wiley Online Library. The code for the integration method, as well as that used to generate the results are available both as a zip file and also online at: https://github.com/lamttran/intasymm.
This paper has been submitted for consideration for publication in Biometrics
Data Availability Statement
The data that support the findings in this paper are available from the Scientific Registry of Transplant Recipients (SRTR). Restrictions apply to the availability of these data, which were used under license in this paper. Information on requesting data from the SRTR is available at https://www.srtr.org/requesting-srtr-data/data-requests/.
References
- Belsey D, Kuh E, and Welsch R (1980). Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. John Wiley & Sons. [Google Scholar]
- Brookhart M, Sturmer T, Glynn R, Rassen J, and Schneeweiss S (2010). Confounding control in healthcare database research: challenges and potential approaches. Medical Care 48, S114–S120. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Byrd R, Lu P, Nocedal J, and Zhu C (1995). A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific Computing 16, 1190–1208. [Google Scholar]
- Davis AE, Mehrotra S, McElroy L, Friedewald J, Skaro A, Lapin B, et al. (2014). The extent and predictors of waiting time geographic disparity in kidney transplantation in the United States. Transplantation 97, 1049–1057. [DOI] [PubMed] [Google Scholar]
- Delmonico FL and McBride MA (2008). Analysis of the wait list and deaths among candidates waiting for a kidney transplant. Transplantation 86, 1678–1683. [DOI] [PubMed] [Google Scholar]
- Fu Y, Wang X, and Wu C (2009). Weighted empirical likelihood inference for multiple samples. Journal of Statistical Planning and Inference 139, 1462–1473. [Google Scholar]
- Goldfarb-Rumyantzev A, Hurdle J, Scandling J, Wang Z, Baird B, Barenbaum L, et al. (2005). Duration of end-stage renal disease and kidney transplant outcome. Nephrology Dialysis Transplantation 20, 167–175. [DOI] [PubMed] [Google Scholar]
- Guo P, Wang X, and Wu Y (2012). Data fusion using weighted likelihood. European Journal of Pure and Applied Mathematics 5, 333–356. [Google Scholar]
- Han Z, Bai T, and Ye K (2019). NPP: Normalized Power Prior Bayesian Analysis. R package version 0.2.0. [Google Scholar]
- Hong X, Sharkey P, and Warwick K (2003). A robust nonlinear identification algorithm using press statistic and forward regression. IEEE Transactions on Neural Networks 14, 454–458. [DOI] [PubMed] [Google Scholar]
- Ibrahim J and Chen M-H (2000). Power prior distributions for regression models. Statistical Science 15, 46–60. [Google Scholar]
- Ibrahim J, Chen M-H, and Sinha D (2003). On optimality properties of the power prior. Journal of the American Statistical Association 98, 204–213. [Google Scholar]
- Inan G, Latif M, and Preisser J (2019). A press statistic for working correlation structure selection in generalized estimating equations. Journal of Applied Statistics 46, 621–637. [Google Scholar]
- Jiang Y, He Y, and Zhang H (2016). Variable selection with prior information for generalized linear models via the prior lasso method. Journal of the American Statistical Association 111, 355–376. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Louie B, Mork P, Martin-Sanchez F, Halevy A, and Tarczy-Hornoch P (2007). Data integration and genomic medicine. Journal of Biomedical Informatics 40, 5–16. [DOI] [PubMed] [Google Scholar]
- Meier-Kriesche H-U, Arndorfer JA, and Kaplan B (2002). The impact of body mass index on renal transplant outcomes: a significant independent risk factor for graft failure and patient death. Transplantation 73, 70–74. [DOI] [PubMed] [Google Scholar]
- Meijer R and Goeman JJ (2013). Efficient approximate k-fold and leave-one-out cross-validation for ridge regression. Biometrical Journal 55, 141–1558. [DOI] [PubMed] [Google Scholar]
- Metzger RA, Delmonico FL, Feng S, Port F, Wynn J, and Merion R (2003). Expanded criteria donors for kidney transplantation. American Journal of Transplantation 3, 114–125. [DOI] [PubMed] [Google Scholar]
- Plante J-F (2008). Nonparametric adaptive likelihood weights. Canadian Journal of Statistics 36, 443–461. [Google Scholar]
- Plante J-F (2009). Asymptotic properties of the mamse adaptive likelihood weights. Journal of Statistical Planning and Inference 139, 2147–2161. [Google Scholar]
- Rodriguez-Bermudez G, Garcia-Laencinaa P, Roca-Gonzalez J, and Roca-Dorda J (2013). Efficient feature selection and linear discrimination of eeg signals. Neurocomputing 115, 161–165. [Google Scholar]
- Snyder JJ, Salkowski N, Kim S, Zaun D, Xiong H, Israni A, et al. (2016). Developing statistical models to assess transplant outcomes using national registries: The process in the United States. Transplantation 100, 288–294. [DOI] [PubMed] [Google Scholar]
- Tennankore KK, Gunaratnam L, Suri RS, Yohanna S, Walsh M, Tanari N, et al. (2020). Frailty and the kidney transplant wait list: Protocol for a multicenter prospective study. Canadian Journal of Kidney Health and Disease 7,. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Than C, Ruths D, Innan H, and Nakhleh L (2017). Confounding factors in hgt detection: Statistical error, coalescent effects, and multiple solutions. Journal of Computational Biology 14,. [DOI] [PubMed] [Google Scholar]
- Van Houwelingen HC, Bruinsma T, Hart AA, Van’t Veer LJ, and Wessels LF (2006). Cross-validated cox regression on microarray gene expression data. Statistics in medicine 25, 3201–3216. [DOI] [PubMed] [Google Scholar]
- Veroux M, Grosso G, Corona D, Mistretta A, Giaquinta A, Giufridda G, et al. (2012). Age is an important predictor of kidney transplantation outcome. Nephrology Dialysis Transplantation 27, 1663–1671. [DOI] [PubMed] [Google Scholar]
- Verweij PJ and Van Houwelingen HC (1993). Cross-validation in survival analysis. Statistics in medicine 12, 2305–2314. [DOI] [PubMed] [Google Scholar]
- Wang X and Zidek J (2005). Selecting likelihood weights by cross-validation. The Annals of Statistics 33, 463–500. [Google Scholar]
- Wolfe R, Ashby V, Milford E, Ojo A, Robert E, Agodoa L, et al. (1999). Comparison of mortality in all patients on dialysis, patients on dialysis awaiting transplantation, and recipients of a first cadaveric transplant. The New England Journal of Medicine 341, 1725–1730. [DOI] [PubMed] [Google Scholar]
- Zhai Y and Han P (2022). Data integration with oracle use of external information from heterogeneous populations. Journal of Computational and Graphical Statistics pages, in press. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The data that support the findings in this paper are available from the Scientific Registry of Transplant Recipients (SRTR). Restrictions apply to the availability of these data, which were used under license in this paper. Information on requesting data from the SRTR is available at https://www.srtr.org/requesting-srtr-data/data-requests/.
