Skip to main content
Oxford University Press logoLink to Oxford University Press
. 2022 Apr 12;110(1):119–134. doi: 10.1093/biomet/asac022

Data integration: exploiting ratios of parameter estimates from a reduced external model

Jeremy M G Taylor 1,, Kyuseong Choi 2, Peisong Han 3
PMCID: PMC9919493  PMID: 36798840

Summary

We consider the situation of estimating the parameters in a generalized linear prediction model, from an internal dataset, where the outcome variable Inline graphic is binary and there are two sets of covariates, Inline graphic and Inline graphic. We have information from an external study that provides parameter estimates for a generalized linear model of Inline graphic on Inline graphic. We propose a method that makes limited assumptions about the similarity of the distributions in the two study populations. The method involves orthogonalizing the Inline graphic variables and then borrowing information about the ratio of the coefficients from the external model. The method is justified based on a new result relating the parameters in a generalized linear model to the parameters in a generalized linear model with omitted covariates. The method is applicable if the regression coefficients in the Inline graphic given Inline graphic model are similar in the two populations, up to an unknown scalar constant. This type of transportability between populations is something that can be checked from the available data. The asymptotic variance of the proposed method is derived. The method is evaluated in a simulation study and shown to gain efficiency compared to simple analysis of the internal dataset, and is robust compared to an alternative method of incorporating external information.

Keywords: Data integration, Omitted variable regression, Ratio of parameters, Transportability

1. Introduction

We consider developing a parametric prediction model for a binary outcome variable Inline graphic, given sets of covariates Inline graphic and Inline graphic. Here Inline graphic and Inline graphic represent vectors of Inline graphic and Inline graphic variables, which may be thought of as the conventional covariates and the newly discovered covariates, respectively. Individual data Inline graphic, for a simple random sample of size Inline graphic from a population of interest, referred to as the internal study population hereafter, are available. The joint law of Inline graphic in the internal population is denoted by Inline graphic, and we refer to this as the joint distribution. In addition, an external study analysed data from a different population, with the joint distribution denoted by Inline graphic, by fitting a generalized linear model of the form Inline graphic, and produced the maximum likelihood estimate Inline graphic for Inline graphic. Here Inline graphic is a link function and the expectation Inline graphic is taken under the external study population distribution Inline graphic. The raw data used to derive Inline graphic are not available. We assume that the external study sample size was large so that the uncertainty associated with Inline graphic is negligible, and no components of Inline graphic are zero. Our goal is to fit a model of the form

graphic file with name Equation1.gif (1)

that includes all the Inline graphic variables and also the Inline graphic variables. Here Inline graphic is allowed to be different from Inline graphic and the expectation Inline graphic is taken under the internal study population distribution Inline graphic. We assume that model (1) is correctly specified corresponding to Inline graphic. In fitting this model we want to make use of Inline graphic provided by the external study. While it would be simple to just analyse the internal data to obtain an estimate of Inline graphic, incorporating Inline graphic is likely to lead to more efficient estimates due to the large external sample. This could be important if Inline graphic is not large.

While this scenario may be common in many areas of science, we are particularly thinking of a medical setting where Inline graphic is a future or unknown and important binary event, the Inline graphic are commonly available variables in that setting, such as age and sex, and the Inline graphic are newer or less commonly measured variables, such as some recently discovered biomarkers. The interest is to include Inline graphic in a new prediction model with the hope that it will have better performance.

There is existing literature on this topic of using external summary information to aid in the estimation of a full model from an internal dataset. Cheng et al. (2018, 2019) proposed Bayesian approaches that rely on a direct link between the parameters in the two models. Other methods (Chatterjee et al., 2016) use a constrained semiparametric maximum likelihood approach by converting the external summary-level information into a constraint and then maximizing the internal data likelihood subject to this constraint. However, the constrained maximum likelihood method requires the probability distribution of the outcome and covariates to be the same in the two populations. This constrained maximum likelihood method is closely related to an empirical likelihood approach (Qin, 2000; Han & Lawless, 2019). See also Han et al. (2023). In a different approach, Gu et al. (2019) used the external model to create synthetic data, which was combined with the internal data and then analysed. Estes et al. (2018) developed a more robust empirical Bayes estimator that essentially downweights the external information if it is not compatible with the internal data. Zhai & Han (2022) developed a penalization method to simultaneously select and integrate the external information that is compatible with the internal data. Rahmandad et al. (2017) and Kundu et al. (2019) developed methods within a meta-analysis framework.

Whether the external information Inline graphic is useful for the estimation of Inline graphic depends on whether the joint distribution of Inline graphic, or some aspect of the joint distribution, is similar between the external and internal study populations. Much of the literature referenced above makes an assumption of transportability between Inline graphic and Inline graphic. The full transportability corresponds to Inline graphic or, equivalently, all of Inline graphic, Inline graphic and Inline graphic. In practice, partial transportability, in which only aspects of the distributions are shared between the populations, may be more realistic. An example of partial transportability would be Inline graphic, but Inline graphic, or Inline graphic, but Inline graphic and Inline graphic. In general, we might expect marginal distributions to differ between populations, especially the Inline graphic distribution since different study populations typically have varying demographics. In practice, some information is usually available about Inline graphic, such as the marginal mean and standard deviation of each variable, which could be used to investigate how different Inline graphic is from Inline graphic. Conditional distributions might have more similarity, which is related to the idea that causal dynamics may be expected to be stable (e.g., Penrose, 2004). For example, for the Inline graphic distribution, assuming that Inline graphic and Inline graphic, a plausibly realistic assumption on transportability is that the two link functions are the same, and that Inline graphic for Inline graphic, whereas Inline graphic. The intercept-only difference between the regression coefficients reflects the belief that the covariate effects are similar between the two study populations, but the disease prevalence may differ. The case of Inline graphic for some constant Inline graphic is another example, reflecting the belief that the relative covariate effects are similar between the two study populations, but not the absolute magnitudes.

For the Inline graphic distribution, a strong assumption would be that Inline graphic is independent of Inline graphic in both study populations. Another choice might be not to specify a model for Inline graphic, but rather to assume that the Inline graphic distribution is the same in the two study populations. A further option would be to specify a parametric model, such as a generalized linear model, and then restrict the regression coefficients to be related in certain ways in the two study populations.

It is feasible to consider the existence of the variables Inline graphic in the external population, but that they were either not measured or included in the external model. However, we will not typically have any available information on the distribution of Inline graphic in the external population, and thus we cannot check whether the distributions for Inline graphic and Inline graphic are transportable between the internal and external populations. We can however check from the internal study data whether the distribution of Inline graphic is similar between the two populations. We will propose a method that does not require Inline graphic, but does require some aspect of Inline graphic and Inline graphic to be the same.

Our method builds on the extensive literature on omitted covariates in generalized linear models (Gail et al., 1984; Neuhaus & Jewell, 1993). Consider a generalized linear model Inline graphic with a scalar Inline graphic and a reduced model Inline graphic in the same population. The reduced model is usually misspecified, or not compatible with the full model if Inline graphic is nonlinear because of the noncollapsability of most generalized linear models. Neuhaus & Jewell (1993) showed that, when Inline graphic is independent of Inline graphic, the value of Inline graphic that best approximates the distribution of Inline graphic is related to Inline graphic through Inline graphic for some constant Inline graphic. So even though Inline graphic is typically not equal to Inline graphic, the ratios of the coefficients of the Inline graphic in the two models are very similar. This result is closely related to the measurement error literature for the relationship between the parameters of a regression model for when the covariates do and do not have measurement errors (Monahan & Stefanski, 1992; Carroll et al., 2006). When Inline graphic is not independent of Inline graphic, the relationship between Inline graphic and Inline graphic is not so simple, depending on the distribution Inline graphic in a complex way.

The proposed method also builds on the literature concerning the robustness of ratios of regression parameter estimates under model misspecification. The general result is that the ratio of regression coefficients can be estimated well even if the link function is misspecified (Solomon, 1984; Struthers & Kalbfleisch, 1986; Li & Duan, 1989; Taylor, 1989, 1990), and this has been studied for continuous, ordinal and censored survival outcomes. In other words, one can get good estimates of Inline graphic up to an unknown scaling factor when the link function is misspecified. The intuitive reason for the stability of the ratio of parameter estimates is that it represents the relative importance of one variable to another, in the sense that Inline graphic is the amount by which Inline graphic needs to be changed for a unit change in Inline graphic to give the same expected value of Inline graphic, and this would be the same irrespective of the link function.

Based on the aforementioned literature, we propose a data integration method that makes use of the ratios of the external study model coefficient estimate Inline graphic. The method works when the external reduced model for the distribution of Inline graphic leads to similar ratios of regression coefficient estimates when applied to Inline graphic and to Inline graphic. This similarity of the ratios can be quantitatively checked since we have both individual data from the internal study and the reduced model parameter estimates from the external study. Such a check on the ratios provides some assurance on applying the proposed method in practice, unlike many existing methods where assumptions on the distribution transportability cannot be explicitly checked. In many practical scenarios it is plausible that the ratios are transportable between study populations based on the interpretation that they represent the relative importance of the Inline graphic on Inline graphic, even though we do not have full transportability between populations.

2. Proposed method for integrating external information

2.1. Proposed method

The implementation of the proposed method for estimating Inline graphic by incorporating the external information in Inline graphic is as follows, where all models are fitted to the internal study data.

Step 1. Centre all the Inline graphic so that each Inline graphic has mean zero, and for each Inline graphic, fit a linear regression on Inline graphic, Inline graphic, and then calculate the residuals Inline graphic, where Inline graphic is the least square estimate of Inline graphic.

Step 2. Fit the model

graphic file with name Equation2.gif (2)

to obtain the estimate Inline graphic. Here Inline graphic represents the common ratio of the regression coefficients of the Inline graphic between the internal and external studies.

Step 3. Estimate Inline graphic as Inline graphic, where

graphic file with name Equation3.gif

A summary of the different variables, models and parameters is provided in the Appendix.

2.2. Justification for the proposed method

Step 1 above orthogonalizes each Inline graphic to Inline graphic. The ordinary least squares method leads to Inline graphic and Inline graphic for all Inline graphic and Inline graphic. Therefore, the orthogonalization creates new variables Inline graphic that are uncorrelated with the Inline graphic variables in the sample, which to some degree approximates the Inline graphic variables being independent of all the Inline graphic variables. This allows us to appeal to the property that the ratios of parameters in a reduced model with omitted covariates are similar to the ratios in a full model when the omitted covariates are independent of the Inline graphic. Although orthogonality is a weaker condition than independence, under some conditions specified later, the property about the ratios of parameters is retained.

Step 3 above makes the trivial connection between the parameters in the model for Inline graphic and those in the model for Inline graphic as

graphic file with name Equation4.gif

Step 2 is the crucial step that makes use of the ratios of parameter estimates from the external study, and Inline graphic is the corresponding scaling parameter. This step is based on making two connections of parameters in different models.

The first connection that Step 2 makes is between the parameters Inline graphic in the model for Inline graphic and the parameters Inline graphic in the model for Inline graphic within the internal study population, where the ratios of the coefficient estimates for Inline graphic are retained. The result, which will be explained in more detail in the next section, says that Inline graphic for some constant Inline graphic. Based on this result, if we already have an estimate Inline graphic of Inline graphic, then the estimate of Inline graphic from the Inline graphic model will be close to Inline graphic for some scalar Inline graphic. Thus we can fit a model of the form (2) with less parameters using Inline graphic.

The second connection that Step 2 makes is between Inline graphic and Inline graphic. For the purpose of gaining efficiency for internal model fitting, it is desired to make use of Inline graphic. The assumption we make is that the ratios of the parameters are the same in the sense that Inline graphic for all Inline graphic, or in other words Inline graphic for some constant Inline graphic. This assumption is weaker than assuming equality of the parameters between the two populations, and it also allows the intercepts to be different. The interpretation of the ratios as the relative importance of the Inline graphic on Inline graphic suggests that they are likely to be transportable from one population to another in many scenarios. An important aspect of this assumption is that it can be investigated from data, because we are provided with Inline graphic and Inline graphic can be obtained from the internal study data. From the perspective of incorporating external study information into internal model fitting, Step 2 has a similar spirit to the constrained maximum likelihood method (Chatterjee et al., 2016), since it fits a generalized linear model under parameter constraints provided by the external information.

To gain more insights on the assumption of equal ratios, let us consider the case where Inline graphic for the external population has the same form as that in (1); that is,

graphic file with name Equation5.gif

Then the required assumption Inline graphic is essentially Inline graphic based on Proposition 1 in the next subsection, which then becomes

graphic file with name Equation6.gif

due to Inline graphic being a reparameterization of Inline graphic after Inline graphic was replaced by Inline graphic in the regression model.

This is not an assumption that can be checked from data; however, it might provide some insight into when the method is applicable. For example, the expression does not directly include the intercepts in the various models, so if Inline graphic, Inline graphic, and Inline graphic and Inline graphic have different intercepts in the generalized linear model, then the assumption is satisfied and the method will work. Another scenario is if the Inline graphic variables are only weakly associated with the Inline graphic variables then the Inline graphic parameters will be small, so all that is required is that the Inline graphic have a similar ratio in the two study populations for the method to provide a good approximation. This would hold even if the Inline graphic distributions in the two populations differed.

2.3. Relationship between full model and reduced model parameters

The distributions Inline graphic and Inline graphic are implicitly connected, because Inline graphic is obtained by integrating out Inline graphic from Inline graphic, which depends on both Inline graphic and Inline graphic. Consider generalized linear models for both Inline graphic and Inline graphic, with regression coefficients Inline graphic and Inline graphic, respectively; then a question is how does the value of Inline graphic that provides the best approximation to Inline graphic relate to Inline graphic, assuming that the model for Inline graphic is correct. Neuhaus & Jewell (1993) considered this problem and obtained their result that Inline graphic when Inline graphic is independent of Inline graphic using Taylor series expansions, assuming that Inline graphic and Inline graphic are small, where Inline graphic is a constant.

An alternative approach, which we will use here, is to consider the solution to the large sample limit of the score equation for the reduced model, and this will provide a link between Inline graphic and Inline graphic through a link between Inline graphic and Inline graphic as in the following proposition.

Proposition 1.

Suppose that the generalized linear model


Proposition 1.

is correctly specified for Inline graphic. Consider another generalized linear model for Inline graphic,


Proposition 1.

with possibly different link functions. Here the reduced model omitting the covariates Inline graphic is mis-specified in general. The Inline graphic are all centred and the Inline graphic satisfy Inline graphic for all Inline graphic and Inline graphic. Let Inline graphic denote the large sample limit of the maximum likelihood estimate of Inline graphic, which is the value that minimizes the Kullback–Leibler divergence between the reduced generalized linear model and the true distribution Inline graphic. When Inline graphic and the true values Inline graphic and Inline graphic are close to zero, Inline graphic is approximately equal to Inline graphic up to a constant factor, i.e., Inline graphic for some constant Inline graphic.

This result concerning ratios of parameters can be regarded as a generalization of the Neuhaus & Jewell (1993) result from the Inline graphic-independent-of-Inline graphic situation to the Inline graphic-orthogonal-to-Inline graphic situation. The result is for two generalized linear models, one of which has omitted covariates, for the same population, and is an approximation based on a Taylor series expansion. In this paper this result is exclusively applied to the internal study population, as in Step 2 in § 2.1, in order to connect the full model parameter Inline graphic to the reduced model parameter Inline graphic. This is also discussed in § 2.2 as the first connection that Step 2 makes. To keep the generality of the result, the presentation of Proposition 1 is for a general population instead.

Here we summarize the main steps in the proof, with greater detail given in the Appendix. The estimate Inline graphic solves the score equation that has the form Inline graphic. Thus, the probability limit of Inline graphic is Inline graphic that solves Inline graphic. Now assuming that the elements of Inline graphic, Inline graphic and Inline graphic are small, we approximate Inline graphic using a Taylor series expansion about Inline graphic, Inline graphic and Inline graphic. Then after some algebra and using the fact that every Inline graphic is orthogonal to every Inline graphic, we arrive at an expression of the form Inline graphic, where Inline graphic. Then we have the desired result as long as Inline graphic is invertible.

3. Asymptotic distribution of the proposed estimator

Proposition 2.

Under the typical regularity conditions for the asymptotic normality of Inline graphic-estimators (e.g., van der Vaart, 1998), for the proposed estimator Inline graphic from step 3 in § 2.1, Inline graphic converges in distribution to Inline graphic as Inline graphic. Here Inline graphic is the probability limit of Inline graphic, Inline graphic, Inline graphic,


Proposition 2.

Inline graphic for Inline graphic, Inline graphic is the probability limit of Inline graphic and


Proposition 2.

The proof of this result is given in the Appendix. It does not assume the ratios for regression coefficients between the internal study reduced model and the external study reduced model to be the same, or Inline graphic for some constant Inline graphic. Instead, this result shows the asymptotic distribution of Inline graphic produced by Step 3 in § 2.1 for any arbitrary fixed value for Inline graphic.

Under the assumptions that Inline graphic for some constant Inline graphic and that the external study sample size was large so that the uncertainty associated with the information it provides is negligible, Proposition 1 ensures that Inline graphic is close to the true Inline graphic, and thus Proposition 2 can be used for inference about Inline graphic. Here we point out that, since the result in Proposition 1 is an approximation based on a Taylor series expansion, Inline graphic is not the exact true value Inline graphic and the difference may be difficult to quantify in general. However, this approximation is good when Inline graphic and Inline graphic are not very large, and as we show in the next section, the numerical results using Proposition 2 for inference are good. As a summary, our proposed method works and Proposition 2 can be used to make inference under the assumptions that (i) the internal study model (1) is correctly specified, (ii) the values of Inline graphic and Inline graphic are close to zero, (iii) Inline graphic for some constant Inline graphic, (iv) the external study sample size was large so that the uncertainty associated with Inline graphic is negligible, and the typical regularity conditions hold for the asymptotic normality of the maximum likelihood estimator for generalized linear models.

4. Simulation studies

Simulation studies are implemented to evaluate the performance of the proposed estimator in various settings. We generate binary Inline graphic from logistic regression models with covariates either Inline graphic or Inline graphic. To assess the robustness properties of the procedure, we consider data generating assumptions that violate transportability between populations in specific ways. The properties of the proposed procedure are compared with those of the maximum likelihood estimator with internal study data alone and the constrained maximum likelihood method.

Two different simulation settings are examined. In the first setting, labelled Inline graphic, we generated external and internal data from the models

graphic file with name Equation11.gif

We use a logistic link function for both Inline graphic and Inline graphic. The external and internal study sample sizes are Inline graphic and Inline graphic. External summary information is obtained by fitting a Inline graphic-omitted misspecified model in the external dataset. For this setting, we always assume that Inline graphic, and also that both external and internal Inline graphic are generated from a Gaussian distribution, with variance 1 and covariances Inline graphic. A continuous variable Inline graphic is generated from Inline graphic and a binary variable Inline graphic is generated from Inline graphic. For each simulation scenario, a total of Inline graphic replications (Inline graphic) are generated. Three different factors were varied, as shown in Table 1: the magnitude of Inline graphic, whether Inline graphic is the same as Inline graphic, and whether the intercepts Inline graphic and Inline graphic are the same.

Table 1.

Simulation scenarios for the first setting. Small Inline graphic refers to Inline graphic and large Inline graphic refers to Inline graphic. Values of Inline graphic in the Inline graphic distribution are Inline graphic and Inline graphic

Inline graphic Small Inline graphic Inline graphic Inline graphic
Inline graphic Small Inline graphic Inline graphic Inline graphic
Inline graphic Small Inline graphic Inline graphic Inline graphic
Inline graphic Small Inline graphic Inline graphic Inline graphic
Inline graphic Large Inline graphic Inline graphic Inline graphic
Inline graphic Large Inline graphic Inline graphic Inline graphic
Inline graphic Large Inline graphic Inline graphic Inline graphic
Inline graphic Large Inline graphic Inline graphic Inline graphic

The performance of parameter estimates is quantified by the metrics

graphic file with name Equation12.gif
graphic file with name Equation13.gif

the Monte Carlo standard deviation, ESD, where

graphic file with name Equation14.gif

and the coverage rate of 95Inline graphic confidence intervals. The Brier score is included to evaluate predictive performance on a test dataset, where

graphic file with name Equation15.gif

The test data have sample size Inline graphic, and are generated using the internal study model.

From the results in Table 2, for the parameter estimates, the proposed estimator and constrained maximum likelihood have small bias, even under Inline graphic. The proposed method accurately estimates the true intercept Inline graphic, when Inline graphic and Inline graphic are not the same. In contrast, the constrained maximum likelihood method has large bias for the intercept in such cases. The justification for the proposed method was based on an assumption that the values of Inline graphic and Inline graphic were small. However, the results in Table 1 within the Supplementary Material show that the proposed method still has good performance when those values are not small. Results for the variability of Inline graphic show that the proposed method is more efficient than the maximum likelihood estimator based on internal study data alone, but not as efficient as the constrained maximum likelihood method. For Inline graphic, all the methods have similar efficiency. The coverage rates of the confidence intervals are close to the nominal Inline graphic level, demonstrating that the asymptotic formula is providing a good approximation of the variance for this sample size of 400. The robustness of the proposed estimator is particularly advantageous over the constrained maximum likelihood method when prediction is our main interest. We see that the Brier score of the constrained maximum likelihood method is substantially larger than that of the proposed estimator when Inline graphic.

Table 2.

Simulation results for Inline graphicInline graphic small Inline graphic scenarios. Sample size Inline graphic. Number of replications is Inline graphic. Inline graphic, Inline graphic and Inline graphic are multiplied by Inline graphic

Bias MSE ESD Coverage
Inline graphic MLE Prop CML MLE Prop CML MLE Prop CML Prop (Inline graphic)
Inline graphic 0.48 0.43 1.42 3.36 3.32 1.97 18.3 18.2 13.9 95
Inline graphic 0.06 0.24 1.36 3.55 2.24 1.42 18.8 14.9 11.8 94
Inline graphic 0.22 Inline graphic Inline graphic 3.41 2.28 1.48 18.4 15.0 12.1 94
Inline graphic 0.09 1.06 1.22 2.30 0.93 0.72 15.1 9.5 8.4 94
Inline graphic 1.76 1.51 1.75 1.75 1.72 1.75 13.1 13.0 13.1 94
Inline graphic Inline graphic Inline graphic Inline graphic 7.68 7.60 7.67 27.7 27.5 27.7 95
Brier(Inline graphic) 0.723 0.719 0.716  
Inline graphic  
Inline graphic 0.48 0.40 1.86 3.36 3.32 1.99 18.3 18.2 13.9 95
Inline graphic 0.06 Inline graphic Inline graphic 3.55 2.23 1.47 18.8 14.8 11.8 94
Inline graphic 0.22 1.09 Inline graphic 3.41 2.31 1.48 18.4 15.1 12.1 94
Inline graphic 0.09 0.21 Inline graphic 2.30 0.91 0.71 15.1 9.5 8.4 93
Inline graphic 1.76 1.51 1.75 1.75 1.71 1.75 13.1 13.0 13.1 94
Inline graphic Inline graphic Inline graphic Inline graphic 7.68 7.60 7.67 27.7 27.5 27.6 95
Brier(Inline graphic) 0.723 0.719 0.716  
Inline graphic  
Inline graphic 0.48 0.41 50.29 3.36 3.32 27.25 18.3 18.2 13.9 95
Inline graphic 0.06 Inline graphic Inline graphic 3.55 2.23 1.41 18.8 14.9 11.9 94
Inline graphic 0.22 Inline graphic Inline graphic 3.41 2.28 1.50 18.4 15.0 12.2 94
Inline graphic 0.09 0.57 0.20 2.30 0.91 0.70 15.1 9.5 8.4 94
Inline graphic 1.76 1.51 1.77 1.75 1.72 1.75 13.1 13.0 13.1 94
Inline graphic Inline graphic Inline graphic Inline graphic 7.68 7.60 7.68 27.7 27.5 27.7 95
Brier(Inline graphic) 0.723 0.719 0.746  
Inline graphic  
Inline graphic 0.48 0.29 50.28 3.36 3.33 27.23 18.3 18.2 13.9 95
Inline graphic 0.06 4.86 4.02 3.55 2.55 1.57 18.8 15.2 11.9 94
Inline graphic 0.22 Inline graphic Inline graphic 3.41 2.36 1.67 18.4 14.9 12.2 93
Inline graphic 0.09 Inline graphic Inline graphic 2.30 0.96 0.80 15.1 9.4 8.3 93
Inline graphic 1.76 1.48 1.77 1.75 1.71 1.75 13.1 13.0 13.1 94
Inline graphic Inline graphic Inline graphic Inline graphic 7.68 7.63 7.68 27.7 27.6 27.7 95
Brier(Inline graphic) 0.723 0.718 0.746  

MLE, maximum likelihood estimation; Prop, proposed method; CML, constrained maximum likelihood; MSE, mean squared error; ESD, Monte Carlo standard deviation.

The scenarios in Table 3 for the second simulation study, labelled Inline graphic, are designed to illustrate that the proposed method works when the ratio of the Inline graphic coefficients in the reduced models are the same in the two populations. We generate internal data from the full model Inline graphic, Inline graphic, Inline graphic, where Inline graphic, the Inline graphic have variances Inline graphic and covariances Inline graphic. The reduced model for the internal study is given by Inline graphic, where the Inline graphic value is approximated by fitting this model to a very large dataset with sample size Inline graphic. Then the external study data are generated from the external reduced model Inline graphic, where Inline graphic for some constants Inline graphic, after obtaining Inline graphic from a very large dataset. So, whenever Inline graphic, the ratios between the external and internal reduced parameters are retained. Then Inline graphic is selected to make the Inline graphic proportions in the internal and external datasets similar when Inline graphic. The external sample has size Inline graphic and the internal sample has size Inline graphic.

Table 3.

Simulation scenarios for the second setting. Here Inline graphic are scaling factors such that Inline graphic

Inline graphic Inline graphic Inline graphic Inline graphic
Inline graphic Inline graphic Inline graphic Inline graphic
Inline graphic Inline graphic Inline graphic Inline graphic
Inline graphic Inline graphic Inline graphic Inline graphic
Inline graphic Inline graphic Inline graphic Inline graphic
Inline graphic Inline graphic Inline graphic Inline graphic
Inline graphic Inline graphic Inline graphic Inline graphic
Inline graphic Inline graphic Inline graphic Inline graphic
Inline graphic Inline graphic Inline graphic Inline graphic

The results in Table 4 show that the proposed method has low bias and is more efficient than the maximum likelihood method when Inline graphic, even when they are not equal to 1, and gives a lower Brier score. The coverage rates of the confidence intervals are close to the nominal level. The constrained maximum likelihood method is more efficient than the proposed method when Inline graphic, but gives biased estimates otherwise, even when Inline graphic. When Inline graphic, the proposed method does have some bias, as shown in Table 2 in the Supplementary Material.

Table 4.

Simulation results for scenarios Inline graphicInline graphic. Inline graphic, Inline graphic and Inline graphic are multiplied by Inline graphic. Sample size Inline graphic. Number of replications is Inline graphic. Let Inline graphic

Inline graphic Bias MSE ESD Coverage
  MLE Prop CML MLE Prop CML MLE Prop CML Prop (Inline graphic)
Inline graphic 0.93 0.49 -29.43 2.45 2.42 8.85 15.6 15.5 4.3 95
Inline graphic 1.86 1.56 -57.02 3.20 2.09 32.89 17.7 14.3 6.1 95
Inline graphic 2.10 1.34 -68.53 4.91 3.76 48.49 22.0 19.3 12.3 95
Inline graphic 1.04 0.84 0.03 2.30 2.28 2.23 15.1 15.1 14.9 95
Brier(Inline graphic) 0.577 0.575 0.622  
Inline graphic  
Inline graphic 0.93 0.49 -29.45 2.45 2.42 8.93 15.6 15.5 5.1 95
Inline graphic 1.86 1.68 -57.02 3.20 2.09 32.88 17.7 14.3 6.1 95
Inline graphic 2.10 1.24 -68.65 4.91 3.78 48.64 22.0 19.4 12.2 95
Inline graphic 1.04 0.84 0.03 2.30 2.28 2.23 15.1 15.1 14.9 95
Brier(Inline graphic) 0.577 0.575 0.622  
Inline graphic  
Inline graphic 0.93 0.49 4.47 2.45 2.42 0.46 15.6 15.5 5.1 95
Inline graphic 1.86 1.64 0.43 3.20 2.01 0.46 17.7 14.0 6.8 95
Inline graphic 2.10 1.36 -0.07 4.91 3.65 1.53 22.0 19.0 12.4 95
Inline graphic 1.04 0.85 1.02 2.30 2.28 2.30 15.1 15.1 15.1 95
Brier(Inline graphic) 0.577 0.575 0.573  
Inline graphic  
Inline graphic 0.93 0.49 4.47 2.45 2.42 0.46 15.6 15.5 6.0 95
Inline graphic 1.86 1.62 0.42 3.20 2.02 0.46 17.7 14.1 6.8 95
Inline graphic 2.10 1.37 -0.06 4.91 3.68 1.55 22.0 19.1 12.4 95
Inline graphic 1.04 0.85 1.02 2.30 2.28 2.30 15.1 15.0 15.1 95
Brier(Inline graphic) 0.577 0.575 0.573  
Inline graphic  
Inline graphic 0.93 0.50 22.18 2.45 2.42 5.26 15.6 15.5 5.8 95
Inline graphic 1.86 1.62 54.32 3.20 1.97 30.08 17.7 13.9 7.6 95
Inline graphic 2.10 1.43 64.72 4.91 3.66 43.57 22.0 19.0 12.9 96
Inline graphic 1.04 0.85 0.81 2.30 2.28 2.29 15.1 15.0 15.1 95
Brier(Inline graphic) 0.577 0.575 0.584  
Inline graphic  
Inline graphic 0.93 0.51 22.02 2.45 2.42 5.35 15.6 15.5 7.1 95
Inline graphic 1.86 1.68 54.45 3.20 1.99 30.26 17.7 14.0 7.8 96
Inline graphic 2.10 1.38 64.72 4.91 3.64 43.54 22.0 19.0 12.8 95
Inline graphic 1.04 0.85 0.81 2.30 2.28 2.29 15.1 15.0 15.1 95
Brier(Inline graphic) 0.577 0.575 0.584  

MLE, maximum likelihood estimation; Prop, proposed method; CML, constrained maximum likelihood; MSE, mean squared error; ESD, Monte Carlo standard deviation.

5. Real data analysis

To evaluate the proposed method on real data, we construct a predictive model to predict risk of high-grade prostate cancer by using internal data and external summary information available from a published study. The original risk calculator was developed from the Prostate Cancer Prevention Trial (Thompson et al., 2006). This calculator is constructed using a logistic regression model including five clinical variables: prostate specific antigen (PSA) level, digital rectal examination (DRE) findings, age, race (African American or not) and prior biopsy results. The original calculator is given as Inline graphic, where Inline graphic is the probability of having high-grade prostate cancer. We aim to construct an expanded risk calculator by considering an additional binary biomarker (T2:ERG) that measures TMPRSS2:ERG gene fusions. Although this biomarker is not widely used, it was observed to have predictive power for high-grade prostate cancer (Truong et al., 2013; Tomlins et al., 2015).

The dataset we use is from Tomlins et al. (2015), which comprises a training and an additional testing set with Inline graphic and Inline graphic subjects, respectively. Details of the study can be found in Cheng et al. (2019). The direct maximum likelihood estimator, constrained maximum likelihood estimator and proposed method are implemented on the training data to obtain risk calculators, and each calculator is then applied to the testing dataset where the predictive performance is quantified by the Brier score, calculated from the testing data.

In Table 5, we see that the proposed method gains substantial efficiency compared to the direct maximum likelihood estimator, which does not utilize external summary information. The Brier score of the proposed method is an improvement of that from direct maximum likelihood and constrained maximum likelihood.

Table 5.

Parameter estimates for the prostate cancer study. The reduced model estimates are obtained from the published literature for the external model. The full model estimates and standard errors are obtained from the internal data using either the maximum likelihood estimate, or the constrained maximum likelihood estimate or the proposed method. The standard errors of the proposed method are in parentheses, all obtained from the asymptotic variance formula

Model PSA Age DRE Prior biopsy Race T2:ERG Brier
Reduced model  
External 1.29 0.031 1.00 Inline graphic 0.96 0.933
Full model  
Direct maximum likelihood 0.98 0.032 1.02 Inline graphic 0.57 0.76 0.930
  (0.18) (0.012) (0.26) (0.27) (0.29) (0.20)  
Constrained maximum likelihood 1.14 0.032 1.06 Inline graphic 0.80 0.72 0.931
  (0.07) (0.004) (0.14) (0.11) (0.17) (0.20)  
Proposed 0.97 0.02 0.69 Inline graphic 0.87 0.73 0.919
  (0.13) (0.003) (0.10) (0.05) (0.10) (0.19)

PSA, prostate specific antigen; DRE, digital rectal examination.

6. Discussion

The gain in efficiency of the proposed method by integrating external information is not as large as for some other existing methods. However, most of these other methods require the joint distribution of all the variables to be the same in the two study populations, and can be biased when the two distributions are not equal. Our proposed method is more robust, and only requires a selected aspect of the joint distribution to be transportable between populations.

The mathematical justification for our proposed method is based on an assumption that the regression coefficients are small. However, in simulation studies we found that it had good performance unless the coefficients were very large. Another assumption made in this paper is that the external study has a large sample size so that the uncertainty associated with the external information is negligible. When this is not the case, the asymptotic variance given in Proposition 2 needs to account for such uncertainty to make valid inference, and inference without accounting for this uncertainty leads to underestimation of the standard error of Inline graphic. The external study uncertainty may be available in the form of standard errors for the parameter estimates Inline graphic, and may be accounted for in a similar fashion to Han & Lawless (2019) and Kundu et al. (2019) in the Taylor series expansion. Alternatively, it would also be feasible to consider an adaptation of the proposed method in which we require the ratio of the Inline graphic estimates from the internal study to be similar to, but not necessarily identical to, the ratio of the values of Inline graphic. Such an adaptation would be needed if there were multiple external studies, each with their own set of potentially overlapping variables and regression coefficients. The requirement of having exact equality between the ratio of regression coefficients for every external study is unlikely to be satisfied. A method that extends the shrinkage approach in Estes et al. (2018) could also be developed, in which more use is made of the external information if it is more consistent with the internal data.

In the high-dimensional regression setting, which occurs when the number of the newly discovered covariates Inline graphic is large, modifications of the proposed method that include regularization or variable selection may be needed. One modification would be to first run the full model of interest on the internal data with some regularization to select a subset of Inline graphic, and then use the selected Inline graphic to carry out the proposed procedure in § 2.1. Another modification would be to add a regularization on the regression coefficients of Inline graphic when fitting the model as in Step 2 in § 2.1 after the orthogonalization of Inline graphic on Inline graphic. These two modifications may yield selection of different subsets of Inline graphic in the end, and future investigations are needed to assess their performance. A relevant recent publication on data integration (Sheng et al., 2022) used a penalized empirical likelihood approach for high dimensional Inline graphic.

The constrained maximum likelihood method (Chatterjee et al., 2016) assumes the joint distribution to be the same between the internal and external study populations. It may be possible to extend this method to the setting we considered, by formulating the proportionality of Inline graphic and Inline graphic into a set of constraints on Inline graphic. This formulation is not straightforward since the connection between Inline graphic and Inline graphic is not explicit. This is a topic worth future investigation.

Supplementary Material

asac022_Supplementary_Data

Acknowledgement

This research was supported by the US National Institutes of Health (CA129102). The authors thank the editor, an associate editor and a referee for their helpful comments.

Appendix

Summary of notation

Inline graphic Variables in the reduced model for Inline graphic.
Inline graphic Additional variables in the full model for Inline graphic.
Inline graphic Orthogonalization of Inline graphic, so Inline graphic for some Inline graphic
  Inline graphic .
Inline graphic Parameters in the Inline graphic model, which are assumed to be correct for the inter-
  nal study.
Inline graphic Parameters in the Inline graphic model.
Inline graphic Parameters in the Inline graphic model, a reparameterization of Inline graphic.
Inline graphic Parameters in the Inline graphic model used in the proposed method.

Proof of Proposition 1

For Proposition 1, the maximum likelihood estimate of Inline graphic is

graphic file with name Equation16.gif

with probability limit

graphic file with name Equation17.gif

where the expectation Inline graphic is with respect to the true joint distribution of Inline graphic. Therefore, Inline graphic solves Inline graphic with

graphic file with name Equation18.gif

where the expectation Inline graphic is with respect to the distribution of Inline graphic and the calculation used the fact that Inline graphic, and for Inline graphic,

graphic file with name Equation19.gif

Here we let Inline graphic. Since Inline graphic cannot be solved analytically, we solve an approximation Inline graphic with

graphic file with name Equation20.gif (A1)

Here Inline graphic, and Inline graphic and Inline graphic, Inline graphic, are the first-order Taylor series expansions of Inline graphic and Inline graphic around Inline graphic. In approximation (A1) we used Inline graphic for all Inline graphic. Then, using both Inline graphic and Inline graphic, we are able to solve the equations defined by equating the second to last rows in (A1) to zero, and get Inline graphic, where Inline graphic, Inline graphic and

graphic file with name Equation21.gif

Then the desired result follows as long as Inline graphic is invertible regardless of the Inline graphic value.

Proof of Proposition 2

For Proposition 2, it is easy to see that Inline graphic satisfies Inline graphic. A Taylor series expansion leads to

graphic file with name Equation22.gif

Let Inline graphic denote the dependence of Inline graphic on Inline graphic implied by Step 3 in § 2.1. Then applying the central limit theorem and the delta method gives the desired result, with Inline graphic.

Contributor Information

Jeremy M G Taylor, Department of Biostatistics, University of Michigan, 1415 Washington Heights, Ann Arbor, Michigan 48019, U.S.A.

Kyuseong Choi, Department of Statistics and Data Science, Cornell University, 1198 Comstock Hall, 129 Garden Ave., Ithaca, New York 14853, U.S.A.

Peisong Han, Department of Biostatistics, University of Michigan, 1415 Washington Heights, Ann Arbor, Michigan 48019, U.S.A.

Supplementary material

The Supplementary Material includes extensions of the simulation studies, including the scenarios Inline graphic of Table 1 and Inline graphic of Table 3.

References

  1. Carroll, R. J., Ruppert, D., Stefanski, L. A. & Crainiceanu, C. M. (2006). Measurement Error in Nonlinear Models: a Modern Perspective, 2nd ed. New York: CRC Press. [Google Scholar]
  2. Chatterjee, N., Chen, Y. H., Maas, P. & Carroll, R. J. (2016). Constrained maximum likelihood estimation for model calibration using summary-level information from external big data sources. J. Am. Statist. Assoc. 111, 107–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Cheng, W., Taylor, J. M. G., Gu, T., Tomlins, S. A. & Mukherjee, B. (2019). Informing a risk prediction model for binary outcomes with external coefficient information. Appl. Statist. 68, 121. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Cheng, W., Taylor, J. M. G., Vokonas, P. S., Park, S. K. & Mukherjee, B. (2018). Improving estimation and prediction in linear regression incorporating external information from an established reduced model. Statist. Med. 37, 1515–30. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Estes, J. P., Mukherjee, B. & Taylor, J. M. G. (2018). Empirical Bayes estimation and prediction using summary-level information from external big data sources adjusting for violations of transportability. Statist. Biosci. 10, 568–86. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Gail, M. H., Wieand, S. & Piantadosi, S. (1984). Biased estimates of treatment effect in randomized experiments with nonlinear regressions and omitted covariates. Biometrika 71, 431–44. [Google Scholar]
  7. Gu, T., Taylor, J. M. G., Cheng, W. & Mukherjee, B. (2019). Synthetic data method to incorporate external information into a current study. Can. J. Statist. 47, 580–603. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Han, P. & Lawless, J. F. (2019). Empirical likelihood estimation using auxiliary summary information with different covariate distributions. Statist. Sinica 29, 1321–42. [Google Scholar]
  9. Han, P., Taylor, J. M. G. & Mukherjee, B. (2023). Integrating information from existing risk prediction models with no model details. Can. J. Statist. To appear. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Kundu, P., Tang, R. & Chatterjee, N. (2019). Generalized meta-analysis for multiple regression models across studies with disparate covariate information. Biometrika 106, 567–85. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Li, K. C. & Duan, N. (1989). Regression analysis under link violation. Ann. Statist. 17, 1009–52. [Google Scholar]
  12. Monahan, J. F. & Stefanski, L. A. (1992). Normal scale mixture approximations to Inline graphic and computation of the logistic-normal integral. In Handbook of the Logistic Distribution, Balakrishnan, N. ed. New York: Marcel Dekker, pp. 529–40. [Google Scholar]
  13. Neuhaus, J. M. & Jewell, N. P. (1993). A geometric approach to assess bias due to omitted covariates in generalized linear models. Biometrika 80, 807–15. [Google Scholar]
  14. Penrose, R. (2004). The Road to Reality. A Complete Guide to the Laws of the Universe. London: Vintage Books. [Google Scholar]
  15. Qin, J. (2000). Combining parametric and empirical likelihoods. Biometrika 87, 484–90. [Google Scholar]
  16. Rahmandad, H., Jalali, M. S. & Paynabar, K. (2017). A flexible method for aggregation of prior statistical findings. PloS One 12, e0175111. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Sheng, Y., Sun, Y., Huang, C. Y. & Kim, M. O. (2022). Synthesizing external aggregated information in the presence of population heterogeneity: A penalized empirical likelihood approach. Biometrics 78, 679–90. [DOI] [PubMed] [Google Scholar]
  18. Solomon, P. J. (1984). Effect of misspecification of regression models in the analysis of survival data. Biometrika 71, 291–8. [Google Scholar]
  19. Struthers, C. A. & Kalbfleisch, J. D. (1986). Misspecified proportional hazard models. Biometrika 73, 363–9. [Google Scholar]
  20. Taylor, J. M. G. (1989). A note on the cost of estimating the ratio of regression parameters after fitting a power transformation. J. Statist. Plan. Infer. 21, 223–30. [Google Scholar]
  21. Taylor, J. M. G. (1990). Properties of maximum likelihood estimates of the ratio of parameters in ordinal response regression models. Commun. Statist. B 19, 469–80. [Google Scholar]
  22. Thompson, I. M., Ankerst, D. P., Chi, C., Goodman, P. J., Tangen, C. M., Lucia, M. S., Feng, Z., Parnes, H. L. & Coltman C. A., Jr. (2006). Assessing prostate cancer risk: results from the Prostate Cancer Prevention Trial. J. Nat. Cancer Inst. 98, 529–34. [DOI] [PubMed] [Google Scholar]
  23. Tomlins, S. A., Day, J. R., Lonigro, R. J., Hovelson, D. H., Siddiqui, J., Kunju, L. P., Dunn, R. L., Meyer, S., Hodge, P., Groskopf, J.. et al. (2015). Urine TMPRSS2:ERG plus PCA3 for individualized prostate cancer risk assessment. Eur. Urol. 70, 45–53. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Truong, M., Yang, B. & Jarrard, D. F. (2013). Toward the detection of prostate cancer in urine: a critical analysis. J. Urol. 189, 422–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge: Cambridge University Press. [Google Scholar]
  26. Zhai, Y. & Han, P. (2022). Data integration with oracle use of external information from heterogeneous populations. J. Comp. Graph. Statist. 31, 1001–12. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

asac022_Supplementary_Data

Articles from Biometrika are provided here courtesy of Oxford University Press

RESOURCES