Summary
We consider the situation of estimating the parameters in a generalized linear prediction model, from an internal dataset, where the outcome variable is binary and there are two sets of covariates,
and
. We have information from an external study that provides parameter estimates for a generalized linear model of
on
. We propose a method that makes limited assumptions about the similarity of the distributions in the two study populations. The method involves orthogonalizing the
variables and then borrowing information about the ratio of the coefficients from the external model. The method is justified based on a new result relating the parameters in a generalized linear model to the parameters in a generalized linear model with omitted covariates. The method is applicable if the regression coefficients in the
given
model are similar in the two populations, up to an unknown scalar constant. This type of transportability between populations is something that can be checked from the available data. The asymptotic variance of the proposed method is derived. The method is evaluated in a simulation study and shown to gain efficiency compared to simple analysis of the internal dataset, and is robust compared to an alternative method of incorporating external information.
Keywords: Data integration, Omitted variable regression, Ratio of parameters, Transportability
1. Introduction
We consider developing a parametric prediction model for a binary outcome variable , given sets of covariates
and
. Here
and
represent vectors of
and
variables, which may be thought of as the conventional covariates and the newly discovered covariates, respectively. Individual data
, for a simple random sample of size
from a population of interest, referred to as the internal study population hereafter, are available. The joint law of
in the internal population is denoted by
, and we refer to this as the joint distribution. In addition, an external study analysed data from a different population, with the joint distribution denoted by
, by fitting a generalized linear model of the form
, and produced the maximum likelihood estimate
for
. Here
is a link function and the expectation
is taken under the external study population distribution
. The raw data used to derive
are not available. We assume that the external study sample size was large so that the uncertainty associated with
is negligible, and no components of
are zero. Our goal is to fit a model of the form
![]() |
(1) |
that includes all the variables and also the
variables. Here
is allowed to be different from
and the expectation
is taken under the internal study population distribution
. We assume that model (1) is correctly specified corresponding to
. In fitting this model we want to make use of
provided by the external study. While it would be simple to just analyse the internal data to obtain an estimate of
, incorporating
is likely to lead to more efficient estimates due to the large external sample. This could be important if
is not large.
While this scenario may be common in many areas of science, we are particularly thinking of a medical setting where is a future or unknown and important binary event, the
are commonly available variables in that setting, such as age and sex, and the
are newer or less commonly measured variables, such as some recently discovered biomarkers. The interest is to include
in a new prediction model with the hope that it will have better performance.
There is existing literature on this topic of using external summary information to aid in the estimation of a full model from an internal dataset. Cheng et al. (2018, 2019) proposed Bayesian approaches that rely on a direct link between the parameters in the two models. Other methods (Chatterjee et al., 2016) use a constrained semiparametric maximum likelihood approach by converting the external summary-level information into a constraint and then maximizing the internal data likelihood subject to this constraint. However, the constrained maximum likelihood method requires the probability distribution of the outcome and covariates to be the same in the two populations. This constrained maximum likelihood method is closely related to an empirical likelihood approach (Qin, 2000; Han & Lawless, 2019). See also Han et al. (2023). In a different approach, Gu et al. (2019) used the external model to create synthetic data, which was combined with the internal data and then analysed. Estes et al. (2018) developed a more robust empirical Bayes estimator that essentially downweights the external information if it is not compatible with the internal data. Zhai & Han (2022) developed a penalization method to simultaneously select and integrate the external information that is compatible with the internal data. Rahmandad et al. (2017) and Kundu et al. (2019) developed methods within a meta-analysis framework.
Whether the external information is useful for the estimation of
depends on whether the joint distribution of
, or some aspect of the joint distribution, is similar between the external and internal study populations. Much of the literature referenced above makes an assumption of transportability between
and
. The full transportability corresponds to
or, equivalently, all of
,
and
. In practice, partial transportability, in which only aspects of the distributions are shared between the populations, may be more realistic. An example of partial transportability would be
, but
, or
, but
and
. In general, we might expect marginal distributions to differ between populations, especially the
distribution since different study populations typically have varying demographics. In practice, some information is usually available about
, such as the marginal mean and standard deviation of each variable, which could be used to investigate how different
is from
. Conditional distributions might have more similarity, which is related to the idea that causal dynamics may be expected to be stable (e.g., Penrose, 2004). For example, for the
distribution, assuming that
and
, a plausibly realistic assumption on transportability is that the two link functions are the same, and that
for
, whereas
. The intercept-only difference between the regression coefficients reflects the belief that the covariate effects are similar between the two study populations, but the disease prevalence may differ. The case of
for some constant
is another example, reflecting the belief that the relative covariate effects are similar between the two study populations, but not the absolute magnitudes.
For the distribution, a strong assumption would be that
is independent of
in both study populations. Another choice might be not to specify a model for
, but rather to assume that the
distribution is the same in the two study populations. A further option would be to specify a parametric model, such as a generalized linear model, and then restrict the regression coefficients to be related in certain ways in the two study populations.
It is feasible to consider the existence of the variables in the external population, but that they were either not measured or included in the external model. However, we will not typically have any available information on the distribution of
in the external population, and thus we cannot check whether the distributions for
and
are transportable between the internal and external populations. We can however check from the internal study data whether the distribution of
is similar between the two populations. We will propose a method that does not require
, but does require some aspect of
and
to be the same.
Our method builds on the extensive literature on omitted covariates in generalized linear models (Gail et al., 1984; Neuhaus & Jewell, 1993). Consider a generalized linear model with a scalar
and a reduced model
in the same population. The reduced model is usually misspecified, or not compatible with the full model if
is nonlinear because of the noncollapsability of most generalized linear models. Neuhaus & Jewell (1993) showed that, when
is independent of
, the value of
that best approximates the distribution of
is related to
through
for some constant
. So even though
is typically not equal to
, the ratios of the coefficients of the
in the two models are very similar. This result is closely related to the measurement error literature for the relationship between the parameters of a regression model for when the covariates do and do not have measurement errors (Monahan & Stefanski, 1992; Carroll et al., 2006). When
is not independent of
, the relationship between
and
is not so simple, depending on the distribution
in a complex way.
The proposed method also builds on the literature concerning the robustness of ratios of regression parameter estimates under model misspecification. The general result is that the ratio of regression coefficients can be estimated well even if the link function is misspecified (Solomon, 1984; Struthers & Kalbfleisch, 1986; Li & Duan, 1989; Taylor, 1989, 1990), and this has been studied for continuous, ordinal and censored survival outcomes. In other words, one can get good estimates of up to an unknown scaling factor when the link function is misspecified. The intuitive reason for the stability of the ratio of parameter estimates is that it represents the relative importance of one variable to another, in the sense that
is the amount by which
needs to be changed for a unit change in
to give the same expected value of
, and this would be the same irrespective of the link function.
Based on the aforementioned literature, we propose a data integration method that makes use of the ratios of the external study model coefficient estimate . The method works when the external reduced model for the distribution of
leads to similar ratios of regression coefficient estimates when applied to
and to
. This similarity of the ratios can be quantitatively checked since we have both individual data from the internal study and the reduced model parameter estimates from the external study. Such a check on the ratios provides some assurance on applying the proposed method in practice, unlike many existing methods where assumptions on the distribution transportability cannot be explicitly checked. In many practical scenarios it is plausible that the ratios are transportable between study populations based on the interpretation that they represent the relative importance of the
on
, even though we do not have full transportability between populations.
2. Proposed method for integrating external information
2.1. Proposed method
The implementation of the proposed method for estimating by incorporating the external information in
is as follows, where all models are fitted to the internal study data.
Step 1. Centre all the so that each
has mean zero, and for each
, fit a linear regression on
,
, and then calculate the residuals
, where
is the least square estimate of
.
Step 2. Fit the model
![]() |
(2) |
to obtain the estimate . Here
represents the common ratio of the regression coefficients of the
between the internal and external studies.
Step 3. Estimate as
, where
![]() |
A summary of the different variables, models and parameters is provided in the Appendix.
2.2. Justification for the proposed method
Step 1 above orthogonalizes each to
. The ordinary least squares method leads to
and
for all
and
. Therefore, the orthogonalization creates new variables
that are uncorrelated with the
variables in the sample, which to some degree approximates the
variables being independent of all the
variables. This allows us to appeal to the property that the ratios of parameters in a reduced model with omitted covariates are similar to the ratios in a full model when the omitted covariates are independent of the
. Although orthogonality is a weaker condition than independence, under some conditions specified later, the property about the ratios of parameters is retained.
Step 3 above makes the trivial connection between the parameters in the model for and those in the model for
as
![]() |
Step 2 is the crucial step that makes use of the ratios of parameter estimates from the external study, and is the corresponding scaling parameter. This step is based on making two connections of parameters in different models.
The first connection that Step 2 makes is between the parameters in the model for
and the parameters
in the model for
within the internal study population, where the ratios of the coefficient estimates for
are retained. The result, which will be explained in more detail in the next section, says that
for some constant
. Based on this result, if we already have an estimate
of
, then the estimate of
from the
model will be close to
for some scalar
. Thus we can fit a model of the form (2) with less parameters using
.
The second connection that Step 2 makes is between and
. For the purpose of gaining efficiency for internal model fitting, it is desired to make use of
. The assumption we make is that the ratios of the parameters are the same in the sense that
for all
, or in other words
for some constant
. This assumption is weaker than assuming equality of the parameters between the two populations, and it also allows the intercepts to be different. The interpretation of the ratios as the relative importance of the
on
suggests that they are likely to be transportable from one population to another in many scenarios. An important aspect of this assumption is that it can be investigated from data, because we are provided with
and
can be obtained from the internal study data. From the perspective of incorporating external study information into internal model fitting, Step 2 has a similar spirit to the constrained maximum likelihood method (Chatterjee et al., 2016), since it fits a generalized linear model under parameter constraints provided by the external information.
To gain more insights on the assumption of equal ratios, let us consider the case where for the external population has the same form as that in (1); that is,
![]() |
Then the required assumption is essentially
based on Proposition 1 in the next subsection, which then becomes
![]() |
due to being a reparameterization of
after
was replaced by
in the regression model.
This is not an assumption that can be checked from data; however, it might provide some insight into when the method is applicable. For example, the expression does not directly include the intercepts in the various models, so if ,
, and
and
have different intercepts in the generalized linear model, then the assumption is satisfied and the method will work. Another scenario is if the
variables are only weakly associated with the
variables then the
parameters will be small, so all that is required is that the
have a similar ratio in the two study populations for the method to provide a good approximation. This would hold even if the
distributions in the two populations differed.
2.3. Relationship between full model and reduced model parameters
The distributions and
are implicitly connected, because
is obtained by integrating out
from
, which depends on both
and
. Consider generalized linear models for both
and
, with regression coefficients
and
, respectively; then a question is how does the value of
that provides the best approximation to
relate to
, assuming that the model for
is correct. Neuhaus & Jewell (1993) considered this problem and obtained their result that
when
is independent of
using Taylor series expansions, assuming that
and
are small, where
is a constant.
An alternative approach, which we will use here, is to consider the solution to the large sample limit of the score equation for the reduced model, and this will provide a link between and
through a link between
and
as in the following proposition.
Proposition 1.
Suppose that the generalized linear model
is correctly specified for
. Consider another generalized linear model for
,
with possibly different link functions. Here the reduced model omitting the covariates
is mis-specified in general. The
are all centred and the
satisfy
for all
and
. Let
denote the large sample limit of the maximum likelihood estimate of
, which is the value that minimizes the Kullback–Leibler divergence between the reduced generalized linear model and the true distribution
. When
and the true values
and
are close to zero,
is approximately equal to
up to a constant factor, i.e.,
for some constant
.
This result concerning ratios of parameters can be regarded as a generalization of the Neuhaus & Jewell (1993) result from the -independent-of-
situation to the
-orthogonal-to-
situation. The result is for two generalized linear models, one of which has omitted covariates, for the same population, and is an approximation based on a Taylor series expansion. In this paper this result is exclusively applied to the internal study population, as in Step 2 in § 2.1, in order to connect the full model parameter
to the reduced model parameter
. This is also discussed in § 2.2 as the first connection that Step 2 makes. To keep the generality of the result, the presentation of Proposition 1 is for a general population instead.
Here we summarize the main steps in the proof, with greater detail given in the Appendix. The estimate solves the score equation that has the form
. Thus, the probability limit of
is
that solves
. Now assuming that the elements of
,
and
are small, we approximate
using a Taylor series expansion about
,
and
. Then after some algebra and using the fact that every
is orthogonal to every
, we arrive at an expression of the form
, where
. Then we have the desired result as long as
is invertible.
3. Asymptotic distribution of the proposed estimator
Proposition 2.
Under the typical regularity conditions for the asymptotic normality of
-estimators (e.g., van der Vaart, 1998), for the proposed estimator
from step 3 in § 2.1,
converges in distribution to
as
. Here
is the probability limit of
,
,
,
for
,
is the probability limit of
and
The proof of this result is given in the Appendix. It does not assume the ratios for regression coefficients between the internal study reduced model and the external study reduced model to be the same, or for some constant
. Instead, this result shows the asymptotic distribution of
produced by Step 3 in § 2.1 for any arbitrary fixed value for
.
Under the assumptions that for some constant
and that the external study sample size was large so that the uncertainty associated with the information it provides is negligible, Proposition 1 ensures that
is close to the true
, and thus Proposition 2 can be used for inference about
. Here we point out that, since the result in Proposition 1 is an approximation based on a Taylor series expansion,
is not the exact true value
and the difference may be difficult to quantify in general. However, this approximation is good when
and
are not very large, and as we show in the next section, the numerical results using Proposition 2 for inference are good. As a summary, our proposed method works and Proposition 2 can be used to make inference under the assumptions that (i) the internal study model (1) is correctly specified, (ii) the values of
and
are close to zero, (iii)
for some constant
, (iv) the external study sample size was large so that the uncertainty associated with
is negligible, and the typical regularity conditions hold for the asymptotic normality of the maximum likelihood estimator for generalized linear models.
4. Simulation studies
Simulation studies are implemented to evaluate the performance of the proposed estimator in various settings. We generate binary from logistic regression models with covariates either
or
. To assess the robustness properties of the procedure, we consider data generating assumptions that violate transportability between populations in specific ways. The properties of the proposed procedure are compared with those of the maximum likelihood estimator with internal study data alone and the constrained maximum likelihood method.
Two different simulation settings are examined. In the first setting, labelled , we generated external and internal data from the models
![]() |
We use a logistic link function for both and
. The external and internal study sample sizes are
and
. External summary information is obtained by fitting a
-omitted misspecified model in the external dataset. For this setting, we always assume that
, and also that both external and internal
are generated from a Gaussian distribution, with variance 1 and covariances
. A continuous variable
is generated from
and a binary variable
is generated from
. For each simulation scenario, a total of
replications (
) are generated. Three different factors were varied, as shown in Table 1: the magnitude of
, whether
is the same as
, and whether the intercepts
and
are the same.
Table 1.
Simulation scenarios for the first setting. Small refers to
and large
refers to
. Values of
in the
distribution are
and
![]() |
Small ![]() |
![]() |
![]() |
![]() |
Small ![]() |
![]() |
![]() |
![]() |
Small ![]() |
![]() |
![]() |
![]() |
Small ![]() |
![]() |
![]() |
![]() |
Large ![]() |
![]() |
![]() |
![]() |
Large ![]() |
![]() |
![]() |
![]() |
Large ![]() |
![]() |
![]() |
![]() |
Large ![]() |
![]() |
![]() |
The performance of parameter estimates is quantified by the metrics
![]() |
![]() |
the Monte Carlo standard deviation, ESD, where
![]() |
and the coverage rate of 95 confidence intervals. The Brier score is included to evaluate predictive performance on a test dataset, where
![]() |
The test data have sample size , and are generated using the internal study model.
From the results in Table 2, for the parameter estimates, the proposed estimator and constrained maximum likelihood have small bias, even under . The proposed method accurately estimates the true intercept
, when
and
are not the same. In contrast, the constrained maximum likelihood method has large bias for the intercept in such cases. The justification for the proposed method was based on an assumption that the values of
and
were small. However, the results in Table 1 within the Supplementary Material show that the proposed method still has good performance when those values are not small. Results for the variability of
show that the proposed method is more efficient than the maximum likelihood estimator based on internal study data alone, but not as efficient as the constrained maximum likelihood method. For
, all the methods have similar efficiency. The coverage rates of the confidence intervals are close to the nominal
level, demonstrating that the asymptotic formula is providing a good approximation of the variance for this sample size of 400. The robustness of the proposed estimator is particularly advantageous over the constrained maximum likelihood method when prediction is our main interest. We see that the Brier score of the constrained maximum likelihood method is substantially larger than that of the proposed estimator when
.
Table 2.
Simulation results for –
small
scenarios. Sample size
. Number of replications is
.
,
and
are multiplied by
Bias | MSE | ESD | Coverage | |||||||
---|---|---|---|---|---|---|---|---|---|---|
![]() |
MLE | Prop | CML | MLE | Prop | CML | MLE | Prop | CML | Prop (![]() |
![]() |
0.48 | 0.43 | 1.42 | 3.36 | 3.32 | 1.97 | 18.3 | 18.2 | 13.9 | 95 |
![]() |
0.06 | 0.24 | 1.36 | 3.55 | 2.24 | 1.42 | 18.8 | 14.9 | 11.8 | 94 |
![]() |
0.22 |
![]() |
![]() |
3.41 | 2.28 | 1.48 | 18.4 | 15.0 | 12.1 | 94 |
![]() |
0.09 | 1.06 | 1.22 | 2.30 | 0.93 | 0.72 | 15.1 | 9.5 | 8.4 | 94 |
![]() |
1.76 | 1.51 | 1.75 | 1.75 | 1.72 | 1.75 | 13.1 | 13.0 | 13.1 | 94 |
![]() |
![]() |
![]() |
![]() |
7.68 | 7.60 | 7.67 | 27.7 | 27.5 | 27.7 | 95 |
Brier(![]() |
0.723 | 0.719 | 0.716 | |||||||
![]() |
||||||||||
![]() |
0.48 | 0.40 | 1.86 | 3.36 | 3.32 | 1.99 | 18.3 | 18.2 | 13.9 | 95 |
![]() |
0.06 |
![]() |
![]() |
3.55 | 2.23 | 1.47 | 18.8 | 14.8 | 11.8 | 94 |
![]() |
0.22 | 1.09 |
![]() |
3.41 | 2.31 | 1.48 | 18.4 | 15.1 | 12.1 | 94 |
![]() |
0.09 | 0.21 |
![]() |
2.30 | 0.91 | 0.71 | 15.1 | 9.5 | 8.4 | 93 |
![]() |
1.76 | 1.51 | 1.75 | 1.75 | 1.71 | 1.75 | 13.1 | 13.0 | 13.1 | 94 |
![]() |
![]() |
![]() |
![]() |
7.68 | 7.60 | 7.67 | 27.7 | 27.5 | 27.6 | 95 |
Brier(![]() |
0.723 | 0.719 | 0.716 | |||||||
![]() |
||||||||||
![]() |
0.48 | 0.41 | 50.29 | 3.36 | 3.32 | 27.25 | 18.3 | 18.2 | 13.9 | 95 |
![]() |
0.06 |
![]() |
![]() |
3.55 | 2.23 | 1.41 | 18.8 | 14.9 | 11.9 | 94 |
![]() |
0.22 |
![]() |
![]() |
3.41 | 2.28 | 1.50 | 18.4 | 15.0 | 12.2 | 94 |
![]() |
0.09 | 0.57 | 0.20 | 2.30 | 0.91 | 0.70 | 15.1 | 9.5 | 8.4 | 94 |
![]() |
1.76 | 1.51 | 1.77 | 1.75 | 1.72 | 1.75 | 13.1 | 13.0 | 13.1 | 94 |
![]() |
![]() |
![]() |
![]() |
7.68 | 7.60 | 7.68 | 27.7 | 27.5 | 27.7 | 95 |
Brier(![]() |
0.723 | 0.719 | 0.746 | |||||||
![]() |
||||||||||
![]() |
0.48 | 0.29 | 50.28 | 3.36 | 3.33 | 27.23 | 18.3 | 18.2 | 13.9 | 95 |
![]() |
0.06 | 4.86 | 4.02 | 3.55 | 2.55 | 1.57 | 18.8 | 15.2 | 11.9 | 94 |
![]() |
0.22 |
![]() |
![]() |
3.41 | 2.36 | 1.67 | 18.4 | 14.9 | 12.2 | 93 |
![]() |
0.09 |
![]() |
![]() |
2.30 | 0.96 | 0.80 | 15.1 | 9.4 | 8.3 | 93 |
![]() |
1.76 | 1.48 | 1.77 | 1.75 | 1.71 | 1.75 | 13.1 | 13.0 | 13.1 | 94 |
![]() |
![]() |
![]() |
![]() |
7.68 | 7.63 | 7.68 | 27.7 | 27.6 | 27.7 | 95 |
Brier(![]() |
0.723 | 0.718 | 0.746 |
MLE, maximum likelihood estimation; Prop, proposed method; CML, constrained maximum likelihood; MSE, mean squared error; ESD, Monte Carlo standard deviation.
The scenarios in Table 3 for the second simulation study, labelled , are designed to illustrate that the proposed method works when the ratio of the
coefficients in the reduced models are the same in the two populations. We generate internal data from the full model
,
,
, where
, the
have variances
and covariances
. The reduced model for the internal study is given by
, where the
value is approximated by fitting this model to a very large dataset with sample size
. Then the external study data are generated from the external reduced model
, where
for some constants
, after obtaining
from a very large dataset. So, whenever
, the ratios between the external and internal reduced parameters are retained. Then
is selected to make the
proportions in the internal and external datasets similar when
. The external sample has size
and the internal sample has size
.
Table 3.
Simulation scenarios for the second setting. Here are scaling factors such that
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
The results in Table 4 show that the proposed method has low bias and is more efficient than the maximum likelihood method when , even when they are not equal to 1, and gives a lower Brier score. The coverage rates of the confidence intervals are close to the nominal level. The constrained maximum likelihood method is more efficient than the proposed method when
, but gives biased estimates otherwise, even when
. When
, the proposed method does have some bias, as shown in Table 2 in the Supplementary Material.
Table 4.
Simulation results for scenarios –
.
,
and
are multiplied by
. Sample size
. Number of replications is
. Let
![]() |
Bias | MSE | ESD | Coverage | ||||||
---|---|---|---|---|---|---|---|---|---|---|
MLE | Prop | CML | MLE | Prop | CML | MLE | Prop | CML | Prop (![]() |
|
![]() |
0.93 | 0.49 | -29.43 | 2.45 | 2.42 | 8.85 | 15.6 | 15.5 | 4.3 | 95 |
![]() |
1.86 | 1.56 | -57.02 | 3.20 | 2.09 | 32.89 | 17.7 | 14.3 | 6.1 | 95 |
![]() |
2.10 | 1.34 | -68.53 | 4.91 | 3.76 | 48.49 | 22.0 | 19.3 | 12.3 | 95 |
![]() |
1.04 | 0.84 | 0.03 | 2.30 | 2.28 | 2.23 | 15.1 | 15.1 | 14.9 | 95 |
Brier(![]() |
0.577 | 0.575 | 0.622 | |||||||
![]() |
||||||||||
![]() |
0.93 | 0.49 | -29.45 | 2.45 | 2.42 | 8.93 | 15.6 | 15.5 | 5.1 | 95 |
![]() |
1.86 | 1.68 | -57.02 | 3.20 | 2.09 | 32.88 | 17.7 | 14.3 | 6.1 | 95 |
![]() |
2.10 | 1.24 | -68.65 | 4.91 | 3.78 | 48.64 | 22.0 | 19.4 | 12.2 | 95 |
![]() |
1.04 | 0.84 | 0.03 | 2.30 | 2.28 | 2.23 | 15.1 | 15.1 | 14.9 | 95 |
Brier(![]() |
0.577 | 0.575 | 0.622 | |||||||
![]() |
||||||||||
![]() |
0.93 | 0.49 | 4.47 | 2.45 | 2.42 | 0.46 | 15.6 | 15.5 | 5.1 | 95 |
![]() |
1.86 | 1.64 | 0.43 | 3.20 | 2.01 | 0.46 | 17.7 | 14.0 | 6.8 | 95 |
![]() |
2.10 | 1.36 | -0.07 | 4.91 | 3.65 | 1.53 | 22.0 | 19.0 | 12.4 | 95 |
![]() |
1.04 | 0.85 | 1.02 | 2.30 | 2.28 | 2.30 | 15.1 | 15.1 | 15.1 | 95 |
Brier(![]() |
0.577 | 0.575 | 0.573 | |||||||
![]() |
||||||||||
![]() |
0.93 | 0.49 | 4.47 | 2.45 | 2.42 | 0.46 | 15.6 | 15.5 | 6.0 | 95 |
![]() |
1.86 | 1.62 | 0.42 | 3.20 | 2.02 | 0.46 | 17.7 | 14.1 | 6.8 | 95 |
![]() |
2.10 | 1.37 | -0.06 | 4.91 | 3.68 | 1.55 | 22.0 | 19.1 | 12.4 | 95 |
![]() |
1.04 | 0.85 | 1.02 | 2.30 | 2.28 | 2.30 | 15.1 | 15.0 | 15.1 | 95 |
Brier(![]() |
0.577 | 0.575 | 0.573 | |||||||
![]() |
||||||||||
![]() |
0.93 | 0.50 | 22.18 | 2.45 | 2.42 | 5.26 | 15.6 | 15.5 | 5.8 | 95 |
![]() |
1.86 | 1.62 | 54.32 | 3.20 | 1.97 | 30.08 | 17.7 | 13.9 | 7.6 | 95 |
![]() |
2.10 | 1.43 | 64.72 | 4.91 | 3.66 | 43.57 | 22.0 | 19.0 | 12.9 | 96 |
![]() |
1.04 | 0.85 | 0.81 | 2.30 | 2.28 | 2.29 | 15.1 | 15.0 | 15.1 | 95 |
Brier(![]() |
0.577 | 0.575 | 0.584 | |||||||
![]() |
||||||||||
![]() |
0.93 | 0.51 | 22.02 | 2.45 | 2.42 | 5.35 | 15.6 | 15.5 | 7.1 | 95 |
![]() |
1.86 | 1.68 | 54.45 | 3.20 | 1.99 | 30.26 | 17.7 | 14.0 | 7.8 | 96 |
![]() |
2.10 | 1.38 | 64.72 | 4.91 | 3.64 | 43.54 | 22.0 | 19.0 | 12.8 | 95 |
![]() |
1.04 | 0.85 | 0.81 | 2.30 | 2.28 | 2.29 | 15.1 | 15.0 | 15.1 | 95 |
Brier(![]() |
0.577 | 0.575 | 0.584 |
MLE, maximum likelihood estimation; Prop, proposed method; CML, constrained maximum likelihood; MSE, mean squared error; ESD, Monte Carlo standard deviation.
5. Real data analysis
To evaluate the proposed method on real data, we construct a predictive model to predict risk of high-grade prostate cancer by using internal data and external summary information available from a published study. The original risk calculator was developed from the Prostate Cancer Prevention Trial (Thompson et al., 2006). This calculator is constructed using a logistic regression model including five clinical variables: prostate specific antigen (PSA) level, digital rectal examination (DRE) findings, age, race (African American or not) and prior biopsy results. The original calculator is given as , where
is the probability of having high-grade prostate cancer. We aim to construct an expanded risk calculator by considering an additional binary biomarker (T2:ERG) that measures TMPRSS2:ERG gene fusions. Although this biomarker is not widely used, it was observed to have predictive power for high-grade prostate cancer (Truong et al., 2013; Tomlins et al., 2015).
The dataset we use is from Tomlins et al. (2015), which comprises a training and an additional testing set with and
subjects, respectively. Details of the study can be found in Cheng et al. (2019). The direct maximum likelihood estimator, constrained maximum likelihood estimator and proposed method are implemented on the training data to obtain risk calculators, and each calculator is then applied to the testing dataset where the predictive performance is quantified by the Brier score, calculated from the testing data.
In Table 5, we see that the proposed method gains substantial efficiency compared to the direct maximum likelihood estimator, which does not utilize external summary information. The Brier score of the proposed method is an improvement of that from direct maximum likelihood and constrained maximum likelihood.
Table 5.
Parameter estimates for the prostate cancer study. The reduced model estimates are obtained from the published literature for the external model. The full model estimates and standard errors are obtained from the internal data using either the maximum likelihood estimate, or the constrained maximum likelihood estimate or the proposed method. The standard errors of the proposed method are in parentheses, all obtained from the asymptotic variance formula
Model | PSA | Age | DRE | Prior biopsy | Race | T2:ERG | Brier |
---|---|---|---|---|---|---|---|
Reduced model | |||||||
External | 1.29 | 0.031 | 1.00 |
![]() |
0.96 | — | 0.933 |
Full model | |||||||
Direct maximum likelihood | 0.98 | 0.032 | 1.02 |
![]() |
0.57 | 0.76 | 0.930 |
(0.18) | (0.012) | (0.26) | (0.27) | (0.29) | (0.20) | ||
Constrained maximum likelihood | 1.14 | 0.032 | 1.06 |
![]() |
0.80 | 0.72 | 0.931 |
(0.07) | (0.004) | (0.14) | (0.11) | (0.17) | (0.20) | ||
Proposed | 0.97 | 0.02 | 0.69 |
![]() |
0.87 | 0.73 | 0.919 |
(0.13) | (0.003) | (0.10) | (0.05) | (0.10) | (0.19) |
PSA, prostate specific antigen; DRE, digital rectal examination.
6. Discussion
The gain in efficiency of the proposed method by integrating external information is not as large as for some other existing methods. However, most of these other methods require the joint distribution of all the variables to be the same in the two study populations, and can be biased when the two distributions are not equal. Our proposed method is more robust, and only requires a selected aspect of the joint distribution to be transportable between populations.
The mathematical justification for our proposed method is based on an assumption that the regression coefficients are small. However, in simulation studies we found that it had good performance unless the coefficients were very large. Another assumption made in this paper is that the external study has a large sample size so that the uncertainty associated with the external information is negligible. When this is not the case, the asymptotic variance given in Proposition 2 needs to account for such uncertainty to make valid inference, and inference without accounting for this uncertainty leads to underestimation of the standard error of . The external study uncertainty may be available in the form of standard errors for the parameter estimates
, and may be accounted for in a similar fashion to Han & Lawless (2019) and Kundu et al. (2019) in the Taylor series expansion. Alternatively, it would also be feasible to consider an adaptation of the proposed method in which we require the ratio of the
estimates from the internal study to be similar to, but not necessarily identical to, the ratio of the values of
. Such an adaptation would be needed if there were multiple external studies, each with their own set of potentially overlapping variables and regression coefficients. The requirement of having exact equality between the ratio of regression coefficients for every external study is unlikely to be satisfied. A method that extends the shrinkage approach in Estes et al. (2018) could also be developed, in which more use is made of the external information if it is more consistent with the internal data.
In the high-dimensional regression setting, which occurs when the number of the newly discovered covariates is large, modifications of the proposed method that include regularization or variable selection may be needed. One modification would be to first run the full model of interest on the internal data with some regularization to select a subset of
, and then use the selected
to carry out the proposed procedure in § 2.1. Another modification would be to add a regularization on the regression coefficients of
when fitting the model as in Step 2 in § 2.1 after the orthogonalization of
on
. These two modifications may yield selection of different subsets of
in the end, and future investigations are needed to assess their performance. A relevant recent publication on data integration (Sheng et al., 2022) used a penalized empirical likelihood approach for high dimensional
.
The constrained maximum likelihood method (Chatterjee et al., 2016) assumes the joint distribution to be the same between the internal and external study populations. It may be possible to extend this method to the setting we considered, by formulating the proportionality of and
into a set of constraints on
. This formulation is not straightforward since the connection between
and
is not explicit. This is a topic worth future investigation.
Supplementary Material
Acknowledgement
This research was supported by the US National Institutes of Health (CA129102). The authors thank the editor, an associate editor and a referee for their helpful comments.
Appendix
Summary of notation
![]() |
Variables in the reduced model for ![]() |
![]() |
Additional variables in the full model for ![]() |
![]() |
Orthogonalization of ![]() ![]() ![]() |
![]() |
|
![]() |
Parameters in the ![]() |
nal study. | |
![]() |
Parameters in the ![]() |
![]() |
Parameters in the ![]() ![]() |
![]() |
Parameters in the ![]() |
Proof of Proposition 1
For Proposition 1, the maximum likelihood estimate of is
![]() |
with probability limit
![]() |
where the expectation is with respect to the true joint distribution of
. Therefore,
solves
with
![]() |
where the expectation is with respect to the distribution of
and the calculation used the fact that
, and for
,
![]() |
Here we let . Since
cannot be solved analytically, we solve an approximation
with
![]() |
(A1) |
Here , and
and
,
, are the first-order Taylor series expansions of
and
around
. In approximation (A1) we used
for all
. Then, using both
and
, we are able to solve the equations defined by equating the second to last rows in (A1) to zero, and get
, where
,
and
![]() |
Then the desired result follows as long as is invertible regardless of the
value.
Proof of Proposition 2
For Proposition 2, it is easy to see that satisfies
. A Taylor series expansion leads to
![]() |
Let denote the dependence of
on
implied by Step 3 in § 2.1. Then applying the central limit theorem and the delta method gives the desired result, with
.
Contributor Information
Jeremy M G Taylor, Department of Biostatistics, University of Michigan, 1415 Washington Heights, Ann Arbor, Michigan 48019, U.S.A.
Kyuseong Choi, Department of Statistics and Data Science, Cornell University, 1198 Comstock Hall, 129 Garden Ave., Ithaca, New York 14853, U.S.A.
Peisong Han, Department of Biostatistics, University of Michigan, 1415 Washington Heights, Ann Arbor, Michigan 48019, U.S.A.
Supplementary material
The Supplementary Material includes extensions of the simulation studies, including the scenarios of Table 1 and
of Table 3.
References
- Carroll, R. J., Ruppert, D., Stefanski, L. A. & Crainiceanu, C. M. (2006). Measurement Error in Nonlinear Models: a Modern Perspective, 2nd ed. New York: CRC Press. [Google Scholar]
- Chatterjee, N., Chen, Y. H., Maas, P. & Carroll, R. J. (2016). Constrained maximum likelihood estimation for model calibration using summary-level information from external big data sources. J. Am. Statist. Assoc. 111, 107–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cheng, W., Taylor, J. M. G., Gu, T., Tomlins, S. A. & Mukherjee, B. (2019). Informing a risk prediction model for binary outcomes with external coefficient information. Appl. Statist. 68, 121. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cheng, W., Taylor, J. M. G., Vokonas, P. S., Park, S. K. & Mukherjee, B. (2018). Improving estimation and prediction in linear regression incorporating external information from an established reduced model. Statist. Med. 37, 1515–30. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Estes, J. P., Mukherjee, B. & Taylor, J. M. G. (2018). Empirical Bayes estimation and prediction using summary-level information from external big data sources adjusting for violations of transportability. Statist. Biosci. 10, 568–86. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gail, M. H., Wieand, S. & Piantadosi, S. (1984). Biased estimates of treatment effect in randomized experiments with nonlinear regressions and omitted covariates. Biometrika 71, 431–44. [Google Scholar]
- Gu, T., Taylor, J. M. G., Cheng, W. & Mukherjee, B. (2019). Synthetic data method to incorporate external information into a current study. Can. J. Statist. 47, 580–603. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Han, P. & Lawless, J. F. (2019). Empirical likelihood estimation using auxiliary summary information with different covariate distributions. Statist. Sinica 29, 1321–42. [Google Scholar]
- Han, P., Taylor, J. M. G. & Mukherjee, B. (2023). Integrating information from existing risk prediction models with no model details. Can. J. Statist. To appear. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kundu, P., Tang, R. & Chatterjee, N. (2019). Generalized meta-analysis for multiple regression models across studies with disparate covariate information. Biometrika 106, 567–85. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li, K. C. & Duan, N. (1989). Regression analysis under link violation. Ann. Statist. 17, 1009–52. [Google Scholar]
-
Monahan, J. F. & Stefanski, L. A. (1992). Normal scale mixture approximations to
and computation of the logistic-normal integral. In Handbook of the Logistic Distribution, Balakrishnan, N. ed. New York: Marcel Dekker, pp. 529–40. [Google Scholar]
- Neuhaus, J. M. & Jewell, N. P. (1993). A geometric approach to assess bias due to omitted covariates in generalized linear models. Biometrika 80, 807–15. [Google Scholar]
- Penrose, R. (2004). The Road to Reality. A Complete Guide to the Laws of the Universe. London: Vintage Books. [Google Scholar]
- Qin, J. (2000). Combining parametric and empirical likelihoods. Biometrika 87, 484–90. [Google Scholar]
- Rahmandad, H., Jalali, M. S. & Paynabar, K. (2017). A flexible method for aggregation of prior statistical findings. PloS One 12, e0175111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sheng, Y., Sun, Y., Huang, C. Y. & Kim, M. O. (2022). Synthesizing external aggregated information in the presence of population heterogeneity: A penalized empirical likelihood approach. Biometrics 78, 679–90. [DOI] [PubMed] [Google Scholar]
- Solomon, P. J. (1984). Effect of misspecification of regression models in the analysis of survival data. Biometrika 71, 291–8. [Google Scholar]
- Struthers, C. A. & Kalbfleisch, J. D. (1986). Misspecified proportional hazard models. Biometrika 73, 363–9. [Google Scholar]
- Taylor, J. M. G. (1989). A note on the cost of estimating the ratio of regression parameters after fitting a power transformation. J. Statist. Plan. Infer. 21, 223–30. [Google Scholar]
- Taylor, J. M. G. (1990). Properties of maximum likelihood estimates of the ratio of parameters in ordinal response regression models. Commun. Statist. B 19, 469–80. [Google Scholar]
- Thompson, I. M., Ankerst, D. P., Chi, C., Goodman, P. J., Tangen, C. M., Lucia, M. S., Feng, Z., Parnes, H. L. & Coltman C. A., Jr. (2006). Assessing prostate cancer risk: results from the Prostate Cancer Prevention Trial. J. Nat. Cancer Inst. 98, 529–34. [DOI] [PubMed] [Google Scholar]
- Tomlins, S. A., Day, J. R., Lonigro, R. J., Hovelson, D. H., Siddiqui, J., Kunju, L. P., Dunn, R. L., Meyer, S., Hodge, P., Groskopf, J.. et al. (2015). Urine TMPRSS2:ERG plus PCA3 for individualized prostate cancer risk assessment. Eur. Urol. 70, 45–53. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Truong, M., Yang, B. & Jarrard, D. F. (2013). Toward the detection of prostate cancer in urine: a critical analysis. J. Urol. 189, 422–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge: Cambridge University Press. [Google Scholar]
- Zhai, Y. & Han, P. (2022). Data integration with oracle use of external information from heterogeneous populations. J. Comp. Graph. Statist. 31, 1001–12. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.