Data integration: exploiting ratios of parameter estimates from a reduced external model

Jeremy M G Taylor; Kyuseong Choi; Peisong Han

doi:10.1093/biomet/asac022

. 2022 Apr 12;110(1):119–134. doi: 10.1093/biomet/asac022

Data integration: exploiting ratios of parameter estimates from a reduced external model

Jeremy M G Taylor ^1,^✉, Kyuseong Choi ², Peisong Han ³

PMCID: PMC9919493 PMID: 36798840

Summary

We consider the situation of estimating the parameters in a generalized linear prediction model, from an internal dataset, where the outcome variable Inline graphic is binary and there are two sets of covariates, and . We have information from an external study that provides parameter estimates for a generalized linear model of on . We propose a method that makes limited assumptions about the similarity of the distributions in the two study populations. The method involves orthogonalizing the Inline graphic variables and then borrowing information about the ratio of the coefficients from the external model. The method is justified based on a new result relating the parameters in a generalized linear model to the parameters in a generalized linear model with omitted covariates. The method is applicable if the regression coefficients in the Inline graphic given model are similar in the two populations, up to an unknown scalar constant. This type of transportability between populations is something that can be checked from the available data. The asymptotic variance of the proposed method is derived. The method is evaluated in a simulation study and shown to gain efficiency compared to simple analysis of the internal dataset, and is robust compared to an alternative method of incorporating external information.

Keywords: Data integration, Omitted variable regression, Ratio of parameters, Transportability

1. Introduction

We consider developing a parametric prediction model for a binary outcome variable Inline graphic , given sets of covariates and . Here and represent vectors of and variables, which may be thought of as the conventional covariates and the newly discovered covariates, respectively. Individual data , for a simple random sample of size from a population of interest, referred to as the internal study population hereafter, are available. The joint law of Inline graphic in the internal population is denoted by , and we refer to this as the joint distribution. In addition, an external study analysed data from a different population, with the joint distribution denoted by , by fitting a generalized linear model of the form , and produced the maximum likelihood estimate Inline graphic for . Here is a link function and the expectation is taken under the external study population distribution . The raw data used to derive are not available. We assume that the external study sample size was large so that the uncertainty associated with is negligible, and no components of Inline graphic are zero. Our goal is to fit a model of the form

(1)

that includes all the Inline graphic variables and also the variables. Here is allowed to be different from and the expectation is taken under the internal study population distribution . We assume that model (1) is correctly specified corresponding to . In fitting this model we want to make use of provided by the external study. While it would be simple to just analyse the internal data to obtain an estimate of Inline graphic , incorporating is likely to lead to more efficient estimates due to the large external sample. This could be important if is not large.

While this scenario may be common in many areas of science, we are particularly thinking of a medical setting where Inline graphic is a future or unknown and important binary event, the are commonly available variables in that setting, such as age and sex, and the are newer or less commonly measured variables, such as some recently discovered biomarkers. The interest is to include in a new prediction model with the hope that it will have better performance.

There is existing literature on this topic of using external summary information to aid in the estimation of a full model from an internal dataset. Cheng et al. (2018, 2019) proposed Bayesian approaches that rely on a direct link between the parameters in the two models. Other methods (Chatterjee et al., 2016) use a constrained semiparametric maximum likelihood approach by converting the external summary-level information into a constraint and then maximizing the internal data likelihood subject to this constraint. However, the constrained maximum likelihood method requires the probability distribution of the outcome and covariates to be the same in the two populations. This constrained maximum likelihood method is closely related to an empirical likelihood approach (Qin, 2000; Han & Lawless, 2019). See also Han et al. (2023). In a different approach, Gu et al. (2019) used the external model to create synthetic data, which was combined with the internal data and then analysed. Estes et al. (2018) developed a more robust empirical Bayes estimator that essentially downweights the external information if it is not compatible with the internal data. Zhai & Han (2022) developed a penalization method to simultaneously select and integrate the external information that is compatible with the internal data. Rahmandad et al. (2017) and Kundu et al. (2019) developed methods within a meta-analysis framework.

Whether the external information Inline graphic is useful for the estimation of depends on whether the joint distribution of , or some aspect of the joint distribution, is similar between the external and internal study populations. Much of the literature referenced above makes an assumption of transportability between and . The full transportability corresponds to Inline graphic or, equivalently, all of , and . In practice, partial transportability, in which only aspects of the distributions are shared between the populations, may be more realistic. An example of partial transportability would be , but , or , but and . In general, we might expect marginal distributions to differ between populations, especially the Inline graphic distribution since different study populations typically have varying demographics. In practice, some information is usually available about , such as the marginal mean and standard deviation of each variable, which could be used to investigate how different is from . Conditional distributions might have more similarity, which is related to the idea that causal dynamics may be expected to be stable (e.g., Penrose, 2004). For example, for the Inline graphic distribution, assuming that and , a plausibly realistic assumption on transportability is that the two link functions are the same, and that for , whereas . The intercept-only difference between the regression coefficients reflects the belief that the covariate effects are similar between the two study populations, but the disease prevalence may differ. The case of Inline graphic for some constant is another example, reflecting the belief that the relative covariate effects are similar between the two study populations, but not the absolute magnitudes.

For the Inline graphic distribution, a strong assumption would be that is independent of in both study populations. Another choice might be not to specify a model for , but rather to assume that the distribution is the same in the two study populations. A further option would be to specify a parametric model, such as a generalized linear model, and then restrict the regression coefficients to be related in certain ways in the two study populations.

It is feasible to consider the existence of the variables Inline graphic in the external population, but that they were either not measured or included in the external model. However, we will not typically have any available information on the distribution of in the external population, and thus we cannot check whether the distributions for and are transportable between the internal and external populations. We can however check from the internal study data whether the distribution of Inline graphic is similar between the two populations. We will propose a method that does not require , but does require some aspect of and to be the same.

Our method builds on the extensive literature on omitted covariates in generalized linear models (Gail et al., 1984; Neuhaus & Jewell, 1993). Consider a generalized linear model Inline graphic with a scalar and a reduced model in the same population. The reduced model is usually misspecified, or not compatible with the full model if is nonlinear because of the noncollapsability of most generalized linear models. Neuhaus & Jewell (1993) showed that, when is independent of Inline graphic , the value of that best approximates the distribution of is related to through for some constant . So even though is typically not equal to , the ratios of the coefficients of the in the two models are very similar. This result is closely related to the measurement error literature for the relationship between the parameters of a regression model for when the covariates do and do not have measurement errors (Monahan & Stefanski, 1992; Carroll et al., 2006). When Inline graphic is not independent of , the relationship between and is not so simple, depending on the distribution in a complex way.

The proposed method also builds on the literature concerning the robustness of ratios of regression parameter estimates under model misspecification. The general result is that the ratio of regression coefficients can be estimated well even if the link function is misspecified (Solomon, 1984; Struthers & Kalbfleisch, 1986; Li & Duan, 1989; Taylor, 1989, 1990), and this has been studied for continuous, ordinal and censored survival outcomes. In other words, one can get good estimates of Inline graphic up to an unknown scaling factor when the link function is misspecified. The intuitive reason for the stability of the ratio of parameter estimates is that it represents the relative importance of one variable to another, in the sense that is the amount by which needs to be changed for a unit change in Inline graphic to give the same expected value of , and this would be the same irrespective of the link function.

Based on the aforementioned literature, we propose a data integration method that makes use of the ratios of the external study model coefficient estimate Inline graphic . The method works when the external reduced model for the distribution of leads to similar ratios of regression coefficient estimates when applied to and to . This similarity of the ratios can be quantitatively checked since we have both individual data from the internal study and the reduced model parameter estimates from the external study. Such a check on the ratios provides some assurance on applying the proposed method in practice, unlike many existing methods where assumptions on the distribution transportability cannot be explicitly checked. In many practical scenarios it is plausible that the ratios are transportable between study populations based on the interpretation that they represent the relative importance of the Inline graphic on , even though we do not have full transportability between populations.

2. Proposed method for integrating external information

2.1. Proposed method

The implementation of the proposed method for estimating Inline graphic by incorporating the external information in is as follows, where all models are fitted to the internal study data.

Step 1. Centre all the Inline graphic so that each has mean zero, and for each , fit a linear regression on , , and then calculate the residuals , where is the least square estimate of .

Step 2. Fit the model

(2)

to obtain the estimate Inline graphic . Here represents the common ratio of the regression coefficients of the between the internal and external studies.

Step 3. Estimate Inline graphic as , where

A summary of the different variables, models and parameters is provided in the Appendix.

2.2. Justification for the proposed method

Step 1 above orthogonalizes each Inline graphic to . The ordinary least squares method leads to and for all and . Therefore, the orthogonalization creates new variables that are uncorrelated with the variables in the sample, which to some degree approximates the variables being independent of all the variables. This allows us to appeal to the property that the ratios of parameters in a reduced model with omitted covariates are similar to the ratios in a full model when the omitted covariates are independent of the Inline graphic . Although orthogonality is a weaker condition than independence, under some conditions specified later, the property about the ratios of parameters is retained.

Step 3 above makes the trivial connection between the parameters in the model for Inline graphic and those in the model for as

Step 2 is the crucial step that makes use of the ratios of parameter estimates from the external study, and Inline graphic is the corresponding scaling parameter. This step is based on making two connections of parameters in different models.

The first connection that Step 2 makes is between the parameters Inline graphic in the model for and the parameters in the model for within the internal study population, where the ratios of the coefficient estimates for are retained. The result, which will be explained in more detail in the next section, says that for some constant . Based on this result, if we already have an estimate Inline graphic of , then the estimate of from the model will be close to for some scalar . Thus we can fit a model of the form (2) with less parameters using .

The second connection that Step 2 makes is between Inline graphic and . For the purpose of gaining efficiency for internal model fitting, it is desired to make use of . The assumption we make is that the ratios of the parameters are the same in the sense that for all , or in other words for some constant . This assumption is weaker than assuming equality of the parameters between the two populations, and it also allows the intercepts to be different. The interpretation of the ratios as the relative importance of the Inline graphic on suggests that they are likely to be transportable from one population to another in many scenarios. An important aspect of this assumption is that it can be investigated from data, because we are provided with and can be obtained from the internal study data. From the perspective of incorporating external study information into internal model fitting, Step 2 has a similar spirit to the constrained maximum likelihood method (Chatterjee et al., 2016), since it fits a generalized linear model under parameter constraints provided by the external information.

To gain more insights on the assumption of equal ratios, let us consider the case where Inline graphic for the external population has the same form as that in (1); that is,

Then the required assumption Inline graphic is essentially based on Proposition 1 in the next subsection, which then becomes

due to Inline graphic being a reparameterization of after was replaced by in the regression model.

This is not an assumption that can be checked from data; however, it might provide some insight into when the method is applicable. For example, the expression does not directly include the intercepts in the various models, so if Inline graphic , , and and have different intercepts in the generalized linear model, then the assumption is satisfied and the method will work. Another scenario is if the variables are only weakly associated with the variables then the parameters will be small, so all that is required is that the Inline graphic have a similar ratio in the two study populations for the method to provide a good approximation. This would hold even if the distributions in the two populations differed.

2.3. Relationship between full model and reduced model parameters

The distributions Inline graphic and are implicitly connected, because is obtained by integrating out from , which depends on both and . Consider generalized linear models for both and , with regression coefficients and , respectively; then a question is how does the value of that provides the best approximation to Inline graphic relate to , assuming that the model for is correct. Neuhaus & Jewell (1993) considered this problem and obtained their result that when is independent of using Taylor series expansions, assuming that and are small, where is a constant.

An alternative approach, which we will use here, is to consider the solution to the large sample limit of the score equation for the reduced model, and this will provide a link between Inline graphic and through a link between and as in the following proposition.

Proposition 1.

Suppose that the generalized linear model

is correctly specified for . Consider another generalized linear model for ,

with possibly different link functions. Here the reduced model omitting the covariates is mis-specified in general. The are all centred and the satisfy for all and . Let denote the large sample limit of the maximum likelihood estimate of , which is the value that minimizes the Kullback–Leibler divergence between the reduced generalized linear model and the true distribution . When and the true values and are close to zero, is approximately equal to up to a constant factor, i.e., for some constant .

This result concerning ratios of parameters can be regarded as a generalization of the Neuhaus & Jewell (1993) result from the Inline graphic -independent-of- situation to the -orthogonal-to- situation. The result is for two generalized linear models, one of which has omitted covariates, for the same population, and is an approximation based on a Taylor series expansion. In this paper this result is exclusively applied to the internal study population, as in Step 2 in § 2.1, in order to connect the full model parameter Inline graphic to the reduced model parameter . This is also discussed in § 2.2 as the first connection that Step 2 makes. To keep the generality of the result, the presentation of Proposition 1 is for a general population instead.

Here we summarize the main steps in the proof, with greater detail given in the Appendix. The estimate Inline graphic solves the score equation that has the form . Thus, the probability limit of is that solves . Now assuming that the elements of , and are small, we approximate using a Taylor series expansion about , and . Then after some algebra and using the fact that every is orthogonal to every Inline graphic , we arrive at an expression of the form , where . Then we have the desired result as long as is invertible.

3. Asymptotic distribution of the proposed estimator

Proposition 2.

Under the typical regularity conditions for the asymptotic normality of -estimators (e.g., van der Vaart, 1998), for the proposed estimator from step 3 in § 2.1, converges in distribution to as . Here is the probability limit of , , ,

for , is the probability limit of and

The proof of this result is given in the Appendix. It does not assume the ratios for regression coefficients between the internal study reduced model and the external study reduced model to be the same, or Inline graphic for some constant . Instead, this result shows the asymptotic distribution of produced by Step 3 in § 2.1 for any arbitrary fixed value for .

Under the assumptions that Inline graphic for some constant and that the external study sample size was large so that the uncertainty associated with the information it provides is negligible, Proposition 1 ensures that is close to the true , and thus Proposition 2 can be used for inference about . Here we point out that, since the result in Proposition 1 is an approximation based on a Taylor series expansion, Inline graphic is not the exact true value and the difference may be difficult to quantify in general. However, this approximation is good when and are not very large, and as we show in the next section, the numerical results using Proposition 2 for inference are good. As a summary, our proposed method works and Proposition 2 can be used to make inference under the assumptions that (i) the internal study model (1) is correctly specified, (ii) the values of Inline graphic and are close to zero, (iii) for some constant , (iv) the external study sample size was large so that the uncertainty associated with is negligible, and the typical regularity conditions hold for the asymptotic normality of the maximum likelihood estimator for generalized linear models.

4. Simulation studies

Simulation studies are implemented to evaluate the performance of the proposed estimator in various settings. We generate binary Inline graphic from logistic regression models with covariates either or . To assess the robustness properties of the procedure, we consider data generating assumptions that violate transportability between populations in specific ways. The properties of the proposed procedure are compared with those of the maximum likelihood estimator with internal study data alone and the constrained maximum likelihood method.

Two different simulation settings are examined. In the first setting, labelled Inline graphic , we generated external and internal data from the models

We use a logistic link function for both Inline graphic and . The external and internal study sample sizes are and . External summary information is obtained by fitting a -omitted misspecified model in the external dataset. For this setting, we always assume that , and also that both external and internal are generated from a Gaussian distribution, with variance 1 and covariances Inline graphic . A continuous variable is generated from and a binary variable is generated from . For each simulation scenario, a total of replications () are generated. Three different factors were varied, as shown in Table 1: the magnitude of , whether is the same as , and whether the intercepts Inline graphic and are the same.

Table 1.

Simulation scenarios for the first setting. Small Inline graphic refers to and large refers to . Values of in the distribution are and

	Small
	Small
	Small
	Small
	Large
	Large
	Large
	Large

Open in a new tab

The performance of parameter estimates is quantified by the metrics

the Monte Carlo standard deviation, ESD, where

and the coverage rate of 95 Inline graphic confidence intervals. The Brier score is included to evaluate predictive performance on a test dataset, where

The test data have sample size Inline graphic , and are generated using the internal study model.

From the results in Table 2, for the parameter estimates, the proposed estimator and constrained maximum likelihood have small bias, even under Inline graphic . The proposed method accurately estimates the true intercept , when and are not the same. In contrast, the constrained maximum likelihood method has large bias for the intercept in such cases. The justification for the proposed method was based on an assumption that the values of and Inline graphic were small. However, the results in Table 1 within the Supplementary Material show that the proposed method still has good performance when those values are not small. Results for the variability of show that the proposed method is more efficient than the maximum likelihood estimator based on internal study data alone, but not as efficient as the constrained maximum likelihood method. For Inline graphic , all the methods have similar efficiency. The coverage rates of the confidence intervals are close to the nominal level, demonstrating that the asymptotic formula is providing a good approximation of the variance for this sample size of 400. The robustness of the proposed estimator is particularly advantageous over the constrained maximum likelihood method when prediction is our main interest. We see that the Brier score of the constrained maximum likelihood method is substantially larger than that of the proposed estimator when Inline graphic .

Table 2.

Simulation results for Inline graphic – small scenarios. Sample size . Number of replications is . , and are multiplied by

		Bias			MSE			ESD		Coverage
	MLE	Prop	CML	MLE	Prop	CML	MLE	Prop	CML	Prop ()
	0.48	0.43	1.42	3.36	3.32	1.97	18.3	18.2	13.9	95
	0.06	0.24	1.36	3.55	2.24	1.42	18.8	14.9	11.8	94
	0.22			3.41	2.28	1.48	18.4	15.0	12.1	94
	0.09	1.06	1.22	2.30	0.93	0.72	15.1	9.5	8.4	94
	1.76	1.51	1.75	1.75	1.72	1.75	13.1	13.0	13.1	94
				7.68	7.60	7.67	27.7	27.5	27.7	95
Brier()	0.723	0.719	0.716

	0.48	0.40	1.86	3.36	3.32	1.99	18.3	18.2	13.9	95
	0.06			3.55	2.23	1.47	18.8	14.8	11.8	94
	0.22	1.09		3.41	2.31	1.48	18.4	15.1	12.1	94
	0.09	0.21		2.30	0.91	0.71	15.1	9.5	8.4	93
	1.76	1.51	1.75	1.75	1.71	1.75	13.1	13.0	13.1	94
				7.68	7.60	7.67	27.7	27.5	27.6	95
Brier()	0.723	0.719	0.716

	0.48	0.41	50.29	3.36	3.32	27.25	18.3	18.2	13.9	95
	0.06			3.55	2.23	1.41	18.8	14.9	11.9	94
	0.22			3.41	2.28	1.50	18.4	15.0	12.2	94
	0.09	0.57	0.20	2.30	0.91	0.70	15.1	9.5	8.4	94
	1.76	1.51	1.77	1.75	1.72	1.75	13.1	13.0	13.1	94
				7.68	7.60	7.68	27.7	27.5	27.7	95
Brier()	0.723	0.719	0.746

	0.48	0.29	50.28	3.36	3.33	27.23	18.3	18.2	13.9	95
	0.06	4.86	4.02	3.55	2.55	1.57	18.8	15.2	11.9	94
	0.22			3.41	2.36	1.67	18.4	14.9	12.2	93
	0.09			2.30	0.96	0.80	15.1	9.4	8.3	93
	1.76	1.48	1.77	1.75	1.71	1.75	13.1	13.0	13.1	94
				7.68	7.63	7.68	27.7	27.6	27.7	95
Brier()	0.723	0.718	0.746

Open in a new tab

MLE, maximum likelihood estimation; Prop, proposed method; CML, constrained maximum likelihood; MSE, mean squared error; ESD, Monte Carlo standard deviation.

The scenarios in Table 3 for the second simulation study, labelled Inline graphic , are designed to illustrate that the proposed method works when the ratio of the coefficients in the reduced models are the same in the two populations. We generate internal data from the full model , , , where , the have variances and covariances . The reduced model for the internal study is given by Inline graphic , where the value is approximated by fitting this model to a very large dataset with sample size . Then the external study data are generated from the external reduced model , where for some constants , after obtaining from a very large dataset. So, whenever , the ratios between the external and internal reduced parameters are retained. Then Inline graphic is selected to make the proportions in the internal and external datasets similar when . The external sample has size and the internal sample has size .

Table 3.

Simulation scenarios for the second setting. Here Inline graphic are scaling factors such that

Open in a new tab

The results in Table 4 show that the proposed method has low bias and is more efficient than the maximum likelihood method when Inline graphic , even when they are not equal to 1, and gives a lower Brier score. The coverage rates of the confidence intervals are close to the nominal level. The constrained maximum likelihood method is more efficient than the proposed method when , but gives biased estimates otherwise, even when Inline graphic . When , the proposed method does have some bias, as shown in Table 2 in the Supplementary Material.

Table 4.

Simulation results for scenarios Inline graphic –. , and are multiplied by . Sample size . Number of replications is . Let

		Bias			MSE			ESD		Coverage
	MLE	Prop	CML	MLE	Prop	CML	MLE	Prop	CML	Prop ()
	0.93	0.49	-29.43	2.45	2.42	8.85	15.6	15.5	4.3	95
	1.86	1.56	-57.02	3.20	2.09	32.89	17.7	14.3	6.1	95
	2.10	1.34	-68.53	4.91	3.76	48.49	22.0	19.3	12.3	95
	1.04	0.84	0.03	2.30	2.28	2.23	15.1	15.1	14.9	95
Brier()	0.577	0.575	0.622

	0.93	0.49	-29.45	2.45	2.42	8.93	15.6	15.5	5.1	95
	1.86	1.68	-57.02	3.20	2.09	32.88	17.7	14.3	6.1	95
	2.10	1.24	-68.65	4.91	3.78	48.64	22.0	19.4	12.2	95
	1.04	0.84	0.03	2.30	2.28	2.23	15.1	15.1	14.9	95
Brier()	0.577	0.575	0.622

	0.93	0.49	4.47	2.45	2.42	0.46	15.6	15.5	5.1	95
	1.86	1.64	0.43	3.20	2.01	0.46	17.7	14.0	6.8	95
	2.10	1.36	-0.07	4.91	3.65	1.53	22.0	19.0	12.4	95
	1.04	0.85	1.02	2.30	2.28	2.30	15.1	15.1	15.1	95
Brier()	0.577	0.575	0.573

	0.93	0.49	4.47	2.45	2.42	0.46	15.6	15.5	6.0	95
	1.86	1.62	0.42	3.20	2.02	0.46	17.7	14.1	6.8	95
	2.10	1.37	-0.06	4.91	3.68	1.55	22.0	19.1	12.4	95
	1.04	0.85	1.02	2.30	2.28	2.30	15.1	15.0	15.1	95
Brier()	0.577	0.575	0.573

	0.93	0.50	22.18	2.45	2.42	5.26	15.6	15.5	5.8	95
	1.86	1.62	54.32	3.20	1.97	30.08	17.7	13.9	7.6	95
	2.10	1.43	64.72	4.91	3.66	43.57	22.0	19.0	12.9	96
	1.04	0.85	0.81	2.30	2.28	2.29	15.1	15.0	15.1	95
Brier()	0.577	0.575	0.584

	0.93	0.51	22.02	2.45	2.42	5.35	15.6	15.5	7.1	95
	1.86	1.68	54.45	3.20	1.99	30.26	17.7	14.0	7.8	96
	2.10	1.38	64.72	4.91	3.64	43.54	22.0	19.0	12.8	95
	1.04	0.85	0.81	2.30	2.28	2.29	15.1	15.0	15.1	95
Brier()	0.577	0.575	0.584

Open in a new tab

MLE, maximum likelihood estimation; Prop, proposed method; CML, constrained maximum likelihood; MSE, mean squared error; ESD, Monte Carlo standard deviation.

5. Real data analysis

To evaluate the proposed method on real data, we construct a predictive model to predict risk of high-grade prostate cancer by using internal data and external summary information available from a published study. The original risk calculator was developed from the Prostate Cancer Prevention Trial (Thompson et al., 2006). This calculator is constructed using a logistic regression model including five clinical variables: prostate specific antigen (PSA) level, digital rectal examination (DRE) findings, age, race (African American or not) and prior biopsy results. The original calculator is given as Inline graphic , where is the probability of having high-grade prostate cancer. We aim to construct an expanded risk calculator by considering an additional binary biomarker (T2:ERG) that measures TMPRSS2:ERG gene fusions. Although this biomarker is not widely used, it was observed to have predictive power for high-grade prostate cancer (Truong et al., 2013; Tomlins et al., 2015).

The dataset we use is from Tomlins et al. (2015), which comprises a training and an additional testing set with Inline graphic and subjects, respectively. Details of the study can be found in Cheng et al. (2019). The direct maximum likelihood estimator, constrained maximum likelihood estimator and proposed method are implemented on the training data to obtain risk calculators, and each calculator is then applied to the testing dataset where the predictive performance is quantified by the Brier score, calculated from the testing data.

In Table 5, we see that the proposed method gains substantial efficiency compared to the direct maximum likelihood estimator, which does not utilize external summary information. The Brier score of the proposed method is an improvement of that from direct maximum likelihood and constrained maximum likelihood.

Table 5.

Parameter estimates for the prostate cancer study. The reduced model estimates are obtained from the published literature for the external model. The full model estimates and standard errors are obtained from the internal data using either the maximum likelihood estimate, or the constrained maximum likelihood estimate or the proposed method. The standard errors of the proposed method are in parentheses, all obtained from the asymptotic variance formula

Model	PSA	Age	DRE	Prior biopsy	Race	T2:ERG	Brier
Reduced model
External	1.29	0.031	1.00		0.96	—	0.933
Full model
Direct maximum likelihood	0.98	0.032	1.02		0.57	0.76	0.930
	(0.18)	(0.012)	(0.26)	(0.27)	(0.29)	(0.20)
Constrained maximum likelihood	1.14	0.032	1.06		0.80	0.72	0.931
	(0.07)	(0.004)	(0.14)	(0.11)	(0.17)	(0.20)
Proposed	0.97	0.02	0.69		0.87	0.73	0.919
	(0.13)	(0.003)	(0.10)	(0.05)	(0.10)	(0.19)

Open in a new tab

PSA, prostate specific antigen; DRE, digital rectal examination.

6. Discussion

The gain in efficiency of the proposed method by integrating external information is not as large as for some other existing methods. However, most of these other methods require the joint distribution of all the variables to be the same in the two study populations, and can be biased when the two distributions are not equal. Our proposed method is more robust, and only requires a selected aspect of the joint distribution to be transportable between populations.

The mathematical justification for our proposed method is based on an assumption that the regression coefficients are small. However, in simulation studies we found that it had good performance unless the coefficients were very large. Another assumption made in this paper is that the external study has a large sample size so that the uncertainty associated with the external information is negligible. When this is not the case, the asymptotic variance given in Proposition 2 needs to account for such uncertainty to make valid inference, and inference without accounting for this uncertainty leads to underestimation of the standard error of Inline graphic . The external study uncertainty may be available in the form of standard errors for the parameter estimates , and may be accounted for in a similar fashion to Han & Lawless (2019) and Kundu et al. (2019) in the Taylor series expansion. Alternatively, it would also be feasible to consider an adaptation of the proposed method in which we require the ratio of the Inline graphic estimates from the internal study to be similar to, but not necessarily identical to, the ratio of the values of . Such an adaptation would be needed if there were multiple external studies, each with their own set of potentially overlapping variables and regression coefficients. The requirement of having exact equality between the ratio of regression coefficients for every external study is unlikely to be satisfied. A method that extends the shrinkage approach in Estes et al. (2018) could also be developed, in which more use is made of the external information if it is more consistent with the internal data.

In the high-dimensional regression setting, which occurs when the number of the newly discovered covariates Inline graphic is large, modifications of the proposed method that include regularization or variable selection may be needed. One modification would be to first run the full model of interest on the internal data with some regularization to select a subset of , and then use the selected to carry out the proposed procedure in § 2.1. Another modification would be to add a regularization on the regression coefficients of Inline graphic when fitting the model as in Step 2 in § 2.1 after the orthogonalization of on . These two modifications may yield selection of different subsets of in the end, and future investigations are needed to assess their performance. A relevant recent publication on data integration (Sheng et al., 2022) used a penalized empirical likelihood approach for high dimensional Inline graphic .

The constrained maximum likelihood method (Chatterjee et al., 2016) assumes the joint distribution to be the same between the internal and external study populations. It may be possible to extend this method to the setting we considered, by formulating the proportionality of Inline graphic and into a set of constraints on . This formulation is not straightforward since the connection between and is not explicit. This is a topic worth future investigation.

Supplementary Material

asac022_Supplementary_Data

Click here for additional data file.^{(129.4KB, pdf)}

Acknowledgement

This research was supported by the US National Institutes of Health (CA129102). The authors thank the editor, an associate editor and a referee for their helpful comments.

Appendix

Summary of notation

	Variables in the reduced model for .
	Additional variables in the full model for .

	Orthogonalization of , so for some
	.
	Parameters in the model, which are assumed to be correct for the inter-
	nal study.
	Parameters in the model.
	Parameters in the model, a reparameterization of .
	Parameters in the model used in the proposed method.

Open in a new tab

Proof of Proposition 1

For Proposition 1, the maximum likelihood estimate of Inline graphic is

with probability limit

where the expectation Inline graphic is with respect to the true joint distribution of . Therefore, solves with

where the expectation Inline graphic is with respect to the distribution of and the calculation used the fact that , and for ,

Here we let Inline graphic . Since cannot be solved analytically, we solve an approximation with

(A1)

Here Inline graphic , and and , , are the first-order Taylor series expansions of and around . In approximation (A1) we used for all . Then, using both and , we are able to solve the equations defined by equating the second to last rows in (A1) to zero, and get , where , and

Then the desired result follows as long as Inline graphic is invertible regardless of the value.

Proof of Proposition 2

For Proposition 2, it is easy to see that Inline graphic satisfies . A Taylor series expansion leads to

Let Inline graphic denote the dependence of on implied by Step 3 in § 2.1. Then applying the central limit theorem and the delta method gives the desired result, with .

Contributor Information

Jeremy M G Taylor, Department of Biostatistics, University of Michigan, 1415 Washington Heights, Ann Arbor, Michigan 48019, U.S.A.

Kyuseong Choi, Department of Statistics and Data Science, Cornell University, 1198 Comstock Hall, 129 Garden Ave., Ithaca, New York 14853, U.S.A.

Peisong Han, Department of Biostatistics, University of Michigan, 1415 Washington Heights, Ann Arbor, Michigan 48019, U.S.A.

Supplementary material

The Supplementary Material includes extensions of the simulation studies, including the scenarios Inline graphic of Table 1 and of Table 3.

References

Carroll, R. J., Ruppert, D., Stefanski, L. A. & Crainiceanu, C. M. (2006). Measurement Error in Nonlinear Models: a Modern Perspective, 2nd ed. New York: CRC Press. [Google Scholar]
Chatterjee, N., Chen, Y. H., Maas, P. & Carroll, R. J. (2016). Constrained maximum likelihood estimation for model calibration using summary-level information from external big data sources. J. Am. Statist. Assoc. 111, 107–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cheng, W., Taylor, J. M. G., Gu, T., Tomlins, S. A. & Mukherjee, B. (2019). Informing a risk prediction model for binary outcomes with external coefficient information. Appl. Statist. 68, 121. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cheng, W., Taylor, J. M. G., Vokonas, P. S., Park, S. K. & Mukherjee, B. (2018). Improving estimation and prediction in linear regression incorporating external information from an established reduced model. Statist. Med. 37, 1515–30. [DOI] [PMC free article] [PubMed] [Google Scholar]
Estes, J. P., Mukherjee, B. & Taylor, J. M. G. (2018). Empirical Bayes estimation and prediction using summary-level information from external big data sources adjusting for violations of transportability. Statist. Biosci. 10, 568–86. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gail, M. H., Wieand, S. & Piantadosi, S. (1984). Biased estimates of treatment effect in randomized experiments with nonlinear regressions and omitted covariates. Biometrika 71, 431–44. [Google Scholar]
Gu, T., Taylor, J. M. G., Cheng, W. & Mukherjee, B. (2019). Synthetic data method to incorporate external information into a current study. Can. J. Statist. 47, 580–603. [DOI] [PMC free article] [PubMed] [Google Scholar]
Han, P. & Lawless, J. F. (2019). Empirical likelihood estimation using auxiliary summary information with different covariate distributions. Statist. Sinica 29, 1321–42. [Google Scholar]
Han, P., Taylor, J. M. G. & Mukherjee, B. (2023). Integrating information from existing risk prediction models with no model details. Can. J. Statist. To appear. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kundu, P., Tang, R. & Chatterjee, N. (2019). Generalized meta-analysis for multiple regression models across studies with disparate covariate information. Biometrika 106, 567–85. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li, K. C. & Duan, N. (1989). Regression analysis under link violation. Ann. Statist. 17, 1009–52. [Google Scholar]
Monahan, J. F. & Stefanski, L. A. (1992). Normal scale mixture approximations to and computation of the logistic-normal integral. In Handbook of the Logistic Distribution, Balakrishnan, N. ed. New York: Marcel Dekker, pp. 529–40. [Google Scholar]
Neuhaus, J. M. & Jewell, N. P. (1993). A geometric approach to assess bias due to omitted covariates in generalized linear models. Biometrika 80, 807–15. [Google Scholar]
Penrose, R. (2004). The Road to Reality. A Complete Guide to the Laws of the Universe. London: Vintage Books. [Google Scholar]
Qin, J. (2000). Combining parametric and empirical likelihoods. Biometrika 87, 484–90. [Google Scholar]
Rahmandad, H., Jalali, M. S. & Paynabar, K. (2017). A flexible method for aggregation of prior statistical findings. PloS One 12, e0175111. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sheng, Y., Sun, Y., Huang, C. Y. & Kim, M. O. (2022). Synthesizing external aggregated information in the presence of population heterogeneity: A penalized empirical likelihood approach. Biometrics 78, 679–90. [DOI] [PubMed] [Google Scholar]
Solomon, P. J. (1984). Effect of misspecification of regression models in the analysis of survival data. Biometrika 71, 291–8. [Google Scholar]
Struthers, C. A. & Kalbfleisch, J. D. (1986). Misspecified proportional hazard models. Biometrika 73, 363–9. [Google Scholar]
Taylor, J. M. G. (1989). A note on the cost of estimating the ratio of regression parameters after fitting a power transformation. J. Statist. Plan. Infer. 21, 223–30. [Google Scholar]
Taylor, J. M. G. (1990). Properties of maximum likelihood estimates of the ratio of parameters in ordinal response regression models. Commun. Statist. B 19, 469–80. [Google Scholar]
Thompson, I. M., Ankerst, D. P., Chi, C., Goodman, P. J., Tangen, C. M., Lucia, M. S., Feng, Z., Parnes, H. L. & Coltman C. A., Jr. (2006). Assessing prostate cancer risk: results from the Prostate Cancer Prevention Trial. J. Nat. Cancer Inst. 98, 529–34. [DOI] [PubMed] [Google Scholar]
Tomlins, S. A., Day, J. R., Lonigro, R. J., Hovelson, D. H., Siddiqui, J., Kunju, L. P., Dunn, R. L., Meyer, S., Hodge, P., Groskopf, J.. et al. (2015). Urine TMPRSS2:ERG plus PCA3 for individualized prostate cancer risk assessment. Eur. Urol. 70, 45–53. [DOI] [PMC free article] [PubMed] [Google Scholar]
Truong, M., Yang, B. & Jarrard, D. F. (2013). Toward the detection of prostate cancer in urine: a critical analysis. J. Urol. 189, 422–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
Van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge: Cambridge University Press. [Google Scholar]
Zhai, Y. & Han, P. (2022). Data integration with oracle use of external information from heterogeneous populations. J. Comp. Graph. Statist. 31, 1001–12. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

asac022_Supplementary_Data

Click here for additional data file.^{(129.4KB, pdf)}

[B1] Carroll, R. J., Ruppert, D., Stefanski, L. A. & Crainiceanu, C. M. (2006). Measurement Error in Nonlinear Models: a Modern Perspective, 2nd ed. New York: CRC Press. [Google Scholar]

[B2] Chatterjee, N., Chen, Y. H., Maas, P. & Carroll, R. J. (2016). Constrained maximum likelihood estimation for model calibration using summary-level information from external big data sources. J. Am. Statist. Assoc. 111, 107–17. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B3] Cheng, W., Taylor, J. M. G., Gu, T., Tomlins, S. A. & Mukherjee, B. (2019). Informing a risk prediction model for binary outcomes with external coefficient information. Appl. Statist. 68, 121. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B4] Cheng, W., Taylor, J. M. G., Vokonas, P. S., Park, S. K. & Mukherjee, B. (2018). Improving estimation and prediction in linear regression incorporating external information from an established reduced model. Statist. Med. 37, 1515–30. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B5] Estes, J. P., Mukherjee, B. & Taylor, J. M. G. (2018). Empirical Bayes estimation and prediction using summary-level information from external big data sources adjusting for violations of transportability. Statist. Biosci. 10, 568–86. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B6] Gail, M. H., Wieand, S. & Piantadosi, S. (1984). Biased estimates of treatment effect in randomized experiments with nonlinear regressions and omitted covariates. Biometrika 71, 431–44. [Google Scholar]

[B7] Gu, T., Taylor, J. M. G., Cheng, W. & Mukherjee, B. (2019). Synthetic data method to incorporate external information into a current study. Can. J. Statist. 47, 580–603. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B8] Han, P. & Lawless, J. F. (2019). Empirical likelihood estimation using auxiliary summary information with different covariate distributions. Statist. Sinica 29, 1321–42. [Google Scholar]

[B9] Han, P., Taylor, J. M. G. & Mukherjee, B. (2023). Integrating information from existing risk prediction models with no model details. Can. J. Statist. To appear. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B10] Kundu, P., Tang, R. & Chatterjee, N. (2019). Generalized meta-analysis for multiple regression models across studies with disparate covariate information. Biometrika 106, 567–85. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B11] Li, K. C. & Duan, N. (1989). Regression analysis under link violation. Ann. Statist. 17, 1009–52. [Google Scholar]

[B12] Monahan, J. F. & Stefanski, L. A. (1992). Normal scale mixture approximations to and computation of the logistic-normal integral. In Handbook of the Logistic Distribution, Balakrishnan, N. ed. New York: Marcel Dekker, pp. 529–40. [Google Scholar]

[B13] Neuhaus, J. M. & Jewell, N. P. (1993). A geometric approach to assess bias due to omitted covariates in generalized linear models. Biometrika 80, 807–15. [Google Scholar]

[B14] Penrose, R. (2004). The Road to Reality. A Complete Guide to the Laws of the Universe. London: Vintage Books. [Google Scholar]

[B15] Qin, J. (2000). Combining parametric and empirical likelihoods. Biometrika 87, 484–90. [Google Scholar]

[B16] Rahmandad, H., Jalali, M. S. & Paynabar, K. (2017). A flexible method for aggregation of prior statistical findings. PloS One 12, e0175111. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B17] Sheng, Y., Sun, Y., Huang, C. Y. & Kim, M. O. (2022). Synthesizing external aggregated information in the presence of population heterogeneity: A penalized empirical likelihood approach. Biometrics 78, 679–90. [DOI] [PubMed] [Google Scholar]

[B18] Solomon, P. J. (1984). Effect of misspecification of regression models in the analysis of survival data. Biometrika 71, 291–8. [Google Scholar]

[B19] Struthers, C. A. & Kalbfleisch, J. D. (1986). Misspecified proportional hazard models. Biometrika 73, 363–9. [Google Scholar]

[B20] Taylor, J. M. G. (1989). A note on the cost of estimating the ratio of regression parameters after fitting a power transformation. J. Statist. Plan. Infer. 21, 223–30. [Google Scholar]

[B21] Taylor, J. M. G. (1990). Properties of maximum likelihood estimates of the ratio of parameters in ordinal response regression models. Commun. Statist. B 19, 469–80. [Google Scholar]

[B22] Thompson, I. M., Ankerst, D. P., Chi, C., Goodman, P. J., Tangen, C. M., Lucia, M. S., Feng, Z., Parnes, H. L. & Coltman C. A., Jr. (2006). Assessing prostate cancer risk: results from the Prostate Cancer Prevention Trial. J. Nat. Cancer Inst. 98, 529–34. [DOI] [PubMed] [Google Scholar]

[B23] Tomlins, S. A., Day, J. R., Lonigro, R. J., Hovelson, D. H., Siddiqui, J., Kunju, L. P., Dunn, R. L., Meyer, S., Hodge, P., Groskopf, J.. et al. (2015). Urine TMPRSS2:ERG plus PCA3 for individualized prostate cancer risk assessment. Eur. Urol. 70, 45–53. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B24] Truong, M., Yang, B. & Jarrard, D. F. (2013). Toward the detection of prostate cancer in urine: a critical analysis. J. Urol. 189, 422–9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B25] Van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge: Cambridge University Press. [Google Scholar]

[B26] Zhai, Y. & Han, P. (2022). Data integration with oracle use of external information from heterogeneous populations. J. Comp. Graph. Statist. 31, 1001–12. [Google Scholar]

PERMALINK

Data integration: exploiting ratios of parameter estimates from a reduced external model

Jeremy M G Taylor

Kyuseong Choi

Peisong Han

Summary

1. Introduction

2. Proposed method for integrating external information

2.1. Proposed method

2.2. Justification for the proposed method

2.3. Relationship between full model and reduced model parameters

Proposition 1.

3. Asymptotic distribution of the proposed estimator

Proposition 2.

4. Simulation studies

Table 1.

Table 2.

Table 3.

Table 4.

5. Real data analysis

Table 5.

6. Discussion

Supplementary Material

Acknowledgement

Appendix

Summary of notation

Proof of Proposition 1

Proof of Proposition 2

Contributor Information

Supplementary material

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Data integration: exploiting ratios of parameter estimates from a reduced external model

Jeremy M G Taylor

Kyuseong Choi

Peisong Han

Summary

1. Introduction

2. Proposed method for integrating external information

2.1. Proposed method

2.2. Justification for the proposed method

2.3. Relationship between full model and reduced model parameters

Proposition 1.

3. Asymptotic distribution of the proposed estimator

Proposition 2.

4. Simulation studies

Table 1.

Table 2.

Table 3.

Table 4.

5. Real data analysis

Table 5.

6. Discussion

Supplementary Material

Acknowledgement

Appendix

Summary of notation

Proof of Proposition 1

Proof of Proposition 2

Contributor Information

Supplementary material

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases