Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2010 Oct 12.
Published in final edited form as: Stat Med. 2009 May 15;28(11):1620–1635. doi: 10.1002/sim.3563

Synthesis analysis of regression models with a continuous outcome

Xiao-Hua Zhou 1,2,*,, Nan Hu 2, Guizhou Hu 3, Martin Root 3
PMCID: PMC2952887  NIHMSID: NIHMS219175  PMID: 19326397

SUMMARY

To estimate the multivariate regression model from multiple individual studies, it would be challenging to obtain results if the input from individual studies only provide univariate or incomplete multivariate regression information. Samsa et al. (J. Biomed. Biotechnol. 2005; 2:113–123) proposed a simple method to combine coefficients from univariate linear regression models into a multivariate linear regression model, a method known as synthesis analysis. However, the validity of this method relies on the normality assumption of the data, and it does not provide variance estimates. In this paper we propose a new synthesis method that improves on the existing synthesis method by eliminating the normality assumption, reducing bias, and allowing for the variance estimation of the estimated parameters.

Keywords: synthesis analysis, meta-analysis, linear models

1. INTRODUCTION

Meta-analysis is a statistical technique for amalgamating, summarizing, and reviewing previous quantitative research. A typical meta-analysis is to summarize all the research results on one topic and to discuss reliability of this summary. It is based on the condition that each individual study reports the same finding for the same research question. The potential advantage of meta-analysis is the increase in the sample size and the validity of statistical inference. It would be difficult to utilize meta-analysis methodologies if individual studies only provide partial findings.

In a practical example, meta-analysis could be used to build a comprehensive and multivariate prediction model for the risk of chronic diseases such as coronary heart disease (CHD). A wide range of CHD risk factors have been reported in the literature, but a comprehensive multivariate CHD prediction model has yet to be found. The Framingham CHD model is widely considered the most comprehensive model, although many well-known CHD risk factors, such as body mass index (BMI), family history of CHD, and c-reactive protein, are not included in the model [13].

We propose a new process to solve several of the problems presented above. This novel multivariate meta-analysis modeling method is called synthesis analysis. Using multiple study results reported in the scientific and medical literature, the objective of our synthesis analysis is to estimate the multivariate relations between multiple predictors (Xs) and an outcome variable (Y) from the univariate relation of each X with Y and the two-way correlations between each pair of Xs. All the inputs may come from various studies in the literature, while a cross-sectional population survey may provide correlations of all Xs. We reported the first method of synthesis analysis (the Samsa-Hu-Root or SHR method) in which the partial regression coefficients were calculated using the following matrix equation:

B=(R1(Bu#S))/S

where B is the vector of partial (excluding the intercept, B0) regression coefficients, Bu is the vector of univariate regression coefficients, R is the vector of Pearson correlation coefficients among all independent variables, S is the vector of standard deviations of the independent variables, # stands for element-wise multiplication, and/stands for element-wise division. The intercept, B0, can be calculated using the resulting multivariate formula, the mean of the predictors and outcome, and the newly calculated partial regression coefficient for each predictor.

In the present study, we propose an improvement to the existing synthesis analysis. Compared with the previous method, this method has at least two advantages: (1) it includes a method to compute the variances for predicted outcomes and estimated regression coefficients and (2) the estimates of predicted outcomes and regression coefficients can be more robust when the independent variables are not normally distributed.

Our paper is organized as follows. In Section 2, we describe our new method. In Section 3, we report a simulation study on finite-sample performance of the proposed method in comparison with the existing synthesis method. In Section 4, we illustrate the use of the proposed method in a real-life example from the 1999–2000 National Health and Nutritional Examination Survey. Finally, in Section 5, we conclude our paper with a discussion on some extensions.

2. NEW METHOD FOR SYNTHESIS ANALYSIS

2.1. Estimation of synthesized parameters

Suppose that we know the individual relationships between an outcome Y and each of p risk factors, X1, X2, …, and Xp, which are given as follows:

E[YXi]=γ0i+γ1iXi (1)

where i = 1,2, …, p. In addition, we assume that we know the mean relationships between any two pairs among the p risk factors:

E[XjXi]=α0ij+α1ijXi (2)

where i, j = 1,2, …, p and ij.

The goal of synthesis analysis is to determine the multivariate linear regression model between Y and the p risk factors:

E(YX1,,Xp)=β0+i=1pβiXi (3)

Note that the linear regression assumption (1) automatically holds under assumptions (2) and (3).

Taking the conditional expectation of the both sides of (3) given Xi, we obtain the following equation:

E(YXi=x)=β0+β1E(X1Xi=x)++βi1E(Xi1Xi=x)+βix++βpE(XpXi=x) (4)

for i = 1, …, p. Combining (1), (2), and (4), we obtain the following result:

γ0i+γ1ix=β0+(β1α0i1++βi1α0i(i1)+βi+1α0i(i+1)++βpα0ip)+(β1α1i1++βi1α1i(i+1)+βi+βi+1α1i(i+1)++βpα1ip)x

for all x, where i = 1, …, p. Therefore, we obtain the following two sets of equations:

γ01=β0+(β2α011++βpα01p)γ0i=β0+(β1α0i1++βi1α0i(i1)+βi+1α0i(i+1)++βpα0ip) (5)

for i = 2, …, p; and

γ11=β1+β2α112++βpα11pγ1i=β1α1i1++βi1α1i(i1)+βi+βi+1α1i(i+1)++βpα1ip (6)

for i = 2, …, p.

Let M be a p × p matrix with diagonal elements 1, and element α1ij when ij; let β = (βk, k = 1, …, p), and γ1+(γ1k,k=1,,p). From (6), we obtain the following p equations for the p unknown slope parameters, β1, …, βp:

Mβ=γ1 (7)

Using Cramer’s rule, we can easily solve the above p simultaneous linear equations. Let us define the following determinants:

D=|1α112α113α11pα1211α123α12pα1p1α1p2α1p31|D1=|γ12α112α113α11pγ121α123α12pγ1pα1p2α1p31|andDp=|1α112α113γ11α1211α123γ12α1p1α1p2α1p3γ1p|

Cramer’s rule gives us the following unique solution to the system of equations (8):

βk=DkD (8)

where k = 1, …, p.

After obtaining estimates of the vector of slope parameters, β, we can derive an estimate for the intercept parameter, β0, using any one of the p equations given in (6). Hence, we have the following p equations for the unknown intercept parameter β0:

β0+0+α012β2+α013β3++α01(p1)βp1+α01pβp=γ01β0+α012β1+0+α023β3++α02(p1)βp1+α02pβp=γ02β0+α0p1β1+α0p2β2+α0p3β3++α0p(p1)βp1+0=γ0p

Although there are p equations for the parameter β0, we show that the solution of β0 is unique in Appendix A. We give a detailed description of our solution for the two-covariate case in Appendix B, and in Appendix C, we give an explicit formula for our synthesized parameters in cases with three and four covariates.

2.2. Variance estimation

The variance can be estimated using the delta method by assuming that the univariate parameter estimates γ0(i) and γ1(i)(i=1,,p) from individual univariate linear regression models, given by (1), are independent of each other [4]. Let α=(α0(ij),α1(ij),i,j=1,,p) and γ=(γ0(k),γ1(k),k=1,,p).

By the well-known result from simple linear regression, we know:

n1/2[(α,γ)T(α0,γ0)T]dN(0,)

where α0 and γ0 are the true expected values of α and γ,

=(α00γ)

Here

α=(σαiklαjkl,i,j=0,1;k,l,k,l=1,2,,p)

where σαiklαjkl(i,j=0,1;k,l,k,l=1,2,,p) is the covariance between αi(kl) and αj(kl), and

γ=(σγ01γ01000σγ11γ1100σγ1pγ1p)

is the covariance matrix of the estimated parameters γ̂.

The synthesized parameter estimates β =(β0, βl, …, βp)T are functions of α’s and γ’s, which can be expressed mathematically as:

β=g(α,γ)

If the function g is differentiable, then the delta method gives the asymptotic variance of β as follows:

β=g(α,γ)Tg(α,γ) (9)

where ∇g(α, γ) is the vector of derivatives of function g with respect to β=( β0, β1, …, βp). We give an explicit formula for ∇g(α, γ) when p = 2 in Appendix B. Many programs, such as Mathematica, can perform derivatives symbolically, thereby making the variance calculation much easier, since the derivation of the exact form of the ∇g is not required before the calculation.

2.3. Variance of predicted value

Once the estimates of parameters and their variances have been derived, we can calculate the covariance matrix of predicted values as follows:

Cov(YX)=Cov(XTβX)=XTβX

where XT is the transpose of the X matrix, and Σβ is the covariance matrix of β, given by (9).

2.4. Mean-squared error of the predicted value and correlation between predicted and observed values

The mean-squared error (MSE) of the predicted value is given by

MSEY^=i=1n(Y^iYi)n

where Ŷi and Yi are the predicted and observed value of subject i, respectively. The correlation coefficient between Ŷi and Yi, ρ, can be calculated by

ρ=Cov(Y^i,Yi)Var(Y^i)Var(Yi)

where Cov(Ŷi, Yi) is the covariance between predicted and observed values.

3. SIMULATION STUDY

We conducted a simulation study to assess the performance of the proposed method in comparison with our previous method [5], denoted by SHR. We simulated data with two, three, and four predictor variables. For simplicity of presentation, we only reported the results for the two-predictors here, because the results for three-predictor and four-predictor cases are similar to those in the two-predictor case.

In each of these cases, we simulated independent variables from (1) a multivariate normal distribution, (2) a multivariate log-normal distribution, (3) a multivariate exponential distribution, and (4) a multivariate gamma distribution. We chose the variances of all the independent variables to be 1 and correlations for pairs of the independent variables to be 0.5. After simulating the independent variables X, we generated the dependent variable Y by adding random normal errors to the mean model:

Y=β0+i=1pβiXi+ε(p=2,3,4) (10)

where ε is a random error following the standard normal distribution.

We set the true regression parameters as follows: (β0, βl, β2) = (−5, 5, 3) for the two-variable setting, (β0, βl, β2, β3) = (−5, 1, 3, 5) for the three-variable setting, and (β0, βl, β2, β3,β4) = (−5, 5, 4, 3, 1) for the four-variable setting. We divided each data set into C2p+1 (p = 2,3,4) subsets with equal sample sizes. Here, C2p+1 denoted the total number of combinations of choosing 2 items from (p + 1) items. In simulated data, each subset contained only one pair of variables chosen from Y, X1, …, Xp. The sample size (the total number of observations) used in simulation was 300 and 3000 (with equal size for each subset). For each of the above settings, we simulated a total number of 1000 data sets. As the results for the data from the skewed log-normal distribution were similar to those from the other skewed distributions, we only reported the results for the normal and log-normal distributions. We reported the mean bias and MSE for estimated parameters in Tables I and II.

Table I.

Mean bias and MSE of estimated regression parameters with two independent variables following a normal distribution.

Sample size, method Mean bias
MSE
β0 β1 β2 β0 β1 β2
n=300(m1=m2=m3=100), New −0.190 −0.016 0.041 14.808 1.708 2.763
n=300(m1=m2=m3=100), SHR 0.486 −0.033 −0.090 26.897 0.939 1.527
n=3000(m1=m2=m3=1000), New 0.031 0.000 −0.007 1.346 0.033 0.067
n=3000(m1=m2=m3=1000), SHR 0.050 −0.004 −0.009 2.628 0.079 0.139
*

The sample size for subsets with only outcome Y and predictor X1.

The sample size for subsets with only outcome Y and predictor X2.

The sample size for subsets with only predictors X1 and X2.

Table II.

Mean bias and MSE of estimated regression parameters with two independent variables following a log-normal distribution.

Sample size, method Mean bias
MSE
β0 β1 β2 β0 β1 β2
n=300(m1=m2=m3=100), New 0.146 −0.081 −0.042 42.032 3.676 4.799
n=300(m1=m2=m3=100), SHR 10.377 −1.104 −1.412 933.764 82.249 80.029
n=3000(m1=m2=m3=1000), New −0.051 −0.004 0.010 1.259 0.033 0.063
n=3000(m1=m2=m3=1000), SHR −0.015 −0.013 0.006 2.349 0.080 0.126
*

The sample size for subsets with only outcome Y and predictor X1.

The sample size for subsets with only outcome Y and predictor X2.

The sample size for subsets with only predictors X1 and X2.

In order to evaluate the accuracy of predicted values using the new model, we simulated two data sets with equal sample sizes. One was used as the training set for model derivation, while the other was used as the validation data set. To evaluate prediction performance, we reported mean bias, MSE, and the mean of standard error estimates (SEEs) for predicted values in Tables III and IV. The SEEs were derived using the method developed in Sections 2.2 and 2.3. The correlations between predicted and observed values were also reported in the two tables.

Table III.

Mean bias, MSE, correlation and SE for predicted values with two independent variables following a normal distribution.

Sample size, method Mean bias MSE Correlation SEE
n=300(m1=m2=m3=100), New 0.0108 0.8046 0.9949 6.0496
n=300(m1=m2=m3=100), SHR 14.1519 221.1321 0.9900
n=3000(m1=m2=m3=1000), New −0.0092 0.0723 0.9996 1.8656
n=3000(m1=m2=m3=1000), SHR 14.0304 209.9250 0.9954

Note: Correlation is the mean correlation between observed and predicted values across simulations. SEE is the mean of standard error estimates for predicted values.

*

The sample size for a subset with only outcome Y and predictor X1.

The sample size for a subset with only outcome Y and predictor X2.

The sample size for a subset with only predictors X1 and X2.

Table IV.

Mean bias, MSE, correlation and SE for predicted values with two independent variables following a log-normal distribution.

Sample size, method Mean bias MSE Correlation SEE
n=300(m1=m2=m3=100), New −10.2079 199764.1000 0.9376 254.6255
n=300(m1=m2=m3=100), SHR 85.9998 47835.6600 0.9335
n=3000(m1=m2=m3=1000), New 1.0546 17442.6700 0.9918 71.3051
n=3000(m1=m2=m3=1000), SHR 66.5488 12226.2700 0.9328

Note: Correlation is the mean correlation between observed and predicted values across simulations. SEE is the mean of standard error estimates for predicted values.

*

The sample size for subset with only outcome Y and predictor X1.

The sample size for subset with only outcome Y and predictor X2.

The sample size for subset with only predictors X1 and X2.

Simulation results for the regression parameters showed that the mean bias and MSE of the estimated regression parameters using our new method were, in general, better than those using the SHR method, across all of the distributions and sample sizes considered here. The results also indicated that when the distributions of independent variables X were heavily skewed (log-normal distribution), the bias and MSE of the estimated regression parameters using both methods were large, especially when sample sizes were small. Nonetheless, the results from our new method were much better than those from the SHR method under this situation.

The results for predicted values indicated that both the new method and the SHR method had similar correlations between observed and predicted values across all sample sizes and distributions. However, mean bias and MSE for predicted values derived from our new method were much smaller than those from the SHR method.

4. EXAMPLE

In this section, we analyzed a real-world example and compared the results using our new synthesis method and the SHR method. The data came from the 1999–2000 National Health and Nutritional Examination Survey [6]. There were five variables in this data set, including one outcome Y, systolic blood pressure, and four predictors, X1, X2, X3, and X4, which represented age, body mass index (BMI), serum total cholesterol level, and the natural log of serum triglycerides, respectively. First, we fitted a multivariate regression model to this data set, which would serve as the gold standard for this analysis. Next, we randomly divided the data set into the five mutually exclusive subsets with approximately equal sample sizes. The first four subsets included the outcome Y and each of the four covariates, X1, X2, X3, and X4, respectively. The last subset contained all four covariates, which was used to derive pairwise correlations among the covariates. We applied the two synthesis methods to these five subsets to obtain estimated parameters in the multivariate regression model and reported the results in Table V. For comparison purposes, we also included the estimated parameters in the multivariate regression models obtained by the gold standard model in Table V.

Table V.

Parameter estimates (SE) for the NHANES blood pressure example.

Variables Gold standard β̃ NEW method β̂NEW SHR method*β̂SHR
Intercept 76.207 (2.556) 73.482 (4.531) 83.401
AGE 0.601 (0.017) 0.634 (0.050) 0.681
BID 0.379 (0.045) 0.403 (0.128) 0.337
TCHOL 0.024 (0.007) 0.029 (0.018) 0.006
LOGTRIG 1.374 (0.529) 1.506 (0.931) 0.160
*

Cannot calculate SE using this method.

The estimated parameters and their standard errors (SEs) from the gold standard and from both our new method and SHR method were listed in Table V (SE was not available by the SHR method). From these results, we observed that the new method produced the coefficient estimates that were comparable to those derived using the gold standard. However, the estimates for Intercept and LOGTRIG from the SHR method were varied somewhat from those derived using the gold standard method. As an illustration, the predicted value for a 65-year-old subject with the BMI of 19, the serum total cholesterol level of 190, and the serum triglycerides of 160 would be 134, 135, and 136, using the gold standard method, the new method, and the SHR method, respectively.

5. DISCUSSION

In this paper, we provided several enhancements to the existing SHR synthesis analysis methodology. These improvements allow for more robust estimates of the regression parameters and predicted values when covariates are not normally distributed. Additionally, the new method allows for estimation of the variance of the resulting parameters and predicted outcomes.

Both the previously reported SHR method and our improved method allow for the building of multivariate regression models using univariate regression coefficients and two-way correlation coefficient data that are derived from different data sources. The underlying assumption is that each individual study is representative of the target population. However, the validity of the previously reported SHR synthesis analysis methodology relies on the normality assumption of the data. Although synthesis analysis is related to both meta-analysis and analysis of missing data, it is also different from these two traditional analyses in two important ways. First, while the goal of traditional meta-analysis is to combine the multivariate regression models with the same covariates from different studies, the goal of synthesis analysis is to create a multivariate linear regression model from univariate linear regression models on different covariates. Although the statistical problem that synthesis analysis address may be considered as one particular type of missing-data problem, unlike a traditional analysis, synthesis analysis does not require individual level data; rather, synthesis analysis only requires coefficient estimates of univariate linear regression models between the outcome and a covariate and between any two covariates.

Although the proposed method was developed to synthesize different univariate linear regression models with different covariates into multivariate linear regression models, it can be easily extended to the setting in which several studies are available for some (or all) of the univariate regression models. In this case, there would be variation among the parameter estimates. For example, if there are five studies available for the linear model, E(Y | X1), and six studies for the linear model, E(X1 | X2), then we would have the five sets of estimates for the intercept and slope of the linear model of Y on X, denoted by γ0j1 and γ1j1, for j = 1, …, 5, and the six sets of estimates for the intercept and slope of the linear model of X1 on X2, denoted by α0k21 and α1k21, for k = l, …, 6.

In this case, we propose to first combine the results on the same univariate regression model from different studies into the one univariate regression model using the weighted mean of αijk and γij, with the weight being the inverse sample size; that is,

γ01=j=15NjNγ0j1,γ11=j=15NjNγ1j1

where Nj is the sample size for the jth univariate model between Y and X1, and N=j=15Nj. Then, we apply the proposed synthesis method in Section 2 to obtain the multivariate regression model.

We performed a simulation study to assess the performance of the modified method in the two independent variables case, with one independent variables following a normal distribution and another following a log-normal distribution. We also compared this modified method with other combining methods, including mean, median, minimum, and maximum of multiple estimates for a same regression parameter. From these simulation results, we concluded that parameter estimates using the weighted mean had the smallest bias and MSE, and were very close to the bias and MSE using the gold standard. In addition, the predicted value using the weighted mean had the smallest bias, MSE, and SEE. We give a detailed description on our simulation study and results in Appendix D. The computer software for implementing the proposed method is available at http://faculty.washington.edu/azhou.

Table DII.

Mean Bias, MSE, Correlation and SEE for predicted values with equal sample sizes.

Method Mean bias MSE Correlation SEE
Total sample size N = 1000 × 3 × 5 equal sample size) = 15 000
Weighted mean (Mean) 0.0019 0.0301 0.9998 0.9109
Total sample size N = 100 × 3 × 5 (equal sample size) = 1500
Weighted mean (Mean) 0.0126 0.3741 0.9956 3.0272

Table DIII.

Bias and MSE for estimated parameters with unequal sample sizes.

Method Bias
MSE
β0 β1 β2 β0 β1 β2
Total sample size N =(100+200+500+1200+3000) × 3 = 15000
Weighted mean 0.0196 0.0049 −0.0056 0.5540 0.0251 0.0496
Mean −0.0231 0.0067 −0.0076 0.8445 0.0567 0.0875
Median 0.0208 0.0073 −0.0082 0.6676 0.0680 0.0329
Minimum −0.0538 0.0211 −0.0103 3.0387 0.0733 0.1526
Maximum −0.0236 0.0040 −0.0123 5.8060 0.1549 0.2748
Total sample size N = (10+20+50+120+300) × 3 = 1500
Weighted mean 0.1147 0.0268 −0.0283 3.0217 0.3488 0.3621
Mean 0.2007 0.0234 0.0322 4.4266 0.3396 0.4212
Median 0.1583 0.0283 −0.0379 7.2861 0.4095 0.3714
Minimum −2.8130 −0.4905 0.6229 73.6571 2.0423 3.8998
Maximum −0.5346 0.1130 0.0830 529.7432 96.6978 61.0214

Acknowledgments

We would like to thank Vicki Ding and Hua Chen for their help in preparing this manuscript. Xiao-Hua Zhou, PhD, is presently a Core Investigator and Biostatistics Unit Director at the Northwest HSR&D Center of Excellence, Department of Veterans Affairs Medical Center, Seattle, WA. The views expressed in this article are those of the authors and do not necessarily represent the views of the Department of Veterans Affairs. This study has been partially supported by NSFC 30728019.

Contract/grant sponsor: NSFC; contract/grant number: 30728019

APPENDIX A: SKETCH PROOF FOR UNIQUENESS OF INTERCEPT COEFFICIENT

Here we show that there is a unique solution for the intercept term β0 with the p equations (5), meaning that we need to show that the following p solutions are equivalent:

β0(1)=γ01(α012β2+α013β3++α01,p1βp1+α01pβp)β0(2)=γ02(α021β1+0+α023β3++α02,p1βp1+α02pβp)β0(p)=γ0p(α0p1β1+α0p2β2+α0p3β3++α0p,p1βp1+0)

Without losing generality, we only show that the solutions of the first two equations are equal, that is, β0(1)=β0(2). The proof for other solutions is similar.

In order to show

γ01α012β2α013β3α01,p1βp1α01pβp=γ02α021β1α023β3α02,p1βp1α02pβp (A1)

we add E(X1)β1 + E(X2)β2 + ··· + E(Xp)βp to both sides of (A1), and then the left side of (A1) becomes

γ01+E(X1)β1+(E(X2)α012)β2+(E(Xp1)α01,p1)βp1+(E(Xp)α01p)βp (A2)

Because E(XjXi)=α0ij+α1ijXi, we can get the following result:

E(Xj)=E(E(XjXi))=α0ij+α1ijE(Xi) (A3)

Hence, we can replace ( E(Xj)α0ij) with α11jE(X1) in (A2) and obtain the following result:

γ01+E(X1)β1+α112β2E(X1)+α11,p1βp1E(X1)+α11pβpE(X1)=γ01+(β1+α112β2++α11pβp)E(X1) (A4)

Because β1, …, and βp are the solutions of Mβ = γ1, we can obtain the following result:

β1+α112β2++α11pβp=γ11 (A5)

Hence, the right side of (A4) becomes γ01+γ11E(X1), which equals to E(Y) because E(Y)=E(E(YX1))=E(γ01+γ11X1)=γ01+γ11E(X1).

Similarly, we can proof the right side of (A1) plus E(X1)β1 + E(X2) β2 + ··· + E(Xp)βp is also equal to E(Y). This completes the proof.

APPENDIX B: SOLUTION FOR TWO PREDICTORS CASE

When p = 2, we can also have an explicit formula for the derivative of β = g(α, γ) with respect to agr; and γ, ∇g(α, γ), for the two independent variables case. Here, ∇g(α, γ) is used to calculate the variance of β and predicted values.

g(α,γ)=(β^0α012β^1α012β^2α012β^0α112β^1α112β^2α112β^0α021β^1α021β^2α021β^0α121β^1α121β^2α121β^0γ01β^1γ01β^2γ01β^0γ11β^1γ11β^2γ11β^0γ02β^1γ02β^2γ02β^0γ12β^1γ12β^2γ12)=(γ12α121γ111α112α12100α012α121(1α112α121)2γ121α112α121+α121(γ11α121γ12)(1α112α121)2α121(γ12α121γ11)1α112α121000α012[γ111α112α121α112(γ12α121γ11)(1α112α121)2]α112(γ11α112γ12)(1α112α121)2γ111α112α121α121(γ12α121γ11)(1α112α121)2100α012α1211α112α12111α112α121α1211α112α121000α0121α112α121α1121α112α12111α112α121)

APPENDIX C: SOLUTION FOR THREE AND FOUR PREDICTORS

When there are three predictors in the model, D and Di, (i = 1, 2, 3) are given as follows:

D=|1α112α113α1211α123α131α1321|=(1+α112α123α131+α113α121α132)(α112α121+α113α131+α123α132)D1=|γ11α112α113γ121α123γ13α132α133|=(γ11α133+α112α123γ13+α113γ12α132)(α113γ13+α112γ12α133+γ11α123α132)D2=|1γ11α113α121γ12α123α131γ13α133|=(γ12α133+γ11α123α131+α113α121γ13)(α113γ12α131+γ11α121α133+α123γ13)

and

D3=|1α112γ11α1211γ12α131α132γ13|=(γ13+α112γ12α131+γ11α121α132)(γ11α131+α112α121γ13+γ12α132)

If there are four predictors in the regression model, the D and Di, (i = 1, 2, 3, 4) are as follows:

D=|1α112α113α114α1211α123α124α131α1321α134α141α142α1431|=[(1+α123α134α142)+α124α132α143)(α123α132+α124α142+α134α143)]α112[(α121+α123α134α141+α124α131α143)(α124α141+α123α131+α121α134α143)]+α113[(α121α132+α134α141+α124α131α142)(α124α132α141+α121α134α142+α131)]α114[(α121α132α143+α141+α123α131α142)(α123α132α141+α131α143+α121α142)]D1=|γ11α112α113α114γ121α123α124γ13α1321α134γ14α142α1431|=γ11[(1+α123α134α142)+α124α132α143)(α123α132+α124α142+α134α143)]α112[(γ12+α123α134γ14+α124γ13α143)(α124γ14+α123γ13+α134α143γ12)]+α113[(γ12α132+α134γ14+α124γ13α142)(α124α132γ14+γ13+α134α142γ12)]α114[(γ12α132α143+γ14+α123γ13α142)(α123α132γ14+α143γ13+α142γ12)]D2=|1γ11α113α114α121γ12α123α124α131γ131α134α141γ14α1431|=[(γ12+α123α134γ14+α124γ13α143)(α124γ14+α123γ13+α134α143γ12)]γ11[(α21+α123α134α141+α124α131α143)(α124α141+α123α131+α121α134α143)]+α113[(α121γ13+γ12α134α141+α124α131γ14)(α124γ13α141+γ12α131+α121α134γ14)]α114[(α121γ13α143+γ12α141+α123α131γ14)(α123γ13α141+γ12α131α143+α121γ14)]D3=|1α112γ11α114α1211γ12α124α131α132γ13α134α141α142γ141|=[(γ13+γ12α134α142+α124α132γ14)(α124α142γ13+γ12α132+α134γ14)]α112[(α121γ13+γ12α134α141+α124α131γ14)(α124γ13α141+γ12α131+α121α134γ14)]+γ11[(α121α132+α134α141+α124α131α142)(α124α132α141+α131+α121α134α142)]α114[(α121α132γ14+γ13α141+γ12α131α142)(γ12α132α141+α131γ14+α121γ13α142)]

and

D4=1α112α113γ11α1211α123γ12α131α1321γ13α141α142α143γ14=[(γ14+α123γ13α142)+γ12α132α143)(γ12α142+α123α132γ14+γ13α143)]α112[(α121γ14+α123γ13α141+γ12α131α143)(γ12α141+α123α131γ14+α121γ13α143)]+α113[(α121α132γ14+γ13α141+γ12α131α142)(γ12α132α141+α131γ14+α121γ13α142)]γ11[(α121α132α143+α141+α123α131α142)(α123α132α141+α131α143+α121α142)]

APPENDIX D: SIMULATION STUDY ON THE MODIFIED SYNTHESIS

We performed a simulation study to assess the performance of the modified method, as described in the discussion section, for the two independent-variable case when the vector of two covariates follows a bivariate normal distribution or bivariate log-normal distribution. We also compared this modified method with the other combining methods, including mean, median, minimum, and maximum of multiple estimates for a same regression parameter. For each of the three univariate linear models, E(Y | X1), E(Y | X2), and E(X1 | X2), there were the estimates from five different studies. We selected the sample size for each of the five studies for each univariate model to be equal (1000 and 100) or unequal (100, 200, 500, 1200, 3000) or (10, 20, 50, 120, 300). We assessed the performance of the modified synthesis method using the weighted mean, mean, median, minimum, and maximum of combing results from the five studies.

Since our results on the simulated data from the bivariate normal distribution are similar to those on the simulated data from the bivariate log-normal distribution, we only report the results on the bivariate normal distribution case. Tables DIDIV show the bias and MSE for each of the regression parameters β0, β1, β2 as well as the mean bias, MSE, correlation, and SEE (mean of SE estimates) for the predicted values.

Table DI.

Bias and MSE for estimated parameters with equal sample sizes.

Method Bias
MSE
β0 β1 β2 β0 β1 β2
Total sample size N = 1000 × 3 × 5 (equal sample size) = 15000
Weighted mean (Mean) 0.0023 0.0005 −0.0005 0.2126 0.0026 0.0068
Median −0.0055 −0.0016 0.0007 0.3792 0.0099 0.0183
Minimum 0.0219 0.0075 −0.0036 0.5250 0.0140 0.0266
Maximum −0.0428 −0.0084 0.0083 0.8344 0.0214 0.0399
Total sample size N = 100 × 3 × 5 (equal sample size) = 1500
Weighted mean (Mean) 0.1066 0.0107 −0.0272 2.8586 0.0708 0.1509
Median 0.1781 0.0286 −0.0433 4.2857 0.1156 0.2228
Minimum −0.2240 −0.0181 0.0502 5.4686 0.1158 0.2820
Maximum −0.1285 −0.0037 −0.0373 11.4781 0.3338 0.5221

Table DIV.

Mean Bias, MSE, Correlation and SEE for predicted values with unequal sample sizes.

Method Mean bias MSE Correlation SEE
Total sample size N = (100+200+500 +1200+3000) × 3 = 15 000
Weighted mean 0.0201 0.0994 0.9886 1.1105
Mean −0.0219 0.1134 0.9825 1.2773
Total sample size N = (10+20+50+120+300) × 3 = 1500
Weighted mean −0.01580 0.3394 0.9900 4.1135
Mean 0.1993 0.3550. 0.9789 4.3768

References

  • 1.Hackam DG, Anand SS. Emerging risk factors for atherosclerotic vascular disease. A critical review of the evidence. Journal of the American Medical Association. 2003;290:932–940. doi: 10.1001/jama.290.7.932. [DOI] [PubMed] [Google Scholar]
  • 2.Fruchart-Najib J, Bauge E, Niculescu LS, Pham T, Thomas B, Rommens C, Majd Z, Brewer B, Pennacchio LA, Fruchart JC. Mechanism of triglyceride lowering in mice expressing human apolipoprotein. Biochemical and Biophysical Research Communications. 2004;319:397–404. doi: 10.1016/j.bbrc.2004.05.003. [DOI] [PubMed] [Google Scholar]
  • 3.Vasan RS. Biomarkers of cardiovascular disease: molecular basis and practical considerations. Circulation. 2006;113:2335–2362. doi: 10.1161/CIRCULATIONAHA.104.482570. [DOI] [PubMed] [Google Scholar]
  • 4.Casella G, Berger RL. Statistical Inference. 2. Thomson Learning; Pacific Grove, CA: 2002. [Google Scholar]
  • 5.Samsa G, Hu G, Root M. Combining information from multiple data sources to create multivariable risk models: illustration and preliminary assessment of a new method. Journal of Biomedicine and Biotechnology. 2005;2:113–123. doi: 10.1155/JBB.2005.113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.National Center for Health Statistics. National Health and Nutrition Examination Survey (NHANES) 1999–2000 Available from: http://www.cdc.gov/nchs/about/major/nhanes/

RESOURCES