Informing a Risk Prediction Model for Binary Outcomes with External Coefficient Information

Wenting Cheng; Jeremy M G Taylor; Tian Gu; Scott A Tomlins; Bhramar Mukherjee

doi:10.1111/rssc.12306

. Author manuscript; available in PMC: 2020 Jan 1.

Published in final edited form as: J R Stat Soc Ser C Appl Stat. 2018 Aug 13;68(1):121–139. doi: 10.1111/rssc.12306

Informing a Risk Prediction Model for Binary Outcomes with External Coefficient Information

Wenting Cheng ^1,^†, Jeremy M G Taylor ¹, Tian Gu ¹, Scott A Tomlins ¹, Bhramar Mukherjee ¹

PMCID: PMC6519970 NIHMSID: NIHMS988360 PMID: 31105344

Summary.

We consider a situation where there is rich historical data available for the coefficients and their standard errors in an established regression model describing the association between a binary outcome variable Y and a set of predicting factors X, from a large study. We would like to utilize this summary information for improving estimation and prediction in an expanded model of interest, Y|X, B. The additional variable B is a new biomarker, measured on a small number of subjects in a new dataset. We develop and evaluate several approaches for translating the external information into constraints on regression coefficients in a logistic regression model of Y|X, B. Borrowing from the measurement error literature we establish an approximate relationship between the regression coefficients in the models Pr(Y = 1|X, β), Pr(Y = 1|X, B, γ) and E(B|X, θ) for a Gaussian distribution of B. For binary B we propose an alternate expression. The simulation results comparing these methods indicate that historical information on Pr(Y = 1|X, β) can improve the efficiency of estimation and enhance the predictive power in the regression model of interest Pr(Y = 1|X, B, γ). We illustrate our methodology by enhancing the High-grade Prostate Cancer Prevention Trial Risk Calculator, with two new biomarkers prostate cancer antigen 3 and TMPRSS2:ERG.

Keywords: Bayesian methods, Constrained estimation, Logistic regression, Prediction models

1. Introduction

Risk prediction models for different binary disease endpoints are abundant in the clinical and epidemiological literature. Examples of established models are the breast cancer risk calculator (Gail et al., 1989) and the Framingham risk score (D’Agostino et al., 2001) which can be used to assess an individual’s risk of experiencing a future health event and to make decisions concerning screening and prophylactic prevention. As a motivating example in this paper, the Prostate Cancer Prevention Trial Risk Calculator (Thompson et al., 2006) is an online assessment tool which provides personalized risk estimate of detecting prostate cancer based on risk factors such as age, prostate-specific antigen (PSA) and digital rectal examination (DRE) findings.

While these established models are often based on standard epidemiologic and behavioral risk factors, wider availability of high throughput data and novel assay technologies are generating candidate biomarkers for potential inclusion in existing prediction models. It’s very likely that the new biomarkers are assessed only on subjects in a study of moderate size and cannot be measured on the much larger population used for the well-established model. Investigators could directly estimate the expanded model in the new dataset, but results from this expanded prediction model based solely on a limited number of subjects could be highly variable. It is natural to consider using the information from the well-established model to increase the accuracy of the expanded model.

Substantial research has been done on the problem of enhancing risk prediction models with supplemental external information. The external information may be used to combine estimates from previous studies with the regression coefficients estimated in the new dataset. Steyerberg et al. (2000) described a method to adjust the multivariate logistic regression model’s coefficients estimated in a dataset based on univariate regression models’ coefficients in the literature. Newcombe et al. (2012) presented two possible approaches incorporating the effect estimates of a set of predictors: the first one was by adding a composite weighted risk score based on these estimates and the second one was by specifying informative priors for the coefficients of these variables in a Bayesian logistic regression model. Chatterjee et al. (2016) developed a general method for incorporating external coefficients, derived from constrained estimating equations. Other related approaches used constrained maximum likelihood and empirical likelihood (Imbens and Lancaster, 1994; Qin, 2000; Qin et al., 2015). Cheng et al. (2018) developed and compared a number of approaches for the situation when the outcome variable is continuous. They established exact relationships between the parameters in the model of interest that includes the new biomarker and the parameters in the established model, then proposed both frequentist and Bayesian approaches. In the current paper we adapt and extend the approaches to the situation when the outcome variable is binary.

There are also a number of simple approaches. For the Gail model, Mealiffe et al. (2010) computed a multiplicative risk score based on previously published odds ratios of newly discovered biomarkers. They then multiplied the Gail risk estimate and the multiplicative risk score to give a combined risk score. Grill et al. (2015) proposed a simple method of incorporating new markers via Bayes Theorem. They updated the posterior odds of getting cancer based on both standard risk factors and new markers by using the likelihood ratio incorporating dependence between the two sets of risk factors to adjust the prior odds of getting cancer based on standard risk alone. Grill et al. (2017) assessed the performance of a set of likelihood ratio approaches as well as the approach proposed in Chatterjee et al. (2016).

We consider a situation where the outcome is a binary indicator of disease and the well-established model is described in a published article, in which the estimated regression coefficients and their standard errors are presented. The expanded model includes one additional biomarker as a potential predictor. To introduce notation, let Y denote the binary outcome, X is a set of p standard risk factors and B is a new biomarker. The association between Y and X is described through the following logistic model:

\log it (\Pr (Y = 1 | X)) = X β = β_{0} + β_{1} X_{1} + \dots + β_{p} X_{p}

(1)

We assume we have available summary-level information on the estimated regression coefficients $\bar{β}$ and their standard errors $\bar{S}$ in model (1). Based on the work that went into establishing this model, we will assume that all the X’s are deemed to be important and need to be included in any model, and further that the above form provides at least a good approximation to the distribution of Y given X.

The model of primary interest is one that describes the joint effect of X, B on Y:

\log it (\Pr (Y = 1 | X, B)) = X γ_{X} + B γ_{B} = γ_{0} + γ_{1} X_{1} + \dots + γ_{p} X_{p} + γ_{p + 1} B

(2)

Our goal is to obtain the best estimate we can of the γ’s in a model of this form, making use of all the available information from the established model and the small dataset.

Another model that can be estimated from the current small dataset is:

E (B | X) = g^{- 1} (X θ) = g^{- 1} (θ_{0} + θ_{1} X_{1} + \dots + θ_{p} X_{p})

(3)

where g is the link function, which is the identity function g(y) = y for Gaussian B and the logit function g(y) = log(y/(1 − y)) for binary B. We propose to formulate the problem in an inferential framework where the historical information is translated in terms of non-linear constraints on the regression parameters. The distribution of B will greatly affect how we translate the historical information into constraints on the regression parameters. We consider the cases that B is either Gaussian or binary.

The following is the structure of the remainder of this article: in Section 2 we describe the Prostate Cancer Prevention Risk Calculator and the available data including the new biomarkers that might be able to enhance this calculator. In Section 3, we establish a relationship equation between the regression coefficients of models (1) - (3) when B is Gaussian. In Section 4, we consider the situation when B is binary and derive the corresponding constrained solutions. We present a simulation study in Section 5. In Section 6 we demonstrate the proposed approaches for the High-grade Prostate Cancer Prevention Trial Risk Calculator. Concluding remarks are presented in Section 7.

2. A Motivating Example: Prostate Cancer Risk Prediction

The Prostate Cancer Prevention Trial (PCPT) was a phase III randomized placebo-controlled trial of drug finasteride for the prevention of carcinoma of the prostate. The PCPT randomly assigned about 18882 men who were at least 55 years old and did not have prostate cancer to either finasteride or placebo for 7 years. At the end of the 7 years of the study, all men who had not been diagnosed with prostate cancer during the trial were asked to undergo an end-of-study prostate biopsy. The biopsy result could be no cancer, low-grade cancer or high-grade cancer, which was defined as Gleason score of 7 or higher. Variables collected in this trial included family history of prostate cancer, age, race, previous biopsy result, PSA and digital rectal examination.

The use of PSA to screen for prostate cancer (PCa) had been controversial because the test has low specificity and can lead to overtreatment. Therefore, improved tests that use additional information are needed. The Prostate Cancer Prevention Trial Risk Calculator (PCPTrc) for prostate cancer, and a separate calculator for high-grade prostate cancer (PCPThg) (Thompson et al., 2006) were the first online prostate cancer risk assessment tools to allow an individual to assess his risk for prostate cancer. These calculators are well established and are frequently used. These calculators were developed from 5519 men in the placebo group of the PCPT who underwent prostate biopsy. The PCPThg calculator (version 1.0) predicts the chance of high-grade prostate cancer based on PSA level, age, DRE findings, prior biopsy result and race:

\log (\frac{p_{i}}{1 - p_{i}}) = - 6.25 + 0.03 {age}_{i} + 0.96 {race}_{i} + 1.29 \log ({PSA}_{i}) + 1.00 {DRE}_{i} - 0.36 {biopsy}_{i}

(4)

where p_i is the probability of observing high grade prostate cancer for subject i. If we plug in a person’s age, race, PSA level, DRE result and previous biopsy information, we can estimate the probability of detecting high-grade prostate cancer. The estimated logistic models coefficients and the 95% confidence intervals are available in Thompson et al. (2006). The estimated coefficients and covariance-variance matrices were also accessible as a R code document at (http://deb.uthscsa.edu/URORiskCalc/Pages/calcs.jsp).

The PCPT risk calculators are based on standard clinical, demographic and epidemiologic variables. None of the variables are related to the molecular mechanisms of carcinogenesis or prostate cancer disease progression. It is plausible to think that including other variables that are more related to the biology of cancer would lead to improved ability to detect PCa. Prostate cancer antigen 3 (PCA3) and TMPRSS2:ERG (T2:ERG) gene fusions are two prostate cancer biomarkers which have been shown to have better specificity for early detection of PCa than PSA (Truong et al., 2013; Tomlins et al., 2015). Their transcripts are detectable and quantifiable in urine collected after digital rectal examination. To investigate whether PCA3 and T2:ERG could be combined with the PCPThg calculator to give more accurate risk prediction, Tomlins et al. (2015) undertook a study in 679 men, in whom all the PCPThg calculator variables and both a PCA3 score and a T2:ERG score were measured. In this dataset the proportion with high grade PCa is 26.4%. An independent validation study of 1218 men was also available. In this dataset the proportion with high grade PCa is 18.3%.

Tomlins et al. (2015) expanded the PCPThg model by incorporating PCA3 as an additional risk factor. They used the predicted risk score from the PCPThg (i.e., $\hat{P} r (Y_{i} = 1 | X_{i}, {\bar{β}}_{PCPThg}) \times 100$ ) directly as a predicting variable and estimated the joint effect of the PCPThg risk score and the PCA3 value on the probability of high-grade PCa. They estimated the new model in the training dataset, and found that when applied to the validation dataset the AUC increased from 0.707 for the PCPThg model to 0.752 for their model. They also constructed another expanded PCPThg model by incorporating T2:ERG and showed that the AUC increased from 0.707 to 0.754. We would like to propose more sophisticated statistical approaches that could potentially provide further improvement compared to these results.

3. Statistical Approaches

3.1. Logistic Regression Approximation of the Marginal Pr(Y = 1|X)

A difficulty in translating the summary information from modeling Pr(Y = 1|X) to modeling Pr(Y = 1|X, B) is that a logistic model logit(Pr(Y = 1|X, B)) does not reduce to a logistic model logit(Pr(Y = 1|X)) when marginalized over the distribution of B. To connect the regression coefficients in models (1), (2) and (3), we need to approximate logit(Pr(Y = 1|X)) written in terms of parameters γ, θ and variables X, and equate that to logit(Pr(Y = 1|X)) = Xβ. To achieve this, we consider the following integral:

\begin{array}{l} \Pr (Y = 1 | X) & = \int \Pr (Y = 1 | X, B) P (B | X) dB \\ = {({(2 π)}^{1 / 2} σ_{2})}^{- 1} \int H (X γ_{x} + B γ_{p+1}) \exp (- \frac{{(B - X θ)}^{T} (B - X θ)}{2 σ_{2}^{2}}) dB \end{array}

(5)

where H(v) = (1+exp(−v))⁻¹, and B|X follows a Gaussian distribution $N (X θ, σ_{2}^{2})$ . This integral in (5) does not have a closed-form solution and approximations are necessary.

By a normal scale mixture representation of the logistic distribution function and computation of the logistic-normal integral (Monahan and Stefanski, 1992), we can find an approximated equation: $Pr (Y = 1 | X) \approx H (\frac{X γ_{x} + (X θ) γ_{p + 1}}{{(1 + γ_{p + 1}^{2} σ_{2}^{2} / {1.7}^{2})}^{\frac{1}{2}}})$ . The derivation of the approximation is given in Supplementary Material Appendix B. Using this approximation, we find an approximate relationship between γ, θ and β:

β_{j} \approx (γ_{j} + γ_{p + 1} θ_{j}) / ({(1 + γ_{p + 1}^{2} σ_{2}^{2} / {1.7}^{2})}^{\frac{1}{2}}), j = 0, \dots, p .

(6)

3.2. Firth Correction in Logistic Regression

The Firth correction (Firth, 1993) is a general approach to reduce bias in maximum likelihood estimation by maximizing a penalized log-likelihood function, where the penalty is $\frac{1}{2} | I |$ and I is the information matrix. In logistic regression, standard maximum likelihood estimates often experience serious bias or even non-existence due to separability and multicollinearity, and the Firth correction is suggested (Heinze and Schemper, 2002) as a way to improve the estimates. In our constrained solution, we add the Firth correction to stabilize the estimates from standard logistic regression.

3.3. Unconstrained Solutions

3.3.1. Direct Regression

Without constraints, the unknown parameters γ in model (2) can be estimated by maximizing the likelihood. Specifically, the estimate solves

\max_{γ} {\sum_{i=1}^{n} [Y_{i} (\sum_{j = 0}^{p} γ_{j} X_{ij} + γ_{p + 1} B_{i}) - \log (1 + \exp (\sum_{j=0}^{p} γ_{j} X_{ij} + γ_{p + 1} B_{i}))]}

(7)

In addition, we implement Firth’s penalized likelihood approach using R package logistf. We use least squares to estimate θ in model (3).

3.3.2. Standard Bayes

Draws for the posterior distributions of γ and θ are obtained using flat conjugate priors for θ and weakly informative Cauchy distribution priors for γ, as described in the Supplementary Materials Appendix A.

3.4. Constrained Solutions

3.4.1. Constrained Maximum Likelihood

The constrained maximum likelihood (constrained ML) estimation maximizes the joint log-likelihood under the set of constraints generated based on the approximate relationship equations in (6). We will require the parameter estimates for γ and θ to result in the derived value of β being within d standard errors of the old point estimates:

\begin{array}{l} \min_{γ, θ} {\sum_{i = 1}^{n} [- Y_{i} (\sum_{j = 0}^{p} γ_{j} X_{ij} + γ_{p + 1} B_{i}) + \log (1 + \exp (\sum_{j = 0}^{p} γ_{j} X_{ij} + γ_{p + 1} B_{i}))] + \sum_{i = 1}^{n} \frac{{(B_{i} - \sum_{j = 0}^{p} θ_{j} X_{ij})}^{2}}{2 {\hat{σ}}_{2}^{2}}} \\ s.t. (γ_{j} + γ_{p + 1} θ_{j}) / {(1 + γ_{p + 1}^{2} σ_{2}^{2} / {1.7}^{2})}^{\frac{1}{2}} \in [{\bar{β}}_{j} - d {\bar{S}}_{j}, {\bar{β}}_{j} + d {\bar{S}}_{j}], j = 0, \dots, p \end{array}

(8)

In this optimization problem, ${\hat{σ}}_{2}^{2}$ is a plug-in estimator and is the OLS residual variance from fitting E(B|X) and d is a scale parameter representing the strength of external information. From simulations, we find that fixing d as d = 1 leads to decent properties of the estimates of γ. A modified version that includes the Firth correction is also considered. Further details of these methods are provided in the Supplementary Materials Appendix A. We use the bootstrap as described in Supplementary Material Appendix D to estimate the standard errors.

3.4.2. Informative Full Bayes

In informative full Bayes, starting with the joint likelihood L(Y|X, B)L(B|X) we translate the constraints in (6) to prior information. The first step is to write down the joint likelihood function with priors on γ, θ, $σ_{2}^{2}$ :

p (γ, θ, σ_{2}^{2} | data) \propto L (Y | X, B, γ) \cdot L (B | X, θ, σ_{2}^{2}) \cdot π (γ, θ, σ_{2}^{2}) = {\prod_{i=1}^{n} \frac{\exp ((\sum_{j = 0}^{p} γ_{j} X_{ij} + γ_{p + 1} B_{i}) Y_{i})}{1 + \exp (\sum_{j=0}^{p} γ_{j} X_{ij} + γ_{p + 1} B_{i})} \cdot \frac{1}{\sqrt{2 π σ_{2}^{2}}} \exp (- \frac{1}{2 σ_{2}^{2}} {(B_{i} - \sum_{j=0}^{p} θ_{j} X_{ij})}^{2})} \cdot π (γ, θ, σ_{2}^{2})

(9)

The logistic regression approximation result (6) suggests that $θ_{j} = \frac{1}{γ_{p + 1}} (β_{j} \sqrt{1 + \frac{γ_{p + 1}^{2} σ_{2}^{2}}{{1.7}^{2}}} - γ_{j})$ , j =0,…,p. We re-parametrize (9) in terms of γ, β and $σ_{2}^{2}$ , and include a Jacobian transformation matrix denoted by J, where $| J | = \frac{1}{| γ_{p + 1}^{p + 1} |} {(1 + \frac{γ_{p + 1}^{2} σ_{2}^{2}}{{1.7}^{2}})}^{\frac{p + 1}{2}}$ . We use a non-informative prior inverse-gamma for $σ_{2}^{2}$ and independent weakly informative Cauchy priors for γ (Gelman et al., 2008). For parameters β, we use the constraints as priors:

β_{j} = (γ_{j} + γ_{p + 1} θ_{j}) / {(1 + γ_{p + 1}^{2} σ_{2}^{2} / {1.7}^{2})}^{\frac{1}{2}} ~ N ({\bar{β}}_{j}, {\bar{S}}_{j}^{2}), j = 0, \dots, p

(10)

Then we can rewrite the joint distribution in terms of γ, β, $σ_{2}^{2}$ as $p (γ, β, σ_{2}^{2} | Y, X, B) \propto {\prod_{i = 1}^{n} \frac{\exp ((\sum_{j = 0}^{p} γ_{j} X_{i j} + γ_{p + 1} B_{i}) Y_{i})}{1 + \exp (\sum_{j = 0}^{p} γ_{j} X_{i j} + γ_{p + 1} B_{i})}} \cdot {\prod_{i = 1}^{n} \frac{1}{\sqrt{2 π σ_{2}^{2}}} \exp (- \frac{1}{2 σ_{2}^{2}} {(B_{i} - \frac{β_{0} \sqrt{1 + \frac{γ_{p + 1}^{2} σ_{2}^{2}}{{1.7}^{2}}} - γ_{0}}{γ_{p + 1}} - \sum_{j=1}^{p} \frac{β_{j} \sqrt{1 + \frac{γ_{p + 1}^{2} σ_{2}^{2}}{1 \cdot 7^{2}}} - γ_{j}}{γ_{p + 1}} X_{ij})}^{2})} \cdot π (β) \cdot π (γ) \cdot π (σ_{2}^{2}) \cdot | J |$ Further details of the priors and the implementation of a Metropolis-Hastings algorithm are given in the Supplementary Materials Appendix B. We note that in the algorithm the full conditional distributions of γ₀, …, γ_p+1 and $σ_{2}^{2}$ do not have closed form expressions, and because of the non-linear relationship between the parameters, the Metropolis-Hasting algorithm cannot be performed efficiently and thus it is computationally slow to obtain uncorrelated draws from the posterior distributions.

3.4.3. Transformation Approach

As the informative full Bayes is computationally intensive, we propose an approximate Bayesian approach that can produce draws that fall into the constrained space but reduces the computational burden of the informative Bayes method. The motivation for this stems from the Bayesian transformation approach incorporating monotone or uni-modal constraints in posterior inference as proposed in Gunn and Dunson (2005), which we modify to the scenario of a regression model with external coefficient information.

Suppose the draws from non-informative standard Bayes method as described in Section 3.3.2 are γ and θ. The corresponding posterior covariance matrices are Σ_γ, Σ_θ. We extract the posterior variances from Σ_γ, Σ_θ and denote them by $s_{γ_{0}}^{2}, \dots, s_{γ_{p}}^{2}, s_{γ_{p + 1}}^{2}, s_{θ_{0}}^{2}, \dots, s_{θ_{p}}^{2}$ .

The OLS residual variance from fitting E(B|X) is ${\hat{σ}}_{2}^{2}$ . Then a constrained draw γ^★,θ^★ is obtained from an unconstrained draw by solving the optimization problem:

\begin{array}{l} \min_{_{γ^{*}, θ^{*}}} {d_{NED}^{2} (γ, γ^{⋆}) + d_{NED}^{2} (θ, θ^{⋆})} = \sum_{j = 0}^{p + 1} \frac{{(γ_{j} - γ_{j}^{⋆})}^{2}}{s_{γ_{j}}^{2}} + \sum_{k = 0}^{p} \frac{{(θ_{k} - θ_{k}^{⋆})}^{2}}{s_{θ_{k}}^{2}} \\ s.t. (γ_{j}^{⋆} + γ_{p + 1}^{⋆} θ_{j}^{⋆}) / {(1 + γ_{p + 1}^{⋆ 2} {\hat{σ}}_{2}^{2} / {1.7}^{2})}^{\frac{1}{2}} \in [{\bar{β}}_{j} - d {\bar{S}}_{j}, {\bar{β}}_{j} + d {\bar{S}}_{j}], j = 0, \dots, p \end{array}

(11)

where d_NED stands for normalized Euclidean distance. For the transformation of a single draw, we generate d from a half normal distribution: d ~ |N(0, 1)|. The intuition behind this transformation procedure is that it will produce values γ^★, θ^★ subject to the box constraints that are closest in normalized distance to the unconstrained values γ, θ.

The transformation is computationally efficient since we have a fast algorithm to solve the optimization problem in (11). We fix $γ_{p + 1}^{⋆}$ and divide the minimization function (11) into p + 1 two-dimensional constrained minimization problems in which the solutions can be re-expressed as functions of $γ_{p + 1}^{⋆}$ . After that, the whole minimization problem is reduced to an easy to solve one-dimensional optimization problem in $γ_{p + 1}^{⋆}$ . The constrained draws produced by the transformation approach are not draws from the true posterior distribution, however, we found in a limited number of simulations that they are reasonable approximations that can be generated much faster.

3.4.4. Constrained Approach of Chatterjee et al (2016)

For comparison we include a constrained maximum likelihood method that uses the integrated score equations (Chatterjee et al., 2016). The method assumes the model for Y|X, B is correct, it does not make any explicit assumptions about the distribution of B|X, but it does require the distribution of X to be the same in the current sample as in the data that was used to develop the model for Y|X. The method uses only the point estimates $\bar{β}$ and does not take into account the standard errors of those estimates.

3.4.5. Logistic Regression Plug-in Method

We also included a simple method which consists of obtaining predicted probabilities by fitting a logistic regression model with two covariates, B and $\log ({\bar{p}}_{i} / (1 - {\bar{p}}_{i})$ , where ${\bar{p}}_{i}$ is the prediction from the established model for Y|X. It is easy to show that this method does give a final model for Y|X, B that has a logistic link function and is linear in X and B, and with some algebra the estimates of γ can be obtained.

4. Statistical Approaches when B is Binary

4.1. The Approximate Relationship Equation When B is Binary

If B is a binary variable, the logistic regression approximation in Section 3 does not hold and the approximate relationship in equation (6) is not applicable. However, by Bayes theorem, there is a relationship equation connecting Pr(Y = 1|X), Pr(Y = 1|X, B) and f(B|X, Y) (Grill et al., 2015; Satten and Kupper, 1993):

\frac{Pr (Y = 1 | X, B)}{Pr (Y = 0 | X, B)} = \frac{f (B | X, Y = 1)}{f (B | X, Y = 0)} \cdot \frac{Pr (Y = 1 | X)}{Pr (Y = 0 | X)}

(12)

Thus, when B is binary, we need to define a model for B|X, Y instead of a model for B|X. Assume $logit (Pr (B = 1 | X, Y)) = \sum_{j=0}^{p} ϕ_{j} X_{j} + ϕ_{p + 1} Y$ . Some algebraic simplifications of equation (12) followed by a Taylor series expansion (as shown in Supplementary Material Appendix C) result in an approximate relationship equation: $β_{0} + β_{1} X_{1} + \dots + β_{p} X_{p} \approx γ_{0} + \frac{1}{2} ϕ_{p + 1} + \frac{1}{4} ϕ_{0} ϕ_{p + 1} + \frac{1}{8} ϕ_{p + 1}^{2} + \sum_{j=1}^{p} (γ_{j} + \frac{1}{4} ϕ_{j} ϕ_{p + 1}) X_{j} + (γ_{p + 1} - ϕ_{p + 1}) B$ . Then the approximate relationship between γ, ϕ and β is:

{\begin{array}{l} β_{0} \approx γ_{0} + \frac{1}{2} ϕ_{p + 1} + \frac{1}{4} ϕ_{0} ϕ_{p + 1} + \frac{1}{8} ϕ_{p + 1}^{2} \\ β_{j} \approx γ_{j} + \frac{1}{4} ϕ_{j} ϕ_{p + 1}, j = 1, \dots, p \\ γ_{p + 1} = ϕ_{p + 1} \end{array}

(13)

4.2. Unconstrained and Constrained Solutions

The two unconstrained solutions, direct regression and standard Bayes can be performed in the same way as described in Section 3 regardless of the distribution of B.

To develop a constrained solution, we need to first define the likelihood function L(B|X). It can be shown that Pr(B_i = 1|X_i) is a weighted average of $\frac{\exp (\sum_{j = 0}^{p} X_{i j} ϕ_{j})}{1 + \exp (\sum_{j = 0}^{p} X_{i j} ϕ_{j})}$ and $\frac{\exp (\sum_{j = 0}^{p} X_{i j} ϕ_{j} + ϕ_{p + 1})}{1 + \exp (\sum_{j = 0}^{p} X_{i j} ϕ_{j} + ϕ_{p + 1})}$ . We use estimates of the weights given by $w_{i, \bar{β}} = \frac{1}{1 + \exp (X_{i} \bar{β})}$ and $1 - w_{i, \bar{β}} = \frac{\exp (X_{i} \bar{β})}{1 + \exp (X_{i} \bar{β})}$ where $\bar{β}$ are the values for β from the established model. Then L(B|X) can be written as: $L (B | X) = \prod_{i=1}^{n} L (B_{i} | X_{i}, ϕ) = \prod_{i=1}^{n} {(\frac{\exp (\sum_{j = 0}^{p} X_{ij} ϕ_{j})}{1 + \exp (\sum_{j = 0}^{p} X_{ij} ϕ_{j})} \cdot w_{i, \bar{β}} + \frac{\exp (\sum_{j = 0}^{p} X_{ij} ϕ_{j} + ϕ_{p + 1})}{1 + \exp (\sum_{j = 0}^{p} X_{ij} ϕ_{j} + ϕ_{p + 1})} \cdot (1 - w_{i, \bar{β}}))}^{B_{i}} \cdot {(\frac{1}{1 + \exp (\sum_{j = 0}^{p} X_{ij} ϕ_{j})} \cdot w_{i, \bar{β}} + \frac{1}{1 + \exp (\sum_{j = 0}^{p} X_{ij} ϕ_{j} + ϕ_{p + 1})} \cdot (1 - w_{i, \bar{β}}))}^{(1 - B_{i})}$ .

4.2.1. Constrained Maximum Likelihood

The constrained ML estimation optimizes the following joint log-likelihood L(Y|X, B)L(B|X) with a set of constraints on γ, ϕ, namely:

\min_{γ, ϕ} {\sum_{i = 1}^{n} [- Y_{i} (\sum_{j = 0}^{p} γ_{j} X_{ij} + γ_{p + 1} B_{i}) + \log (1 + \exp (\sum_{j = 0}^{p} γ_{j} X_{ij} + γ_{p + 1} B_{i}))] - \sum_{i=1}^{n} [B_{i} \log (\frac{\exp (\sum_{j = 0}^{p} X_{ij} ϕ_{j}) w_{i, \bar{β}}}{1 + \exp (\sum_{j = 0}^{p} X_{ij} ϕ_{j})} + \frac{\exp (\sum_{j = 0}^{p} X_{ij} ϕ_{j} + ϕ_{p + 1}) (1 - w_{i, \bar{β}})}{1 + \exp (\sum_{j = 0}^{p} X_{ij} ϕ_{j} + ϕ_{p + 1})}) + (1 - B_{i}) \log (\frac{w_{i, \bar{β}}}{1 + \exp (\sum_{j = 0}^{p} X_{ij} ϕ_{j})} + \frac{(1 - w_{i, \bar{β}})}{1 + \exp (\sum_{j = 0}^{p} X_{ij} ϕ_{j} + ϕ_{p + 1})})]} s.t. {\begin{array}{l} γ_{0} + \frac{1}{2} ϕ_{p + 1} + \frac{1}{4} ϕ_{0} ϕ_{p + 1} + \frac{1}{8} ϕ_{p + 1}^{2} \in [{\bar{β}}_{0} - d {\bar{S}}_{0}, {\bar{β}}_{0} + d {\bar{S}}_{0}] \\ γ_{j} + \frac{1}{4} ϕ_{j} ϕ_{p + 1} \in [{\bar{β}}_{j} - d {\bar{S}}_{j}, {\bar{β}}_{j} + d {\bar{S}}_{j}], j = 1, \dots, p \\ γ_{p + 1} = ϕ_{p + 1} \end{array}

(14)

We also consider a modification that adds the Firth penalty term.

4.2.2. Informative Full Bayes

Analogous to the derivation of the informative full Bayes solution described in Section 3, we first write down the product of L(Y|X, B) and L(B|X) with priors.

p (γ, ϕ | data) \propto {\prod_{i = 1}^{n} \frac{\exp ((\sum_{j = 0}^{p} γ_{j} X_{ij} + γ_{p + 1} B_{i}) Y_{i})}{1 + \exp (\sum_{j = 0}^{p} γ_{j} X_{ij} + γ_{p + 1} B_{i})} \cdot {(\frac{\exp (\sum_{j = 0}^{p} X_{ij} ϕ_{j}) w_{i, \bar{β}}}{1 + \exp (\sum_{j = 0}^{p} X_{ij} ϕ_{j})} + \frac{\exp (\sum_{j = 0}^{p} X_{ij} ϕ_{j} + ϕ_{p + 1}) (1 - w_{i, \bar{β}})}{1 + \exp (\sum_{j = 0}^{p} X_{ij} ϕ_{j} + ϕ_{p + 1})})}^{B_{i}} \cdot {(\frac{w_{i, \bar{β}}}{1 + \exp (\sum_{j = 0}^{p} X_{ij} ϕ_{j})} + \frac{(1 - w_{i, \bar{β}})}{1 + \exp (\sum_{j = 0}^{p} X_{ij} ϕ_{j} + ϕ_{p + 1})})}^{(1 - B_{i})}} \cdot π (γ, ϕ)

(15)

We can re-parametrize (15) in terms of γ, β, and include a Jacobian corresponding to this transformation. We denote the Jacobian matrix by M where $| M | = {| \frac{4}{γ_{p + 1}} |}^{p + 1}$ . We specify independent weakly informative Cauchy priors for γ and use the constraints directly as priors for β. Then similar to section 3.4.2 we can rewrite the joint distribution in terms of γ, β as

p (γ, β | Y, X, B) \propto {\prod_{i=1}^{n} \frac{\exp ((\sum_{j=0}^{p} γ_{j} X_{ij} + γ_{p + 1} B_{i}) Y_{i})}{1 + \exp (\sum_{j=0}^{p} γ_{j} X_{ij} + γ_{p + 1} B_{i})} \cdot {[\frac{w_{i, \bar{β}}}{1 + \exp (- (\frac{4 β_{0} - 4 γ_{0} - 2 γ_{p + 1} - \frac{1}{2} γ_{p + 1}^{2}}{γ_{p + 1}} + \sum_{j = 1}^{p} X_{i j} \frac{4 (β_{j} - γ_{j})}{γ_{p + 1}}))} + \frac{1 - w_{i, \bar{β}}}{1 + \exp (- (\frac{4 β_{0} - 4 γ_{0} - 2 γ_{p + 1} - \frac{1}{2} γ_{p + 1}^{2}}{γ_{p + 1}} + \sum_{j = 1}^{p} X_{i j} \frac{4 (β_{j} - γ_{j})}{γ_{p + 1}} + γ_{p + 1}))}]}^{B_{i}} \cdot {[\frac{w_{i, \bar{β}}}{1 + \exp (\frac{4 β_{0} - 4 γ_{0} - 2 γ_{p + 1} - \frac{1}{2} γ_{p + 1}^{2}}{γ_{p + 1}} + \sum_{j = 1}^{p} X_{i j} \frac{4 (β_{j} - γ_{j})}{γ_{p + 1}})} + \frac{1 - w_{i, \bar{β}}}{1 + \exp (\frac{4 β_{0} - 4 γ_{0} - 2 γ_{p + 1} - \frac{1}{2} γ_{p + 1}^{2}}{γ_{p + 1}} + \sum_{j = 1}^{p} X_{i j} \frac{4 (β_{j} - γ_{j})}{γ_{p + 1}} + γ_{p + 1})}]}^{(1 - B_{i})}} \cdot π (β) \cdot π (γ) \cdot | M |

The full conditional distributions of γ₀, …, γ_p+1 and β₀, …, β_p do not have closed form expressions and a Metropolis-Hastings algorithm is used.

4.2.3. Transformation Approach

Suppose the raw draws from the non-informative Bayes method for Y|X,B are γ and the raw draws from non-informative Bayes method for B|X, Y are ϕ. The posterior variances are $s_{γ_{j}}^{2}$ , j = 0, …,p + 1 and $s_{ϕ_{k}}^{2}$ , k = 0, …, p + 1. Then γ^★, ϕ^★ are obtained by solving the following optimization problem:

\begin{array}{l} \min_{γ^{*}, ϕ^{*}} {d_{NED}^{2} (γ, γ^{⋆}) + d_{NED}^{2} (ϕ, ϕ^{⋆})} = \sum_{j = 0}^{p + 1} \frac{{(γ_{j} - γ_{j}^{⋆})}^{2}}{s_{γ_{j}}^{2}} + \sum_{k = 0}^{p + 1} \frac{{(ϕ_{k} - ϕ_{k}^{⋆})}^{2}}{s_{ϕ_{k}}^{2}} \\ s . t . {\begin{cases} γ_{0}^{⋆} + \frac{1}{2} ϕ_{p + 1}^{⋆} + \frac{1}{4} ϕ_{0}^{⋆} ϕ_{p + 1}^{⋆} + \frac{1}{8} ϕ_{p + 1}^{⋆ 2} \in [{\bar{β}}_{0} - d {\bar{S}}_{0}, {\bar{β}}_{0} + d {\bar{S}}_{0}] \\ γ_{j}^{⋆} + \frac{1}{4} ϕ_{j}^{⋆} ϕ_{p + 1}^{⋆} \in [{\bar{β}}_{j} - {d \bar{S}}_{j}, {\bar{β}}_{j} + {d \bar{S}}_{j}], j = 1, \dots, p \\ γ_{p + 1}^{⋆} = ϕ_{p + 1}^{⋆} \end{cases} \end{array}

(16)

5. Simulation Study

To evaluate the performance of the various approaches we conduct a simulation study. The results for Gaussian B are presented here, and the results for binary B and other scenarios are presented in the Supplementary Materials Appendix E. The simulation scenario has three predicting variables, X₁, X₂ and B and the sample size of each dataset is 55. Five hundred replicate datasets are generated. Y_i is Bernoulli distributed with logit(Pr(Y_i = 1|X_i1, X_i2, B_i)) = 2 + 3X_i1 + 3X_i2 + 2B_i. X_i1, X_i2 are independently and identically distributed on U(−0.75, 0.25) and B_i is simulated as B_i = 0.5X_i1 + 0.5X_i2 +N(0, 0.75²). A logistic regression based on a large dataset of 10000 subjects gives estimates for the model logit(Pr(Y = 1|X)) = β₀+ β₁X₁+ β₂X₂. The estimates and standard errors from this fit are ${\bar{β}}_{0} = 1.50$ , ${\bar{S}}_{0} = 0.04$ , ${\bar{β}}_{1} = 2.95$ , ${\bar{S}}_{1} = 0.09$ , ${\bar{β}}_{2} = 3.01$ , ${\bar{S}}_{2} = 0.09$ .

We report three evaluation metrics related to estimation accuracy: the average of estimated coefficient, relative efficiency of estimated coefficient and mean squared error across 500 replicates. The average of the estimated coefficients is defined as: ${\bar{γ}}_{j} = \frac{1}{500} \sum_{m = 1}^{500} {\hat{γ}}_{m, j}$ ; the relative efficiency of the estimated coefficients is defined as: $\frac{V ({\hat{γ}}_{j, direct})}{V ({\hat{γ}}_{j ,method})}$ where $V ({\hat{γ}}_{j}_{,direct}) = \frac{1}{500} \sum_{m = 1}^{500} {({\hat{γ}}_{m, j} - {\bar{γ}}_{j})}^{2}$ estimated by direct regression; the MSE of the estimated coefficients is defined as: $\frac{1}{500} \sum_{m = 1}^{500} {({\hat{γ}}_{m, j} - γ_{j})}^{2}$ , j =1, …, p+1.

The predictive ability of logistic prediction models can be assessed using a variety of methods and metrics on a validation dataset (Steyerberg et al., 2010). In this simulation study, we assess the predictive ability of the model on a validation dataset of size 800 using the scaled Brier score $(\frac{\sum_{i = 1}^{n} {(Y_{i} - {\hat{p}}_{i})}^{2}}{\sum_{i = 1}^{n} {(Y_{i} - \bar{Y})}^{2}})$ . We assess the variability the predictions on the validation dataset using the standard deviation of the predicted probabilities ${((1 / 799) \sum {({\hat{p}}_{i} - \bar{\hat{p}})}^{2})}^{1 / 2}$ . The discriminatory performance of the model is assessed using the area under the ROC curve (AUC). These three performance measures are also calculated for the model based on $\bar{β}$ which does not use B, and for the best achievable model that uses the true values of the γ’s.

Table 1 presents the results. In this setting what is achievable with the true model is noticeably better, as measured by Brier score and AUC, than using the established model, and the established model does not give the correct standard deviation of the predicted probabilities. There is bias in ${\hat{γ}}_{1}$ , ${\hat{γ}}_{2}$ and ${\hat{γ}}_{3}$ for the direct regression approach, which is reduced by including the Firth penalty. We find that the constrained methods greatly improve the estimation efficiency of the coefficients γ₁ and γ₂ of X. The constrained methods can reduce the MSE of X₁, X₂ by 70% or more, and give 4 or 5 times more efficient point estimates of γ₁ and γ₂. In contrast, these constrained solutions can only reduce the MSE of γ₃ by 10% to 40% and generally have similar efficiency to that of direct regression plus Firth. The new methods give similar variability of the predicted probabilities compared to the true model. All the methods that use the external information give similar Brier scores and AUCs which are very similar to the best that can be achieved. They are all better than not using B at all, and slightly better than the methods that don’t use the external information. In terms of computational efficiency, the informative prior Bayesian approach and the transformation approach require more time than the other methods, however the transformation approach takes about 18% the time of the informative full Bayes approach.

Table 1.

Simulation results for Gaussian B: for each method, we report mean (relative efficiency w.r.t. direct regression), MSE, average Brier score, average AUC, average $\hat{p}$ (SD) and computing time for 500 datasets of size 55

Method	${\hat{γ}}_{1}$	${\hat{γ}}_{2}$	${\hat{γ}}_{3}$	Scaled Brier Score	AUC	$\hat{p}$ mean(SD)	Time
True value	3	3	2	0.605	0.864	0.49 (0.333)	-
Established model using known $\bar{β}$	-	-	-	0.796	0.761	0.51 (0.239)	-
Direct regression	3.37 (1)	3.40 (1)	2.35 (1)	0.661	0.852	0.49 (0.344)	1.3
MSE	3.36	3.48	0.96
Direct regression + Firth	2.89 (1.55)	2.92 (1.52)	1.99(1.69)	0.651	0.852	0.49 (0.324)	2.4
MSE	2.10	2.18	0.49
Non-informative Bayes	2.72 (1.79)	2.75 (1.78)	2.04 (1.77)	0.647	0.852	0.49 (0.307)	3.6
MSE	1.88	1.93	0.47
Constrained ML	3.08 (3.60)	3.17 (3.91)	2.30 (1.11)	0.628	0.857	0.49 (0.313)	44.9
MSE	0.90	0.88	0.84
Constrained ML + Firth	2.88 (6.01)	2.97 (6.32)	1.96 (1.85)	0.622	0.857	0.49 (0.303)	78.2
MSE	0.55	0.53	0.45
Informative full Bayes	2.87 (4.93)	2.98 (5.15)	2.30 (1.33)	0.624	0.857	0.49 (0.301)	9097.6
MSE	0.64	0.67	0.72
Transformation	2.90 (6.80)	3.00 (6.94)	1.93 (1.75)	0.622	0.857	0.49 (0.298)	888.2
MSE	0.48	0.48	0.48
Chatterjee et al	3.17 (2.91)	3.29 (3.06)	2.34 (1.01)	0.631	0.859	0.49 (0.342)	43.2
MSE	1.13	1.17	0.94
Simple logistic ( $\bar{p}$ , B	3.27 (1.64)	3.34 (1.62)	2.28 (1.11)	0.644	0.858	0.49 (0.339)	0.9
MSE	2.04	2.16	0.84

Open in a new tab

The results for the Chatterjee et al. (2016) method show some similarities to the other methods that use the external information, but also show some differences. The method shows a similar amount of bias or even slightly greater bias than the other methods. The method does result in some gain in efficiency in the point estimates, and also smaller MSE, compared to direct regression, but not as much gain as for the other new methods. For ${\hat{γ}}_{3}$ there is no gain in efficiency compared to direct regression, and even loss in efficiency compared to the direct regression with Firth correction. The method does tend to give slightly more variability to the predicted probabilities than the other methods. The Chatterjee et al. (2016) method has comparable values of AUC and Brier score as the other methods that use external information.

The simple logistic regression plug-in method is found to have better performance than direct regression, but not as good performance as the more sophisticated approaches for using the external information. It also can result in biased estimates of the γ’s.

The other results presented in the Supplementary Materials give similar conclusions.

6. Application to the Prostate Cancer Data

We demonstrate our methodology by enhancing the Prostate Cancer Prevention Trial Risk Calculator for high-grade prostate cancer. Using the data from Tomlins et al. (2015) we will illustrate the methods described in this paper to develop a logistic model that includes all the PCPThg variables and PCA3. We estimate the new model from the training dataset of 679 men, incorporating the known coefficients and their standard errors from the PCPThg calculator. After a transformation (log₂(PCA3 + 1)) the distribution of PCA3 is roughly normally distributed in both cohorts and thus the approximate relationship equations (6) are applicable. The distribution of T2:ERG looks like a truncated normal whose value is bounded below at zero, with many observations equal to zero. We dichotomized T2:ERG by splitting at the median and develop a logistic model that includes PCPThg variables and dichotomized T2:ERG. The approximate relationship equations (13) would be appropriate in this case.

These two expanded PCPThg models will be estimated by both the unconstrained methods and the constrained methods described in Section 3 and Section 4. For comparing coefficient estimation across different methods, we report the estimated coefficients and their standard errors calculated from the training dataset. For comparing prediction power, we calculate the Brier Score and the AUC based on the validation dataset. We also present the original PCPThg model and the expanded model developed by Tomlins et al. (2015). We give the calibration plots for the original PCPThg model, the expanded model by Tomlins et al. (2015), the expanded PCPThg model estimated without constraints (direct regression) and the expanded PCPThg model estimated with constraints (transformation approach). The calibration plot contains the predicted and the observed risk of high-grade cancer in 10 groups which are defined by sorting the predicted probabilities from lowest to highest and then separated into 10 groups of approximately equal size. For each group, the expected numbers of events is the sum of the predicted probability in the group. Perfect predictions should be on the 45° line.

Table 2 presents the expanded PCPThg model incorporating these two biomarkers fitted to the training dataset. For the expanded PCPThg model incorporating PCA3 score, if we compare the standard errors across different methods, it is easily seen that the constrained methods can reduce the standard errors of regression coefficients compared to direct regression. For example, the informative full Bayes solution can substantially reduce the standard errors in parameters of variables PSA (0.08 vs 0.19), age (0.008 vs0.013), DRE findings (0.17 vs 0.27), prior biopsy history (0.16 vs 0.28) and race (0.23 vs 0.31). The constrained ML with Firth penalty can reduce the standard errors of the parameters of variables PSA, age, prior biopsy history and race by at least 50%.

Table 2.

Expanded PCPThg model: for each method, point estimate (standard error) from the training dataset, and the Brier score, the AUC and the mean and SD of predicted probabilities from the validation dataset. The sample size of the training dataset is 679. The sample size of the validation dataset is 1218.

Model	PSA	Age	DRE findings	Prior biopsy history	Race		Scaled Brier Score	AUC	$\hat{p}$ mean(SD)
Original PGPThg	1.29 (0.09)	0.031 (0.012)	1.00 (0.17)	−0.36 (0.18)	0.96 (0.27)	-	0.933	0.707	0.14 (0.132)
Estimated PGPThg	1.06 (0.18)	0.033 (0.012)	1.15 (0.26)	−1.44 (0.27)	0.44 (0.29)	-	0.975	0.716	0.27 (0.174)
Expanded model with PGA3 score						PCA3
PCPThg score+PGA3	-	-	-	-	-	-	0.950	0.752	0.27 (0.201)
Direct regression	1.00 (0.19)	0.009 (0.013)	1.07 (0.27)	−1.30 (0.28)	0.04 (0.31)	0.56 (0.08)	0.950	0.767	0.28 (0.221)
Direct regression + Firth	0.97 (0.19)	0.009 (0.013)	1.06 (0.27)	−1.27 (0.27)	0.05 (0.31)	0.56 (0.08)	0.953	0.767	0.28 (0.219)
Non-informative Bayes	0.98 (0.18)	0.009 (0.013)	1.05 (0.27)	−1.27 (0.27)	0.04 (0.30)	0.56 (0.08)	0.950	0.767	0.28 (0.218)
Constrained ML	1.20 (0.09)	0.010 (0.007)	1.08 (0.14)	−0.55 (0.13)	0.30 (0.19)	0.59 (0.08)	0.948	0.766	0.27 (0.225)
Constrained ML + Firth	1.19 (0.09)	0.012 (0.006)	1.08 (0.14)	−0.54 (0.13)	0.47 (0.11)	0.53 (0.07)	0.947	0.764	0.27 (0.218)
Informative full Bayes	1.23 (0.10)	0.009 (0.008)	0.99 (0.17)	−0.73 (0.17)	0.26 (0.22)	0.60 (0.08)	0.946	0.767	0.27 (0.222)
Transformation	1.23 (0.07)	0.008 (0.009)	0.96 (0.14)	−0.50 (0.13)	0.41 (0.19)	0.55 (0.08)	0.883	0.765	0.22 (0.191)
Chatterjee et al	1.22 (0.08)	0.007 (0.005)	0.86 (0.10)	−0.20 (0.08)	0.58 (0.11)	0.56 (0.10)	0.888	0.759	0.15 (0.168)
Simple logistic ( $\bar{p}$ ,B)	0.82 (0.18)	0.023 (0.000)	0.64 (0.11)	−0.23 (0.01)	0.61 (0.10)	0.55 (0.08)	0.940	0.759	0.27 (0.204)
Expanded model with binary T2:ERG						T2:ERG
PCPThg score+ T2:ERG	-	-	-	-	-	-	0.932	0.732	0.26 (0.153)
Direct regression	1.01 (0.18)	0.032 (0.012)	1.03 (0.26)	−1.44 (0.28)	0.57 (0.29)	0.77 (0.20)	0.929	0.745	0.26 (0.179)
Direct regression + Firth	0.98 (0.18)	0.032 (0.012)	1.02 (0.26)	−1.41 (0.27)	0.57 (0.29)	0.76 (0.20)	0.930	0.744	0.27 (0.177)
Non-informative Bayes	0.99 (0.18)	0.032 (0.012)	1.01 (0.26)	−1.40 (0.27)	0.55 (0.29)	0.76 (0.20)	0.926	0.745	0.27 (0.175)
Constrained ML	1.14 (0.07)	0.032 (0.004)	1.06(0.14)	−0.52 (0.11)	0.81 (0.18)	0.74 (0.21)	0.928	0.742	0.25 (0.176)
Constrained ML + Firth	1.14 (0.07)	0.032 (0.004)	1.06 (0.14)	−0.52 (0.11)	0.80 (0.17)	0.72 (0.20)	0.931	0.742	0.26 (0.176)
Informative full Bayes	1.14 (0.09)	0.033 (0.007)	0.95 (0.14)	−0.76 (0.16)	0.77 (0.21)	0.73 (0.19)	0.922	0.744	0.25 (0.175)
Transformation	1.17 (0.07)	0.030 (0.007)	0.94 (0.12)	−0.50 (0.11)	0.89 (0.16)	0.74 (0.14)	0.889	0.742	0.21 (0.152)
Chatterjee et al	1.25 (0.03)	0.029 (0.002)	0.85 (0.05)	−0.37 (0.04)	1.06 (0.05)	0.77 (0.27)	0.911	0.736	0.14 (0.129)
Simple logistic ( $\bar{p}$ ,B)	0.98 (0.18)	0.023 (0.000)	0.76 (0.11)	−0.28 (0.01)	0.73 (0.10)	0.80 (0.19)	0.918	0.739	0.25 (0.155)

Open in a new tab

Among the 1218 validation study patients, AUC for PCPThg model and the expanded PCPThg score plus PCA3 model are 0.707 and 0.752. By incorporating PCA3 score in the PCPThg model, the AUC increases to 0.767 in direct regression. However, the constrained methods do not further increase the AUC. All the new methods except for the transformation approach give predicted probabilities that are on average too high, suggesting they are not well calibrated. For calibration, as measured by the Brier score, the original PCPT calculator performs better than direct regression and of the new methods only the transformation approach gives an improved Brier score. In Figure 1 we can see that the expanded PCPThg model incorporating PCA3 tends to overestimate the risk of getting high-grade PCa among those patients with high risk. However, the overall calibration ability of the expanded PCPThg model estimated by the transformation approach still outperforms that of the original PCPThg model, the expanded PCPThg score plus PCA3 model or the expanded PCPThg model estimated by direct regression.

Fig.1. — Calibration plot of the original high-grade Prostate Cancer Prevention Trial risk calculator (PCPThg) and calibration plots of the expanded PCPThg model by incorporating PCA3 score and dichotomized T2:ERG

The expanded PCPThg model incorporating binary T2:ERG fitted to the training dataset again shows that the constrained methods can reduce the standard errors of regression coefficients compared to direct regression. In Figure 1 we can see that the expanded PCPThg model incorporating binary T2:ERG tends to overestimate the risk of getting high-grade cancer among those patients with high risk. However, the transformation approach predicts the risk well for the high risk groups.

For the prostate cancer example with either new biomarker, the Chatterjee et al. (2016) method gave parameter estimates and standard errors that are different from both the methods that don’t use the external information and from the other methods that do use the external information. In general, they tend to be closer to those of the original PCPT calculator. Also the method gives a much lower predicted population proportion than all the other methods and lower than the observed prevalence of 18.3% in the validation dataset. This is possibly because the Chatterjee et al. (2016) method does not take into account the uncertainty in the estimates coefficients. It is notable, in contrast to what was seen in the simulation studies, that the Chatterjee et al. (2016) method tends to give smaller standard errors than all the other methods for the coefficients of the X variables, but larger standard errors for the coefficient of the new biomarker. The method does assume that the X distribution is the same in the original dataset and the dataset with the new biomarkers, but in fact there are considerable differences between these X distributions, it is unclear how this is affecting the estimates and standard errors. The results (as shown in Appendix F of the Supplementary materials) for the constrained MLE, with a range of values for d demonstrate that the choice of d can be quite impactful, both on the estimates and on their standard errors. As expected the results for d = 0.1 are quite close to the Chatterjee et al. (2016) method, because neither method incorporates the uncertainty in the parameter estimates.

7. Discussion

We propose several strategies for translating the external coefficient information obtained from outside the dataset into constraints on regression coefficients in the setting of a logistic regression model describing Pr(Y = 1|X, B). Simulation studies show that the external coefficient information from the established model can help improve the efficiency of estimation and enhance the predictive power in the expanded model.

In terms of computational efficiency, in simulation studies the transformation approach shows advantage over the informative full Bayes because in the transformation approach the raw draws are first obtained in a fast way and then transformed into draws that obey the constraints based on an efficient optimization algorithm, while the informative full Bayes solution produces constrained draws inefficiently. When the dimensionality of the predictors X increases, the computational cost of the transformation approach solution will not increase much because the high-dimensional optimization problem will always reduce to a one-dimensional optimization problem based on our algorithm regardless of the dimensionality of the predictor space. Furthermore, the correlation among the samples in the Markov chain for the informative full Bayes approach is very high and effective samples are harder to obtain when the dimensionality increases (additional simulation results that validate this finding are not shown). As a consequence, the discrepancy of these two constrained solutions in computational cost will be more apparent in higher dimensions. In general the Bayesian approaches are much more computationally time-consuming than the other approaches. While it is conceivable that better algorithms and improved programming could speed these up considerably, it is very unlikely that they will ever have comparable speed to the CML methods or the Chatterjee et al. (2016) approach. A general overview of the computational and implementation details for all the methods described in this paper are given in Supplementary Materials Appendix G.

The efficiency gain in the expanded model of interest depends on the sample size used to construct the established model and the sample size used to estimate the expanded model of interest. In our simulation studies the established models are based on large datasets with 10000 observations while the current datasets are very small. The relative efficiency gain in the regression coefficient of variables X by incorporating the external coefficient information is significant and the prediction power in the validation dataset is enhanced. However, when the sample size in the current dataset is large enough to estimate the expanded model, the constrained methods do not lead to much improvement in the predictive ability compared to direct regression, as was the case in the prostate cancer example. However, our numerical results suggest that improved precision of the coefficient estimates, as measured by standard errors, can be achieved even if the current dataset is not small.

A situation we did not consider in the simulation studies is when the event of interest is rare and the predicted probabilities are very low. While there may be other methods that exploit this assumption, we hypothesize that the relative performance of the methods we considered may be similar to those for the small samples sizes we did evaluate. This is because both situations have limited information in the data from which to estimate regression coefficients. However, this hypothesis would need to be investigated.

The approaches proposed in this paper are based on establishing a relationship between the parameters in the Y|X model and the parameters in the Y|X, B model. Depending on the form of B and the structure of the models these relationships need to be analytically derived and are approximations. The methods also require an explicit model for B|X. While it would be desirable to avoid having to specify a parametric model for B|X, we also note that the appropriateness of the model for B|X can be checked to some degree from the small dataset. The differences in the distributions of X in the external and internal studies will not have much effect on the performance of our proposed constrained methods (simulation results not shown). This is because the coefficients’ approximate relationship equations are constructed based on the conditional distributions Y|X, B and B|X. As long as these two conditional distributions are correctly specified in the internal study, the approximate relationship equations will hold regardless of the differences in the distributions of X in these two studies. The approaches also do have the feature that they can directly incorporate the uncertainty in the parameters of the Y|X model. An alternative method of using $\bar{p}$ as a covariate is appealing because of its simplicity and broad applicability, although it does appear to have slightly worse properties than the more sophisticated approaches. The approach of Chatterjee et al. (2016) is appealing because it is applicable for any form of B and it does not require an explicit model for B|X, however it does require the same distribution of X in the two populations and it does not incorporate the uncertainty in the parameters of the Y|X, B model. We also found that it was sometimes numerically unstable for small sample sizes.

One point of future consideration is the distribution of the new biomarker B. We develop the approximate relationship equation for the scenarios that B is Gaussian and binary. When B is multivariate Gaussian, based on the generalization of equation (5), assuming B|X is multivariate normal with L dimensions, mean Xθ and covariance matrix V_L×L, the approximate relationship between γ, θ and β is:

β_{j} \approx (γ_{j} + \sum_{l = 1}^{L} γ_{p + l} θ_{l j}) / {(1 + γ_{B}^{T} V γ_{B} / {1.7}^{2})}^{\frac{1}{2}}, j = 0, \dots, p

(17)

Then the strategies to incorporate the external coefficient information described in Section 3 can be easily extended in this case. However, if additional biomarkers follow other distributions, these approximate relationship equations will fail. Therefore, further investigations are needed for the generalization of our proposed constrained solutions to flexibly adapt to other possible distributions of the new biomarker.

Supplementary Material

Supplementary File - Amended

NIHMS988360-supplement-Supplementary_File_-_Amended.pdf^{(300.1KB, pdf)}

Acknowledgments

This research was supported by the NSF grant DMS 1406712 and NIH grant CA 129102.

References

Chatterjee N, Chen Y-H, Maas P and Carroll RJ (2016) Constrained maximum likelihood estimation for model calibration using summary-level information from external big data sources. Journal of the American Statistical Association, 111, 107–117. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cheng W, Taylor JMG, Vokonas PS, Park SK and Mukherjee” B (2018) Improving estimation and prediction in linear regression incorporating external information from an established reduced model. Statistics in Medicine, 37, 1515–1530. [DOI] [PMC free article] [PubMed] [Google Scholar]
D’Agostino RB, Grundy S, Sullivan LM, Wilson P and for the CHD Risk Prediction Group (2001) Validation of the framingham coronary heart disease prediction scores: results of a multiple ethnic groups investigation. The Journal of the American Medical Association, 286, 180–187. [DOI] [PubMed] [Google Scholar]
Firth D (1993) Bias reduction of maximum likelihood estimates. Biometrika, 80, 27–38. [Google Scholar]
Gail MH, Brinton LA, Byar DP, Corle DK, Green SB, Schairer C and Mulvihill JJ (1989) Projecting individualized probabilities of developing breast cancer for white females who are being examined annually. Journal of the National Cancer Institute, 81, 1879–1886. [DOI] [PubMed] [Google Scholar]
Gelman A, Jakulin A, Pittau MG and Su Y-S (2008) A weakly informative default prior distribution for logistic and other regression models. The Annals of Applied Statistics, 2, 1360–1383. [Google Scholar]
Grill S, Ankerst DP, Gail MH, Chatterjee N and Pfeiffer RM (2017) Comparison of approaches for incorporating new information into existing risk prediction models. Statistics in Medicine, 36, 1134–1156. [DOI] [PMC free article] [PubMed] [Google Scholar]
Grill S, Fallah M, Leach RJ, Thompson IM, Hemminki K and Ankerst DP (2015) A simple-to-use method incorporating genomic markers into prostate cancer risk prediction tools facilitated future validation. Journal of Clinical Epidemiology, 68, 563–573. [DOI] [PubMed] [Google Scholar]
Gunn LH and Dunson DB (2005) A transformation approach for incorporating monotone or unimodal constraints. Biostatistics, 6, 434–449. [DOI] [PubMed] [Google Scholar]
Heinze G and Schemper M (2002) A solution to the problem of separation in logistic regression. Statistics in Medicine, 21, 2409–2419. [DOI] [PubMed] [Google Scholar]
Imbens GW and Lancaster T (1994) Combining micro and macro data in microeconometric models. The Review of Economic Studies, 61, 655–680. [Google Scholar]
Mealiffe ME, Stokowski RP, Rhees BK, Prentice RL, Pettinger M and Hinds DA (2010) Assessment of clinical validity of a breast cancer risk model combining genetic and clinical information. Journal of the National Cancer Institute, 102, 1618–1627. [DOI] [PMC free article] [PubMed] [Google Scholar]
Monahan J and Stefanski LA (1992) Normal scale mixture approximations to F*(z) and computation of the logistic-normal integral in Handbook of the logistic distribution. New York: CRC Press. [Google Scholar]
Newcombe PJ, Reck BH, Sun J, Platek GT, Verzilli C, Kader AK, Kim S-T, Hsu F-C, Zhang Z, Zheng SL, Mooser VE, Condreay LD, Spraggs CF, Whittaker JC, Rittmaster RS and Xu J (2012) A comparison of Bayesian and frequentist approaches to incorporating external information for the prediction of prostate cancer risk. Genetic Epidemiology, 36, 71–83. [DOI] [PMC free article] [PubMed] [Google Scholar]
Qin J (2000) Combining parametric and empirical likelihoods. Biometrika, 87, 484–490. [Google Scholar]
Qin J, Zhang H, Li P, Albanes D and Yu K (2015) Using covariate-specific disease prevalence information to increase the power of case-control studies. Biometrika, 102, 169–180. [Google Scholar]
Satten GA and Kupper LL (1993) Inferences about exposure-disease associations using probability-of- exposure information. Journal of the American Statistical Association, 88, 200–208. [Google Scholar]
Steyerberg EW, Eijkemans MJC, Van Houwelingen JC, Lee KL and Habbema JDF (2000) Prognostic models based on literature and individual patient data in logistic regression analysis. Statistics in Medicine, 19, 141–160. [DOI] [PubMed] [Google Scholar]
Steyerberg EW, Vickers AJ, Cook NR, Gerds T, Gonen M, Obuchowski N, Pencina MJ and Kattan MW (2010) Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology, 21, 128–138. [DOI] [PMC free article] [PubMed] [Google Scholar]
Thompson IM, Ankerst DP, Chi C, Goodman PJ, Tangen CM, Lucia MS, Feng Z, Parnes HL and Coltman CA (2006) Assessing prostate cancer risk: results from the prostate cancer prevention trial. Journal of the National Cancer Institute, 98, 529–534. [DOI] [PubMed] [Google Scholar]
Tomlins SA, Day JR, Lonigro RJ, Hovelson DH, Siddiqui J, Kunju LP, Dunn RL, Meyer S, Hodge P, Groskopf J, Wei JT and Chinnaiyan AM (2015) Urine TMPRSS2:ERG plus PCA3 for individualized prostate cancer risk assessment. European Urology, 70, 45–53. [DOI] [PMC free article] [PubMed] [Google Scholar]
Truong M, Yang B and Jarrard DF (2013) Toward the detection of prostate cancer in urine: a critical analysis. The Journal of Urology, 189, 422–429. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary File - Amended

NIHMS988360-supplement-Supplementary_File_-_Amended.pdf^{(300.1KB, pdf)}

[R1] Chatterjee N, Chen Y-H, Maas P and Carroll RJ (2016) Constrained maximum likelihood estimation for model calibration using summary-level information from external big data sources. Journal of the American Statistical Association, 111, 107–117. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] Cheng W, Taylor JMG, Vokonas PS, Park SK and Mukherjee” B (2018) Improving estimation and prediction in linear regression incorporating external information from an established reduced model. Statistics in Medicine, 37, 1515–1530. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] D’Agostino RB, Grundy S, Sullivan LM, Wilson P and for the CHD Risk Prediction Group (2001) Validation of the framingham coronary heart disease prediction scores: results of a multiple ethnic groups investigation. The Journal of the American Medical Association, 286, 180–187. [DOI] [PubMed] [Google Scholar]

[R4] Firth D (1993) Bias reduction of maximum likelihood estimates. Biometrika, 80, 27–38. [Google Scholar]

[R5] Gail MH, Brinton LA, Byar DP, Corle DK, Green SB, Schairer C and Mulvihill JJ (1989) Projecting individualized probabilities of developing breast cancer for white females who are being examined annually. Journal of the National Cancer Institute, 81, 1879–1886. [DOI] [PubMed] [Google Scholar]

[R6] Gelman A, Jakulin A, Pittau MG and Su Y-S (2008) A weakly informative default prior distribution for logistic and other regression models. The Annals of Applied Statistics, 2, 1360–1383. [Google Scholar]

[R7] Grill S, Ankerst DP, Gail MH, Chatterjee N and Pfeiffer RM (2017) Comparison of approaches for incorporating new information into existing risk prediction models. Statistics in Medicine, 36, 1134–1156. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] Grill S, Fallah M, Leach RJ, Thompson IM, Hemminki K and Ankerst DP (2015) A simple-to-use method incorporating genomic markers into prostate cancer risk prediction tools facilitated future validation. Journal of Clinical Epidemiology, 68, 563–573. [DOI] [PubMed] [Google Scholar]

[R9] Gunn LH and Dunson DB (2005) A transformation approach for incorporating monotone or unimodal constraints. Biostatistics, 6, 434–449. [DOI] [PubMed] [Google Scholar]

[R10] Heinze G and Schemper M (2002) A solution to the problem of separation in logistic regression. Statistics in Medicine, 21, 2409–2419. [DOI] [PubMed] [Google Scholar]

[R11] Imbens GW and Lancaster T (1994) Combining micro and macro data in microeconometric models. The Review of Economic Studies, 61, 655–680. [Google Scholar]

[R12] Mealiffe ME, Stokowski RP, Rhees BK, Prentice RL, Pettinger M and Hinds DA (2010) Assessment of clinical validity of a breast cancer risk model combining genetic and clinical information. Journal of the National Cancer Institute, 102, 1618–1627. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Monahan J and Stefanski LA (1992) Normal scale mixture approximations to F*(z) and computation of the logistic-normal integral in Handbook of the logistic distribution. New York: CRC Press. [Google Scholar]

[R14] Newcombe PJ, Reck BH, Sun J, Platek GT, Verzilli C, Kader AK, Kim S-T, Hsu F-C, Zhang Z, Zheng SL, Mooser VE, Condreay LD, Spraggs CF, Whittaker JC, Rittmaster RS and Xu J (2012) A comparison of Bayesian and frequentist approaches to incorporating external information for the prediction of prostate cancer risk. Genetic Epidemiology, 36, 71–83. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] Qin J (2000) Combining parametric and empirical likelihoods. Biometrika, 87, 484–490. [Google Scholar]

[R16] Qin J, Zhang H, Li P, Albanes D and Yu K (2015) Using covariate-specific disease prevalence information to increase the power of case-control studies. Biometrika, 102, 169–180. [Google Scholar]

[R17] Satten GA and Kupper LL (1993) Inferences about exposure-disease associations using probability-of- exposure information. Journal of the American Statistical Association, 88, 200–208. [Google Scholar]

[R18] Steyerberg EW, Eijkemans MJC, Van Houwelingen JC, Lee KL and Habbema JDF (2000) Prognostic models based on literature and individual patient data in logistic regression analysis. Statistics in Medicine, 19, 141–160. [DOI] [PubMed] [Google Scholar]

[R19] Steyerberg EW, Vickers AJ, Cook NR, Gerds T, Gonen M, Obuchowski N, Pencina MJ and Kattan MW (2010) Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology, 21, 128–138. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] Thompson IM, Ankerst DP, Chi C, Goodman PJ, Tangen CM, Lucia MS, Feng Z, Parnes HL and Coltman CA (2006) Assessing prostate cancer risk: results from the prostate cancer prevention trial. Journal of the National Cancer Institute, 98, 529–534. [DOI] [PubMed] [Google Scholar]

[R21] Tomlins SA, Day JR, Lonigro RJ, Hovelson DH, Siddiqui J, Kunju LP, Dunn RL, Meyer S, Hodge P, Groskopf J, Wei JT and Chinnaiyan AM (2015) Urine TMPRSS2:ERG plus PCA3 for individualized prostate cancer risk assessment. European Urology, 70, 45–53. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] Truong M, Yang B and Jarrard DF (2013) Toward the detection of prostate cancer in urine: a critical analysis. The Journal of Urology, 189, 422–429. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Informing a Risk Prediction Model for Binary Outcomes with External Coefficient Information

Wenting Cheng

Jeremy M G Taylor

Tian Gu

Scott A Tomlins

Bhramar Mukherjee

Summary.

1. Introduction

2. A Motivating Example: Prostate Cancer Risk Prediction

3. Statistical Approaches

3.1. Logistic Regression Approximation of the Marginal Pr(Y = 1|X)

3.2. Firth Correction in Logistic Regression

3.3. Unconstrained Solutions

3.3.1. Direct Regression

3.3.2. Standard Bayes

3.4. Constrained Solutions

3.4.1. Constrained Maximum Likelihood

3.4.2. Informative Full Bayes

3.4.3. Transformation Approach

3.4.4. Constrained Approach of Chatterjee et al (2016)

3.4.5. Logistic Regression Plug-in Method

4. Statistical Approaches when B is Binary

4.1. The Approximate Relationship Equation When B is Binary

4.2. Unconstrained and Constrained Solutions

4.2.1. Constrained Maximum Likelihood

4.2.2. Informative Full Bayes

4.2.3. Transformation Approach

5. Simulation Study

Table 1.

6. Application to the Prostate Cancer Data

Table 2.

Fig.1.

7. Discussion

Supplementary Material

Acknowledgments

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases