Skip to main content
BMC Medical Research Methodology logoLink to BMC Medical Research Methodology
. 2023 Feb 17;23:45. doi: 10.1186/s12874-023-01836-5

The iterative bisection procedure: a useful tool for determining parameter values in data-generating processes in Monte Carlo simulations

Peter C Austin 1,2,3,
PMCID: PMC9936690  PMID: 36800931

Abstract

Background

Data-generating processes are key to the design of Monte Carlo simulations. It is important for investigators to be able to simulate data with specific characteristics.

Methods

We described an iterative bisection procedure that can be used to determine the numeric values of parameters of a data-generating process to produce simulated samples with specified characteristics. We illustrated the application of the procedure in four different scenarios: (i) simulating binary outcome data from a logistic model such that the prevalence of the outcome is equal to a specified value; (ii) simulating binary outcome data from a logistic model based on treatment status and baseline covariates so that the simulated outcomes have a specified treatment relative risk; (iii) simulating binary outcome data from a logistic model so that the model c-statistic has a specified value; (iv) simulating time-to-event outcome data from a Cox proportional hazards model so that treatment induces a specified marginal or population-average hazard ratio.

Results

In each of the four scenarios the bisection procedure converged rapidly and identified parameter values that resulted in the simulated data having the desired characteristics.

Conclusion

An iterative bisection procedure can be used to identify numeric values for parameters in data-generating processes to generate data with specified characteristics.

Keywords: Data-generating process, Simulations, Monte Carlo simulations

Introduction

Monte Carlo simulations are a critical tool in modern statistical research [1, 2]. Simulations allow one to investigate the properties of statistical estimators and procedures in settings in which analytic calculations are not feasible. A crucial component of any simulation is a data-generating process that allows the investigator to simulate data with specified characteristics. While a data-generating process can often be quickly constructed, it is more difficult to specify the values of the parameters of the data-generating process to result in the simulated data having specified characteristics.

For instance, given a set of baseline covariates simulated from a multivariate distribution, a logistic model can be used to simulate binary outcomes so as to induce an odds ratio of a specified magnitude for the association between a variable denoting treatment status (treatment vs. control) and the outcome. However, if one wanted to induce a treatment effect with a specific relative risk or risk difference, rather than a specific odds ratio, one would need to identify the given log-odds ratio for treatment that resulted in the desired relative risk or risk difference [3, 4]. This log-odds ratio would depend on the distribution of baseline covariates. Similarly, if one wanted to simulate binary outcome data such that the logistic regression model for the outcome had a specified c-statistic (equivalent to the area under the receiver operating characteristic (ROC) curve), one would need to determine the regression coefficients for the logistic regression model that result in the desired c-statistic.

We describe an iterative bisection procedure that allows researchers to determine the required value of parameters in a data-generating process to result in simulated data with the desired characteristics. We illustrate the iterative bisection procedure by applying it to four different examples. In “Determining the intercept of a logistic regression model so that prevalence of treatment is equal to a specified value when using a logistic regression model to simulate outcomes” section, we apply the iterative bisection procedure to construct a data-generating process for simulating binary outcomes from a multivariable logistic regression model so that the prevalence of the outcome in the population is equal to a specified probability. In “Determining the odds ratio for a binary treatment variable in a logistic regression model to induce a desired treatment risk difference or relative risk in the population” section, we apply the bisection procedure to construct a data-generating process for simulating binary outcomes using a multivariable logistic regression model such that a binary treatment (or exposure) induces a relative risk of a given magnitude. In “Determining the regression coefficients for a logistic regression model so that the model has a specified c-statistic” section, we apply the bisection procedure to construct a data-generating process for simulating binary outcomes from a multivariable logistic regression model with a specified c-statistic. In “Determining the conditional hazard ratio for treatment/exposure in an adjusted Cox regression model to induce a specified marginal hazard ratio” section, we apply the bisection procedure to construct a data-generating process for simulating time-to-event outcomes with a specified marginal hazard ratio for treatment. Finally, we provide a summary in “Discussion” section.

Determining the intercept of a logistic regression model so that prevalence of treatment is equal to a specified value when using a logistic regression model to simulate outcomes

Description of method

In this section we consider a setting in which one wants to simulate a binary outcome that is related to a vector of covariates, such that the prevalence of the outcome in the population is equal to a specified value. Let Ptarget denote the specified or target prevalence of the outcome in the population.

The first step is to simulate a vector of covariates for each subject in a large super-population, say of size N = 1,000,000. The distribution of the baseline covariates can be chosen by the investigator. The application of the bisection procedure is independent of this distributional decision. Assume that we simulate p baseline covariates (X1,,Xp) from a given multivariable distribution.

The second step is to specify a logistic regression model for generating the binary outcomes:

logit(Pr(Yi=1))=β0+j=1pβjXij 1

The regression coefficients β1,,βp can be chosen by the investigator to reflect the desired relationship between each of the p covariates and the log-odds of the outcome. The prevalence of the outcome in the population is primarily determined by the intercept, β0. One must determine the value of the intercept that produces the desired prevalence of the outcome. Lower values of β0 are associated with lower prevalences of the outcome, while higher values of β0 are associated with higher prevalences of the outcome. For a given value of β0 we can simulate a binary outcome for each subject in the super-population from a Bernoulli distribution with subject-specific parameter determined by formula (1). Let Yiβ0 denote the simulated outcome for the ith subject when the intercept for the regression model (1) is set equal to β0.

The next step is to specify the endpoints of an interval for the parameter of interest; in this case the regression intercept, β0. Denote this interval by (β0lower,β0upper). The lower endpoint β0lower is chosen such that 1Ni=1NYiβ0lower<Ptarget. In other words, the prevalence of the simulated outcome is less than the target value when using β0lower. Similarly, the upper endpoint β0upper is chosen such that 1Ni=1NYiβ0upper> Ptarget. In other words, the prevalence of the simulated outcome is greater than the target value when using β0upper. The endpoints can be identified through a grid search or by trial and error.

Once the endpoints of the interval (β0lower,β0upper) have been determined, compute the midpoint of the interval: β0midpoint=β0lower+β0upper2 (e.g., if the original endpoints of the interval are ± 10, the original midpoint will be 0). We then use β0midpoint in formula (1) and simulate a binary outcome for each subject: Yiβ0midpoint. We then compute the prevalence of the outcome in the super-population: 1Ni=1NYiβ0midpoint. If 1Ni=1NYiβ0midpoint<Ptarget, the prevalence of the outcome is too low and the intercept of formula (1) has to be increased. If 1Ni=1NYiβ0midpoint> Ptarget, the prevalence is too high and the intercept of formula (1) has to be decreased.

If 1Ni=1NYiβ0midpoint<Ptarget, then define a new interval: (β0midpoint,β0upper). Conversely, if 1Ni=1NYiβ0midpoint> Ptarget, then define a new interval (β0lower,β0midpoint). In the first case, the new interval is the upper half of the initial interval, while the in the second case the new interval is the lower half of the initial interval. In either case, the width of the new interval is half the width of the initial interval. We have bisected the initial interval. One then repeats this process iteratively. After K iterations, the width of the resultant interval is 12K of the width of the initial interval. The iterative process can be continued until 1Ni=1NYiβ0midpoint is as close to Ptarget as desired.

Application of method

We applied the iterative bisection procedure to simulate data for a sample of size N = 1,000,000. We simulated 10 baseline covariates. The first 5 from independent standard normal distributions, and the last five from independent Bernoulli distributions with parameter 0.5. The regression coefficients (equivalent to log-odds ratios) for the 10 covariates were set to β1=log(1.25),β2=log(1.5),β3=log(1.75),β4=log(2),β5=log(2.5),β6=log(1.25),β7=log(1.5),β8=log(1.75),β9=log(2),β10=log(2.5)..

Our objective was to simulate data such that the prevalence of the outcome was 0.10 (10%). The initial interval for (β0lower,β0upper) was set to (-10,10). R code to implement the bisection procedure is provided at the author’s GitHub account [https://github.com/peter-austin/BMC_MRM-bisection-procedures-for-Monte-Carlo-simulations]. The estimates of β0midpointand1Ni=1NYiβ0midpoint at each iteration are reported in Table 1. After 14 iterations of the procedure, an intercept equal to -4.368896 resulted in the generation of outcomes such that the prevalence of the outcome was 0.099923.

Table 1.

Bisection procedure to determine intercept of a logistic regression model to produce an outcome with a given prevalence (target prevalence: 0.10)

Iteration Target outcome prevalence β0midpoint Empirical outcome prevalence
1 0.1 0 0.729943
2 0.1 -5 0.062318
3 0.1 -2.5 0.313826
4 0.1 -3.75 0.153365
5 0.1 -4.375 0.099513
6 0.1 -4.0625 0.124133
7 0.1 -4.21875 0.111236
8 0.1 -4.29688 0.105887
9 0.1 -4.33594 0.102487
10 0.1 -4.35547 0.100863
11 0.1 -4.36523 0.100523
12 0.1 -4.37012 0.099575
13 0.1 -4.36768 0.100366
14 0.1 -4.3689 0.099923

Determining the odds ratio for a binary treatment variable in a logistic regression model to induce a desired treatment risk difference or relative risk in the population

Description of method

The logistic regression model is commonly-used in biomedical and epidemiological research for assessing the association between a binary outcome and a set of covariates [5]. When using a logistic regression model, the odds ratio is the resultant measure of association. The odds ratio denotes the relative increase in the odds of the binary outcome associated with a one unit increase in the given covariate. Other measures of effect for binary outcomes include: the risk difference, the relative risk, and the number needed to treat, where the latter is the reciprocal of the risk difference. Several clinical commentators have suggested that these latter three measures of effect are preferable to the odds ratio for clinical decision making [69].

To study the performance of statistical methods for estimating risk differences or relative risks, one requires a data-generating process that can simulate data with a given risk difference or relative risk [3, 4]. We assume that our data-generating process for simulating outcomes is a modification of the one described above. We modify the logistic regression model as follows:

logit(Pr(Yi=1))=β0+γZi+j=1pβjXij 2

The model has been modified by including a binary treatment variable (Z = 1 treated; Z = 0 control) with an associated log-odds ratio of γ. Thus, treatment is associated with an increase of γ in the log-odds of the outcome. Let RRtarget denote the target treatment relative risk in the population.

The first step is to simulate baseline covariates X1,,Xp from a chosen distribution. One can then simulate treatment status using methods described in “Determining the intercept of a logistic regression model so that prevalence of treatment is equal to a specified value when using a logistic regression model to simulate outcomes” section, so that receipt of treatment has a specified association with each of the baseline covariates and so that the prevalence of treatment in the population is equal to the specified value.

The second step is to set the regression coefficients associated with the baseline covariates in formula (2) to the desired quantities. One can use the methods described in “Determining the intercept of a logistic regression model so that prevalence of treatment is equal to a specified value when using a logistic regression model to simulate outcomes” section to determine the intercept (β0) of formula (2) so that the prevalence of the outcome in the population if no one were treated is equal to a specified value.

We introduce the potential outcomes framework, as this facilitates identifying the appropriate value of γ [10]. Given a binary treatment Z, let Y(1) and Y(0) denote a subject’s outcomes under treatment (Z = 1) and control (Z = 0) if received under identical circumstances. The average treatment effect (ATE) is defined as E[Y(1) – Y(0)]. The marginal value of the relative risk is defined as E[Y(1)]/E[Y(0)].

The population relative risk due to treatment is determined by the log-odds ratio for treatment, γ. One must determine the value of γ that results in the desired relative risk. As the value of γ increases, the relative risk increases. Lower values of γ are associated with lower relative risks, while higher values of γ are associated with higher relative risks. For a given value of γ we can simulate the two potential outcomes for each subject using formula (2). First, we set Z = 0 (control) for all subjects in the super-population and simulate a binary outcome for each subject in the super-population from a Bernoulli distribution with subject-specific parameter determined by formula (2). Let Y(0)iγ denote the simulated outcome under control for the ith subject when the log-odds ratio for treatment in regression model (2) is set equal to γ. Second, we set Z = 1 (treated) for all subjects in the super-population and simulate a binary outcome for each subject in the super-population from a Bernoulli distribution with subject-specific parameter determined by formula (2). Let Y(1)iγ denote the simulated outcome under treatment for the ith subject when log-odds ratio for treatment in regression model (2) is set equal to γ. The population relative risk when the log-odds ratio for treatment is set to γ is equal to E[Y(1)iγ]/Y[(0)iγ]=1Ni=1N[Y(1)iγ]1Ni=1N[Y(0)iγ].

The next step is to specify the endpoints of an interval for the log-odds ratio for treatment, γ. Denote this interval by (γlower,γupper). The lower endpoint γlower is chosen such that 1Ni=1N[Y(1)iγ]1Ni=1N[Y(0)iγ]<RRtarget. Similarly, the upper endpoint γupper is chosen such that 1Ni=1N[Y(1)iγ]1Ni=1N[Y(0)iγ]> RRtarget. The endpoints can be identified through a grid search or by trial and error.

Once the endpoints of the interval (γlower,γupper) have been determined, compute the midpoint of the interval: γmidpoint=γlower+γupper2. We then use γmidpoint in formula (2) and simulate the two potential outcomes under treatment and control for each subject: Y(1)iγmidpoint and Y(0)iγmidpoint. We then compute the treatment relative risk in the super-population: 1Ni=1N[Y(1)iγmidpoint]1Ni=1N[Y(0)iγmidpoint]. If 1Ni=1N[Y(1)iγmidpoint]1Ni=1N[Y(0)iγmidpoint]<RRtarget, the relative risk is too low and γ in formula (2) has to be increased. If 1Ni=1N[Y(1)iγmidpoint]1Ni=1N[Y(0)iγmidpoint]> RRtarget, the relative risk is too large and γ in formula (2) has to be decreased.

If 1Ni=1N[Y(1)iγmidpoint]1Ni=1N[Y(0)iγmidpoint]<RRtarget, then define a new interval: (γmidpoint,γupper). Conversely, if 1Ni=1N[Y(1)iγmidpoint]1Ni=1N[Y(0)iγmidpoint]> RRtarget, then define a new interval (γlower,γmidpoint). In the first case, the new interval is the upper half of the initial interval, while the in the second case the new interval is the lower half of the initial interval. In either case, the width of the new interval is half the width of the initial interval. We have bisected the initial interval. One then repeats this process iteratively until 1Ni=1N[Y(1)iγmidpoint]1Ni=1N[Y(0)iγmidpoint] is as close to RRtarget as desired.

The above procedure allows one to determine the value of γ necessary to induce a given treatment relative. The procedure can be modified to induce a given treatment risk difference. To do so, all occurrences of 1Ni=1NY(1)iγ1Ni=1NY(0)iγ are replaced by 1Ni=1N[Y(1)iγ- Y(0)iγ].

Application of method

We applied the iterative bisection procedure to simulate data for a sample of size N = 1,000,000. We simulated 10 baseline covariates as in “Description of method” of “Determining the intercept of a logistic regression model so that prevalence of treatment is equal to a specified value when using a logistic regression model to simulate outcomes” section. We first simulated a binary treatment status variable using formula (1) with the 10 regression coefficients for the baseline covariates in the treatment-selection model set to log(1.1), log(2), log(3), log(1.5), log(1.5), log(1.1), log(2), log(3), log(1.5), and log(1.5). We used the bisection process described in Determining the intercept of a logistic regression model so that prevalence of treatment is equal to a specified value when using a logistic regression model to simulate outcomes” section to determine the intercept for the treatment selection model such that the prevalence of treatment in the population was 0.2. The resultant intercept was -3.31749.

The regression coefficients (equivalent to log-odds ratios) for the 10 covariates in the outcome model (formula (2)) were set to β1=log(1.25),β2=log(1.5),β3=log(1.75),β4=log(2),β5=log(2.5),β6=log(1.25),β7=log(1.5),β8=log(1.75),β9=log(2),β10=log(2.5).

We set the value of β0 to that determined in the first “Description of method” of “Determining the intercept of a logistic regression model so that prevalence of treatment is equal to a specified value when using a logistic regression model to simulate outcomes” (i.e., the subsection in the Determining the intercept of a logistic regression model so that prevalence of treatment is equal to a specified value when using a logistic regression model to simulate outcomes section) section (β0 = -4.367676) so that the prevalence of outcome was 0.10 (10%) if no subjects were treated.

Our objective was to simulate data such that the treatment relative risk is 0.80. The initial interval for γ was set to (-10,10). R code to implement the bisection procedure is provided at the author’s GitHub account [https://github.com/peter-austin/BMC_MRM-bisection-procedures-for-Monte-Carlo-simulations]. The estimates of γmidpointand1Ni=1NY(1)iγ1Ni=1NY(0)iγ at each iteration are reported in Table 2. After 16 iterations, the procedure identified that a treatment log-odds ratio of -0.2999878 (equivalent to an odds ratio of 0.741) resulted in a relative risk of 0.8000121.

Table 2.

Bisection procedure to determine log-odds ratio for treatment in a logistic regression model to produce a binary outcome with a given relative risk (target relative risk: 0.80)

Iteration Target relative risk γmidpoint Empirical relative risk
1 0.8 0 1
2 0.8 -5 0.010792
3 0.8 -2.5 0.12165
4 0.8 -1.25 0.371719
5 0.8 -0.625 0.621492
6 0.8 -0.3125 0.792433
7 0.8 -0.15625 0.891377
8 0.8 -0.23438 0.840727
9 0.8 -0.27344 0.81629
10 0.8 -0.29297 0.804289
11 0.8 -0.30273 0.798343
12 0.8 -0.29785 0.801312
13 0.8 -0.30029 0.799827
14 0.8 -0.29907 0.800569
15 0.8 -0.29968 0.800198
16 0.8 -0.29999 0.800012

Determining the regression coefficients for a logistic regression model so that the model has a specified c-statistic

Description of method

The c-statistic (equivalent to the area under the receiver operating characteristic (ROC) curve, which is sometimes abbreviated as the AUC) is a measure of discrimination used to assess the predictive performance of logistic regression models [11, 12]. In this section we describe how to specify the coefficients of a logistic regression model to simulate outcomes such that the underlying logistic regression model has a specified c-statistic. In doing so, we make use of the fact that the c-statistic of a univariate logistic regression model is a function of the variance of the covariate and the log-odds ratio for that covariate [13]. Let AUCtarget denote the target c-statistic.

We modify the logistic regression model described in formula (1):

logit(Pr(Yi=1))=β0+σj=1pβjXij 3

The regression coefficients β1,,βp can be chosen by the investigator to reflect the desired relationship between each of the covariates and the log-odds of the outcome. By introducing the scalar σ, we are modifying each log-odds ratio, but doing so in such a way that the ratio of any two log-odds ratios remains constant after modification. We need to identify the value of σ required to induce the desired c-statistic. Larger values of σ are associated with larger values of the c-statistic, while lower values of σ are associated with smaller values of the c-statistic.

The first step is to simulate a vector of covariates for each subject in a large super-population, say of size N = 1,000,000. The distribution of the baseline covariates can be chosen by the investigator. The application of the bisection procedure is independent of this distributional decision.

For a given value of σ we can simulate a binary outcome for each subject in the super-population from a Bernoulli distribution with subject-specific parameter determined by formula (3). Let Yiσ denote the simulated outcome for the ith subject when σ is a constant scalar as shown in formula (3).

The next step is to specify the endpoints of an interval for σ. Denote this interval by (σlower,σupper). The lower endpoint σlower is chosen such that when binary outcomes are simulated using a Bernoulli distribution with subject-specific parameter determined using formula (3), the c-statistic of the logistic regression model fit to the simulated data has a c-statistic that is less than AUCtarget. Similarly, the upper endpoint σupper is chosen that when binary outcomes are simulated using a Bernoulli distribution with subject-specific parameter determined using formula (3), the c-statistic of the logistic regression model fit to the simulated data has a c-statistic that is greater than AUCtarget. The endpoints can be identified through a grid search or by trial and error.

Once the endpoints of the interval (σlower,σupper) have been determined, compute the midpoint of the interval: σmidpoint=σlower+σupper2. We then use σmidpoint in formula (3) and simulate a binary outcome for each subject: Yiσmidpoint. We fit a logistic regression model in the simulated sample and determine its c-statistic, which we refer to as AUCσmidpoint. If AUCσmidpoint<AUCtarget, the c-statistic is too low and σ has to be increased. If AUCσmidpoint> AUCtarget, the c-statistic is too large and σmidpoint has to be decreased.

If AUCσmidpoint<AUCtarget, then define a new interval: (σmidpoint,σupper). Conversely, if AUCσmidpoint> AUCtarget, then define a new interval (σlower,σmidpoint). One then repeats this process iteratively until AUCσmidpoint is as close to AUCtarget as desired.

Application of method

We applied the iterative bisection procedure to simulate data for a sample of size N = 1,000,000. We simulated 10 baseline covariates as in “Description of method” of “Determining the intercept of a logistic regression model so that prevalence of treatment is equal to a specified value when using a logistic regression model to simulate outcomes” section. As above, the regression coefficients (equivalent to log-odds ratios) for the 10 covariates in the outcome model (formula (3)) were set to β1=log(1.25),β2=log(1.5),β3=log(1.75),β4=log(2),β5=log(2.5),β6=log(1.25),β7=log(1.5),β8=log(1.75),β9=log(2),β10=log(2.5).

We set the value of β0 to that determined in the first “Description of method” of “Determining the intercept of a logistic regression model so that prevalence of treatment is equal to a specified value when using a logistic regression model to simulate outcomes” section (β0 = -4.368896).

Our objective was to simulate binary outcomes such that c-statistic of the logistic regression model was 0.8. The initial interval for σ was set to (0,10). R code to implement the bisection procedure is provided at the author’s GitHub account [https://github.com/peter-austin/BMC_MRM-bisection-procedures-for-Monte-Carlo-simulations. The values of σmidpoint,and AUCσmidpoint at each iteration are reported in Table 3. After 11 iterations of the bisection procedure, σ=0.8349609 resulted in an empirical c-statistic of 0.8006215.

Table 3.

Bisection procedure to determine σ for multiplying the coefficients of a logistic regression model to produce a binary outcome such that the outcomes model has a given c-statistic (target c-statistic: 0.80)

Iteration Target c-statistic σmidpoint Empirical c-statistic (AUCσmidpoint)
1 0.8 5 0.983782
2 0.8 2.5 0.944541
3 0.8 1.25 0.868001
4 0.8 0.625 0.742962
5 0.8 0.9375 0.821941
6 0.8 0.78125 0.788305
7 0.8 0.859375 0.805549
8 0.8 0.820313 0.79756
9 0.8 0.839844 0.801353
10 0.8 0.830078 0.798494
11 0.8 0.834961 0.800622

Determining the conditional hazard ratio for treatment/exposure in an adjusted Cox regression model to induce a specified marginal hazard ratio

Description of method

The Cox proportional hazard regression is frequently used in biomedical and epidemiological research [14]. When fitting a multivariable Cox regression model, the regression coefficients, when exponentiated, are interpretated as conditional (or adjusted) hazard ratios. For a given covariate, the conditional hazard ratio compares the relative difference in the hazard of the outcome between two subjects for whom the covariate in question differs by one unit and for whom all the other covariates are identical. In contrast to the conditional (or adjusted) hazard ratio is the marginal (or population-average) hazard ratio. The marginal hazard ratio denotes the relative difference in the hazard function between two populations, for whom the covariate in question differs by one unit, and all other covariates are identical between populations. Due to the phenomenon known as the non-collapsibility of the hazard ratio, marginal and conditional hazard ratios do not coincide (unless one is null) [15].

Bender and colleagues have described data-generating processes for time-to-event outcomes based on an underlying hazard regression model [16]. This data-generating process simulates time-to-event outcomes with specified conditional hazard ratios. One can use the bisection approach with Bender’s data-generating process to induce data with a desired marginal effect for a given covariate.

Assume that we simulate p baseline covariates (X1,,Xp) from a given multivariable distribution. Furthermore, assume that we have the following Cox proportional hazards model:

log(hi(t))=log(h0(t))+γZi+j=1pβjXij 4

where hi(t) denotes the hazard function for the ith subject, h0(t) denotes the baseline hazard function, and where Z denotes a binary treatment variable (Z = 1 treated; Z = 0 control) with an associated conditional log-hazard ratio of γ. Thus, treatment is associated with an increase of γ in the log-hazard of the outcome. Let HRtarget denote the target marginal hazard ratio in the population.

The first step is to simulate baseline covariates X1,,Xp from a chosen distribution. One can then simulate treatment status (Z) using methods described above, so that receipt of treatment has a specified association with each of the baseline covariates and the prevalence of treatment in the population is equal to the specified value.

The second step is to set the regression coefficients associated with the baseline covariates in formula (4) to the desired quantities.

As above, we use the potential outcomes framework. The marginal hazard ratio due to treatment is determined by the log-hazard ratio for treatment, γ. One must determine the value of γ that results in the desired marginal hazard ratio. For a given value of γ we can use Bender’s approach to simulate the two potential outcomes for each subject using formula (4). First, we set Z = 0 (control) for all subjects in the super-population and simulate a time-to-event outcome for each subject in the super-population. Let T(0)iγ denote the simulated outcome under control for the ith subject when the log-hazard ratio for treatment in regression model (4) is set to γ. Second, we set Z = 1 (treated) for all subjects in the super-population and simulate a time-to-event outcome for each subject in the super-population. Let T(1)iγ denote the simulated outcome under treatment for the ith subject when the log-hazard ratio for treatment in regression model (4) is set to γ. One then creates a large super-population by concatenating the two simulated datasets (one under control and one under treatment). Using a univariate Cox proportional hazards model, one then regresses the hazard of the outcome on the variable denoting treatment status. The resultant hazard ratio is an estimate of the marginal hazard ratio.

The next step is to specify the endpoints of an interval for the log-hazard ratio for treatment, γ. Denote this interval by (γlower,γupper). The lower endpoint γlower is chosen such that the estimated marginal hazard ratio is less than HRtarget. Similarly, the upper endpoint γupper is chosen such that the estimated marginal hazard ratio is greater than HRtarget. The endpoints can be identified through a grid search or by trial and error.

Once the endpoints of the interval (γlower,γupper) have been determined, compute the midpoint of the interval: γmidpoint=γlower+γupper2. We then use γmidpoint in formula (4) and simulate the two potential outcomes under treatment and control for each subject: T(1)iγmidpoint and T(0)iγmidpoint. We then compute the marginal treatment when the dataset consisting of both potential outcomes for all subjects is used to regress the hazard of the outcome on treatment status. If the estimated marginal hazard ratio is less than HRtarget, the hazard ratio is too low and γ in formula (4) must be increased. If the estimated marginal hazard ratio is greater than HRtarget, the hazard ratio is too large and γ in formula (4) must be decreased.

If the estimated marginal hazard ratio is less than HRtarget, then define a new interval: (γmidpoint,γupper). Conversely, if the estimated marginal hazard ratio is greater than HRtarget, then define a new interval (γlower,γmidpoint). In the first case, the new interval is the upper half of the initial interval, while the in the second case the new interval is the lower half of the initial interval. In either case, the width of the new interval is half the width of the initial interval. We have bisected the initial interval. One then repeats this process iteratively until the estimated marginal hazard ratio is as close to the target marginal hazard ratio, HRtarget, as desired.

Application of method

We applied the iterative bisection procedure to simulate data for a sample of size N = 1,000,000. We simulated 10 baseline covariates as in “Description of method” of “Determining the intercept of a logistic regression model so that prevalence of treatment is equal to a specified value when using a logistic regression model to simulate outcomes” section. We used the regression coefficients for the treatment-selection model described in “Description of method” of “Determining the odds ratio for a binary treatment variable in a logistic regression model to induce a desired treatment risk difference or relative risk in the population” section, with the same intercept as determined in “Description of method” of “Determining the odds ratio for a binary treatment variable in a logistic regression model to induce a desired treatment risk difference or relative risk in the population” section, so that the prevalence of treatment was 0.20.

The regression coefficients for the Cox regression model described in formula (4) were set to β1=log(1.25),β2=log(1.5),β3=log(1.75),β4=log(2),β5=log(2.5),β6=log(1.25),β7=log(1.5),β8=log(1.75),β9=log(2),β10=log(2.5).

Our objective was to simulate data such that the marginal hazard ratio for treatment was 0.80. The initial interval for γ was set to (-10,10). R code to implement the bisection procedure is provided at the author’s GitHub account [https://github.com/peter-austin/BMC_MRM-bisection-procedures-for-Monte-Carlo-simulations]. The estimates of γmidpointand HRmarginalγmidpoint at each iteration are reported in Table 4. After 12 iterations of the bisection procedure, a conditional log-hazard ratio for treatment equal to 0.6298828 resulted in simulated outcomes with a marginal hazard ratio of 0.7991514.

Table 4.

Bisection procedure to determine conditional log-hazard ratio for treatment in a Cox proportional hazards model to produce a time-to-event outcome with a given marginal hazard ratio for treatment (target marginal hazard ratio for treatment: 0.80)

Iteration Target marginal hazard ratio γmidpoint Empirical marginal hazard ratio (HRmarginalγmidpoint)
1 0.8 0 0.000354
2 0.8 5 2.203321
3 0.8 2.5 1.561186
4 0.8 1.25 1.114086
5 0.8 0.625 0.796174
6 0.8 0.9375 0.969266
7 0.8 0.78125 0.887425
8 0.8 0.703125 0.843299
9 0.8 0.664063 0.820148
10 0.8 0.644531 0.808168
11 0.8 0.634766 0.802396
12 0.8 0.629883 0.799151

Discussion

We illustrated the application of an iterative bisection procedure that allows investigators to select the numeric values of parameters in a data-generating process to produce simulated datasets with specified characteristics. This will facilitate designing data-generating processes that produce simulated datasets that are tailored to the investigators’ specifications.

We illustrated the use of the bisection procedure when there is one characteristic that requires specification (e.g., the prevalence of the outcome or the c-statistic of a logistic regression model). The procedure can be modified to simulate data such that two characteristics are fixed at specified values (e.g., both the prevalence of the outcome and the c-statistic of the logistic regression model). To do so, one would apply the procedure sequentially and then iteratively repeat the sequential process until both characteristics are close to the target values. It is necessary to repeat the process iteratively as modifying the parameter values during the second application of the procedure (e.g., for the c-statistic of the regression model) may modify the value of the first characteristic (e.g., the prevalence of the outcome).

The bisection procedure has been successfully used in previous studies that used simulations to: assess the ability to rank hospitals by their performance on composite indicators [17], describe a data-generating process for data with a specified marginal odds ratio [18], describe a data-generating process for data with a specified risk difference or number needed to treat [19], to determine the rate parameter for an exponential censoring distribution so as to induce the desired proportion of censoring [20], in a study of the performance of double propensity score adjustment [21], in a study on the performance of the generalized propensity score for estimating the effect of quantitative exposures on time-to-event outcomes [22], to assess the performance of propensity score methods for estimating marginal hazard ratios [23], in a study of the use of the bootstrap with propensity score matching [24], in a comparison of algorithms for matching on the propensity score [25], assess the use of optimal matching with survival outcomes [26], to assess methods of variance estimation when using inverse probability of treatment weighting with survival outcomes [27], to assess the use of propensity score matching in the presence of competing risks [28], to assess the performance of calibration metrics for survival models [29], to assess the performance of variance estimators for survival outcomes when using propensity score matching with replacement [30], to assess the effect of constraints on the matching ratio when using full matching [31], to examine the consequences multiply-imputing missing potential outcomes under control [32], and to examine sample size and power calculations when using inverse probability of treatment weighting [33].

Conclusion

We have a described an iterative bisection procedure that can be used in designing data-generating processes that produce simulated datasets with specific characteristics.

Acknowledgements

Not applicable.

Abbreviations

AUC

Area under the curve

RR

Relative risk

HR

Hazard ratio

ROC

Receiver operating characteristic

Authors’ contributions

PA conceived the study, conducted the simulations, wrote the manuscript, and approved the final manuscript.

Authors’ information

Not applicable.

Funding

ICES is an independent, non-profit research institute funded by an annual grant from the Ontario Ministry of Health (MOH) and the Ministry of Long-Term Care (MLTC). As a prescribed entity under Ontario’s privacy legislation, ICES is authorized to collect and use health care data for the purposes of health system analysis, evaluation and decision support. Secure access to these data is governed by policies and procedures that are approved by the Information and Privacy Commissioner of Ontario. The opinions, results and conclusions reported in this paper are those of the authors and are independent from the funding sources. No endorsement by ICES or the Ontario MOH or MLTC is intended or should be inferred. This research was supported by operating grant from the Canadian Institutes of Health Research (CIHR) (PJT 166161).

Availability of data and materials

No data were used in the study. The software code for simulating artificial datasets is provided on the author’s GitHub account [https://github.com/peter-austin/BMC_MRM-bisection-procedures-for-Monte-Carlo-simulations].

Declarations

Ethics approval and consent to participate

No data were used in this study. The study described algorithms for randomly generating data.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Footnotes

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Morris TP, White IR, Crowther MJ. Using simulation studies to evaluate statistical methods. Stat Med. 2019;38(11):2074–2102. doi: 10.1002/sim.8086. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Harrison RL. Introduction To Monte Carlo Simulation. AIP Conf Proc. 2010;1204:17–21. doi: 10.1063/1.3295638. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Austin PC. The performance of different propensity-score methods for estimating relative risks. J Clin Epidemiol. 2008;61(6):537–545. doi: 10.1016/j.jclinepi.2007.07.011. [DOI] [PubMed] [Google Scholar]
  • 4.Austin PC. Comparing paired vs non-paired statistical methods of analyses when making inferences about absolute risk reductions in propensity-score matched samples. Stat Med. 2011;30(11):1292–1301. doi: 10.1002/sim.4200. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Hosmer DW, Lemeshow S. Applied Logistic Regression. New York, NY: John Wiley & Sons; 1989. [Google Scholar]
  • 6.Laupacis A, Sackett DL, Roberts RS. An assessment of clinically useful measures of the consequences of treatment. N Engl J Med. 1988;318:1728–1733. doi: 10.1056/NEJM198806303182605. [DOI] [PubMed] [Google Scholar]
  • 7.Cook RJ, Sackett DL. The number needed to treat: a clinically useful measure of treatment effect. BMJ. 1995;310(6977):452–454. doi: 10.1136/bmj.310.6977.452. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Sackett DL. Down with odds ratios! Evid Based Med. 1996;1:164–166. [Google Scholar]
  • 9.Jaeschke R, Guyatt G, Shannon H, Walter S, Cook D, Heddle N. Basic statistics for clinicians: 3. Assessing the effects of treatment: measures of association. Can Med Assoc J. 1995;152(3):351–7. [PMC free article] [PubMed] [Google Scholar]
  • 10.Rubin DB. Estimating causal effects of treatments in randomized and nonrandomized studies. J Educ Psychol. 1974;66:688–701. doi: 10.1037/h0037350. [DOI] [Google Scholar]
  • 11.Steyerberg EW. Clinical Prediction Models. 2. New York: Springer-Verlag; 2019. [Google Scholar]
  • 12.Harrell FE., Jr . Regression modeling strategies. 2. New York, NY: Springer-Verlag; 2015. [Google Scholar]
  • 13.Austin PC, Steyerberg EW. Interpreting the concordance statistic of a logistic regression model: relation to the variance and odds ratio of a continuous explanatory variable. BMC Med Res Methodol. 2012;12:82. doi: 10.1186/1471-2288-12-82. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Cox DR. Regression models and life tables (with discussion) Journal of the Royal Statistical Society - Series B. 1972;34:187–220. [Google Scholar]
  • 15.Gail MH, Wieand S, Piantadosi S. Biased estimates of treatment effect in randomized experiments with nonlinear regressions and omitted covariates. Biometrika. 1984;7:431–444. doi: 10.1093/biomet/71.3.431. [DOI] [Google Scholar]
  • 16.Bender R, Augustin T, Blettner M. Generating survival times to simulate Cox proportional hazards models. Stat Med. 2005;24(11):1713–1723. doi: 10.1002/sim.2059. [DOI] [PubMed] [Google Scholar]
  • 17.Austin PC, Ceyisakar IE, Steyerberg EW, Lingsma HF, Marang-van de Mheen PJ. Ranking hospital performance based on individual indicators: can we increase reliability by creating composite indicators? BMC Med Res Methodol. 2019;19(1):131. doi: 10.1186/s12874-019-0769-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Austin PC, Stafford J. The performance of two data-generation processes for data with specified marginal treatment odds ratios. Communications in Statistics - Simulation and Computation. 2008;37:1039–1051. doi: 10.1080/03610910801942430. [DOI] [Google Scholar]
  • 19.Austin PC. A data-generation process for data with specified risk differences or numbers needed to treat. Communications in Statistics - Simulation and Computation. 2010;39:563–577. doi: 10.1080/03610910903528301. [DOI] [Google Scholar]
  • 20.Austin PC, Putter H, Giardiello D, van Klaveren D. Graphical calibration curves and the integrated calibration index (ICI) for competing risk models. Diagn Progn Res. 2022;6(1):2. doi: 10.1186/s41512-021-00114-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Austin PC. Double propensity-score adjustment: A solution to design bias or bias due to incomplete matching. Stat Methods Med Res. 2017;26(1):201–222. doi: 10.1177/0962280214543508. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Austin PC. Assessing the performance of the generalized propensity score for estimating the effect of quantitative or continuous exposures on survival or time-to-event outcomes. Stat Methods Med Res. 2019;28(8):2348–2367. doi: 10.1177/0962280218776690. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Austin PC. The performance of different propensity score methods for estimating marginal hazard ratios. Stat Med. 2013;32(16):2837–2849. doi: 10.1002/sim.5705. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Austin PC, Small DS. The use of bootstrapping when using propensity-score matching without replacement: A simulation study. Stat Med. 2014;33(24):4306–4319. doi: 10.1002/sim.6276. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Austin PC. A comparison of 12 algorithms for matching on the propensity score. Stat Med. 2014;33(6):1057–1069. doi: 10.1002/sim.6004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Austin PC, Stuart EA. Optimal full matching for survival outcomes: a method that merits more widespread use. Stat Med. 2015;34(30):3949–3967. doi: 10.1002/sim.6602. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Austin PC. Variance estimation when using inverse probability of treatment weighting (IPTW) with survival analysis. Statisics in Medicine. 2016;35(30):5642–5655. doi: 10.1002/sim.7084. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Austin PC, Fine JP. Propensity-score matching with competing risks in survival analysis. Stat Med. 2019;38(5):751–777. doi: 10.1002/sim.8008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Austin PC, Harrell FE, Jr, van Klaveren D. Graphical calibration curves and the integrated calibration index (ICI) for survival models. Stat Med. 2020;39(21):2714–2742. doi: 10.1002/sim.8570. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Austin PC, Cafri G. Variance estimation when using propensity-score matching with replacement with survival or time-to-event outcomes. Stat Med. 2020;39(11):1623–1640. doi: 10.1002/sim.8502. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Austin PC, Stuart EA. The effect of a constraint on the maximum number of controls matched to each treated subject on the performance of full matching on the propensity score when estimating risk differences. Stat Med. 2021;40(1):101–118. doi: 10.1002/sim.8764. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Austin PC, Rubin DB, Thomas N. Estimating adjusted risk differences by multiply-imputing missing control binary potential outcomes following propensity score-matching. Stat Med. 2021;40(25):5565–5586. doi: 10.1002/sim.9141. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Austin PC. Informing power and sample size calculations when using inverse probability of treatment weighting using the propensity score. Stat Med. 2021;40(27):6150–6163. doi: 10.1002/sim.9176. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

No data were used in the study. The software code for simulating artificial datasets is provided on the author’s GitHub account [https://github.com/peter-austin/BMC_MRM-bisection-procedures-for-Monte-Carlo-simulations].


Articles from BMC Medical Research Methodology are provided here courtesy of BMC

RESOURCES