Skip to main content
American Journal of Epidemiology logoLink to American Journal of Epidemiology
. 2021 Feb 17;190(8):1696–1698. doi: 10.1093/aje/kwab039

SIMULATION IN PRACTICE: THE BALANCING INTERCEPT

Jacqueline E Rudolph 1,, Jessie K Edwards 2, Ashley I Naimi 1, Daniel J Westreich 2
PMCID: PMC8530150  PMID: 33595061

Simulation is an important tool within epidemiology for both learning and developing new methodology (15). Unfortunately, few epidemiology training programs teach basic simulation methods. Briefly, when conducting a simulation experiment, we generally follow the same basic steps. We first decide which variables to include, as well as their distributions and associations—often aided by a causal diagram. We then generate those variables by sampling from their specified distributions and estimate whatever target parameter is of interest (e.g., sample average or causal effect). We finally repeat the process multiple times, building a distribution for the target parameter from the estimates obtained in each replicate.

Table 1.

Estimated Marginal Probabilities Under Different Simulation Specifications, in a Single Simulation of 10,000 Units

Simulationa Y Model Intercept X Model Intercept L Model Intercept Inline graphic (L=1) Inline graphic (X=1) Inline graphic (Y=1)
Generate Y 0.3 N/A N/A N/A N/A 0.295
Generate X and Y 0.3 0.5 N/A N/A 0.490 0.400
Inline graphic 0.5 N/A N/A 0.490 0.288
Generate L, X, and Y 0.3 0.5 0.8 0.799 0.577 0.497
Inline graphic Inline graphic 0.8 0.799 0.501 0.301

Abbreviations: Inline graphic, estimated probability of Inline graphic; NA, not applicable.

a Variables generated: Y, outcome; X, exposure; L, confounder.

Here, we briefly demonstrate one key simulation practice: the balancing intercept, which allows us to specify in the first step above the marginal mean of a variable and see that mean preserved (in expectation) in the simulated sample. This is attractive because the marginal probability is an easily interpretable quantity. Additionally, if the simulation is designed to mimic real data, the marginal probability of a variable is often a known, reported characteristic of the data (although, conversely, there could be areas in which researchers know more about the conditional probability of that variable). Use of a balancing intercept also makes validation of simulation code simple; regardless of how other simulation parameters change, as long as the balancing intercept is specified correctly, the marginal probability of a variable in the simulated data set should match the desired marginal probability.

Suppose we were only interested in generating a binary outcome Inline graphic for Inline graphic simulated individuals, from a Bernoulli distribution with probability Inline graphic. We can express this probability in terms of an intercept-only model: Inline graphic, where Inline graphic. It is then relatively easy to use any software program of one’s choosing to generate Inline graphic (see Web Table 1 and the Web Appendix, available at https://www.doi.org/10.1093/aje/kwab039, for example code). Although here we focus on binary Inline graphic, much of the discussion below also applies to continuous Inline graphic.

The simulation becomes slightly more complicated when we expand our simulation to include a binary exposure Inline graphic (with Inline graphic), which affects Inline graphic. Now, when we generate Inline graphic, we need to account for the association Inline graphic has with Inline graphic. This is commonly done by specifying a model for the conditional probability of Inline graphic given Inline graphic, Inline graphic, rather than Inline graphic. One could also model the joint distribution directly. For example, we can use the linear model (specifically, fit with an identity link function):

graphic file with name M27.gif

where Inline graphic is the probability of Inline graphic when Inline graphic and Inline graphic is the risk difference for Inline graphic, which we set to be Inline graphic. The question here is: How should we determine the appropriate value of Inline graphic in this model? One approach would be to add Inline graphic to the intercept-only model above:

graphic file with name M36.gif

This is intuitive because when Inline graphic in the above formula, Inline graphic. We will refer to this intercept as the “standard intercept.”

The main drawback of the standard intercept is that the marginal probability of Inline graphic in our simulated sample, Inline graphic, will not be the same as the desired Inline graphic as long as Inline graphic. We see this in our simulation (Table 1). In the scenario where Inline graphic, Inline graphic was 0.400, rather than the desired 0.3. We can show why this occurs using the law of total probability:

graphic file with name M45.gif

If we do wish to preserve the specified marginal probability, we can replace the standard intercept with what we refer to here as the “balancing intercept.” To do this, we could specify in the model for Inline graphic

graphic file with name M47.gif

where Inline graphic is the desired marginal probability of Inline graphic. This intercept balances out the influence of the effect of Inline graphic on Inline graphic on the expectation of Inline graphic by including an offset that multiplies the effect size for Inline graphic by the expected value of Inline graphic. For binary variables, the expected value of Inline graphic is the marginal probability, Inline graphic; if Inline graphic were continuous, the expected value would be the mean. Note that the sign is opposite of that in the formula for the conditional probability; that is, for (positive) Inline graphic, we include the term Inline graphic. In the simulation scenario in which Inline graphic, we estimated that Inline graphic, which is closer to the Inline graphic we desired. The difference here is largely due to random error, which decreases if we increase Inline graphic.

The approach presented here can be extended to account for more variables. Let us add to our simulation a binary confounder Inline graphic, with Inline graphic, which affects the exposure Inline graphic and the outcome Inline graphic. Accordingly, in our simulation, we model the conditional probabilities of Inline graphic and Inline graphic given Inline graphic, as follows:

graphic file with name M71.gif
graphic file with name M72.gif

Note that we first simulate Inline graphic based on Inline graphic, then Inline graphic based on Inline graphic and Inline graphic together. Furthermore, each equation includes a balancing intercept:

graphic file with name M78.gif
graphic file with name M79.gif

In our simulation, we specified Inline graphic and Inline graphic (i.e., no interaction between Inline graphic and Inline graphic on the additive scale) and estimated that Inline graphic and Inline graphic.

We can model the conditional probabilities using functions other than the linear model. In fact, when simulating a binary variable, we might prefer to use the inverse of the logit function (expit function), which will guarantee that Inline graphic will fall between (0, 1). A linear model, on the other hand, could lead to probabilities falling outside this range (e.g., a probability of 1.04). The expit model (see Web Table 2 and the Web Appendix for code) takes the form:

graphic file with name M87.gif

where Inline graphic is the intercept, Inline graphic is the conditional log odds ratio (log(OR)) for Inline graphic, and Inline graphic is the conditional log(OR) for Inline graphic. The standard intercept for this model would be the marginal log odds of Inline graphic:

graphic file with name M94.gif

We can instead specify a balancing intercept, as follows:

graphic file with name M95.gif

The balancing intercept allows a researcher to set a desired marginal probability for Inline graphic and then see that same probability manifested in the resulting simulation (in expectation). We should note, though, that the balancing intercept approach is an approximation and could fail as the model for the conditional probability of Inline graphic becomes extreme (e.g., when variables in the model have complex distributions). Even so, we believe that using a balancing intercept is a best practice for simulation design that deserves wider attention and adoption.

Supplementary Material

Web_Material_kwab039

ACKNOWLEDGMENTS

All authors contributed equally to this work.

This work was supported in part by National Institutes of Health grants R01 HD093602 and K01 AI125087.

We thank Dr. Stephen Cole for his insightful and thoughtful feedback on a draft of this paper. We also thank Dr. Tim Morris for providing STATA code to match our R and SAS code.

Conflicts of interest: none declared.

REFERENCES

  • 1.Hodgson  T, Burke  M. On simulation and the teaching of statistics. Teach Stat. 2000;22:91–96. [Google Scholar]
  • 2.Westreich  D, Cole  SR, Schisterman  EF, et al.  A simulation study of finite-sample properties of marginal structural cox proportional hazards models. Stat Med. 2012;31(19):2098–2109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Lesko  CR, Lau  B. Bias due to confounders for the exposure-competing risk relationship. Epidemiology. 2017;28(1):20–27. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Jurek  AM, Greenland  S, Maldonado  G, et al.  Proper interpretation of non-differential misclassification effects: expectations vs observations. Int J Epidemiol. 2005;34(3):680–687. [DOI] [PubMed] [Google Scholar]
  • 5.Rudolph  JE, Cole  SR, Eron  JJ, et al.  Estimating human immunodeficiency virus (HIV) prevention effects in low-incidence settings. Epidemiology. 2019;30(3):358–364. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Web_Material_kwab039

Articles from American Journal of Epidemiology are provided here courtesy of Oxford University Press

RESOURCES