Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2022 Jul 19.
Published in final edited form as: Med Decis Making. 2021 May 8;41(6):714–726. doi: 10.1177/0272989X211009161

Bayesian versus Empirical Calibration of Microsimulation Models: A Comparative Analysis

Stavroula A Chrysanthopoulou 1, Carolyn M Rutter 2, Constantine A Gatsonis 3
PMCID: PMC9294658  NIHMSID: NIHMS1821774  PMID: 33966518

Abstract

Calibration of a microsimulation model (MSM) is a challenging but crucial step for the development of a valid model. Numerous calibration methods for MSMs have been suggested in the literature, most of which are usually adjusted to the specific needs of the model and based on subjective criteria for the selection of optimal parameter values. This article compares 2 general approaches for calibrating MSMs used in medical decision making, a Bayesian and an empirical approach. We use as a tool the MIcrosimulation Lung Cancer (MILC) model, a streamlined, continuous-time, dynamic MSM that describes the natural history of lung cancer and predicts individual trajectories accounting for age, sex, and smoking habits. We apply both methods to calibrate MILC to observed lung cancer incidence rates from the Surveillance, Epidemiology and End Results (SEER) database. We compare the results from the 2 methods in terms of the resulting parameter distributions, model predictions, and efficiency. Although the empirical method proves more practical, producing similar results with smaller computational effort, the Bayesian method resulted in a calibrated model that produced more accurate outputs for rare events and is based on a well-defined theoretical framework for the evaluation and interpretation of the calibration outcomes. A combination of the 2 approaches is an alternative worth considering for calibrating complex predictive models, such as microsimulation models.

Keywords: Bayesian calibration, empirical calibration, microsimulation model, comparative analysis


Microsimulation models (MSMs) are complex predictive models aimed at simulating a process at the unit level. This novel methodology has applications in several fields, including operations research to describe queuing systems,1 civil engineering to simulate traffic flows,2,3 health policy to evaluate the impact of interventions on populations,4,5 and so forth. When used for medical decision making, MSMs usually describe the natural history of a disease, including the effect of important risk factors,6 and are used to assess the effect of interventions on populations to inform health policy by simulating aggregated individual outcomes under different policy scenarios.7,8 For example, MSMs have been used by the United States Preventive Services Task Force (USPSTF, https://www.uspreventiveservicestaskforce.org/) to guide decisions related to breast, cervical, lung, and colorectal cancer screening.

A key component in the development of an MSM is the specification of the model parameters. The process used to select parameter values that result in model predictions close to the observed quantities of interest9 is referred to as “calibration,” “estimation,” or “model fitting.”1012 Rutter and colleagues7 defined calibration as the process of specifying parameter values for which the model reproduces observed results, when direct estimation is not possible. By necessity, MSMs are calibrated using observed data collected in the past (calibration targets). The calibrated MSM can then be used to make predictions within the same time frame but is more often used to make future projections (outside the time frame of the available data) comparing different scenarios.

The high level of complexity of MSMs usually makes it difficult, or even impossible, to derive closed-form expressions of the predictions as functions of the model parameters. This complexity, in conjunction with the fact that model parameters often correspond to latent characteristics, gives rise to identifiability problems. When an MSM is not identifiable based on available targets, multiple sets of parameter values can produce similar model predictions, which can reveal interesting relationships among the MSM parameters and predictions. Understanding and describing these relationships can considerably improve the structure of the model, so it can more accurately and efficiently describe the underlying process and thus also enhance its predictive ability. Furthermore, the determination of multiple “acceptable” sets of values for the MSM parameters rather than an optimal one allows for capturing parameter uncertainty, a major source of variability in MSM outcomes, and conveying the effect it has on MSM predictions.

We consider calibration as the most suitable technique for parameter specification in the context of microsimulation modeling when direct estimation of the parameters is not possible and view it as a procedure aimed at identifying sets of parameter values for which the MSM fits well to calibration targets.

Vanni and colleagues10 provided a systematic overview of calibration techniques for economic evaluation mathematical models and have organized the entire process into 7 essential steps. These steps are also applicable to the calibration of any complex model including MSMs. Stout and colleagues11 classified commonly used parameter specification techniques, applied to cancer simulation models, into “purely analytic”1316 and “calibration methods.”17,18 In the same article, they further classify calibration methods into directed and undirected based on the search algorithm of the multidimensional parameter space. Undirected methods involve an exhaustive grid search19,20 or a grid search using some sampling design, such as random,18,2123 Latin Hypercube,2427 and so forth, which can either result in the “best” value or a set of “optimal” values for each of the calibrated parameters. Directed methods, on the other hand, specifically aim at finding a single (optimum) set of parameter values using some optimization algorithm.12,2830 Bayesian calibration methods are also used in microsimulation modeling to draw values from the joint posterior distribution of the MSM parameters given a priori beliefs about them and the calibration targets.31 In real practice, people often adopt empirical or completely ad hoc approaches, which are not informed by a fully developed theoretical framework, for choosing methodological tools and methods when calibrating complex predictive models.

The main characteristics of a calibration procedure for MSMs, determining to a great extent the final results, are how parameter vectors are selected for evaluation and the stopping rule. The stopping rule determines when an optimal value or set of values has been found, or, in the case of Bayesian estimation, when the set of accepted draws provides a good approximation to the posterior distribution. The stopping rule may be somewhat arbitrary and therefore may be dictated by convenience (e.g., based on run times). In addition, the resulting distributions of the calibrated parameters do not always have a sound theoretical foundation that leads to clear interpretation. An exception is Bayesian calibration approaches, which result in estimates of the joint posterior distribution of calibrated parameters.

Research in the calibration of MSMs has primarily addressed the development and implementation of methods in this area. To our knowledge, no comparative analysis of calibration methods for MSMs has been published.

The main purpose of this article is to present a thorough comparative analysis of 2 generic calibration methods used in microsimulation modeling, a Bayesian and an empirical approach. We used a simulation study to compare the results from the 2 calibration methods applied to the MIcrosimulation Lung Cancer (MILC) model, a streamlined MSM that describes the natural history of lung cancer. Evaluation of the calibrated models involves both internal and external validation. Comparison of the 2 methods includes graphical, quantitative, and qualitative means of the resulted sets for the calibrated parameters, the model predictions, and the efficiency of each method. We use the results of our analysis to formulate recommendations implementing and comparing calibration methods in MSMs.

Methods

The MILC Model

Chrysanthopoulou32 developed the MILC model, a streamlined MSM that describes the natural history of lung cancer in the absence of any screening or treatment components. In short, the MILC is a continuous time, individual-based, state-transition, complex predictive model that assumes 5 distinct Markov states for the natural history of lung cancer and a set of transition rules among these states. Given individual-level information at baseline (beginning of the prediction period), the model employs Markov Chain Monte Carlo methods to simulate individual trajectories and predict lung-cancer related outcomes of interest (e.g., incidence, mortality, tumor size, and stage at diagnosis, etc.). In that sense, the MILC model shares common characteristics with both discrete event and individual-based state transition models.33

The model simulates individual disease trajectories given certain baseline characteristics including age, sex, and smoking. Information about smoking comprises smoking status (current, former, or nonsmoker), start and quit smoking ages, and smoking intensity expressed as average number of cigarettes smoked per day, when relevant. MILC is designed to constitute a major part of a comprehensive decision-making model for evaluating lung cancer prevention, screening, and treatment policies.

The model comprises 4 major components designed to represent 1) the onset of the first malignant cell, following the biological two-stage clonal expansion model13; 2) the tumor growth, assuming Gombertz distribution34; 3) the disease progression, assuming log-normal distributions for tumor volumes at the beginning of regional and distant stages, as well as at diagnosis19,3537; and 4) survival using cumulative incidence function estimates of lung cancer and other cause mortality to account for smoking-related competing risks.

A complete individual trajectory contains predictions about several aspects of the disease. Primary predicted outcomes are times to key events marking the disease process during the prediction period, such as time at diagnosis and death from lung cancer (or censoring observations if these events do not occur before the end of the prediction period). The model also predicts age at the beginning of each major tumor state (local, regional, or distant), along with tumor size and stage at tumor diagnosis. By aggregating MSM predictions (i.e., multiple individual trajectories), we can estimate lung cancer–related quantities of interest such as incidence and mortality trends for subpopulations. Chrysanthopoulou38 has also created a library for the implementation of the MILC model in R.

Calibration

Traditionally, calibration has been focused on identifying a single set of optimal parameter values for which the model fits well outcome(s) of interest.39 MSM requires individual-level information (microdata) of the underlying population characteristics at the starting point (e.g., disease-free state) of the simulated trajectories, determination of the prediction period (time from starting to end point), and observed data on the quantities of interest (calibration targets) that the calibrated model should accurately predict. Specification of plausible ranges for model parameters is also crucial to the effectiveness and efficiency of a calibration method and can be based on estimates from previous studies, expert opinion for observable variables, or even result from some ad hoc analyses. A complete calibration procedure involves sampling and evaluation of candidate values from the multidimensional parameter space using microdata to predict individual trajectories and assessing the overall fit of the model comparing aggregated predictions to calibration targets.

Bayesian Calibration

In Rutter and colleagues,31 we find one of the first attempts of implementing a comprehensive Bayesian approach for calibrating a MSM. In this framework, the goal is to employ Bayesian reasoning to draw values from the joint posterior distribution h(θ|Y) of the model parameters θK × 1, where K is the total number of calibrated parameters, given a priori information π(θ), and available information about prespecified calibration targets Y. In our case, this method involves a large number of Gibbs sampler iterations with embedded approximate Metropolis-Hastings (MH) steps, to simulate draws from the unknown forms of the full conditional distributions. In particular, within each Gibbs sampler step, we implement multiple iterations of a random-walk MH algorithm. Although computationally expensive, such a combination of a Gibbs sampler with an MH algorithm is often necessary due to the complexity of MSMs, because there are usually no closed-form expressions of the data likelihood as a function of the model parameters.

Given a symmetric jumping distribution, the MH algorithm accepts a new value θk for the kth element of the parameter vector θ with transition probability

a(θk,θk)={min(r(θk,θk),1)ifh(θkY)>01ifh(θkY)=0 (1)
r(θk,θk)=h(θkY)h(θY)=π(θk)j=1Jfj(yjg(θk))π(θ)j=1Jfj(yjg(θ)) (2)

where θk is the vector [θk,θ(k)], with θ(−k) indicating the parameter vector excluding the kth element, π(θ) represents the prior distribution of the model parameters, g is a function of the model parameters associated with the distribution f of the calibration targets, and h is the joint posterior of the calibrated parameters. This joint posterior h cannot be directly calculated, as it involves an intractable integral.

Assuming that the micro-simulation model M(θ) and the data distributions f(Y, g(θ)) are correctly specified, we can use the model to estimate the transition probability function based on an embedded simulation. Specifically, for a given (sampled) value of θ, we calculate g(θ) using the maximum likelihood estimator based on model predictions Y˜. For example, if the calibration target arises from binomial count data, then the data distribution is a binomial (f(Y, g(θ))) with a probability of success that is an unknown function g(θ) of the model parameters. To estimate g(θ), we use the model to simulate the count outcomes for a given value of θ and then estimate g^(θ) using the maximum likelihood estimation (MLE) (i.e., for binomial, the MLE of g(θ) is the average: g^(θ)=1ni=1nY˜ij, where g^(θ) is the joint distribution of different calibration targets). We use the estimate g^(θ) to calculate the transition probability α^(θ,θ) given

r^k(θk,θk)=π(θk)j=1Jfj(yjg^j(θk))π(θk)j=1Jfj(yjg^j(θ)) (3)

The Bayesian calibration method results in a V × K matrix of values for the calibrated parameters, denoted as ΘBayes, the rows of which represent a random sample from the joint posterior distribution h(θ|Y) of the MSM parameter vector θ. V represents the total number of draws from the joint posterior distribution of the calibrated parameters, and K the total number of calibrated parameters. This ΘBayes sample can be used to describe the posterior distributions of the calibrated MSM parameters, along with the posterior predictive distributions of the quantities of interest.

Empirical Calibration

Several empirical calibration methods for microsimulation models have been suggested in the literature. The general approach of these methods essentially involves 3 main steps: 1) employing some type of sampling design to efficiently traverse the multidimensional parameter space when searching for plausible values, 2) stipulating a proximity measure between observed and predicted quantities of interest, and 3) selecting values of the parameter vectors satisfying prespecified convergence criteria. In many cases, the result from an empirical calibration procedure is a set of parameter vectors that each results in a good fit of the model to calibration targets, rather than a single best parameter vector. The empirical method that we implemented in this comparative study is essentially a combination of popular practices followed for empirical calibration of MSMs in medical research.

Our empirical calibration method employs the Latin hypercube sampling design to extract a large sample of parameter vectors, representative of the multidimensional parameter space. This space is designated by a set of plausible intervals for the model parameters, derived either from prior information on the parameters (e.g., literature review, expert opinion, etc.) or after some ad hoc analysis comparing model predictions with outcomes of interest. These plausible intervals are analogous to the prior beliefs that we incorporate in the Bayesian calibration method.

At each step of the empirical calibration, we extract a set of potential parameter values (sample) from the multidimensional space, we run the MSM a large number of times (n) and calculate estimates (g^j(θ)) of parameters of the predicted data distributions, analogous to the Bayesian calibration method. We use these estimates to calculate the deviance statistic:

D=2[l(g^(θ)Y)l(Λ^Y)]==2j=1J[l(g^(θj)yj)l(λ^jyj)] (4)

where Λ represents the vector with the parameters of the distribution assumed for the outcome of interest (calibration target); λ^j is the estimate of Λ for the jth covariate class, where l(Λ^Y) is the log-likelihood of the saturated model, namely, a model assuming a separate parameter for each observation; and l(g^(θ)Y) is the log-likelihood of the fitted (calibrated MSM) model. We use this statistic to assess the goodness of fit (GoF) of the empirically calibrated MSM to the calibration targets. The degrees of freedom of this deviance statistic depend on the prespecified calibration targets and the total number of estimated/calibrated parameters. A vector of parameter values is deemed “acceptable” if this GoF test is not rejected at a prespecified α% level of significance. The result from this procedure is a V × K matrix of values for the calibrated parameters, denoted as ΘEmp, analogous to the ΘBayes matrix resulting from the Bayesian calibration. The ΘEmp matrix essentially includes a sample of “good” values for the model parameters, according to the prespecified acceptance criteria.

Simulation Study

Both calibration methods were applied to calibrate the MILC model. The goal with each method was to identify model parameters that result in accurate prediction of lung-cancer incidence rates, given baseline characteristics (age and smoking history) of the US population. We focused on men who were current smokers. We used both methods to calibrate a subset of the MILC parameters. The same calibration procedure can be followed for other MILC parameters and predictions of subpopulations.

Pseudo-population.

We combined information from the 1980 US census and the 1980 Statistical Abstract of the United States to simulate a pseudo-sample of N = 100,000 people (smpl100, 000) representative of the US 1980 population. Data from this simulated pseudo-population included individual-level information about age and smoking history (1980 Statistical Abstract of the United States), based on the respective observed distributions. The goal of calibration is to select MILC model parameters that result in MILC model predictions of age-group specific, lung cancer incidence rates that are similar to the respective ones extracted from the SEER database.

Microdata.

To implement each calibration method, we used as microdata a sample (smpl5, 000) of n = 5000 people, representative of the 1980 US population, randomly extracted from the pseudo-population (smpl100, 000). The same sample was used for the internal validation of the model. For the external model validation procedure, we extracted another random sample with individual-level characteristics of n = 10,000 people (smpl10, 000) from the pseudo-population (smpl100, 000).

Calibrated parameters.

We applied both methods to calibrate 4 MILC parameters in total, namely, the scale parameter of the Gompertz function (θ1 = m) and the 3 means of the log-normal distributions assumed for the tumor volumes at diagnosis (θ2 = mdiagn) and the beginning of the regional (θ3 = mreg) and distant stage (θ4 = mdist), respectively. For each of those log-normal distributions, we assumed equal location and scale parameters (i.e., mean = SD). For the remaining model parameters, we used fixed values based on information from relevant literature on the natural history of lung cancer.32

Furthermore, both calibration procedures incorporated the same initial assumptions about the range and distributions of plausible values for the 4 calibrated parameters. In particular, for the calibrated parameter values, we assume the following truncated normal (TN) distributions:

  • m = θ1 ~ TN (μ = sd = 0.0008, L = 0.00001, U = 0.0016)

  • mdiagn = θ2 ~ TN (μ = sd = 4, L = 0.0001, U = 8)

  • mreg = θ3 ~ TN (μ = sd = 2.2, L = 0.0001, U = 4.4)

  • mdist = θ4 ~ TN (μ = sd = 5.6, L = 0.0001, U = 11.2)

For the Bayesian calibration, these are the prior distributions, whereas for the empirical calibration, these are the specifications of the multidimensional parameter space from which random values are extracted using the Latin hypercube sampling method. We specified TN distributions with large variance (uninformative priors) to reflect the lack of information about these model parameters and to ensure a more complete search of the multidimensional parameter space.

An important step in the simulation study is the choice of the microdata size (n) and the number of model repetitions (M) with each microdata to predict the outcomes of interest and compare them with the calibration targets. To this end, we ran a sensitivity analysis to check how accurate and variant the predictions are for different combinations of n and M. We tried several combinations of n (500, 1000, 2500, 5000) and M (10, 20, 30, 50, 100) values to explore the effect these numbers have on the predicted outcomes (age-specific lung cancer incidence rates). There is a tradeoff between getting accurate model predictions (close to the calibration targets and not too dispersed) and the computational cost. Results from this sensitivity analysis indicated that a combination of M = 10 repetitions with a microdata sample of size N = 5000 individuals is a reasonable choice for our application, since it resulted in accurate predictions in a reasonable amount of time (computationally efficient).

Simulated “truth” for model parameters.

To overcome the problem of unknown actual values of model parameters, we performed an ad hoc analysis to stipulate a set of values θc for which the MILC model, given the simulated pseudo-population, produced predictions close to the calibration targets. Values from θc were assigned to model parameters that were held constant during the calibration procedure. Furthermore, values from this set were used to guide a priori decisions about the calibrated parameters as well as to evaluate the resulting calibrated sets from each procedure.

Simulated calibration targets.

To control for the effect of model structural uncertainty on the final predictions (bias), we simulated the calibration targets instead of using the actual quantities observed in the SEER database. To this end, we input θc and the pseudo-sample (smpl100, 000) into MILC and simulated individual trajectories 26 y ahead. For each θc, we ran the model multiple (r = 100) times and aggregated the results to simulate age group–specific (<60, 60–80, and >80 y old) lung-cancer incidence rates Y˜clbr=M100(θc). We set these simulated Y˜clbr aggregates as calibration targets for both calibration procedures, and we used them to validate each calibrated MILC version.

Calibration results.

Each calibration procedure resulted in V = 1000 sets of “good” values for the K = 4 MILC parameters. In the Bayesian calibration for each chain of 100,000 parameter values, we chose a burn-in period of size 50,000 and thinned the remaining sequence by keeping every 50th simulation draw. The resulting nK, Bayes = 1000 values represent a random sample from the posterior distribution of the Kth parameter. In the empirical calibrationm, we assessed 100,000 of vectors of parameter values for which the GoF test was not rejected at α = 5%. Of those, we kept only the top 1%, namely, those vectors with the smaller deviance.

For the Bayesian calibration, the resulting sets of parameter values (ΘBayes) represent draws from the joint posterior distribution, whereas for the Empirical method, they (ΘEmp) represent a sample from the “parameter space” (i.e., “the set of possible values” for those parameters).

Validation.

We evaluated the performance of the resulting models from the 2 calibration methods, using both internal and external validation9,33,40 to compare the model predictions with the observed calibration targets.

The term internal validation here essentially refers to the assessment of model’s internal consistency, namely, how close the predictions from the calibrated model are to the observed outcomes of interest when using exactly the same microdata (smpl5, 000) sample, like the input in the calibration process. In the external validation, on the other hand, we evaluated the model performance using different microdata (smpl10, 000) from those used for the model development (calibration procedure) but from the same underlying pseudo-population as we described in the “Microdata” subsection.

For each of the 2 sets ΘBayes and ΘEmp, we ran the MSM multiple times using the respective microdata for internal and external validation and simulated individual trajectories 26 y ahead. We aggregated information from the simulated trajectories by age group to obtain age-specific predictions of the 2006 lung cancer incidence rates. Furthermore, for the internal validation, we compared MILC predictions with the simulated calibration targets. In the external validation, on the other hand, we compared MILC predictions with observed SEER 2006 lung cancer incidence rates for 11 age groups; namely, we used a different age group classification from the original calibration procedure.

We use the term external validation loosely to describe a process that incorporates different microdata and breakdowns of the calibration targets (lung cancer incidence in different age groups) compared with the ones used for model calibration. In practice, we would have specific data used for model calibration and potentially very different data available for external validation (e.g., data from a clinical study). For the purposes of evaluating different calibration methods, we followed this approach to eliminate the (real-life) issues with potential systematic differences in data used for calibration and validation, as the only differences between calibration and validation data sets are stochastic.

Comparative analysis.

We compared the results from the 2 calibration methods with respect to the distributions of the calibrated parameters and model predictions. In particular, we compared the 2 methods in terms of the following criteria:

  1. overlap between the resulted distributions for the calibrated parameters,

  2. overlap between the distributions of the model predictions,

  3. reference values for model parameters (here, simulated values from the ad hoc analysis for which the model predictions were very close to the calibration targets), and

  4. reference values for model predictions (i.e., the calibration targets).

Comparisons involved graphical and quantitative means. In particular, we created density and contour plots to compare the univariate and bivariate distributions of the calibrated parameters ΘBayes and ΘEmp. We created density and box plots to compare the distributions of the model predictions between the 2 calibrated versions of the MILC model, with regard to the respective lung cancer incidence rates. We calculated univariate and multivariate discrepancy measures (mean absolute [MAD] and mean squared [MSD] deviations, and Mahalanobis distances) of the calibrated models’ predictions from the respective calibration targets. We assessed the degree of overlap between univariate distributions applying the symmetric Kullback–Leibler distance, with values close to zero indicating larger overlap.41

Results

The 2 methods produced very similar results with respect to distributions of the calibrated parameters. There is a large overlap between the univariate distributions of the calibrated MILC parameters, whereas the respective density plots indicate the same identifiability issues in the cases of θ3 and θ4 (Figure 1). Contour plots show very similar bivariate associations between the calibrated parameters (Figures 2 and 3), namely, a strong correlation between θ1 and θ2, whereas θ3 and θ4 seem unrelated both to each other and to the other 2 parameters. Both the Bayesian and the empirical method included θc in the calibrated values for the MILC parameters, whereas the box plots of the Mahalanobis distances of the resulted parameter vectors from θc further show that the 2 methods produce comparable results (Figure 4).

Figure 1.

Figure 1

Density plots comparing the marginal distributions of the calibrated MIcrosimulation Lung Cancer parameters between the 2 methods. Kullback–Leibler distance (KL-dist) quantifies the degree of overlap between the 2 distributions. Fixed values represent the simulated truth for the model parameters, resulting from an ad hoc analysis.

Figure 2.

Figure 2

Contour plots of the bivariate distributions of the calibrated MIcrosimulation Lung Cancer parameters with the Bayesian calibration method.

Figure 3.

Figure 3

Contour plots of the bivariate distributions of the calibrated MIcrosimulation Lung Cancer parameters with the empirical calibration method.

Figure 4.

Figure 4

Box plots comparing the multivariate distances of the calibrated MIcrosimulation Lung Cancer parameters from θc, between the 2 calibration methods.

The nature of the Bayesian method is such that it is possible for the algorithm to be trapped in local minima or maxima. Therefore, we tried different starting values and created 2 different chains (length = 100,000) for each parameter. The results were very similar: there was a large overlap (small Kullback–Leibler divergence and similar density and contour plots) between the respective distributions for both the calibrated parameters and the model predictions. Therefore, we discuss the results from only 1 chain.

Internal validation showed a large overlap between the distributions of the predicted lung-cancer incidence rates by age group, produced by the 2 calibrated models (Figure 5). In addition, both models seem to result in more accurate predictions of lung-cancer incidence for the last age group (>80 y), namely, the one with the most lung cancer cases, as compared with the other 2 age groups (smaller bias, MAD, and MSD; Table 1). However, predictions from the Bayesian-calibrated MILC model are much more dispersed as compared with those from the empirically calibrated one.

Figure 5.

Figure 5

Internal validation: density plots depicting the marginal age group–specific distributions of the predicted lung cancer incidence rates (cases/100,000 person-years) as compared with the calibration targets (Yclbr). Kullback–Leibler distance (KL-dist) quantifies the degree of overlap between the 2 distributions.

Table 1.

Performance of the 2 Calibrated MILC Models: Distribution of Predictions and Deviation from Calibration Targets

Target Value Min Q1 Median Mean ± SD Q2 Max Bias (%) MAD MSD
Bayesian calibration
 <60 y   41   14.05   32.70   39.87 36.36 ± 9.19   45.94   66.32   1.64 (4.0) 0.0400 0.0518
 60–80 y 391 208.9 308.1 342.2 336.6 ± 40.5 369.2 426.1   54.4 (13.9) 0.1392 0.0306
 >80 y 464 370.6 433.4 458.8 465.0 ± 41.7 494.5 622.9   −1.0 (0.2) 0.0220 0.0081
 Overall 0.0600 0.0300
Empirical calibration
 < 60 y   41   31.22   44.97   49.8   49.8 ± 6.61   54.7   69.7 −8.79 (21.4) 0.2145 0.0720
 60–80 y 391 313.6 358.1 373.6 372.9 ± 19.8 387.6 423.1   18.1 (4.63) 0.0464 0.0047
 >80 y 464 383.8 458.9 476.4 476.1 ± 26.7 495.2 562.0 −12.1 (2.6) 0.0261 0.0040
 Overall 0.0957 0.0269

MAD, mean absolute deviation; MILC, MIcrosimulation Lung Cancer; MSD, mean squared deviation. Values marked in bold indicate the calibrated model with the best respective performance in the specific age group.

Both methods perform well, with neither method clearly outperforming the other in terms of bias, the MAD and MSD discrepancy measures (Table 1). Nevertheless, it is noteworthy that, when looking at the age-specific results, the Bayesian method resulted in a model that can more accurately predict lung cancer incidence rates for the youngest age group (<60 y) as compared with the empirical method. Given that the prevalence of lung cancer in the youngest age group is by far the lowest in the age range of the data, this result could indicate that when it comes to predictions of rare events, the Bayesian method results in a more accurate MSM. The empirical method, on the other hand, resulted in a model that produces more accurate predictions for the other 2 age groups with more lung cancer cases as compared with the < 60 y group.

The external validation provided further evidence for the similarity of the results produced by the 2 calibration methods (Figure 6). The distributions of the age group predictions are very similar between the 2 calibrated versions of the MILC model. In addition, the observed lung cancer incidence rates fall within the range of the respective model predictions for both methods and every age group.

Figure 6.

Figure 6

External validation: box plots of predicted lung cancer incidence rates from the 2 calibrated versions of the MIcrosimulation Lung Cancer model, by age group. Comparisons with observed lung cancer incidence rates from the SEER database.

Another important consideration is that the calibration of an MSM is usually a rather complicated and time-consuming procedure. An indication of the workload involved is that in our case, for calibrating only 4 MILC parameters, we ran 5 × 109 and 2 × 1010 simulations in total for the empirical and the Bayesian method, respectively. These numbers indicate that calibration of an MSM is an instance of the “embarrassingly parallel computations42 problem. In such settings, a huge number of algorithms (tasks) can run independently in parallel, by distributing them in multiple processors or cores, thus making the entire task much more efficient. To achieve plausible running times, this type of problem necessitates high-performance parallel-computing techniques. The computational cost for running in parallel 1,000,000 independent microsimulations (individual trajectories) was approximately 2.8 CPU hours, and the total required time was 40 s, with simulations distributed across 256 nodes (8-core Intel Xeon E5540 at 2.53 GHz with 24 GB of memory).

The Bayesian calibration is inherently inefficient because of its sequential structure. The only part that can be parallelized is the independent individual-level predictions (microsimulations) within each step of the MH algortihm. One the contrary, the empirical approach allows for a high-degree of parallelism. In our example, parallelization of the empirical approach resuted in 6.1 d of computational time, compared with 19.7 d for the Bayesian approach.

Discussion

This article presents a comparative analysis between a Bayesian and an empirical calibration method for MSMs. We applied both methods to calibrate MILC, a streamlined MSM that describes the natural history of lung cancer. We compared the performance of the 2 methods in terms of both quantitative and qualitative characteristics.

An objective comparison of calibration methods should entail evaluation of the results regarding 2 key elements of the predictive model, namely, the true parameter values and the model predictions. However, because of their complexity, some model parameters are nonidentifiable or even unobservable. Values based on prior information or an ad hoc analysis could be used instead as input for the model parameters. Furthermore, calibration methods should be compared in terms of the GoF of the resulting models to prespecified calibration targets.

When comparing calibration methods, it is also useful to examine the overlap between the respective parameter values and predictions of the resulted models. Large deviations could be indicative of questionable results or even insufficient model structure to capture the underlying process. In addition, calibration methods should be compared in terms of their efficiency, namely, the ease to implement and the total required computational effort. Finally, it is also interesting to consider the interpretation of the results from each procedure.

The comparative analysis showed that, in this particular example, the 2 methods produced very similar results with respect to values for the calibrated model parameters. These results are in line with the more general observation that frequentist and Bayesian estimates are very similar when vague priors are assumed. However, there is a key difference in the interpretation of the results. The values for the calibrated parameters from the Bayesian method are independent draws from the joint posterior distribution, whereas the values from the empirical method constitute a sample from the parameter space with unknown properties.

The evaluation of the predictions obtained from the 2 models led to some interesting findings. Although predictions from the Bayesian-calibrated MILC were much more dispersed, they seemed to be more accurate when concerning rare events, as compared with the results obtained from the empirically calibrated MILC.

Another downside of the Bayesian calibration is that, because of its sequential nature, it is much more time-consuming compared with the parallel Empirical procedure. This lack of efficiency in the Bayesian case could be overcome if, for instance, we used a multivariate prior for the MILC parameters. However, an important implication here is that the correlation structure of the model parameters is unknown and hard to derive because of the model complexity and the presence of latent variables. A secondary, exploratory analysis would be helpful to correctly specify an informative multivariate prior for the Bayesian case.

The results from this comparative analysis suggest that a 2-step approach may be indicated for the calibration of an MSM. The first step could be a broad empirical search of the multidimensional parameter space, which would narrow the problem to a smaller area of possible values for the model parameters. In the second step, the Bayesian calibration method could be applied, using a more informative prior for the model parameters based on the results from the preceding empirical procedure. This Bayesian method could result in an MSM that better predicts rare events, with meaningful results for the calibrated parameters; namely, the sets of parameter values would constitute a sample from their posterior distribution given the informative prior and the calibration targets. The 2-step approach makes it possible to benefit from the computational efficiency of empirical calibration and the robustness of the final results from Bayesian calibration.

This comparative analysis also suggests a study design and outlines the basic steps that should be followed for the comparison of any calibration methods for MSMs. A comprehensive comparison requires the stipulation of 2 main sets of reference values, one for the parameters and the other for the predictions, against which the results of the calibration procedure should be compared. Equally important is the comparison of the MSM results between the calibrated models, with regard to both parameter values and predictions. Such a comparison may prove that the methods are almost equivalent or that one outperforms the other in certain instances (e.g., when predicting rare events). In the first case, the choice of using one over the other can be simply based on efficiency terms, whereas in the second case the calibration procedure would benefit from a combination of the methods under consideration.

Calibration of an MSM requires a considerably large number of simulations. For the empirical method, the values within each parameter vector can be tested simultaneously (parallel), whereas in the Bayesian method, each update of a model parameter depends on the previous values (sequential). Undirected MSM calibration methods usually share the same parallel nature with the empirical method we employ here. Directed methods, on the other hand, follow a sequential procedure for the specification of the optimum parameter vector, analogous to the Bayesian method. Therefore, the empirical, and hence undirected methods, prove to be much more efficient compared with the Bayesian as well as any other directed calibration method.

This comparative analysis has some limitations. The results indicated that the Bayesian method may perform better when the calibration targets are rare events. We also believe that a combination of the 2 calibration methods is an interesting alternative approach that could prove more efficient in some cases. However, a more extensive calibration exercise, including more calibrated parameters and targets, is required to confirm this hypotheses.

In addition, it was unclear which method outperformed the other when we compared the predicted, from the calibrated models, age group–specific lung cancer incidence rates with the respective calibration targets. It would probably be helpful to add more targets (e.g., lung cancer mortality, tumor stage and size at diagnosis, etc.) to obtain a conclusive result that could be also generalized to other settings.

In summary, this simulation study showed that the empirical calibration method is more practical because it produces very similar results compared with the Bayesian method, with a considerably smaller computational effort. The Bayesian method, on the other hand, produces meaningful results for the calibrated parameters, namely, random draws from the posterior distribution. Based on the results from this comprehensive comparison, the empirical approach seems a useful initial step in the complicated and cumbersome calibration process of a complex simulation model, which can inform the prior distributions of a subsequent Bayesian method.

Acknowledgments

The authors would like to thank Dr. Matthew T. Harrison for the helpful comments on the implementation of the ideas presented in this article. We are also grateful to the staff of the Center for Computation and Visualization at Brown University, especially Mr. Mark Howison and Mr. Aaron Shen, for their continuous and timely support.

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article. The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The work presented in this article was conducted with support received from the American College of Radiology Imaging Network (ACRIN) (grant number: U01-CA-79778 from the National Cancer Institute). The simulations were conducted using computational resources and services of the Center for Computation and Visualization at Brown University.

Contributor Information

Stavroula A. Chrysanthopoulou, Brown University, Providence, RI, USA.

Carolyn M. Rutter, RAND Corporation, Santa Monica, CA, USA.

Constantine A. Gatsonis, Brown University, Providence, RI, USA

References

  • 1.Lavingto MR. A practical microsimulation model for consumer marketing. Oper Res Q. 1970;21(1):25–45. [Google Scholar]
  • 2.Fang FC, Elefteriadou L. Some guidelines for selecting microsimulation models for interchange traffic operational analysis. J Transport Eng ASCE. 2005;131(7):535–43. [Google Scholar]
  • 3.Figueiredo M, Seco A, Silva AB. Calibration of microsimulation models—the effect of calibration parameters errors in the models’ performance. Transport Res Proc. 2014;3:962–71. [Google Scholar]
  • 4.Zucchelli E, Jones AM, Rice N. The evaluation of health policies through dynamic microsimulation methods. Int J Microsimulat. 2012;5(1):2–20. [Google Scholar]
  • 5.Krijkamp EM, Alarid-Escudero F, Enns EA, et al. Microsimulation modeling for health decision sciences using r: a tutorial. Med Decis Making. 2018;38(3):400–22. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Caglayan C, Terawaki H, Chen QS, et al. Microsimulation modeling in oncology. JCO J Clin Cancer Inform. 2018;2:1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Rutter CM, Zaslavsky AM, Feuer EJ. Dynamic microsimulation models for health outcomes. Med Decis Making. 2011;31(1):10–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Gatsonis C, Morton SC. Methods in Comparative Effectiveness Research. Boca Raton (FL): Chapman & Hall/CRC Biostatistics Series, CRC Press; 2017. [Google Scholar]
  • 9.Kopec JA, Fines P, Manuel DG, et al. Validation of population-based disease simulation models: a review of concepts and methods. BMC Public Health. 2010;10:710. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Vanni T, Karnon J, Madan J, et al. Calibrating models in economic evaluation: a seven-step approach. Pharmacoeconomics. 2011;29(1):35–49. [DOI] [PubMed] [Google Scholar]
  • 11.Stout NK, Knudsen AB, Kong CY, et al. Calibration methods used in cancer simulation models and suggested reporting guidelines. Pharmacoeconomics. 2009;27(7):533–45. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Kong CY, McMahon PM, Gazelle GS. Calibration of disease simulation model using an engineering approach. Value Health. 2009;12(4):521–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Moolgavkar SH, Luebeck G. Two-event model for carcinogenesis: biological, mathematical, and statistical considerations. Risk Anal. 1990;10(2):323–41. [DOI] [PubMed] [Google Scholar]
  • 14.Baker R Use of a mathematical model to evaluate breast cancer screening policy. Health Care Manage Sci. 1998;1(2):103–13. [DOI] [PubMed] [Google Scholar]
  • 15.Plevritis SK, Sigal BM, Salzman P, et al. Chapter 12: A stochastic simulation model of U.S. breast cancer mortality trends from 1975 to 2000. JNCI Monographs. 2006;2006(36):86–95. [DOI] [PubMed] [Google Scholar]
  • 16.Foy M, Spitz MR, Kimmel M, et al. A smoking-based carcinogenesis model for lung cancer risk prediction. Int J Cancer. 2011;129(8):1097–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.van den Akker-van Marle ME, van Ballegooijen M, van Oortmarssen GJ, et al. Cost-effectiveness of cervical cancer screening: comparison of screening policies. J Natl Cancer Inst. 2002;94(3):193–204. [DOI] [PubMed] [Google Scholar]
  • 18.Fryback DG, Stout NK, Rosenberg MA, et al. Chapter 7: the Wisconsin breast cancer epidemiology simulation model. JNCI Monographs. 2006;2006(36):37–47. [DOI] [PubMed] [Google Scholar]
  • 19.Koscielny S, Tubiana M, Valleron AJ. A simulation model of the natural history of human breast cancer. Br J Cancer. 1985;52(4):515–24. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Mandelblatt J, Schechter CB, Lawrence W, et al. Chapter 8: the spectrum population model of the impact of screening and treatment on U.S. breast cancer trends from 1975 to 2000: principles and practice of the model methods. JNCI Monographs. 2006;2006(36):47–55. [DOI] [PubMed] [Google Scholar]
  • 21.Berry DA, Inoue L, Shen Y, et al. Chapter 6: modeling the impact of treatment and screening on U.S. breast cancer mortality: a Bayesian approach. JNCI Monographs. 2006;2006(36):30–6. [DOI] [PubMed] [Google Scholar]
  • 22.Karnon J, Goyder E, Tappenden P, et al. A review and critique of modelling in prioritising and designing screening programmes. Health Technol Assess. 2007;11(52):iii–ix, ix,–xi, 1–145. [DOI] [PubMed] [Google Scholar]
  • 23.Vanni T, Legood R, Franco EL, et al. Economic evaluation of strategies for managing women with equivocal cytological results in Brazil. Int J Cancer. 2011;129(3):671–9. [DOI] [PubMed] [Google Scholar]
  • 24.Blower SM, Dowlatabadi H. Sensitivity and uncertainty analysis of complex models of disease transmission: an HIV model, as an example. Int Stat Rev. 1994;62(2):229–43. [Google Scholar]
  • 25.McKay MD, Beckman RJ, Conover WJ. A comparison of three methods for selecting values of input variables in the analysis of output from a computer code. Technometrics. 2000;42(1):55–61. [Google Scholar]
  • 26.Santner TJ, Williams BJ, Notz W. The Design and Analysis of Computer Experiments. New York: Springer; 2003. [Google Scholar]
  • 27.Jit M, Choi YH, Edmunds WJ. Economic evaluation of human papillomavirus vaccination in the united kingdom. BMJ. 2008;337:a769. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Chia YL, Salzman P, Plevritis SK, et al. Simulation-based parameter estimation for complex models: a breast cancer natural history modelling illustration. Stat Methods Med Res. 2004;13(6):507–24. [DOI] [PubMed] [Google Scholar]
  • 29.Tan SYGL, van Oortmarssen GJ, de Koning HJ, et al. Chapter 9: the Miscan-Fadia continuous tumor growth model for breast cancer. JNCI Monographs. 2006;2006(36):56–65. [DOI] [PubMed] [Google Scholar]
  • 30.Kim JJ, Kuntz KM, Stout NK, et al. Multiparameter calibration of a natural history model of cervical cancer. Am J Epidemiol. 2007;166(2):137–50. [DOI] [PubMed] [Google Scholar]
  • 31.Rutter CM, Miglioretti DL, Savarino JE. Bayesian calibration of microsimulation models. J Am Stat Assoc. 2009;104(488):1338–50. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Chrysanthopoulou SA. MILC: a microsimulation model of the natural history of lung cancer. Int J Microsimulat. 2017;10(3):5–26. [Google Scholar]
  • 33.Caro JJ, Briggs AH, Siebert U, et al. Modeling good research practices—overview: a report of the ISPOR-SMDM modeling good research practices task force–1. Med Decis Making. 2012;32(5):667–77. [DOI] [PubMed] [Google Scholar]
  • 34.Laird AK. Dynamics of tumor growth. Br J Cancer. 1964;18(3):490–502. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Spratt JS, Spratt TL. Rates of growth of pulmonary metastases and host survival. Ann Surg. 1964;159(2):161–71. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Steel GG. Growth Kinetics of Tumours: Cell Population Kinetics in Relation to the Growth and Treatment of Cancer. Oxford (UK): Clarendon Press; 1977. [Google Scholar]
  • 37.McMahon PM. Policy assessment of Medical Imaging Utilization: Methods and Applications. PhD Thesis, Harvard University; 2005. [Google Scholar]
  • 38.Chrysanthopoulou SA. MILC: MIcrosimulation Lung Cancer (MILC) model. R package version 1.0. 2014. Available from: http://CRAN.R-project.org/package=MILC
  • 39.Dahabreh IJ, Chan JA, Earley A, et al. Modeling and simulation in the context of health technology assessment: review of existing guidance, future research needs, and validity assessment. Report No.: 16(17)-EHC020-EF. Rockville (MD): Agency for Healthcare Research and Quality; 2017. [PubMed] [Google Scholar]
  • 40.Collins GS, de Groot JA, Dutton S, et al. External validation of multivariable prediction models: a systematic review of methodological conduct and reporting. BMC Med Res Methodol. 2014;14(1):40. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Kullback S, Leibler RA. On information and sufficiency. Ann Math Stat. 1951;22(1):79–86. [Google Scholar]
  • 42.Schmidt B, Gonzalez-Dominguez J, Hundt C, et al. Parallel Programming: Concepts and Practice. 1st ed. San Francisco: Morgan Kaufmann; 2017. [Google Scholar]

RESOURCES