Comparing Parametric, Nonparametric, and Semiparametric Estimators: The Weibull Trials

Stephen R Cole; Jessie K Edwards; Alexander Breskin; Michael G Hudgens

doi:10.1093/aje/kwab024

. 2021 Feb 11;190(8):1643–1651. doi: 10.1093/aje/kwab024

Comparing Parametric, Nonparametric, and Semiparametric Estimators: The Weibull Trials

Stephen R Cole ^✉, Jessie K Edwards, Alexander Breskin, Michael G Hudgens

PMCID: PMC8484780 PMID: 33569578

Abstract

We use simple examples to show how the bias and standard error of an estimator depend in part on the type of estimator chosen from among parametric, nonparametric, and semiparametric candidates. We estimated the cumulative distribution function in the presence of missing data with and without an auxiliary variable. Simulation results mirrored theoretical expectations about the bias and precision of candidate estimators. Specifically, parametric maximum likelihood estimators performed best but must be “omnisciently” correctly specified. An augmented inverse probability–weighted (IPW) semiparametric estimator performed best among candidate estimators that were not omnisciently correct. In one setting, the augmented IPW estimator reduced the standard error by nearly 30%, compared with a standard Horvitz-Thompson IPW estimator; such a standard error reduction is equivalent to doubling the sample size. These results highlight the gains and losses that can be incurred when model assumptions are made in any analysis.

Keywords: bias, estimators, nonparametric estimators, parametric estimators, precision, semiparametric estimators

Abbreviations

IPW: inverse probability weighted
MLE: maximum likelihood estimator
NPF: nonparametric full data
SPA: semiparametric augmented inverse probability weighted
SPF: semiparametric full data
WF: Weibull full data
WG: Weibull g-computation

Accuracy is a combination of validity (i.e., lack of bias) and precision (i.e., small standard error) (1, pp. 128, 231). Here accuracy tradeoffs are illustrated between parametric maximum likelihood, nonparametric, and semiparametric estimators. A simple yet nontrivial example is used to show how the bias and standard error of an estimator depend on the type of estimator chosen.

Before discussing types of estimators, the target parameter should be carefully defined. All information about some outcome of interest, say Inline graphic , is captured by the risk function or cumulative distribution function. Therefore, our parameter of interest is the risk function of , defined as (2), where the probability is defined as for discrete with mass function , or as for continuous with density function . Many common parameters can be calculated directly from the risk function (e.g., mean, median, or other percentiles). To highlight differences between types of estimators, we will study the (all too common) case where there is missing data for the outcome of interest.

To fix ideas, say we are given Inline graphic numbers from a positively distributed outcome of interest (e.g., biomarker levels, lifetimes). Assume the sample units were independently and randomly drawn from an infinite population.

Estimators can be classified in many ways. One useful classification entails whether the estimators arise from parametric, nonparametric, or semiparametric models. Parametric models have a finite number of parameters. For example, without covariates, an exponential model for a distribution function has 1 (rate) parameter. On the other hand, nonparametric models (for continuously distributed variables) have an infinite number of parameters. For example, assuming a nonparametric model, we can estimate the risk Inline graphic using the Kaplan-Meier estimator (3). For semiparametric models, the parameter space is split into a piece that is finite and a piece that is infinite (4). As a canonical example, the Cox model (5) has a linear predictor, which has a finite number of parameters, and a reference hazard function, which is infinite dimensional.

Given an observed random sample of size Inline graphic , at one extreme we make no assumption about the shape of the distribution and use a nonparametric maximum likelihood estimator (MLE) (i.e., ). This nonparametric MLE places probability mass on each of the observed values of . Using the weak law of large numbers (6, p. 232), this nonparametric MLE is a pointwise asymptotically consistent estimator for any value Inline graphic of the distribution, regardless of the shape of the function in the population. But because this nonparametric MLE is unconstrained, it is not optimally precise if the data are generated from a distribution in a particular finite-dimension parametric model. That is, estimators that make constraints based on a correct parametric form, or otherwise leverage auxiliary information, can be more precise than the unconstrained nonparametric MLE.

In the absence of censored data, the above nonparametric MLE is equivalent to the Kaplan-Meier estimator. But the Kaplan-Meier estimator extends the above nonparametric MLE to allow for independent right censoring of the data Y (as the Turnbull (7) and Aalen-Johansen (8) estimators extend the above nonparametric MLE to allow for arbitrary censoring/truncation and estimation of subdistribution functions for competing events, respectively).

At another extreme, assume the distribution is exponential, formally Inline graphic , where is the mean of . We often estimate by the MLE, say , and then estimate the target (or interest) parameter with . This MLE is a pointwise asymptotically consistent estimator of , if is a member of the family of exponential distributions indexed by . If the population distribution function Inline graphic is a member of the exponential distribution, then this MLE attains the Cramér-Rao efficiency bound (6, p. 335) and is therefore maximally precise. To be maximally precise means that the estimator uses all the information relevant to the parameter given by the combination of data and model constraints. The exponential assumption can be relaxed by instead supposing Inline graphic is a mixture of exponentials. In particular, assume the conditional distribution of given covariate is exponential with mean . Auxiliary variables are covariates that can help to more accurately estimate the target parameter when is not independent of (9). In scenarios with one or more auxiliary variables, the parameter of interest Inline graphic might be estimated by first estimating the conditional distribution of given and the marginal distribution of , and then using the relation where . If is discrete, this relation can be expressed simply as .

Alternatively, more flexible parametric MLEs can be entertained, such as the 2-parameter Weibull model, Inline graphic , which can again be estimated using the MLE . Or even more flexible parametric models with 3 or more parameters can be considered (10, 11). In turn, these models can be made more flexible yet by adding covariate effects, as above. In at least an informal sense, the limit of this process of relaxing the constraints on parametric models leads toward the infinite-dimensional nonparametric case.

Semiparametric estimators provide a third, or middle, way. Say that we assume a semiparametric Cox proportional hazards model (5) for the association between Inline graphic and , formally where is the cumulative hazard function of at y given , and is the cumulative reference hazard function. The set of finite-dimensional parameters are estimated by the maximum partial likelihood estimator , and, assuming continuous, the infinite-dimensional parameter Inline graphic can be estimated using the Breslow estimator (12). Then a semiparametric estimator of the risk function is

(1)

where Inline graphic and . This estimator, with or without right censoring, provides an estimate of by averaging the covariate-conditional estimates over the sample and constrains the relationship to follow a proportional hazards model.

We say that an estimator is valid if it is asymptotically consistent—that is, the estimator converges in probability to the true data generating value as the sample size Inline graphic tends toward infinity. For any of the above estimators to be valid in this sense, the population function must be a possible value of the limit of the estimator, for all values , as tends toward infinity. This is always the case for the nonparametric estimators we consider because these estimators remain consistent with no constraints on the shape of the risk function. On the other hand, when the true function Inline graphic does not satisfy the model assumptions of the parametric or semiparametric estimator employed, the best possible member of the family is the member that most closely resembles the actual , which is sometimes called the “least false” parameter (13). Choosing to estimate a least-false parameter is like allowing a tolerance for bias. In certain settings, bias might be tolerable in exchange for some benefit, such as precision, speed, or ease.

Barring extra-data information, for the above estimators to be optimally efficient, they must extract all the information relevant to the parameter available in the data, which consists of Inline graphic , and in scenarios with an auxiliary covariate. The parametric MLEs, when constraints are correct, automatically maximally extract information from , and the nonparametric estimators we considered ignore the covariate and so extract no information about present in . The semiparametric estimators, even when optimal, achieve the Hájek-Le Cam semiparametric efficiency bound, which is no smaller (and typically larger) than the Cramér-Rao bound for the parametric maximum likelihood model under consideration (4, 14–16). Next, we describe an experiment, simulating data like those described above, to demonstrate the accuracy (i.e., bias and variance) tradeoffs among a set of estimators.

METHODS

We generate 5,000 samples of Inline graphic and 200 units, indexed by , where , that is, is mean zero Gaussian with standard deviation , and , where is the scale parameter and is the Weibull shape parameter, that is, . When the Weibull coincides with an exponential distribution, which corresponds to a constant hazard. When Inline graphic there is an increasing (decreasing) hazard of . For all scenarios we set and , which corresponds to a strong association between and (i.e., a unit increase in is associated with a 3-fold increase in the hazard of when ). Below, for convenience, we focus on the single parameter, the value for the marginal distribution function of Inline graphic at , or . The pattern of results should hold for any value not in the extremity of the distribution of .

For each scenario, approximately half the Inline graphic values are missing. To generate missing data for , we draw an indicator of being observed , distributed as Bernoulli with expectation , where or 0, and set such that is approximately 1/2.

We consider 6 scenarios from a factorial experiment varying Inline graphic , and the combination of as (0,0), (1,0), and . These parameter choices correspond to the following 6 scenarios: 1) exponential with no covariate (i.e., the covariate is a constant 0); 2) exponential with a standard normal covariate that causes the outcome but not missingness; 3) exponential with a standard normal covariate that causes the outcome and missingness; 4) Weibull with no covariate; 5) Weibull with a standard normal covariate that causes the outcome but not missingness; and 6) Weibull with a standard normal covariate that causes the outcome and missingness.

For each of the 6 scenarios, we fit 14 estimators, as detailed in Table 1. First, we fit 2 nonparametric estimators, which both ignore the auxiliary covariate. The first nonparametric estimator was fitted to the full data (NPF) and the second was fitted to the observed data (NPO), that is, where Inline graphic . The NPF estimator ought to be approximately unbiased in all scenarios, and the NPO estimator ought to be likewise unbiased in scenarios 1, 2, 4, and 5 where the missingness is completely at random. The NPF and NPO estimators ought to be less precise than the correctly specified parametric models. Throughout, estimators using the full data are provided for reference, as well as to make comparisons between estimators in the absence of missing data.

Table 1.

Nonparametric, Semiparametric, and Parametric Estimators

Estimator, With Abbreviation	Definition ^a
Nonparametric
NPF: full data
NPO: observed data
Semiparametric ^b
SPF: semiparametric full data
SPO: observed data
SPI: Horvitz-Thompson IPW
SPH: Hájek IPW
SPG: g-computation
SPA: augmented IP
Weibull ^c
WF: full data
WO: observed data
WG: g-computation
Exponential ^d
EF: exponential full data
EO: observed data
EG: g-computation

Open in a new tab

Abbreviations: EF, exponential full data; EG, exponential g-computation; EO, exponential observed data; IPW, inverse probability weighted; MLE, maximum likelihood estimator; NPF, nonparametric full data; NPO, nonparametric observed data; SPA, semiparametric augmented inverse probability weighted; SPF, semiparametric full data; SPG, semiparametric g-computation; SPH, semiparametric Hájek inverse probability weighted; SPI, semiparametric Horvitz-Thompson inverse probability weighted; SPO, semiparametric observed data; WF, Weibull full data; WG, Weibull g-computation; WO, Weibull observed data.

^a Inline graphic is the outcome, indicates is observed, and is an auxiliary covariate.

^b Inline graphic , where and are the MLEs of in the full and observed (i.e., ) data, respectively.

^c Inline graphic , where and are the MLEs of in the full and observed (i.e., ) data, respectively.

^d Inline graphic , where and are the MLEs of in the full and observed (i.e., ) data, respectively.

Second, we fit 6 semiparametric estimators. The first semiparametric estimator, given as equation 1 above, used the full data (SPF). The remaining 5 semiparametric estimators used the observed data on Inline graphic when , with or without using observed data on the auxiliary variable when . The second semiparametric estimator was a simple complete-case estimator fitted on data where (semiparametric observed; SPO) and ignores when , and so ought to be biased in scenarios 3 and 6. The third, fourth, and fifth semiparametric estimators were Horvitz-Thompson (17) inverse probability weighted (IPW; SPI), Hájek (18) IPW (SPH), and g-computation (SPG) (19) estimators, which use Inline graphic when . This Hájek estimator is sometimes referred to as a modified ((20), see technical points 12.1 and 12.2) or “stabilized” IPW estimator, but this stabilization is distinct from that described by Robins et al. (21). These semiparametric estimators ought to be approximately unbiased in every scenario, with the SPG estimator more precise than SPI and SPH. The sixth and last semiparametric estimator was an augmented inverse probability–weighted estimator (SPA), which also ought to be approximately unbiased in all scenarios with precision intermediate between the SPI and SPG (22–24). For estimators using an inverse probability weight (i.e., SPI, SPH, SPA), the model for missingness was a correctly specified logistic regression model fitted by maximum likelihood. For the SPA estimator, we combined information from the semiparametric estimator given above (equation 1) with a finite-dimension parametric logistic regression model for the probability of being observed given Inline graphic (23). Specifically, the form of the SPA estimator is:

(2)

where Inline graphic is the maximum partial likelihood estimator of and the Breslow estimator of the cumulative baseline hazard function based only on the observed data where , denotes under the assumed logistic regression model with finite-dimensional parameter , and is the MLE of . Notably, this SPA estimator is double robust and therefore consistent if either the model for the outcome Inline graphic or the missing data mechanism is correct. All 6 semiparametric estimators should be less precise than the correct parametric MLE, and more precise than the nonparametric estimator when an informative auxiliary variable is present.

Third, we fit 3 parametric Weibull estimators. Each estimator entailed fitting a correctly specified Weibull model for Inline graphic given via maximum likelihood and then marginalizing over to obtain an estimate of the population-average risk. The first Weibull estimator was fitted to the full data (WF) and ought to be approximately unbiased in all scenarios. The second Weibull estimator was fitted to the observed data (WO), where Inline graphic , and ought to be approximately unbiased in scenarios 1, 2, 4, and 5. The third Weibull estimator is a parametric g-computation estimator (WG). The WG estimator ought to be approximately unbiased for all 6 scenarios, most precise in scenarios 4–6, but somewhat inefficient in scenarios 1–3 (compared with the correct exponential submodel estimator).

Fourth, we fit 3 parametric exponential estimators, which are akin to the Weibull estimators above, with the sole additional constraint that Inline graphic . The first exponential estimator was fitted to the full data (EF), and ought to be approximately unbiased in scenarios 1–3 with maximal precision. The second exponential estimator was fitted to the observed data (EO) and ought to be approximately unbiased in scenarios 1 and 2. The third exponential estimator (EG) is a parametric g-computation estimator. The EG estimator ought to be approximately unbiased for scenarios 1–3.

We also explored the impact of an unmeasured common cause of the outcome and missingness. Specifically, we added a standard normal covariate with a log(3) coefficient to the linear component of both data-generating models. Therefore, 12 of the 14 estimators ought to be biased due to misspecification, with only the NPF and semiparametric full-data estimators expected to be approximately unbiased.

In addition to bias in the estimate of the risk function, we quantify precision by the standard errors of each estimator, which are approximated by the standard deviation of the 5,000 simulation estimates. Experiments were performed separately using SAS (SAS Institute, Inc., Cary, North Carolina) and R (R Foundation for Statistical Computing, Vienna, Austria).

RESULTS

Figures 1 and 2 group the 14 estimators in 3 panels each; Figure 1 presents results from scenarios 1–3, and Figure 2 presents results from scenarios 4–6. Each scenario highlights specific aspects of semiparametric theory. For example, for scenario 1 (Figure 1A), the outcome data are exponential and there is no covariate. In this scenario, as expected, all estimators are unbiased with precision improving as the estimators become more restrictive. For scenario 2 (Figure 1B), the outcome data are exponential with a standard normal covariate which does not predict missingness, and the results are similar to scenario 1. For scenario 3 (Figure 1C), the outcome data are exponential with a standard normal covariate, which causes the outcome and missingness, and the results illustrate how the observed data estimators are biased due to incorrectly assuming missingness is completely at random.

Bias and precision of the estimated probability in 5,000 Monte Carlo simulation trials, each of sample size 100. A) Scenario 1: exponential with no covariate; B) scenario 2: exponential with a covariate that causes ; C) scenario 3: exponential with a covariate that causes and missingness. EF, exponential full data; EG, exponential g-computation; EO, exponential observed data; NPF, nonparametric full data; NPO, nonparametric observed data; SPA, semiparametric augmented inverse probability weighted; SPF, semiparametric full data; SPG, semiparametric g-computation; SPH, semiparametric Hájek inverse probability weighted; SPI, semiparametric Horvitz-Thompson inverse probability weighted; SPO, semiparametric observed data; WF, Weibull full data; WG, Weibull g-computation; WO, Weibull observed data.

Inline graphic — Bias and precision of the estimated probability in 5,000 Monte Carlo simulation trials, each of sample size 100. A) Scenario 1: exponential with no covariate; B) scenario 2: exponential with a covariate that causes ; C) scenario 3: exponential with a covariate that causes and missingness. EF, exponential full data; EG, exponential g-computation; EO, exponential observed data; NPF, nonparametric full data; NPO, nonparametric observed data; SPA, semiparametric augmented inverse probability weighted; SPF, semiparametric full data; SPG, semiparametric g-computation; SPH, semiparametric Hájek inverse probability weighted; SPI, semiparametric Horvitz-Thompson inverse probability weighted; SPO, semiparametric observed data; WF, Weibull full data; WG, Weibull g-computation; WO, Weibull observed data.

Bias and precision of the estimated probability in 5,000 Monte Carlo simulation trials, each of sample size 100. A) Scenario 4: Weibull with no covariate; B) scenario 5: Weibull with a covariate that causes ; C) scenario 6: Weibull with a covariate that causes and missingness. EF, exponential full data; EG, exponential g-computation; EO, exponential observed data; NPF, nonparametric full data; NPO, nonparametric observed data; SPA, semiparametric augmented inverse probability weighted; SPF, semiparametric full data; SPG, semiparametric g-computation; SPH, semiparametric Hájek inverse probability weighted; SPI, semiparametric Horvitz-Thompson inverse probability weighted; SPO, semiparametric observed data; WF, Weibull full data; WG, Weibull g-computation; WO, Weibull observed data.

For scenario 4 (Figure 2A), the outcome data are Weibull with no covariate, and results mimic the results for scenario 1, with the exception that the parametric exponential results are biased due to the inappropriate restriction. For scenario 5 (Figure 2B), the outcome data are Weibull with a standard normal covariate that causes only the outcome, and results again mimic the results for scenario 2, apart from the exponential models being biased. Finally, for scenario 6 (Figure 2C), outcome data are Weibull with a standard normal covariate that causes the outcome and missingness, and results illustrate a combination of features seen in the prior scenarios. In scenario 6, all observed data estimators and the parametric exponential estimators are biased. The WG and SPA estimators were unbiased and most precise (with WG more precise than SPA).

Table 2 presents numerical summaries for scenarios 1 through 6. Many patterns are illustrated that are expected based on parametric, nonparametric, and semiparametric theory. For example, when the data are generated as exponential but a (more flexible) Weibull model estimator is used, there is a slight loss of precision due to the estimation of an unnecessary (Weibull shape) parameter. Contrariwise, when the data are generated as Weibull but a more restrictive exponential model estimator is used, the estimator is biased for the parameter of interest. Across estimators, there is precision gained when estimators leverage the presence of the informative auxiliary covariate, because more information from the data is used. In the absence of missing data (i.e., looking only at the 4 estimators based on full data), there is no discernable advantage to the semiparametric estimator. The augmented IPW estimator (which is semiparametric efficient (22)) is shown to improve on the simpler IPW estimators (which are not semiparametric efficient), specifically, compared with the Horvitz-Thompson IPW, the augmented IPW standard error is reduced by 28% in scenario C (1 − 0.062/0.086) and by 22% in scenario F (1 − 0.072/0.092). In the former case, this near 30% reduction in the standard error equates to about a doubling of sample size. The Hájek IPW estimator recovered a sizable portion of the precision loss of the Horvitz-Thompson IPW estimator compared with the augmented IPW estimator. The WG estimator was slightly more precise than the SPA estimator. This is expected because the parametric WG estimator encodes more restrictions than the SPA estimator (i.e., the outcome model is Weibull rather than Cox). These restrictions also make the WG estimator less robust than the SPA estimator, as demonstrated by analogy with the bias of the parametric g-computation estimator (EG) in scenarios 4–6.

Table 2.

Bias and Precision of Estimated Probability Inline graphic in 5,000 Monte Carlo Simulation Trials With Sample Size 100, Scenarios 1 Through 6, Half Data Missing

Estimator	1: Exponential With No Covariate		2: Exponential With Covariate		3: Exponential With Biasing Covariate		4: Weibull With No Covariate		5: Weibull With Covariate		6: Weibull With Biasing Covariate
Estimator	Bias ^a	SE	Bias ^a	SE	Bias ^a	SE	Bias ^a	SE	Bias ^a	SE	Bias ^a	SE
Nonparametric
Full data	−0.003	0.048	−0.003	0.049	−0.003	0.049	−0.003	0.048	−0.003	0.048	−0.003	0.048
Observed data	−0.007	0.068	−0.006	0.069	−0.130	0.071	−0.007	0.068	−0.007	0.068	−0.084	0.070
Semiparametric
Full data	−0.003	0.048	−0.003	0.048	−0.003	0.048	−0.003	0.048	−0.003	0.047	−0.003	0.048
Observed data	−0.007	0.068	−0.006	0.067	−0.130	0.070	−0.007	0.068	−0.007	0.068	−0.085	0.070
HT IPW	−0.001	0.068	0.000	0.064	−0.003	0.086	−0.001	0.069	−0.001	0.067	−0.001	0.092
Hájek IPW	−0.001	0.068	0.000	0.064	−0.003	0.066	−0.001	0.069	−0.001	0.067	−0.002	0.074
G-computation	−0.007	0.068	−0.006	0.062	−0.008	0.061	−0.007	0.068	−0.007	0.066	−0.008	0.068
Augmented IPW	−0.001	0.068	0.000	0.063	0.000	0.062	−0.001	0.069	−0.001	0.067	0.000	0.072
Weibull
Full data	0.000	0.039	0.001	0.040	0.000	0.040	0.001	0.039	0.000	0.039	0.001	0.039
Observed data	0.000	0.056	0.001	0.057	−0.126	0.059	0.001	0.056	0.001	0.056	−0.078	0.058
G-computation	0.000	0.056	0.001	0.049	−0.002	0.051	0.001	0.056	0.001	0.054	−0.001	0.057
Exponential
Full data	0.002	0.036	0.003	0.038	0.002	0.038	−0.228	0.067	−0.175	0.060	−0.174	0.062
Observed data	0.003	0.052	0.005	0.053	−0.119	0.053	−0.218	0.095	−0.165	0.086	−0.283	0.079
G-computation	0.003	0.052	0.005	0.045	0.003	0.047	−0.218	0.095	−0.165	0.082	−0.169	0.093

Open in a new tab

Abbreviations: HT, Horvitz-Thompson; IPW, inverse probability weighted; SE, empirical standard error.

^a Monte Carlo simulation error for bias < 0.002.

Analogous figures for the scenario with Inline graphic are provided in the Web material (Web Figures 1 and 2, available at https://doi.org/10.1093/aje/kwab024). The pattern of results is similar in the scenario with shown here. Also, Figures and tabular results for the scenario with misspecified models (due to an unmeasured common cause of the outcome and missingness) are provided in the Web material (Web Figures 3 and 4, Web Table 1). As expected, only the NPF and semiparametric full-data estimators were unbiased in the misspecified scenario, and the Weibull full-data estimator performed best among misspecified approaches.

DISCUSSION

The moral of this story is an old one: It is best to be right. To be most accurate, be an “omniscient” oracle and pick the correct parametric model or rely on chance to accidentally specify the model correctly. Failing omniscience or luck, and at a small loss of precision, have enough foresight to choose a flexible parametric model that incorporates the correct parametric model as a special case. Failing omniscience, luck, and such seemingly impossible foresight, the semiparametric estimator performs best in the limited scenarios explored here. Of course, the parametric component of the semiparametric model needs to be correct for the semiparametric estimator to perform well. This point is reinforced with the results of the misspecified scenario. In our primary setting, the semiparametric model assumed proportional hazards of the outcome for unit changes in the auxiliary variable. Finally, all our estimators assumed data were independent, and there was no measurement error.

In principle, we could specify a nonparametric model for the distribution of Inline graphic condition on and allow our nonparametric estimators to depend on some data-adaptive function of the covariate (e.g., a data-adaptive restricted quadratic spline), but if were more than a single variable, restrictions would be needed to obtain well-functioning data-adaptive nonparametric estimators, and while this is an intriguing frontier, it is beyond the scope of the present work.

Of course, our results are only guaranteed to hold in scenarios like those explored. This lack of generality is a central limitation of simulations, like those presented here. The scenarios explored were chosen to clearly illustrate theoretical claims about semiparametric statistical theory that might not be within the typical training of epidemiologists. Moreover, our results pertain to the set of estimators explored. We did not explore confidence interval coverage probability or length, which are helpful metrics for epidemiologic practice. Here we concentrated on the estimators themselves, rather than estimates of variability, which are complicated in their own right and are therefore the topic of future work.

In conclusion, we present these experimental results to help epidemiologists and other health data scientists better understand justifications for the use of estimators based on modern semiparametric statistical theory.

Supplementary Material

Web_Material_kwab024

Click here for additional data file.^{(861.6KB, pdf)}

ACKNOWLEDGMENTS

Author affiliations: Department of Epidemiology, UNC Gillings School of Global Public Health, University of North Carolina, Chapel Hill, Chapel Hill, North Carolina, United States (Stephen R. Cole, Jessie K. Edwards); NoviSci, Durham, North Carolina, United States (Alexander Breskin); and Department of Biostatistics, UNC Gillings School of Global Public Health, University of North Carolina, Chapel Hill, Chapel Hill, North Carolina, United States (Michael G. Hudgens).

This work was supported in part by the National Institute of Allergy and Infectious Diseases (grants R01AI157758—S.R.C, J.K.E, M.G.H.; P30AI50410—S.R.C., M.G.H.; and K01AI125087—J.K.E.).

Thanks to Dr. Alexander P. Keil for expert advice.

Conflicts of interest: none declared.

REFERENCES

1.Rothman KJ, Greenland S, Lash T. Modern Epidemiology. 3rd ed. New York, NY: Lippincott-Raven; 2008. [Google Scholar]
2.Cole SR, Hudgens MG, Brookhart MA, et al. Risk. Am J Epidemiol. 2015;181(4):246–250. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Kaplan EL, Meier P. Nonparametric estimation from incomplete observations. JASA. 1958;53:457–481. [Google Scholar]
4.Wellner JA. Semiparametric models: progress and problems. Bull Inst Int Stat. 1985;51:1–23. [Google Scholar]
5.Cox DR. Regression models and life tables. J R Statist Soc (B). 1972;34(2):187–220. [Google Scholar]
6.Casella G, Berger RL. Statistical Inference. 2nd ed. Pacific Grove, CA: Duxbury Press; 2002. [Google Scholar]
7.Turnbull BW. The empirical distribution function with arbitrarily grouped, censored and truncated data. J R Stat Soc B. 1976;38:290–295. [Google Scholar]
8.Aalen OO, Johansen S. Empirical transition matrix for non-homogeneous Markov chains based on censored observations. Scand J Stat. 1978;5:141–150. [Google Scholar]
9.Godambe VP. Estimation in survey sampling: robustness and optimality. J Am Stat Assoc. 1982;77:393–403. [Google Scholar]
10.Cox C, Chu H, Schneider MF, et al. Parametric survival analysis and taxonomy of hazard functions for the generalized gamma distribution. Stat Med. 2007;26(23):4352–4374. [DOI] [PubMed] [Google Scholar]
11.Cox C. The generalized F distribution: an umbrella for parametric survival analysis. Stat Med. 2008;27(21):4301–4312. [DOI] [PubMed] [Google Scholar]
12.Lin DY. On the Breslow estimator. Lifetime Data Anal. 2007;13(4):471–480. [DOI] [PubMed] [Google Scholar]
13.Hjort N. On inference in parametric survival data models. Int Stat Rev. 1992;60:355–387. [Google Scholar]
14.Newey WK. Semiparametric efficiency bounds. J Appl Economet. 1990;5:99–135. [Google Scholar]
15.Stein C. Efficient nonparametric testing and estimation. In: Neyman J, ed. Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability. Berkeley. CA: University of California Press; 1956:187–195. [Google Scholar]
16.Begun JM, Hall WJ, Huang W, et al. Information and asymptotic efficiency in parametric-nonparametric models. Ann Stat. 1983;11:432–452. [Google Scholar]
17.Horvitz DG, Thompson DJ. A generalization of sampling without replacement from a finite universe. JASA. 1952;47:663–685. [Google Scholar]
18.Hájek J. Comment on an Essay by D. Basu. In: Godambe VP, Sprott DA, eds. Foundations of Statistical Inference. Toronto, Canada: Holt, Rinehart, and Winston; 1971:236. [Google Scholar]
19.Robins JM. A new approach to causal inference in mortality studies with a sustained exposure period: application to control of the healthy worker survivor effect. Math Model. 1986;7:1393–1512. [Google Scholar]
20.Hernán MA, Robins JM. Causal Inference: What If? Boca Raton, FL: Chapman & Hall/CRC Press; 2020. [Google Scholar]
21.Robins JM, Hernán MA, Brumback B. Marginal structural models and causal inference in epidemiology. Epidemiology. 2000;11(5):550–560. [DOI] [PubMed] [Google Scholar]
22.Robins JM, Rotnitzky A, Zhao LP. Estimation of regression coefficients when some regressors are not always observed. JASA. 1994;89:846–846. [Google Scholar]
23.Tsiatis AA. Semiparametric Theory and Missing Data. New York, NY: Springer; 2006. [Google Scholar]
24.Daniel R. Double robustness. In: StatsRef: Statistics Reference Online. New York, NY: John Wiley & Sons, Ltd; 2018. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Web_Material_kwab024

Click here for additional data file.^{(861.6KB, pdf)}

[ref1] 1.Rothman KJ, Greenland S, Lash T. Modern Epidemiology. 3rd ed. New York, NY: Lippincott-Raven; 2008. [Google Scholar]

[ref2] 2.Cole SR, Hudgens MG, Brookhart MA, et al. Risk. Am J Epidemiol. 2015;181(4):246–250. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref3] 3.Kaplan EL, Meier P. Nonparametric estimation from incomplete observations. JASA. 1958;53:457–481. [Google Scholar]

[ref4] 4.Wellner JA. Semiparametric models: progress and problems. Bull Inst Int Stat. 1985;51:1–23. [Google Scholar]

[ref5] 5.Cox DR. Regression models and life tables. J R Statist Soc (B). 1972;34(2):187–220. [Google Scholar]

[ref6] 6.Casella G, Berger RL. Statistical Inference. 2nd ed. Pacific Grove, CA: Duxbury Press; 2002. [Google Scholar]

[ref7] 7.Turnbull BW. The empirical distribution function with arbitrarily grouped, censored and truncated data. J R Stat Soc B. 1976;38:290–295. [Google Scholar]

[ref8] 8.Aalen OO, Johansen S. Empirical transition matrix for non-homogeneous Markov chains based on censored observations. Scand J Stat. 1978;5:141–150. [Google Scholar]

[ref9] 9.Godambe VP. Estimation in survey sampling: robustness and optimality. J Am Stat Assoc. 1982;77:393–403. [Google Scholar]

[ref10] 10.Cox C, Chu H, Schneider MF, et al. Parametric survival analysis and taxonomy of hazard functions for the generalized gamma distribution. Stat Med. 2007;26(23):4352–4374. [DOI] [PubMed] [Google Scholar]

[ref11] 11.Cox C. The generalized F distribution: an umbrella for parametric survival analysis. Stat Med. 2008;27(21):4301–4312. [DOI] [PubMed] [Google Scholar]

[ref12] 12.Lin DY. On the Breslow estimator. Lifetime Data Anal. 2007;13(4):471–480. [DOI] [PubMed] [Google Scholar]

[ref13] 13.Hjort N. On inference in parametric survival data models. Int Stat Rev. 1992;60:355–387. [Google Scholar]

[ref14] 14.Newey WK. Semiparametric efficiency bounds. J Appl Economet. 1990;5:99–135. [Google Scholar]

[ref15] 15.Stein C. Efficient nonparametric testing and estimation. In: Neyman J, ed. Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability. Berkeley. CA: University of California Press; 1956:187–195. [Google Scholar]

[ref16] 16.Begun JM, Hall WJ, Huang W, et al. Information and asymptotic efficiency in parametric-nonparametric models. Ann Stat. 1983;11:432–452. [Google Scholar]

[ref17] 17.Horvitz DG, Thompson DJ. A generalization of sampling without replacement from a finite universe. JASA. 1952;47:663–685. [Google Scholar]

[ref18] 18.Hájek J. Comment on an Essay by D. Basu. In: Godambe VP, Sprott DA, eds. Foundations of Statistical Inference. Toronto, Canada: Holt, Rinehart, and Winston; 1971:236. [Google Scholar]

[ref19] 19.Robins JM. A new approach to causal inference in mortality studies with a sustained exposure period: application to control of the healthy worker survivor effect. Math Model. 1986;7:1393–1512. [Google Scholar]

[ref20] 20.Hernán MA, Robins JM. Causal Inference: What If? Boca Raton, FL: Chapman & Hall/CRC Press; 2020. [Google Scholar]

[ref21] 21.Robins JM, Hernán MA, Brumback B. Marginal structural models and causal inference in epidemiology. Epidemiology. 2000;11(5):550–560. [DOI] [PubMed] [Google Scholar]

[ref22] 22.Robins JM, Rotnitzky A, Zhao LP. Estimation of regression coefficients when some regressors are not always observed. JASA. 1994;89:846–846. [Google Scholar]

[ref23] 23.Tsiatis AA. Semiparametric Theory and Missing Data. New York, NY: Springer; 2006. [Google Scholar]

[ref24] 24.Daniel R. Double robustness. In: StatsRef: Statistics Reference Online. New York, NY: John Wiley & Sons, Ltd; 2018. [Google Scholar]

PERMALINK

Comparing Parametric, Nonparametric, and Semiparametric Estimators: The Weibull Trials

Stephen R Cole

Jessie K Edwards

Alexander Breskin

Michael G Hudgens

Abstract

Abbreviations