Abstract
M-estimation is a statistical procedure that is particularly advantageous for some comon epidemiological analyses, including approaches to estimate an adjusted marginal risk contrast (i.e. inverse probability weighting and g-computation) and data fusion. In such settings, maximum likelihood variance estimates are not consistent. Thus, epidemiologists often resort to bootstrap to estimate the variance. In contrast, M-estimation allows for consistent variance estimates in these settings without requiring the computational complexity of the bootstrap. In this paper, we introduce M-estimation and provide four illustrative examples of implementation along with software code in multiple languages. M-estimation is a flexible and computationally efficient estimation procedure that is a powerful addition to the epidemiologist’s toolbox.
Keywords: M-estimation, estimating equations, logistic regression, standardization, data fusion
Key Messages.
M-estimation is a generalization of maximum likelihood estimation and allows for multiple estimators to be stacked together.
Compared with the nonparametric bootstrap, M-estimation is more computationally efficient for estimation of the variance.
M-estimation allows direct estimation of the variance for parameters that are functions of other parameters, such as the risk difference and log-transformed risk ratio.
M-estimation is easily implemented in statistical analysis software.
Introduction
Epidemiologists commonly use maximum likelihood estimation (MLE) to compute point estimates and confidence intervals (CI).1 M-estimation is a generalized version of the typical MLE approach.2 Unlike MLE, M-estimation is implemented in a way that naturally accommodates estimating multiple sets of equations or parameters simultaneously. This is useful for many common epidemiological analyses, e.g. modern standardization approaches [i.e. inverse probability weighting (IPW) and g-computation], and use of multiple, distinct data sources to address a question (e.g. using external validation data to address measurement error).3–5 In such settings, epidemiologists often resort to bootstrap because the standard MLE variance estimates are not appropriate (e.g. there are multiple estimation steps or the data are not identically distributed).6 In contrast, M-estimation variance estimates are valid in these settings. This has many beneficial properties, such as faster computation, highlighted below.
In this paper, we provide a brief introduction to M-estimation and illustrate implementation with four applied examples. We provide code in R, SAS and Python [https://github.com/rachael-k-ross/Mestimation-worked-example].
M-estimation
This is a brief introduction to M-estimation (see Supplementary Material, available as Supplementary data at IJE online for a more technical introduction). Say we want to estimate the mean of , denoted by , given a sample of units. We could use the equation, . Instead of using this equation, we can rearrange the terms into . This second equation is called an estimating equation, where the left-hand side consists of a summation of a function of the observed data and the parameter (termed an estimating function) and the right-hand side is zero.7 When is a finite-dimension parameter, as is the case with the mean, we call a M-estimator. To estimate with the estimating equation, we can iteratively plug values for into the estimating equation until we find the value of where the summation is (or is close to) zero. This procedure is called root-finding, as the ‘root’ of a function is where it is equal to zero. Standard statistical software offers a variety of root-finding algorithms.8–10
Typically, we would estimate the variance (and consequently the standard error) for the mean using the standard equation for the variance, .11 This is the MLE variance estimator. It is also possible to estimate the variance using a different estimator, the empirical sandwich variance estimator. The sandwich variance estimator leverages the estimating equation and actually combines two different ways of estimating the variance (i.e. the inverse of the negative derivative of the estimating equations and the outer product of the estimating equations) (see Supplementary Material, available as Supplementary data at IJE online for more details).2 Both of these pieces can be computed with standard statistical software.
Implementing an M-estimator consists of three steps: (i) determining the estimating equations; (ii) using root-finding to find the point estimates; and (iii) using the sandwich variance estimator to estimate the standard error. As we can rely on software for implementing (ii) and (iii),7,12,13 we do not need to manually solve the corresponding equations.
Whereas applying the M-estimator for the mean is more complicated than use of pre-packaged MLE functions, the M-estimation framework has several advantages due to its flexibility. With M-estimation one can ‘stack’ estimating equations for different parameters together and solve this stack simultaneously. This feature is useful, as epidemiological analyses often involve estimating multiple parameters, including parameters of interest (e.g. a risk difference) and parameters that are not of interest, but that are required for estimating our parameter of interest (e.g. the parameters of a propensity score model). By stacking estimating equations together, the sandwich variance estimator allows for the variance of one parameter to depend on other parameters. This makes standard error estimation simpler when the parameter of interest is a function of multiple parameters. For example, obtaining the standard error of a risk difference or log-transformed risk ratio involves combining the standard errors of two risks via the delta method. The sandwich variance estimator automates the delta method. Although the delta method can be easily implemented by hand for the risk difference or log-transformed risk ratio, appropriately accounting for the uncertainty in estimating, for example the propensity score, when implementing a weighted estimator is more complicated. It is in such a setting that the sandwich variance estimator is particularly valuable. To showcase these advantages for epidemiological analyses, we illustrate how to obtain estimating equations and the application of M-estimators in four applied examples in the next section.
Applications of M-estimation
Here we demonstrate how M-estimation can be used to estimate point estimates and standard errors for:
the parameters of a multivariable logistic regression;
a marginal risk difference by IPW;
a marginal risk difference by g-computation;
a prevalence corrected for misclassification.
Application 1 illustrates the basic tools of M-estimation and shows how M-estimation aligns with MLE. The subsequent applications illustrate how M-estimation simplifies estimation of multiple parameters in different models and, unlike standard MLE, directly provides appropriate standard errors. Often, the standardization methods used in Applications 2 and 3 (IPW and g-computation) target causal effect contrasts; here, for simplicity, we assume standard causal identification criteria are fulfilled.4,5 Code to implement these examples is available on GitHub. R, SAS and Python have procedures for implementing the necessary steps (e.g. root-finding and taking the derivative); R and Python have specific packages for M-estimation (geex12 and delicatessen13).
Logistic regression
We use data from the Zambia Preterm Birth Prevention Study.14 We analyse pregnant participants with complete data. Let the observed data for individual be , where is preterm birth, is early-pregnancy anaemia and is early-pregnancy elevated blood pressure. We want to estimate the parameters of the logistic regression model:
where . Let ; is the design matrix.
Typically, we would estimate by maximizing the log-likelihood of the data (i.e. MLE). Under certain conditions, the maximum of the log-likelihood function is equivalent to finding where the partial derivatives (i.e. slope of the corresponding tangent line at a given point) for each parameter of the log-likelihood function, or score function, equal zero. Therefore, the estimating equations for a logistic model (or any generalized linear model) are the partial derivatives of the corresponding log-likelihood. The estimating equations here are:
See Supplementary Material (available as Supplementary data at IJE online) for how these estimating equations were derived. Because consists of three parameters, there are three estimating equations. In each, the predicted probability of the outcome (i.e. ) is being subtracted from the observed outcome; this is the residual. Then, this residual is multiplied by the appropriate variable for that parameter (for the intercept, it is 1 for everyone). The estimating equations for most refression models will take this same form: i.e. the residual times each column of the design matrix.
Now that we have our estimating equations, we rely on software for root-finding to estimate and to implement the sandwich variance estimator to estimate the standard error and then construct Wald-type CIs. Table 1 includes the estimates and 95% CIs by M-estimation with comparison with MLE. The odds of preterm birth for individuals with anaemia were 1.12 times (95% CI 0.65, 1.93) the odds of preterm birth for individuals without anaemia. The odds for individuals with elevated blood pressure were 1.43 times (95% CI 0.90, 2.27) the odds for individuals without elevated blood pressure. When the assumed parametric family is correctly specified (e.g. a logistic model for a binary outcome), the M-estimation and MLE approaches are equal as increases to infinity (i.e. are asymptotically equivalent).15 When the assumed parametric model is incorrect (e.g. a Poisson model for binary outcomes), they are no longer asymptotically equivalent. In such cases, the sandwich variance estimator is recommended as it is robust to this misspecification.16 This feature of the sandwich variance estimator has been used to correctly estimate the variance when using Poisson regression to estimate risk ratios (i.e. ‘modified’ Poisson).17,18
Table 1.
Point estimates with 95% confidence intervals from maximum likelihood and M-estimation in first three applied examples, from R
| MLE | M-estimation |
||||
|---|---|---|---|---|---|
| Parameter | Logistic regression | Logistic regression | IPWa | G-computationa | |
| Outcome model | , intercept | –1.89 (–2.13, –1.66) | –1.89 (–2.13, –1.66) | – | –1.89 (–2.13, –1.66) |
| , coefficient for anaemia | 0.12 (–0.43, 0.66) | 0.12 (–0.43, 0.66) | – | 0.12 (–0.43, 0.66) | |
| , coefficient for BP | 0.36 (–0.11, 0.82) | 0.36 (–0.11, 0.82) | – | 0.36 (–0.11, 0.82) | |
| Propensity score model | , intercept | – | – | –1.74 (–1.95, –1.53) | – |
| , coefficient for BP | – | – | –0.30 (–0.83, 0.24) | – | |
| Parameters of interest | , risk under no anaemia | – | – | 0.14 (0.11, 0.17) | 0.14 (0.11, 0.17) |
| , risk under anaemia | – | – | 0.15 (0.09, 0.22) | 0.16 (0.09, 0.22) | |
| , risk difference | – | – | 0.01 (–0.06, 0.08) | 0.02 (–0.06, 0.09) | |
BP, blood pressure; IPW, inverse probability weighting; MLE, maximum likelihood estimation.
Differences in and are due to the different modelling approaches of g-computation and IPW.
Estimating the marginal risk difference
Using the same data as the prior example, we now want to estimate the marginal risk difference of anaemia on preterm birth, adjusted for (standardized by) elevated blood pressure. We illustrate two standardization approaches, IPW and g-computation. Parameter estimates are included in Table 1.
Inverse Probability Weighting
The top of Table 2 describes the four steps to estimate the marginal risk difference by IPW: (i) estimating in the propensity score logistic regression model; (ii) estimating the marginal risk under no anaemia, ; (iii) estimating the marginal risk under anaemia, ; and (iv) estimating the risk difference, . To use M-estimation, we translate each step into estimation equations as shown in Table 2. Step 1 is estimating a logistic regression. We derived the estimating equations for logistic regression in the prior example. Steps 2 and 3 are estimating sample means. The equation for , in the fourth step, can be rearranged into the form of an estimating equation. The Supplementary Material (available as Supplementary data at IJE online) shows the derivation of each estimating equation. Finally, we stack these estimating equations together:
Table 2.
Standardization by inverse probability weighting and by g-computation
| Inverse probability weighting | |||
|---|---|---|---|
| Estimation steps | |||
| # | Description | Equation | Estimating equation |
| 1 | Estimate in the propensity score logistic regression model | ||
| 2 | Estimate the standardized risk under no anaemia by taking the mean of among individuals with , weighted by the inverse of | ||
| 3 | Estimate the standardized risk under anaemia by taking the mean of among individuals with , weighted by the inverse of | ||
| 4 | Estimate the risk difference | ||
| G-computation | |||
|---|---|---|---|
| Estimation steps | |||
| # | Description | Equation | Estimating equation |
| 1 | Estimate in the outcome logistic regression model | ||
| 2 | Estimate the standardized risk under no anaemia by taking the mean of the predicted probabilities of using and setting | ||
| 3 | Estimate the standardized risk under anaemia by taking the mean of the predicted probabilities of using and setting | ||
| 4 | Estimate the risk difference | ||
Here, M-estimation produces point estimates and standard errors (and thus CIs) for all the parameters . Using the sandwich variance estimator, the estimated standard errors for and appropriately account for the uncertainty in . For , the sandwich variance estimator is an implementation of the delta method (because this estimating equation is only a function of other parameters and not directly a function of the data), incorporating uncertainty in and .7
In practice, it is common to estimate so-called ‘robust’ variances when implementing IPW.19 Such variances are estimated using the sandwich variance estimator where the stack of estimating equations does not include the estimating equations for the propensity score model parameters, thus treating the propensity score parameters as known. With IPW for the risk difference, these robust variances are conservative (i.e. larger than the empirical variance).20 In contrast, M-estimation provides a consistent variance estimator when including the propensity score model estimating equations. When using odds weights (e.g. to standardize to individuals with anaemia), the robust standard errors may be anticonservative, whereas M-estimation still provides a consistent variance estimator.20
G-computation
The bottom of Table 2 describes the four steps to estimate the marginal risk difference by g-computation21 and the estimating equations corresponding to each step. The translation of each step follows from the previous examples and is shown in detail in the Supplementary Material (available as Supplementary data at IJE online). Finally, we stack these estimating equations on top of each other:
Here, M-estimation produces point estimates and standard errors for all the parameters . If we had alternatively used MLE to estimate the outcome model in Step 1 (Table 2), we would typically rely on bootstrap to estimate standard errors for the risks and the risk difference. Although generally easy to implement, bootstrap can be computationally inefficient, particularly with large sample sizes, as it requires having to resample and compute the estimator using each of the resamples. Instead, M-estimation only requires numerically approximating the derivatives and matrix algebra operations for the sandwich variance estimator, making it more computationally efficient than the bootstrap.
Addressing outcome misclassification
Here, we leverage external validation data to address outcome misclassification to estimate the prevalence of the outcome. We include only a summary of the approach in the main text, with technical details in the Supplementary Material (available as Supplementary data at IJE online).
Specifically, we want to estimate the point prevalence of HIV treatment for 950 HIV-positive adults in Multicenter AIDS Cohort Study/Women’s Interagency HIV Study (MACS/WIHS) combined cohort study in 1995.5,22 However, only self-reported treatment is available, which is a potentially misclassified measurement of our outcome of interest. In an external sample (two studies pooled: one in the MACS and one in the University of North Carolina centre for AIDS Research HIV Cohort study), we have data on both the gold standard measure of HIV treatment (based on medical and pharmacy records) and self-reported treatment for 331 individuals. This is an example of data fusion where multiple data sources (here, the study sample and external validation data) are combined to estimate the parameter of interest.
To correct for the measurement error, we use the Rogan Gladen equation, which corrects the probability of the mismeasured outcome using the sensitivity and specificity of the measurement. In the Supplementary Material (available as Supplementary data at IJE online), we provide the Rogan Gladen equation, illustrate derivation of the estimating equations and provide the full stack of equations. The stack includes equations for four parameters: (i) mismeasured outcome prevalence in the MACS/WIHS; (ii) the sensitivity from the external sample; (iii) the specificity from the external sample; and (iv) the corrected outcome prevalence in the MACS/WIHS, our parameter of interest.
Using M-estimation (and the sandwich variance estimator), the estimated variance for the corrected prevalence appropriately incorporates the uncertainty in the estimates of sensitivity and specificity from the external sample. An alternative for variance estimation would be to use bootstrap. Because we have two samples (the MACS/WIHS sample and the external sample), the bootstrap would entail resampling separately from each sample. This example highlights that M-estimation can be used when data are not independent and identically distribution (i.e. iid), which is generally the case in data fusion analyses.
Using M-estimation, the naïve prevalence of treatment based on self-report (i.e. subject to measurement error) is 0.72 (95% CI 0.69, 0.74). The estimated sensitivity and specificity are 0.84 (95% CI 0.80, 0.89) and 0.80 (95% CI 0.71, 0.88), respectively. The measurement-error-corrected prevalence of treatment is 0.80 (95% CI 0.72, 0.88).
Discussion
Here, we introduced M-estimation and provided four examples illustrating its application to address common problems in epidemiology. Further examples and technical details are available.2,5,7,13 MLE is a default method and is easy to implement for many analytical approaches used in epidemiology. However, there are some settings where the flexibility of M-estimation may be preferred, including estimating multiple parameters simultaneously by stacking equations together, automating the delta method, or when data are not independent and identically distributed. The computational efficiency of M-estimation over the bootstrap may be meaningful when paired with multiple imputation for missing data or conducting sensitivity analysis across a range of inputs.
M-estimation has limitations. First, as shown, there is an estimating equation for each parameter, meaning that theoretically M-estimation only applies to a finite number of parameters. Therefore, standard M-estimation is not justified for some nonparametric applications, like the Kaplan–Meier estimator. Second, the sandwich variance estimator involves taking the derivative of the estimating equation, so the estimating equation must be differentiable. The estimating equation for the median, for example, is not differentiable.2 Third, M-estimation cannot be used when the estimating equation depends on . Such is the case with the Cox proportional hazards model.23 Finally, M-estimation is based on large-sample theory and may not be unbiased in small samples.
Despite these restrictions, M-estimation is a flexible and computationally efficient estimation procedure that is a powerful addition to the epidemiologist’s toolbox.
Supplementary Material
Contributor Information
Rachael K Ross, Department of Epidemiology, Mailman School of Public Health, Columbia University, New York, NY, USA.
Paul N Zivich, Institute for Global Health and Infectious Diseases, School of Medicine, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA; Department of Epidemiology, Gillings School of Global Public Health, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
Jeffrey S A Stringer, Department of Obstetrics and Gynecology, School of Medicine, University of North Carolina, Chapel Hill, NC, USA.
Stephen R Cole, Department of Epidemiology, Gillings School of Global Public Health, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
Data availability
The data underlying this article are available in GitHub, at [https://github.com/rachael-k-ross/Mestimation-worked-example].
Supplementary data
Supplementary data are available at IJE online
Author contributions
All authors designed the tutorial. P.Z. and R.R. created the code. R.R. drafted the paper. All authors critically revised the paper.
Funding
R.R. is supported in part by a grant from the National Institute of Drug Abuse (R01DA056407). P.Z. is supported by a training grant from National Institute of Allergy and Infectious Diseases (T32AI007001) and in part by a grant from the National Institute of Allergy and Infectious Diseases (K01AI177102). S.C. is supported in part by a grant from the National Institute of Allergy and Infectious Diseases (R01AI157758).J.S.A.S. is supported by grants from the National Institutes of Health (5P30AI050410) and Gates Foundation (INV016221).
Conflict of interest
None declared.
References
- 1. Cole SR, Chu H, Greenland S.. Maximum likelihood, profile likelihood, and penalized likelihood: a primer. Am J Epidemiol 2014;179:252–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Stefanski LA, Boos DD.. The calculus of M-estimation. Am Stat 2002;56:29–38. [Google Scholar]
- 3. Naimi AI, Cole SR, Kennedy EH.. An introduction to g methods. Int J Epidemiol 2017;46:756–62. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Hernán MA, Robins JM.. Causal Inference: What If. Boca Raton: Chapman & Hall/CRC, 2020. [Google Scholar]
- 5. Cole SR, Edwards JK, Breskin A. et al. Illustration of two fusion designs and estimators. Am J Epidemiol 2022;192:467–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Kulesa A, Krzywinski M, Blainey P, Altman N.. Sampling distributions and the bootstrap. Nat Methods 2015;12:477–78. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Boos DD, Stefanski LA.. Chapter 7. M-estimation (estimating equations). In: Essential Statistical Inference Theory Methods. New York: Springer, 2013, pp. 297–337. [Google Scholar]
- 8. Press WH, Teukolsky SA, Vetterling WT, Flannery BP.. Root finding and nonlinear sets of equations. In: Numerical Recipes: The Art of Scientific Computing. 3rd edn. Cambridge: Camrbidge University Press, 2007, pp. 442–486. [Google Scholar]
- 9. Virtanen P, Gommers R, Oliphant TE. et al. ; SciPy 1.0 Contributors. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat Methods 2020;17:261–72. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Soetaert K, Hindmarsh AC, Eisenstat SC, Moler C, Dongarra J, Saad Y.. rootSolve: Nonlinear Root Finding, Equilibrium and Steady-State Analysis of Ordinary Differential Equations [Internet]. 2021. https://cran.r-project.org/package=rootSolve (6 February 2024, date last accessed).
- 11. Rothman KJ, Greenland S, Lash TL.. Modern Epidemiology. Philadelphia, PA: Lippincott Williams & Wilkins, 2008. [Google Scholar]
- 12. Saul BC, Hudgens MG.. The calculus of M-estimation in R with geex. J Stat Softw 2020;92:1–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Zivich PN, Klose M, Cole SR, Edwards JK, Shook-Sa BE.. Delicatessen: M-Estimation in Python. arXiv, doi:arXiv:2203.11300, 10 October 2022, preprint: not peer reviewed. http://arxiv.org/abs/2203.11300 (6 February 2024, date last accessed).
- 14. Castillo MC, Fuseini NM, Rittenhouse KJ. et al. Zambian Preterm Birth Prevention Study (ZAPPS): cohort characteristics at enrollment. Gates Open Res 2018;2:25–18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Mansournia MA, Nazemipour M, Naimi AI, Collins GS, Campbell MJ.. Reflection on modern methods: demystifying robust standard errors for epidemiologists. Int J Epidemiol 2021;50:346–51. [DOI] [PubMed] [Google Scholar]
- 16. Royall RM. Model robust confidence intervals using maximum likelihood estimators. Int Stat Rev Rev Int Stat 1986;54:221. [Google Scholar]
- 17. Zou G. A modified poisson regression approach to prospective studies with binary data. Am J Epidemiol 2004;159:702–706. [DOI] [PubMed] [Google Scholar]
- 18. McNutt L-A, Wu C, Xue X, Hafner JP.. Estimating the relative risk in cohort studies and clinical trials of common outcomes. Am J Epidemiol 2003;157:940–43. [DOI] [PubMed] [Google Scholar]
- 19. Robins JM, Hernán MA, Brumback B.. Marginal structural models and causal inference in epidemiology. Epidemiology 2000;11:550–60. [DOI] [PubMed] [Google Scholar]
- 20. Reifeis SA, Hudgens MG.. Practice of epidemiology on variance of the treatment effect in the treated when estimated by inverse probability weighting. Am J Epidemiol 2022;191:1–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Snowden JM, Rose S, Mortimer KM.. Implementation of G-computation on a simulated data set: demonstration of a causal inference technique. Am J Epidemiol 2011;173:731–38. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Cole SR, Jacobson LP, Tien PC, Kingsley L, Chmiel JS, Anastos K.. Using marginal structural measurement-error models to estimate the long-term effect of antiretroviral therapy on incident AIDS or death. Am J Epidemiol 2010;171:113–22. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Lin DY, Wei LJ.. The robust inference for the cox proportional hazards model. J Am Stat Assoc 1989;84:1074–78. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The data underlying this article are available in GitHub, at [https://github.com/rachael-k-ross/Mestimation-worked-example].
