On the role of parameterization in models with a misspecified nuisance component

Heather S Battey; Nancy Reid

doi:10.1073/pnas.2402736121

. 2024 Aug 30;121(36):e2402736121. doi: 10.1073/pnas.2402736121

On the role of parameterization in models with a misspecified nuisance component

Heather S Battey ^a,^1,², Nancy Reid ^b,^1,²

PMCID: PMC11388321 PMID: 39213177

Significance

Statistical models are often chosen based on a combination of scientific understanding and flexibility or mathematical convenience. While the aspects of core scientific relevance may be relatively securely specified in terms of interpretable interest parameters, the rest of the formulation is often chosen somewhat arbitrarily. In many statistical models this formulation includes so-called nuisance parameters, which are of no direct subject-matter concern, but are needed to complete the model or reflect the complexity of the data. This paper contributes to the foundations of statistics by studying the interplay between model structure and likelihood inference, under misspecification of the nuisance component.

Keywords: causality, model structure, parameter orthogonality, symmetric parameterization, model formulation

Abstract

The paper is concerned with inference for a parameter of interest in models that share a common interpretation for that parameter but that may differ appreciably in other respects. We study the general structure of models under which the maximum likelihood estimator of the parameter of interest is consistent under arbitrary misspecification of the nuisance part of the model. A specialization of the general results to matched-comparison and two-groups problems gives a more explicit and easily checkable condition in terms of a notion of symmetric parameterization, leading to a broadening and unification of existing results in those problems. The role of a generalized definition of parameter orthogonality is highlighted, as well as connections to Neyman orthogonality. The issues involved in obtaining inferential guarantees beyond consistency are briefly discussed.

Scientific practice in nearly every field relies on the use of mathematical models: a provisional base for describing how observable quantities materialize. Some models, such as Einstein’s theory of gravitation, provide predictions that can be verified with such accuracy that they are often considered to be the truth. This is, however, an atypical situation: The processes underlying most observable data are much too complicated to be described by exact laws and often require a probabilistic element. A useful statistical model captures the essence of the data-generating mechanism in a way that can accommodate new data collected under slightly different conditions. When the primary purpose of a model is to provide insight and explanation, as is typical in many scientific contexts, stable aspects should ideally be capable of interpretation. For this, statistical models will usually have a relatively small number of scientifically interpretable parameters of interest. So-called nuisance parameters are used to complete the specification of the model.

In contexts where prediction is the main focus, such as arise in many applications of machine learning, model-free approaches are sometimes advocated. This is justified on the grounds that faithful representation of a data-generating process is both unnecessary for such a task and also unrealistically optimistic with very large or complex sets of data. For instance in commercial recommender systems, the criteria of speed and accuracy of predictions are key, and the identification of interpretable parameters unimportant. In sharp contrast are contexts in which the goal is to understand how a treatment or exposure influences the distribution of outcomes, having accounted, to the extent feasible, for the diversity of individuals. Breiman (1) described this separation succinctly as the “two cultures” of statistical thought. The development in this paper is closer to the view expressed by Cox (2) in the discussion of that work.

We consider sets of statistical models that share a common interpretation for the parameter of subject-matter interest but differ in their specification of the nuisance component. A fundamental question is whether reliable inference for the interest parameter is still achievable using standard approaches when the nuisance aspect is misspecified. An exemplar setting is in understanding the effect of treatment or exposure on the health outcomes of individuals. These individuals may exhibit considerable heterogeneity and complex interdependencies stemming, for instance, from certain shared genetic traits. Modeling such interdependencies is a formidable challenge and often tackled by introducing a parameter for each individual. If these are, postulated to have been drawn from a parametric distribution, the resulting doubly stochastic models are called random-effects models. Since the choice of parametric distribution for the individual-specific nuisance parameters is typically based on mathematical convenience, the prospect of misspecification seems likely.

The extensive literature on inference under model misspecification appears to have started with Cox (3), who considered two nonnested models for a vector random variable $Y$ , specifying joint density or mass functions $m$ and $\overset{ˇ}{m}$ , say, where the latter is completely known up to a finite-dimensional parameter $θ$ . With $\hat{θ}$ the maximum likelihood estimator under model $\overset{ˇ}{m}$ , Cox (3) showed that when the true density function is $m$ , $\hat{θ}$ is asymptotically normal with expected value $θ_{m}^{0}$ and covariance matrix given by what is now known as the sandwich formula. The quantity $θ_{m}^{0}$ solves

E_{m} {[\nabla_{θ} log \overset{ˇ}{m} (Y ; θ)]}_{θ = θ_{m}^{0}} = 0,

where $E_{m}$ is expectation under $m$ . Equivalently $θ_{m}^{0}$ minimizes with respect to $θ$ the Kullback–Leibler divergence

\int m (y) log (\frac{m (y)}{\overset{ˇ}{m} (y ; θ)}) d y .

More rigorous discussions of the distribution theory and regularity conditions were provided by Huber (4), Kent (5), and White (6, 7).

Inferential guarantees for the quantity $θ_{m}^{0}$ follow directly under classical regularity conditions on the family of models. Our interest, by contrast, is in the estimation of the true value of an interest parameter, which is assumed to have a common interpretation in both the true distribution and in the fitted model. Our focus is on uncovering the structure for which standard likelihood theory retains its first-order theoretical guarantees, providing foundational insight into the limits of likelihood inference. Reference to the unsolved nature of this question is widespread; see, for instance, Evans and Didelez (8, section 5).

A different line of exploration, with connections to the double-robustness literature [e.g., Robins et al. (9) and Chernozhukov et al. (10)] is the so-called assumption-lean approach to modeling of Vansteelandt and Dukes (11), who center their analyses on model-free estimands rather than parameters of a given model. The ideal estimand coincides with a particular model parameter when the chosen model includes $m$ and retains a degree of interpretability when the model is misspecified. The tension between estimands and model parameters has a long history, going back to Fisher and Neyman; see Cox (12).

In the discussion of ref. 11, Battey (13) conjectured a condition under which the maximum likelihood estimator for an interest parameter is consistent in spite of arbitrary misspecification of the nuisance part of the model. The intuition for that incomplete claim came from a highly involved calculation in a particular paired-comparisons model studied in Battey and Cox (14). Any role played by the assumed model for the random effects was unclear from that calculation and is clarified by the more general insight provided here.

In Section 1, we study the general structure for such a consistency result, and specialize the analysis to settings of common relevance, providing more explicit conditions on the parameterization for matched-pair and two-group problems. An important conclusion from this analysis is that the conditions for consistency can sometimes be checked without knowledge of the true model. Section 2 shows through this route that some key insights have been overlooked in a rather large literature on misspecified random effects distributions, particularly the role of parameterization. Section 2.C recovers two well-known results for generalized linear models as an illustration of the more general statements, and Section 2.D provides a recent example in the context of a marginal structural model, for which the more general results of Section 1 provide elucidation. While some of the material in Section 1 is rather technical, we have aimed to provide intuition for the main results and have provided several examples in Section 2. The examples are deliberately detached from specific subject-matter details, in order that they be broadly useful for a range of scientific applications.

1. Consistency in General Misspecified Models

1.A. General Conditions for Consistency.

Let $m$ represent the density function for the outcomes, parameterized in terms of an interest parameter $ψ$ with true value $ψ^{*}$ . The assumed model, while sharing the same interpretable interest parameter, is misspecified in other ways. The joint density function for the observations under the assumed model has parameters $(ψ, λ) \in Ψ \times Λ$ , and $ℓ (ψ, λ) = log \overset{ˇ}{m} (y ; ψ, λ)$ denotes the observed log-likelihood function for that model, viewed as a function of $(ψ, λ)$ for observed data $y = (y_{1}, \dots, y_{n})$ . As above, maximization of $ℓ (ψ, λ)$ gives estimates $(\hat{ψ}, \hat{λ})$ ; their dependence on $y$ is suppressed in the notation. As functions of the random variable $Y = (Y_{1}, \dots, Y_{n})$ the probability limit of the maximum likelihood estimator, as $n \to \infty$ , is $(ψ_{m}^{0}, λ_{m}^{0})$ , the solution to

\begin{matrix} E_{m} [\nabla_{(ψ, λ)} ℓ (ψ_{m}^{0}, λ_{m}^{0})] = 0 . \end{matrix}

[1.1]

We consider the model to be misspecified if there are no values of $λ$ in $Λ$ for which the true density $m$ is recovered. This precludes the situation in which the true distribution belongs to a submodel of an assumed encompassing family, briefly discussed in Section 3.C.

The following example will be used repeatedly to illustrate key ideas and notation.

Example 1.1

Consider a matched comparison problem in which, for each of $n$ twin pairs, one individual from each pair is chosen at random to receive a treatment, the other being the untreated control. Let $Y_{i 1}$ and $Y_{i 0}$ denote the time until some medical event of interest $($ e.g., recovery from illness $)$ for the treated and untreated individual in the $i$ th pair. A simple parametric model specifies the outcomes $Y_{i 1}$ and $Y_{i 0}$ as exponentially distributed of rates $γ_{i} ψ > 0$ and $γ_{i} / ψ > 0$ , respectively. The pair-specific nuisance parameter $γ_{i}$ captures, for instance, genetic differences between the pairs of twins. The parameter of interest $ψ$ with true value $ψ^{*}$ quantifies the effect of the treatment: $ψ^{2}$ being the multiplicative effect of the treatment on the instantaneous probability of recovery relative to baseline. Other parameterizations are possible, but the above symmetric parameterization will be shown to play an important role. Suppose the pair effects are modeled as gamma distributed of shape $κ$ and rate $ρ$ . Then, under the assumed model, the joint density function for the outcomes in any given pair at $(y_{1}, y_{0})$ is

$\frac{Γ (κ + 2) ρ^{κ}}{Γ (κ) {(y_{1} ψ + y_{0} / ψ + ρ)}^{κ + 2}} .$ [1.2]

The true random effects distribution could have been quite different, so that the model Eq. 1.2 is misspecified. Nevertheless, the interpretation of the interest parameter $ψ$ is stable over the different specifications. The notional nuisance parameter is $λ = (κ, ρ)$ .

Remark 1.1

A different analysis is possible in this example, treating the pair effects $γ_{1}, \dots, γ_{n}$ as fixed arbitrary constants, which can be eliminated from the analysis by basing likelihood inference on the distribution of the ratios $Z_{i} = Y_{i 1} / Y_{i 0}$ ; this is discussed briefly in Section 3.B.

Definition 1.1 (parameter m-orthogonality):

Let $\nabla_{ψ λ}^{2} ℓ (ψ, λ)$ denote the cross-partial derivative of the assumed log-likelihood function. The parameter $ψ$ is said to be $m$ -orthogonal to the notional parameter $λ$ if $E_{m} [\nabla_{ψ λ}^{2} ℓ (ψ, λ)] = 0$ . This can hold globally for all $(ψ, λ)$ or at particular values, the notation $Ψ ⊥_{m} Λ$ indicating global $m$ -orthogonality and $Ψ ⊥_{m} λ$ and $ψ ⊥_{m} Λ$ indicating, respectively, local $m$ -orthogonality at $λ$ for any $ψ$ , and local $m$ -orthogonality at $ψ$ for any $λ$ .

To gain some intuition for parameter $m$ -orthogonality, consider first the much stronger condition, $\nabla_{ψ λ}^{2} ℓ (ψ, λ) = 0$ for all $ψ$ and $λ$ . This corresponds to what is called a cut in the parameter space and implies $E_{m} [\nabla_{ψ λ}^{2} ℓ (ψ, λ)] = 0$ for any $m$ so that the true density $m$ trivially plays no role. In this strongest setting, parameter orthogonality is a purely geometric property of the log-likelihood function that holds only in certain parameterizations. It is effectively an absence of torsion, implying that, for any fixed $λ$ , the corresponding log-likelihood function over $Ψ$ is maximized at the same point. This parameter separation in the likelihood function is, however, limited to a relatively small number of models. The weaker condition $E_{m} [\nabla_{ψ λ}^{2} ℓ (ψ, λ)] = 0$ only requires the absence of torsion to hold on average over hypothetical repeated draws from the true distribution. If the model is correctly specified, the usual definition of parameter orthogonality with respect to Fisher information is recovered.

Propositions 1.1 and 1.2 give two alternative general conditions for consistency of $\hat{ψ}$ in spite of arbitrary misspecification of the nuisance part of the model. Their proofs, and those of subsequent results, are in SI Appendix.

Proposition 1.1

Let the observed log-likelihood function for the assumed model be strictly concave as a function of $(ψ, λ)$ . Then, $ψ_{m}^{0} = ψ^{*}$ if and only if $E_{m} [\nabla_{ψ} ℓ (ψ^{*}, λ_{m}^{0})] = 0$ . The latter condition is equivalent to $E_{m} [\nabla_{ψ} ℓ (ψ^{*}, λ)] = 0$ for all $λ$ if and only if $ψ^{*} ⊥_{m} Λ$ .

Remark 1.2

Local orthogonality at a particular value, $λ^{'}$ say, is not sufficient to ensure that $E_{m} [\nabla_{ψ} ℓ (ψ^{*}, λ^{'})] = 0$ implies $E_{m} [\nabla_{ψ} ℓ (ψ^{*}, λ_{m}^{0})] = 0$ , as is clear from the proof of Proposition 1.1.

Remark 1.3

In Example 1.1, it was shown in ref. 14 that $ψ$ is globally $m$ -orthogonal to $λ = (κ, ρ)$ for any distribution over the random effects, and the consistency of the maximum likelihood estimator of $ψ$ was established through an explicit calculation. An initial motivation for the present work was to understand how sensitive that conclusion was to various aspects of the model formulation.

The first part of Proposition 1.1, i.e., the requirement of an unbiased estimating equation $E_{m} [\nabla_{ψ} ℓ (ψ^{*}, λ_{m}^{0})] = 0$ , is almost immediate. Since $λ_{m}^{0}$ depends on the unknown $m$ and is typically unavailable, verification of the orthogonality condition $ψ^{*} ⊥_{m} Λ$ uniformly over $m$ , in the event of its validity, appears to be the simplest way of establishing consistency. The simplest special cases in which orthogonality can be checked are those in which there is a parameter cut. Such examples include misspecification of the dispersion component of a generalized linear model, discussed in Section 2.C, and Gaussian linear mixed models with a misspecified distribution over the random effects. Robustness of inference in the latter case has often been observed empirically, e.g., Schielzeth et al. (15) without reference to any underlying structure. A more elaborate example with a parameter cut is the causal model of ref. 8 outlined in Section 2.D. Example 1.1 obeys the weaker form $E_{m} [\nabla_{ψ λ}^{2} ℓ (ψ, λ)] = 0$ from Proposition 1.1, as does the unbalanced two-group problem of Example 2.4 in Section 2.A, in which the pair-specific parameters are replaced by stratum-specific parameters. The relevant structure underpinning these paired and two-group examples is elucidated in Section 1.C, where some intuition is provided.

The unbalanced two-group problem of Example 2.3, on the other hand, violates the $m$ -orthogonality assumption of Proposition 1.1 with respect to one of its two nuisance parameters. Nevertheless, Cox and Wong (16) showed that the maximum likelihood estimator of the interest parameter in Example 2.3 is consistent, suggesting that Proposition 1.1, while easier to verify, is too strong for some situations. Proposition 1.2 presents weaker conditions for consistency.

Suppose that the orthogonality condition $ψ^{*} ⊥_{m} Λ$ from Proposition 1.1 fails, so that there exists at least one $λ \neq λ_{m}^{0}$ such that $E_{m} [\nabla_{ψ} ℓ (ψ^{*}, λ)] \neq 0$ . Let

\begin{matrix} g_{ψ} (ψ^{*}, λ) : = E_{m} [\nabla_{ψ} ℓ (ψ^{*}, λ)], g_{λ} (ψ^{*}, λ) : = E_{m} [\nabla_{λ} ℓ (ψ^{*}, λ)], \end{matrix}

[1.3]

and partition the inverse of the Fisher information matrix $i : = i (ψ^{*}, λ) : = E_{m} [- \nabla_{(ψ, λ)}^{2} ℓ (ψ^{*}, λ)]$ as

\begin{matrix} (\begin{matrix} i^{ψ ψ} & i^{ψ λ} \\ i^{λ ψ} & i^{λ λ} \end{matrix}) : = {(\begin{matrix} i_{ψ ψ} & i_{ψ λ} \\ i_{λ ψ} & i_{λ λ} \end{matrix})}^{- 1} . \end{matrix}

[1.4]

In Proposition 1.2 and its proof, we refer to $g_{ψ} (ψ^{*}, λ)$ and $g_{λ} (ψ^{*}, λ)$ in the shorthand $g_{ψ}$ and $g_{λ}$ respectively.

Proposition 1.2

If $i^{ψ ψ} g_{ψ} + i^{ψ λ} g_{λ} = 0$ for all $λ \in Λ$ , then $ψ_{m}^{0} = ψ^{*}$ . If $ψ$ and $λ$ are both scalar parameters, the condition reduces to $i_{λ λ} g_{ψ} = i_{ψ λ} g_{λ}$ .

That Proposition 1.2 is more general than Proposition 1.1 is seen on noting that the orthogonality condition of Proposition 1.1 implies $i^{ψ λ} = 0$ and $g_{ψ} = 0$ , so that the condition of Proposition 1.2 also holds. The conditions of Propositions 1.1 and 1.2 are clearly not necessary. In particular, if $λ_{m}^{0}$ can be calculated, the route to establishing consistency or inconsistency of $\hat{ψ}$ is more direct. Any practical relevance, beyond general theoretical understanding, is to situations in which $λ_{m}^{0}$ is not calculable, typically because $m$ is unknown. The conditions in Propositions 1.1 and 1.2, while dependent on $m$ , are nevertheless checkable in some situations, as illustrated in the examples of subsequent sections.

A generalization involves a second nuisance parameter vector, $ν \in N$ say, such that $ψ^{*} ⊥_{m} N$ and $λ ⊥_{m} N$ in Proposition 1.1 or $Λ ⊥_{m} N$ in Proposition 1.2. It can be shown via a straightforward extension of the arguments leading to Propositions 1.1 and 1.2 that the conclusion of the latter is unchanged by this modification. Examples involving a second nuisance parameter are provided in Section 2.

The consistency demonstrated by Battey and Cox (14) for Example 1.1 arises as a special case of Proposition 1.1, while the argument of Cox and Wong (16) for an unbalanced doubly stochastic two-group problem turns out to be an application of Proposition 1.2 and Corollary 1.1 below. We return to this example in Section 2.B.

1.B. Parameter Orthogonality and Orthogonalization in Misspecified Models.

In view of Propositions 1.1 and 1.2, a key question is when the assumed Fisher information matrix, with expectation computed under an erroneous model, coincides with the true Fisher information matrix in some or all regions of the parameter space. Such coincidence implies in particular that parameter orthogonality established under a notional model is true more generally. The practical import is that otherwise $Ψ ⊥_{m} Λ$ or its local analog would typically not be verifiable without knowledge of $m$ . A partial answer is presented as Proposition 1.3, whose proof is immediate by the definition of sufficiency.

Proposition 1.3

Suppose that the cross-partial derivative of the assumed log-likelihood function $\nabla_{ψ λ}^{2} ℓ (ψ, λ)$ depends on the data only through a sufficient statistic $S = (S_{1}, \dots, S_{k})$ , for some $1 \leq k \leq n$ . Write ${\overset{ˇ}{ı}}_{ψ λ} (ψ, λ) = E_{(ψ, λ)} [- \nabla_{ψ λ}^{2} ℓ (ψ, λ)]$ , where $E_{(ψ, λ)}$ means expectation under the assumed model. Provided that $\nabla_{ψ λ}^{2} ℓ (ψ, λ)$ is additive in $S_{1}, \dots, S_{k}$ and $E_{m} (S_{j}) = E_{(ψ, λ)} (S_{j})$ for all $j$ , then $i_{ψ λ} (ψ, λ) = {\overset{ˇ}{ı}}_{ψ λ} (ψ, λ)$ .

Corollary 1.1

Under the conditions of Proposition 1.3, a sufficient condition for parameter $m$ -orthogonality $Ψ ⊥_{m} Λ$ is ${\overset{ˇ}{ı}}_{ψ λ} (ψ, λ) = 0$ for all $(ψ, λ) \in Ψ \times Λ$ , and similarly for local $m$ -orthogonality.

Remark 1.4

For Example 1.1, $m$ -orthogonality can be established more directly via Proposition 2.1.

In general, a reparameterization of a model from parameters $ϕ$ to $θ = θ (ϕ)$ is called interest-respecting if the parameter of interest is common to both: if $ϕ = (ψ, ξ)$ then $θ = (ψ, λ (ψ, ξ))$ . The relevance of interest-respecting reparameterizations in scientific work is that the more important interpretable aspects of the model are retained. In ref. 17, interest-respecting orthogonal reparameterizations were explored for improving inference in correctly specified models. The role in misspecified settings is amplified in view of Proposition 1.1, which concerns the first-order properties of the estimator.

In a slightly more explicit notation, suppose that $ϕ$ is a parameterization for which the $(r, s)$ th component satisfies $i_{rs}^{(ϕ)} (ϕ) = {\overset{ˇ}{ı}}_{rs}^{(ϕ)} (ϕ) = 0$ . Consider an interest-respecting reparameterization from an initial parameter $ϕ$ to a new parameter $θ$ , with coordinates

ϕ^{r} (θ^{1}, \dots, θ^{p}), θ^{a} (ϕ^{1}, \dots, ϕ^{p}), (a, r = 1, \dots, p),

where $θ^{1} = ϕ^{1} = ψ$ . An implication of Corollary 1.1 is that the approach of Cox and Reid (17) can still be used in misspecified models, provided that the additional structure of Proposition 1.3 is present in the parameterization that is orthogonal under the assumed model. Specifically, from a starting parameterization $ϕ$ , an orthogonal parameterization $θ$ is any solution in $θ^{2}, \dots, θ^{p}$ to the system of $p - 1$ differential equations

{\overset{ˇ}{ı}}_{ab}^{(ϕ)} \frac{\partial ϕ^{a} (θ)}{\partial θ^{1}} + {\overset{ˇ}{ı}}_{1 b}^{(ϕ)} = 0, b = 2, \dots, p,

or in matrix notation

\frac{\partial ϕ (θ)}{\partial θ^{1}} = - {({\overset{ˇ}{ı}}_{\cdot \cdot}^{(ϕ)})}^{- 1} ({\overset{ˇ}{ı}}_{\cdot 1}^{(ϕ)}),

where ${\overset{ˇ}{ı}}_{\cdot \cdot}^{(ϕ)}$ is the Fisher information matrix under the assumed model without the row and column corresponding to $ϕ^{1}$ , and ${\overset{ˇ}{ı}}_{\cdot 1}^{(ϕ)}$ is the excluded column.

Even under the assumptions of Proposition 1.3, with model misspecification, the asymptotic variance of $\hat{ψ}$ is not $i^{ψ ψ}$ but is given by the sandwich formula of ref. 3. This means that inferential guarantees beyond consistency are not, in general, available using standard likelihood-based approaches; see Section 3.A for a brief discussion of some exceptions.

1.C. Symmetric Parameterizations and Induced Antisymmetry.

For the matched-pair and two-group examples of Sections 2.A and 2.B, a fundamental role is played by symmetric parameterization in transformation models, which induces an antisymmetry on the log-likelihood derivative and thereby the $m$ -orthogonality of Proposition 1.1 under arbitrary misspecification of the model. The class of problems covered in this subsection exemplifies settings in which $m$ -orthogonality can be straightforwardly verified without knowledge of $m$ . A simplified explanation is that in transformation models, there is a convenient duality between the parameter space and the sample space which results in the cancelation of the relevant terms in $E_{m} [\nabla_{ψ λ}^{2} ℓ (ψ, λ)]$ provided that the appropriate parameterization is used.

Let $P$ be a set of continuous probability measures on $Y$ . A transformation model under the action of $G$ [e.g., Barndorff–Nielsen and Cox (18, p. 53]] is a subset of $P$ parameterized by $γ \in Γ$ and possibly other parameters suppressed in the notation, such that

P_{G} : = {p \in P : p (g E ; g γ) = p (E ; γ), g \in G, E \in E, γ \in Γ},

where $E$ is the set of measurable events on the sample space $Y$ . The definition implies that $G$ acts on both the sample space and the parameter space. In particular, the action of $G$ on $F$ , say, is a continuous map $G \times F \to F$ , $(g, x) \to g x$ for $x \in F$ , where in the transformation model, $x$ represents either the data point $y \in Y$ or the parameter value $γ \in Γ$ . In general, $Y$ and $Γ$ need not be equal, although this will often be the case. The identity element on $Y$ or on $Γ$ is written $e = g^{- 1} g = g g^{- 1}$ , the context leaving no ambiguity.

The relevant group actions for present purposes depend only on the interest parameter $ψ$ , so that we may write, in a more explicit notation, $g = g_{ψ} \in G$ . Two examples serve to illustrate the properties of the transformation models. In location models, $Y = Γ = R$ , the group action is addition and $g_{ψ} γ = γ + ψ$ , giving $g g^{- 1} = g_{ψ} g_{- ψ} = e = g_{0}$ . In scale models, $Y = Γ = R^{+}$ , the group action is multiplication and $g_{ψ} γ = γ ψ$ , giving $g g^{- 1} = g_{ψ} g_{(1 / ψ)} = e = g_{1}$ .

Let $U$ be a random variable with a distribution depending only on $γ$ . This is best thought of as the random variable at baseline with respect to $ψ$ , i.e., $ψ = 0$ in location models and $ψ = 1$ in scale models. Write the probability density function of $U$ at $u$ as $f_{U} (u ; γ) d u$ . The following definition plays an important role in establishing consistency for a treatment parameter in matched-pair and two-groups problems.

Definition 1.2 (symmetric parameterization):

Let $Y_{1}$ and $Y_{0}$ be independent random variables with probability measures in $P_{G}$ and density functions $f_{1}$ and $f_{0}$ , respectively. Their joint distribution is said to be parameterized $ψ$ -symmetrically with respect to $(ψ, γ)$ if $g = g_{ψ} \in G$ depends only on $ψ$ and if the density functions $f_{1}$ and $f_{0}$ relate to $f_{U}$ by

$f_{U} (u ; γ) d u = f_{1} (g u ; g γ) d (g u) = f_{0} (g^{- 1} u ; g^{- 1} γ) d (g^{- 1} u) .$

In other words, the joint density function, when expressed in terms of $u_{1} = g^{- 1} y_{1}$ and $u_{0} = g y_{0}$ , is symmetric in $u_{1}$ and $u_{0}$ :

$\begin{matrix} f_{1} (y_{1} ; g γ) f_{0} (y_{0} ; g^{- 1} γ) d y_{1} d y_{0} = f_{U} (u_{1} ; γ) f_{U} (u_{0} ; γ) d u_{1} d u_{0} . \end{matrix}$ [1.5]

Definition 1.2 says that the random variables $U_{1} = g^{- 1} Y_{1}$ and $U_{0} = g Y_{0}$ are equal in distribution to $U$ , which has a “standardized form.” The inverted commas express that $f_{U}$ might not be a standardized form in the conventional sense, as it depends on $γ$ and possibly other parameters suppressed in the notation. The symmetric parameterization can equivalently be written

\begin{matrix} \begin{matrix} f_{1} (y_{1} ; g γ) d y_{1} = & f_{U} (g^{- 1} y_{1} ; γ) J_{1}^{+} d y_{1} \\ f_{0} (y_{0} ; g^{- 1} γ) d y_{0} = & f_{U} (g y_{0} ; γ) J_{0}^{+} d y_{0}, \end{matrix} \end{matrix}

[1.6]

where

J_{1}^{+} = | d (g^{- 1} y_{1}) / d y_{1} |, J_{0}^{+} = | d (g y_{0}) / d y_{0} |,

satisfy $J_{1}^{+} J_{0}^{+} = 1$ . In a location model, Eq. 1.6 becomes

\begin{matrix} \begin{matrix} f_{1} (y_{1} ; γ + ψ) d y_{1} = & f_{U} (y_{1} - ψ ; γ) d y_{1} \\ f_{0} (y_{0} ; γ - ψ) d y_{0} = & f_{U} (y_{0} + ψ ; γ) d y_{0}, \end{matrix} \end{matrix}

[1.7]

while in a scale model Eq. 1.6 becomes

\begin{matrix} \begin{matrix} f_{1} (y_{1} ; γ ψ) d y_{1} = & f_{U} (y_{1} / ψ ; γ) (1 / ψ) d y_{1} \\ f_{0} (y_{0} ; γ / ψ) d y_{0} = & f_{U} (y_{0} ψ ; γ) ψ d y_{0} . \end{matrix} \end{matrix}

[1.8]

Distributions within the scale family are often parameterized in terms of rate, or inverse scale, which reverses the roles of $g$ and $g^{- 1}$ relative to their actions in the scale parameterization. Example 1.1 is of this form.

In a location-scale model, the action of $G$ is $g = g_{ψ_{2}} ○ g_{ψ_{1}}$ , where $g_{ψ_{1}}$ is multiplication by a positive scalar $ψ_{1}$ and $g_{ψ_{2}}$ is addition of a scalar $ψ_{2}$ , its inverse being $g^{- 1} = g_{ψ_{1}}^{- 1} ○ g_{ψ_{2}}^{- 1}$ . Thus, a two-parameter symmetric parameterization in a location-scale model is in terms of $g γ = ψ_{1} γ + ψ_{2}$ and $g^{- 1} γ = (γ - ψ_{2}) / ψ_{1}$ . For the applications we have in mind, the interest parameter $ψ$ represents a treatment effect. Thus, if the treatment is assumed to affect either the location or the scale but not both, a location-scale distribution can effectively be treated as either location or scale. Examples include the extreme-value distributions, the most common parametric families arising in renewal theory and used in survival modeling, and members of the elliptically symmetric family in one or more dimensions, which encompasses the Gaussian, Student-t, Cauchy, and logistic distributions.

Consider, as a function of $ψ$ alone, the log-likelihood contribution of $y_{1}$ and $y_{0}$ , realizations of $Y_{1}$ and $Y_{0}$ . This is of the form

\begin{matrix} ℓ (ψ ; γ, y_{1}, y_{0}) = & log L (ψ ; γ, y_{1}, y_{0}) \\ = & log f_{1} (y_{1} ; g γ) + log f_{0} (y_{0} ; g^{- 1} γ) . \end{matrix}

Variables to the right of the semicolon in the log-likelihood function $ℓ$ are treated as fixed (although arbitrary), together with any other parameters suppressed in the notation.

Definition 1.3 (antisymmetry):

The symmetric parameterization of $f_{1}$ and $f_{0}$ is said to induce antisymmetry on the associated log-likelihood derivative with respect to $ψ$ if $\nabla_{ψ} ℓ (ψ ; γ, y_{1}, y_{0})$ , when expressed in terms of $u_{1} = g^{- 1} y_{1}$ and $u_{0} = g y_{0}$ , satisfies $\nabla_{ψ} ℓ (ψ ; γ, u_{1}, u_{0}) = - \nabla_{ψ} ℓ (ψ ; γ, u_{0}, u_{1})$ .

Example 1.2 (Continuation of Example 1.1):

In the exponential matched pair problem, the gamma distribution over the random effects is irrelevant for illustrating the symmetric parameterization and induced antisymmetry, as these definitions are conditional on $γ$ within a single pair. Since multiplication of rates corresponds to division of scale, it is natural, for consistency with Eq. 1.8, to define the group operation $g$ as division by $ψ$ rather than multiplication by $ψ$ . Thus let $u_{1} = g^{- 1} y_{1} = ψ y_{1}$ and $u_{0} = g y_{0} = y_{0} / ψ$ . Conditionally on $γ$ , the joint density function of $(Y_{1}, Y_{0})$ is

$f_{1} (y_{1} ; ψ^{*}, γ) f_{0} (y_{0} ; ψ^{*}, γ) d y_{1} d y_{0} = γ^{2} exp {- γ (u_{1} + u_{0})} d u_{1} d u_{0},$

which is symmetric in $u_{1}$ and $u_{0}$ . The log-likelihood derivative with respect to $ψ$ is

$\nabla_{ψ} ℓ (ψ ; u_{0}, u_{1}) = - γ (y_{1} - y_{0} / ψ) = - γ (u_{1} / ψ - u_{0} / ψ),$

where the right hand side is $- (- γ (u_{0} / ψ - u_{1} / ψ)) = - \nabla_{ψ} ℓ (ψ ; u_{1}, u_{0})$ . This shows that the log-likelihood derivative is antisymmetric in the sense of Definition 1.3.

The main transformation models arising in statistics are the location models and the scale models. For these, we show in Example 1.3 that the symmetric parameterization automatically induces antisymmetry on the associated log-likelihood derivative. We also show this for a rotation family on the circle in Example 1.4. We conjecture that antisymmetry according to Definition 1.3 is a necessary consequence of the $ψ$ -symmetric parameterization for any transformation model, but in the absence of general group-theoretic proof, we present the following necessary and sufficient conditions for asymmetry, that may be checked on a case-by-case basis for more exotic groups. Examples 1.3 and 1.4 illustrate the application of Proposition 1.4.

Proposition 1.4

Suppose that the joint distribution of $Y_{1}$ and $Y_{0}$ is parameterized $ψ$ -symmetrically with respect to $(ψ, γ)$ in the sense of Definition 1.2. The parameterization induces antisymmetry on the log-likelihood derivative in the sense of Definition 1.3 if and only if

$a (u_{1}, u_{0}) : = \frac{\partial u_{1}}{\partial ψ} + \frac{\partial u_{0}}{\partial ψ} = - a (u_{0}, u_{1}),$

and

$c (u_{1}, u_{0}) : = (\frac{\partial}{\partial ψ} | \frac{\partial u_{1}}{\partial y_{1}} |) | \frac{\partial u_{0}}{\partial y_{0}} | + (\frac{\partial}{\partial ψ} | \frac{\partial u_{0}}{\partial y_{0}} |) | \frac{\partial u_{1}}{\partial y_{1}} | = - c (u_{0}, u_{1}),$

where $u_{1} = g^{- 1} y_{1}$ and $u_{0} = g y_{0}$ .

A clearer but more cumbersome expression of the conditions $a (u_{1}, u_{0}) = - a (u_{0}, u_{1})$ and $c (u_{1}, u_{0}) = - c (u_{0}, u_{1})$ makes the dependence of the partial derivatives on $u_{1}$ and $u_{0}$ explicit. In this more explicit notation, $a (u_{1}, u_{0}) = - a (u_{0}, u_{1})$ amounts to $(\partial u_{1} / \partial ψ) (u_{1}) = - (\partial u_{0} / \partial ψ) (u_{1})$ and vice versa.

Example 1.3

In a symmetric parameterization of a location model $\partial u_{1} / \partial ψ = - 1 = - \partial u_{0} / \partial ψ$ and

$(\frac{\partial}{\partial ψ} | \frac{\partial u_{1}}{\partial y_{1}} |) | \frac{\partial u_{0}}{\partial y_{0}} | = 0 = - (\frac{\partial}{\partial ψ} | \frac{\partial u_{0}}{\partial y_{0}} |) | \frac{\partial u_{1}}{\partial y_{1}} | .$

Thus, $a (u_{1}, u_{0}) = 0 = - a (u_{0}, u_{1})$ and $c (u_{1}, u_{0}) = 0 = - c (u_{0}, u_{1})$ . In a symmetric parameterization of a scale model $\partial u_{1} / \partial ψ = - u_{1} / ψ$ , $\partial u_{0} / \partial ψ = u_{0} / ψ$ , and

$(\frac{\partial}{\partial ψ} | \frac{\partial u_{1}}{\partial y_{1}} |) | \frac{\partial u_{0}}{\partial y_{0}} | = - \frac{1}{ψ} = - (\frac{\partial}{\partial ψ} | \frac{\partial u_{0}}{\partial y_{0}} |) | \frac{\partial u_{1}}{\partial y_{1}} |,$

so that $a (u_{1}, u_{0}) = - a (u_{0}, u_{1})$ and $c (u_{1}, u_{0}) = 0 = - c (u_{0}, u_{1})$ . Thus both cases satisfy the condition of Proposition 1.4.

Example 1.4

For a rotation model on $R^{p}$ , $Y \subset R^{p}$ , $Γ = [0, 2 π)$ and the group action is matrix multiplication by a rotation matrix $g \in G \subseteq SO (p)$ , where $SO (p)$ is the special orthogonal group of dimension $p$ . The identity element is matrix multiplication by the $p$ -dimensional identity matrix and $f_{U} (u ; γ)$ is a distribution on the hypersphere, where $γ$ is a rotation parameter. The von Mises–Fisher distribution is one such example. Consider the case of $p = 2$ .

Let $v \in R^{2}$ be a point on a circle of fixed arbitrary radius, either a location parameter of a probabilistic model or a data point on the circle. In a symmetric parameterization of a rotation model, $v_{γ} = g_{γ} v$ defines a (counterclockwise) rotation of angle $v_{γ}^{T} v = γ$ and $v_{1} = g_{ψ} v_{γ}$ , $v_{0} = g_{ψ}^{- 1} v_{γ}$ define (counterclockwise and clockwise) rotations of angles $v_{1}^{T} v_{γ} = ψ$ and $v_{0}^{T} v_{γ} = - ψ$ , respectively, where

$\begin{matrix} g_{ψ}^{- 1} & = {(\begin{matrix} cos ψ & - sin ψ \\ sin ψ & cos ψ \end{matrix})}^{- 1} = (\begin{matrix} cos ψ & sin ψ \\ - sin ψ & cos ψ \end{matrix}) \\ = (\begin{matrix} cos (- ψ + 2 π) & - sin (- ψ + 2 π) \\ sin (- ψ + 2 π) & cos (- ψ + 2 π) \end{matrix}) \in G . \end{matrix}$

and $g_{ψ} g_{γ} = g_{γ + ψ} \in G$ , $g_{ψ}^{- 1} g_{γ} = g_{γ - ψ} \in G$ . Thus, in Eq. 1.6 $g γ$ should be understood either as addition of angles $g γ = γ + ψ$ or multiplication of rotation matrices $g γ = g_{ψ} g_{γ}$ .

Let $y_{1} = {(y_{11}, y_{12})}^{T}$ , $y_{0} = {(y_{01}, y_{02})}^{T}$ . Then

$\begin{matrix} u_{1} = g^{- 1} y_{1} & = (\begin{matrix} y_{11} cos ψ + y_{12} sin ψ \\ - y_{11} sin ψ + y_{12} cos ψ \end{matrix}), \\ u_{0} = g y_{0} & = (\begin{matrix} y_{01} cos ψ - y_{02} sin ψ \\ y_{01} sin ψ + y_{02} cos ψ \end{matrix}) . \end{matrix}$

The Jacobian determinants $J_{1}^{+}$ and $J_{0}^{+}$ are both equal to $| {cos}^{2} ψ + {sin}^{2} ψ | = 1$ . Thus, $c (u_{1}, u_{0}) = 0 = - c (u_{0}, u_{1})$ in Proposition 1.4. Since

$\begin{matrix} \frac{\partial u_{1}}{\partial ψ} & = (\begin{matrix} - y_{11} sin ψ + y_{12} cos ψ \\ - y_{11} cos ψ - y_{12} sin ψ \end{matrix}) = g_{θ}^{- 1} u_{1}, \\ \frac{\partial u_{0}}{\partial ψ} & = (\begin{matrix} - y_{01} sin ψ - y_{02} cos ψ \\ y_{01} cos ψ - y_{02} sin ψ \end{matrix}) = g_{θ} u_{0}, \end{matrix}$

with $θ = π / 2$ , the quantity $a (u_{1}, u_{0})$ from Proposition 1.4 is

$\begin{matrix} a (u_{1}, u_{0}) = & (\begin{matrix} cos (- π / 2) & - sin (- π / 2) \\ sin (- π / 2) & cos (- π / 2) \end{matrix}) u_{1} \\ + (\begin{matrix} cos (π / 2) & - sin (π / 2) \\ sin (π / 2) & cos (π / 2) \end{matrix}) u_{0} = - a (u_{0}, u_{1}), \end{matrix}$

verifying the conditions for the $ψ$ -symmetric parameterization of rotation families on the circle.

2. Examples

2.A. Matched Pairs.

Let $Y_{i 1}$ and $Y_{i 0}$ be random variables corresponding to observations on treated and untreated individuals in a matched pair design, where $i = 1, \dots, n$ is the pair index. The treatment effect is represented by a parameter $ψ$ while the pair effects are encapsulated in the pair-specific nuisance parameter $γ_{i}$ , avoiding explicit modeling assumptions in terms of covariates. As noted in Example 1.1, two approaches to analysis treat the pair effects as fixed arbitrary constants, or as independent and identically distributed random variables. In the latter case, it is common to model the distribution parametrically, typically producing efficiency gains in the treatment effect estimator if the parametric assumptions hold to an adequate order of approximation.

The relevant calculations are the same for every pair, and we therefore omit the pair index $i$ from the notation. Let the true joint density function of $(Y_{1}, Y_{0})$ be given by

\begin{matrix} m (y_{1}, y_{0}) = \int f_{1} (y_{1} ; ψ^{*}, γ) f_{0} (y_{0} ; ψ^{*}, γ) f (γ) d γ, \end{matrix}

[2.1]

with $f (γ)$ an unknown density function for the nuisance parameters, which are treated as independent and identically distributed across pairs.

Conditionally on $γ$ , let $ℓ (ψ ; γ, y_{1}, y_{0})$ denote the log-likelihood contribution for $ψ$ from a single pair.

Proposition 2.1

Suppose that the assumed model over $γ$ is parameterized by $λ$ , producing a log-likelihood contribution $ℓ (ψ, λ ; y_{1}, y_{0})$ , assumed strictly concave. Suppose further that, conditionally on $γ$ , $Y_{1}$ and $Y_{0}$ have a joint distribution that is parameterized $ψ$ -symmetrically (Definition 1.2). Then, provided that the group induces antisymmetry on the log-likelihood derivative (Definition 1.3), it follows that $ψ^{*} ⊥_{m} Λ$ and

$\begin{matrix} 0 & = E_{m} [\nabla_{ψ} ℓ (ψ^{*}, λ)] \\ = \int_{Y} \int_{Y} (\nabla_{ψ} ℓ (ψ^{*}, λ ; y_{1}, y_{0})) m (y_{1}, y_{0}) d y_{1} d y_{0}, \end{matrix}$ [2.2]

for all $λ \in Λ$ . Thus, $ψ_{m}^{0} = ψ^{*}$ by Proposition 1.1.

Example 2.1 (Continuation of Examples 1.1 and 1.2):

Example 1.2 establishes antisymmetry of the log-likelihood derivative for the symmetric parameterization of the exponential matched pairs problem. Consistency of $\hat{ψ}$ thus follows by Proposition 2.1 regardless of the true distribution and assumed model for the random effects $γ$ . This considerably extends the result of ref. 14 and provides deeper insight into the structure of inference under model misspecification.

Remark 2.1

It is not too difficult to check following calculations similar to that in Appendix A.4 of ref. 14 that if we parameterize this model nonsymmetrically, for example, with rates $γ_{i} θ$ and $γ_{i}$ , the resulting estimator is not consistent if the gamma random effects assumption is erroneous.

Proposition 1.4 gives easily verifiable conditions for antisymmetry of the log-likelihood derivative, which holds for all location and scale families in Example 1.3. The following explicit calculation for the normal distribution mirrors Example 1.2.

Example 2.2 (normal matched pairs with arbitrary mixing):

Let $Y_{1}$ and $Y_{0}$ be normally distributed with unit variance and means $γ + ψ^{*}$ and $γ - ψ^{*}$ , respectively, $γ$ being treated as random. The conclusion is unchanged if an unknown variance parameter is present. Conditionally on $γ$ , with $u_{1} = y_{1} - ψ^{*}$ , $u_{0} = y_{0} + ψ^{*}$ , the joint density function of $(Y_{1}, Y_{0})$ is

$\begin{matrix} f_{1} (y_{1} ; ψ^{*}, γ) f_{0} (y_{0} ; ψ^{*}, γ) d y_{1} d y_{0} \\ = exp [- {{(u_{1} - γ)}^{2} + {(u_{0} - γ)}^{2}} / 2] d u_{1} d u_{0}, \end{matrix}$

which is symmetric in $u_{1}$ and $u_{0}$ by construction as a result of using a $ψ$ -symmetric parameterization with respect to addition, the relevant group action for location models. Conditionally on $γ$ , the likelihood as a function of $ψ$ is

$\begin{matrix} L (ψ ; y_{0}, y_{1}, γ) & \propto exp {- \frac{1}{2} {(y_{1} - γ - ψ)}^{2} - \frac{1}{2} {(y_{0} - γ + ψ)}^{2}} \\ \propto exp [- \frac{1}{2} {{(y_{1} - ψ)}^{2} + {(y_{0} + ψ)}^{2}}], \end{matrix}$ [2.3]

where the proportionality symbol on the second line absorbs the multiplicative term $exp {- γ^{2} + γ (y_{1} + y_{0})}$ , which does not depend on $ψ$ . On taking logarithms and differentiating,

$\nabla_{ψ} ℓ (ψ ; u_{0}, u_{1}) = y_{1} - y_{0} - 2 ψ = u_{1} - u_{0} = - \nabla_{ψ} ℓ (ψ ; u_{1}, u_{0}),$

verifying the antisymmetry condition of Definition 1.3.

For any assumed random effects distribution for $γ$ with density function parameterized by $λ$ , an application of Proposition 2.1 guarantees that the resulting maximum likelihood estimator $\hat{ψ}$ is consistent for $ψ^{*}$ under arbitrary misspecification of the random effects distribution.

Remark 2.2

A separate point in connection with Example 2.2 is that, due to the separation in the log-likelihood function specific to this case (a parameter cut in the terminology of ref. 19), $\hat{ψ} = (Y_{1} - Y_{0}) / 2$ regardless of the assumed random effects distribution, and for an analysis from $n$ pairs $\hat{ψ} = (S_{1} - S_{0}) / 2 n$ , where $S_{j} = \sum_{i = 1}^{n} Y_{ij}$ , where $i$ is the pair index. For any finite-variance random effects distribution $f$ , $var (Y_{1}) = var (Y_{0}) = 1 + {var}_{f} (γ)$ and $cov (Y_{1}, Y_{0}) = {var}_{f} (γ)$ . Thus $var (\hat{ψ}) = 2 / n$ irrespective of $f$ and of the assumed (erroneous) model over the random effects.

Remark 2.3

In the context of stratum-specific random effects, Lindsay (20, Thm 4.2] established the limiting distribution of his proposed estimator of the interest parameter, valid whether or not the random effects distribution is misspecified, under the assumption that his estimator is consistent. The latter aspect, central to the present paper, was not discussed in that work.

2.B. Two-Group Problems.

Section 2.A dealt with an experimental setting in which the matching ensures balance. A more realistic situation in an observational context is that observations on treated and untreated individuals are stratified into groups that are as similar as possible. This in general produces unbalanced strata, having a potentially different number of individuals in the treated and untreated groups for each stratum.

Inference is based on the sufficient statistics $S_{j 1}$ and $S_{j 0}$ within treatment groups and strata. Let ${(Y_{i j 1})}_{i = 1}^{r_{j 1}}$ and ${(Y_{i j 0})}_{i = 1}^{r_{j 0}}$ be observations within the $j$ th stratum for treated and untreated individuals, respectively. Some examples fix ideas. If the observations ${(Y_{i j 1})}_{i = 1}^{r_{j 1}}$ and ${(Y_{i j 0})}_{i = 1}^{r_{j 0}}$ are normally distributed with means $γ_{j} + ψ^{*}$ and $γ_{j} - ψ^{*}$ and variance $τ$ , the likelihood contribution to the $j$ th stratum depends on the data only through $\sum_{i = 1}^{r_{j 1}} Y_{i j 1} / r_{j 1}$ and $\sum_{i = 1}^{r_{j 0}} Y_{i j 0} / r_{j 0}$ , whose distributions are normal with means unchanged, and variances $τ / r_{j 1}$ and $τ / r_{j 0}$ , respectively. If, conditionally on $γ_{j}$ , the individual observations are Poisson distributed counts of rates $γ_{j} ψ^{*}$ and $γ_{j} / ψ^{*}$ , the sufficient statistics are sums of these counts, Poisson distributed of rates $r_{j 1} γ_{j} ψ^{*}$ and $r_{j 0} γ_{j} / ψ^{*}$ . As a third example, if the originating variables are exponentially distributed of rates $γ_{j} ψ^{*}$ and $γ_{j} / ψ^{*}$ , the sufficient statistics are gamma-distributed sums of shape and rate parameters $(r_{j 1}, γ_{j} ψ^{*})$ and $(r_{j 0}, γ_{j} / ψ^{*})$ , respectively. The matched comparison setting of Section 2.A is a special case of this formulation with $r_{j 1} = r_{j 0} = 1$ for all $j$ . The more general balanced case with $r_{j 1} = r_{j 0} = r$ is also implicitly covered by Propositions 1.4 and 2.1.

The situation when $r_{j 1} \neq r_{j 0}$ is more complicated as it implies that even if the distributions of $S_{j 1}$ and $S_{j 0}$ belong to a group family, the transformations required to express their distributions in the standardized form $f_{U}$ depend on the different values of $r_{j 1}$ and $r_{j 0}$ , so that Definition 1.2 is violated. The following extension of Proposition 2.1, which assumes the same random effects formulation for the nuisance parameters $γ_{j}$ , omits the subscripts for the strata as before.

Proposition 2.2

Suppose that, conditionally on $γ$ , $S_{1}$ and $S_{0}$ are independent random variables with probability measures in $P_{G}$ . Let $h_{1} \in G$ and $h_{0} \in G$ , not depending on $ψ$ or $γ$ , be such that $T_{1} = h_{1} S_{1}$ and $T_{0} = h_{0} S_{0}$ have a joint distribution that is parameterized $ψ$ -symmetrically in the sense of Definition 1.2 and such that the log-likelihood derivative, when expressed in terms of $u_{1} = g^{- 1} t_{1}$ and $u_{0} = g t_{0}$ , is antisymmetric in the sense of Definition 1.3. Then, provided that $ℓ (ψ, λ)$ is strictly concave, $ψ^{*} ⊥ Λ$ and

$\begin{matrix} 0 & = E_{m} [\nabla_{ψ} ℓ (ψ^{*}, λ)] \\ = \int \int (\nabla_{ψ} ℓ (ψ^{*}, λ ; s_{1}, s_{0})) m (s_{1}, s_{0}) d s_{1} d s_{0}, \end{matrix}$ [2.4]

for all $λ \in Λ$ . Thus, $ψ_{m}^{0} = ψ^{*}$ by Proposition 1.1.

For discrete problems such as the two-group Poisson situation mentioned above, there is no natural group and Propositions 2.1 and 2.2 are therefore inapplicable. Direct verification of Proposition 1.2 is sometimes more fruitful, as illustrated in the following examples.

Example 2.3 (stratified two-group Poisson problem with unbalanced strata):

Cox and Wong (16) considered misspecification of a random effects distribution in a stratified two-group problem. The distributions of counts $S_{j 1}$ and $S_{j 0}$ in the treated and untreated groups for stratum $j$ are, conditionally on $γ_{j}$ , Poisson with means $r_{j 1} γ_{j} exp (θ^{*})$ and $r_{j 0} γ_{j} exp (- θ^{*})$ , respectively, where $r_{j 1}$ and $r_{j 0}$ reflect the number of patients at risk in each group and $γ_{j}$ are stratum-specific nuisance parameters.

The problem can equivalently be parameterized in terms of $ψ = e^{θ}$ but the above formulation was used by Cox and Wong (16). The result of Proposition 2.3 was noted there; we complete their proof in SI Appendix.

Proposition 2.3 [Cox and Wong (16)].

Suppose that in Example 2.3, the nuisance parameters ${(γ_{j})}_{j = 1}^{n}$ are modeled as independent and identically distributed random variables with gamma density $ξ {(ξ γ)}^{ω - 1} exp (- ξ γ) / Γ (ω)$ , parameterized in terms of the notionally orthogonal parameters $ν = ξ / ω$ and $ω > 0$ . Provided that there is a member of the assumed model that gives the same expectation $E_{ν, ω} (γ_{j}) = 1 / ν$ as under the true random effects distribution for ${(γ_{j})}_{j = 1}^{n}$ , $\hat{θ}$ is consistent for $θ^{*}$ .

Example 2.4 (time to the first event in a two-group problem with unbalanced strata):

We describe here the generating mechanism for an exponential analog of Example 2.3. Suppose the available data in Example 2.3 consist also of the failure times. If these are, for each individual, independent exponentially distributed of rates $γ_{j} exp (θ^{*})$ and $γ_{j} exp (- θ^{*})$ in the $j$ th stratum, with $r_{j 1}$ and $r_{j 0}$ individuals at risk in the treated and untreated groups, then the times $S_{j 1}$ and $S_{j 0}$ to the first failure in each group are the group minima, exponentially distributed with rates $r_{j 1} γ_{j} exp (θ^{*})$ and $r_{j 0} γ_{j} exp (- θ^{*})$ , respectively. By using only the minima across strata, information is sacrificed, which would normally not be advised unless to check on modeling assumptions. After reparameterization, $ψ = e^{θ}$ , the problem satisfies the conditions of Proposition 2.2 with $T_{j 1} = r_{j 1} S_{j 1}$ and $T_{j 0} = r_{j 0} S_{j 0}$ . Thus, $\hat{ψ}$ is consistent for $ψ^{*} = exp (θ^{*})$ .

2.C. Generalized Linear Models.

A simple example in which Proposition 1.1 is directly applicable is misspecified dispersion in exponential-family regression problems. Suppose that the true density function belongs to the class of canonical exponential family regression models with unknown regression coefficient $ψ$ on a set of known covariates $x_{1}, \dots, x_{n}$ (each column vectors of the same dimension as $ψ$ ), and dispersion parameters $ϕ_{1}, \dots, ϕ_{n}$ . The form of the exponential-family log-likelihood function with respect to $ϕ_{i}$ is such that $ψ^{*} ⊥_{m} ϕ_{i}$ for all $i$ . In particular this implies that if $ϕ_{i}$ is modeled erroneously as $ϕ_{i} (λ)$ , perhaps depending on covariates other than or including $x_{i}$ , $ψ^{*} ⊥_{m} Λ$ . Write the log-likelihood function of the assumed model as

\begin{matrix} ℓ (ψ, λ) = \sum_{i = 1}^{n} ϕ_{i}^{- 1} (λ) {y_{i} x_{i}^{T} ψ - K (x_{i}^{T} ψ)} + \sum_{i = 1}^{n} h {y_{i} ; ϕ_{i} (λ)} . \end{matrix}

[2.5]

We do not distinguish notationally between realizations of the random variables $Y_{1}, \dots, Y_{n}$ and arbitrary evaluation points for their joint density function conditional on covariates. With outcomes treated as random, the estimating equation for $ψ$ is unbiased in the sense that $E_{m} [\nabla_{ψ} ℓ (ψ^{*}, λ)] = 0$ for all $λ \in Λ$ . Thus, $\hat{ψ}$ is consistent in the presence of arbitrary misspecification of the dispersion parameters.

In a continuation of this generalized linear models example, we assume this time a simplified specification for dispersion but allow for relevant covariates to be omitted. Let $Z = (X, W)$ denote the included and omitted covariates. The conditional probability density or mass function of $Y = (Y_{1}, \dots, Y_{n})$ at $y = (y_{1}, \dots, y_{n})$ is, with $θ^{*} = {(ψ^{* T}, λ^{* T})}^{T}$ .

\begin{matrix} m (y ; z_{1}^{T} θ^{*}, \dots, z_{n}^{T} θ^{*}) \\ = exp [ϕ^{- 1} {θ^{* T} \sum_{i = 1}^{n} z_{i} y_{i} - \sum_{i = 1}^{n} K (z_{i}^{T} θ^{*})}] \prod_{i = 1}^{n} h (y_{i}, ϕ), \end{matrix}

Let $\overset{ˇ}{m} (y ; x_{1}^{T} ψ, \dots, x_{n}^{T} ψ)$ be the assumed model, in which the outcome is modeled only in terms of $X$ . The orthogonality condition $ψ^{*} ⊥_{m} Λ$ is

\begin{matrix} \sum_{i = 1}^{n} K^{″} (x_{i}^{T} ψ^{*} + w_{i}^{T} λ) x_{i} w_{i}^{T} = 0, \end{matrix}

[2.6]

for all $λ$ , a highly restrictive condition. By Proposition 1.1, the additional conditions under which $ψ_{m}^{0} = ψ^{*}$ are those that set $E_{m} [\nabla_{ψ} ℓ (ψ^{*}, λ)] = 0$ for all $λ$ , where $ℓ$ is the log-likelihood function under the assumed model. Since $λ$ is not present under the assumed model,

\begin{matrix} E_{m} [\nabla_{ψ} ℓ (ψ^{*}, λ)] & = \sum_{i = 1}^{n} {K^{'} (z_{i}^{T} θ^{*}) - K^{'} (x_{i}^{T} ψ^{*})} x_{i} \\ = \sum_{i = 1}^{n} K^{″} (x_{i}^{T} ψ^{*} + {\bar{ξ}}_{i}) x_{i} w_{i}^{T} λ^{*}, \end{matrix}

where ${\bar{ξ}}_{i}$ is on a line between $0$ and $w_{i}^{T} λ^{*}$ and the first equality is because $K^{'} (z_{i}^{T} θ^{*}) = E_{m} (Y_{i} ∣ Z_{i} = z_{i})$ . Since $K^{″} (x_{i}^{T} ψ^{*} + {\bar{ξ}}_{i}) \neq K^{″} (x_{i}^{T} ψ^{*} + w_{i}^{T} λ)$ in general, the second condition $E_{m} [\nabla_{ψ} ℓ (ψ^{*}, λ)] = 0$ typically does not hold exactly even when the columns of $X$ and $W$ are orthogonal. An exception is the normal-theory linear model, for which $K (ζ) = ζ^{2} / 2$ so that $K^{″} (ζ) = 1$ . For an insightful discussion of noncollapsibility in logistic and other regression models, see ref. 21, and for some relevant approximations, see ref. 22.

Although both conclusions of this subsection are well established in the literature, their purpose is to illustrate application of Proposition 1.1 to some well-known settings.

2.D. Marginal Structural Models.

Evans and Didelez (8) presented a further example in which consistency arises as a special case of Proposition 1.1. Their paper introduced the idea of a frugal parameterization of a marginal structural model, constructed to complete the specification of a model respecting the interventional components of interest, but without loss or redundancy. In a causal system of the form, e.g., $Z \to X$ , $Z \to Y$ , $X \to Y$ , the authors suggest a reparameterization of the joint distribution $p_{ZXY}$ from $(p_{Z}, p_{X | Z}, p_{Y | Z X})$ to $(p_{Z}, p_{X | Z}, p_{Y | X}^{*}, ϕ_{Y Z | X}^{*})$ , where $p_{Y | X}^{*}$ , with finite-dimensional parameter $θ_{Y | X}^{*}$ , is typically the unobservable interventional distribution of interest for $Y ∣ do (X)$ , and $ϕ_{Y Z | X}^{*}$ is whatever is necessary to complete the joint model. The quantity $p_{X | Z}$ is known as the propensity score. We refer to the original paper for a detailed discussion of these quantities, which in turn are expressed in terms of finite-dimensional parameters, say $ψ$ and $λ$ . The vector parameter $ψ$ with true value $ψ^{*}$ comprises $ϕ_{Y Z | X}^{*}$ and the main component of interest $θ_{Y | X}^{*}$ which together specify $p_{Z} p_{Y | Z X}$ , while $λ$ characterizes the propensity score $p_{X | Z}$ . A consequence of the above parameterization within the class of marginal structural models is Theorem 5.1 of ref. 8, which establishes consistency and asymptotic normality of the maximum likelihood estimator $\hat{ψ}$ of $ψ^{*}$ , even if $p_{X | Z} = p_{X | Z} (λ)$ is misspecified. As noted in ref. 8, the conclusion arises as a consequence of the parameter cut between $p_{Z} p_{Y | Z X}$ and $p_{X | Z}$ , and therefore between $ψ$ and $λ$ . The parameter cut implies the parameter orthogonality condition $ψ^{*} ⊥_{m} Λ$ , and Evans and Didelez (8) implicitly establish in their proof the remaining condition of Proposition 1.1.

3. Discussion

3.A. Stronger Inferential Guarantees.

Even if consistency holds, inferential guarantees entail estimation of variance, a difficult problem due to failure of Bartlett’s second identity. Such difficulties persist under the stronger condition of Proposition 1.3.

In the notation of Section 1, let

\begin{matrix} q & = E_{m} [(\nabla_{(ψ, λ)} ℓ) {(\nabla_{(ψ, λ)} ℓ)}^{T}], \\ \overset{ˇ}{q} & = E_{(ψ, λ)} [(\nabla_{(ψ, λ)} ℓ) {(\nabla_{(ψ, λ)} ℓ)}^{T}], \end{matrix}

both functions of $ψ$ and $λ$ . Proposition 1.3, Eq. 1.1 and consistency of $\hat{ψ}$ imply

E_{(ψ, λ)} [\nabla_{(ψ, λ)} ℓ (ψ^{*}, λ_{m}^{0})] = 0,

and differentiation gives Bartlett’s second identity under the assumed model:

- \overset{ˇ}{ı} (ψ^{*}, λ_{m}^{0}) + \overset{ˇ}{q} (ψ^{*}, λ_{m}^{0}) = 0 .

The relevant quantities for establishing the limiting distributions of standard test statistics are, however, $i$ and $q$ , where $i$ was defined in Section 1. These do not in general satisfy $- i (ψ^{*}, λ_{m}^{0}) + q (ψ^{*}, λ_{m}^{0}) = 0$ , even with the additional structure of Proposition 1.3. An exception is Example 2.2 because of the particular simplicity of this case.

It can be shown using now standard arguments that, with the log-likelihood function constructed from $n$ independent observations, provided that $\hat{ψ}$ is consistent for $ψ^{*}$ ,

\begin{matrix} \sqrt{n} (\hat{ψ} - ψ^{*}) \to_{d} N (0, {(i^{- 1} q i^{- 1})}_{ψ ψ}), \end{matrix}

[3.1]

the variance being the so-called sandwich formula, in which the constituent terms are evaluated at $(ψ^{*}, λ_{m}^{0})$ . This reduces to $i^{ψ ψ} (ψ^{*}, λ_{m}^{0}) = i^{ψ ψ} (ψ^{*}, λ^{*})$ when the model is correctly specified, so that the usual asymptotic result is recovered. The standard refs. 3, 5–7, and 23 have $ψ_{m}^{0}$ in place of $ψ^{*}$ in Eq. 3.1. Qualitatively similar results hold for the profile score and profile likelihood ratio statistics, the main point being that confidence set estimation is infeasible without knowledge of $m$ unless $\overset{ˇ}{q} (ψ^{*}, λ_{m}^{0}) = q (ψ^{*}, λ_{m}^{0})$ to some adequate order of approximation. In general, this requires, not only that the expectations of the sufficient statistics are robust to misspecification but also that the expectations of any relevant squares and cross terms are stable.

Suppose, however, that the parameterization is orthogonal in the sense $ψ^{*} ⊥_{m} λ_{m}^{0}$ . Then,

{(i^{- 1} q i^{- 1})}_{ψ ψ} = i^{ψ ψ} q_{ψ ψ} i^{ψ ψ},

and the familiar result is also recovered if $q_{ψ ψ} i^{ψ ψ} = I_{dim (ψ)}$ , a weaker requirement than $q = i$ especially if $ψ$ is a scalar. Under the additional structure of Proposition 1.3, inference based on the assumed model is unaffected by the misspecification.

3.B. Treating Incidental Parameters As Fixed or Random.

In the context of Section 2.A, the lack of general inferential guarantees beyond consistency may well be an argument for treating the nuisance parameters as fixed. Although the conceptual distinction is consequential for the analysis, a formulation in which incidental parameters are treated as fixed and arbitrary, and one in which they are treated as independent and identically distributed from a totally unspecified distribution, are numerically indistinguishable. In this sense, the fixed-parameter formulation is essentially nonparametric for the nuisance component, with the distinguishing feature that it evades estimation of the infinite-dimensional nuisance parameter.

The group structure used to define the symmetric parameterization in Proposition 2.1 also allows elimination of nuisance parameters by suitable preliminary maneuvers when the nuisance parameters are treated as fixed. In the exponential matched pairs setting of Example 2.1, (14) compared inference based on the distribution of $Y_{i 1} / Y_{i 0}$ to that based on modeling the pair effects by a gamma distribution. The random pair effects model is more efficient when the gamma distribution is correct, but efficiency degrades substantially under misspecification, compared to maximum likelihood estimation based on the distribution of ratios.

Proposition 2.1 hinges on a symmetric parameterization having been chosen, which is possible only under the group structure of Section 1.C. Outside of this setting, there are no guarantees of consistency of the maximum likelihood estimator with misspecified random effects distribution, while it may still be possible to eliminate the pair effects with exact or approximate conditioning arguments, as illustrated in the second example of ref. 14.

3.C. Overstratification and Other Encompassing Models.

Two essentially equivalent ways to adjust for potential confounders are to stratify by observed explanatory features, leading to strata effects ${(γ_{j})}_{j = 1}^{m}$ as in Example 2.3, or to adjust for the explanatory features in a regression analysis. Both approaches are a form of conditioning by model formulation. It is arguably more relevant in this context to treat the strata effects as fixed rather than random. In De Stavola and Cox (24), this modification of Example 2.3 was considered, showing the extent to which overstratification decreases efficiency of the estimator. As is intuitively clear, no bias is incurred through overstratification, yet the conditions of Propositions 1.1 and 1.2 do not always hold, as illustrated by the following example. The explanation is that the true density is nested within the encompassing model, so that $λ_{m}^{0} = λ^{*} = 0$ . The overstratified model is therefore not misspecified according to the definition below Eq. 1.1.

To isolate the point at issue, we assume that any dispersion parameters are equal to 1, the conclusion of Proposition 3.1 being unchanged for arbitrary dispersion parameter by the discussion of Section 2.C.

Proposition 3.1

Suppose that, conditional on observed explanatory features, the probability density or mass function of $Y = (Y_{1}, \dots, Y_{n})$ at $y = (y_{1}, \dots, y_{n})$ is of canonical exponential family regression form, that is,

$\begin{matrix} m (y ; x_{1}^{T} ψ^{*}, \dots, x_{n}^{T} ψ^{*}) \\ = exp {\sum_{i = 1}^{n} y_{i} x_{i}^{T} ψ^{*} - \sum_{i = 1}^{n} K (x_{i}^{T} ψ^{*})} \prod_{i = 1}^{n} h (y_{i}) . \end{matrix}$

Suppose further that the analysis is overstratified so that under the assumed model the log-likelihood function is $ℓ (ψ, λ) = s^{T} ψ + t^{T} λ - K (ψ, λ)$ , where $s = \sum_{i = 1}^{n} x_{i} y_{i}$ , $t = \sum_{i = 1}^{n} w_{i} y_{i}$ and $K (ψ, λ) = \sum_{i = 1}^{n} K (x_{i}^{T} ψ + w_{i}^{T} λ)$ . The conditions of Propositions 1.1 and 1.2 are in general violated, yet the maximum likelihood estimator $\hat{ψ}$ is consistent for $ψ^{*}$ .

Another type of encompassing model arises in random effects models when the assumed family of distributions for the random effects is rich enough to include the true distribution, for instance if the postulated mixture class is nonparametric [Kiefer and Wolfowitz (25)], as discussed in Section 3.B. This type of situation can be understood in terms of the convex geometry of mixture distributions [Lindsay (26)] and was specialized to the case of binary matched pairs with logistic probabilities by Neuhaus et al. (27). Their conditions on the mixing distributions with regard to the resulting cell probabilities effectively imply that the model is correctly specified according to the definition below Eq. 1.1.

3.D. Estimating Equations and Neyman Orthogonality.

A generalization of the score Eq. 1.1 is to other estimating equations, often but not always obtained as the vector of partial derivatives of a convex loss function. The model need only be partially specified in terms of a known function $h$ of the data, an interest parameter $ψ$ and a nuisance parameter $λ$ , such that the expectation when the assumed model is true satisfies $E_{m} [h (ψ^{*}, λ^{*} ; Y)] = E_{(ψ^{*}, λ^{*})} [h (ψ^{*}, λ^{*} ; Y)] = 0$ . If the model is misspecified the limiting solutions satisfy, in analogy with Eq. 1.1,

E_{m} [h (ψ_{m}^{0}, λ_{m}^{0} ; Y)] = 0 .

A special choice of $h$ , establishing a connection to the double robustness literature, is the Neyman-orthogonal score for $ψ$ , defined as

s_{N}^{*} (ψ ; λ, Y) = ℓ_{ψ} (ψ, λ) - w^{* T} ℓ_{λ} (ψ, λ),

where $ℓ_{ψ}$ and $ℓ_{λ}$ are the partial derivatives of $ℓ$ with respect to $ψ$ and $λ$ and $w^{* T}$ is a matrix of dimension $dim (ψ) \times dim (λ)$ given by $w^{* T} = i_{ψ λ}^{*} i_{λ λ}^{* - 1}$ when the model is correctly specified. Here, $i_{ψ λ}^{*}$ and $i_{λ λ}^{*}$ are the components of the Fisher information matrix at the true parameter values $(ψ^{*}, λ^{*})$ . Consider instead

\begin{matrix} s_{N} (ψ ; λ, Y) = ℓ_{ψ} (ψ, λ) - w^{T} ℓ_{λ} (ψ, λ), w^{T} = i_{ψ λ} i_{λ λ}^{- 1}, \end{matrix}

[3.2]

where $i_{ψ λ}$ and $i_{λ λ}$ are as defined in Eq. 1.4. Write the condition $i^{ψ ψ} g_{ψ} + i^{ψ λ} g_{λ} = 0$ of Proposition 1.2 as

\begin{matrix} g_{ψ} + i_{ψ ψ . λ} i^{ψ λ} g_{λ} & = 0, \\ i_{ψ ψ . λ} & = {(i^{ψ ψ})}^{- 1} = i_{ψ ψ} - i_{ψ λ} i_{λ λ}^{- 1} i_{λ ψ}, \end{matrix}

[3.3]

where $i_{ψ ψ . λ}$ is the Fisher information for $ψ$ at $(ψ^{*}, λ)$ , computed under the true model, having adjusted for estimation of $λ$ . On noting that $i^{ψ λ} = - i^{ψ ψ} i_{ψ λ} i_{λ λ}^{- 1}$ , we see that the information identity $i^{ψ ψ} g_{ψ} + i^{ψ λ} g_{λ} = 0$ of Proposition 1.2 is equivalent to requiring that the Neyman orthogonal score Eq. 3.2, with score adjustment $w$ computed under the true distribution, has zero expectation under the same distribution for all $λ \in Λ$ . Specification of $w$ in Eq. 3.2 when the model is misspecified thus hinges on the conditions of Proposition 1.3 being satisfied.

Supplementary Material

Appendix 01 (PDF)

pnas.2402736121.sapp.pdf^{(194.7KB, pdf)}

Acknowledgments

This research was partially supported by the UK Engineering and Physical Sciences Research Council and the Natural Sciences and Engineering Research Council of Canada. We are grateful to the reviewers for their very helpful comments.

Author contributions

H.S.B. and N.R. designed research; performed research; and wrote the paper.

Competing interests

The authors declare no competing interest.

Footnotes

Reviewers: E.J.C., Stanford University; and E.I.G., University of Pennsylvania.

Contributor Information

Heather S. Battey, Email: h.battey@imperial.ac.uk.

Nancy Reid, Email: nancym.reid@utoronto.ca.

Data, Materials, and Software Availability

There are no data underlying this work.

Supporting Information

References

1.Breiman L., Statistical modeling: The two cultures. Stat. Sci. 16, 199–215 (2001). [Google Scholar]
2.Cox D. R., Comment on Breiman’s “Statistical modeling: The two cultures’’. Stat. Sci. 16, 216–218 (2001). [Google Scholar]
3.Cox D. R., “Tests of separate families of hypotheses” in Proceedings of the Fourth Berkeley Symposium, Neyman J., Ed. (University of California Press, Berkeley, CA, 1961), vol. I, pp. 105–123. [Google Scholar]
4.Huber P. J., “The behavior of maximum likelihood estimates under nonstandard conditions” in Proceedings of the Fifth Berkeley Symposium, Le Cam L. M., Neyman J., Eds. (University of California Press, Berkeley, CA, 1967), vol. I, pp. 221–233. [Google Scholar]
5.Kent J., Robust properties of likelihood ratio tests. Biometrika 69, 19–27 (1982). [Google Scholar]
6.White H., Maximum likelihood estimation of misspecified models. Econometrica 50, 1–25 (1982). [Google Scholar]
7.White H., Regularity conditions for Cox’s test of non-nested hypotheses. J. Econ. 19, 301–318 (1982). [Google Scholar]
8.Evans R. J., Didelez V., Parameterizing and simulating from causal models (with discussion). J. R. Stat. Soc. Ser. B 86, 535–568 (2024). [Google Scholar]
9.Robins J. M., Rotnitzky A., Zhao L. P., Estimation of regression coefficients when some regressors are not always observed. J. Am. Stat. Assoc. 89, 846–866 (1994). [Google Scholar]
10.Chernozhukov V., et al. , Double/debiased machine learning for treatment and structural parameters. Econ. J. 21, 1–68 (2018). [Google Scholar]
11.Vansteelandt S., Dukes O., Assumption-lean inference for generalised linear model parameters (with discussion). J. Roy. Stat. Soc. Ser. B 84, 657–685 (2022). [Google Scholar]
12.Cox D. R., “Statistical causality: some historical remarks” in Causality: Statistical perspectives and applications, Berzuini C., Dawid A. P., Bernardinelli L., Eds. (Wiley, Chichester, 2012), pp. 1–5. [Google Scholar]
13.Battey H. S., Heather battey’s contribution to the discussion of “Assumption-lean inference for generalised linear model parameters’’ by Vansteelandt and Dukes. J. R. Stat. Soc. Ser. B 84, 696–698 (2022). [Google Scholar]
14.Battey H. S., Cox D. R., High dimensional nuisance parameters: An example from parametric survival analysis. Inf. Geom. 3, 119–148 (2020). [Google Scholar]
15.Schielzeth H., et al. , Robustness of linear mixed-effects models to violations of distributional assumptions. Methods Ecol. Evol. 11, 1141–1152 (2020). [Google Scholar]
16.Cox D. R., Wong M. Y., A note on the sensitivity of assumptions of a generalized linear mixed model. Biometrika 97, 209–214 (2010). [Google Scholar]
17.Cox D. R., Reid N., Parameter orthogonality and approximate conditional inference (with discussion). J. R. Stat. Soc. Ser. B 49, 1–39 (1987). [Google Scholar]
18.Barndorff-Nielsen O. E., Cox D. R., Inference and Asymptotics (Chapman & Hall, London, 1994). [Google Scholar]
19.Barndorff-Nielsen O. E., Information and Exponential Families in Statistical Theory (Wiley, New York, 1978). [Google Scholar]
20.Lindsay B. G., Using empirical partially Bayes inference for increased efficiency. Ann. Stat. 13, 914–931 (1985). [Google Scholar]
21.Daniel R., Zhang J., Farewell D., Making apples from oranges: Comparing noncollapsible effect estimators and their standard errors after adjustment for different covariate sets. Biom. J. 63, 528–557 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Cox D. R., Wermuth N., An approximation to maximum likelihood estimates in reduced models. Biometrika 77, 747–761 (1990). [Google Scholar]
23.Cox D. R., Further results on tests of separate families of hypotheses. J. R. Stat. Soc. Ser. B 24, 406–424 (1962). [Google Scholar]
24.de Stavola B. L., Cox D. R., On the consequences of overstratification. Biometrika 95, 992–996 (2008). [Google Scholar]
25.Kiefer J. G., Wolfowitz J., Consistency of the maximum likelihood estimator in the presence of many incidental parameters. Ann. Math. Stat. 27, 887–906 (1983). [Google Scholar]
26.Lindsay B. G., The geometry of mixture likelihoods: General theory. Ann. Stat. 11, 86–94 (1983). [Google Scholar]
27.Neuhaus J. M., Kalbfleisch J. D., Hauck W. W., Conditions for consistent estimation in mixed-effects models for binary matched-pairs data. Can. J. Stat. 22, 139–148 (1994). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Appendix 01 (PDF)

pnas.2402736121.sapp.pdf^{(194.7KB, pdf)}

Data Availability Statement

There are no data underlying this work.

[r1] 1.Breiman L., Statistical modeling: The two cultures. Stat. Sci. 16, 199–215 (2001). [Google Scholar]

[r2] 2.Cox D. R., Comment on Breiman’s “Statistical modeling: The two cultures’’. Stat. Sci. 16, 216–218 (2001). [Google Scholar]

[r3] 3.Cox D. R., “Tests of separate families of hypotheses” in Proceedings of the Fourth Berkeley Symposium, Neyman J., Ed. (University of California Press, Berkeley, CA, 1961), vol. I, pp. 105–123. [Google Scholar]

[r4] 4.Huber P. J., “The behavior of maximum likelihood estimates under nonstandard conditions” in Proceedings of the Fifth Berkeley Symposium, Le Cam L. M., Neyman J., Eds. (University of California Press, Berkeley, CA, 1967), vol. I, pp. 221–233. [Google Scholar]

[r5] 5.Kent J., Robust properties of likelihood ratio tests. Biometrika 69, 19–27 (1982). [Google Scholar]

[r6] 6.White H., Maximum likelihood estimation of misspecified models. Econometrica 50, 1–25 (1982). [Google Scholar]

[r7] 7.White H., Regularity conditions for Cox’s test of non-nested hypotheses. J. Econ. 19, 301–318 (1982). [Google Scholar]

[r8] 8.Evans R. J., Didelez V., Parameterizing and simulating from causal models (with discussion). J. R. Stat. Soc. Ser. B 86, 535–568 (2024). [Google Scholar]

[r9] 9.Robins J. M., Rotnitzky A., Zhao L. P., Estimation of regression coefficients when some regressors are not always observed. J. Am. Stat. Assoc. 89, 846–866 (1994). [Google Scholar]

[r10] 10.Chernozhukov V., et al. , Double/debiased machine learning for treatment and structural parameters. Econ. J. 21, 1–68 (2018). [Google Scholar]

[r11] 11.Vansteelandt S., Dukes O., Assumption-lean inference for generalised linear model parameters (with discussion). J. Roy. Stat. Soc. Ser. B 84, 657–685 (2022). [Google Scholar]

[r12] 12.Cox D. R., “Statistical causality: some historical remarks” in Causality: Statistical perspectives and applications, Berzuini C., Dawid A. P., Bernardinelli L., Eds. (Wiley, Chichester, 2012), pp. 1–5. [Google Scholar]

[r13] 13.Battey H. S., Heather battey’s contribution to the discussion of “Assumption-lean inference for generalised linear model parameters’’ by Vansteelandt and Dukes. J. R. Stat. Soc. Ser. B 84, 696–698 (2022). [Google Scholar]

[r14] 14.Battey H. S., Cox D. R., High dimensional nuisance parameters: An example from parametric survival analysis. Inf. Geom. 3, 119–148 (2020). [Google Scholar]

[r15] 15.Schielzeth H., et al. , Robustness of linear mixed-effects models to violations of distributional assumptions. Methods Ecol. Evol. 11, 1141–1152 (2020). [Google Scholar]

[r16] 16.Cox D. R., Wong M. Y., A note on the sensitivity of assumptions of a generalized linear mixed model. Biometrika 97, 209–214 (2010). [Google Scholar]

[r17] 17.Cox D. R., Reid N., Parameter orthogonality and approximate conditional inference (with discussion). J. R. Stat. Soc. Ser. B 49, 1–39 (1987). [Google Scholar]

[r18] 18.Barndorff-Nielsen O. E., Cox D. R., Inference and Asymptotics (Chapman & Hall, London, 1994). [Google Scholar]

[r19] 19.Barndorff-Nielsen O. E., Information and Exponential Families in Statistical Theory (Wiley, New York, 1978). [Google Scholar]

[r20] 20.Lindsay B. G., Using empirical partially Bayes inference for increased efficiency. Ann. Stat. 13, 914–931 (1985). [Google Scholar]

[r21] 21.Daniel R., Zhang J., Farewell D., Making apples from oranges: Comparing noncollapsible effect estimators and their standard errors after adjustment for different covariate sets. Biom. J. 63, 528–557 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r22] 22.Cox D. R., Wermuth N., An approximation to maximum likelihood estimates in reduced models. Biometrika 77, 747–761 (1990). [Google Scholar]

[r23] 23.Cox D. R., Further results on tests of separate families of hypotheses. J. R. Stat. Soc. Ser. B 24, 406–424 (1962). [Google Scholar]

[r24] 24.de Stavola B. L., Cox D. R., On the consequences of overstratification. Biometrika 95, 992–996 (2008). [Google Scholar]

[r25] 25.Kiefer J. G., Wolfowitz J., Consistency of the maximum likelihood estimator in the presence of many incidental parameters. Ann. Math. Stat. 27, 887–906 (1983). [Google Scholar]

[r26] 26.Lindsay B. G., The geometry of mixture likelihoods: General theory. Ann. Stat. 11, 86–94 (1983). [Google Scholar]

[r27] 27.Neuhaus J. M., Kalbfleisch J. D., Hauck W. W., Conditions for consistent estimation in mixed-effects models for binary matched-pairs data. Can. J. Stat. 22, 139–148 (1994). [Google Scholar]

PERMALINK

On the role of parameterization in models with a misspecified nuisance component

Heather S Battey

Nancy Reid

Significance

Abstract

1. Consistency in General Misspecified Models

1.A. General Conditions for Consistency.

Example 1.1

Remark 1.1

Definition 1.1 (parameter m-orthogonality):

Proposition 1.1

Remark 1.2

Remark 1.3

Proposition 1.2

1.B. Parameter Orthogonality and Orthogonalization in Misspecified Models.

Proposition 1.3

Corollary 1.1

Remark 1.4

1.C. Symmetric Parameterizations and Induced Antisymmetry.

Definition 1.2 (symmetric parameterization):

Definition 1.3 (antisymmetry):

Example 1.2 (Continuation of Example 1.1):

Proposition 1.4

Example 1.3

Example 1.4

2. Examples

2.A. Matched Pairs.

Proposition 2.1

Example 2.1 (Continuation of Examples 1.1 and 1.2):

Remark 2.1

Example 2.2 (normal matched pairs with arbitrary mixing):

Remark 2.2

Remark 2.3

2.B. Two-Group Problems.

Proposition 2.2

Example 2.3 (stratified two-group Poisson problem with unbalanced strata):

Proposition 2.3 [Cox and Wong (16)].

Example 2.4 (time to the first event in a two-group problem with unbalanced strata):

2.C. Generalized Linear Models.

2.D. Marginal Structural Models.

3. Discussion

3.A. Stronger Inferential Guarantees.

3.B. Treating Incidental Parameters As Fixed or Random.

3.C. Overstratification and Other Encompassing Models.

Proposition 3.1

3.D. Estimating Equations and Neyman Orthogonality.

Supplementary Material

Acknowledgments

Author contributions

Competing interests

Footnotes

Contributor Information

Data, Materials, and Software Availability

Supporting Information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases