Semiparametric Generalized Linear Models with the gldrm Package

Michael J Wurm; Paul J Rathouz

. Author manuscript; available in PMC: 2019 Mar 12.

Published in final edited form as: R J. 2018 Jul;10(1):288–307.

Semiparametric Generalized Linear Models with the gldrm Package

Michael J Wurm ¹, Paul J Rathouz ²

PMCID: PMC6414059 NIHMSID: NIHMS1011992 PMID: 30873295

Abstract

This paper introduces a new algorithm to estimate and perform inferences on a recently proposed and developed semiparametric generalized linear model (glm). Rather than selecting a particular parametric exponential family model, such as the Poisson distribution, this semiparametric glm assumes that the response is drawn from the more general exponential tilt family. The regression coefficients and unspecified reference distribution are estimated by maximizing a semiparametric likelihood. The new algorithm incorporates several computational stability and efficiency improvements over the algorithm originally proposed. In particular, the new algorithm performs well for either small or large support for the nonparametric response distribution. The algorithm is implemented in a new R package called gldrm.

Introduction

Rathouz and Gao (2009) introduced the generalized linear density ratio model (gldrm), which is a novel semiparametric formulation of the classical glm. Although Rathouz and Gao did not use the term gldrm, we refer to it as such because it is a natural extension of the density ratio model (see e.g. Lovric (2011)). Like a standard glm, the gldrm relates the conditional mean of the response to a linear function of the predictors through a known link function g. To be specific, let the random variable Y be a scalar response, and let X be a p × 1 covariate vector for a particular observation. The model for the mean is

E (Y | X = x) = g^{- 1} (x^{T} β) .

(1)

Because Equation 1 holds for both gldrm and glm, the regression coefficients have the same interpretation for both models. The gldrm relaxes the standard glm assumption that the distribution of Y|X comes from a particular exponential family model. Instead, assume that Y|X from an exponential tilt model of the form

f (y | X = x) = f_{0} (y) exp {θ y - b (θ)},

(2)

where

b (θ) = log \int f_{0} (y) exp (θ y) d λ (y),

(3)

and θ is defined implicitly as the solution to

g^{- 1} (x^{T} β) = \int y exp {θ y - b (θ)} d λ (y) .

(4)

Here f₀ is an unspecified probability density with respect to a measure λ. We call f₀ the reference distribution. Measure λ is Lebesgue if Y|X is continuous, a counting measure if Y|X is discrete, or a mixture of the two if Y|X has a mixture distribution. Note that E(Y|X) = b′(θ) and Var(Y|X) = b″(θ), which are standard glm properties.

The regression coefficients β and reference distribution f₀ are estimated by maximizing a semi-parametric likelihood function, which contains a nonparametric representation of f₀ that has point mass only at values of Y observed in the data. The quantity θ is not a parameter in this model, but rather a function of the free parameters β and f₀, as well as of X. This work was fully developed by Huang and Rathouz (2012), who first focused on the case where the covariate vector takes one of finitely many values, drawing on theoretical arguments advanced in the empirical likelihood literature (e.g. Owen et al. (2001)). Drawing on semiparametric profile likelihood methods, Huang (2014) went on to fully generalize the asymptotic arguments, proving consistency and asymptotic normality of the regression coefficient estimators, and deriving the asymptotic variance matrix using the profile likelihood. Huang also proved pointwise asymptotic normality of the reference cumulative distribution function estimator.

Despite these important theoretical advances, computation for this model has remained a practical challenge. The original algorithm in Rathouz and Gao (2009) is somewhat rudimentary and applies mostly to cases with finite support. It does not scale well, and stability and speed are challenges. This paper proposes a new algorithm to address these challenges and render the model more widely applicable.

In particular, the issue of optimizing the semiparametric likelihood over the f₀ point masses is improved with application of the Broyden-Fletcher-Goldfarb-Shanno (BFGS) technique (Broyden, 1970; Fletcher, 1970; Goldfarb, 1970; Shanno, 1970). Because the number of parameters in the semiparametric likelihood associated with f₀ is equal to the number of unique points in the response vector, the rank of the Hessian matrix can become unwieldy, especially for continuous response data. The BFGS method uses an easily-computed rank-two approximate Hessian matrix that has an inverse which is much less computationally intensive to calculate.

The optimization techniques in this paper have been incorporated into the R package gldrm (Wurm and Rathouz, 2018), the first CRAN package for gldrm. This package estimates the gldrm model and provides coefficient standard errors. In addition, since the publication of Rathouz and Gao (2009), Huang (2014) developed likelihood ratio tests and likelihood ratio-based confidence intervals for regression coefficients and demonstrated their strong properties. The gldrm package provides utilities for likelihood ratio tests of nested models, as well as confidence intervals.

We present our optimization algorithm in Section 2. In Section 3, we present simulation experiments to check the finite sample performance of the gldrm estimators and the accuracy of estimated standard errors using this algorithm. In Section 4, we discuss methods for inference. Section 5, we present a simulation that benchmarks the computational time and how it scales with the number of observations and number of covariates. In Section 6, we present a summary of the functions in the gldrm package. In Section 7, we demonstrate the functionality of the gldrm package with an example in R. In Section 8, we make concluding remarks.

Optimization algorithm

Suppose we have an observed sample (x₁, y₁),…, (x_n, y_n) generated by the model specified in Equations 1–4. The x_i can be fixed or random, and the y_i are conditionally independent given the covariates. Let $S = (s_{1}, \dots, s_{K})$ be the observed support, i.e. a vector containing the unique values in the set {y₁,…y_n}. If the response follows a continuous distribution, then K will equal n. Otherwise K may be less than n due to ties. Let the vector ${\tilde{f}}_{0} = (f_{1}, \dots, f_{K})$ represent probability mass at the points (s₁,…, s_K). This is a nonparametric representation of f₀ because the true reference density may be continuous or have probability mass at values not contained in $S$ .

The semiparametric log-likelihood is

l (β, {\tilde{f}}_{0}) = \sum_{i = 1}^{n} {θ_{i} y_{i} - log \sum_{k = 1}^{K} f_{k} exp (θ_{i} s_{k}) + \sum_{k = 1}^{K} I (y_{i} = s_{k}) log f_{k}},

(5)

where each θ_i is defined implicitly as the solution to

g^{- 1} (x_{i}^{T} β) = \frac{\sum_{k = 1}^{K} s_{k} f_{k} exp {θ_{i} s_{k}}}{\sum_{k = 1}^{K} f_{k} exp {θ_{i} s_{k}}} .

(6)

There exists a θ_i that satisfies Equation 6 as long as $g^{- 1} (x_{i}^{T} β) \in (m, M)$ for all i, where m ≡ min( $S$ ) and M ≡ max( $S$ ).

We obtain parameter estimates from this likelihood as $\underset{(β, {\tilde{f}}_{0})}{arg max} l (β, {\tilde{f}}_{0})$ , subject to the following constraints:

C 1. g^{- 1} (x_{i}^{T} β) \in (m, M) for all i

C 2. f_{k} > 0 for all k = 1, \dots, K

C 3. \sum_{k = 1}^{K} f_{k} = 1

C 4. \sum_{k = 1}^{K} s_{k} f_{k} = μ_{0} for some chosen μ_{0} \in (m, M)

Constraint (C4) is necessary for identifiability because for any nonparametric density ${\tilde{f}}_{0} = {(f_{1}, \dots, f_{k})}^{T}$ , the “exponentially tilted” density ${\tilde{f}}_{0}^{(α)} = {(f_{1} e^{α s_{1}}, \dots, f_{K} e^{α s_{K}})}^{T} / \sum_{k = 1}^{K} f_{k} e^{α s_{k}}$ has the same semiparametric log-likelihood for any $α \in ℝ, i.e. l (β, {\tilde{f}}_{0}) = l (β, {\tilde{f}}_{0}^{(α)})$ for all β. We can set μ₀ to be any value within the range of observed response values, but choosing a value too extreme can lead to numerical instability in estimating ${\tilde{f}}_{0}$ . It usually works well to choose $μ_{0} = \frac{1}{n} \sum_{i = 1}^{n} y_{i}$ .

To perform the optimization, we take an iterative approach, alternating between optimizing over ${\tilde{f}}_{0}$ and over β (Rathouz and Gao, 2009). Each optimization step marginally optimizes the log-likelihood over one set of parameters, holding the other fixed. Neither optimization step has a closed form solution, so iterative procedures must be used. To update ${\tilde{f}}_{0}$ , we propose using the BFGS technique. To update β, we use Fisher scoring, which is equivalent to iteratively re-weighted least squares (IRLS). We propose iterating BFGS until convergence for each ${\tilde{f}}_{0}$ update, while only using a single IRLS iteration to update β in between ${\tilde{f}}_{0}$ updates; this is tantamount to a profile likelihood approach to estimation of β.

Although the log-likelihood in Equation 5 is a function of $(β, {\tilde{f}}_{0})$ , we have expressed it in terms of (θ₁, … , θ_n); each θ_i is implicitly a function of $(β, {\tilde{f}}_{0})$ . The score function is also best expressed in terms of the θ_i′s (see Equations 12 and 15). Consequently, our optimization algorithm requires the θ_i’s to be computed each time β or ${\tilde{f}}_{0}$ is updated. To do this, we use an iterative Newton-Raphson procedure after each iteration of the β update or the ${\tilde{f}}_{0}$ update procedures.

In what follows, we detail updates of the θ_i′s, of ${\tilde{f}}_{0}$ , and of β. We use the notation b(θ) to denote the function defined in Equation 4 with respect to the discrete probability distribution specified by ${\tilde{f}}_{0}$ , i.e. $b (θ) = log \sum_{k = 1}^{K} f_{k} exp (θ s_{k})$ . We also define $μ_{i} \equiv g^{- 1} (x_{i}^{T} β)$ .

θ update procedure

θ_i is defined implicitly as the solution to Equation 6, which can be written as

μ_{i} = b^{'} (θ_{i}) .

(7)

To calculate θ_i, we use the Newton-Raphson procedure provided in Appendix C of Rathouz and Gao (2009). To satisfy equation 7, we need to find θ such that μ_i = b′(θ), or equivalently g_l(μ_i) = g_l{b′(θ)}, where $g_{l} (s) = logit (\frac{s - m}{M - m}) = log (\frac{s - m}{M - s})$ . (The transformation, g_l, stabilizes the solution.)

We use Newton-Raphson to find the root of t(θ) = g_l{b′(θ)} ‒ g_l(μ_i). Let θ^(r) denote the approximate solution at the r^th Newton-Raphson iteration. The Newton-Raphson update is given by

θ^{(r + 1)} = θ^{(r)} - {t^{'} (θ^{(r)})}^{- 1} t (θ^{(r)}),

(8)

where

t^{'} (θ) = \frac{M - m}{{b^{'} (θ) - m} {M - b^{'} (θ)}} b^{″} (θ),

(9)

and

b^{″} (θ) = \sum_{k = 1}^{K} {s_{k} - b^{'} (θ)}^{2} f_{k} exp {θ s_{k} - b (θ)} .

(10)

We typically initialize θ⁽⁰⁾ to be the value obtained from the previous θ update procedure. The first time θ is updated, we initialize θ⁽⁰⁾ to zero for every observation. We define convergence when $| t (θ^{(r)}) | < ϵ$ , where ∈ is a small threshold such as 10⁻¹⁰.

As μ_i → M from the left, θ_i → +∞. Likewise, as μ_i → m from the right, θ_i → −∞. To prevent numerical instability when μ_i is at or near these boundaries, we cap |θ_i| at a maximum value (500 by default). The appropriateness of this threshold would depend on the scale of response variable. Rather than adjust the threshold, we center and scale the response variable to the interval [−1, 1] (see the subsequent section on “response variable transformation”).

${\tilde{f}}_{0}$ optimization procedure

Holding β fixed at its current estimate, we need to marginally optimize the log-likelihood $l (β, {\tilde{f}}_{0})$ over ${\tilde{f}}_{0}$ , subject to constraints (C2)-(C4). The linear constraints (C3) and (C4) could be enforced using constrained optimization techniques such as Lagrange multipliers or reducing the dimension of the parameter space by two. Huang and Rathouz (2012) used the former technique, while Rathouz and Gao (2009) used the latter. We propose a different method, based on the BFGS technique, that is more computationally efficient. At each iteration, we apply a BFGS update to ${\tilde{f}}_{0}$ to improve the unconstrained log-likelihood and then transform ${\tilde{f}}_{0}$ to satisfy constraints (C3) and (C4). Application of the constraints does not affect the log-likelihood of the estimate, as the constraints are only required for identifiability (i.e. uniqueness of the optimal ${\tilde{f}}_{0}$ ). For any set of positive ${\tilde{f}}_{0}$ values, there exists a unique set of values with equal log-likelihood that satisfies both constraints (C3) and (C4).

We define the transformation ${\tilde{g}}_{0} = (g_{1}, \dots, g_{k}) = (log f_{1}, \dots, log f_{K})$ and consider the log-likelihood as a function of ${\tilde{g}}_{0}$ only, with β held fixed. Working on the log scale enforces constraint (C2) and also improves numerical stability. Specifically, numerical stability is improved by working with the score and Hessian as a function of ${\tilde{g}}_{0}$ rather than ${\tilde{f}}_{0}$ .

BFGS is a quasi-Newton procedure, which makes iterative updates using an approximate Hessian matrix along with the exact score function. Let ${\tilde{g}}_{0}^{(t)}$ be the estimate at the t^th iteration. The updates take the form

{\tilde{g}}_{0}^{(t + 1)} \leftarrow {\tilde{g}}_{0}^{(t)} - H_{t}^{- 1} S ({\tilde{g}}_{0}^{(t)}; β) .

(11)

Here, $S ({\tilde{g}}_{0}; β)$ is the score as a function of ${\tilde{g}}_{0}$ only, holding β fixed. It has k^th element

{S ({\tilde{g}}_{0}; β)}_{k} = \sum_{i = 1}^{n} {I (y_{i} = s_{k}) - \frac{exp (g_{k} + θ_{i} s_{k})}{exp {b (θ_{i})}} - \frac{exp (g_{k} + θ_{i} s_{k})}{exp {b (θ_{i})}} \frac{s_{k} - μ_{i}}{b^{″} (θ_{i})} (y_{i} - μ_{i})}

(12)

for k = 1,…, K (derivation in Appendix A). Note that this score function ignores constraints (C3) and (C4). The matrix H_t is an approximation to the Hessian of the log-likelihood as a function of ${\tilde{g}}_{0}$ only, holding β fixed. This estimate is updated with each iteration. Letting $u_{t} = {\tilde{g}}_{0}^{(t)} - {\tilde{g}}_{0}^{(t - 1)}$ and $v_{t} = S ({\tilde{g}}_{0}^{(t)}; β) - S ({\tilde{g}}_{0}^{(t - 1)}; β)$ , we can write the BFGS estimate of the Hessian recursively as

H_{t} = H_{t - 1} + \frac{v_{t} v_{t}^{T}}{u_{t}^{T} v_{t}} - \frac{H_{t - 1} u_{t} u_{t}^{T} H_{t - 1}}{u_{t}^{T} H_{t - 1} u_{t}} .

(13)

H_t is a rank-2 update to H_t−1 that satisfies the secant condition: H_tu_t = v_t. Furthermore, H_t is guaranteed to be symmetric and positive definite, even though the true Hessian is not full rank. (The true Hessian is not full rank because without imposing constraints (C3) and (C4), the log-likelihood does not have a unique optimum.) The BFGS update in Equation 11 requires the inverse of H_t, which can be calculated efficiently and directly, without calculating H_t. By the Sherman-Morrison-Woodbury formula,

H_{t}^{- 1} = H_{t - 1}^{- 1} + \frac{(u_{t}^{T} v_{t} + v_{t}^{T} H_{t - 1}^{- 1} v_{t}) (u_{t} u_{t}^{T})}{{(u_{t}^{T} v_{t})}^{2}} - \frac{H_{t - 1}^{- 1} v_{t} u_{t}^{T} + u_{t} v_{t}^{T} H_{t - 1}^{- 1}}{u_{t}^{T} v_{t}} .

(14)

For an initial estimate, we propose $H_{0}^{- 1} = α I_{K}$ , where I_K is the K × K identity matrix. We perform a line search along the gradient to choose an appropriate value for α such that $l (β, {\tilde{f}}_{0}^{(1)}) > l (β, {\tilde{f}}_{0}^{(0)})$ .

As previously mentioned, constraints (C3) and (C4) are only required for identifiability. After each BFGS iteration, we impose these constraints on the ${\tilde{f}}_{0}$ estimate, which does not affect the log-likelihood of the estimate. Specifically, we apply (C3) by scaling our estimate of ${\tilde{f}}_{0}$ to sum to one. We then “exponentially tilt” the estimate to enforce constraint (C4). In other words, we compute θ such that $\sum_{j = 1}^{K} s_{j} f_{j} e^{θ s_{j}} / \sum_{j = 1}^{K} f_{j} e^{θ s_{j}} = μ_{0}$ , and set our final estimate for the iteration to be $f_{k} \leftarrow f_{k} e^{θ s_{k}} / \sum_{j = 1}^{K} f_{j} e^{θ s_{j}}$ for all k.

We initialize ${\tilde{g}}_{0}^{(0)}$ to the log of the ${\tilde{f}}_{0}$ estimate obtained from the previous ${\tilde{f}}_{0}$ update procedure. We suggest using the empirical response distribution as an initial estimate of ${\tilde{f}}_{0}$ . We define convergence using the relative change in the log-likelihood. Our default convergence threshold is 10⁻¹⁰. If the log-likelihood decreases after any iteration, we backtrack by half steps, setting ${\tilde{g}}_{0}^{(t + 1)} \leftarrow \frac{1}{2} ({\tilde{g}}_{0}^{(t + 1)} + {\tilde{g}}_{0}^{(t)})$ until the log-likelihood improves. In our experience, we have found that the log-likelihood improves after most iterations without taking half steps, but log-likelihood decreases can occur sporadically (both at early and late iterations).

β optimization procedure

Holding ${\tilde{f}}_{0}$ fixed at its current estimate, we could marginally optimize the log-likelihood over β using iteratively re-weighted least squares (IRLS). Rather than iterating until convergence, however, we propose using a single IRLS iteration to update β in between ${\tilde{f}}_{0}$ updates. The IRLS algorithm is simply the Newton-Raphson algorithm, but using the Fisher information in place of the negative Hessian matrix. This technique is commonly referred to as Fisher Scoring. As we now show, the Fisher Scoring update minimizes a weighted least squares expression, which is why we can refer to the algorithm as IRLS. The score function is given by

S (β; {\tilde{f}}_{0}) = \sum_{i = 1}^{n} x_{i} (\frac{1}{g^{'} (μ_{i})}) (\frac{1}{b^{″} (θ_{i})}) (y_{i} - μ_{i}) = X^{T} W r,

(15)

(derivation in Appendix C) and the Fisher information is

I (β; {\tilde{f}}_{0}) = E (S (β; {\tilde{f}}_{0}) S {(β; {\tilde{f}}_{0})}^{T} | X_{1} = x_{1}, \dots, X_{n} = x_{n}) = \sum_{i = 1}^{n} x_{i} {(\frac{1}{g^{'} (μ_{i})})}^{2} (\frac{1}{b^{″} (θ_{i})}) x_{i}^{T} = X^{T} W X,

(16)

where X is an n × p matrix with rows $x_{i}^{T}$ , W is an n × n diagonal matrix with entries ${(\frac{1}{g^{'} (μ_{i})})}^{2} \frac{1}{b^{″} (θ_{i})}$ , and r is an n × 1 vector with entries $g^{'} (μ_{i}) (Y_{i} - μ_{i})$ .

Let β⁽⁰⁾ be the estimate obtained from the previous β update procedure. The IRLS step between β⁽⁰⁾ and the updated estimate, β⁽¹⁾, is

β^{(1)} - β^{(0)} = {I (β^{(0)}; {\tilde{f}}_{0})}^{- 1} S (β^{(0)}; {\tilde{f}}_{0}) = {(X^{T} W X)}^{- 1} X^{T} W r .

(17)

This is the solution to a weighted least squares expression, which can be computed efficiently using QR decomposition.

Response variable transformation

Numerical stability issues can occur if the exp{θ_is_k} terms become too large or small during optimization. To prevent these terms from exceeding R’s default floating-point number range, we transform the response vector to the interval [−1, 1]. Specifically, the response values are transformed to $y_{i}^{*} = (y_{i} - \frac{m + M}{2}) \cdot \frac{2}{M - m}$ . It turns out the semiparametric log-likelihood function with the transformed response and modified link function $g_{*} (μ) = g (\frac{m + M}{2} + \frac{M - m}{2} μ)$ is equivalent to the original log-likelihood. The parameters β and ${\tilde{f}}_{0}$ that optimize the modified log-likelihood also optimize the original log-likelihood (proof in Appendix D).

As mentioned in the “θ update procedure” section, we choose to cap |θ| for each observation at 500 by default, thereby restricting the rescaled θ_is_k terms to the interval [−500, 500]. Note that the optimization function in the gldrm R package returns the estimated θ_i values for each observation, and these values are on the original scale of the response variable.

Inference

The gldrm R package (Wurm and Rathouz, 2018) can perform inference on the regression coefficients using likelihood ratio tests for nested models. It can also calculate likelihood ratio, Wald, or score confidence intervals. All three confidence intervals should yield similar results for large samples, but the Wald form may be preferred for its computational simplicity. Likelihood ratio and score confidence intervals are more computationally expensive to calculate. Huang (2014) recommends likelihood ratio tests for small samples.

The Wald asymptotic variance estimate for the β estimator can be obtained from the inverse of the information matrix given in Equation 16. Recall this information matrix is calculated with the ${\tilde{f}}_{0}$ parameters held fixed. Its inverse is a valid asymptotic variance matrix because the full information matrix is block diagonal; i.e., the β and ${\tilde{f}}_{0}$ estimators are asymptotically independent. The gldrm optimization function returns standard error estimates for each coefficient, which are displayed by the print method along with p-values. Wald confidence intervals are straightforward to calculate from the standard errors and can be obtained from the gldrmCI function. For these single-coefficient hypothesis tests and confidence intervals, we approximate the null distribution by a t-distribution with n − p degrees of freedom, where n is the number of observations and p is the rank of the covariate matrix.

Likelihood ratio tests for nested models are based on the usual test statistic: 2 times the log-likelihood difference between the full and reduced model. Following Huang (2014), we approximate the null distribution by an F-distribution with q and n − p degrees of freedom, where q is the difference in the number of parameters between the full and reduced models. This test can be performed by the gldrmLRT function.

Likelihood ratio and score confidence intervals for a single coefficient can be obtained from the gldrmCI function. These confidence intervals are more computationally expensive than Wald confidence intervals because an iterative method is required to search for the interval boundaries. We use the following bisection method.

Suppose we want to obtain the lower boundary of a confidence level 1 − α interval for a single coefficient β*, and that the gldrm estimate of this coefficient is ${\hat{β}}^{*}$ . For the lower boundary, we need to find $β_{1 o}^{*}$ such that $β_{1o}^{*} < {\hat{β}}^{*}$ and the one-sided likelihood ratio test has a p-value equal to α/2 under the null hypothesis $H_{0} : β^{*} = β_{1o}^{*}$ . As explained in the next paragraph, we can compute the p-value for any guess of $β_{1 o}^{*}$ . We begin by finding a guess that is too low (p-value less than α/2) and a guess that is too high (p-value greater than α/2). We then obtain the p-value of the midpoint. If the p-value is less than α/2 then the midpoint becomes the new low guess, and if it is greater than α/2 then the midpoint becomes the new high guess. We iterate this process, each time halving the distance between the low guess and high guess, until we have a guess of $β_{1 o}^{*}$ that has a p-value arbitrarily close to α/2. The same procedure can be used to solve for $β_{hi}^{*}$ .

Let β₀ be any constant. To calculate the likelihood ratio or score test p-value of H₀ : β* = β₀, we need to optimize the log-likelihood over ${\tilde{f}}_{0}$ and all other β values, holding β* fixed at β₀. This can be done by multiplying β₀ with the column of the X matrix corresponding to β* and treating this vector as an offset to the linear predictor. In other words, this vector contains a fixed component of the linear predictor for each observation. The β* column is dropped from X, and the model is optimized with the offset term.

Goodness of fit

To check the model fit, we calculate the randomized probability inverse transform (Smith, 1985). This is done by evaluating the fitted conditional cdf (i.e., the cumulative distribution function, conditional on the covariates) of each observed response value. If the model fit is good, the probability inverse transform values should roughly follow a uniform distribution on the interval (0, 1). Because the gldrm fitted cdf is discrete, we use randomization to make the distribution continuous as follows.

Let $\hat{F} (y | X = x)$ denote the fitted cdf conditional on a covariate vector x evaluated at y, and let $y_{i}^{-} = max ({y_{j} : y_{j} < y_{i}} \cup {- \infty})$ . For each observation, we draw a random value from a uniform distribution on the interval $(\hat{F} (y_{i} | X = x_{i}), \hat{F} (y_{i}^{-} | X = x_{i}))$ .

To illustrate a good gldrm fit, we generate 1,000 independent pairs (x, y) where x ~ Normal(mean = 0, sd = 1) and y|x ~ Normal(mean = x, sd = 1). The mean of y|x is linear in x, and the data generating mechanism is an exponential tilt family, so we expect the gldrm fit to be good. We fit the model and then do a visual check by plotting the histogram and uniform QQ plot of the randomized probability inverse transform values (Figure 1).

Figure 1: — Scatterplot and probability inverse transform histogram and QQ plot for a good gldrm model fit.

To illustrate a poor gldrm fit, we generate 1,000 independent pairs (x, y) where x ~ Normal(mean = 0, sd = 1) and y|x Normal(mean = x, sd = x²). The mean of y|x is again linear in x, but this data generating mechanism is not an exponential tilt family. The diagnostic plots (Figure 2) confirm that the gldrm fit is not ideal. The probability inverse transform values are concentrated near the center of the interval (0, 1) rather than being uniformly distributed. The gldrm still provides a good estimate of the regression coefficients and the mean of y|x, but it is not the correct model for the cdf of y|x. .

Figure 2: — Scatterplot and probability inverse transform histogram and QQ plot for a poor gldrm model fit.

Simulation experiments

Under four different simulation scenarios, we check the finite sample performance of the regression coefficient estimators, $\hat{β}$ , as well as the reference distribution cdf estimator ${\hat{F}}_{0} ()$ . For $\hat{β}$ , we also check the accuracy of the estimated standard errors.

For each simulation, the data generating mechanism is a gamma glm. The covariates are the same under each simulation setting, and their values are fixed, not random. The covariate vector is x = (x₀, x₁, x₂, x₃). Variable x₀ is an intercept column of ones. Variable x₁ is an indicator with exactly 20% ones in each sample (observations 5, 10, 15, 20,… have x₁ = 1, and the rest have x₁ = 0). Variable x₂ is a continuous variable that follows a uniform sequence between zero and one in each sample (the first observation has x₂ = 0, and the last observation has x₂ = 1). Lastly, x₃ = x₁ · x₂.

The coefficient values were (β₁, β₂, β₃) = (1, 1, −2) for all simulation scenarios. The intercepts were β₀ = 0 for simulations 1 & 2, and β₀ = 1 for simulations 3 & 4. These values were selected so the mean response values were centered close to one and, for the identity link scenarios, guaranteed to be positive.

The log link was used for simulations 1 & 2, and the identity link was used for simulations 3 & 4. For all four simulations, the response was drawn from a gamma distribution with Var(y|x) = ϕE(y|x)², where ϕ varies by simulation to achieve different R² values. R² is defined as the squared correlation between the response and linear combination of predictors.

Simulation 1: log link and high R² (ϕ = 0.5, R² ≈ 0.13)
Simulation 2: log link and low R² (ϕ = 4, R² ≈ 0.019)
Simulation 3: identity link and high R² (ϕ = 0.25, R² ≈ 0.13)
Simulation 4: identity link and low R² (ϕ = 2, R² ≈ 0.018)

A gldrm was fit to each data set with correct link function and with ${\tilde{f}}_{0}$ constrained to have mean μ₀ = 1. If we had followed our usual practice of choosing μ₀ to be the sample mean of the observed response values, then the reference distribution would have a different mean for each simulation replicate. By choosing μ₀ = 1 for all replicates, the true f₀ always follows a gamma distribution with mean one and variance ϕ, where ϕ varies by simulation scenario. The value μ₀ = 1 was chosen because, by design, it fell within the range of observed response values for all 2,000 replicates of each simulation scenario. The cumulative reference distribution estimate, denoted as ${\hat{F}}_{0} (\cdot)$ , was computed at percentiles 0.1, 0.25, 0.5, 0.75, 0.9.

For each simulation scenario, we display four tables. The first table (Tables 1, 5, 9, and 13) contains the sample mean of each β estimator. For comparison, we also calculated the gamma glm coefficient estimates for each simulated data set and display the sample mean alongside in parentheses.

Table 1:

Sample mean of $\hat{β}$ . For comparison, the gamma glm sample mean is shown in parentheses.

$\hat{β}$ mean (gamma glm mean)
	n = 25	n = 100	n = 400
β₀ = 0	−0.02 (−0.02)	0.00 (0.00)	0.00 (0.00)
β₁ = 1	0.93 (0.92)	0.98 (0.99)	0.99 (0.99)
β₂ = 1	0.99 (1.00)	0.99 (1.00)	1.00 (1.00)
β₃ = −2	−2.00 (−2.00)	−2.00 (−2.01)	−1.99 (−1.99)

		${\hat{F}}_{0}$ (y) mean
F₀(y)	y	n = 25	n = 100	n = 400
0.10	0.27	0.082	0.097	0.099
0.25	0.48	0.222	0.244	0.249
0.50	0.84	0.487	0.498	0.499
0.75	1.35	0.759	0.751	0.750
0.90	1.94	0.911	0.902	0.900

	β₀	β₁	β₂	β₃
Wald	0.918	0.823	0.909	0.859
Likelihood ratio	0.969	0.939	0.958	0.944
Score	0.976	0.966	0.967	0.967

(Intercept)	1.1832	0.0369	32.10	< 2e-16 ***
Sepal.Width	0.0788	0.0128	6.17	6.4e-09 ***
Petal.Length	0.1128	0.0102	11.04	< 2e-16 ***
Petal.Width	−0.0350	0.0248	−1.41	0.162
Speciesversicolor	−0.0561	0.0395	−1.42	0.157
Speciesvirginica	−0.0994	0.0557	−1.79	0.076 .

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species	fitted_mean
1	5.1	3.5	1.4	0.2	setosa	5.00
51	7.0	3.2	4.7	1.4	versicolor	6.43
101	6.3	3.3	6.0	2.5	virginica	6.91

	P(4<Y<=5)	P(5<Y<=6)	P(6<Y<=7)	P(7<Y<=8)
1	0.625	0.375	0.000	0.000
51	0.000	0.136	0.832	0.032
101	0.000	0.006	0.649	0.344

$\hat{β}$ rmse (relative efficiency)
	n = 25	n = 100	n = 400
β₀	0.31 (0.99)	0.16 (1.00)	0.08 (1.00)
β₁	0.81 (0.96)	0.38 (1.01)	0.18 (1.00)
β₂	0.54 (1.00)	0.27 (1.00)	0.14 (0.99)
β₃	1.27 (0.96)	0.63 (1.00)	0.30 (1.00)

$\hat{β}$ rmse (relative efficiency)
	n = 25	n = 100	n = 400
β₀	0.94 (1.14)	0.45 (1.04)	0.23 (1.00)
β₁	2.83 (1.43)	1.12 (1.09)	0.51 (1.02)
β₂	1.62 (1.19)	0.78 (1.05)	0.39 (1.01)
β₃	4.18 (1.59)	1.88 (1.14)	0.89 (1.02)

$\hat{β}$ rmse (relative efficiency)
	n = 25	n = 100	n = 400
β₀	0.25 (0.99)	0.13 (0.99)	0.07 (1.00)
β₁	0.88 (0.96)	0.40 (0.99)	0.20 (1.00)
β₂	0.55 (0.98)	0.28 (0.99)	0.14 (1.00)
β₃	1.26 (0.98)	0.63 (0.98)	0.32 (1.00)

$\hat{β}$ rmse (relative efficiency)
	n = 25	n = 100	n = 400
β₀	0.73 (1.17)	0.39 (1.02)	0.19 (1.00)
β₁	2.35 (1.05)	1.16 (1.09)	0.58 (1.02)
β₂	1.58 (1.18)	0.81 (1.03)	0.40 (1.00)
β₃	3.27 (1.18)	1.83 (1.11)	0.92 (1.02)

PERMALINK

Semiparametric Generalized Linear Models with the gldrm Package

Michael J Wurm

Paul J Rathouz

Abstract

Introduction

Optimization algorithm

θ update procedure

f˜0 optimization procedure

β optimization procedure

Response variable transformation

Inference

Goodness of fit

Figure 1:

Figure 2:

Simulation experiments

Table 1:

Table 5:

Table 9:

Table 13:

Table 2:

Table 6:

Table 10:

Table 14:

Table 3:

Table 7:

Table 11:

Table 15:

Table 4:

Table 8:

Table 12:

Table 16:

Simulation 1

Simulation 2

Simulation 3

Simulation 4

Likelihood ratio and score confidence intervals

Table 17:

Computational time and scaling

Figure 3:

Figure 4:

Figure 5:

Figure 6:

R package: gldrm

R example: iris data

Fit gldrm model

Inference

Goodness of fit

Prediction

Discussion

Acknowledgments

Appendix

Notation

A. Score function for f˜0

B. Information matrix for f˜0

C. Score function for β

D. Proof that transformed response with modified link function has equivalent log-likelihood

Contributor Information

Bibliography

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

${\tilde{f}}_{0}$ optimization procedure

A. Score function for ${\tilde{f}}_{0}$

B. Information matrix for ${\tilde{f}}_{0}$