A Comparison of Correlation Structure Selection Penalties for Generalized Estimating Equations

Philip M Westgate; Woodrow W Burchett

doi:10.1080/00031305.2016.1200490

. Author manuscript; available in PMC: 2019 Mar 25.

Published in final edited form as: Am Stat. 2018 Jan 11;71(4):344–353. doi: 10.1080/00031305.2016.1200490

A Comparison of Correlation Structure Selection Penalties for Generalized Estimating Equations

Philip M Westgate ^1,^*, Woodrow W Burchett ²

PMCID: PMC6433418 NIHMSID: NIHMS1507192 PMID: 30918414

Abstract

Correlated data are commonly analyzed using models constructed using population-averaged generalized estimating equations (GEEs). The specification of a population-averaged GEE model includes selection of a structure describing the correlation of repeated measures. Accurate specification of this structure can improve efficiency, whereas the finite-sample estimation of nuisance correlation parameters can inflate the variances of regression parameter estimates. Therefore, correlation structure selection criteria should penalize, or account for, correlation parameter estimation. In this manuscript, we compare recently proposed penalties in terms of their impacts on correlation structure selection and regression parameter estimation, and give practical considerations for data analysts.

Keywords: Bias-correction, Efficiency, Empirical covariance matrix, Longitudinal data

1. Introduction

The need to analyze correlated data arises in a variety of research settings. In general, we have N independent clusters, with the ith cluster contributing $n_{i}, i = 1, \dots, N$ , correlated outcomes denoted by $Y_{i} = {[Y_{i 1}, \dots, Y_{i n_{i}}]}^{T}$ . A common example is correlated repeated measures collected in longitudinal studies in which independent subjects contribute outcomes over time. If a marginal model is desired, generalized estimating equations (GEE) (Liang and Zeger, 1986) are commonly used for the data analysis. Although working marginal variances and a correlation structure must be selected, this approach often yields valid statistical inference as long as the mean structure is correct.

Examples of popular working correlation structures include independence, exchangeable, AR-1, the least parsimonous Toeplitz, and unstructured. For simplicity of example, suppose each subject in a balanced longitudinal study contributes n repeated measures. We note, however, that this is not a requirement and the results of this manuscript are generalizeable for an unequal number of measurements per subject. With these working correlation matrices, the element in the jth row and kth column is equal to the working form for $C o r r (Y_{i j}, Y_{i k})$ . If j = k, $C o r r (Y_{i j}, Y_{i k}) = 1$ . Otherwise, exchangeable and AR-1 each use one parameter, α, such that $C o r r (Y_{i j}, Y_{i k}) = α$ and $C o r r (Y_{i j}, Y_{i k}) = α^{| j - k |}$ , respectively, whereas $C o r r (Y_{i j}, Y_{i k}) = 0$ with independence. For the least parsimonous Toeplitz version, denote the n − 1 correlation parameters by α_l, l = 1,…, n − 1, such that $C o r r (Y_{i j}, Y_{i k}) = α_{l}$ for l = |j − k|. For unstructured, each distinct correlation element is allowed to differ such that $C o r r (Y_{i j}, Y_{i k}) = α_{j k}$ , resulting in the need to estimate n(n−1)/2 different correlation parameters.

To improve efficiency for the estimation of regression parameters, accurate modeling of the working correlation structure is desired (Liang and Zeger, 1986; Wang and Carey, 2003). Therefore, multiple correlation selection criteria have been proposed. As a quick reference, we refer the reader to Westgate (2014), who summarized and studied the performances of many criteria. In short, the ‘correlation information criterion’ (CIC) (Hin and Wang, 2009), gaussian pseudolikelihood (GP) (Carey and Wang, 2011), and the ‘trace of the empirical covariance matrix’ (TECM) (Westgate, 2014) work well.

Unfortunately, the estimation of any nuisance correlation parameters can increase the finite-sample variances of regression parameter estimates (Westgate, 2013, 2015).This increase in variances was shown with the unstructured working correlation matrix in Westgate (2013), and an approximation for the covariance inflation was derived. Furthermore, Westgate (2015) demonstrated variance inflation and the utility of the covariance inflation approximation when GEE incorporates structured working correlation matrices. Due to this variance inflation the estimation of correlation parameters should be accounted for, or penalized, when selecting a working structure. Besides comparing the performances of many correlation structure selection criteria, Westgate (2014) showed that utilizing Westgate’s (2013) covariance inflation approximation as a penalty to be applied toward the unstructured working correlation matrix can improve the performances of criteria. Furthermore, Westgate (2015) showed that use of this particular penalty can be applied toward any working correlation structure, thus improving selection accuracy even further. We note that Westgate’s penalty can be applied with any criterion that incorporates an empirical estimate for the covariance matrix of the regression parameter estimates. Without such a penalty, selection criteria can overselect less parsimonous correlation structures such as the unstructured and least parsimonous Toeplitz matrices, potentially resulting in less precise regression parameter estimation (Barnett et al., 2010; Hardin and Hilbe, 2012; Shults and Hilbe, 2014; Westgate, 2014, 2015). This position of Westgate (2014, 2015) on the need to utilize correlation structure selection penalties is the position we take in this manuscript.

Other correlation structure selection penalties have recently been proposed that were not studied by Westgate (2014, 2015), and that have not been directly compared in terms of their advantages, disadvantages, and ultimately their impact on selection accuracy. Specifically, Hardin and Hilbe (2012) and Shults and Hilbe (2014) proposed penalties for use with the well-known ‘quasi-likelihood under the independence model criterion’ (QIC) (Pan, 2001), which we later extend for use with the CIC. Also, penalties based on the AIC (Akaike, 1974) and Bayesian information criterion (BIC) (Schwarz, 1978) have been proposed for use with both the GP criterion (Xiaolu and Zhongyi, 2013) and a method proposed by Chen and Lazar (2012) that utilizes empirical likelihood (EL) (Owen, 1988; Qin and Lawless, 1994).

Our focus in this manuscript is therefore to compare the previously mentioned penalties in terms of their practical implications on correlation structure selection and ultimately regression parameter estimation. In Section 2, we briefly describe GEE and relevant correlation selection criteria. In Section 3, we discuss properties of different penalties for data analysts to consider. Penalties and corresponding selection criteria are then contrasted via a simulation study in Section 4 and an application example in Section 5. Concluding remarks, including practical recommendations, are given in Section 6.

2. GEE and Correlation Selection Criteria

2.1. Generalized Estimating Equations

Established in (Liang and Zeger, 1986), a consistent estimate of the regression parameters, $β = {[β_{0}, β_{1}, \dots, β_{p - 1}]}^{T}$ , is obtained by solving $\sum_{i = 1}^{N} D_{i}^{T} V_{i}^{- 1} (Y_{i} - μ_{i}) = \sum_{i = 1}^{N} D_{i}^{T} A_{i}^{- 1 / 2} R_{i}^{- 1} (α) A_{i}^{- 1 / 2} (Y_{i} - μ_{i}) = 0.$ Here, for the ith cluster, $D_{i} = \partial μ_{i} / \partial β^{T}, V_{i} = A_{i}^{1 / 2} R_{i} (α) A_{i}^{1 / 2}$ is the working covariance structure for $Y_{i}, R_{i} (α)$ is the working correlation matrix for Y_i composed of parameters given by α, A_i is a diagonal matrix of working marginal variances for Y_i, and E(Y_i) = µ_i with link function f such that $f (μ_{i j}) = X_{i j}^{T} β$ for $x_{i j} = {[1, x_{_{1 i j}}, \dots, x_{_{(p - 1) i j}}]}^{T}$ Assuming the nuisance correlation parameters are known, $C o v (\hat{β}) \approx$

\sum = {(\sum_{i = 1}^{N} D_{i}^{T} V_{i}^{- 1} D_{i})}^{- 1} (\sum_{i = 1}^{N} D_{i}^{T} V_{i}^{- 1} C o v (Y_{i}) V_{i}^{- 1} D_{i}) {(\sum_{i = 1}^{N} D_{i}^{T} V_{i}^{- 1} D_{i})}^{- 1} .

(1)

We denote a consistent estimator for $Σ$ as $\hat{Σ}$ $.$ For instance, this could be the popular Liang and Zeger (1986) empirical, or robust, estimator which replaces Cov(Y_i) with $(Y_{i} - {\hat{μ}}_{i}) {(Y_{i} - {\hat{μ}}_{i})}^{T}, i = 1, …, N,$ , in Equation (1). In small-sample settings, this estimator can be biased because the residual matrices incorporated within these empirical covariances tend to be too small (Mancl and DeRouen, 2001). Therefore, multiple corrections for this bias have been proposed. For instance, see Kauermann and Carroll (2001); Mancl and DeRouen (2001); Fay and Graubard (2001); Morel et al. (2003); McCaffrey and Bell (2006); Fan et al. (2013). We also note that the correlation matrices may fail to be consistent if they do not model the true structure, in which case these empirical covariance estimators may also fail to be consistent (Crowder, 1995; Sutradhar and Das, 1999).

2.2. Correlation Selection Criteria

We give focus to the CIC, TECM, and GP for correlation structure selection because these criteria were found by Westgate (2014) to notably outperform the well-known QIC (Pan, 2001) and criteria motivated by the work of Rotnitzky and Jewell (Rotnitzky and Jewell, 1990; Hin et al., 2007; Shults et al., 2009; Carey and Wang, 2011). We also give focus to criteria based upon empirical likelihood versions of the AIC and BIC that were proposed by Chen and Lazar (2012). The CIC is taken from the QIC. The unpenalized QIC for a given working correlation structure, R, is given by $Q I C (R) = - 2 Q (β, ϕ) + 2 t r ({\hat{Σ}}_{I}^{- 1} \hat{Σ}) = - 2 Q (β, ϕ) + 2 C I C (R)$ . Here, $Q (β, ϕ)$ is the quasi-likelihood under the assumption that all outcomes are independent, and ${\hat{Σ}}_{I} = {(Σ_{i = 1}^{N} D_{i}^{T} A_{i}^{- 1} D_{i})}^{- 1}$ is the model-based covariance matrix assuming independence. To improve upon the performance of the QIC, Hin and Wang (2009) proposed using the CIC, as only this last part of the QIC contains information about the correlation structure. The unpenalized TECM and GP are given by $T E C M (R) = t r (\hat{Σ})$ and $G P (R) = Σ_{i = 1}^{N} [{(Y_{i} - μ_{i})}^{T} V_{i}^{- 1} (Y_{i} - μ_{i}) + \log (| V_{i} |)]$ , respectively. The structure that yields the smallest value for the given criterion is selected. We note that this form for the GP is based on the setup of Xiaolu and Zhongyi (2013), and $\hat{Σ}$ is utilized within each of these criteria except for the GP. As the Chen and Lazar (2012) criteria require a penalty, we discuss them in detail in the following section.

3. Correlation Selection Penalties

3.1. Penalties of Hardin & Hilbe and Shults & Hilbe

Hardin and Hilbe (2012) and Shults and Hilbe (2014) proposed penalized versions of the QIC given by

- 2 Q (β, ϕ) + 2 C I C (R) + 2 \frac{[p + r + m] [p + r + m + 1]}{N - p - r - m - 1} = Q I C (R) + 2 P (p, r, m, N)

(2)

in which m = 0 and m = 1 for the Hardin and Hilbe (2012) and Shults and Hilbe (2014) penalties, respectively, and r is the number of estimated correlation parameters. These penalties are motivated by the adjusted Akaike information criterion (AIC) of Hurvich and Tsai (1989), who proposed adding a similar penalty to the AIC as a small-sample correction factor. The CIC has been shown to perform better than the QIC with respect to correlation selection, and therefore we note that these penalties can also be incorporated within the CIC. Specifically, $- 2 Q (β, ϕ)$ in Equation (2) can be ignored, and multiplying by 2 has no impact. Therefore, the penalized CIC is given by $C I C (R) + P (p, r, m, N)$ .

As N increases, any inflation of the variances of the regression parameter estimates, and therefore the needed penalty value, reduces in magnitude because correlation parameters are estimated more precisely. The use of these penalties is therefore intuitive because $P (p, r, m, N) \to 0 a s N \to \infty$ , given $p, r \leq c < \infty$ for some constant c. However, this penalty could still use greater theoretical justification (Shults and Hilbe, 2014). Furthermore, Hardin and Hilbe (2012) and Shults and Hilbe (2014) found their penalties, when used with the QIC, tended to favor simpler structures and over-select independence, particularly for smaller N in the Hardin and Hilbe (2012) study. In short, P (p, r, m, N ) can over-penalize, although this overpenalization diminishes as N increases.

In practice, the data analyst should consider that P (p, r, m, N ) may favor structures with fewer, or no, correlation parameters. This property can be advantageous when simpler structures, such as exchangeable or AR-1, perform better than less parsimonous structures. However, this property can be disadvantageous if the true structure is not independence and ignoring the correlation leads to a loss in small sample efficiency, such as described in Fitzmaurice (1995), Mancl and Leroux (1996), and Wang and Carey (2003). Therefore, the analyst may not want to consider independence for selection in some instances. We also point out that P (p, r, m, N ) has a practical advantage in that it is simple to calculate and therefore does not need to be incorporated within existing software in order to be easily used.

3.2. Westgate Penalty

The need to penalize correlation parameter estimation arises because this estimation potentially increases the estimation variability of GEE, thus inflating $C o v (\hat{β})$ (Westgate, 2013, 2015). Specifically, Westgate (2013) showed that when accounting for correlation estimation, $C o v (\hat{β}) \approx (I_{p} + G) Σ {(I_{p} + G)}^{T}$ , where I_p is a p × p identity matrix and $G = (G_{0}, G_{1}, \dots, G_{p - 1})$ , $G_{r} = - {(Σ_{i = 1}^{N} D_{i}^{T} V_{i}^{- 1} D_{i})}^{- 1} Σ_{i = 1}^{N} D_{i}^{T} A_{i}^{- 1 / 2} R_{i}^{- 1} \frac{\partial R_{i} (\hat{α} (β))}{\partial β_{r}} R_{i}^{- 1} A_{i}^{- 1 / 2} (Y_{i} - μ_{i} (β))$ .

We note that although Westgate (2013) focused on the use of an unstructured working correlation matrix, Westgate (2015) demonstrated that $C o v (\hat{β}) \approx (I p + G) Σ {(I p + G)}^{T}$ for any working correlation structure given by R_i, i = 1,… , N , within G. Therefore, Westgate (2014, 2015) proposed penalizing the estimation of correlation parameters by using $(I_{p} + \hat{G}) \hat{Σ} {(I_{p} + \hat{G})}^{T}$ in place of $\hat{Σ}$ , the estimator that had been historically utilized to estimate $C o v (\hat{β})$ within correlation selection criteria. As a result, versions of the CIC and TECM that incorporate this penalty are given by $C I C w (R) = t r [{\hat{Σ}}_{I}^{- 1} (I_{p} + \hat{G}) \hat{Σ} {(I_{p} + \hat{G})}^{T}]$ and $T E C M_{w} (R) = t r [(I_{p} + \hat{G}) \hat{Σ} {(I_{p} + \hat{G})}^{T}]$ respectively. We note that $\hat{G}$ is an estimate for G in which β is replaced with $\hat{β}$ .

Westgate’s penalty has a natural advantage over other penalties in that it directly accounts for the inflation of $C o v (\hat{β})$ that arises from the estimation of nuisance correlation parameters. If we were able to show $C o v (\hat{β}) = (I_{p} + G) Σ {(I_{p} + G)}^{T}$ , and then utilize $(I_{p} + G) Σ {(I_{p} + G)}^{T}$ within correlation selection criteria, then this penalty would be ideal. Unfortunately, this is not the case. Specifically, $C o v (\hat{β}) \approx (I_{p} + G) Σ {(I_{p} + G)}^{T}$ , and so we can only approximate the covariance inflation. Furthermore, Westgate’s penalty is an estimate, as $\hat{G}$ must be used in place of G. As N decreases, the magnitude of the covariance inflation can increase, whereas the precision of $\hat{G}$ decreases. Therefore, Westgate’s penalty can be less reliable for smaller N , which is when the penalty is needed most. For instance, a resulting consequence can be that Westgate’s penalty sometimes selects less parsimonious structures simply because the corresponding covariance inflation is underestimated. Although this selection may be advantageous when less parsimounous structures lead to the least variable parameter estimates, it will be disadvantageous when simpler structures work best.

3.3. Penalties for Empirical Likelihood and Gaussian Pseudolikelihood

The method proposed by Chen and Lazar (2012) uses EL forms of the AIC and BIC given by $E A I C (R) = - 2 l o g E L R (\hat{β}, \hat{α}) + 2 (p + r)$ and $E B I C (R) = - 2 l o g E L R (\hat{β}, \hat{α}) + (p + r) l o g (N)$ , respectively. With this approach, a maximal correlation structure is assumed, and this structure and nested structures are possibly assumed to be under consideration for selection. In short, any given structure under consideration is compared with the maximal structure via the empirical likelihood ratio (ELR). Although Chen and Lazar (2012) assumed this maximal structure to be the stationary, or least parsimonous Toeplitz correlation matrix, we feel this is too restrictive relative to other criteria. In this manuscript, we use the unstructured working correlation as the maximal model. We refer the reader to Chen and Lazar (2012) for technical details. Xiaolu and Zhongyi (2013) took the same penalization approach and applied it with the GP criterion, resulting in penalized versions given by $A G P (R) = G P (R) + 2 (p + r)$ and $B G P (R) = G P (R) + (p + r) \log (N)$ We note that Fu et al. (2015) also utilized this penalty when focusing on quantile regression.

Penalties are either $2(p + r)$ or $(p + r) l o g (N)$ . Although these seem intuitive with respect to the form of the AIC and BIC, they are not truly representative of the need for correlation penalization. Specifically, the magnitude of the penalty should decrease as N increases, whereas $2(p + r)$ does not depend on N and $(p + r) l o g (N)$ gives a harsher penalty toward less parsimonous structures as N increases. However, because EL and GP do not incorporate an estimate for $C o v (\hat{β})$ , the impacts of these penalties are not as clear as for the previously discussed penalties. We note that the simulation study of Xiaolu and Zhongyi (2013) potentially suggests that these penalties may overly favor simple structures. However, we will show in our simulation study that the impact of these penalties rely upon the utility of the given method, EL or GP.

4. Simulation Study

4.1. Study Description

We now demonstrate in a simulation study the impacts of the different penalties, and inherently the criteria that incorporate these penalties, on correlation structure selection frequencies and ultimately regression parameter estimation. Because penalties may overselect a certain type of structure, depending on parsimony, we gave focus to three scenarios. In each scenario, we assumed the analyst was interested in selecting one of multiple working structures. In one scenario, interest was in independence, exchangeable, AR-1, Toeplitz, and unstructured. In another scenario, these same structures except for independence were considered, avoiding over-selection of independence. Finally, only independence, exchangeable, and AR-1 were considered, avoiding over-selection of the least parsimonous structures.

Data were generated from a multivariate normal distribution such that $E (Y_{i j}) = β_{0} + β_{1} x_{1 i j} + β_{2} x_{2 i j}; j = 1, \dots, n$ , where $β = {[0, 0.3, 0.3]}^{T}$ , outcomes had unit marginal variance, and both covariates were independently generated across all observations from Uniform(0, 1). We note that this model setup is similar to settings used in Hin et al. (2007), Hin and Wang (2009), and Westgate (2015). The true structures were either AR-1 with correlation value of 0.5 or unstructured with $C o r r (Y_{i 1}, Y_{i 2}) = 0 .8$ , $C o r r (Y_{i 1}, Y_{i 3}) = 0.3$ , and $C o r r (Y_{i 2}, Y_{i 3}) = 0.6$ Corresponding to true AR-1 settings, either 20 or 100 subjects each contributed 4 repeated measurements. Alternatively, to ensure stable empirical standard deviations (ESDs) were produced by the unstructured working correlation matrix, we only present results for settings in which 100 subjects each contributed 3 repeated measurements when unstructured was the true structure. Simulations were conducted in R version 2.13.1 (R Development Core Team, 2011), and outcomes were generated using

rmvnorm

of the

mvtnorm

package (Genz et al., 2013; Genz and Bretz, 2009)

In Tables 1–3, we present the number of times out of 1,000 replications that each working structure under consideration was selected. These frequencies are given for each scenario and each of the following selection criteria and penalties: the TECM criterion with Westgate penalty (TECM_W ), the CIC with Westgate penalty (CIC_W ), the CIC with Hardin and Hilbe penalty (CIC_HH ), the CIC with Shults and Hilbe penalty (CIC_SH ), the EAIC and EBIC with unstructured as the maximal structure, and the AGP and BGP. ESDs of ${\hat{β}}_{2}$ are also presented. We note that results with respect to ${\hat{β}}_{1}$ were very similar because values for both covariates were generated in the same manner.

Table 1:

Correlation structure selection frequencies and empirical standard deviations(ESD) of ${\hat{β}}_{2}$ for the setting in which the true structure is AR-1 (ρ = 0.5) and 20 subjects each contribute 4 observations.

		Selection Frequencies
Structures of Interest	Criterion	Independence (ESD=0.40)	Exchangeable (ESD=0.35)	AR-1 (ESD=0.33)	Toeplitz (ESD=0.35)	Unstructured (ESD=2.94^**)	Criterion ESD^*
IEATU	TECM_W	49	163	569	177	42	0.34
	CIC_W	55	153	587	181	24	0.35
	CIC_HH	420	47	533	0	0	0.36
	CIC_SH	597	24	379	0	0	0.37
	EAIC	8	56	304	212	420	0.45
	EBIC	15	84	394	209	298	0.43
	AGP	3	110	801	82	4	0.34
	BGP	15	115	841	28	1	0.34

IEA	TECM_W	58	187	755			0.34
	CIC_W	61	184	755			0.34
	CIC_HH	420	47	533			0.36
	CIC_SH	597	24	379			0.37
	EAIC	93	193	714			0.34
	EBIC	100	189	711			0.34
	AGP	4	115	881			0.34
	BGP	15	115	870			0.34

EATU	TECM_W		186	591	180	43	0.34
	CIC_W		180	614	181	25	0.34
	CIC_HH		167	833	0	0	0.33
	CIC_SH		167	833	0	0	0.33
	EAIC		59	308	213	420	0.45
	EBIC		89	399	211	301	0.43
	AGP		110	804	82	4	0.34
	BGP		115	855	29	1	0.34

Open in a new tab

empirical standard deviations (ESD) of ${\hat{β}}_{2}$ resulting from use of the given criterion and penalty

^**

There was instability with the unstructured working correlation for the analysis of some simulated datasets

TECM_W - trace of the empirical covariance matrix with Westgate penalty

CIC_W - correlation information criterion with Westgate penalty

CIC_HH - correlation information criterion with Hardin and Hilbe penalty

CIC_SH - correlation information criterion with Shults and Hilbe penalty

EAIC - empirical likelihood AIC-based criterion with penalty

EBIC - empirical likelihood BIC-based criterion with penalty

AGP - AIC-based gaussian pseudolikelihood criterion with penalty

BGP - BIC-based gaussian pseudolikelihood criterion with penalty

IEATU - All 5 structures are considered for selection

IEA - Only independence, exchangeable, and AR-1 are considered for selection

EATU - All structures except for independence are considered for selection

Table 3:

Correlation structure selection frequencies and empirical standard deviations (ESD) of ${\hat{β}}_{2}$ for the setting in which the true structure is Unstructured and 100 subjects each contribute 3 observations.

		Selection Frequencies
Structures of Interest	Criterion	Independence (ESD=0.21)	Exchangeable (ESD=0.16)	AR-1 (ESD=0.13)	Toeplitz (ESD=0.12)	Unstructured (ESD=0.12)	Criterion ESD^*
IEATU	TECM_W	0	0	60	260	680	0.12
	CIC_W	0	0	155	348	497	0.12
	CIC_HH	0	0	232	671	97	0.12
	CIC_SH	0	0	326	626	48	0.12
	EAIC	0	0	8	359	633	0.12
	EBIC	0	0	69	566	365	0.12
	AGP	0	0	26	346	628	0.12
	BGP	0	0	48	419	533	0.12

IEA	TECM_W	0	0	1000			0.13
	CIC_W	0	0	1000			0.13
	CIC_HH	0	0	1000			0.13
	CIC_SH	0	0	1000			0.13
	EAIC	0	0	1000			0.13
	EBIC	0	0	1000			0.13
	AGP	0	0	1000			0.13
	BGP	0	0	1000			0.13

EATU	TECM_W		0	60	260	680	0.12
	CIC_W		0	155	348	497	0.12
	CIC_HH		0	232	671	97	0.12
	CIC_SH		0	326	626	48	0.12
	EAIC		0	8	359	633	0.12
	EBIC		0	69	566	365	0.12
	AGP		0	26	346	628	0.12
	BGP		0	48	419	533	0.12

Open in a new tab

empirical standard deviations (ESD) of ${\hat{β}}_{2}$ resulting from use of the given criterion and penalty

TECMW - trace of the empirical covariance matrix with Westgate penalty

CIC_W - correlation information criterion with Westgate penalty

CIC_HH - correlation information criterion with Hardin and Hilbe penalty

CICS_H - correlation information criterion with Shults and Hilbe penalty

EAIC - empirical likelihood AIC-based criterion with penalty

EBIC - empirical likelihood BIC-based criterion with penalty

AGP - AIC-based gaussian pseudolikelihood criterion with penalty

BGP - BIC-based gaussian pseudolikelihood criterion with penalty

IEATU - All 5 structures are considered for selection

IEA - Only independence, exchangeable, and AR-1 are considered for selection

EATU - All structures except for independence are considered for selection

Direct comparison of penalty values is only meaningful when these penalties are used with the same criterion. Three different penalties are available with the CIC. Therefore, the Westgate, Hardin and Hilbe, and Shults and Hilbe penalties are explicitly demonstrated in Table 4. Specifically, empirical means of the unpenalized CIC, CIC_W , CIC_HH , and CIC_SH are presented. Furthermore, the empirical mean of Westgate’s penalty, along with the Hardin and Hilbe and Shults and Hilbe penalties, are given and are equal to the difference between the empirical means of the corresponding penalized and unpenalized CIC. We note that the mean of Westgate’s penalty should be approximately equal to the optimal penalty. We utilized the empirical covariance matrix estimator that incorporates the Kauermann and Carroll (2001) correction for $\hat{Σ}$ within selection criteria, as it has been shown to work well with respect to attaining valid inference (Westgate, 2015). However, other available corrections could alternatively be utilized. Fortunately, the performances of selection criteria are not sensitive to which of these corrections, if any, are utilized (Westgate, 2015).

Table 4:

Empirical means of CIC values and corresponding penalties.

Setting	Working Structure	Mean of Unpenalized CIC	Mean of CIC_W	Mean of CIC_HH	Mean of CIC_SH	Mean of Westgate’s Penalty	Hardin & Hilbe Penalty	Shults & Hilbe Penalty
1	Independence	4.05	4.05	4.80	5.38	0.00	0.75	1.33
	Exchangeable	3.62	3.68	4.96	5.77	0.06	1.33	2.14
	AR−1	3.32	3.41	4.65	5.46	0.09	1.33	2.14
	Toeplitz	3.24	3.94	6.47	7.90	0.70	3.23	4.67
	Unstructured^*	−	−	−	−	-	9.00	12.22

2	Independence	4.05	4.05	4.17	4.26	0.00	0.13	0.21
	Exchangeable	3.62	3.63	3.83	3.94	0.01	0.21	0.32
	AR-1	3.33	3.35	3.54	3.65	0.02	0.21	0.32
	Toeplitz	3.31	3.37	3.76	3.92	0.06	0.45	0.61
	Unstructured	3.29	3.45	4.29	4.53	0.16	1.00	1.24

3	Independence	4.13	4.13	4.25	4.34	0.00	0.13	0.21
	Exchangeable	3.32	3.33	3.53	3.64	0.01	0.21	0.32
	AR−1	2.80	2.81	3.01	3.12	0.01	0.21	0.32
	Toeplitz	2.62	2.75	2.94	3.07	0.13	0.32	0.45
	Unstructured^*	−	−	−	−	-	0.45	0.61

Open in a new tab

Due to instability in some simulated CIC values, the resulting distorted empirical means are not given

Setting 1: True correlation structure is AR-1, and N = 20 subjects contribute 4 observations

Setting 2: True correlation structure is AR-1, and N = 100 subjects contribute 4 observations

Setting 3: True correlation structure is Unstructured, and N = 100 subjects contribute 3 observations

CIC_W - correlation information criterion with Westgate penalty

CIC_HH - correlation information criterion with Hardin and Hilbe penalty

CIC_SH - correlation information criterion with Shults and Hilbe penalty

4.2. Description of Results

As expected, using P (p, r, m, N ) as a penalty with the CIC led to a high selection frequency of the independence structure when N = 20. Furthermore, because the Shults and Hilbe penalty is greater than the Hardin and Hilbe penalty, it selected independence even more often. ESDs show that independence is a poor choice in this setting. Therefore, it is advantageous to not even consider independence for selection here, in which case the CIC with these penalties actually works best. Alternatively, when N = 100, use of the CIC with these penalties selected independence one time at most. Furthermore, use of these penalties never resulted in unstructured being selected when AR-1 was the true structure, and Toeplitz was selected only twice by the Hardin and Hilbe penalty. Alternatively, when unstructured was the true structure, Toeplitz and unstructured resulted in the smallest ESDs, and use of both penalties resulted in either of these two structures being selected most often. However, they both favored the simpler Toeplitz structure.

As found in Westgate (2014, 2015), his penalized versions of the TECM and CIC worked similarly, although selection frequencies were notably different when unstructured was the true structure. Use of these criteria and penalties, relative to other criteria and penalties with the exception of the EL methods, tended to select Toeplitz and unstructured more often when AR-1 was the true structure. This may be viewed as over-selection of less parsimonous structures. When N = 20, these structures did not perform as well, i.e. ESDs for ${\hat{β}}_{2}$ were larger, as AR-1 due to notable variance inflation from the estimation of additional correlation parameters. Therefore, not allowing the TECM and CIC with Westgate’s penalty to select these two structures resulted in slightly smaller ESDs. However, when N = 100, the variance inflation is much smaller, and considering these structures was not detrimental.

Table 4 shows that, when incorporated with the CIC, the Hardin and Hilbe and Shults and Hilbe penalties are larger than the average Westgate penalty. We point out that the former penalties actually have a penalty value for independence, whereas the Westgate penalty is 0 because no covariance inflation occurs with this structure. Therefore, penalty comparisons must take this into account. For instance, Westgate’s average penalty toward the exchangeable structure is 0.06 when N = 20. Alternatively, the Hardin and Hilbe (Shults and Hilbe) penalties toward independence and exchangeable are 0.75 (1.33) and 1.33 (2.14), respectively, for a difference of 0.58 (0.81), which is notably larger than the 0.06 difference with the Westgate penalty. Therefore, the Hardin and Hilbe and Shults and Hilbe penalties result in the over-selection of independence, and simpler structures in general.

In short, relative to the Hardin and Hilbe and Shults and Hilbe penalties, Westgate’s penalty resulted in smaller ESDs when N = 20 and independence was allowed to be selected. However, when independence was not considered, but Toeplitz and unstructured were options, these former penalties resulted in slightly smaller ESDs. Finally, when N = 100, no notable differences were observed in ESDs resulting from the use of these three penalties.

The EAIC and EBIC over-selected Toeplitz and unstructured when N = 20, and corresponding ESDs were unacceptably higher than ESDs produced from the use of any other criterion and penalty. Alternatively, not allowing Toeplitz and unstructured to be selected by the EAIC and EBIC reduced ESDs to acceptable levels. Furthermore, when N = 100, selecting less parsimonous structures was not detrimental, and no notable differences in ESDs were notable across criteria and penalties.

The AGP and BGP worked very well and similarly to the TECM and CIC with Westgate’s penalty in terms of resulting ESDs. When AR-1 was the true structure, the AGP and BGP did not over-select independence, Toeplitz, or unstructured, and they had high frequencies of correctly selecting AR-1. Alternatively, when unstructured was the true structure, the AGP and BGP selected Toeplitz and unstructured most often, as did the other criteria and penalties.

5. Application

We now demonstrate the penalties in the context of the Prevention of Alzheimer’s Disease by Vitamin E and Selenium (PREADViSE) clinical trial (Caban-Holt et al., 2012). We have yearly global cognitive status data, measured using the Consortium to Establish a Registry for Alzheimer’s Disease (CERAD) T-score (Chandler et al., 2005; Mathews et al., 2013; Caban-Holt et al., 2012), from fifty men. Larger scores indicate higher cognitive status. Each subject came in for an annual assessment for either three or four sequential years. The number of assessments simply depended on when the subject was recruited into the study. Available predictors of the continuous T-score include baseline age and estimated full-scale IQ (Boekamp et al., 1995).

Our working marginal model, based on the application of Westgate (2014), centers age and IQ at values near their corresponding sample medians and means. Specifically, $E (Y_{i j}) = β_{0} + β_{1} t i m e_{i j} + β_{2} (I Q_{i} - 110) + β_{3} (A g e_{i} - 70) + β_{4} {(A g e_{i} - 70)}^{2}; j = 1, \dots, n_{i}$ ,

where n_i = 3 or 4, time_ij = j − 1 is years since baseline, and Y_ij is the ith subject’s T-score in their jth year.

Table 5 presents results from using independence, exchangeable, AR-1, Toeplitz, and unstructured working correlation matrices with GEE to fit the above model. Specifically, for each working structure, parameter estimates, their corresponding estimated standard errors, and penalized correlation selection criterion values are presented. We also give unpenalized values for the TECM and CIC to illustrate the degree of penalizations. We note that because some subjects contribute only three observations we are unable to use the EL criteria.

Table 5:

Parameter estimates, standard error estimates, and correlation selection criteria values and penalties from analyses of the PREADViSE trial data.

Parameter	Ind $\hat{β}$ (SE)	Exch $\hat{β}$ (SE)	AR-1 $\hat{β}$ (SE)	Toeplitz $\hat{β}$ (SE)	Un $\hat{β}$ (SE)
β₀	48.5 (1.42)	48.8 (1.42)	48.7 (1.38)	48.6 (1.35)	48.0 (1.51)
β₁	2.10 (0.38)	1.79 (0.37)	1.76 (0.40)	1.88 (0.39)	2.11 (0.50)
β₂	0.24 (0.16)	0.24 (0.16)	0.20 (0.16)	0.21 (0.16)	0.17 (0.16)
β₃	−0.57 (0.23)	−0.55 (0.23)	−0.54 (0.23)	–0.53 (0.23)	−0.44 (0.27)
β₄	0.05 (0.03)	0.05 (0.03)	0.05 (0.03)	0.05 (0.03)	0.05 (0.03)
Criterion
TECM No Penalty	2.25	2.24	2.12	2.06	2.06
TECM_W	2.25	2.24	2.13	2.07	2.63
CIC No Penalty	11.61	11.64	11.33	10.96	9.76
CIC_W	11.61	11.66	11.47	11.49	12.94
CIC_HH	12.29	12.62	12.31	12.72	13.23
CIC_SH	12.59	12.97	12.66	13.21	13.98
AGP	1,001	908	919	911	947
BGP	1,011	920	930	926	968

Open in a new tab

Ind - independence; Exch - exchangeable; Un - unstructured

TECM_W - trace of the empirical covariance matrix with Westgate penalty

CIC_W - correlation information criterion with Westgate penalty

CIC_HH - correlation information criterion with Hardin and Hilbe penalty

CIC_SH - correlation information criterion with Shults and Hilbe penalty

AGP - AIC-based gaussian pseudolikelihood criterion with penalty

BGP - BIC-based gaussian pseudolikelihood criterion with penalty

The TECM and CIC, when not penalizing for correlation estimation, select unstructured. We note that Toeplitz could also have been chosen by the unpenalized TECM. For these criteria Westgate’s penalty is small for exchangeable and AR-1 and increases in magnitude for Toeplitz and especially for unstructured. The reason for this is because the estimated covariance inflation was negligible (notable) when estimating one (multiple) correlation parameter(s). We note that Westgate’s penalty is more apparent with the CIC. When utilizing this penalty, the TECM and CIC select Toeplitz and AR-1, respectively. However, this does not result in a practical problem, as use of these structures resulted in similar values for $\hat{β}$ and corresponding SE estimates. The Hardin and Hilbe and Shults and Hilbe penalties with the CIC are larger than Westgate’s penalties, as expected. Furthermore, they select independence. Alternatively, if we did not consider independence for selection, then these penalties select AR-1. Finally, the AGP and BGP both select exchangeable.

We point out that our analyses of this dataset extend the analyses of Westgate (2014). First, Westgate (2014) did not apply the Kauermann and Carroll (2001) correction because N = 50 is not small and in order to stay consistent with previous articles on correlation selection that did not use this correction. However, we do apply this correction because we found that it had small but notable impact on some standard error estimates. Second, the current manuscript allowed Toeplitz to be considered for selection, whereas Westgate (2014) did not. Although Westgate (2014) concluded that the AR-1 structure may be preferred, this manuscript demonstrates that Toeplitz is also a viable choice when utilizing the TECM with Westgate’s penalty. Finally, we point out that Westgate’s TECM and CIC penalties for the exchangeable and AR-1 structures were negligible and therefore not actually needed in this setting.

6. Concluding Remarks

When estimating a population-averaged GEE model, a working correlation structure must be selected. To do so, a criterion that penalizes the estimation of correlation parameters can be utilized. The goal of this manuscript was to describe and contrast different penalties, and inherently the criteria that can incorporate these penalties, in order to make data analysts aware of important aspects to consider as to which penalty and criterion combination will ultimately result in the least variable regression parameter estimates via accurate correlation structure selection. In summary, the penalized Gaussian Pseudolikelihood criteria work well, as do the TECM and CIC with Westgate’s penalty. Furthermore, if the analyst is willing to not consider independence as a selection option due to its potential inefficiency, the Hardin and Hilbe (2012) and Shults and Hilbe (2014) penalties with the CIC can also work well, and these penalities have a practical advantage in that they can be easily calculated and used regardless of the software that is employed. Finally, the empirical likelihood criteria of Chen and Lazar (2012), when utilizing unstructured as the maximal structure, were found to not work well when the number of subjects was small.

As seen in our simulation study, the smaller the number of independent clusters, or subjects in a longitudinal study, the more notable the difference between distinct penalty and criteria combinations in terms of the resulting variances of estimated regression parameters. Alternatively, if the sample size is not small, differences may be negligible. However, what constitutes a small or large sample size may not be clear, and the analyst should consider the implications of different criteria and penalties.

In our experience with small-sample settings, we have had difficulty finding true Toeplitz or unstructured correlation matrices for generating outcomes in simulation studies such that the incorporation of corresponding working structures within GEE results in smaller variances of regression parameter estimates relative to the incorporation of simpler structures such as exchangeable or AR-1. For instance, the AR-1 working structure resulted in the smallest empirical standard deviations of regression parameter estimates in simulation results (not shown) from the setting in which the true structure was unstructured, but for N = 20. The reason for this result is because notable covariance inflation arises from the estimation of the nuisance correlation parameters, and therefore it is difficult to find a realistic, non-parsimonous structure that has an efficiency gain that more than offsets the corresponding variance inflation in small-sample settings. Therefore, if N is small, the analyst may want to over-penalize, or even not consider, structures such as Teoplitz or unstructured. To over-penalize, we recommend the use of the Hardin and Hilbe or Shults and Hilbe penalty with the CIC, or use one of these penalties in addition to Westgate’s penalty. We briefly note that the relative performances, in terms of resulting empirical standard deviations of regression parameter estimates, of the different criteria and penalty combinations from the simulation setting in which the true structure was unstructured and N = 20 were similar to those that were observed when the true structure was AR-1 and N = 20.

We note that the inflation of the variances of regression parameter estimates is often negligible for structures such as AR-1 and exchangeable that require only one correlation parameter to be estimated, as can be seen in the simulation study of Westgate (2015). As a result, Westgate’s penalty will often be negligible with such structures, as is apparent in Tables 4 and 5. As a practical result, there often is no need to penalize one parameter correlation structures except possibly for when N is very small (Westgate, 2015). However, the definition of very small is vague here, and ideally the data analyst will still apply an appropriate penalty method which can deterimine an adequate penalty value. Furthermore, the use of one parameter working correlation structures can result in the need to estimate multiple correlation parameters, for instance, when the analyst allows correlation parameters to vary across groups such as trial arms. In such a case, use of a penalty may very well be beneficial.

Our use of correlation selection criteria and penalties focuses on obtaining the least variable regression parameter estimates. However, other aspects should be considered. Therefore, we refer the reader to Ziegler and Vens (2010) for an informative discussion. Furthermore, we suggested that the analyst may want to consider disregarding independence as a selection option, as it is often no more or less efficient relative to other working structures. We note that independence is a special case of all other structures, and if the estimated correlation values are approximately zero, then independence is still available to the analyst. In situations where independence cannot lose a notable amount of efficiency, an alternative argument may be to simply use independence and ignore any other possible structures (Dahmen and Ziegler, 2004). We refer the reader to articles such as Fitzmaurice (1995), Mancl and Leroux (1996), Wang and Carey (2003), and Dahmen and Ziegler (2004) for details. Additionally, there are situations in which independence should be used, or at least preferred, and we refer the reader to the manuscript by Dahmen and Ziegler (2004) that summarizes such findings.

Table 2:

Correlation structure selection frequencies and empirical standard deviations (ESD) of ${\hat{β}}_{2}$ for the setting in which the true structure is AR-1 (ρ = 0.5) and 100 subjects each contribute 4 observations.

		Selection Frequencies
Structures of Interest	Criterion	Independence (ESD=0.18)	Exchangeable (ESD=0.15)	AR-1 (ESD=0.14)	Toeplitz (ESD=0.14)	Unstructured (ESD=0.15)	Criterion ESD^*
IEATU	TECM_W	0	19	643	227	111	0.14
	CIC_W	0	16	664	226	94	0.14
	CIC_HH	0	21	977	2	0	0.14
	CIC_SH	1	22	977	0	0	0.14
	EAIC	0	2	728	162	108	0.14
	EBIC	0	6	959	31	4	0.14
	AGP	0	1	866	115	18	0.14
	BGP	0	4	989	7	0	0.14

IEA	TECM_W	0	32	968			0.14
	CIC_W	0	25	975			0.14
	CIC_HH	0	22	978			0.14
	CIC_SH	1	22	977			0.14
	EAIC	0	10	990			0.14
	EBIC	0	10	990			0.14
	AGP	0	4	996			0.14
	BGP	0	4	996			0.14

EATU	TECM_W		19	643	227	111	0.14
	CIC_W		16	664	226	94	0.14
	CIC_HH		21	977	2	0	0.14
	CIC_SH		22	978	0	0	0.14
	EAIC		2	728	162	108	0.14
	EBIC		6	959	31	4	0.14
	AGP		1	866	115	18	0.14
	BGP		4	989	7	0	0.14