Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Apr 4.
Published in final edited form as: J Surv Stat Methodol. 2020 Feb 17;9(1):141–158. doi: 10.1093/jssam/smz048

Tools for Selecting Working Correlation Structures When Using Weighted GEE to Model Longitudinal Survey Data

Philip M Westgate 1, Brady T West 2
PMCID: PMC10072866  NIHMSID: NIHMS1837003  PMID: 37020583

SUMMARY

Weighted generalized estimating equations (GEE) are popular for the marginal analysis of longitudinal survey data. This popularity is due to the ability of these estimating equations to provide consistent regression parameter estimates and corresponding standard error estimates as long as the population mean and survey weights are correctly specified. Although the data analyst must incorporate a working correlation structure within the weighted GEE, this structure need not be correctly specified. However, accurate modeling of this structure has the potential to improve regression parameter estimation, i.e. reduce standard errors, and therefore the selection of a working correlation structure for use within GEE has received considerable attention in standard longitudinal data analysis settings. In this manuscript, we describe how correlation selection criteria can be extended for use with weighted GEE in the context of analyzing longitudinal survey data. Importantly, we provide and demonstrate an R function that we have created for such analyses. Furthermore, we discuss correlation selection in the context of using existing software which does not have this explicit capability. The methods are demonstrated via the use of data from a real survey in which we are interested in the mean number of falls that elderly individuals in a specific subpopulation experience over time.

Keywords: Generalized Estimating Equations, Complex Sample Survey Data, Weighted Estimation, Longitudinal Survey Data, Correlation Structure Selection

1. Introduction

In this paper, we focus on panel surveys in which subjects are initially sampled at baseline according to a probability sampling plan, and then these same individuals are surveyed at one or more future time points. Such longitudinal survey data are important for researchers interested in making inferences about a larger population with respect to changes over time. Furthermore, obtaining repeated measurements from the same subjects over time, as opposed to utilizing a repeated cross-sectional survey design, is advantageous in that accounting for the correlation among repeated measurements from the same subject has the potential to decrease standard errors of regression parameter estimates (Liang and Zeger, 1986; Fitzmaurice, 1995; Mancl and Leroux, 1996; Wang and Carey, 2003).

When preparing data sets for these types of longitudinal surveys, data producers often provide analysts with survey weights enabling the computation of representative population estimates based on the longitudinal data. However, the type of survey weight needed for a given analysis of longitudinal survey data clearly depends on the objectives of the analysis. The calculation of estimates of descriptive parameters or regression model parameters at a given point in time generally only requires appropriately-adjusted survey weights for each wave of the longitudinal survey. These adjusted cross-sectional weights account for any nonresponse from originally sampled units at that wave, and also possibly for calibration of the weights to known features of the population at that wave.

Analysts interested in estimating trends in measures of interest over time are faced with a more difficult challenge: exactly what survey weight should be used when considering a given sampled unit’s contribution to the longitudinal data? A single unit-level weight that does not change across the waves, reflecting the probability of each unit’s pattern of response across the waves of the survey, per Fitzmaurice (1995)? Or a unit-time-level weight that varies across the waves, reflecting differential nonresponse at each wave, per Carrillo and Karr (2013)? Heeringa et al. (2017) and Raghunathan (2016) discuss the difficulty of developing weights for longitudinal survey data given these analytic objectives. Roberts et al. (2009) discuss weights that adjust for total nonresponse, where a given sampled unit does not respond in any wave; Carrillo and Karr (2013) present a more refined approach that enables combining multiple cohorts in a longitudinal study and accommodates time-varying weights for each sampled unit, reflecting their weights computed at each wave to account for their probabilities of selection and nonresponse.

There are multiple approaches to analyzing longitudinal survey data, as summarized in Chapter 11 of Heeringa et al. (2017). For example, survey analysts might estimate changes in descriptive parameters from one time point to another using standard linear combinations of weighted estimates that recognize the covariance in the estimates introduced by a complex sample design (possibly arising from cluster sampling). Or, analysts might fit models to examine trajectories in selected outcomes over time, and these models could be marginal in nature (e.g., GEE) or conditional (e.g., multilevel modeling). Regardless of the approach used, analysts should at least consider the importance of the aforementioned longitudinal survey weights for estimation of the parameter(s) of interest (Sikkel et al., 2009; Carrillo et al., 2010). In this manuscript, we focus on weighted generalized estimating equations (GEE) (Liang and Zeger, 1986; Robins et al., 1995; Preisser et al., 2002). These estimating equations incorporate survey weights reflecting differential probabilities of selection into the sample, and also potentially differential probabilities of responding over time. The use of this so-called design-based approach has an advantage over other methods in that it does not require a correct specification of the theoretical distribution from which outcomes arise. Rather, in order to attain valid inference, weighted GEE only requires a correct model specification for the population mean, as well as appropriately adjusted survey weights. As long as these are correctly specified, regression parameters are consistently estimated (Roberts et al., 2009), and standard errors can be consistently estimated using the empirical, robust, sandwich estimator. In this paper, we develop a general approach for selecting working correlation structures in survey-weighted GEE analysis that can accommodate both types of survey weights (unit-level and unit-time-level) discussed here.

Although weighted GEE does not require correct specification of the correlation structure for repeated measures in order to attain valid inference, accurate modeling of this structure can improve regression parameter estimation, i.e. decrease standard errors, particularly with respect to time-dependent covariates (Liang and Zeger, 1986; Fitzmaurice, 1995; Mancl and Leroux, 1996; Wang and Carey, 2003). In this manuscript, we illustrate methods for correlation structure selection using a small subset of data from The Health and Retirement Study (HRS) (Heeringa et al., 2017). In our specific example, we are interested in fitting a marginal Poisson regression model using age at baseline (2006), years since 2006, and log income as predictors of the number of falls in the past two years for a small subpopulation of hypothetical research interest: Hispanic females age 75+ with diabetes, arthritis, and an education of no more than high school. In this example, we attempt to accurately model the working correlation structure for the small sample that we have from this subpopulation in the HRS. Our objective in doing this is to reduce the standard errors, if possible, for regression parameter estimates in settings in which a small-sample subset is of interest. For situations of large numbers of subjects, the unstructured correlation matrix will almost invariably be preferred as we will later discuss and demonstrate with a less restricted subset.

The unanswered question that this manuscript addresses is as follows: How can we best select a working correlation structure for the weighted GEE approach when analyzing weighted longitudinal survey data? Numerous methods have been proposed and shown to work well for the selection of a working correlation structure in non-survey data settings. For instance, Westgate (2014) summarized and compared existing criteria, demonstrating that criteria such as the ‘trace of the empirical covariance matrix’ (TECM) (Westgate, 2014), ‘correlation information criterion’ (CIC) (Hin and Wang, 2009), and gaussian pseudolikelihood (GP) (Carey and Wang, 2011) work well in terms of accurately selecting a working correlation structure, thus potentially improving regression parameter estimation. Furthermore, Westgate and Burchett (2017) summarize such criteria and how penalties can be applied to such criteria to account for the influence correlation parameter estimation has on the standard errors of regression parameter estimates. For instance, if these criteria do not apply a penalty to the working unstructured correlation matrix when the number of subjects is not large, then these criteria can inappropriately over-select this working structure.

In this manuscript, we describe and illustrate, using a provided R (R Development Core Team, 2011) function as well as existing software, the use of these information criteria when working with weighted longitudinal survey data. Due to our specific focus on the case where each subject has a constant weight that accounts for sample selection probability and different patterns of attrition (Roberts et al., 2009), the weight is simply a nuisance factor with respect to criteria for the selection of appropriate within-subject correlation structures, and results from the in-depth studies on these criteria therefore extend to our present focus. In the following sections, we introduce both GEE and weighted GEE. We also introduce correlation selection criteria that have been shown to work well in non-survey settings, and then we extend these criteria for use in longitudinal survey settings corresponding to a weighted GEE analysis. We also explain what criteria can be easily obtained from and validly used with existing software. Methods and software are then illustrated using the HRS data.

2. Methods

2.1. Generalized estimating equations and corresponding correlation selection criteria

Suppose we want to conduct a longitudinal survey-based study in which N subjects are sampled at baseline (t = 1). Furthermore, we plan to survey these same subjects at T − 1 future time points, with each subject being surveyed at the same point in time. If everyone completes the survey at all T time points, we have a balanced design. Even if some subjects do not provide data at some time points, the temporal spacing of data collection is still the same across subjects. We initially ignore the need to incorporate probability sampling weights, and we also initially assume that any missing data are missing completely at random (MCAR) (Little and Rubin, 2002), such that no weighting is required within the GEE in order to attain valid inference.

Let Yi=[Yi1,,Yini]T denote the ni, 1 ≤ niT, repeated survey outcomes, with marginal mean μi, from subject i, i = 1, . . . , N. A working correlation structure for these outcomes must be utilized within the GEE. Popular examples for such structures include independence, exchangeable, AR-1, Toeplitz, and unstructured, as these are typically available options in software that is able to conduct GEE-based analyses. In short, an independence structure assumes there is no correlation among outcomes from the same subject. Alternatively, an exchangeable structure assumes an equivalent correlation among all outcomes from the same subject, which is often an unrealistic assumption with longitudinal data (Fitzmaurice et al., 2011). An AR-1 structure assumes Corr(Yij, Yik) = α|jk| such that the magnitude of correlation decreases the further apart in time outcomes are from each other. Although more realistic, AR-1, and even other structures we have not mentioned but may be applicable, may not accurately model the true correlation structure. Toeplitz is more flexible in that T − 1 correlation parameters are estimated, denoted by αl, l = 1, . . . , T − 1, such that Corr(Yij, Yik) = αl for l = |jk|. The unstructured correlation matrix is even more flexible, as it distinctly models each possible correlation parameter, Corr(Yij, Yik) = αjk = αkj, jk.

Using the matrix of covariate values given by xij = [1, x1ij , . . . , x(p)ij]T, suppose we want to fit the marginal generalized linear model given by f(μij)=xijTβ. Let Ri (α) denote the working correlation matrix for Y i, in which any correlation parameters are denoted by α. To obtain a consistent estimate of the regression parameters, β = [β0, β1, . . . , βp]T, with the GEE approach, the following is iteratively solved: i=1NDiTVi1(Yiμi)=i=1NDiTAi1/2Ri1(α^)Ai1/2(Yiμi)=0. Here, Di = μi/∂βT, and Ri (α) is part of the working covariance matrix, Vi=Ai1/2Ri(α)Ai1/2, for Y i.

The importance of accurately modeling the working correlation structure is to obtain the smallest standard errors (SEs) possible. In order to obtain consistent SE estimates, the well-known empirical, robust, sandwich estimator given by

Cov^E(β^)=Σ^M(i=1NDiTVi1(Yiμ^i)(Yiμ^i)TVi1Di)Σ^M (1)

is often used, in which Σ^M=(i=1NDiTVi1Di)1. However, when the number of subjects is not large, this estimator can be negatively biased due to the use of residual, as opposed to error, vectors and not accounting for the impact of correlation parameter estimation. Therefore, the corrected estimator given by

Cov^EBC(β^)=(Ip+G^)Σ^M(i=1NDiTVi1Cov^(Yi)Vi1Di)Σ^M(Ip+G^)T, (2)

is ideal, in which Cov^(Yi)=(IniHi)c(Yiμ^i)(Yiμ^i)T(IniHiT)c, Hi=Di(Ip+G)Σ^MDiTVi1, accounts for bias from the use of residual vectors, and G = (G0, G1, . . . , Gp),

Gr=Σ^Mi=1NXiTAi1/2Ri1Ri(α^(β))βrRi1Ai1/2(Yiμi(β)),

accounts for any covariance inflation arising from the use of estimated correlation parameters when estimating β (Westgate, 2013, 2016, 2014; Westgate and Burchett, 2017). We note that c = 1/2 and c = 1 correspond to the Kauermann and Carroll (2001) and Mancl and DeRouen (2001) corrections, respectively. Both are well-known, and other corrections have been proposed as well, although either the Kauermann and Carroll correction or the average of these two corrections may be best in practice (Westgate, 2016; Ford and Westgate, 2018).

A general suggestion for the need of a residual correction such as the Kauermann and Carroll approach is when there are 40 or fewer subjects (Murray et al., 2004), although guidelines for the use of the covariance inflation correction are less clear. Often such a correction is not needed for simpler structures such as exchangeable and AR-1 due to the need to estimate only one correlation parameter (Westgate, 2016). However, it could be needed when incorporating a Toeplitz or an unstructured matrix, as they require estimation of multiple correlation parameters, depending on the number of time points at which data are collected. For instance, it was still useful with the unstructured matrix in the simulation study of Westgate (2014), even with 100 subjects and only 4 time points; i.e., six correlation parameters to estimate from 100 subjects.

The goal of correlation selection is to improve parameter estimation. In short, the data analyst wants to select the structure that will result in the smallest SEs. Let Cov^(β^) represent an estimator that will be used in practice, either Cov^E(β^) or Cov^EBC(β^). For a given working correlation structure under consideration, R, the TECM and CIC criteria are given by TECM(R)=tr(Cov^(β^)) and CIC(R)=tr(Σ^MI1Cov^(β^)), in which Σ^MI is the model-based covariance matrix of regression parameter estimates if independence had been the working structure. Essentially, these two criteria select the working structure that has been estimated to yield the least variable regression parameter estimates. Alternatively, the GP criterion is given by GP(R)=i=1N[(Yiμi)TVi1(Yiμi)+ln(|Vi|)] and is based on the concept of minimizing the generalized least squares. We note that in order to penalize correlation parameter estimation, 2(p + r) or (p + r)ln(N) can be added, forming AIC and BIC-based penalized versions of the GP which we will call the AGP and BGP, respectively (Xiaolu and Zhongyi, 2013; Westgate and Burchett, 2017). We remind the reader that the use of G is the method for penalizing correlation estimation with the TECM and CIC (Westgate and Burchett, 2017). All three criteria select the structure under consideration that yields the smallest value.

2.2. Weighted GEE and extensions for corresponding correlation selection criteria

With respect to the marginal analysis of longitudinal survey data, weighted GEE must be used in order to provide consistent regression parameter estimates. In short, weights must account for the subject’s probability of inclusion in the sample, as well as the probability of responding at each time of data collection if there are data that are missing at random (Robins et al., 1995; Preisser et al., 2002; Roberts et al., 2009; Carrillo and Karr, 2013). We note that both case-specific weights or time-varying weights can be used to account for data missing at random (Preisser et al., 2002). Furthermore, the weights are assumed to be correctly specified. We note that we will not be considering approaches for missing data that are not missing at random (NMAR), or non-ignorable missingness, where, for example, the probability of attrition depends on the outcomes of interest (Wang and Fitzmaurice, 2006); we return to this point in the Discussion.

Weighted GEE are given by i=1NDiTAi1/2Ri1(α)Ai1/2Wi(Yiμi)=0, in which W i is a diagonal matrix of weights corresponding to the ith subject’s ni survey outcomes (Preisser et al., 2002). When weights must be estimated to account for the probability of responding, the previous form for the empirical, robust, sandwich estimator can be conservative and therefore the following estimator should be used if possible (Robins et al., 1995; Preisser et al., 2002):

Cov^WE(β^)=Σ^MW(i=1NEiEiT)Σ^MW. (3)

Here, Σ^MW=(i=1NDiTVi1WiDi)1, Ei=Ui(i=1NUiSiT)(i=1NSiSiT)Si, Ui=DiTVi1Wi(Yiμ^i), and Si=tRi,t1Zit(Ri,tλi,t) in which Ri,t, λi,t, and Zit are a response indicator, the modeled probability of response, and the design matrix for the modeled probability of response, respectively, corresponding to subject i at time t. In short, Si represents the score equations from the model for missingness from which estimated weights are obtained. Setting Si, i = 1, . . . , N as 0 vectors reduces this formula back to the typical empirical covariance formula, but with the addition of the weights. When incorporating both types of small-sample bias-corrections as introduced in the previous section, the corrected estimator is given by

Cov^WEBC(β^)=(Ip+G^W)Σ^MW(i=1NE˜iE˜iT)Σ^MW(Ip+G^W)T, (4)

in which E˜i has the same form as Ei but incorporates U˜i=DiTVi1Wi(IniH˜i)c(Yiμ^i), where H˜i=Di(Ip+GW)Σ^MWDiTVi1Wi, GW = (GW0, GW1, . . . , GWp), and

GWr=Σ^MWi=1NXiTAi1/2Ri1Ri(α^(β))βrRi1Ai1/2Wi(Yiμi(β)).

As already discussed, the TECM, CIC, and GP have been shown, via extensive simulation studies, to work well. For instance, see Westgate (2014); Westgate and Burchett (2017). Therefore, we now show that these criteria can easily be extended for use with weighted GEE for the marginal analysis of longitudinal survey data, as the weights are simply nuisance factors reflecting how much influence a subject’s data should have based on the initial probability of being selected into a sample and the subsequent probability of having a particular pattern of participation across the waves. Specifically, the TECM and CIC seamlessly incorporate the weights and are given by TECMW(R)=tr(Cov^WEBC(β^)) and CICW(R)=tr(Σ^MWI1Cov^WEBC(β^)), in which Σ^MWI is the model-based covariance matrix of regression parameter estimates if independence had been the working structure. The GP is naturally applicable in a weighted fashion when cluster-specific weights are used such that all observations from subject i are given the same weight wi, i = 1, . . . , N. However, normalized weights must be utilized such that the penalties applied by the AGP and BGP are on the same scale. Specifically, we define the normalized weight for subject i as w˜i=wi/w¯ where w¯=i=1Nwi/N. The weighted GP is then given by GPW(R)=i=1Nw˜i[(Yiμi)TVi1(Yiμi)+ln(|Vi|)], and the weighted versions of the AGP and BGP are given by GPW (R) + 2(p + r) and GPW (R) + (p + r)ln(N), respectively.

The R function that we provide (see the supplementary materials for details and demonstration) utilizes the small-sample corrections. Specifically, they utilize Cov^WEBC(β^) instead of Cov^WE(β^). Therefore, correlation parameter estimation can be properly accounted for, or penalized, within the selection criteria, while also obtaining appropriate SE estimates.

Our R function also allows the incorporation of Si, i = 1, . . . , N, within Cov^WEBC(β^). However, if there are no missing data, or data are assumed to be missing completely at random and thus weights only need to account for sampling, i.e. the subject’s probability of inclusion in the sample, then Si, i = 1, . . . , N can be set to be 0 vectors and appropriate estimates are obtained. Furthermore, even if data are missing at random, obtaining Si, i = 1, . . . , N, may be difficult and tedious in practice unless weights can be obtained from a stacked dataset in which the same model for missingness can be used for each time point. In the former case, analysts may want to set Si, i = 1, . . . , N, to be 0 vectors. Although this may result in conservative SE estimates, this will not affect our ability to utilize correlation selection criteria as previously described. Similarly, whether or not the Kauermann and Carroll (2001) or Mancl and DeRouen (2001) corrections are utilized will have no influence on correlation selection Westgate (2016). In fact, if Si, i = 1, . . . , N, are set to 0 (making SE estimates conservative in this aspect), then the analyst may want to not utilize the Kauermann and Carroll (2001) or Mancl and DeRouen (2001) corrections (making SE estimates liberal in this aspect) in order to offset these different types of bias. However, more study is needed to determine how well this suggestion will work in practice.

2.3. Weighted GEE and correlation structure selection with existing software

We assume that some data analysts may prefer to use existing software such as SAS PROC GEE (SAS Institute Inc., 2013). The GEE procedure in SAS can utilize a stacked dataset for modeling data that are missing at random, and it incorporates Cov^WE(β^) within the analysis such that SE estimates are not conservative. We now address correlation selection using existing software, whether or not the correction utilizing Si, i = 1, . . . , N, is used.

In general, the CIC and TECM can be obtained with existing software. For both criteria, the model must be fit under each correlation structure under consideration. With the TECM, all that needs to be done is to take the sum of the squared SE estimates. For the CIC, a model must also be fit using a working independence structure, and then covariance matrices as described above must be obtained and multiplied before taking the trace.

Unfortunately, to our knowledge, the software used in practice will not allow the analyst to incorporate small-sample adjustments; i.e., the Kauermann and Carroll (2001) or Mancl and DeRouen (2001) corrections or the covariance inflation correction. The inability to incorporate the Kauermann and Carroll (2001) or Mancl and DeRouen (2001) corrections will not influence correlation selection, and may only have notable influence on SE estimates if N ≤ 40. However, the inability to utilize the covariance inflation can result in the over-selection of the Toeplitz and unstructured correlation matrices. In small-sample settings, this can be detrimental as legitimately smaller SE estimates may be obtainable with simpler structures, and SE estimates can be negatively biased with the Toeplitz and unstructured matrices. Therefore, if the sample size is not large, e.g. N ≤ 100, the analyst may want to only consider simpler structures for selection such that penalization is likely not needed. Alternatively, with large sample sizes, we recommend the unstructured matrix. In short, if N is large, any covariance inflation from correlation estimation will be negligible, and so the unstructured matrix’s flexibility is potentially advantageous and it still can model simpler structures such as exchangeable and AR-1.

In the application that we consider below, we use the R function that we have developed for computing these selection criteria in the context of survey-weighted GEE estimation with smaller samples. The supplementary materials provide explicit detail on how the function should be used when analyzing longitudinal survey data. As noted above, the use of our function in the context of larger samples with survey weights will not lead to incorrect inference, and the unstructured matrix would be recommended for modeling in these settings.

3. Applications

3.1. Small sample setting

We now demonstrate our R function, as well as existing software, specifically SAS PROC GEE, in a marginal analysis of trends in (and correlates of) the number of falls in the past two years from 41 individuals in the small hypothetical subpopulation (Hispanic females age 75+ with diabetes, arthritis, and an education of no more than high school) in the HRS data. Data were collected every two years from 2006 to 2012. In the HRS, the base sampling weights are computed to reflect unequal probabilities of selection into the sample that arise from over-sampling specific population subgroups, including Blacks and Hispanics, and the selection of single eligible individuals from households with multiple eligible individuals (https://bit.ly/2MvTVeZ). For purposes of this application, we assume that the 2006 weights for each individual in this specific subpopulation (which generally reflect adjustments for attrition over time in the HRS) are the base sampling weights, prior to the individuals being followed for six years. The HRS began in 1992, but we assume that this subpopulation of individuals was first followed in 2006 strictly for illustration purposes. Our R function, code demonstrating the function via the replication of the results presented in this section, and the datasets used for this application example are provided in supplementary materials.

Table 1 presents a small subset of the person-wave data set analyzed in this application (specifically, 12 observations corresponding to three of the individuals). We note that the 2006 sampling weights are constant for each individual female. We also consider the large-sample case as well in this application, considering all female Hispanics ages 75 and above.

Table 1:

Subset of HRS data set analyzed in the application

ID Base Weight Years Since 2006 Number of Falls

4141 3878 0 1
4141 3878 2 0
4141 3878 4 0
4141 3878 6 0
4902 2197 0 0
4902 2197 2 0
4902 2197 4 2
4902 2197 6 11
7035 1640 0 1
7035 1640 2 1
7035 1640 4 1
7035 1640 6 1

We fit the following marginal Poisson regression model incorporating working independence, exchangeable, AR-1, Toeplitz, and unstructured correlation matrices within the weighted GEE:

ln(μij)=β0+β1Age2006i+β2(Yearij2006)+β3ln(Incomeij);j=1,,ni;i=1,,N.

Here, μij is the mean number of falls in the previous two years for subject i at their jth observation, Age2006i is the ith subject’s age during the (hypothetical) first survey round in 2006, (Y earij − 2006) is the number of years since 2006 at the time the jth observation is obtained from subject i, and Incomeij is the ith subject’s recorded income at the time of their jth observation. To demonstrate the need for correlation selection penalization, as well as small-sample bias corrections to empirical standard error estimates, we present two sets of results for each working correlation structure. The first set of results arises from the use of the GEE procedure in SAS, and this corresponds to a typical analysis in current practice. The case-specific 2006 sampling weights are incorporated into the analysis via the WEIGHT statement. The second set of results arise from the use of the R function we are providing as supplemental material with this paper and incorporates the small-sample corrections and penalties addressed above.

Table 2 presents regression parameter estimates, corresponding uncorrected and corrected standard error estimates, and correlation structure selection criterion (TECM, CIC, AGP, BGP) values from fitting weighted generalized estimating equations with working independence, exchangeable, AR-1, Toeplitz, and unstructured working correlation matrices. The smallest observed value for a given criterion is italicized when utilizing no penalties, whereas such a value is in bold when based on the use of penalties. We note that the GP criterion is the non-penalized version of both the AGP and BGP and is therefore shown in both rows corresponding to these two penalized criteria.

Table 2:

Regression parameter estimates, corresponding uncorrected (SEU) and corrected (SEC) standard error estimates, and correlation structure selection criterion values (NP - not penalized, P - penalized) from fitting weighted generalized estimating equations with working independence (Ind), exchangeable (Exch), AR-1, Toeplitz, and unstructured (UN) working correlation matrices in the small sample setting. The smallest value for each uncorrected / unpenalized criterion is italicized, whereas the smallest value for each corrected / penalized criterion is in bold.

Ind Exch AR-1 Toeplitz UN
β^ SEU SEC β^ SEU SEC β^ SEU SEC β^ SEU SEC β^ SEU SEC

β^0 1.058 2.759 3.102 0.293 2.649 2.922 0.854 2.506 2.744 1.468 2.565 2.802 −0.005 2.469 2.777
β^1 0.068 0.057 0.059 0.068 0.057 0.058 0.074 0.055 0.057 0.060 0.054 0.055 0.120 0.059 0.061
β^2 −0.008 0.030 0.032 −0.0003 0.029 0.031 −0.005 0.029 0.030 −0.012 0.029 0.030 0.007 0.028 0.029
β^3 −0.013 0.119 0.140 0.002 0.101 0.116 −0.022 0.080 0.091 −0.020 0.080 0.091 −0.043 0.087 0.105
NP P NP P NP P NP P NP P
TECM 7.63 9.65 7.03 8.55 6.29 7.54 6.59 7.86 6.11 7.73
CIC 5.08 5.81 4.86 5.23 4.38 4.98 4.31 5.13 4.44 5.18
AGP 1108 1116 1104 1114 1099 1109 1097 1111 1085 1105
BGP 1108 1122 1104 1122 1099 1118 1097 1123 1085 1122

Uncorrected standard error estimates (SEU) were estimated using SAS PROC GEE, and corrected standard error estimates (SEC) were estimated using our new WGEE function in R.

The GP criterion is the non-penalized version of both the AGP and BGP and is therefore shown in both rows corresponding to these two penalized criteria.

TECM - trace of the empirical covariance matrix

CIC - correlation information criterion

The need for correlation penalization is very evident. When not utilizing a penalty, both the TECM and GP select unstructured, whereas the CIC disagrees and chooses Toeplitz. However, penalties are needed to appropriately account for the estimation of multiple nuisance correlation parameters within each of these structures. When utilizing penalties, the BGP, TECM, and CIC all select AR-1 as this structure provides the smallest values. These results can be explained by looking at the corrected standard error estimates. It can be seen that, relative to unstructured, the choice of AR-1 results in a notably smaller standard error estimate for β^3 which corresponds to the time-dependent covariate ln(Incomeij). Relative to Toeplitz, AR-1 resulted in a notably smaller standard error estimate of β^0.

We note that the AGP selects unstructured. In such a case of disagreement, we note that the structure to be chosen can be the structure that is selected by the most criteria (Shults and Hilbe, 2014; Westgate, pear). In our case, AR-1 is selected by three of the four criteria and is therefore preferred.

3.2. Large sample setting

In the previous section, we recommend use of the unstructured matrix in large-sample settings if the analyst does not wish to use the R function we provide. Therefore, we now demonstrate an example with a larger sample size in which either the use of existing software or our R function with small-sample corrections both tend to select a working unstructured matrix, as the need for small-sample corrections is almost negligible. Specifically, we utilize data from all Hispanic females age 75+, resulting in data from 155 subjects. For ease of comparison, we replicate all of the above analyses with the one difference that the dataset is now larger, and results are provided in Table 3.

Table 3:

Regression parameter estimates, corresponding uncorrected (SEU) and corrected (SEC) standard error estimates, and correlation structure selection criterion values (NP - not penalized, P - penalized) from fitting weighted generalized estimating equations with working independence (Ind), exchangeable (Exch), AR-1, Toeplitz, and unstructured (UN) working correlation matrices in the large sample setting. The smallest value for each uncorrected / unpenalized criterion is italicized, whereas the smallest value for each corrected / penalized criterion is in bold.

Ind Exch AR-1 Toeplitz UN
β^ SEU SEC β^ SEU SEC β^ SEU SEC β^ SEU SEC β^ SEU SEC

β^0 −3.324 1.845 1.883 −3.825 1.643 1.669 −3.335 1.681 1.707 −3.519 1.653 1.678 −3.688 1.580 1.605
β^1 0.073 0.0340 0.034 0.083 0.029 0.029 0.081 0.031 0.031 0.081 0.030 0.030 0.084 0.031 0.031
β^2 0.007 0.017 0.017 0.013 0.015 0.015 0.011 0.016 0.016 0.012 0.015 0.015 0.013 0.015 0.015
β^3 0.281 0.126 0.129 0.279 0.105 0.108 0.249 0.105 0.107 0.258 0.104 0.106 0.268 0.101 0.103
NP P NP P NP P NP P NP P
TECM 3.42 3.56 2.71 2.80 2.84 2.93 2.74 2.83 2.51 2.59
CIC 5.77 6.08 5.11 5.49 5.15 5.49 5.11 5.44 5.02 5.17
AGP 4061 4069 4018 4028 4012 4022 4010 4024 3992 4012
BGP 4061 4081 4018 4044 4012 4037 4010 4045 3992 4043

Uncorrected standard error estimates (SEU) were estimated using SAS PROC GEE, and corrected standard error estimates (SEC) were estimated using our new WGEE function in R.

The GP criterion is the non-penalized version of both the AGP and BGP and is therefore shown in both rows corresponding to these two penalized criteria.

TECM - trace of the empirical covariance matrix

CIC - correlation information criterion

The need for small-sample corrections is almost negligible as the uncorrected and corrected standard error estimates are similar in value, although some small differences are evident. Table 3 shows that whether or not correlation parameter estimation is penalized, the working unstructured matrix is always chosen with the exception of the BGP. The BGP selects a working AR-1 structure, even though corrected estimated standard errors are smaller with the unstructured matrix. This demonstrates that criteria may not always select the best structure, and criteria will not always agree with each other in which case the selection of a structure can be based on what the majority of criteria choose.

Based on a choice of the working unstructured matrix, the regression parameters for both time-dependent covariates, the number of years since 2006 (p=0.007) and the natural log of income (p=0.010), are statistically significant at the 0.05 level. These results indicate positive relationships of aging and income with the expected number of falls in a given year for women in this subpopulation.

4. Discussion

In this paper, we have developed and illustrated approaches for selecting the best correlation structure under consideration, in the sense of minimizing the variances of estimates, when using weighted GEE to fit models to longitudinal survey data. The approaches have been implemented in an easy-to-use R function that is available in the supplementary materials. Our approaches provide advantages over other procedures that enable weighted GEE estimation in general-purpose statistical packages (e.g., SAS PROC GEE) but fail to incorporate penalties for the number of correlation parameters being estimated that incorporate the survey weights.

The weighted penalties to the correlation selection criteria described in this study are most relevant for fitting models to longitudinal data collected from small samples. In our application, we first focused on a very specific subpopulation of subjects meeting certain criteria from the Health and Retirement Study. When replicating the analysis presented using all Hispanic females age 75+, the unstructured matrix is selected by most criteria considered, as we predicted earlier. We strongly encourage data analysts to consider the enhanced model selection criteria developed in this paper, as they incorporate any information contained in the survey weights about the relationships of interest that is not otherwise being captured in the model being fitted. For large samples, the unstructured correlation matrix will generally produce the most efficient estimates, but for smaller samples, the selection criteria described here may well play a crucial role in the inferences that result from fitting the models of interest.

We note that no simulation study was presented in this manuscript. The correlation selection criteria with corresponding penalties proposed in this manuscript are seamless extensions of criteria that have already been studied extensively and shown to work well. Furthermore, our specific focus was on a constant weight for all repeated measures from the same subject, and so the weight is simply a nuisance factor in analyses focused on the selection of an appropriate within-subject correlation structure. Therefore, these criteria will also work well for longitudinal survey data.

In this manuscript, we have assumed that any working correlation structure will result in valid inference. However, with certain types of time-dependent covariates, some moment conditions within the weighted estimating equations will not be valid unless a working independence structure is employed (Pepe and Anderson, 1994; Lai and Small, 2007). Therefore, careful consideration of the type of time dependency a covariate has must be carefully considered. Definitions and explanations of the four different classification types for time dependency can be found in Lai and Small (2007) and Zhou et al. (2014).

Our work could be extended in several important directions. First, we did not consider non-ignorable nonresponse, which is best handled using Bayesian approaches that incorporate different assumptions about the missing data mechanism (Kaciroti et al., 2008). Weighted GEE relies on a missing at random assumption for missing data in the longitudinal context, which is made plausible with well-specified models for the probability of attrition at the different time points. We remind readers that the approaches considered here allow for the score equations from these models for missingness to be incorporated into the selection criteria calculations.

Second, we did not explicitly consider the use of time-varying weights in the weighted GEE context. The methods developed here are all based on a time-invariant survey weight at baseline that incorporates the initial probability of selection for a given case and (potentially) the probability of having a specific response pattern (Fitzmaurice et al., 1995). We note that our R function allows analysts to account for the estimation of time-varying weights via the S matrix, but this does require obtaining the score equations from the models used. Future empirical work could consider whether there are any benefits to inference from considering different types of weights to account for non-monotone patterns of missing data over time.

Third, modifications have been recently developed to address the validity of inference with respect to any type of time dependency of covariates (Chen and Westgate, 2017, 2018). In short, the estimating equations can be tailored to the type of time dependency such that valid, yet powerful, inference is still possible (Chen and Westgate, 2017). Furthermore, if the type of time dependency is difficult to determine, an empirical approach to choosing a working type of time dependency has been developed (Chen and Westgate, 2018). Therefore, future work can extend such methods for use in our current context of longitudinal data arising from a complex probability sample.

Supplementary Material

Demonstrating the Weighted GEE Function
PopData
SubPopData
Weighted GEE Functions for Supplementary Material

Contributor Information

Philip M. Westgate, Department of Biostatistics, College of Public Health, University of Kentucky, Lexington, KY 40536, U.S.A.

Brady T. West, Survey Research Center, Institute for Social Research, University of Michigan-Ann Arbor, Ann Arbor, MI 48109, U.S.A.

References

  1. Carey VJ and Wang Y-G (2011). Working covariance model selection for generalized estimating equations. Statistics in Medicine 30, 3117–3124. [DOI] [PubMed] [Google Scholar]
  2. Carrillo IA, Chen J, and Wu C (2010). The pseudo-gee approach to the analysis of longitudinal surveys. The Canadian Journal of Statistics 38, 540–554. [Google Scholar]
  3. Carrillo IA and Karr AF (2013). Combining cohorts in longitudinal surveys. Survey Methodology 39, 149–182. [Google Scholar]
  4. Chen I and Westgate PM (2017). Improved methods for the marginal analysis of longitudinal data in the presence of time-dependent covariates. Statistics in Medicine 36, 2533–2546. [DOI] [PubMed] [Google Scholar]
  5. Chen I and Westgate PM (2018). A novel approach to selecting classification types for time-dependent covariates for the marginal analysis of longitudinal data. Statistical Methods in Medical Research. [DOI] [PubMed] [Google Scholar]
  6. Fitzmaurice GM (1995). A caveat concerning independence estimating equations with multivariate binary data. Biometrics 51, 309–317. [PubMed] [Google Scholar]
  7. Fitzmaurice GM, Laird NM, and Ware JH (2011). Applied Longitudinal Analsysis (2nd ed.). Hoboken, New Jersey: John Wiley & Sons, Inc. [Google Scholar]
  8. Fitzmaurice GM, Molenberghs G, and Lipsitz SR (1995). Regression models for longitudinal binary responses with informative drop-outs. Journal of the Royal Statistical Society Series B 57, 691–704. [Google Scholar]
  9. Ford WP and Westgate PM (2018). A comparison of biascorrected empirical covariance estimators with generalized estimating equations in small-sample longitudinal study settings. Statistics in Medicine. [DOI] [PubMed] [Google Scholar]
  10. Heeringa SG, West BT, and Berglund PA (2017). Applied Survey Data Analysis (2nd ed.). Boca Raton, FL: Taylor & Francis, CRC Press. [Google Scholar]
  11. Hin L-Y and Wang Y-G (2009). Working-correlation-structure identification in generalized estimating equations. Statistics in Medicine 28, 642–658. [DOI] [PubMed] [Google Scholar]
  12. Kaciroti NA, Raghunathan TE, Schork MA, and Clark NM (2008). A bayesian model for longitudinal count data with non-ignorable dropout. Journal of the Royal Statistical Society Series C 57, 521–534. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Kauermann G and Carroll RJ (2001). A note on the efficiency of sandwich covariance matrix estimation. Journal of the American Statistical Association 96, 1387–1396. [Google Scholar]
  14. Lai TL and Small D (2007). Marginal regression analysis of longitudinal data with time-dependent covariates: a generalized method-of-moments approach. Journal of the Royal Statistical Society: Series B 69, 79–99. [Google Scholar]
  15. Liang K-Y and Zeger SL (1986). Longitudinal data analysis using generalized linear models. Biometrika 73, 13–22. [Google Scholar]
  16. Little RJA and Rubin DB (2002). Statistical Analysis with Missing Data (2nd ed.). New York: John Wiley & Sons, Inc. [Google Scholar]
  17. Mancl LA and DeRouen TA (2001). A covariance estimator for gee with improved small-sample properties. Biometrics 57, 126–134. [DOI] [PubMed] [Google Scholar]
  18. Mancl LA and Leroux BG (1996). Efficiency of regression estimates for clustered data. Biometrics 52, 500–511. [PubMed] [Google Scholar]
  19. Murray DM, Varnell SP, and Blitstein JL (2004). Design and analysis of group-randomized trials: a review of recent methodological developments. American Journal of Public Health 94, 423–432. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Pepe MS and Anderson GL (1994). A cautionary note on inference for marginal regression models with longitudinal data and general correlated response data. Communications in Statistics-Simulation and Computation 23, 939–951. [Google Scholar]
  21. Preisser JS, Lohman KK, and Rathouz PJ (2002). Performance of weighted estimating equations for longitudinal binary data with drop-outs missing at random. Statistics in Medicine 21, 3035–3054. [DOI] [PubMed] [Google Scholar]
  22. R Development Core Team (2011). R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. ISBN 3–900051-07–0. [Google Scholar]
  23. Raghunathan T (2016). Missing Data Analysis in Practice. Boca Raton, FL: Taylor & Francis, CRC Press. [Google Scholar]
  24. Roberts G, Ren Q, and Rao JNK (2009). Using marginal mean models for data from longitudinal surveys with a complex design: Some advances in methods. In Lynn P (Ed.), Methodology of Longitudinal Surveys, Chapter 20, pp. 351–366. West Sussex: Wiley. [Google Scholar]
  25. Robins JM, Rotnitzky A, and Zhao LP (1995). Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. Journal of the American Statistical Association 90, 106–121. [Google Scholar]
  26. SAS Institute Inc. (2013). SAS/STAT 12.3 User’s Guide. Cary, NC: SAS Institute Inc. [Google Scholar]
  27. Shults J and Hilbe JM (2014). Quasi-Least Squares Regression. Chapman & Hall/CRC Monographs on Statistics & Applied Probability. Taylor & Francis, Boca Raton, FL. [Google Scholar]
  28. Sikkel D, Hox J, and de Leeuw E (2009). Using auxiliary data for adjustment in longitudinal research. In Lynn P (Ed.), Methodology of Longitudinal Surveys, Chapter 9, pp. 141–156. West Sussex: Wiley. [Google Scholar]
  29. Wang M and Fitzmaurice GM (2006). A simple imputation method for longitudinal studies with non-ignorable non-responses. Biometrical Journal 48, 302–318. [DOI] [PubMed] [Google Scholar]
  30. Wang Y-G and Carey V (2003). Working correlation structure misspecification, estimation and covariate design: implications for generalised estimating equations performance. Biometrika 90, 29–41. [Google Scholar]
  31. Westgate PM (2013). A bias correction for covariance estimators to improve inference with generalized estimating equations that use an unstructured correlation matrix. Statistics in Medicine 32, 2850–2858. [DOI] [PubMed] [Google Scholar]
  32. Westgate PM (2014). Improving the correlation structure selection approach for generalized estimating equations and balanced longitudinal data. Statistics in Medicine 33, 2222–2237. [DOI] [PubMed] [Google Scholar]
  33. Westgate PM (2016). A covariance correction that accounts for correlation estimation to improve finite-sample inference with generalized estimating equations: a study on its applicability with structured correlation matrices. Journal of Statistical Computation and Simulation 86, 1891–1900. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Westgate PM (To Appear). Approaches for the utilization of multiple criteria to select a working correlation structure for use within generalized estimating equations. Communications in Statistics - Simulation and Computation. [Google Scholar]
  35. Westgate PM and Burchett WW (2017). A comparison of correlation structure selection penalties for generalized estimating equations. The American Statistician 71, 344–353. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Xiaolu Z and Zhongyi Z (2013). Comparison of criteria to select working correlation matrix in generalized estimating equations. Chinese Journal of Applied Probability and Statistics 29, 515–530. [Google Scholar]
  37. Zhou Y, Lefante J, Rice J, and Chen S (2014). Using modified approaches on marginal regression analysis of longitudinal data with time-dependent covariates. Statistics in Medicine 33, 3354–3364. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Demonstrating the Weighted GEE Function
PopData
SubPopData
Weighted GEE Functions for Supplementary Material

RESOURCES