Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2018 Jun 1.
Published in final edited form as: Stat Methods Med Res. 2016 Sep 20;27(6):1806–1817. doi: 10.1177/0962280216669184

Inferring marginal association with paired and unpaired clustered data

Douglas J Lorenz 1, Steven Levy 2, Somnath Datta 3
PMCID: PMC5524605  NIHMSID: NIHMS877817  PMID: 27655806

Abstract

In the marginal analysis of clustered data, where the marginal distribution of interest is that of a typical observation within a typical cluster, analysis by reweighting has been introduced as a useful tool for estimating parameters of these marginal distributions. Such reweighting methods have foundation in within-cluster resampling schemes that marginalize potential informativeness due to cluster size or within-cluster covariate distribution, to which reweighting methods are asymptotically equivalent. In this paper, we introduce a reweighting scheme for the marginal analysis of clustered data that generalizes prior reweighting methods, with a particular application to measuring bivariate correlation in unpaired clustered data, in which observations of two random variables are not naturally paired at the within-cluster level. We develop unpaired clustered data analogs of well-known product moment correlation coefficients (Pearson, Spearman, phi), as well as the polyserial coefficient for measuring correlation between one discrete and one continuous variable. We evaluate the performance of these coefficients via a simulation study and demonstrate their use by finding no statistically significant association between dental caries at an early age and dental fluorosis at age 13 using a large dental dataset.

Keywords: Measures of association, correlation, clustered data, marginal analysis, informative cluster size

1 Introduction

The motivating dataset for this paper comes from an observational dental study,1 with interest in examining the association between dental caries and dental fluorosis, the staining or pitting of enamel during tooth development caused by excess exposure to fluoride. As part of this study, children received dental examinations at ages 5, 9, and 13, and the presence and severity of dental caries were noted for each tooth in each child. Additionally, the presence and severity of dental fluorosis in each tooth was noted at ages 9 and 13. Investigators were interested in identifying associations between the presence of dental caries at age 5 and fluorosis in late-erupting teeth (canines, first and second premolars, second molars) at age 13. The hypothesis was that the protective measures taken by children diagnosed with caries at age 5—when late-erupting teeth are still in their sensitive, formative stages and have yet to erupt—would include the use and potential overuse of fluoride, placing the children at heightened risk of fluorosis in the late-erupting teeth at later ages.

These data provide an example of clustered data, in which observations are collected within units known as clusters. In marginal analyses of clustered data, clusters are typically assumed to be independent and the observations within clusters potentially dependent. Several types of marginal analysis are possible for clustered data, one of which treats the clusters as the primary sampling unit and the marginal distribution of a randomly selected observation from a randomly selected cluster is of interest. For the dental study, such a marginal analysis of the data from the dental study would consider the children as the clusters and the teeth as the observations within clusters.

The data from this study present several challenges in measuring the marginal association between caries at age 5 and fluorosis at age 13 through, say, a correlation analysis. The potential dependence of observations within clusters invalidates the use of standard methods measures based on independent and identically distributed (i.i.d.) theory. A more complex challenge arises from the interest in the marginal distribution of a randomly selected tooth from a randomly selected child. In particular, while it is clear that measurements at age 5 and age 13 should be paired at the cluster (child) level, it is not entirely clear how the observations within clusters (teeth) should be paired. The teeth of interest that will potentially exhibit fluorosis at age 13 have yet to erupt at age 5 and as such could not be measured for dental caries at age 5. Further, virtually all of the deciduous teeth measured for caries at age 5 are no longer present in the mouth at age 13, precluding an assessment of fluorosis.

A possible solution to this pairing problem would be to map deciduous teeth to permanent teeth by location and perform a marginal correlation analysis, but this strategy is biologically problematic. The hypothesized increase in fluoride usage for a child in which caries is detected at age 5 is not tooth-specific, i.e. the stimulus purported to increase the risk of fluorosis is applied to all teeth within a child, particularly those late-erupting teeth that could not be assessed for caries at age 5. By pairing teeth, even teeth present at both ages, information about the association between dental caries and fluorosis might be lost. For example, it is biologically plausible that a child with dental caries on a deciduous canine at age 5 could exhibit fluorosis in the left premolar at age 13, owing to increased fluoride usage. In short, a marginal correlation analysis with a fixed pairing of observations at the tooth-within-child level does not correspond to the biological process under examination.

An additional problem in the marginal analysis of clustered data that has received recent attention is that of informative cluster size, in which the number of observations within a cluster is correlated with the random variable being measured. When the marginal analysis of interest concerns that of a randomly selected observation from a randomly selected cluster, standard marginal methods for clustered data can be biased under informative cluster size, as observations from large clusters will be overweighted. Hoffman et al.2 introduced the within-cluster resampling (WCR) algorithm to mitigate the effects of informative cluster size. In the WCR procedure, an observation is randomly selected from each of the independent clusters to create a pseudo dataset on which methods based on i.i.d. theory can be applied. The process is repeated and the WCR estimator is defined as the average of the estimates produced from the pseudo datasets. Williamson et al.3 noted that, in the limit, the WCR process assigns to each observation a weight equal to the inverse of the cluster size from which the observation came. They introduced cluster-weighted generalized estimating equations (CWGEEs) as an extension of the standard generalized estimating equation (GEE) model. In CWGEEs, standard estimating equations are weighted by the inverse of the cluster size, resulting in estimators asymptotically equivalent to WCR estimators and negating the need for computationally intensive resampling. The cluster-weighting approach has been applied to define clustered data analogs of the Wilcoxon signed rank4 and rank sum test,5 estimators for survival data,6,7 and parametric and nonparametric correlation measures.8

An additional type of informativeness in clustered data is the informativeness of covariates at the sub-cluster level, in which the probability that a covariate takes a certain value within a given cluster can be informative for the outcome in question. Huang and Leroux9 extended cluster-weighted methods by considering this type of informativeness and introduced doubly weighted generalized estimating equations (DWGEEs). These DWGEEs are based on a two-step WCR procedure—from a given cluster, a particular value of the covariate is randomly chosen from the set of all possible covariate values, followed by the random selection of an observation from the cluster having the randomly selected value. The authors demonstrated that methods correcting for informative cluster size (CWGEEs) and methods correcting for informativeness in covariate distributions, such as inverse probability of treatment weighting,10 were each biased when both cluster size and sub-cluster covariate informativeness were present, while DWGEE estimators remained unbiased. A particular application of this method of reweighting occurs when the potentially informative covariate is a grouping factor and a comparison of a response among groups is desired, for which Dutta and Datta11 developed a rank sum test.

These methods for the marginal analysis of clustered data each operate under a reweighting principle, where the weights are tailored to correct the potentially biasing effects of cluster size or covariate informativeness. In this paper, we suggest a reweighting principle that generalizes prior reweighting methods for the marginal analysis of clustered data. In Section 2, we propose our weighted GEEs and illustrate how they generalize the methods of Williamson et al.3 and Huang and Leroux.9 We extend the weighted GEEs to data unpaired at the within-cluster level and apply these principles to define several correlation estimators for such data. Section 3 presents the results of a simulation study testing the validity of these correlation measures and the analysis of data from the dental study.1 Section 4 presents our concluding remarks.

2 Methods

We seek to estimate the marginal association between two random variables, denoted X and Y, from data collected on M clusters, where the clustering mechanism is defined such that observations from different clusters are independent. Let θ = (θ1, ..., θK) denote a vector of parameters derived from the marginal joint distribution of X and Y. In our applications, we will estimate a parameter of marginal association between X and Y that will be a smooth function of this vector of parameters, i.e. g(θ) for smooth function g.

2.1 Notation and preliminaries—paired clustered data

We begin by establishing notation and methods of weighted estimation for paired clustered data; notation and methods for data unpaired at the within-cluster level are detailed in Section 2.2. We observe bivariate data (X, Y) on a set of M clusters, with each cluster consisting of ni observed pairs. Denote the total number of observations N=i=1Mni. We assume that the data consist of i.i.d. replicates of observations from each of the M clusters, denoted by the vector V with ith realization Vi = (ni, (Xi1, Yi1), ..., (Xini, Yini)). This notation indicates that the size of each cluster, ni, is a random variable potentially correlated with the bivariate random variable (X, Y) or functionals of the marginal bivariate distribution of (X, Y).

Let Ui(Yij, Xij, θ) be an estimating function in cluster i for the marginal parameter θ. To estimate the marginal parameter θ, we suggest the following weighted GEE

U(θ^)=i=1Mj=1niωijUi(Yij,Xij,θ^)=0 (1)

where the weights ωij are positive and subject to some natural constraint, e.g. j=1niωij=1, ∀i. While here we define the constraints per cluster, the weights can easily be normalized over the clusters as well without changing the methodology that follows, e.g. i=1Mj=1niωij=1. In what follows, we demonstrate how equation (1) generalizes cluster-weighted3 and doubly weighted9 GEEs. Further, we show the importance of appropriate selection of the weights ωij in the marginal analysis of clustered data. As will be shown, the choice of weights is intimately related to the type of marginal analysis desired, as well as the nature of the relationship (if any) between the cluster size ni, the outcome Y, and the marginal association between X and Y.

When the marginal analysis to be conducted is of a typical observation in the population of all observations, a popular choice for estimating the marginal association parameter θ is the standard GEE. In this case, θ is a functional of the distribution of a typical observation in the population of all possible observations

Fg(x,y)=EV{i=1Mj=1niN-1I[Xijx,Yijy]}

where EV indicates the expectation taken with respect to the distribution of V and I[] is the indicator function. Observations contribute equal information regarding the relationship between X and Y, and thus are weighted equally by setting ωij = 1, ∀i, j, in which case equation (1) yields the standard GEE with independent working correlation matrix

U(θ^)=i=1Mj=1niUi(Yij,Xij,θ^)=0 (2)

When cluster size is non-informative, this equation provides an unbiased estimate of θ and any smooth functions thereof in the marginal analysis of a typical observation in the population of all observations.

A second type of marginal analysis for clustered data treats the cluster as the primary sampling unit, so that interest is in a typical observation from a typical cluster. As previously noted,2,3,9 for this type of marginal analysis, the standard GEE approach in equation (2) still yields an appropriate estimate of the marginal parameter θ when cluster size is non-informative. However, when cluster size is informative, the standard GEE will yield biased estimates of θ, favoring larger clusters. In equation (2), cluster i receives weight ni/N and, as such, larger clusters are favored by receiving greater weight. Williamson et al.3 introduced cluster-weighted estimating equations by setting ωij=ni-1

U(θ^)=i=1Mj=1ni1niUi(Yij,Xij,θ^)=0 (3)

Under this approach, it is the clusters rather than the observations that receive equal weight ( Jni-1=1, ∀i), as observations from larger clusters are down-weighted. The marginal parameter θ is a functional of the marginal distribution of X and Y at the cluster level, defined as

Fc(x,y)=EV{i=1Mj=1nini-1I[Xijx,Yijy]}

that is, the distribution of a typical observation from a typical cluster. A key distinction between estimation via estimating equations (2) and (3) is that the θ estimated by equation (2) is a marginal parameter in the population of observations, whereas the θ estimated by equation (3) is a marginal parameter in the population of clusters. While these parameters can be closely related or even coincide, they are often distinct, either by definition or through their relationship to the potentially informative cluster size.

When sub-cluster covariate informativeness is present, not only is the value of the covariate X related to the outcome Y but so is its within-cluster distribution. In this setting, estimation through standard GEEs or even cluster-weighted GEEs can lead to biased estimation of θ. For a simple example, consider the scenario in which Y is positively related to a binary variable X and that clusters with large values of Y tend to be large and are more likely to have X=1. Equation (2) will result in a biased estimate of θ in this scenario, given the overabundance of observations with large values of Y and X=1. Equation (3) will also produce a biased estimate of θ. Even though observations from larger clusters are down-weighted by the inverse cluster size, observations with large values of Y remain more likely to come from clusters in which observations with X=1 predominate. The DWGEEs of Huang and Leroux9 are obtained by defining the weights in equation (1) in terms of the sub-cluster covariate distribution

U(θ^)=i=1Mj=1ni1niXijUi(Yij,Xij,θ^)=0 (4)

where niXij represents the number of observations in cluster i taking value Xij. For example, if X is categorical with K levels, then the ith cluster will be composed of ni1 observations from level 1, ni2 observations from level 2,..., and niK observations from level K, so that ni=k=1Knik.

The distinct values of the covariate X define “sub-clusters,” which are equally weighted by estimating equation (4). In the K-level categorical X example, each level of X receives weight 1, so that each cluster receives weight K. When θ is estimated using the CWGEE (3), we note that the marginal analysis of interest is for a typical observation from a typical cluster. The same is true of the DWGEE (4). The distinction between CWGEEs and DWGEEs lies in the interpretation of “typical.” For CWGEEs, a typical X observation is defined by the observed distribution of X values. For DWGEEs, a typical X observation is defined so that each possible value for the covariate X is equally likely, i.e. the observed within-cluster distribution of X is marginalized. In this way, any bias in estimating θ that results from a relationship between Y and the within-cluster distribution of X is marginalized. The marginal distribution under consideration when estimating by DWGEEs is

Fd(x,y)=EV{i=1Mj=1niniXij-1I[Xijx,Yijy]}

The variance of the estimator θ̂ obtained by solving equation (1) can be estimated in sandwich form by Σ̂ = M−1 Â−1B̂Â−1, where

A^=1Mi=1Mj=1niωijUi(Yij,Xij,θ)θ|θ=θ^B^=1Mi=1M{j=1niωijUi(Yij,Xij,θ^)}×{j=1niωijUi(Yij,Xij,θ^)}T

This variance estimator corresponds with estimators established for both CWGEEs3 and DWGEEs.9 Next, we define marginal correlation coefficient estimators that are smooth functions, g, of the marginal parameter θ. The correlation estimators are defined simply as g(θ̂), with variance estimators defined via the delta method as σ̂2 = M−1T Σ̂, where is the vector of first partial derivatives of g with respect to θ evaluated at the estimated value of θ

l^=(gθ1θ1=θ^1,,gθKθK=θ^K)

2.2 Unpaired clustered data

When data are unpaired at the within-cluster level, as in the dental data described in the introduction,1 unconventional notation is required. We continue to denote the random variables in whose correlation we are interested as X and Y and the data in the ith cluster as Vi. Now, we define

Vi=(nix,niy,Xi1,,Xinix,Yi1,,Yiniy)

Several characteristics of data unpaired at the within-cluster level are implicit in this notation—that the number of X observations nix and the number of Y observations niy in each cluster are random, potentially unequal, and potentially correlated with, and thus informative for, X, Y, and any association between X and Y. Our interest remains in the marginal association between X and Y, defined as some function g of the marginal parameter θ and estimating function Ui(Xij, Yij, θ), which require no changes from before. In what follows, we will refer to this type of data as unpaired clustered data.

In the introduction, we noted that both CWGEEs and DWGEEs for paired clustered data arise from limiting calculations applied to within-cluster resampling2 schemes. For CWGEEs, one bivariate observation per cluster is randomly selected to create the pseudo datasets on which the marginal parameter θ is estimated and averaged.3 Resampling and estimation proceed in the same way for DWGEEs, except that the random selection of an observation within each cluster is preceded by the random selection of a sub-cluster based on the discrete values of the informative covariate X.9 A within-cluster resampling scheme for unpaired clustered data involves randomly and independently selecting one X and one Y observation from each cluster to build the pseudo datasets on which θ is estimated and averaged. Such a sampling scheme is in accordance with the characteristics of unpaired clustered data, in which X and Y observations are not naturally paired and all possible pairings of X and Y observations from a given cluster are equally valid. A limiting calculation applied to this resampling scheme produces the following estimating equation for unpaired clustered data

U(θ^)=i=1Mjx=1nixjy=1niy1nixniyUi(Yijy,Xijx,θ^)=0 (5)

In equation (5), each possible pairing of X and Y observations is represented and assigned a weight equal to the inverse of the number of possible X–Y pairings in each cluster (nixniy). Thus, estimators for unpaired clustered data arise in a similar way as from the weighted estimating equation (1), although the notation there does not accommodate the notation for unpaired clustered data. Further, if either or both of the sizes of the X and Y data in a given cluster are informative for θ, then estimation via equation (5) will remain valid. Like CWGEEs and DWGEEs, the marginal analysis of interest here still concerns a typical observation from a typical cluster. Because the data are unpaired at the within-cluster level, a typical observation is characterized as any pairing of a single X and a single Y observation from a particular cluster, which has the marginal distribution

Fu(x,y)=EV{i=1Mjx=1nixjy=1niynix-1niy-1I[Xijxx,Yijyy]}

The variance estimator Σ̂ takes the same form previously shown, with

A^=1Mi=1Mjx=1nixjy=1niy1nixniyUi(Yijy,Xijx,θ)θ|θ=θ^B^=1Mi=1M{jx=1nixjx=1niy1nixniyUi(Yij,Xij,θ^)}×{jx=1nixjx=1niy1nixniyUi(Yij,Xij,θ^)}T

The delta method calculations for estimating smooth functions of θ noted previously remain unchanged in the unpaired clustered data framework.

As an example of estimation for unpaired clustered data, consider estimation of the marginal moment mkl = E[XkYl]. The estimating function for θ = mkl is Ui(Yij,Xij,θ)=XijkYijl-mkl. By solving equation (5), we obtain the estimator

m^kl=1Mi=1M1nixniyjy=1niyjx=1nixXijxkYijxl (6)

When either k or l is 0 and thus a marginal moment of X (l=0) or a marginal moment of Y (k=0) is being estimated, the interior sum in equation (6) corresponding to the zero exponent drops out, and the moment estimator corresponds to the CWGEE estimator. For example, in estimating m20 = E[X2] for unpaired clustered data, we find that m^20=M-1i=1Mnix-1jx=1nixXijx2, since jy=1niyniy-1=1. We will now use these moment estimators to motivate the estimation of product moment correlations detailed in Section 2.3.

2.3 Applications to correlation analysis for unpaired data

Product moment correlations are functions of five moments of varying type from the bivariate distribution of X and Y. Letting the vector z = (z1, z2, z3, z4, z5) represent the five moments, the familiar product moment formula is

g(z)=z3-z1z2(z4-z12)(z5-z22) (7)

The Pearson correlation is the most well-known product moment correlation, for which g in equation (7) is evaluated at the vector of the first five raw moments of the bivariate distribution of X and Y: ρ = g(m10,m01,m11,m20,m02). The estimator is formed by applying the product moment formula to the first five raw sample moments, ρ̂ = g(10,01,11,20,02), where kl is as defined in Section 2.2. We briefly note that the correlation coefficient ρ is not directly estimated as the solution of the weighted estimating equation (1). However, ρ is a smooth function of the first five raw moments of X and Y, estimators of which can be obtained as solutions of equation (1).

The Spearman correlation coefficient is a popular alternative to the Pearson coefficient, and is typically employed to detect nonlinear association or relationships between discrete or skewed continuous random variables. The Spearman coefficient is also a product moment correlation, but is calculated on the first five “rank moments” of the bivariate distribution of X and Y. These rank moments are not defined as solutions to the weighted estimating equation (1), in contrast with the Pearson coefficient. Therefore, we must define a cluster-weighted rank function to define the rank moments that will comprise a cluster-weighted Spearman coefficient. As previously noted,4,5 informative cluster size invalidates standard formulas for rank functions such as rank(Xijx)=i=1Mjx=1nixI[XijxXijx]. We define the rank of Xijx among all X observations as RXijx=12(F^X(Xijx)+F^X(Xijx-)), where F^X(x)=1Mi=1MFXi(x), and FXi(x)=j=1nixnix-1I[Xijxx]. Ranks for the Y data, RYijy, are defined analogously, with corresponding functions Y(y), and Yi(y). These cluster-weighted rank functions have been previously used to define rank sum5 and signed rank4 tests for clustered data, and we note that tied observations are accounted through a midrank calculation implied by the formulas. The cluster-weighted rank moments for unpaired clustered data are then defined as

r^kl=i=1Mjx=1nixjy=1niy1nixniyRXijxkRYijyl

and the marginal Spearman correlation for unpaired clustered data can be calculated as ρ̂s = g(10, 01, 11, 20, 02).

When X and Y are both binary random variables taking values 0 or 1, equation (7) yields the phi coefficient, a commonly used measure of association for binary data. The phi coefficient also has an alternative formulation in terms of cell counts from a 2×2 contingency table. Much like the rank functions, the cell counts from 2×2 tables calculated on clustered data with potentially informative cluster size require weighting. The cluster-weighted cell counts for unpaired clustered data are

n^kl=i=1Mjx=1nixjy=1niy1nixniyI[Xijx=k,Yijy=l]

where k and l take values 0 or 1. The cluster-weighted marginal totals for the 2×2 table can be calculated as = 11 + 10, = 01 + 00, ·1 = 11 + 01, and ·0 = 10 + 00. In terms of these weighted cell counts, the alternative formulation for the cluster-weighted phi coefficient is

ϕ^=n^11n^00-n^10n^01n^1·n^·1n^0·n^·0

The delta method calculation described previously can be employed to derive variance estimators for each of the three product moment correlation coefficients. Because the five cluster-weighted moments used in the product moment correlation formula are asymptotically normal2,3 and the product moment correlations are a smooth function of the cluster-weighted moments, the asymptotic distributions of the Pearson, Spearman, and phi coefficients are normal. We use this fact to generate confidence intervals for each of the product moment coefficients, the performance of which will be evaluated in Section 3.

The polyserial correlation12 is commonly used for bivariate data in which one variable is continuous and one discrete. We now assume X to be continuous and Y discrete taking values 1, ..., K. It is assumed that values of Y arise from a latent random variable U~N(0, 1) according to the formula

P[Y=k]=P[ξk-1U<ξk],k=1,,K-1

where ξk are partition points of the real line, ξ0 = −∞, and ξK=∞. The pair (X, U) is assumed to be bivariate normal and, for brevity of presentation, we assume X to be standard normal as well. When X follows a non-standard normal distribution, its mean and variance can be estimated using the cluster-weighted moment estimators described in Section 2.2.

In developing the biserial coefficient, Cox12 defined the estimating equations for the latent correlation ρ and the partition points ξk. Cluster-weighted versions of these estimating equations for unpaired clustered data are

Ui(ρ)=1(1-ρ2)3/2i=1Mjx=1nixjy=1niyk=1KI[Yijy=k]nixniy×{ρξkϕk(Xijx)-ξk-1ϕk-1(Xijx)Φk(Xijx)-Φk-1(Xijx)-ϕk(Xijx)-ϕk-1(Xijx)Φk(Xijx)-Φk-1(Xijx)}Ui(ξr)=1(1-ρ2)1/2i=1Mjx=1nixjy=1niyk=1KI[Yijy=k]nixniy×{ϕk(Xijx)I[k=r]Φk(Xijx)-Φk-1(Xijx)-ϕk-1(Xijx)I[k-1=r]Φk(Xijx)-Φk-1(Xijx)} (8)

where Φy(x)=Φ((ξy-ρx)/1-ρ2),ϕy(x)=ϕ((ξy-ρx)/1-ρ2), ϕ and Φ denote the standard normal density and distribution functions, ϕ0(x)= ϕK(x)=0, Φ0(x)=0, and ΦK=1.

The estimates ρ̂ and ξ̂r are obtained by solving equation (8) numerically. Starting values for the vector (ρ, ξ1, ... , ξK−1)T are obtained as cluster-weighted analogs of the starting estimates provided by Cox12 for i.i.d. data. Let πk=M-1i=1Mniy-1jy=1niyI[Yijy=k] be the cluster-weighted proportion of Y observations taking value k. Then the starting values are given by

ξr=Φ-1(k=1rπk)ρ=syrxyk=1K-1ϕ(Φ-1(πk))

in which we have that

sy2=M-1i=1Mniy-1jy=1niyYijy2-(M-1i=1Mniy-1jy=1niyYijy)2

a cluster-weighted variance estimate for Y, and rxy is the cluster-weighted product moment correlation of (X, Y) defined earlier in this section. Finally, the variance-covariance matrix of (ρ̂, θ̂1, ... , θ̂K−1)T can be estimated using the sandwich formula noted in Sections 2.1 and 2.2.

3 Results

3.1 Simulation study

We evaluated the correlation estimators for unpaired data from Section 2.3 with a simple simulation study. We simulated data under an unpaired “multivariate random effects” structure, the details of which follow. For each cluster, we first simulated the random effects for X and Y as the bivariate normal pair

(uivi)~N((00),(1γγ1))

We then independently simulated the cluster sizes nix and niy in each cluster as

nix,niy~{BIN(n1,9)ifuivi>0BIN(n2,9)otherwise

where (n1, n2)=(20, 10) in one set of simulations and (n1, n2)=(10, 20) in another. Using these cluster sizes, we then simulated the “model errors” for each cluster as the vector

(μi1,,μinix,νi1,,νiniy)~N(0,(ΔxPTPΔy))

where 0 denotes a vector of nix+niy zeros, Δx the nix × nix compound symmetry matrix with correlation parameter δx, Δy the niy×niy compound symmetry matrix with correlation parameter δy, and P the niy × nix matrix with every element equal to ρ. The data in cluster i were then generated as the sum of the random effects and model errors

Vi={nix,niy,Xi1,,Xinix,Yi1,,Yiniy}={nix,niy,(ui+μi1),,(ui+μinix),(vi+vi1),,(vi+νiniy)}

Data simulated under this model have several important features. Under the compound symmetry structure, Corr(Xij1,Xij2) = δx and Corr(Yij1,Yij2) = δy. Since the random effects ui and vi were generated independently of the model errors μijx and νijx, we have that Var(X) = Var(Y) = 2 and Cov(X,Y) = ρ + γ and thus Corr(Xij1, Yij2)=(ρ+γ)/2. Of particular importance to note is that the cluster sizes nix and niy were informative for the correlation between X and Y. Under the setting in which (n1, n2)=(20, 10), clusters with concordant (discordant) random effects ui and vi were simulated from a binomial distribution with size 20 (10). Thus, clusters with concordant random effects—-clusters with evidence of a positive correlation—tended to be larger in size than clusters with discordant random effects. The reverse was true when (n1, n2)=(10, 20)—clusters with discordant random effects tended to be larger in size.

We evaluated each of our correlation estimators for ρ= −0.5, 0.2, 0.7 and γ=−0.4, 0, 0.4. The compound symmetry correlation parameters for the X and Y data were set as δx=δy=0.7. To create binary data for evaluating the phi coefficient, we dichotomized X as I[X>0.5] and Y as I[Y>0.5]. To create discrete data for testing the biserial correlation, we discretized Y into a four-level categorical variable using −1.2, −0.1, and 1.3 as cut points. For each of the design settings, we calculated the average of our estimators over 10,000 Monte Carlo loops and defined the number of clusters to be M=100.

Our proposed correlation estimators for unpaired clustered data were unbiased and the asymptotic confidence intervals exhibited close to nominal coverage probabilities (Table 1). The estimators were resistant to the potentially biasing effect of informative cluster size—the estimators were approximately unbiased when clusters with concordant random effects were larger and when clusters with discordant random effects were larger. Coverage probabilities were slightly lower but generally in correspondence with the nominal 95% level. This phenomenon corresponded with the same well-known phenomenon for asymptotic confidence intervals for product moment correlations for i.i.d. data.

Table 1.

Simulation results. For each setting of the design parameters, “True” provides the true value of the coefficient, “Est.” lists the average of the 10,000 Monte Carlo estimates, and “Cov.” lists the empirical coverage probability of the asymptotic 95% confidence interval.

Design parameters
Pearson
Spearman
Phi
Biserial
(n1, n2) ρ γ True Est. Cov. True Est. Cov. True Est. Cov. True Est. Cov.
(20, 10) −0.5 −0.4 −0.45 −0.45 0.9322 −0.43 −0.43 0.9309 −0.27 −0.27 0.9466 −0.45 −0.44 0.9385
0.0 −0.25 −0.25 0.9381 −0.24 −0.24 0.9442 −0.14 −0.15 0.9399 −0.25 −0.25 0.9438
0.4 −0.05 −0.05 0.9362 −0.05 −0.05 0.9408 −0.03 −0.03 0.9452 −0.05 −0.05 0.9408
0.2 −0.4 −0.10 −0.10 0.9298 −0.10 −0.10 0.9364 −0.06 −0.06 0.9412 −0.10 −0.10 0.9398
0.0 0.10 0.10 0.9358 0.10 0.10 0.9392 0.06 0.06 0.9441 0.10 0.10 0.9369
0.4 0.30 0.30 0.9357 0.29 0.29 0.9424 0.19 0.19 0.9449 0.30 0.30 0.9455
0.7 −0.4 0.15 0.15 0.9344 0.14 0.14 0.9382 0.09 0.09 0.9396 0.15 0.15 0.9501
0.0 0.35 0.35 0.9362 0.34 0.33 0.9403 0.22 0.22 0.9448 0.35 0.34 0.9389
0.4 0.55 0.55 0.9391 0.53 0.53 0.9361 0.36 0.36 0.9436 0.55 0.55 0.9402
(10, 20) −0.5 −0.4 −0.45 −0.45 0.9376 −0.43 −0.43 0.9385 −0.27 −0.27 0.9448 −0.45 −0.45 0.9421
0.0 −0.25 −0.25 0.9336 −0.24 −0.24 0.9395 −0.14 −0.15 0.9358 −0.25 −0.25 0.9439
0.4 −0.05 −0.05 0.9355 −0.05 −0.05 0.9431 −0.03 −0.03 0.9450 −0.05 −0.05 0.9388
0.2 −0.4 −0.10 −0.10 0.9364 −0.10 −0.10 0.9432 −0.06 −0.06 0.9469 −0.10 −0.10 0.9452
0.0 0.10 0.10 0.9381 0.10 0.09 0.9431 0.06 0.06 0.9428 0.10 0.10 0.9411
0.4 0.30 0.30 0.9339 0.29 0.29 0.9375 0.19 0.19 0.9451 0.30 0.30 0.9397
0.7 −0.4 0.15 0.15 0.9346 0.14 0.14 0.9393 0.09 0.09 0.9442 0.15 0.15 0.9423
0.0 0.35 0.35 0.9377 0.34 0.33 0.9417 0.22 0.22 0.9435 0.35 0.35 0.9452
0.4 0.55 0.55 0.9417 0.53 0.53 0.9382 0.36 0.36 0.9476 0.55 0.55 0.9361

3.2 Application to dental data

To illustrate the use of the unpaired correlation estimators, we analyzed the dental dataset described in the introduction.1 At ages 5 and 13, each subject was evaluated for caries and fluorosis, respectively. An ordinal fluorosis score was assigned for each tooth of each enrolled subject using the Fluorosis Risk Index,13 with 0 indicating none, 1 indicating questionable fluorosis, 2 indicating definitive white striations, and 3 indicating staining or pitting of the tooth. Additionally, an alphabetized, ordinal dental caries score was assigned for each tooth of each enrolled subject, with “S” indicating a sound zone (no caries), “D0”, “D1”, and “D2” indicating increasing severity of caries, and “F” indicating a tooth with filled caries. We briefly note that both fluorosis and caries were scored on several zones for each tooth; to simplify our analyses, we considered the maximum score across all zones to be the fluorosis or caries score for a given tooth.

There were 525 subjects in the study with assessments of dental caries at age 5 and assessments of fluorosis at age 13, representing 10,363 teeth at age 5 and 14,001 teeth at age 13. Here, we restrict our analysis to the measurements taken on 7,732 late-erupting teeth at age 13 (canines, premolars, and second molars). Table 2 provides some descriptive statistics for these 525 individuals. The majority of subjects had full complements of teeth at age 5 (465 of 525, 89%) and full complements of late-erupting teeth at age 13 (317 of 525, 60%), but variation in the number of teeth per individual was present.

Table 2.

Descriptive statistics for the M=525 subjects with assessments of caries at age 5 and fluorosis at age 13.

Age 5
Age 13
No. of teeth No. of subjects At least 1 with caries Mean proportion of teeth with caries No. of teeth No. of subjects At least 1 with fluorosis Mean proportion of teeth with fluorosis
≤11 41 0.29 0.15
≤17 13 0.54 0.16 12 36 0.25 0.09
18 31 0.45 0.14 13 22 0.23 0.14
19 16 0.38 0.08 14 54 0.20 0.07
20 465 0.35 0.06 15 55 0.27 0.08
16 317 0.27 0.12
Total 525 0.36 0.07 Total 525 0.26 0.11

Many subjects (36%) exhibited caries on at least one tooth at age 5, defined here as a caries score greater than “S”. There was some evidence of potentially informative cluster size, as both the proportion of individuals with at least one tooth with caries and the mean proportion of teeth with caries per subject appeared to vary by the number of teeth. Some subjects (26%) exhibited fluorosis on at least one tooth at age 13, defined here (and generally) as a fluorosis score greater than 1. The proportion of individuals with at least one tooth exhibiting fluorosis varied over subjects with different tooth counts, as did proportion of teeth with fluorosis. Two distinct, simplified, subject-level marginal analyses indicated that caries at age 5 and fluorosis at age 13 are not related. The proportion of subjects with fluorosis at age 13 was fairly consistent among those with caries at age 5 (45 of 189, 23%) and without caries at age 5 (92 of 336, 27%). Further, the Pearson correlation between subject-averaged ordinal caries scores at age 5 and subject-averaged Fluorosis Risk Index measurements at age 13 was low (−0.08, 95% CI [−0.17, 0.00]).

We examined the caries–fluorosis relationship at the tooth-within-subject level by calculating the unpaired Spearman coefficient for the caries at age 5 score and the fluorosis at age 13 score. We additionally calculated phi coefficients based on various dichotomizations of the caries and fluorosis scores. The use of unpaired coefficients was in part justified by the presence of different teeth at ages 5 and 13, as noted in the introduction—no permanent teeth were present in any subject at age 5 and only late-erupting permanent teeth were of interest at age 13. Our marginal analysis showed no caries–fluorosis association (Table 3), as all estimated coefficients were close to zero for all dichotomizations of the fluorosis and caries scores.

Table 3.

Estimated unpaired correlation coefficients (95% confidence interval) examining the relationship between dental caries at age 5 and fluorosis at age 13. Phi coefficients were calculated under the indicated dichotomizations of the dental caries and fluorosis scores.

Coefficient Age 5 caries Age 13 fluorosis Coefficient (95% CI)
Spearman −0.04 (−0.06, −0.01)
Phi D0, D1, D2, F 1, 2, 3 −0.03 (−0.06, 0.00)
D0, D1, D2, F 2, 3 −0.03 (−0.05, −0.01)
D1, D2, F 1, 2, 3 −0.03 (−0.06, 0.00)
D1, D2, F 2, 3 −0.03 (−0.05, −0.01)

4 Discussion

In this paper, we have suggested a generalized weighted estimating equation for the marginal analysis of clustered data in which the cluster is the primary sampling unit and the marginal distribution of a typical observation from a typical cluster is of interest. The estimating equations that we suggest here are straightforward extensions of standard GEEs, and generalize previously defined extended GEEs designed to adjust for the potentially biasing effects of informative cluster size and sub-cluster covariate informativeness.3,9 We used these estimating equations to develop correlation estimators for unpaired clustered data, in which observations of two random variables to be associated are paired at the cluster level but unpaired at the observation-within-cluster level. The simulation study in Section 3.1 demonstrated the unbiasedness of our correlation estimators for unpaired clustered data and the approximate correct coverage of the associated asymptotic 95% confidence intervals. We were able to use these correlation estimators to show that dental caries at a younger age and fluorosis at an older age were not associated in a large sample of children enrolled in a dental study.

A potential alternative option in the marginal analysis of clustered data is a model-based approach in which all sources of variability are specified.14 However, a similar approach for bivariate data under informative cluster size has yet to be developed. In particular, the joint model approach must also include a component for the cluster size, and the performance of the association measures under mis-specification of this component needs to be studied.

We have shown examples of weights ωij to be used in estimating equation (1), and each was associated with a resampling scheme designed to adjust for potentially problematic features of clustered data. Inverse cluster size weights ( ωij=ni-1) were employed to adjust for informative cluster size.3 Inverse sub-cluster covariate weights ( ωij=niXij-1) were employed to adjust for sub-cluster covariate informativeness.9 We introduced variable-specific inverse cluster size weights (ωijxjy = (nixniy)−1) for unpaired clustered data with potentially informative cluster size. We note that the sub-cluster covariate weighting approach can be integrated into this by defining ωijxjy = (niXijniYij)−1. We did not explore this approach here, as sub-cluster covariate informativeness did not appear to be a feature of the dental dataset. An important feature of estimating equation (1) is that it is not limited to such resampling-based weights. For example, in our dental data application, it is possible that proximity plays a role, in that fluorosis may occur by age 13 in teeth nearby a tooth with caries at age 5. In this scenario, the weights ωijxjy can be defined as a function of a proximity mapping to accommodate association of teeth by proximity, e.g., ωijxjy can be defined as a function of I [XijxYijy], where ≡ indicates that Xijx and Yijy come from teeth defined to be in proximity to each other.

While the customization of the weights in equation (1) provides flexibility in the marginal analysis of clustered data, care needs to be taken in selecting appropriate weights and in noting what equation (1) estimates for a given weight. For example, consider a version of the doubly weighted GEEs proposed by Huang and Leroux9 for scenarios in which not all possible values of the potentially informative covariate X are observed in each cluster, termed by the authors DWGEE2. Seaman et al.15 showed that DWGEE2 estimates a parameter in a population in which all clusters exhibit all possible values of X. In other words, it is assumed that the observed data come from a population in which each value of X is represented in each cluster. Among other criticisms, Seaman et al.15 point out that this can be philosophically problematic, in that such a population might be purely hypothetical. For example, in the context of dental studies, one could find oneself modeling the dental hygiene of individuals with no teeth. As such, careful consideration of the weights is important, particularly with regard to what is being estimated for a particularly specified weight.

Acknowledgments

We thank an anonymous reviewer for many useful comments that improved this manuscript.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by the National Institutes of Health (grant numbers 1R03DE020839, 1R03DE022538, R01-DE09551, R01-DE12101, M01-RR00059, UL1-RR024979).

Footnotes

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, or publication of this article.

References

  • 1.Levy SM, Warren JJ, Broffitt BA, et al. Fluoride, beverages and dental caries in the primary dentition. Caries Res. 2003;37:157–165. doi: 10.1159/000070438. [DOI] [PubMed] [Google Scholar]
  • 2.Hoffman EB, Sen PK, Weinberg CR. Within-cluster resampling. Biometrika. 2001;88:1121–1134. [Google Scholar]
  • 3.Williamson JM, Datta S, Satten GA. Marginal analysis of clustered data when cluster size is informative. Biometrics. 2003;59:36–42. doi: 10.1111/1541-0420.00005. [DOI] [PubMed] [Google Scholar]
  • 4.Datta S, Satten GA. A signed-rank test for clustered data. Biometrics. 2008;65:501–507. doi: 10.1111/j.1541-0420.2007.00923.x. [DOI] [PubMed] [Google Scholar]
  • 5.Datta S, Satten GA. Rank-sum tests for clustered data. J Am Stat Assoc. 2005;100:908–915. [Google Scholar]
  • 6.Cong XJ, Yin G, Shen Y. Marginal analysis of correlated failure time data with informative cluster sizes. Biometrics. 2007;63:663–672. doi: 10.1111/j.1541-0420.2006.00730.x. [DOI] [PubMed] [Google Scholar]
  • 7.Williamson JM, Kim HY, Manatunga A, et al. Modeling survival data with informative cluster size. Stat Med. 2008;27:543–555. doi: 10.1002/sim.3003. [DOI] [PubMed] [Google Scholar]
  • 8.Lorenz DJ, Datta S, Harkema SJ. Marginal association measures for clustered data. Stat Med. 2011;30:3181–3191. doi: 10.1002/sim.4368. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Huang Y, Leroux B. Informative cluster sizes for sub-cluster level covariates and weighted generalized estimating equations. Biometrics. 2011;67:843–851. doi: 10.1111/j.1541-0420.2010.01542.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Robins JM, Hernan MA, Brumback B. Marginal structural models and causal inference in epidemiology. Epidemiology. 2000;11:550–560. doi: 10.1097/00001648-200009000-00011. [DOI] [PubMed] [Google Scholar]
  • 11.Dutta S, Datta S. A rank-sum test for clustered data when the number of subjects in a group within a cluster is informative. Biometrics. 2016;72:432–440. doi: 10.1111/biom.12447. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Cox NR. Estimation of the correlation between a continuous and a discrete variable. Biometrics. 1972;30:171–178. [PubMed] [Google Scholar]
  • 13.Pendrys DG. The Fluorosis Risk Index: a method for investigating risk factors. J Public Health Dent. 1990;50:291–298. doi: 10.1111/j.1752-7325.1990.tb02138.x. [DOI] [PubMed] [Google Scholar]
  • 14.Chen Z, Zhang B, Albert PS. A joint modeling approach to data with informative cluster size: robustness to the cluster size model. Stat Med. 2011;30:1825–1836. doi: 10.1002/sim.4239. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Seaman SR, Pavlou M, Copas AJ. Methods for observed-cluster inference when cluster size is informative: a review and clarifications. Biometrics. 2014;70:449–456. doi: 10.1111/biom.12151. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES