Abstract
It is well known that the sample correlation coefficient (Rxy) is the maximum likelihood estimator (MLE) of the Pearson correlation (ρxy) for i.i.d. bivariate normal data. However, this is not true for ophthalmologic data where X (e.g., visual acuity) and Y (e.g., visual field) are available for each eye and there is positive intraclass correlation for both X and Y in fellow eyes. In this paper, we provide a regression-based approach for obtaining the MLE of ρxy for clustered data, which can be implemented using standard mixed effects model software. This method is also extended to allow for estimation of partial correlation by controlling both X and Y for a vector U of other covariates. In addition, these methods can be extended to allow for estimation of rank correlation for clustered data by (a) converting ranks of both X and Y to the probit scale, (b) estimating the Pearson correlation between probit scores for X and Y, and (c) using the relationship between Pearson and rank correlation for bivariate normally distributed data. The validity of the methods in finite-sized samples is supported by simulation studies. Finally, two examples from ophthalmology and analgesic abuse are used to illustrate the methods.
Keywords: clustered data, rank correlation, Pearson correlation, partial correlation
1. Introduction
It is well known that the sample correlation coefficient (Rxy) is the maximum likelihood estimator (MLE) of the Pearson correlation coefficient (ρxy) for i.i.d. bivariate normally distributed data.[1] Also, there is an extensive literature on estimation of regression models for clustered normally distributed data.[2] However, estimation of Pearson correlation in the clustered data setting has received less attention. Furthermore, many outcome measures in ophthalmology are either continuous and not normally distributed or are ordinal. For example, the Humphrey visual field is often used to characterize visual field in retinitis pigmentosa (RP) patients. The Humphrey field is continuous, but is not normally distributed (see Figure 1). Similarly, Early Treatment Diabetic Retinopathy Study (ETDRS)[3] visual acuity, a measure of central vision, is also frequently not normally distributed (see Figure 2). Thus, the rank correlation is a natural measure of association between these two measures. However, both measures are correlated in fellow eyes which violates standard assumptions of inference regarding rank correlation (see Figure 3).
The subject of this paper is to provide methods for obtaining point and interval estimates of rank correlation for clustered data. We first develop methods for estimation of Pearson correlation and then extend these methods to estimation of rank correlation. Methods are first presented for the case of two subunits per cluster and extended to allow for more than two subunits per cluster. Extensions for unbalanced data (i.e., a variable number of subunits per cluster) are also provided.
2. Estimation of Pearson correlation for clustered data
(a) Clusters of size 2
We assume throughout this paper that subunits within a cluster are distinguishable (e.g., right eye (OD) and left eye (OS) scores). We also assume that the Pearson correlation between X and Y scores is the same for right and left eyes. We will begin with a discussion of maximum likelihood estimation of Pearson correlation for clusters of size 2 and will extend our inference to clusters of arbitrary size. Let Xij = X score for the jth subunit in the ith cluster; i = 1, …, n; j = 1,2 and Yij defined similarly for the Y score. Our goal is to estimate corr(Xij,Yij). For simplicity, we will use the notation (xij,yij) to denote both the random variable (Xij,Yij) as well as its sample realization. Suppose we have the data
, where xi = (xi1,xi2)′, yi = (yi1,yi2)′; i = 1, …, n.
Thus, xi and yi are 2 × 1 vectors and zi is a 4 × 1 vector.
We assume that zi is multivariate normal with mean μ = (μx,μx,μy,μy)′ and covariance matrix Σ given by
Thus, ρxy = corr(xij,yij) = inter-class Pearson correlation between visual acuity (VA) (xij) and visual field (VF) (yij) for the same eye, i = 1, n; j = 1,2.
and Y = inter-class Pearson correlation between VA for the right eye (OD) and VF for the left eye (OS) and conversely, i = 1, …, n; j1 ≠ j2 = 1,2.
ρxx = corr(xij1,xij2) = intraclass correlation between VA for the left and right eyes, i = 1, …, n; j1 ≠ j2 = 1,2.
ρyy = corr(yij1,yij2) = intraclass correlation between VF for the left and right eyes, i = 1, …, n; j1 ≠ j2 = 1,2.
The likelihood for the ith subject is given by
and the overall likelihood is given by:
The goal is to obtain maximum likelihood estimates (MLE) of ρxy. Note that if , then the MLE of ρxy is the ordinary Pearson product-moment correlation [1] given by
where
and
However, it is difficult to obtain closed form maximum likelihood estimates of ρxy using standard software if zi1 = (xi1,yi1) and zi2 = (xi2,yi2) are not independent. Instead, we consider an indirect approach. Specifically, if we factor Li as follows:
then the overall likelihood is given by
Note that all the information concerning ρxy is contained in L1 Hence, it will suffice to maximize L1 rather than L. We can write L1i as a conditional regression likelihood given by
where
Thus, marginally we have the model
(1) |
However, based on the properties of the multivariate normal distribution [1], the conditional distribution of yi given xi is multivariate normal (MVN) with mean and variance given by:
where
In this case, it follows straightforwardly from (1) that
(2) |
Therefore, if we equate E(yij|xi) in equations 1 and 2, we obtain:
(3) |
(4) |
Upon combining equations 3 and 4, and solving for ρxy and , we obtain:
(5) |
We can obtain MLE’s of σx and ρxx based on x from L2 denoted by σ̂x and ρ̂xx upon using standard mixed effects software (e.g., PROC MIXED of SAS) with a compound symmetry correlation structure. Similarly, we can obtain MLE’s of σy denoted by σ̂y from the joint likelihood of yi = (yi1,yi2) given by upon using PROC MIXED based on y with a compound symmetry correlation structure.
Thus, we can use equation 5 to obtain MLE’s of ρxy and given by:
(6) |
However, an equivalent approach is to express equation 1 directly in terms of ρxy and . Specifically, if we substitute equations 3 and 4 into equation 1 and collect terms, we obtain
(7) |
where
Thus, MLE’s of ρxy and can be obtained from equation 7 using PROC MIXED with a compound symmetry correlation structure, where σy,σx and ρxx are estimated using PROC MIXED in a similar manner as in equation 6.
(b) Clusters of size > 2
Suppose we have data , where
xi = (xi1, …, xik)′, yi = (yi1, …, yik)′ are (k×1) vectors and zi is a 2k×1 vector, i = 1, …, n.
The overall likelihood L is given by:
Since and , it follows that all the information concerning ρxy is contained in L1 Similar to the case of k = 2, we can write L1i as a conditional regression likelihood given by:
where
and
where Ik×k is the k×k identity matrix and Jk×k is a (k×k) matrix of one’s.
Thus, marginally, we have the regression model
(8) |
Similar to equation 2, based on the properties of the multivariate normal distribution [1], the conditional distribution of yi given xi is multivariate normal with mean and variance given by
(9) |
It follows straightforwardly from (9) that
Note that if k = 2, equation 9 reduces to equation 2.
Thus, from equations (8) and (9) we obtain
(10) |
Upon combining equations 8 and 10 and factoring out ρxy and , we obtain
(11) |
where
and σy,σx and ρxx are estimated in a similar manner as in equation 6.
(a) Interval Estimation
We could obtain interval estimates of ρxy by using se(ρ̂xy) from equation 11 and assume asymptotic normality of ρ̂xy. However, asymptotic normality is more quickly achieved for estimated correlation coefficients if one uses Fisher’s z transformation. If we let Zxy = 0.5 ln[(1+ρxy)/(1−ρxy)] and Ẑxy = 0.5 ln[(1+ρ̂xy)/(1−ρ̂xy)], then from the delta method we have
Thus, an approximate 100%×(1−α) CI for Zxy based on Fisher’s z transformation is given by
where Φ−1 is the inverse normal distribution.
The corresponding 100%×(1−α) CI for ρxy is given by (ρxy,1,ρxy,2) where
(12) |
1. Estimation of Rank Correlation for Clustered Data
We also consider the estimation of rank correlation for clustered data. For this purpose, we define FXj(x) ≡ Pr(Xj ≤ x) = c.d.f. of Xj, and similarly for FYj(y), j=1, …, k, where Xj and Yj are continuous random variables. Pearson referred to FXj(x) and FYj(y) as grades of Xj and Yj within a reference population[4].
Let HXj(x) = Φ−1[FXj(x)], HYj(y) = Φ−1[FYj(y)]. By definition, if Xj and Yj are continuous random variables, then HXj and HYj are univariate normal. We also assume that HZj ≡ (HXj,HYj) is bivariate normal. Finally, we define Pi,Xj = FXj(xij) and PiYj = FYj(yij); i = 1, …, n; j = 1, …, k.
We wish to estimate ρs,j = corr(Pi,Xj,PiYj) which we define as the underlying rank correlation between X and Y for replicate j. It is customary to estimate ρs,j by the Spearman rank correlation coefficient ρ̂s,j = corr[rankj(Xij),rankj(Yij)] where the ranks are sample ranks of Xij and Yij, j = 1, …, k within a sample of size n. However, we have previously shown that an alternative estimator of ρs,j exists which is both more efficient than ρ̂s,j and allows one to directly estimate confidence limits for ρs,j [5]. Specifically, let Hi,Xj = Φ−1(Pi,Xj) ≡ probit(Pi,Xj), Hi,Yj = Φ−1(PiYj) ≡ probit(PiYj) and Hi,Zj = (Hi,Xj,Hi,Yj).
We note that the probit transformation is rank-preserving. Therefore, ρs,j is the same in both the original and probit scale. Let us define Ĥi,Xj,n = Φ−1[rankj(Xij)/(n+1)], Ĥi,Yjn = Φ−1[rankj(Yij)/(n+1)] and Ĥi,Zjn = (Ĥi,Xj,n, Ĥi,Yj,n). In Theorem 2 in the Appendix, we show that if Hi,Zj is bivariate normal, then Ĥi,Zj,n is asymptotically bivariate normal as n→∞. Furthermore, Moran has shown a direct connection between a Pearson and rank correlation for bivariate normally distributed scales [6] whereby
(13) |
where ρH,j = corr(Hi,Xj,Hi,Yj). We can estimate ρH,j by ρ̂H,j = corr(Ĥi,Xj,Ĥi,Yj).
To incorporate clustering, we assume that ρs,j = ρs for all j = 1, …, k and use the regression model in equation 11 based on probit scores rather than raw scores to obtain an estimate of ρH,j = corr(Hi,Xj,Hi,Yj) which we assume is the same for all j and denote by ρ̂H. Equation 11 assumes multivariate normality of zi = (xi,yi). In Theorem 3 in the Appendix we show that if (H1, …, HL) is multivariate normal, then (Ĥ1,n, …, ĤL,n) is asymptotically multivariate normal as n → ∞. The corresponding estimator of Spearman rank correlation which is denoted by ρ̂s,a is given by
Furthermore, a 100%×(1−α) CI for ρs is given by
(14) |
where ρH,1 and ρH,2 are obtained from equation 12 based on probit scores rather than raw scores.
Similarly, from equation 11 we can also obtain an estimate of the Spearman cross-correlation of ( ) given by
where
and obtain a 100% × (1−α) CI for based on equations 12 and 13 given by
2. Estimation of partial correlation for clustered data
It is likely that there is positive correlation between visual field area (VF) and visual acuity (VA). However, it is possible that this is partially due to the effect of age since older RP patients tend to have lower field and acuity measurements than younger RP patients. Also, in some datasets gender differences are found for visual function measures in RP patients [7]. Furthermore, there are different genetic types of RP and level of visual function is known to differ by genetic type [7]. Hence, it would be desirable to estimate the correlation between VF and VA in individual eyes, while controlling for age, gender and genetic type.
(a) Partial Pearson correlation
We assume that there are a set of L cluster-specific covariates denoted by ui = (ui1, …, uiL)′ where ui is an (L × 1) vector and uil is the value of the lth cluster-specific covariate for the ith cluster. Let Ui be a k × L matrix, where Ui,jl = Uil, i = 1, …, n; j = 1, …, k; l = 1, …, L. In addition, we assume there are a set of M subunit-specific covariates wij = (wij1, …, wijM)′ where wij is an M × 1 vector and wijm is the value of the mth subunit-specific covariate for the jth subunit in the ith cluster. Finally, let Wi be a k × M matrix where Wi,jm = wijm, i = 1, …, n; j = 1, …, k; m = 1, …, M.
We define the partial correlation between X and Y by ρxy,partial = corr(xij,yij|ui,Wi), i = 1, …, n; j = 1, …, k. Similarly, the partial cross-correlation between X and Y is defined by , i = 1, …, n; j1 ≠ j2 = 1, …, k.
To estimate these correlations, we factor the likelihood of Li as follows:
Furthermore, we can factor LAi as follows:
All the information concerning ρxy,partial and is contained in LAi1. Hence, it will suffice to maximize to obtain the MLE’s of ρxy,partial and . We can write LAi1 as a conditional regression likelihood given by
(15) |
where
and
β3 is an L × 1 vector of regression coefficients for cluster-specific covariates and β4,β5 are M × 1 vectors of regression coefficients for subunit-specific covariates.
To estimate ρxy,partial and we
Use mixed effects regression models to regress yi and xi respectively on Ui,Wi and and obtain residuals and .
-
Obtain estimates of using mixed effects models based on maximizing
(16) where
-
Obtain estimates of and using mixed effects models based on
where
Obtain covariate-adjusted estimates and given by substituting and for xij,σy,σx and ρxx in equation 11.
Estimate ρxy,partial and using mixed effects models given by with a compound symmetry correlation structure.
Obtain interval estimates for ρxy,partial and using the methods in equation 12.
(b) Partial Spearman rank correlation
Similarly, estimates of ρs,partial and can be obtained by computing sample probits and using ranks based on and instead of xij and yij and using the methods in equation 13.
5. Unbalanced data
In ophthalmology there often are datasets where some subjects have data available for one eye and other subjects have data available for two eyes. Specifically, in the RP dataset, there were 7 patients with VA and VF available in one eye, but incomplete data in the fellow eye. Five of the patients with incomplete data were missing both VA and VF, while two of the patients were missing VF but not VA, in the incomplete eye. The missing data profile can get more complex if one wants to estimate partial correlation and there is missing covariate data.
In general, since replicates are distinguishable we treat unbalanced data as a type of missing data problem. One possible option is to use all available data, but allow k to vary for different clusters and use mixed effects software methods as provided in equation 11 to estimate ρxy However, as shown in equation 9, the variance-covariance matrix Σyi|xi will also depend on k which will not be taken into account using standard software.
If unbalanced data are considered missing data, then the best option is to use multiple imputation methods to analyze the data [8]. For clusters of size 2 using SAS software, this would proceed as follows:
use PROC MI of SAS to impute the missing data based on the data vector hi = (xi1,yi1,xi2,yi2,ui,wi1,wi2) based on M imputations,
analyze each imputed data set based on the methods in equation 7,
use PROC MIANALYZE to combine results over the M imputations.
This strategy can also be used where the maximum cluster size = k. We used this strategy in both of our examples each of which had small amounts of missing data and had max cluster size = 2 and 3, respectively.
6. Simulation Studies
We have conducted simulation studies to assess the finite sample properties of the estimators of ρxy both for clusters of 2 as given in Equation 7 and clusters of size > 2 (i.e., 3) as given in Equation 11. We generated zi based on the multivariate normal distribution in Section 2 (separately for k=2,3). In addition to the MLE methods described in Equations 7 and 11, we also considered other ad hoc, but frequently used methods. Specifically, we looked at the situation where an investigator has data from 2 eyes available, but only uses a single eye in the analyses (e.g., the right eye, or the left eye) to avoid the correlated data problem. Similarly for k=3, where only 1 of 3 distinguishable replicates was used in the analysis. We then estimated ρxy from the ordinary Pearson product-moment correlation from the n (x,y) pairs and obtained confidence limits based on Fisher’s z statistic. We refer to this approach as the 1 subunit method. In addition, we estimated ρxy from the ordinary Pearson product-moment correlation based on kn (x,y) pairs and obtained confidence limits based on Fisher’s z statistic ignoring the correlated data. We refer to this approach as the all subunits method.
We generated multivariate normal data with parameters μ = 0, σ2 = 1 and
, ρxx = ρyy = (0.2,0.5,0.8) to assess type I error.
, ρxx = ρyy = (0.2,0.5,0.8) to assess power.
, ρxx = ρyy = (0.5,0.8) to assess power.
Bias and coverage probability were determined for each method for each of (a) – (c). Separate simulations were performed for k=2 and k=3. Four thousand simulations were performed for each design in (a) – (c) and each k=2, 3. In addition, to assess the adequacy of the model for unbalanced data, we simulated multivariate normal data for k=3 subunits per cluster, but randomly deleted the 3rd subunit for 20% of the clusters, thus creating a dataset where 80% of the clusters had 3 subunits and 20% had 2 subunits. The same approach was used to create a dataset where 95% of the clusters had 3 subunits and 5% of the clusters had 2 subunits. These unbalanced data simulations were performed for each of
-
(b)
, ρxx = ρyy = 0.5 to assess type I error.
-
(c)
, ρxx = ρyy = 0.5 to assess power.
Finally, for each of the designs in (a) – (e) separate simulations were performed for n=50 and n=100. The results are summarized in Table 1.
Table 1.
n=50
|
n=100
|
|||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
k | ρxy |
|
ρxx(ρyy) | method | ρ̂xy | Type I error | coverage prob. | power | ρ̂xy | Type I error | coverage prob. | power | ||
Balanced Designs
|
||||||||||||||
2 | 0 | 0 | 0.2 | MLE | 0.001 | 0.052 | 94.9 | --- | 0.001 | 0.047 | 95.3 | --- | ||
1 subunit | −0.001 | 0.054 | 94.6 | --- | 0.002 | 0.052 | 94.8 | --- | ||||||
all subunits | 0.000 | 0.061 | 94.0 | --- | 0.001 | 0.053 | 99.3 | --- | ||||||
0.5 | MLE | 0.001 | 0.056 | 94.4 | --- | 0.001 | 0.050 | 95.2 | --- | |||||
1 subunit | −0.001 | 0.054 | 94.6 | --- | 0.002 | 0.052 | 94.8 | --- | ||||||
all subunits | 0.001 | 0.091 | 90.9 | --- | 0.001 | 0.077 | 98.7 | --- | ||||||
0.8 | MLE | 0.001 | 0.057 | 94.6 | --- | 0.002 | 0.050 | 95.2 | --- | |||||
1 subunit | −0.001 | 0.054 | 94.6 | --- | 0.002 | 0.052 | 94.8 | --- | ||||||
all subunits | 0.000 | 0.133 | 86.7 | --- | 0.002 | 0.127 | 97.0 | --- | ||||||
0.2 | 0.1 | 0.2 | MLE | 0.200 | --- | 95.0 | 49.3 | 0.200 | --- | 95.3 | 79.9 | |||
1 subunit | 0.198 | --- | 94.6 | 28.3 | 0.201 | --- | 94.7 | 53.1 | ||||||
all subunits | 0.199 | --- | 94.4 | 51.4 | 0.200 | --- | 94.8 | 81.1 | ||||||
0.5 | MLE | 0.201 | --- | 94.7 | 43.3 | 0.200 | --- | 95.3 | 72.0 | |||||
1 subunit | 0.198 | --- | 94.6 | 28.3 | 0.201 | --- | 94.7 | 53.1 | ||||||
all subunits | 0.200 | --- | 91.2 | 52.5 | 0.201 | --- | 92.1 | 78.9 | ||||||
0.8 | MLE | 0.201 | --- | 94.9 | 33.2 | 0.202 | --- | 95.2 | 61.1 | |||||
1 subunit | 0.198 | --- | 94.6 | 28.3 | 0.201 | --- | 94.7 | 53.1 | ||||||
all subunits | 0.201 | --- | 86.4 | 52.4 | 0.202 | --- | 87.1 | 76.7 | ||||||
0.5 | 0.4 | 0.5 | MLE | 0.499 | --- | 94.6 | 99.8 | 0.500 | --- | 94.6 | 100.0 | |||
1 subunit | 0.497 | --- | 94.9 | 96.4 | 0.500 | --- | 95.0 | 100.0 | ||||||
all subunits | 0.496 | --- | 92.6 | 99.9 | 0.498 | --- | 92.9 | 100.0 | ||||||
0.8 | MLE | 0.500 | --- | 95.5 | 99.0 | 0.502 | --- | 95.3 | 100.0 | |||||
1 subunit | 0.497 | --- | 94.9 | 96.4 | 0.500 | --- | 95.0 | 100.0 | ||||||
all subunits | 0.498 | --- | 87.1 | 99.7 | 0.500 | --- | 87.6 | 100.0 | ||||||
3 | 0 | 0 | 0.2 | MLE | 0.001 | 0.052 | 95.0 | --- | 0.000 | 0.056 | 94.5 | --- | ||
1 subunit | −0.001 | 0.054 | 94.6 | --- | 0.002 | 0.052 | 94.8 | --- | ||||||
all subunits | 0.001 | 0.060 | 94.0 | --- | 0.000 | 0.064 | 93.7 | --- | ||||||
0.5 | MLE | 0.003 | 0.052 | 95.2 | --- | 0.001 | 0.057 | 94.6 | --- | |||||
1 subunit | −0.001 | 0.054 | 94.6 | --- | 0.002 | 0.052 | 94.8 | --- | ||||||
all subunits | 0.003 | 0.113 | 88.7 | --- | 0.001 | 0.120 | 88.0 | --- | ||||||
0.8 | MLE | 0.004 | 0.048 | 95.7 | --- | 0.001 | 0.052 | 95.1 | --- | |||||
1 subunit | −0.001 | 0.054 | 94.6 | --- | 0.002 | 0.052 | 94.8 | --- | ||||||
all subunits | 0.004 | 0.197 | 80.3 | --- | 0.002 | 0.202 | 79.8 | --- | ||||||
0.2 | 0.1 | 0.2 | MLE | 0.201 | --- | 95.1 | 66.6 | 0.200 | --- | 94.4 | 92.0 | |||
1 subunit | 0.198 | --- | 94.6 | 28.3 | 0.201 | --- | 94.7 | 53.1 | ||||||
all subunits | 0.199 | --- | 94.2 | 68.6 | 0.199 | --- | 93.7 | 92.9 | ||||||
0.5 | MLE | 0.203 | --- | 95.2 | 52.4 | 0.202 | --- | 94.6 | 81.7 | |||||
1 subunit | 0.198 | --- | 94.6 | 28.3 | 0.201 | --- | 94.7 | 53.1 | ||||||
all subunits | 0.202 | --- | 88.9 | 66.5 | 0.201 | --- | 88.1 | 89.7 | ||||||
0.8 | MLE | 0.205 | --- | 95.4 | 32.4 | 0.202 | --- | 95.1 | 63.7 | |||||
1 subunit | 0.198 | --- | 94.6 | 28.3 | 0.201 | --- | 94.7 | 53.1 | ||||||
all subunits | 0.204 | --- | 79.8 | 34.9 | 0.201 | --- | 79.4 | 84.8 | ||||||
0.5 | 0.4 | 0.5 | MLE | 0.500 | --- | 94.6 | 100.0 | 0.500 | --- | 93.6 | 100.0 | |||
1 subunit | 0.497 | --- | 94.9 | 96.4 | 0.500 | --- | 95.0 | 100.0 | ||||||
all subunits | 0.496 | --- | 90.3 | 100.0 | 0.498 | --- | 90.3 | 100.0 | ||||||
0.8 | MLE | 0.503 | --- | 95.9 | 99.3 | 0.501 | --- | 95.2 | 100.0 | |||||
1 subunit | 0.497 | --- | 94.9 | 96.4 | 0.500 | --- | 95.0 | 100.0 | ||||||
all subunits | 0.500 | --- | 80.4 | 100.0 | 0.499 | --- | 79.5 | 100.0 | ||||||
Unbalanced Designs
|
||||||||||||||
2 | 5% | 0 | 0 | 0.5 | MLE | 0.002 | 0.055 | 95.2 | --- | 0.000 | 0.058 | 94.7 | --- | |
3 | 95% | 0.2 | 0.1 | 0.5 | MLE | 0.202 | --- | 95.4 | 52.6 | 0.201 | --- | 94.7 | 81.8 | |
2 | 20% | 0 | 0 | 0.5 | MLE | −0.019 | 0.049 | 95.5 | --- | −0.008 | 0.056 | 94.8 | --- | |
3 | 80% | 0.2 | 0.1 | 0.5 | MLE | 0.182 | --- | 95.2 | 40.7 | 0.193 | --- | 94.5 | 76.9 |
4000 simulations were run for each set of parameter combinations
For balanced designs, we see that bias is low for all 3 methods. For the MLE method type I error ranges from 0.047 to 0.057 (mean = 0.052), which is acceptable. Similarly, the 1 subunit method type I error ranges from 0.052 to 0.054 (mean = 0.053). However, type I error is not preserved for the all subunits method and ranges from 0.053 to 0.064 (mean = 0.061) for ρxx = 0.2, from 0.077 to 0.120 (mean = 0.100) for ρxx = 0.5 and from 0.127 to 0.202 (mean = 0.155) for ρxx = 0.8.
Coverage probability for the MLE method ranges from 93.6% to 95.9% (mean = 95.0%), which is acceptable. Similarly, coverage probability for the 1 subunit method ranges from 94.6% to 95.0% (mean = 94.8%), which is also acceptable. However, for the all subunits method coverage ranges from 93.7% to 99.3% (mean = 94.8%) when ρxx = 0.2, from 88.0% to 98.7% (mean = 93.4%) when ρxx = 0.5 and from 79.4% to 97.0% (mean = 84.3%) when ρxx = 0.8. In general, the high type I error when ρxx = 0.5 or 0.8 is due to an inappropriately low se and a corresponding low coverage probability.
Power for the MLE method is always greater than for the 1 subunit method, sometimes dramatically so, especially when ρxx = 0.2. For example, when ρxy = 0.2, and n = 100, power is 79.9 for the MLE method and 53.1% for the 1 subunit method when k=2 and 92.0% for the MLE method and 53.1% for the 1 subunit method for k=3. Overall, the MLE method has appropriate type I error and coverage probability and improved power vs. the 1 subunit method. The all subunits method has inappropriate type I error especially when ρxx ≥ 0.5. Thus, overall the MLE method is the preferred method among the 3 approaches.
For unbalanced designs, with 5% of subjects missing a 3rd replicate, bias is low and coverage probability ranges from 94.7% to 95.4%. With 20% of subjects missing a 3rd replicate, there was slight negative bias both for n=50 and to a lesser extent n=100. Power in the 5% missing data design was 81.8% for n=100 and 52.6% for n=50, virtually the same as with no missing data (k=3, n=100, 81.7%; n=50, 52.4%). Power in the 20% missing data design was noticeably reduced (n=100, 76.9%; n=50, 40.7%). Overall, missing data methods are an effective approach to analyzing clustered data with unbalanced designs.
An assumption of the estimation procedure for rank correlation is that the distribution of both X and Y are continuous. We also performed simulation studies to assess the adequacy of estimation of rank correlation for clustered data based on (13,14) when the distribution of X and/or Y is discrete. For this purpose, we simulated (X,Y) scores from a bivariate normal distribution, computed the sample probits of X and Y scores, but then subdivided the sample probit scores into quintiles and replaced the individual sample probit scores by the median sample probit score within a quintile. We then obtained point and interval estimates of the rank correlation based on (13) and (14). We considered the following parameter combinations , (0.2,0.1); n = (50,100), (ρxx = ρyy = 0.5), k = 3. 4,000 simulations were conducted for each parameter combination. The results are given in Appendix Table 1.
We see that when , the type I error ranges from 0.054 to 0.055. Over the 4 parameter combinations the coverage probability ranges from 94.6% to 95.9%. Power ranges from 52.4% when n=50 and 76.8% when n=100, which is slightly lower than for continuous distributions with the same parameter combinations ( , (ρxx = ρyy = 0.5), k = 3, power = 81.7%, n=100; 52.4%, n=50). Overall, the approach in (13) and (14) for estimation of rank correlation seems adequate for discrete distributions based on these designs.
Another assumption of the procedure for estimation of rank correlation for clustered data is that the distribution of (HX,HY) is bivariate normal. By definition HX and HY are univariate N(0,1) random variables, but the assumption of bivariate normality may not hold. To assess the sensitivity of our rank correlation estimation procedures to this assumption we generated (HX,HY) from a Kimeldorf-Sampson (KS) bivariate copula [9] of the form:
(17) |
Note that the univariate margins of F are N(0,1) random variables. Also, if δ = 0, then HX and HY are independent, i.e., F(a,b,0) = Φ(a) Φ(b).
To simulate data from this family we first obtained the conditional distribution of HY|HX from (17) given by:
(18) |
We then generated two iid U(0,1) random variables (U1, U2) to correspond to Φ(a) and FHY|HX(b|a,δ) respectively, and solved for a and b as follows:
(19) |
To simulate clustered data from this family, we created a clustered KS copula by first generating (a,b) as in (19). We then generated iid (eaq,ebq),q = 1,2 with univariate N(0,g) margins from this copula based on
(20) |
where (U1q, U2q) are iid U(0,1), (U11,U21) is independent of (U12,U22) and (U1q, U2q), q = 1,2 are independent of (U1, U2). Finally, we created clustered data using
(21) |
Note that based on (17–21) xq and yq are univariate N(0,1), but F(xq, yq) is not bivariate normal,
To determine the value of δ we refer to Joe [10], page 147 who shows by numerical integration that with copulas from the family in (17) for δ = 1.06, the Spearman rank correlation = 0.5. However, the clustered KS copula in (21) is a summation of two random variables from this family and does not correspond to the distribution of (HX, HY) in (17). Hence, we simulated 100,000 random vectors (X1,Y1,X2,Y2) from (21) for g = 1/2 and g = 1, respectively, and empirically estimated ρxy,s and from
Using δ = 1.06, this yielded for g = 1/2 and (0.497,0.246) for g = 1.
We then assumed these were the true values of ρxy,s and and performed the estimation procedure in (13) and (14) to obtain point and interval estimates of ρxy,s based on 4,000 samples of size 50 and 100, respectively, thus obtaining empirical estimates of bias and coverage probability. The results are provided in Appendix Table 2.
We see that the median estimated value is slightly lower than the true value for both ρxy,s and , although the absolute bias is small (≤ 0.011). The coverage probability ranges from 94.1% to 96.6% for g = 1/2 and 91.4% to 96.7% for g = 1 (overall mean = 94.8%) which is acceptable. In summary, the estimation procedure for rank correlation in (13) and (14) performs well for probit scores (HX,HY) whose distribution is not bivariate normal, but instead comes from the clustered KS copula in (21).
5. Examples
(a) Estimating the correlation between visual field and visual acuity in RP patients
We have a sample of 220 RP patients who were enrolled in a clinical trial assessing the effects of docosahexaenoic acid (DHA) supplementation vs. placebo on the rate of decline of visual function over a 4-year period [11]. For this analysis we assess the correlation between Humphrey 30-2 visual field (VF) and ETDRS visual acuity (VA) in individual eyes at the initial (screening) visit. Two hundred thirteen of the patients had VF and VA available in both eyes; five patients had (VA, VF) available in a single eye, but neither available in the fellow eye; two patients had (VA, VF) available in one eye and VA, but not VF available in the fellow eye. Covariate values (i.e., age, gender and genetic type) were complete for all subjects. We used multiple imputation methods as described in Section 5 to impute the missing VF and VA values and perform the analyses described in sections 2, 3 and 4.
We first provide descriptive statistics of VA and VF in Table 2. We see that the mean ETDRS score (VA) = 51.5 ± 0.6 (mean ± se) letters while the mean Humphrey 30-2 visual field (VF) = 848.9 ± 31.7 db. For reference purposes, a person with 20/20 vision has an ETDRS score of 85 letters; an ETDRS score of 50 letters corresponds roughly to a visual acuity of 20/100 [12]. A normal Humphrey 30-2 visual field is about 2500 db. We see that VF but not VA is lower for older subjects and lower for males vs. females. Furthermore, both VA and VF vary substantially by genetic type with the best function among dominant patients and the worst function among sex-linked patients. Other genetic types had intermediate levels of function. We now provide estimates of the Pearson correlation between VF and VA for individual eyes using the eye as the unit of analysis based on the model in equation 7. The results are given in Table 3. We see that the estimated Pearson correlation between VF(y) and VA(x) in the same eye = 0.212 (95% CI=0.088, 0.329), p<0.001, while the Pearson cross-correlation (correlation between VF in the right eye and VA in the left eye, and conversely) = 0.177 (95% CI = 0.054, 0.295), p=0.004. We also computed a partial Pearson correlation, adjusting for age, gender, and genetic type (person-specific covariates) and right vs. left eye (eye-specific covariate) based on equation 16. The partial Pearson correlation = 0.195 and the partial Pearson cross-correlation = 0.156 which are lower than the crude correlations, although still statistically significant (p=0.002, 0.012, respectively).
Table 2.
Variable | Category | VA** | VF*** | ||||
---|---|---|---|---|---|---|---|
| |||||||
mean (se) | Neyes (Neyes,obs†) | (95% CI) | mean (se) | Neyes (Neyes,obs†) | (95% CI) | ||
overall | 51.5 (0.6) | 440 (435) | (50.3, 52.8) | 848.9 (31.7) | 440 (433) | (786.7, 911.1) | |
age (yrs) | |||||||
< 35 | 50.3 (1.0) | 166 (166) | (48.3, 52.3) | 934.3 (51.2) | 166 (165) | (833.9, 1034.6) | |
≥ 35 | 52.3 (1.0) | 274 (269) | (50.7, 53.8) | 797.2 (39.9) | 274 (268) | (719.0, 875.4) | |
gender | |||||||
M | 51.0 (0.9) | 218 (217) | (49.2, 52.7) | 798.4 (44.9) | 218 (216) | (710.4, 886.3) | |
F | 52.1 (0.9) | 222 (218) | (50.3, 53.8) | 898.5 (44.5) | 222 (217) | (811.3, 985.7) | |
genetic type+ | |||||||
DOM | 54.4 (1.3) | 102 (100) | (51.9, 56.9) | 1005.6 (65.1) | 102 (99) | (878.0, 1133.2) | |
AR | 51.3 (1.7) | 60 (60) | (48.1, 54.6) | 849.6 (84.9) | 60 (59) | (683.3, 1015.5) | |
XL | 44.1 (2.3) | 30 (30) | (39.5, 48.7) | 670.6 (119.9) | 30 (30) | (435.5, 905.7) | |
ISO | 52.2 (0.9) | 214 (211) | (50.5, 53.9) | 812.1 (45.0) | 214 (211) | (724.0, 900.2) | |
UND | 45.4 (2.2) | 34 (34) | (41.0, 49.7) | 766.3 (112.7) | 34 (34) | (545.4, 987.1) | |
eye | |||||||
right eye (OD) | 51.5 (0.6) | 220 (219) | (50.3, 52.7) | 873.0 (31.7) | 220 (217) | (810.4, 935.5) | |
left eye (OS) | 51.6 (0.7) | 220 (216) | (50.1, 53.0) | 824.8 (32.6) | 220 (216) | (760.6, 889.0) |
VA is available for 219 patients for the right eye (OD) and 216 patients for the left eye (OS); VF is available for 217 patients for the right eye (OD) and 216 patients for the left eye (OS); age, gender and genetic type are available for all 220 patients; missing data are accounted for using multiple imputation.
VA = visual acuity (ETDRS letter score)
VF = Humphrey visual field (db)
DOM = dominant; AR = recessive; XL = sex-linked; ISO = isolate; UND = undetermined
Neyes,obs = number of eyes observed for the characteristic; Neyes = number of eyes observed + number of eyes imputed for the characteristic
- :/proj/stross/stros0a/rankcorrelation/correlation.proc.mi.age.sas 11/24/15
- :/proj/stross/stros0a/rankcorrelation/correlation.ranked.proc.mi.sexddm 11/30/15
- :/proj/stross/stros0a/rankcorrelation/correlation.ranked.proc.mi.overall 12/1/15
- :/proj/stross/stros0a/rankcorrelation/correlation.ranked.ns.sas 11/7/16
Table 3.
Crude | Adjusted | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||
Type of correlation | Parameter | Point estimate | 95% CI | P-value | Neyes (Neyes,obs**) | Point estimate | 95% CI | P-value | Neyes (Neyes,obs**) | |
Pearson | ρxy | 0.212 | (0.088, 0.329) | <0.001 | 440 (433) | 0.195 | (0.071, 0.312) | 0.002 | 440 (433) | |
|
0.177 | (0.054, 0.295) | 0.004 | 440 (433) | 0.156 | (0.033, 0.274) | 0.012 | 440 (433) | ||
ρxx | 0.787 | 0.766 | ||||||||
|
100.2 | 91.1 | ||||||||
ρyy | 0.944 | 0.943 | ||||||||
|
227, 330 | 201, 710 | ||||||||
Rank | ρxy,s | 0.229 | (0.110, 0.341) | <0.001 | 440 (433) | 0.221 | (0.103, 0.333) | <0.001 | 440 (433) | |
|
0.200 | (0.082, 0.312) | <0.001 | 440 (433) | 0.189 | (0.072, 0.301) | 0.001 | 440 (433) | ||
ρxx,s | 0.819 | 0.807 | ||||||||
|
0.946 | 0.891 | ||||||||
ρyy,s | 0.909 | 0.900 | ||||||||
|
0.962 | 0.884 |
VA is available for 219 patients for the right eye (OD) and 216 patients for the left eye (OS); VF is available for 217 patients for the right eye (OD) and 216 patients for the left eye (OS); age, gender and genetic type are available for all 220 patients; missing data are accounted for using multiple imputation.
Neyes,obs = number of eyes where both VA and VF are observed; Neyes = number of eyes observed + number of eyes imputed for the characteristic
- :/proj/stross/stros0a/rankcorrelation/correlation.ranked.proc.mi3a.sas 11/23/15
- :/proj.stross/stros0a/rankcorrelation/correlation.ranked.proc.mi.residual.probit.sas 12/04/15
- :/proj/stross/stros0a/rankcorrelation/correlation.ranked.proc.mi.residual.sas 12/04/15
Because both VF and VA are not normally distributed, we also computed the corresponding crude and partial rank correlations based on equation 13. The rank correlation coefficients are slightly larger than the corresponding Pearson correlations (ordinary rank correlation = 0.229, p<0.001; partial rank correlation = 0.221, p<0.001; ordinary rank cross-correlation = 0.200, p<0.001; partial rank cross-correlation = 0.189, p = 0.001).
Note that for both Pearson and rank correlation, the cross-correlations ( ) are almost as large as the corresponding inter-class correlations (ρxyρxy,s) indicating likely efficiency gains in estimation of ρxy and ρxy,s by ρ̂xy,MLE as presented in equation 7 vs. the ordinary Pearson product-moment correlation given by ρ̂xy,Pearson since will be reduced if tij,2 is included as a predictor of yij in equation 7.
To check the assumption of bivariate normality, we note that if HX is univariate normal, then (HX,HY) will be bivariate normal if HY|HX ~ N(α + βHX,σ2). To test this assumption we (1) ran a linear regression of Ĥi,Yj,n on Ĥ i,Xj,n, (2) obtained studentized residuals from the regression and (3) performed the Shapiro-Wilk test of normality on the distribution of studentized residuals (available in PROC UNIVARIATE of SAS). This was done separately for right (j=1) and left (j=2) eyes, where Y = VF and X = VA. The Shapiro-Wilk test statistic was W = 0.98, p > 0.05 for right eyes and W = 0.98, p > 0.05 for left eyes indicating that the assumption of bivariate normality is appropriate for these data.
(b) Estimating the correlation between biomarkers of phenacetin and aspirin intake in a study of analgesic abuse among Swiss women
The Swiss Analgesic Study started in 1967/1968 [13]. There were 1244 Swiss women participating in the study. The goal was to evaluate the association between the use of phenacetin-containing analgesics and the prevalence and incidence of kidney disorders. NAPAP (N-acetyl-P-aminophenol) is a urinary metabolite associated with recent use of phenacetin-containing analgesics. NAPAP was initially measured in a urine sample at the baseline clinic visit. There were two additional urine samples collected at home on 2 separate days within 1 week of the baseline clinic visit. One issue is that some analgesics contain both phenacetin and aspirin and it is important to determine the correlation between biomarkers that reflect recent phenacetin and aspirin intake. Salicylate concentrations in urine may be increased after recent intake of aspirin-containing analgesics, but the amount of increase depends on the pH level in urine with alkaline urine (pH > 7.0) resulting in increased secretion in urine [14]. Urinary salicylates may also increase after ingestion of other substances (e.g., cranberry juice) [15].
For this analysis, concentrations of NAPAP and salicylates were determined from the same 3 urine specimens for each woman. NAPAP values were represented as continuous variables and expressed in units of optical density (o.d.), but its distribution is highly skewed and non-normal. Urinary salicylates (mg %) were measured in four categories (0-19 mg% / 20-49 mg% / 50-99 mg% / 100+ mg%) and converted to a continuous scale by using the median value within each category (10 mg% / 35 mg% / 75 mg% / 100 mg%). Since the distributions of both NAPAP and salicylates were non-normal, the rank correlation is a natural measure of association. Descriptive data for this study are provided in Table 4.
Table 4.
NAPAP (o.d.)
|
Salicylates (mg%)
|
|||||
---|---|---|---|---|---|---|
mean(se) | Nvis(Nvis,obs)** | (95% CI) | mean(se) | Nvis(Nvis,obs)** | (95% CI) | |
|
|
|||||
overall | 0.162 (0.008) | 3732 (3680) | (0.147, 0.178) | 15.0 (0.4) | 3732 (3600) | (14.2, 15.7) |
age | ||||||
<40 | 0.146 (0.013) | 1311 (1292) | (0.120, 0.172) | 15.2 (0.7) | 1311 (1265) | (13.9, 16.5) |
≥40 | 0.171 (0.010) | 2421 (2388) | (0.152, 0.191) | 14.8 (0.5) | 2421 (2335) | (13.9, 15.8) |
weight (kg) | ||||||
≤57.0 | 0.202 (0.016) | 912 (898) | (0.171, 0.233) | 16.4 (0.8) | 912 (880) | (14.9, 18.0) |
57.1 – 63.9 | 0.134 (0.017) | 840 (832) | (0.102, 0.167) | 13.2 (0.8) | 840 (813) | (11.6, 14.8) |
64.0 – 72.9 | 0.142 (0.014) | 1113 (1098) | (0.114, 0.170) | 14.6 (0.7) | 1113 (1075) | (13.2, 16.1) |
≥73.0 | 0.174 (0.016) | 867 (851) | (0.142, 0.206) | 15.6 (0.8) | 867 (832) | (14.0, 17.2) |
Number of cigarettes currently smoked per day | ||||||
0 | 0.134 (0.010) | 2295 (2266) | (0.117, 0.156) | 14.5 (0.5) | 2295 (2222) | (13.5, 15.5) |
1 – 9 | 0.157 (0.017) | 759 (745) | (0.123, 0.191) | 15.4 (0.9) | 759 (725) | (13.7, 17.1) |
≥10 | 0.258 (0.018) | 678 (669) | (0.222, 0.294) | 16.1 (0.9) | 678 (653) | (14.4, 17.9) |
location of urine collection | ||||||
clinic visit | 0.175 (0.010) | 1244 (1244) | (0.155, 0.194) | 15.6 (0.5) | 1244 (1202) | (14.5, 16.6) |
1st home urine | 0.153 (0.009) | 1244 (1217) | (0.136, 0.170) | 14.6 (0.5) | 1244 (1191) | (13.7, 15.6) |
2nd home urine | 0.160 (0.009) | 1244 (1219) | (0.142, 0.177) | 14.7 (0.5) | 1244 (1207) | (13.7, 15.7) |
NAPAP is available for 1244 women at the clinic visit, 1217 women at the 1st home urine and 1219 women for the 2nd home urine; Salicylates are available for 1202 women at the clinic visit, 1191 women at the 1st home urine and 1207 women at the 2nd home urine; age, weight and number of cigarettes currently smoked are available for all 1244 women; missing data are accounted for using multiple imputation.
Nvis,obs = number of visits where the characteristic was observed; Nvis = number of visits where the characteristic was observed + number of visits where the characteristic was imputed.
- :/proj/stross/stros0a/rankcorrelation/swiss.proc.mi.age.sas 01/22/16
- :/proj/stross/stros0a/rankcorrelation/swiss.proc.mi.overall.sas 01/22/16
- :/proj/stross/stros0a/rankcorrelation/swiss.proc.mi.weight.smoke.sas 02/02/16
- :/proj/stross/stros0a/rankcorrelation/swiss.age.wgt.smk.ns.sas 11/7/16
There were 90 women (7.2%) who had missing data for at least one NAPAP replicate and/or one salicylate replicate. We used multiple imputation methods with 5 imputations to impute the missing data (using PROC MI of SAS) and to perform the analyses in equation 11 (using PROC MIANALYZE of SAS).
We see that mean NAPAP was higher for women age 40 or older vs. women < 40 years of age, while mean salicylates were comparable in the two age groups. Mean NAPAP was higher for both light (≤57.0kg) and heavier (≥73.0kg) women and was much higher for women who currently smoke ≥10 cigarettes/day vs. non-current smokers. The trends for salicylates were in the same direction. Both NAPAP and salicylates were somewhat higher at the clinic visit than at the two home urines.
The correlation analyses based on equation 11 are reported in Table 5. The intraclass correlation (ICC) was higher for NAPAP (ρyy = 0.611) than for salicylates (ρxx = 0.402). The Pearson correlation between NAPAP and salicylates was 0.391 (p<0.001) when measured at the same visit (ρxy ) and 0.248 when measured at different visits ( ). After adjusting for age, weight, smoking and location of urine collection, the estimated Pearson correlation between NAPAP and salicylates remained about the same when measured at the same visit (i.e., ρxy = 0.390) and when measured at different visits (i.e., ).
Table 5.
Crude† | Adjusted‡ | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||
Type of correlation | Parameter | Point estimate | 95% CI | P-value | Nvis(Nvis,obs)** | Point estimate | 95% CI | P-value | Nvis(Nvis,obs)** | |
Pearson | ρxy | 0.391 | (0.355, 0.426) | <0.001 | 3732 (3597) | 0.390 | (0.354, 0.425) | <0.001 | 3732 (3597) | |
|
0.248 | (0.213, 0.282) | <0.001 | 3732 (3597) | 0.246 | (0.212, 0.280) | <0.001 | 3732 (3597) | ||
ρxx | 0.402 | 0.401 | ||||||||
|
313.2 | 312.5 | ||||||||
ρyy | 0.611 | 0.602 | ||||||||
|
0.105 | 0.102 | ||||||||
Rank | ρxy,s | 0.290 | (0.250, 0.328) | <0.001 | 3732 (3597) | 0.288 | (0.247, 0.328) | <0.001 | 3732 (3597) | |
|
0.197 | (0.162, 0.231) | <0.001 | 3732 (3597) | 0.194 | (0.159, 0.229) | <0.001 | 3732 (3597) | ||
ρxx,s | 0.329 | 0.329 | ||||||||
|
0.426 | 0.426 | ||||||||
ρyy,s | 0.538 | 0.528 | ||||||||
|
0.952 | 0.932 |
NAPAP is available for 1244 women at the clinic visit, 1217 women at the 1st home urine and 1219 women for the 2nd home urine; Salicylates are available for 1202 women at the clinic visit, 1191 women at the 1st home urine and 1207 women at the 2nd home urine; age, weight and number of cigarettes currently smoked are available for all 1244 women; missing data are accounted for using multiple imputation.
Nvis,obs = number of visits where both urinary NAPAP and salicylates were available; Nvis = number of visits where both urinary NAPAP and salicylates were available + number of visits where either was imputed
y=NAPAP, x=salicylates
adjusted for age, weight, smoking and location of urine collection
- :/proj/stross/stros0a/rankcorrelation/swiss.proc.mi2.sas 1/21/16
- :/proj.stross/stros0a/rankcorrelation/swiss.proc.mi.probit.sas 1/21/16
- :/proj/stross/stros0a/rankcorrelation/swiss.proc.mi.residual3.sas 2/2/16
- :/proj/stross/stros0a/rankcorrelation/swiss.proc.mi.probit.residual3.sas 2/4/16
Since the distribution of NAPAP was highly skewed and the distribution of salicylates represented in 4 broad ordered categories, it was reasonable to also consider rank correlation as a measure of association. The estimated rank correlation was 0.290 when NAPAP and salicylates were measured at the same visit (ρxy,s) and 0.197 when measured at different visits ( ). After adjusting for age, weight, smoking and location of urine collection, the rank correlations decreased slightly when measured at the same visit (ρxy,s = 0.288) and at different visits ( ).
This example shows that there is a considerable difference between ρ̂xy vs. and similarly between ρ̂xy,s vs. indicating that NAPAP and salicylate values at different visits are best modeled as distinguishable replicates.
We also performed the Shapiro-Wilk test of normality for the distribution of studentized residuals of Ĥi,Yj,n given Ĥi,Xj,n as described in the RP example. The Shapiro-Wilk test statistic was W = 0.99, p < 0.05 at the clinic visit, W = 0.99, p < 0.05 at the 1st home urine, and W = 0.99, p < 0.05 at the 2nd home urine. The reference value for the W test statistic under normality is W = 1.0. Thus, there was only a mild departure from the null value of W and the significance of the W test is due to the large sample size.
6. Discussion
In this paper we have presented methods for maximum likelihood estimation of Pearson correlation in the clustered data setting. The methods are applicable to clustered data based on distinguishable replicates and can be used both for clusters of size 2 (Equation 7) and clusters of size > 2 (Equation 11). They are based on a population-average type of model. A simpler approach to this problem from a population-average perspective is based on the regression model
(22) |
where
and
The corresponding estimator of ρxy denoted by ρ̂xy,standard is given by:
(23) |
The model in (22) and the corresponding estimator in (23) are actually a special case of Equation 11 when . In words, if the correlation between yij1and xij2 is completely mediated by the correlation between yij and xij and between xij1and xij2. In general, this will not always be the case and the estimator in (23) is not the MLE of ρxy under the model in (8).
Bland and Altman [16] have also considered a population-average approach to estimation of ρxy. With their approach, a pseudo-data set is created where pairs (x̄i, ȳi) are repeated ki times and ki = # replicates for the ith subject. Their estimator can be expressed in the form:
where
The ordinary Pearson correlation is then computed based on the K observations in the pseudo-data set and is tested for significance using a t distribution with n - 2 df as the reference distribution.
Lorenz, Datta and Harkema [17] also consider a population-average approach, but based on observations from individual cluster-members rather than cluster averages. Their estimator takes the form:
Large sample asymptotic normality of ρ̂xy,LDH is assumed and significance testing is based on zLDH = ρ̂xy,LDH/[var(ρ̂xy,LDH)1/2] and compared with a N(0,1) distribution. The delta method is used to estimate var(ρ̂xy,LDH). The LDH approach is also applied to estimation of Kendall’s τ for clustered data.
Both the BA and LDH approaches assume that replicates are indistinguishable and allow for a variable number of replicates per cluster. With the BA estimator, the ith cluster-mean is weighted ki times. With the LDH estimator each cluster gets equal weight regardless of cluster-size. If either ρxx > 0 or ρyy > 0, the optimal weighting for the ith cluster will be between 1 and ki.
The estimators in equations 7 and 11 of this paper assume replicates are distinguishable and use multiple imputation methods to estimate missing (x, y) replicate scores based on (x, y) scores for other cluster members as well as values of other covariates and performs the analysis based on nIMP imputed data sets using Rubin’s rules. The estimators are maximum likelihood, can be implemented with standard software, are extended to the rank correlation setting and allow one to control for other covariates.
The distinguishing feature of ML estimation is that for clustered data the likelihood is based on
where and is maximized to estimate ρxy.
In the LDH approach the likelihood is based on
Under , yij is expressed as a function of xij, while under L1, yij is expressed as a function of both xij and xi+,−j. is a special case of L1 when . is the likelihood in the i.i.d. situation when ρxx = ρyy = 0 [1], but not in the clustered data situation.
Another approach is to consider a subject-specific type of model [18]. Under this approach, we consider the random intercept model
(24) |
where
Under this model, the estimator β has a subject-specific interpretation (i.e., are changes in X between 2 replicates for an individual correlated with corresponding changes in Y). For example, do changes in visual field between 2 eyes of a subject correspond to changes in visual acuity. Bland and Altman [18] consider a slight variation of (24) where q is a fixed effect instead of a random effect.
Finally, we have extended our results to the estimation of Spearman rank correlation for clustered data. To our knowledge, this is the first time this has appeared in the literature. We also extend the Pearson and rank correlation results to the setting of partial correlation, where one can control for both cluster-specific and subunit-specific covariates. For rank correlation this is different from the partial Spearman correlation available in some statistical packages where one calculates covariate-adjusted residuals for both Y and X based on the raw data and then computes the Spearman correlation based on the residuals. With our approach, probit scores are first computed for X and Y variables and residuals are computed by regressing X and Y probit scores respectively, on covariates. Rank correlations are then obtained based on Spearman correlations between the probit residuals. This should provide a more nonparametric estimate of partial Spearman correlation.
The relationship between Pearson and Spearman correlation given in (13), assumes that HZ = (HX, HY) is bivariate normal. If X and/or Y are categorical, such as for salicylates in the second example, then this assumption will be violated. However, if we assume that there is a latent continuous scale but the continuous scale is divided into categories and the actual rank of an observation on the latent continuous scale is replaced by the average rank within a category, then this assumption may be more plausible. Also, in the simulation studies where we replaced the actual rank on the continuous scale by the average rank within a category (see Appendix Table 1), the type I error and coverage were preserved.
We also assessed the validity of the estimation procedure in (13) and (14) when HX and HY are each univariate normal, but (HX, HY) is not bivariate normal. The clustered KS copula in (21) was introduced for this purpose. Simulation studies indicate that bias and coverage probability associated with the estimators of ρxy,s and in (13) and (14) are acceptable in this setting for n ≥ 50 (see Appendix Table 2). We also proposed a diagnostic procedure to test the assumption of bivariate normality of replicate probit scores which was applied and satisfied for both of the examples in this paper.
Acknowledgments
This work was supported by the National Eye Institute, R01 EY022445. The authors would also like to acknowledge the programming support of Marion McPhee.
Appendix – Proofs
Theorem 1
Let X be a continuous random variable and let {x1,…,xn} be an i.i.d. sample from X. Define , where Φ−1 is the inverse normal distribution, , and U(a) = 1 if a > 0, = 0 otherwise. If Hx = Φ−1[FX(x)] where FX is the c.d.f. of X, then Ĥx,n converges in probability to Hx as n → ∞.
Proof
From (A.110), p. 346 of Lehmann [19] if E (Ĥx,n – Hx)2 → 0 as n → ∞, then Ĥx,n converges in probability to Hx as n → ∞.
Let Px = Φ(Hx) = FX(x) and
We have that
(A1) |
To assess the 1st component of (A1) we use a Taylor series expansion of Ĥx,n about Hx, yielding
(A2) |
where |ar| < ∞, r ≥ 3. if we take expectations of both sides of (A2) we obtain:
(A3) |
Since E(P̂x,n – Px)q → 0 as n → ∞ for q ≥ 1 it follows that E(Ĥx,n) – Hx → 0 as n → ∞
To assess the 2nd component of (A1) we use the delta method, whereby
(A4) |
Thus, based on (A1), (A3) and (A4), E(Ĥx,n – Hx)2 → 0 as n → ∞, whereby Ĥx,n converges in probability to Hx as n → ∞.
Corollary 1
Let Ĥx,n be defined as in Theorem 1, HX = {Ĥx, xεX} and ĤX,n = {Ĥx,n, xεX}. ĤX,n converges in law to a N(0,1) distribution as n → ∞.
Proof
We have that
Hence, ĤX,n is asymptotically normal as n → ∞.
Theorem 2
Let ĤX,n and ĤY,n be defined as in Corollary 1. Let HZ = (HX,HY), ρXY = cov(HX,HY) and ĤZ,n = (ĤX,n, ĤY,n) If HZ is bivariate normal, then ĤZ,n is asymptotically bivariate normal as n → ∞.
Proof
From Theorem 2.6.2, p. 37 of Anderson [1], HZ is bivariate normal if and only if aHX + bHY is univariate normal for all a, b, where min(|a|,|b|) > 0. We have from Corollary 1 that aĤX,n + bĤY,n converges in law to aHX + bHY as n → ∞. Thus, since var(HX) = var(HY) = 1 and cov(HX,HY) = ρXY then
Therefore, aĤx,n + bĤy,n is asymptotically univariate normal for all a, b. It follows that HZ,n is asymptotically bivariate normal as n → ∞.
Theorem 3
Let X1,…,XL be continuous random variables and define HXl(x) = Φ−1[FXl(x)], l =1,…,L. Let ρl1l2 = cov(HXl1,HXl2), l1 ≠ l2 = 1,…,L: Define ĤXln(x) = Φ−1 [RXln(x)/(n+1)], where and xil = value of Xl for the ith subject, i = 1,…,n; l = 1,…,L.
If HX ≡ (HX1,… HXL,) is multivariate normal, then ĤXn ≡ (ĤX1,n,…, ĤXL,n) will be asymptotically multivariate normal as n → ∞.
Proof
From Theorem 2.6.2, p. 37 of Anderson [1] Hx will be multivariate normal if and only if is univariate normal for all a = (a1,…,aL) where min(|al|, l = 1,…,L) > 0. We have from Corollary 1 that converges in law to Z as n → ∞ Thus, since var(HXl) = 1, l = 1,…,L, and cov(HXl1,HXl2) = ρl1l2, l1 ≠ l2 = 1,…,L, then . Therefore, Ẑn is asymptotically univariate normal for all a where min(|al|, l = 1,…,L) > 0. It follows that ĤXn is asymptotically multivariate normal as n → ∞
Appendix Table 1.
k | ρxy |
|
ρxx(ρyy) | n | ρ̂xy | Type I error | coverage probability (%) | power | |
---|---|---|---|---|---|---|---|---|---|
3 | 0 | 0 | 0.5 | 50 | 0.028 | 0.055 | 94.7 | --- | |
0 | 0 | 0.5 | 100 | 0.010 | 0.054 | 94.8 | --- | ||
0.2 | 0.1 | 0.5 | 50 | 0.194 | --- | 95.9 | 52.4 | ||
0.2 | 0.1 | 0.5 | 100 | 0.184 | --- | 94.6 | 76.8 |
- :/proj/stross/stros0a/rankcorrelation/simulation.3rep.4000.n100.5.probit.sas 9/15/16
- :/proj/stross/stros0a/rankcorrelation/simulation2.3rep.4000.n100.5.probit.sas 9/15/16
- :/proj/stross/stros0a/rankcorrelation/simulation.3rep.4000.n50.5.probit.sas 9/15/16
- :/proj/stross/stros0a/rankcorrelation/simulation2.3rep.4000.n50.5.probit.sas 9/15/16
Appendix Table 2.
g | n | parameter | true value** | median estimated value+ | bias | coverage probability (%) | |
---|---|---|---|---|---|---|---|
0.5 | 50 | ρxy,s | 0.497 | 0.488 | −0.009 | 96.6 | |
|
0.328 | 0.327 | −0.001 | 94.4 | |||
100 | ρxy,s | 0.497 | 0.491 | −0.006 | 96.6 | ||
|
0.328 | 0.325 | −0.003 | 94.1 | |||
1.0 | 50 | ρxy,s | 0.497 | 0.486 | −0.011 | 96.4 | |
|
0.246 | 0.246 | 0 | 92.1 | |||
100 | ρxy,s | 0.497 | 0.491 | −0.006 | 96.7 | ||
|
0.246 | 0.244 | −0.002 | 91.4 |
Kimeldorf-Sampson
based on simulating 100,000 random vectors (X1, Y1, X2, Y2) from the clustered KS copula in (21) and estimating ρxy,s from mean [Spearman correlation (X1,Y1 )+Spearman correlation (X2,Y2 )] and from mean [Spearman correlation (X1, Y2 )+Spearman correlation (X2, Y1)].
based on 4,000 simulated datasets consisting of n (X1, Y1, X2, Y2) vectors from the clustered KS copula in (21).
- :/proj/stross/stros0a/rankcorrelation/simulation.new.n100000.sas 9/29/16
- :/proj/stross/stros0a/rankcorrelation/simulation.new2.n100000.sas 10/13/16
- :/proj/stross/stros0a/rankcorrelation/simulation.new.4000.n50.sas 10/13/16
- :/proj/stross/stros0a/rankcorrelation/simulation.new.4000.n100.sas 10/13/16
- :/proj/stross/stros0a/rankcorrelation/simulation.new2.4000.n50.sas 10/14/16
- :/proj/stross/stros0a/rankcorrelation/simulation.new2.4000.n100.sas 10/14/16
References
- 1.Anderson TW. An introduction to multivariate statistical analysis. Wiley; New York: 1958. [Google Scholar]
- 2.Jennrich RI, Schluchter MD. Unbalanced repeated-measures models with structured covariance matrices. Biometrics. 1986;42:805–820. [PubMed] [Google Scholar]
- 3.Group ETDRS. Early Therapy Diabetic Retinopathy Study (ETDRS) manual of operations. University of Maryland; City: 1985. Early Therapy Diabetic Retinopathy Study (ETDRS) manual of operations. [Google Scholar]
- 4.Pearson K. Mathematical Contributions to the Theory of Evolution. On further methods of determining correlation. Cambridge University Press; Cambridge, UK: 1907. [Google Scholar]
- 5.Rosner B, Glynn RJ. Interval estimation for rank correlation coefficients based on the probit transformation with extension to measurement error correction of correlated ranked data. Stat Med. 2007;26:633–646. doi: 10.1002/sim.2547. [DOI] [PubMed] [Google Scholar]
- 6.Moran PA. Rank correlation and product-moment correlation. Biometrika. 1948;35:203–206. [PubMed] [Google Scholar]
- 7.Berson EL, Sandberg MA, Rosner B, Birch DG, Hanson AH. Natural course of retinitis pigmentosa over a three-year interval. Am J Ophthalmol. 1985;99:240–251. doi: 10.1016/0002-9394(85)90351-4. [DOI] [PubMed] [Google Scholar]
- 8.Rubin DB. Multiple imputation for non-response in surveys. John Wiley and Sons; New York: 1987. [Google Scholar]
- 9.Kimeldorf G, Sampson AR. Uniform representation of bivariate distributions. Comm Statist. 1975;4:617–627. [Google Scholar]
- 10.Joe H. Multivariate models and dependence concepts. Chapman & Hall; London ; New York: 1997. [Google Scholar]
- 11.Berson EL, Rosner B, Sandberg MA, Weigel-DiFranco C, Moser A, Brockhurst RJ, Hayes KC, Johnson CA, Anderson EJ, Gaudio AR, Willett WC, Schaefer EJ. Clinical Trial of docosahexaenoic acid in patients with retinitis pigmentosa receiving vitamin A treatment. Archives of Ophthalmology. 2004;122:1297–1305. doi: 10.1001/archopht.122.9.1297. [DOI] [PubMed] [Google Scholar]
- 12.Gregori NZ, Feuer W, Rosenfeld PJ. Novel method for analyzing snellen visual acuity measurements. Retina. 2010;30:1046–1050. doi: 10.1097/IAE.0b013e3181d87e04. [DOI] [PubMed] [Google Scholar]
- 13.Dubach UC, Levy PS, Muller A. Relationships Between Regular Analgesic Intake and Urorenal Disorders in a Working Female Population of Switzerland I. Initial Results (1968) American Journal of Epidemiology. 1971;93:425–434. doi: 10.1093/oxfordjournals.aje.a121276. [DOI] [PubMed] [Google Scholar]
- 14.Proudfoot AT, Krenzelok EP, Brent J, Vale JA. Does urine alkalinization increase salicylate elimination? If so, why? Toxicol Rev. 2003;22:129–136. doi: 10.2165/00139709-200322030-00001. [DOI] [PubMed] [Google Scholar]
- 15.Duthie GG, Kyle JA, Jenkinson AM, Duthie SJ, Baxter GJ, Paterson JR. Increased salicylate concentrations in urine of human volunteers after consumption of cranberry juice. J Agric Food Chem. 2005;53:2897–2900. doi: 10.1021/jf040393b. [DOI] [PubMed] [Google Scholar]
- 16.Bland JM, Altman DG. Calculating correlation coefficients with repeated observations: Part 2--Correlation between subjects. British Medical Journal. 1995;310:633. doi: 10.1136/bmj.310.6980.633. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Lorenz DJ, Datta S, Harkema SJ. Marginal association measures for clustered data. Stat Med. 2011;30:3181–3191. doi: 10.1002/sim.4368. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Bland JM, Altman DG. Calculating correlation coefficients with repeated observations: Part 1--Correlation within subjects. British Medical Journal. 1995;310:446. doi: 10.1136/bmj.310.6977.446. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Lehmann EL. Nonparametrics: Statistical Methods Based on Ranks. Springer; New York: 2006. (revised edn) [Google Scholar]