Estimation of Rank Correlation for Clustered Data

Bernard Rosner; Robert Glynn

doi:10.1002/sim.7257

. Author manuscript; available in PMC: 2018 Jun 30.

Published in final edited form as: Stat Med. 2017 Apr 11;36(14):2163–2186. doi: 10.1002/sim.7257

Estimation of Rank Correlation for Clustered Data

Bernard Rosner ¹, Robert Glynn ²

PMCID: PMC5457370 NIHMSID: NIHMS851394 PMID: 28399615

Abstract

It is well known that the sample correlation coefficient (R_xy) is the maximum likelihood estimator (MLE) of the Pearson correlation (ρ_xy) for i.i.d. bivariate normal data. However, this is not true for ophthalmologic data where X (e.g., visual acuity) and Y (e.g., visual field) are available for each eye and there is positive intraclass correlation for both X and Y in fellow eyes. In this paper, we provide a regression-based approach for obtaining the MLE of ρ_xy for clustered data, which can be implemented using standard mixed effects model software. This method is also extended to allow for estimation of partial correlation by controlling both X and Y for a vector U of other covariates. In addition, these methods can be extended to allow for estimation of rank correlation for clustered data by (a) converting ranks of both X and Y to the probit scale, (b) estimating the Pearson correlation between probit scores for X and Y, and (c) using the relationship between Pearson and rank correlation for bivariate normally distributed data. The validity of the methods in finite-sized samples is supported by simulation studies. Finally, two examples from ophthalmology and analgesic abuse are used to illustrate the methods.

Keywords: clustered data, rank correlation, Pearson correlation, partial correlation

1. Introduction

It is well known that the sample correlation coefficient (R_xy) is the maximum likelihood estimator (MLE) of the Pearson correlation coefficient (ρ_xy) for i.i.d. bivariate normally distributed data.[1] Also, there is an extensive literature on estimation of regression models for clustered normally distributed data.[2] However, estimation of Pearson correlation in the clustered data setting has received less attention. Furthermore, many outcome measures in ophthalmology are either continuous and not normally distributed or are ordinal. For example, the Humphrey visual field is often used to characterize visual field in retinitis pigmentosa (RP) patients. The Humphrey field is continuous, but is not normally distributed (see Figure 1). Similarly, Early Treatment Diabetic Retinopathy Study (ETDRS)[3] visual acuity, a measure of central vision, is also frequently not normally distributed (see Figure 2). Thus, the rank correlation is a natural measure of association between these two measures. However, both measures are correlated in fellow eyes which violates standard assumptions of inference regarding rank correlation (see Figure 3).

Distribution of Humphrey 30-2 field (decibels) for right (OD) and left (OS) eyes among 220 retinitis pigmentosa (RP) patients

Distribution of ETDRS visual acuity (number of letters correctly read) for right (OD) and left (OS) eyes among 220 retinitis pigmentosa (RP) patients

**Figure 3a** Cross-classification of Humphrey 30-2 field (decibels) for right (OD) and left (OS) eyes among 220 retinitis pigmentosa (RP) patients

**Figure 3b**Cross-classification of ETDRS visual acuity (number of letters correctly read) for right (OD) and left (OS) eyes among 220 retinitis pigmentosa (RP) patients

The subject of this paper is to provide methods for obtaining point and interval estimates of rank correlation for clustered data. We first develop methods for estimation of Pearson correlation and then extend these methods to estimation of rank correlation. Methods are first presented for the case of two subunits per cluster and extended to allow for more than two subunits per cluster. Extensions for unbalanced data (i.e., a variable number of subunits per cluster) are also provided.

2. Estimation of Pearson correlation for clustered data

(a) Clusters of size 2

We assume throughout this paper that subunits within a cluster are distinguishable (e.g., right eye (OD) and left eye (OS) scores). We also assume that the Pearson correlation between X and Y scores is the same for right and left eyes. We will begin with a discussion of maximum likelihood estimation of Pearson correlation for clusters of size 2 and will extend our inference to clusters of arbitrary size. Let X_ij = X score for the j^th subunit in the i^th cluster; i = 1, …, n; j = 1,2 and Y_ij defined similarly for the Y score. Our goal is to estimate corr(X_ij,Y_ij). For simplicity, we will use the notation (x_ij,y_ij) to denote both the random variable (X_ij,Y_ij) as well as its sample realization. Suppose we have the data

${\underline{z}}_{i} = (\begin{matrix} {\underline{x}}_{i} \\ {\underline{y}}_{i} \end{matrix})$ , where x_i = (x_i₁,x_i₂)′, y_i = (y_i₁,y_i₂)′; i = 1, …, n.

Thus, x_i and y_i are 2 × 1 vectors and z_i is a 4 × 1 vector.

We assume that z_i is multivariate normal with mean μ = (μ_x,μ_x,μ_y,μ_y)′ and covariance matrix Σ given by

\sum = (\begin{matrix} σ_{x}^{2} & ρ_{x x} σ_{x}^{2} & ρ_{x y} σ_{x} σ_{y} & ρ_{x y}^{*} σ_{x} σ_{y} \\ ρ_{x x} σ_{x}^{2} & σ_{x}^{2} & ρ_{x y}^{*} σ_{x} σ_{y} & ρ_{x y} σ_{x} σ_{y} \\ ρ_{x y} σ_{x} σ_{y} & ρ_{x y}^{*} σ_{x} σ_{y} & σ_{y}^{2} & ρ_{y y} σ_{y}^{2} \\ ρ_{x y}^{*} σ_{x} σ_{y} & ρ_{x y} σ_{x} σ_{y} & ρ_{y y} σ_{y}^{2} & σ_{y}^{2} \end{matrix}) \equiv (\begin{matrix} \sum_{x} & \sum_{x y} \\ \sum_{y x} & \sum_{y} \end{matrix})

Thus, ρ_xy = corr(x_ij,y_ij) = inter-class Pearson correlation between visual acuity (VA) (x_ij) and visual field (VF) (y_ij) for the same eye, i = 1, n; j = 1,2.

$ρ_{x y}^{*} = corr (x_{i j_{1}}, y_{i j_{2}}) \equiv cross-correlation between X$ and Y = inter-class Pearson correlation between VA for the right eye (OD) and VF for the left eye (OS) and conversely, i = 1, …, n; j₁ ≠ j₂ = 1,2.

ρ_xx = corr(x_ij_₁,x_ij_₂) = intraclass correlation between VA for the left and right eyes, i = 1, …, n; j₁ ≠ j₂ = 1,2.

ρ_yy = corr(y_ij_₁,y_ij_₂) = intraclass correlation between VF for the left and right eyes, i = 1, …, n; j₁ ≠ j₂ = 1,2.

The likelihood for the i^th subject is given by

L_{i} = \frac{1}{{(2 π)}^{2} {∣ \sum ∣}^{1 / 2}} exp [- \frac{1}{2} {({\underline{z}}_{i} = \underline{μ})}^{'} \sum^{- 1} ({\underline{z}}_{i} - \underline{μ})]

and the overall likelihood is given by:

L = \prod_{i = 1}^{n} L_{i}

The goal is to obtain maximum likelihood estimates (MLE) of ρ_xy. Note that if $ρ_{x x} = ρ_{y y} = ρ_{x y}^{*} = 0$ , then the MLE of ρ_xy is the ordinary Pearson product-moment correlation [1] given by

{\hat{ρ}}_{x y, Pearson} = \sum_{i = 1}^{n} \sum_{j = 1}^{2} (x_{i j} - \bar{\bar{x}}) (y_{i j} - \bar{\bar{y}}) / {[\sum_{i = 1}^{n} \sum_{j = 1}^{2} {(x_{i j} - \bar{\bar{x}})}^{2} \sum_{i = 1}^{n} \sum_{j = 1}^{2} {(y_{i j} - \bar{\bar{y}})}^{2}]}^{1 / 2}

where

\bar{\bar{x}} = \sum_{i = 1}^{n} \sum_{j = 1}^{2} x_{i j} / (2 n)

and

\bar{\bar{y}} = \sum_{i = 1}^{n} \sum_{j = 1}^{2} y_{i j} / 2 n

However, it is difficult to obtain closed form maximum likelihood estimates of ρ_xy using standard software if z_i₁ = (x_i₁,y_i₁) and z_i₂ = (x_i₂,y_i₂) are not independent. Instead, we consider an indirect approach. Specifically, if we factor L_i as follows:

L_{i} = L_{1 i} ({\underline{y}}_{i} ∣ {\underline{x}}_{i}) L_{2 i} ({\underline{x}}_{i})

then the overall likelihood is given by

L \equiv L_{1} L_{2} = \prod_{i = 1}^{n} L_{1 i} ({\underline{y}}_{i} ∣ {\underline{x}}_{i}) \prod_{i = 1}^{n} L_{2 i} ({\underline{x}}_{i})

Note that all the information concerning ρ_xy is contained in L₁ Hence, it will suffice to maximize L₁ rather than L. We can write L₁_i as a conditional regression likelihood given by

L_{1 i} = \frac{{∣ \sum^{*} ∣}^{- 1 / 2}}{2 π} exp [- \frac{1}{2} {({\underline{y}}_{i} - μ_{{\underline{y}}_{i} ∣ {\underline{x}}_{i}})}^{'} \sum^{*^{- 1}} ({\underline{y}}_{i} - μ_{{\underline{y}}_{i} ∣ {\underline{x}}_{i}})]

where

\begin{matrix} μ_{{\underline{y}}_{i} ∣ {\underline{x}}_{i}} = α 1 + β_{1} {\underline{x}}_{i} + β_{2} {\underline{v}}_{i} \\ {\underline{x}}_{i} = {(x_{i 1} x_{i 2})}^{'}, {\underline{v}}_{i} = {(x_{i 2} x_{i 1})}^{'}, 1 = {(\begin{matrix} 1 & 1 \end{matrix})}^{'} \\ \sum^{*} \equiv \sum_{\underline{y} ∣ \underline{x}} = σ_{y ∣ \underline{x}}^{2} (\begin{matrix} 1 & ρ_{y y ∣ \underline{x}} \\ ρ_{y y ∣ \underline{x}} & 1 \end{matrix}) \end{matrix}

Thus, marginally we have the model

\begin{matrix} y_{i j} ∣ {\underline{x}}_{i} ~ N (α + β_{1} x_{i j} + β_{2} x_{i, 3 - j}, σ_{y ∣ \underline{x}}^{2}), i = 1, \dots, n; j = 1, 2 \\ corr (y_{i 1}, y_{i 2} ∣ {\underline{x}}_{i}) = ρ_{y y ∣ \underline{x}} \end{matrix}

(1)

However, based on the properties of the multivariate normal distribution [1], the conditional distribution of y_i given x_i is multivariate normal (MVN) with mean and variance given by:

{\underline{y}}_{i} ∣ {\underline{x}}_{i} ~ MVN (μ_{{\underline{y}}_{i} ∣ {\underline{x}}_{i}}, \sum_{{\underline{y}}_{i} ∣ {\underline{x}}_{i}})

where

\begin{matrix} μ_{{\underline{y}}_{i} ∣ {\underline{x}}_{i}} = (\begin{matrix} μ_{y} \\ μ_{y} \end{matrix}) + \sum_{y x} \sum_{x}^{- 1} (\begin{matrix} x_{i 1} - μ_{x} \\ x_{i 2} - μ_{x} \end{matrix}) \\ \sum_{{\underline{y}}_{i} ∣ {\underline{x}}_{i}} = \sum_{y} - \sum_{y x} \sum_{x}^{- 1} \sum_{x y} \end{matrix}

In this case, it follows straightforwardly from (1) that

\begin{matrix} E (y_{i j} ∣ {\underline{x}}_{i}) \equiv μ_{{\underline{y}}_{i} ∣ {\underline{x}}_{i},} \\ = μ_{y} + (σ_{y} / σ_{x}) (\frac{ρ_{x y} - ρ_{x y}^{*} ρ_{x x}}{1 - ρ_{x x}^{2}}) (x_{i j} - μ_{x}) + (σ_{y} / σ_{x}) (\frac{ρ_{x y}^{*} - ρ_{x y} ρ_{x x}}{1 - ρ_{x x}^{2}}) (x_{i, 3 - j} - μ_{x}), j = 1, 2 \\ V_{j j} \equiv \sum_{{\underline{y}}_{i} ∣ {\underline{x}}_{i}, j j} \equiv Var (y_{i j} ∣ {\underline{x}}_{i}) = σ_{y}^{2} [1 - ρ_{x y}^{2} - \frac{{(ρ_{x y}^{*} - ρ_{x y} ρ_{x x})}^{2}}{1 - ρ_{x x}^{2}}], j = 1, 2 \\ V_{12} = \sum_{{\underline{y}}_{i} ∣ {\underline{x}}_{i,} 12} \equiv Cov (y_{i 1}, y_{i 2} ∣ {\underline{x}}_{i}) = σ_{y}^{2} [ρ_{y y} - \frac{(2 ρ_{x y} ρ_{x y}^{*} - ρ_{x y}^{* 2} ρ_{x x} - ρ_{x y}^{2} ρ_{x x})}{1 - ρ_{x x}^{2}}] = \sum_{{\underline{y}}_{i} ∣ {\underline{x}}_{i,} 21} \end{matrix}

(2)

Therefore, if we equate E(y_ij|x_i) in equations 1 and 2, we obtain:

β_{1} = \frac{(ρ_{x y} - ρ_{x y}^{*} ρ_{x x}) (σ_{y} / σ_{x})}{1 - ρ_{x x}^{2}}

(3)

β_{2} = \frac{(ρ_{x y}^{*} - ρ_{x y} ρ_{x x}) (σ_{y} / σ_{x})}{1 - ρ_{x x}^{2}}

(4)

Upon combining equations 3 and 4, and solving for ρ_xy and $ρ_{x y}^{*}$ , we obtain:

\begin{array}{l} ρ_{x y} = \frac{σ_{x}}{σ_{y}} (β_{1} + ρ_{x x} β_{2}) \\ ρ_{x y}^{*} = \frac{σ_{x}}{σ_{y}} (β_{2} + ρ_{x x} β_{1}) \end{array}

(5)

We can obtain MLE’s of σ_x and ρ_xx based on x from L₂ denoted by σ̂_x and ρ̂_xx upon using standard mixed effects software (e.g., PROC MIXED of SAS) with a compound symmetry correlation structure. Similarly, we can obtain MLE’s of σ_y denoted by σ̂_y from the joint likelihood of y_i = (y_i₁,y_i₂) given by $L_{3} = \prod_{i = 1}^{n} L_{y} ({\underline{y}}_{i})$ upon using PROC MIXED based on y with a compound symmetry correlation structure.

Thus, we can use equation 5 to obtain MLE’s of ρ_xy and $ρ_{x y}^{*}$ given by:

\begin{array}{l} {\hat{ρ}}_{x y, MLE} = \frac{{\hat{σ}}_{x}}{{\hat{σ}}_{y}} ({\hat{β}}_{1} + {\hat{ρ}}_{x x} {\hat{β}}_{2}) \\ {\hat{ρ}}_{x y, MLE}^{*} = \frac{{\hat{σ}}_{x}}{{\hat{σ}}_{y}} ({\hat{β}}_{2} + {\hat{ρ}}_{x x} {\hat{β}}_{1}) \end{array}

(6)

However, an equivalent approach is to express equation 1 directly in terms of ρ_xy and $ρ_{x y}^{*}$ . Specifically, if we substitute equations 3 and 4 into equation 1 and collect terms, we obtain

\begin{matrix} y_{i j} ∣ {\underline{x}}_{i} ~ N (α + ρ_{x y} s_{i j, 2} + ρ_{x y}^{*} t_{i j, 2}, σ_{\underline{y} ∣ \underline{x}}^{2}), i = 1, \dots, n; j = 1, 2 \\ corr (y_{i 1}, y_{i 2} ∣ {\underline{x}}_{i}) = ρ_{y y ∣ \underline{x}} \end{matrix}

(7)

where

\begin{array}{l} s_{i j, 2} = \frac{σ_{y}}{σ_{x}} (\frac{1}{1 - ρ_{x x}^{2}}) (x_{i j} - ρ_{x x} x_{i, 3 - j}) \\ t_{i j, 2} = \frac{σ_{y}}{σ_{x}} (\frac{1}{1 - ρ_{x x}^{2}}) (x_{i, 3 - j} - ρ_{x x} x_{i j}) \end{array}

Thus, MLE’s of ρ_xy and $ρ_{x y}^{*}$ can be obtained from equation 7 using PROC MIXED with a compound symmetry correlation structure, where σ_y,σ_x and ρ_xx are estimated using PROC MIXED in a similar manner as in equation 6.

(b) Clusters of size > 2

Suppose we have data ${\underline{z}}_{i} = (\begin{matrix} {\underline{x}}_{i} \\ {\underline{y}}_{i} \end{matrix})$ , where

x_i = (x_i₁, …, x_ik)′, y_i = (y_i₁, …, y_ik)′ are (k×1) vectors and z_i is a 2k×1 vector, i = 1, …, n.

The overall likelihood L is given by:

\begin{matrix} L = \prod_{i = 1}^{n} L_{i}, where \\ L_{i} = L ({\underline{z}}_{i}) = L_{1 i} ({\underline{y}}_{i} ∣ {\underline{x}}_{i}) L_{2 i} ({\underline{x}}_{i}) \end{matrix}

Since $L_{1} \equiv L_{1} (ρ_{x y}, ρ_{x y}^{*}, ρ_{y y}, ρ_{x x}, σ_{y}^{2}, σ_{x}^{2})$ and $L_{2} \equiv L_{2} (ρ_{x x}, σ_{x}^{2})$ , it follows that all the information concerning ρ_xy is contained in L₁ Similar to the case of k = 2, we can write L₁_i as a conditional regression likelihood given by:

L_{1 i} = \frac{{| \sum_{\underline{y} ∣ \underline{x}} |}^{\frac{- 1}{2}}}{{(2 π)}^{\frac{k}{2}}} \exp [- \frac{1}{2} {({\underline{y}}_{i} - μ_{{\underline{y}}_{i} ∣ {\underline{x}}_{i}})}^{'} \sum_{\underline{y} ∣ \underline{x}}^{- 1} ({\underline{y}}_{i} - μ_{{\underline{y}}_{i} ∣ {\underline{x}}_{i}})]

where

\begin{matrix} μ_{{\underline{y}}_{i} ∣ {\underline{x}}_{i}} = α 1_{k} + β_{1} {\underline{x}}_{i} + β_{2} {\underline{v}}_{i} \\ 1_{k} = {(1, \dots, 1)}^{'} is a k \times 1 vector of ones; {\underline{v}}_{i} = {(v_{i 1}, \dots, v_{i k})}^{'} \end{matrix}

and

\begin{matrix} v_{i j} = \sum_{\begin{array}{l} q = 1 \\ q \neq j \end{array}}^{k} x_{i q} \equiv x_{i +, - j}, j = 1, \dots, k, i = 1, \dots, n \\ \sum_{\underline{y} ∣ \underline{x}} = σ_{\underline{y} ∣ \underline{x}}^{2} [(1 - ρ_{y y ∣ \underline{x}}) I_{k \times k} + ρ_{y y ∣ \underline{x}} J_{k \times k}] \end{matrix}

where I_k_×_k is the k×k identity matrix and J_k_×_k is a (k×k) matrix of one’s.

Thus, marginally, we have the regression model

\begin{matrix} y_{i j} ∣ {\underline{x}}_{i} ~ N (α + β_{1} x_{i j} + β_{2} x_{i +, - j}, σ_{y ∣ \underline{x}}^{2}) \\ corr (y_{i j_{1}}, y_{i j_{2}} ∣ {\underline{x}}_{i}) = ρ_{y y ∣ \underline{x}}, j_{1} \neq j_{2} = 1, \dots, k \end{matrix}

(8)

Similar to equation 2, based on the properties of the multivariate normal distribution [1], the conditional distribution of y_i given x_i is multivariate normal with mean and variance given by

\begin{array}{l} {\underline{y}}_{i} ∣ {\underline{x}}_{i} ~ MVN (μ_{{\underline{y}}_{i} ∣ {\underline{x}}_{i}}, \sum_{{\underline{y}}_{i} ∣ {\underline{x}}_{i}}), where \\ μ_{{\underline{y}}_{i} ∣ {\underline{x}}_{i}} = μ_{y} 1_{k} + \sum_{y x} \sum_{x}^{- 1} ({\underline{x}}_{i} - μ_{x} 1_{k}) \\ \sum_{{\underline{y}}_{i} ∣ {\underline{x}}_{i}} = \sum_{y} - \sum_{y x} \sum_{x}^{- 1} \sum_{x y} \end{array}

(9)

It follows straightforwardly from (9) that

\begin{array}{c} μ_{{\underline{y}}_{i} ∣ {\underline{x}}_{i}} = μ_{y} 1_{k} + \frac{σ_{y}}{[σ_{x} (1 - ρ_{x x})] [1 + (k - 1) ρ_{x x}]} {\begin{matrix} {ρ_{x y} [1 + (k - 2) ρ_{x x}] - (k - 1) ρ_{x y}^{*} ρ_{x x}} ({\underline{x}}_{i} - μ_{x} 1_{k}) \\ + (ρ_{x y}^{*} - ρ_{x y} ρ_{x x}) ({\underline{v}}_{i} - μ_{x} 1_{k}) \end{matrix}} \\ V_{j j} = \sum_{{\underline{y}}_{i} ∣ {\underline{x}}_{i}, j j} = Var (y_{i j} ∣ {\underline{x}}_{i}) = σ_{y}^{2} {1 - ρ_{x y}^{2} - \frac{(k - 1) {(ρ_{x y}^{*} - ρ_{x y} ρ_{x x})}^{2}}{(1 - ρ_{x x}) [1 + (k - 1) ρ_{x x}]}}, j = 1, \dots, k \\ V_{j_{1}, j_{2}} = \sum_{{\underline{y}}_{i} ∣ {\underline{x}}_{i}, j_{1} j_{2}} = Cov (y_{i j_{1}}, y_{i j_{2}} ∣ \underline{x}) = σ_{y}^{2} {ρ_{y y} - \frac{[2 ρ_{x y} ρ_{x y}^{*} - (k - 1) ρ_{x y}^{* 2} ρ_{x x} - ρ_{x y}^{2} ρ_{x x} + (k - 2) ρ_{x y}^{* 2}]}{(1 - ρ_{x x}) [1 + (k - 1) ρ_{x x}]}}, j_{1} \neq j_{2} \\ = 1, \dots, k \end{array}

Note that if k = 2, equation 9 reduces to equation 2.

Thus, from equations (8) and (9) we obtain

β_{1} = (σ_{y} / σ_{x}) {\frac{ρ_{x y} [1 + (k - 2) ρ_{x x}] - (k - 1) ρ_{x y}^{*} ρ_{x x}}{[1 + (k - 1) ρ_{x x}] (1 - ρ_{x x})}}

(10)

β_{2} = (σ_{y} / σ_{x}) {\frac{ρ_{x y}^{*} - ρ_{x y} ρ_{x x}}{[1 + (k - 1) ρ_{x x}] (1 - ρ_{x x})}}

Upon combining equations 8 and 10 and factoring out ρ_xy and $ρ_{x y}^{*}$ , we obtain

y_{i j} ∣ {\underline{x}}_{i} ~ N (α + ρ_{x y} s_{i j, k} + ρ_{x y}^{*} t_{i j, k}, σ_{\underline{y} ∣ \underline{x}}^{2}), i = 1, \dots, n; j = 1, \dots, k

(11)

where

\begin{matrix} s_{i j, k} = (\frac{σ_{y}}{σ_{x}}) \frac{{x_{i j} [1 + (k - 2) ρ_{x x}] - x_{i +, - j} ρ_{x x}}}{[1 + (k - 1) ρ_{x x}] (1 - ρ_{x x})} \\ t_{i j, k} = (\frac{σ_{y}}{σ_{x}}) \frac{[x_{i +, - j} - (k - 1) x_{i j} ρ_{x x}]}{[1 + (k - 1) ρ_{x x}] (1 - ρ_{x x})} \end{matrix}

and σ_y,σ_x and ρ_xx are estimated in a similar manner as in equation 6.

(a) Interval Estimation

We could obtain interval estimates of ρ_xy by using se(ρ̂_xy) from equation 11 and assume asymptotic normality of ρ̂_xy. However, asymptotic normality is more quickly achieved for estimated correlation coefficients if one uses Fisher’s z transformation. If we let Z_xy = 0.5 ln[(1+ρ_xy)/(1−ρ_xy)] and Ẑ_xy = 0.5 ln[(1+ρ̂_xy)/(1−ρ̂_xy)], then from the delta method we have

var ({\hat{Z}}_{x y}) ≅ var ({\hat{ρ}}_{x y}) / {(1 - {\hat{ρ}}_{x y}^{2})}^{2} .

Thus, an approximate 100%×(1−α) CI for Z_xy based on Fisher’s z transformation is given by

{\hat{Z}}_{x y} \pm Φ^{- 1} (1 - α / 2) s e ({\hat{ρ}}_{x y}) / (1 - {\hat{ρ}}_{x y}^{2}) \equiv (Z_{x y, 1}, Z_{x y, 2})

where Φ⁻¹ is the inverse normal distribution.

The corresponding 100%×(1−α) CI for ρ_xy is given by (ρ_xy_,1,ρ_xy_,2) where

ρ_{x y, q} = [exp (2 Z_{x y, q}) - 1] / [exp (2 Z_{x y, q}) + 1], q = 1, 2.

(12)

1. Estimation of Rank Correlation for Clustered Data

We also consider the estimation of rank correlation for clustered data. For this purpose, we define F_{X_j}(x) ≡ Pr(X_j ≤ x) = c.d.f. of X_j, and similarly for F_{Y_j}(y), j=1, …, k, where X_j and Y_j are continuous random variables. Pearson referred to F_{X_j}(x) and F_{Y_j}(y) as grades of X_j and Y_j within a reference population[4].

Let H_{X_j}(x) = Φ⁻¹[F_{X_j}(x)], H_{Y_j}(y) = Φ⁻¹[F_{Y_j}(y)]. By definition, if X_j and Y_j are continuous random variables, then H_{X_j} and H_{Y_j} are univariate normal. We also assume that H_{Z_j} ≡ (H_{X_j},H_{Y_j}) is bivariate normal. Finally, we define P_{i,X_j} = F_{X_j}(x_ij) and P_{iY_j} = F_{Y_j}(y_ij); i = 1, …, n; j = 1, …, k.

We wish to estimate ρ_s_,_j = corr(P_{i,X_j},P_{iY_j}) which we define as the underlying rank correlation between X and Y for replicate j. It is customary to estimate ρ_s_,_j by the Spearman rank correlation coefficient ρ̂_s_,_j = corr[rank_j(X_ij),rank_j(Y_ij)] where the ranks are sample ranks of X_ij and Y_ij, j = 1, …, k within a sample of size n. However, we have previously shown that an alternative estimator of ρ_s_,_j exists which is both more efficient than ρ̂_s_,_j and allows one to directly estimate confidence limits for ρ_s_,_j [5]. Specifically, let H_{i,X_j} = Φ⁻¹(P_{i,X_j}) ≡ probit(P_{i,X_j}), H_{i,Y_j} = Φ⁻¹(P_{iY_j}) ≡ probit(P_{iY_j}) and H_{i,Z_j} = (H_{i,X_j},H_{i,Y_j}).

We note that the probit transformation is rank-preserving. Therefore, ρ_s_,_j is the same in both the original and probit scale. Let us define Ĥ_{i,X_j,n} = Φ⁻¹[rank_j(X_ij)/(n+1)], Ĥ_{i,Y_jn} = Φ⁻¹[rank_j(Y_ij)/(n+1)] and Ĥ_{i,Z_jn} = (Ĥ_{i,X_j,n}, Ĥ_{i,Y_j,n}). In Theorem 2 in the Appendix, we show that if H_{i,Z_j} is bivariate normal, then Ĥ_{i,Z_j,n} is asymptotically bivariate normal as n→∞. Furthermore, Moran has shown a direct connection between a Pearson and rank correlation for bivariate normally distributed scales [6] whereby

ρ_{s, j} = (6 / π) {sin}^{- 1} (ρ_{H, j} / 2)

(13)

where ρ_H_,_j = corr(H_{i,X_j},H_{i,Y_j}). We can estimate ρ_H_,_j by ρ̂_H_,_j = corr(Ĥ_{i,X_j},Ĥ_{i,Y_j}).

To incorporate clustering, we assume that ρ_s_,_j = ρ_s for all j = 1, …, k and use the regression model in equation 11 based on probit scores rather than raw scores to obtain an estimate of ρ_H_,_j = corr(H_{i,X_j},H_{i,Y_j}) which we assume is the same for all j and denote by ρ̂_H. Equation 11 assumes multivariate normality of z_i = (x_i,y_i). In Theorem 3 in the Appendix we show that if (H₁, …, H_L) is multivariate normal, then (Ĥ_1,_n, …, Ĥ_L_,_n) is asymptotically multivariate normal as n → ∞. The corresponding estimator of Spearman rank correlation which is denoted by ρ̂_s_,_a is given by

{\hat{ρ}}_{s, a} = (6 / π) {sin}^{- 1} ({\hat{ρ}}_{H} / 2)

Furthermore, a 100%×(1−α) CI for ρ_s is given by

[(6 / π) {sin}^{- 1} (ρ_{H, 1} / 2), (6 / π) {sin}^{- 1} (ρ_{H, 2} / 2)]

(14)

where ρ_H_,1 and ρ_H_,2 are obtained from equation 12 based on probit scores rather than raw scores.

Similarly, from equation 11 we can also obtain an estimate of the Spearman cross-correlation of ( $ρ_{s}^{*}$ ) given by

{\hat{ρ}}_{s, a}^{*} = (6 / π) {sin}^{- 1} ({\hat{ρ}}_{H}^{*} / 2)

where

{\hat{ρ}}_{H}^{*} = corr ({\hat{H}}_{i, X_{j_{1}}}, {\hat{H}}_{i, Y_{j_{2}}}), j_{1} \neq j_{2} = 1, \dots, k,

and obtain a 100% × (1−α) CI for $ρ_{s}^{*}$ based on equations 12 and 13 given by

[(6 / π) {sin}^{- 1} (ρ_{H, 1}^{*} / 2), (6 / π) {sin}^{- 1} (ρ_{H, 2}^{*} / 2)]

2. Estimation of partial correlation for clustered data

It is likely that there is positive correlation between visual field area (VF) and visual acuity (VA). However, it is possible that this is partially due to the effect of age since older RP patients tend to have lower field and acuity measurements than younger RP patients. Also, in some datasets gender differences are found for visual function measures in RP patients [7]. Furthermore, there are different genetic types of RP and level of visual function is known to differ by genetic type [7]. Hence, it would be desirable to estimate the correlation between VF and VA in individual eyes, while controlling for age, gender and genetic type.

(a) Partial Pearson correlation

We assume that there are a set of L cluster-specific covariates denoted by u_i = (u_i₁, …, u_iL)′ where u_i is an (L × 1) vector and u_il is the value of the l^th cluster-specific covariate for the i^th cluster. Let U_i be a k × L matrix, where U_i_,_jl = U_il, i = 1, …, n; j = 1, …, k; l = 1, …, L. In addition, we assume there are a set of M subunit-specific covariates w_ij = (w_ij₁, …, w_ijM)′ where w_ij is an M × 1 vector and w_ijm is the value of the m^th subunit-specific covariate for the j^th subunit in the i^th cluster. Finally, let W_i be a k × M matrix where W_i_,_jm = w_ijm, i = 1, …, n; j = 1, …, k; m = 1, …, M.

We define the partial correlation between X and Y by ρ_xy,_partial = corr(x_ij,y_ij|u_i,W_i), i = 1, …, n; j = 1, …, k. Similarly, the partial cross-correlation between X and Y is defined by $ρ_{x y, partial}^{*} = corr (x_{i j_{1}}, y_{i j_{2}} ∣ {\underline{u}}_{i}, W_{i})$ , i = 1, …, n; j₁ ≠ j₂ = 1, …, k.

To estimate these correlations, we factor the likelihood of L_i as follows:

L_{i} = L ({\underline{z}}_{i}) = L_{A i} ({\underline{z}}_{i} ∣ {\underline{u}}_{i}, W_{i}) L_{B i} ({\underline{u}}_{i}, W_{i})

Furthermore, we can factor L_Ai as follows:

L_{A i} ({\underline{z}}_{i} ∣ {\underline{u}}_{i}, W_{i}) = L_{A i 1} ({\underline{y}}_{i} | {\underline{x}}_{i}, {\underline{u}}_{i}, W_{i}) L_{A i 2} ({\underline{x}}_{i} ∣ {\underline{u}}_{i}, W_{i})

All the information concerning ρ_xy,_partial and $ρ_{x y, partial}^{*}$ is contained in L_Ai₁. Hence, it will suffice to maximize $L_{A 1} = \prod_{i = 1}^{n} L_{A i 1}$ to obtain the MLE’s of ρ_xy,_partial and $ρ_{x y, partial}^{*}$ . We can write L_Ai₁ as a conditional regression likelihood given by

L_{A i 1} = \frac{{| \sum_{\underline{y} ∣ \underline{x}, U, W} |}^{- 1 / 2}}{{(2 π)}^{k / 2}} \exp [- \frac{1}{2} {({\underline{y}}_{i} - μ_{{\underline{y}}_{i} ∣ {\underline{x}}_{i}, {\underline{u}}_{i}, W_{i}})}^{'} \sum_{\underline{y} ∣ \underline{x}, U, W}^{- 1} ({\underline{y}}_{i} - μ_{{\underline{y}}_{i} ∣ {\underline{x}}_{i}, {\underline{u}}_{i}, W_{i}})]

(15)

where

μ_{{\underline{y}}_{i} ∣ {\underline{x}}_{i}, {\underline{u}}_{i}, W_{i}} = α 1_{k} + β_{1} {\underline{x}}_{i} + β_{2} {\underline{v}}_{i} + U_{i} {\underline{β}}_{3} + W_{i} {\underline{β}}_{4} + W_{i}^{*} {\underline{β}}_{5}

W_{i, j m}^{*} = W_{i +, - j, m} = \sum_{\begin{array}{l} q = 1 \\ q \neq j \end{array}}^{k} w_{iqm}, i = 1, \dots, n; j = 1, \dots, k; m = 1, \dots, M .

and

\sum_{\underline{y} ∣ \underline{x}, U, W} = σ_{\underline{y} ∣ \underline{x}, U, W}^{2} [(1 - ρ_{y y ∣ \underline{x}, U, W}) I_{k \times k} + ρ_{y y ∣ \underline{x}, U, W} J_{k \times k}]

β₃ is an L × 1 vector of regression coefficients for cluster-specific covariates and β_4,β₅ are M × 1 vectors of regression coefficients for subunit-specific covariates.

To estimate ρ_xy,_partial and $ρ_{x y, partial}^{*}$ we

Use mixed effects regression models to regress y_i and x_i respectively on U_i,W_i and $W_{i}^{*}$ and obtain residuals ${\underline{y}}_{i}^{*}$ and ${\underline{x}}_{i}^{*}$ .
Obtain estimates of $σ_{y}^{*}$ using mixed effects models based on maximizing
$L_{3}^{*} = \prod_{i = 1}^{n} L_{y} ({\underline{y}}_{i}^{*})$ (16)

where $σ_{y}^{2 *} = var (y ∣ U, W) \equiv var (y_{i}^{*})$
Obtain estimates of $σ_{x}^{*}$ and $ρ_{x x}^{*}$ using mixed effects models based on
$L_{2}^{*} = \prod_{i = 1}^{n} L_{2 i} ({\underline{x}}_{i}^{*})$

where $σ_{x}^{2 *} = var (x ∣ \underline{U}, \underline{W}) \equiv var (x_{i}^{*}), ρ_{x x}^{*} = corr (x_{i 1}, x_{i 2} ∣ \underline{U}, \underline{W})$
Obtain covariate-adjusted estimates $s_{i j, k}^{*}$ and $t_{i j, k}^{*}$ given by substituting $x_{i j}^{*}, σ_{y}^{*}, σ_{x}^{*}$ and $ρ_{x x}^{*}$ for x_ij,σ_y,σ_x and ρ_xx in equation 11.
Estimate ρ_xy,_partial and $ρ_{x y, partial}^{*}$ using mixed effects models given by $y_{i j}^{*} ~ N (α^{*} + ρ_{x y, partial} s_{i j, k}^{*} + ρ_{x y, partial}^{*} t_{i j, k}^{*} σ_{\underline{y} ∣ \underline{x}}^{2 *})$ with a compound symmetry correlation structure.
Obtain interval estimates for ρ_xy,_partial and $ρ_{x y, partial}^{*}$ using the methods in equation 12.

(b) Partial Spearman rank correlation

Similarly, estimates of ρ_s,_partial and $ρ_{s, partial}^{*}$ can be obtained by computing sample probits ${\hat{H}}_{i, X_{j}}^{*}$ and ${\hat{H}}_{i, Y_{j}}^{*}$ using ranks based on $x_{i j}^{*}$ and $y_{i j}^{*}$ instead of x_ij and y_ij and using the methods in equation 13.

5. Unbalanced data

In ophthalmology there often are datasets where some subjects have data available for one eye and other subjects have data available for two eyes. Specifically, in the RP dataset, there were 7 patients with VA and VF available in one eye, but incomplete data in the fellow eye. Five of the patients with incomplete data were missing both VA and VF, while two of the patients were missing VF but not VA, in the incomplete eye. The missing data profile can get more complex if one wants to estimate partial correlation and there is missing covariate data.

In general, since replicates are distinguishable we treat unbalanced data as a type of missing data problem. One possible option is to use all available data, but allow k to vary for different clusters and use mixed effects software methods as provided in equation 11 to estimate ρ_xy However, as shown in equation 9, the variance-covariance matrix Σ_{y_i|x_i} will also depend on k which will not be taken into account using standard software.

If unbalanced data are considered missing data, then the best option is to use multiple imputation methods to analyze the data [8]. For clusters of size 2 using SAS software, this would proceed as follows:

use PROC MI of SAS to impute the missing data based on the data vector h_i = (x_i₁,y_i₁,x_i₂,y_i₂,u_i,w_i₁,w_i₂) based on M imputations,
analyze each imputed data set based on the methods in equation 7,
use PROC MIANALYZE to combine results over the M imputations.

This strategy can also be used where the maximum cluster size = k. We used this strategy in both of our examples each of which had small amounts of missing data and had max cluster size = 2 and 3, respectively.

6. Simulation Studies

We have conducted simulation studies to assess the finite sample properties of the estimators of ρ_xy both for clusters of 2 as given in Equation 7 and clusters of size > 2 (i.e., 3) as given in Equation 11. We generated z_i based on the multivariate normal distribution in Section 2 (separately for k=2,3). In addition to the MLE methods described in Equations 7 and 11, we also considered other ad hoc, but frequently used methods. Specifically, we looked at the situation where an investigator has data from 2 eyes available, but only uses a single eye in the analyses (e.g., the right eye, or the left eye) to avoid the correlated data problem. Similarly for k=3, where only 1 of 3 distinguishable replicates was used in the analysis. We then estimated ρ_xy from the ordinary Pearson product-moment correlation from the n (x,y) pairs and obtained confidence limits based on Fisher’s z statistic. We refer to this approach as the 1 subunit method. In addition, we estimated ρ_xy from the ordinary Pearson product-moment correlation based on kn (x,y) pairs and obtained confidence limits based on Fisher’s z statistic ignoring the correlated data. We refer to this approach as the all subunits method.

We generated multivariate normal data with parameters μ = 0, σ² = 1 and

$ρ_{x y} = ρ_{x y}^{*} = 0$ , ρ_xx = ρ_yy = (0.2,0.5,0.8) to assess type I error.
$(ρ_{x y}, ρ_{x y}^{*}) = (0.2, 0.1)$ , ρ_xx = ρ_yy = (0.2,0.5,0.8) to assess power.
$(ρ_{x y}, ρ_{x y}^{*}) = (0.5, 0.4)$ , ρ_xx = ρ_yy = (0.5,0.8) to assess power.

Bias and coverage probability were determined for each method for each of (a) – (c). Separate simulations were performed for k=2 and k=3. Four thousand simulations were performed for each design in (a) – (c) and each k=2, 3. In addition, to assess the adequacy of the model for unbalanced data, we simulated multivariate normal data for k=3 subunits per cluster, but randomly deleted the 3^rd subunit for 20% of the clusters, thus creating a dataset where 80% of the clusters had 3 subunits and 20% had 2 subunits. The same approach was used to create a dataset where 95% of the clusters had 3 subunits and 5% of the clusters had 2 subunits. These unbalanced data simulations were performed for each of

(b)
$ρ_{x y} = ρ_{x y}^{*} = 0$ , ρ_xx = ρ_yy = 0.5 to assess type I error.
(c)
$(ρ_{x y}, ρ_{x y}^{*}) = (0.2, 0.1)$ , ρ_xx = ρ_yy = 0.5 to assess power.

Finally, for each of the designs in (a) – (e) separate simulations were performed for n=50 and n=100. The results are summarized in Table 1.

Table 1.

Simulation Results

n=50

n=100

ρ_xy

ρ_{xy}^{*}

ρ_xx(ρ_yy)

method

ρ̂_xy

Type I error

coverage prob.

power

ρ̂_xy

Type I error

coverage prob.

power

Balanced Designs

0.2

MLE

0.001

0.052

94.9

---

0.001

0.047

95.3

---

1 subunit

−0.001

0.054

94.6

---

0.002

0.052

94.8

---

all subunits

0.000

0.061

94.0

---

0.001

0.053

99.3

---

0.5

MLE

0.001

0.056

94.4

---

0.001

0.050

95.2

---

1 subunit

−0.001

0.054

94.6

---

0.002

0.052

94.8

---

all subunits

0.001

0.091

90.9

---

0.001

0.077

98.7

---

0.8

MLE

0.001

0.057

94.6

---

0.002

0.050

95.2

---

1 subunit

−0.001

0.054

94.6

---

0.002

0.052

94.8

---

all subunits

0.000

0.133

86.7

---

0.002

0.127

97.0

---

0.2

0.1

0.2

MLE

0.200

---

95.0

49.3

0.200

---

95.3

79.9

1 subunit

0.198

---

94.6

28.3

0.201

---

94.7

53.1

all subunits

0.199

---

94.4

51.4

0.200

---

94.8

81.1

0.5

MLE

0.201

---

94.7

43.3

0.200

---

95.3

72.0

1 subunit

0.198

---

94.6

28.3

0.201

---

94.7

53.1

all subunits

0.200

---

91.2

52.5

0.201

---

92.1

78.9

0.8

MLE

0.201

---

94.9

33.2

0.202

---

95.2

61.1

1 subunit

0.198

---

94.6

28.3

0.201

---

94.7

53.1

all subunits

0.201

---

86.4

52.4

0.202

---

87.1

76.7

0.5

0.4

0.5

MLE

0.499

---

94.6

99.8

0.500

---

94.6

100.0

1 subunit

0.497

---

94.9

96.4

0.500

---

95.0

100.0

all subunits

0.496

---

92.6

99.9

0.498

---

92.9

100.0

0.8

MLE

0.500

---

95.5

99.0

0.502

---

95.3

100.0

1 subunit

0.497

---

94.9

96.4

0.500

---

95.0

100.0

all subunits

0.498

---

87.1

99.7

0.500

---

87.6

100.0

0.2

MLE

0.001

0.052

95.0

---

0.000

0.056

94.5

---

1 subunit

−0.001

0.054

94.6

---

0.002

0.052

94.8

---

all subunits

0.001

0.060

94.0

---

0.000

0.064

93.7

---

0.5

MLE

0.003

0.052

95.2

---

0.001

0.057

94.6

---

1 subunit

−0.001

0.054

94.6

---

0.002

0.052

94.8

---

all subunits

0.003

0.113

88.7

---

0.001

0.120

88.0

---

0.8

MLE

0.004

0.048

95.7

---

0.001

0.052

95.1

---

1 subunit

−0.001

0.054

94.6

---

0.002

0.052

94.8

---

all subunits

0.004

0.197

80.3

---

0.002

0.202

79.8

---

0.2

0.1

0.2

MLE

0.201

---

95.1

66.6

0.200

---

94.4

92.0

1 subunit

0.198

---

94.6

28.3

0.201

---

94.7

53.1

all subunits

0.199

---

94.2

68.6

0.199

---

93.7

92.9

0.5

MLE

0.203

---

95.2

52.4

0.202

---

94.6

81.7

1 subunit

0.198

---

94.6

28.3

0.201

---

94.7

53.1

all subunits

0.202

---

88.9

66.5

0.201

---

88.1

89.7

0.8

MLE

0.205

---

95.4

32.4

0.202

---

95.1

63.7

1 subunit

0.198

---

94.6

28.3

0.201

---

94.7

53.1

all subunits

0.204

---

79.8

34.9

0.201

---

79.4

84.8

0.5

0.4

0.5

MLE

0.500

---

94.6

100.0

0.500

---

93.6

100.0

1 subunit

0.497

---

94.9

96.4

0.500

---

95.0

100.0

all subunits

0.496

---

90.3

100.0

0.498

---

90.3

100.0

0.8

MLE

0.503

---

95.9

99.3

0.501

---

95.2

100.0

1 subunit

0.497

---

94.9

96.4

0.500

---

95.0

100.0

all subunits

0.500

---

80.4

100.0

0.499

---

79.5

100.0

Unbalanced Designs

0.5

MLE

0.002

0.055

95.2

---

0.000

0.058

94.7

---

95%

0.2

0.1

0.5

MLE

0.202

---

95.4

52.6

0.201

---

94.7

81.8

20%

0.5

MLE

−0.019

0.049

95.5

---

−0.008

0.056

94.8

---

80%

0.2

0.1

0.5

MLE

0.182

---

95.2

40.7

0.193

---

94.5

76.9

Open in a new tab

4000 simulations were run for each set of parameter combinations

For balanced designs, we see that bias is low for all 3 methods. For the MLE method type I error ranges from 0.047 to 0.057 (mean = 0.052), which is acceptable. Similarly, the 1 subunit method type I error ranges from 0.052 to 0.054 (mean = 0.053). However, type I error is not preserved for the all subunits method and ranges from 0.053 to 0.064 (mean = 0.061) for ρ_xx = 0.2, from 0.077 to 0.120 (mean = 0.100) for ρ_xx = 0.5 and from 0.127 to 0.202 (mean = 0.155) for ρ_xx = 0.8.

Coverage probability for the MLE method ranges from 93.6% to 95.9% (mean = 95.0%), which is acceptable. Similarly, coverage probability for the 1 subunit method ranges from 94.6% to 95.0% (mean = 94.8%), which is also acceptable. However, for the all subunits method coverage ranges from 93.7% to 99.3% (mean = 94.8%) when ρ_xx = 0.2, from 88.0% to 98.7% (mean = 93.4%) when ρ_xx = 0.5 and from 79.4% to 97.0% (mean = 84.3%) when ρ_xx = 0.8. In general, the high type I error when ρ_xx = 0.5 or 0.8 is due to an inappropriately low se and a corresponding low coverage probability.

Power for the MLE method is always greater than for the 1 subunit method, sometimes dramatically so, especially when ρ_xx = 0.2. For example, when ρ_xy = 0.2, $ρ_{x y}^{*} = 0.1$ and n = 100, power is 79.9 for the MLE method and 53.1% for the 1 subunit method when k=2 and 92.0% for the MLE method and 53.1% for the 1 subunit method for k=3. Overall, the MLE method has appropriate type I error and coverage probability and improved power vs. the 1 subunit method. The all subunits method has inappropriate type I error especially when ρ_xx ≥ 0.5. Thus, overall the MLE method is the preferred method among the 3 approaches.

For unbalanced designs, with 5% of subjects missing a 3^rd replicate, bias is low and coverage probability ranges from 94.7% to 95.4%. With 20% of subjects missing a 3^rd replicate, there was slight negative bias both for n=50 and to a lesser extent n=100. Power in the 5% missing data design was 81.8% for n=100 and 52.6% for n=50, virtually the same as with no missing data (k=3, n=100, 81.7%; n=50, 52.4%). Power in the 20% missing data design was noticeably reduced (n=100, 76.9%; n=50, 40.7%). Overall, missing data methods are an effective approach to analyzing clustered data with unbalanced designs.

An assumption of the estimation procedure for rank correlation is that the distribution of both X and Y are continuous. We also performed simulation studies to assess the adequacy of estimation of rank correlation for clustered data based on (13,14) when the distribution of X and/or Y is discrete. For this purpose, we simulated (X,Y) scores from a bivariate normal distribution, computed the sample probits of X and Y scores, but then subdivided the sample probit scores into quintiles and replaced the individual sample probit scores by the median sample probit score within a quintile. We then obtained point and interval estimates of the rank correlation based on (13) and (14). We considered the following parameter combinations $(ρ_{x y}, ρ_{x y}^{*}) = (0, 0)$ , (0.2,0.1); n = (50,100), (ρ_xx = ρ_yy = 0.5), k = 3. 4,000 simulations were conducted for each parameter combination. The results are given in Appendix Table 1.

We see that when $(ρ_{x y}, ρ_{x y}^{*}) = (0, 0)$ , the type I error ranges from 0.054 to 0.055. Over the 4 parameter combinations the coverage probability ranges from 94.6% to 95.9%. Power ranges from 52.4% when n=50 and 76.8% when n=100, which is slightly lower than for continuous distributions with the same parameter combinations ( $(ρ_{x y}, ρ_{x y}^{*}) = (0.2, 0.1)$ , (ρ_xx = ρ_yy = 0.5), k = 3, power = 81.7%, n=100; 52.4%, n=50). Overall, the approach in (13) and (14) for estimation of rank correlation seems adequate for discrete distributions based on these designs.

Another assumption of the procedure for estimation of rank correlation for clustered data is that the distribution of (H_X,H_Y) is bivariate normal. By definition H_X and H_Y are univariate N(0,1) random variables, but the assumption of bivariate normality may not hold. To assess the sensitivity of our rank correlation estimation procedures to this assumption we generated (H_X,H_Y) from a Kimeldorf-Sampson (KS) bivariate copula [9] of the form:

\begin{array}{l} F (a, b, δ) = Pr (H_{X} \leq a and H_{Y} \leq b) \\ = {{[Φ (a)]}^{- δ} + {[Φ (b)]}^{- δ} - 1}^{- 1 / δ}, δ \geq 0. \end{array}

(17)

Note that the univariate margins of F are N(0,1) random variables. Also, if δ = 0, then H_X and H_Y are independent, i.e., F(a,b,0) = Φ(a) Φ(b).

To simulate data from this family we first obtained the conditional distribution of H_Y|H_X from (17) given by:

F_{H_{Y} ∣ H_{X}} (b ∣ a, δ) = {{[Φ (a)]}^{- δ} + {[Φ (b)]}^{- δ} - 1}^{- \frac{1}{δ} - 1} {[Φ (a)]}^{- δ - 1}

(18)

We then generated two iid U(0,1) random variables (U₁, U₂) to correspond to Φ(a) and F_{H_Y|H_X}(b|a,δ) respectively, and solved for a and b as follows:

\begin{matrix} a = Φ^{- 1} (U_{1}) \\ b = Φ^{- 1} [{1 - U_{1}^{- δ} [1 - U_{2}^{- δ / (1 + δ)}]}^{- 1 / δ}] \end{matrix}

(19)

To simulate clustered data from this family, we created a clustered KS copula by first generating (a,b) as in (19). We then generated iid (ea_q,eb_q),q = 1,2 with univariate N(0,g) margins from this copula based on

\begin{matrix} e_{a_{q}} = g^{\frac{1}{2}} Φ^{- 1} (U_{1 q}) \\ e_{b_{q}} = g^{1 / 2} Φ^{- 1} [{1 - U_{1 q}^{- δ} [1 - U_{2 q}^{- δ / (1 + δ)}]}^{- 1 / δ}], q = 1, 2 \end{matrix}

(20)

where (U₁_q, U₂_q) are iid U(0,1), (U₁₁,U₂₁) is independent of (U₁₂,U₂₂) and (U₁_q, U₂_q), q = 1,2 are independent of (U₁, U₂). Finally, we created clustered data using

x_{q} = (a + e_{a_{q}}) / \sqrt{1 + g}, y_{q} = (b + e_{b_{q}}) / \sqrt{1 + g}, q = 1, 2.

(21)

Note that based on (17–21) x_q and y_q are univariate N(0,1), but F(x_q, y_q) is not bivariate normal,

ρ_{x y} = corr (x_{q}, y_{q}), q = 1, 2; ρ_{x y}^{*} = corr (x_{q_{1}}, y_{q_{2}}), q_{1} \neq q_{2} = 1, 2, and ρ_{x y} \neq ρ_{x y}^{*}

To determine the value of δ we refer to Joe [10], page 147 who shows by numerical integration that with copulas from the family in (17) for δ = 1.06, the Spearman rank correlation = 0.5. However, the clustered KS copula in (21) is a summation of two random variables from this family and does not correspond to the distribution of (H_X, H_Y) in (17). Hence, we simulated 100,000 random vectors (X₁,Y₁,X₂,Y₂) from (21) for g = 1/2 and g = 1, respectively, and empirically estimated ρ_xy,s and $ρ_{x y, s}^{*}$ from

\begin{array}{l} {\hat{ρ}}_{x y, s} = {corr [rank (X_{1}), rank (Y_{1})] + corr [rank (X_{2}), rank (Y_{2})]} / 2 \\ {\hat{ρ}}_{x y, s}^{*} = {corr [rank (X_{1}), rank (Y_{2})] + corr [rank (X_{2}), rank (Y_{1})]} / 2 \end{array}

Using δ = 1.06, this yielded $({\hat{ρ}}_{x y, s}, {\hat{ρ}}_{x y, s}^{*}) = (0.497, 0.328)$ for g = 1/2 and (0.497,0.246) for g = 1.

We then assumed these were the true values of ρ_xy,s and $ρ_{x y, s}^{*}$ and performed the estimation procedure in (13) and (14) to obtain point and interval estimates of ρ_xy,s based on 4,000 samples of size 50 and 100, respectively, thus obtaining empirical estimates of bias and coverage probability. The results are provided in Appendix Table 2.

We see that the median estimated value is slightly lower than the true value for both ρ_xy,s and $ρ_{x y, s}^{*}$ , although the absolute bias is small (≤ 0.011). The coverage probability ranges from 94.1% to 96.6% for g = 1/2 and 91.4% to 96.7% for g = 1 (overall mean = 94.8%) which is acceptable. In summary, the estimation procedure for rank correlation in (13) and (14) performs well for probit scores (H_X,H_Y) whose distribution is not bivariate normal, but instead comes from the clustered KS copula in (21).

5. Examples

(a) Estimating the correlation between visual field and visual acuity in RP patients

We have a sample of 220 RP patients who were enrolled in a clinical trial assessing the effects of docosahexaenoic acid (DHA) supplementation vs. placebo on the rate of decline of visual function over a 4-year period [11]. For this analysis we assess the correlation between Humphrey 30-2 visual field (VF) and ETDRS visual acuity (VA) in individual eyes at the initial (screening) visit. Two hundred thirteen of the patients had VF and VA available in both eyes; five patients had (VA, VF) available in a single eye, but neither available in the fellow eye; two patients had (VA, VF) available in one eye and VA, but not VF available in the fellow eye. Covariate values (i.e., age, gender and genetic type) were complete for all subjects. We used multiple imputation methods as described in Section 5 to impute the missing VF and VA values and perform the analyses described in sections 2, 3 and 4.

We first provide descriptive statistics of VA and VF in Table 2. We see that the mean ETDRS score (VA) = 51.5 ± 0.6 (mean ± se) letters while the mean Humphrey 30-2 visual field (VF) = 848.9 ± 31.7 db. For reference purposes, a person with 20/20 vision has an ETDRS score of 85 letters; an ETDRS score of 50 letters corresponds roughly to a visual acuity of 20/100 [12]. A normal Humphrey 30-2 visual field is about 2500 db. We see that VF but not VA is lower for older subjects and lower for males vs. females. Furthermore, both VA and VF vary substantially by genetic type with the best function among dominant patients and the worst function among sex-linked patients. Other genetic types had intermediate levels of function. We now provide estimates of the Pearson correlation between VF and VA for individual eyes using the eye as the unit of analysis based on the model in equation 7. The results are given in Table 3. We see that the estimated Pearson correlation between VF(y) and VA(x) in the same eye = 0.212 (95% CI=0.088, 0.329), p<0.001, while the Pearson cross-correlation (correlation between VF in the right eye and VA in the left eye, and conversely) = 0.177 (95% CI = 0.054, 0.295), p=0.004. We also computed a partial Pearson correlation, adjusting for age, gender, and genetic type (person-specific covariates) and right vs. left eye (eye-specific covariate) based on equation 16. The partial Pearson correlation = 0.195 and the partial Pearson cross-correlation = 0.156 which are lower than the crude correlations, although still statistically significant (p=0.002, 0.012, respectively).

Table 2.

Descriptive Statistics for the RP Data Set (220 patients)^*

Variable	Category	VA^**			VF^***

		mean (se)	N_eyes (N_eyes,obs^†)	(95% CI)	mean (se)	N_eyes (N_eyes,obs^†)	(95% CI)
	overall	51.5 (0.6)	440 (435)	(50.3, 52.8)	848.9 (31.7)	440 (433)	(786.7, 911.1)
age (yrs)
	< 35	50.3 (1.0)	166 (166)	(48.3, 52.3)	934.3 (51.2)	166 (165)	(833.9, 1034.6)
	≥ 35	52.3 (1.0)	274 (269)	(50.7, 53.8)	797.2 (39.9)	274 (268)	(719.0, 875.4)
gender
	M	51.0 (0.9)	218 (217)	(49.2, 52.7)	798.4 (44.9)	218 (216)	(710.4, 886.3)
	F	52.1 (0.9)	222 (218)	(50.3, 53.8)	898.5 (44.5)	222 (217)	(811.3, 985.7)
genetic type⁺
	DOM	54.4 (1.3)	102 (100)	(51.9, 56.9)	1005.6 (65.1)	102 (99)	(878.0, 1133.2)
	AR	51.3 (1.7)	60 (60)	(48.1, 54.6)	849.6 (84.9)	60 (59)	(683.3, 1015.5)
	XL	44.1 (2.3)	30 (30)	(39.5, 48.7)	670.6 (119.9)	30 (30)	(435.5, 905.7)
	ISO	52.2 (0.9)	214 (211)	(50.5, 53.9)	812.1 (45.0)	214 (211)	(724.0, 900.2)
	UND	45.4 (2.2)	34 (34)	(41.0, 49.7)	766.3 (112.7)	34 (34)	(545.4, 987.1)
eye
	right eye (OD)	51.5 (0.6)	220 (219)	(50.3, 52.7)	873.0 (31.7)	220 (217)	(810.4, 935.5)
	left eye (OS)	51.6 (0.7)	220 (216)	(50.1, 53.0)	824.8 (32.6)	220 (216)	(760.6, 889.0)

Open in a new tab

VA is available for 219 patients for the right eye (OD) and 216 patients for the left eye (OS); VF is available for 217 patients for the right eye (OD) and 216 patients for the left eye (OS); age, gender and genetic type are available for all 220 patients; missing data are accounted for using multiple imputation.

^**

VA = visual acuity (ETDRS letter score)

^***

VF = Humphrey visual field (db)

⁺

DOM = dominant; AR = recessive; XL = sex-linked; ISO = isolate; UND = undetermined

^†

N_eyes,obs = number of eyes observed for the characteristic; N_eyes = number of eyes observed + number of eyes imputed for the characteristic

computer run:

:/proj/stross/stros0a/rankcorrelation/correlation.proc.mi.age.sas 11/24/15
:/proj/stross/stros0a/rankcorrelation/correlation.ranked.proc.mi.sexddm 11/30/15
:/proj/stross/stros0a/rankcorrelation/correlation.ranked.proc.mi.overall 12/1/15
:/proj/stross/stros0a/rankcorrelation/correlation.ranked.ns.sas 11/7/16

Table 3.

Correlation Estimates for the RP Data Set (220 patients, 440 eyes)^*

Crude

Adjusted

Type of correlation

Parameter

Point estimate

95% CI

P-value

N_eyes (N_eyes,obs^**)

Point estimate

95% CI

P-value

N_eyes (N_eyes,obs^**)

Pearson

ρ_xy

0.212

(0.088, 0.329)

<0.001

440 (433)

0.195

(0.071, 0.312)

0.002

440 (433)

ρ_{xy}^{*}

0.177

(0.054, 0.295)

0.004

440 (433)

0.156

(0.033, 0.274)

0.012

440 (433)

ρ_xx

0.787

0.766

σ_{x}^{2}

100.2

91.1

ρ_yy

0.944

0.943

σ_{y}^{2}

227, 330

201, 710

Rank

ρ_xy,s

0.229

(0.110, 0.341)

<0.001

440 (433)

0.221

(0.103, 0.333)

<0.001

440 (433)

ρ_{xy, s}^{*}

0.200

(0.082, 0.312)

<0.001

440 (433)

0.189

(0.072, 0.301)

0.001

440 (433)

ρ_xx,s

0.819

0.807

σ_{x, s}^{2}

0.946

0.891

ρ_yy,s

0.909

0.900

σ_{y, s}^{2}

0.962

0.884

Open in a new tab

^**

N_eyes,obs = number of eyes where both VA and VF are observed; N_eyes = number of eyes observed + number of eyes imputed for the characteristic

computer run:

:/proj/stross/stros0a/rankcorrelation/correlation.ranked.proc.mi3a.sas 11/23/15
:/proj.stross/stros0a/rankcorrelation/correlation.ranked.proc.mi.residual.probit.sas 12/04/15
:/proj/stross/stros0a/rankcorrelation/correlation.ranked.proc.mi.residual.sas 12/04/15

Because both VF and VA are not normally distributed, we also computed the corresponding crude and partial rank correlations based on equation 13. The rank correlation coefficients are slightly larger than the corresponding Pearson correlations (ordinary rank correlation = 0.229, p<0.001; partial rank correlation = 0.221, p<0.001; ordinary rank cross-correlation = 0.200, p<0.001; partial rank cross-correlation = 0.189, p = 0.001).

Note that for both Pearson and rank correlation, the cross-correlations ( $ρ_{x y}^{*}, ρ_{x y, s}^{*}$ ) are almost as large as the corresponding inter-class correlations (ρ_xyρ_xy,s) indicating likely efficiency gains in estimation of ρ_xy and ρ_xy,s by ρ̂_xy,MLE as presented in equation 7 vs. the ordinary Pearson product-moment correlation given by ρ̂_xy,Pearson since $σ_{\underline{y} ∣ \underline{x}}^{2}$ will be reduced if t_ij,₂ is included as a predictor of y_ij in equation 7.

To check the assumption of bivariate normality, we note that if H_X is univariate normal, then (H_X,H_Y) will be bivariate normal if H_Y|H_X ~ N(α + βH_X,σ²). To test this assumption we (1) ran a linear regression of Ĥ_{i,Y_j,n} on Ĥ _{i,X_j,n}, (2) obtained studentized residuals from the regression and (3) performed the Shapiro-Wilk test of normality on the distribution of studentized residuals (available in PROC UNIVARIATE of SAS). This was done separately for right (j=1) and left (j=2) eyes, where Y = VF and X = VA. The Shapiro-Wilk test statistic was W = 0.98, p > 0.05 for right eyes and W = 0.98, p > 0.05 for left eyes indicating that the assumption of bivariate normality is appropriate for these data.

(b) Estimating the correlation between biomarkers of phenacetin and aspirin intake in a study of analgesic abuse among Swiss women

The Swiss Analgesic Study started in 1967/1968 [13]. There were 1244 Swiss women participating in the study. The goal was to evaluate the association between the use of phenacetin-containing analgesics and the prevalence and incidence of kidney disorders. NAPAP (N-acetyl-P-aminophenol) is a urinary metabolite associated with recent use of phenacetin-containing analgesics. NAPAP was initially measured in a urine sample at the baseline clinic visit. There were two additional urine samples collected at home on 2 separate days within 1 week of the baseline clinic visit. One issue is that some analgesics contain both phenacetin and aspirin and it is important to determine the correlation between biomarkers that reflect recent phenacetin and aspirin intake. Salicylate concentrations in urine may be increased after recent intake of aspirin-containing analgesics, but the amount of increase depends on the pH level in urine with alkaline urine (pH > 7.0) resulting in increased secretion in urine [14]. Urinary salicylates may also increase after ingestion of other substances (e.g., cranberry juice) [15].

For this analysis, concentrations of NAPAP and salicylates were determined from the same 3 urine specimens for each woman. NAPAP values were represented as continuous variables and expressed in units of optical density (o.d.), but its distribution is highly skewed and non-normal. Urinary salicylates (mg %) were measured in four categories (0-19 mg% / 20-49 mg% / 50-99 mg% / 100+ mg%) and converted to a continuous scale by using the median value within each category (10 mg% / 35 mg% / 75 mg% / 100 mg%). Since the distributions of both NAPAP and salicylates were non-normal, the rank correlation is a natural measure of association. Descriptive data for this study are provided in Table 4.

Table 4.

Descriptive Statistics for the Swiss Study Data Set, (1244 women)^*

	NAPAP (o.d.)			Salicylates (mg%)
	mean(se)	N_vis(N_vis,obs)^**	(95% CI)	mean(se)	N_vis(N_vis,obs)^**	(95% CI)

overall	0.162 (0.008)	3732 (3680)	(0.147, 0.178)	15.0 (0.4)	3732 (3600)	(14.2, 15.7)
age
<40	0.146 (0.013)	1311 (1292)	(0.120, 0.172)	15.2 (0.7)	1311 (1265)	(13.9, 16.5)
≥40	0.171 (0.010)	2421 (2388)	(0.152, 0.191)	14.8 (0.5)	2421 (2335)	(13.9, 15.8)
weight (kg)
≤57.0	0.202 (0.016)	912 (898)	(0.171, 0.233)	16.4 (0.8)	912 (880)	(14.9, 18.0)
57.1 – 63.9	0.134 (0.017)	840 (832)	(0.102, 0.167)	13.2 (0.8)	840 (813)	(11.6, 14.8)
64.0 – 72.9	0.142 (0.014)	1113 (1098)	(0.114, 0.170)	14.6 (0.7)	1113 (1075)	(13.2, 16.1)
≥73.0	0.174 (0.016)	867 (851)	(0.142, 0.206)	15.6 (0.8)	867 (832)	(14.0, 17.2)
Number of cigarettes currently smoked per day
0	0.134 (0.010)	2295 (2266)	(0.117, 0.156)	14.5 (0.5)	2295 (2222)	(13.5, 15.5)
1 – 9	0.157 (0.017)	759 (745)	(0.123, 0.191)	15.4 (0.9)	759 (725)	(13.7, 17.1)
≥10	0.258 (0.018)	678 (669)	(0.222, 0.294)	16.1 (0.9)	678 (653)	(14.4, 17.9)
location of urine collection
clinic visit	0.175 (0.010)	1244 (1244)	(0.155, 0.194)	15.6 (0.5)	1244 (1202)	(14.5, 16.6)
1^st home urine	0.153 (0.009)	1244 (1217)	(0.136, 0.170)	14.6 (0.5)	1244 (1191)	(13.7, 15.6)
2^nd home urine	0.160 (0.009)	1244 (1219)	(0.142, 0.177)	14.7 (0.5)	1244 (1207)	(13.7, 15.7)

Open in a new tab

NAPAP is available for 1244 women at the clinic visit, 1217 women at the 1^st home urine and 1219 women for the 2^nd home urine; Salicylates are available for 1202 women at the clinic visit, 1191 women at the 1^st home urine and 1207 women at the 2^nd home urine; age, weight and number of cigarettes currently smoked are available for all 1244 women; missing data are accounted for using multiple imputation.

^**

N_vis,obs = number of visits where the characteristic was observed; N_vis = number of visits where the characteristic was observed + number of visits where the characteristic was imputed.

:/proj/stross/stros0a/rankcorrelation/swiss.proc.mi.age.sas 01/22/16
:/proj/stross/stros0a/rankcorrelation/swiss.proc.mi.overall.sas 01/22/16
:/proj/stross/stros0a/rankcorrelation/swiss.proc.mi.weight.smoke.sas 02/02/16
:/proj/stross/stros0a/rankcorrelation/swiss.age.wgt.smk.ns.sas 11/7/16

There were 90 women (7.2%) who had missing data for at least one NAPAP replicate and/or one salicylate replicate. We used multiple imputation methods with 5 imputations to impute the missing data (using PROC MI of SAS) and to perform the analyses in equation 11 (using PROC MIANALYZE of SAS).

We see that mean NAPAP was higher for women age 40 or older vs. women < 40 years of age, while mean salicylates were comparable in the two age groups. Mean NAPAP was higher for both light (≤57.0kg) and heavier (≥73.0kg) women and was much higher for women who currently smoke ≥10 cigarettes/day vs. non-current smokers. The trends for salicylates were in the same direction. Both NAPAP and salicylates were somewhat higher at the clinic visit than at the two home urines.

The correlation analyses based on equation 11 are reported in Table 5. The intraclass correlation (ICC) was higher for NAPAP (ρ_yy = 0.611) than for salicylates (ρ_xx = 0.402). The Pearson correlation between NAPAP and salicylates was 0.391 (p<0.001) when measured at the same visit (ρ_xy ) and 0.248 when measured at different visits ( $ρ_{x y}^{*}$ ). After adjusting for age, weight, smoking and location of urine collection, the estimated Pearson correlation between NAPAP and salicylates remained about the same when measured at the same visit (i.e., ρ_xy = 0.390) and when measured at different visits (i.e., $ρ_{x y}^{*} = 0.246$ ).

Table 5.

Correlation Estimates for the Swiss Study Data Set (1244 women)^*

Crude^†

Adjusted^‡

Type of correlation

Parameter

Point estimate

95% CI

P-value

N_vis(N_vis,obs)^**

Point estimate

95% CI

P-value

N_vis(N_vis,obs)^**

Pearson

ρ_xy

0.391

(0.355, 0.426)

<0.001

3732 (3597)

0.390

(0.354, 0.425)

<0.001

3732 (3597)

ρ_{xy}^{*}

0.248

(0.213, 0.282)

<0.001

3732 (3597)

0.246

(0.212, 0.280)

<0.001

3732 (3597)

ρ_xx

0.402

0.401

σ_{x}^{2}

313.2

312.5

ρ_yy

0.611

0.602

σ_{y}^{2}

0.105

0.102

Rank

ρ_xy,s

0.290

(0.250, 0.328)

<0.001

3732 (3597)

0.288

(0.247, 0.328)

<0.001

3732 (3597)

ρ_{xy, s}^{*}

0.197

(0.162, 0.231)

<0.001

3732 (3597)

0.194

(0.159, 0.229)

<0.001

3732 (3597)

ρ_xx,s

0.329

σ_{x, s}^{2}

0.426

ρ_yy,s

0.538

0.528

σ_{y, s}^{2}

0.952

0.932

Open in a new tab

^**

N_vis,obs = number of visits where both urinary NAPAP and salicylates were available; N_vis = number of visits where both urinary NAPAP and salicylates were available + number of visits where either was imputed

^†

y=NAPAP, x=salicylates

^‡

adjusted for age, weight, smoking and location of urine collection

computer run:

:/proj/stross/stros0a/rankcorrelation/swiss.proc.mi2.sas 1/21/16
:/proj.stross/stros0a/rankcorrelation/swiss.proc.mi.probit.sas 1/21/16
:/proj/stross/stros0a/rankcorrelation/swiss.proc.mi.residual3.sas 2/2/16
:/proj/stross/stros0a/rankcorrelation/swiss.proc.mi.probit.residual3.sas 2/4/16

Since the distribution of NAPAP was highly skewed and the distribution of salicylates represented in 4 broad ordered categories, it was reasonable to also consider rank correlation as a measure of association. The estimated rank correlation was 0.290 when NAPAP and salicylates were measured at the same visit (ρ_xy,s) and 0.197 when measured at different visits ( $ρ_{x y, s}^{*}$ ). After adjusting for age, weight, smoking and location of urine collection, the rank correlations decreased slightly when measured at the same visit (ρ_xy,s = 0.288) and at different visits ( $ρ_{x y, s}^{*} = 0.194$ ).

This example shows that there is a considerable difference between ρ̂_xy vs. ${\hat{ρ}}_{x y}^{*}$ and similarly between ρ̂_xy,s vs. ${\hat{ρ}}_{x y, s}^{*}$ indicating that NAPAP and salicylate values at different visits are best modeled as distinguishable replicates.

We also performed the Shapiro-Wilk test of normality for the distribution of studentized residuals of Ĥ_{i,Y_j,n} given Ĥ_{i,X_j,n} as described in the RP example. The Shapiro-Wilk test statistic was W = 0.99, p < 0.05 at the clinic visit, W = 0.99, p < 0.05 at the 1^st home urine, and W = 0.99, p < 0.05 at the 2^nd home urine. The reference value for the W test statistic under normality is W = 1.0. Thus, there was only a mild departure from the null value of W and the significance of the W test is due to the large sample size.

6. Discussion

In this paper we have presented methods for maximum likelihood estimation of Pearson correlation in the clustered data setting. The methods are applicable to clustered data based on distinguishable replicates and can be used both for clusters of size 2 (Equation 7) and clusters of size > 2 (Equation 11). They are based on a population-average type of model. A simpler approach to this problem from a population-average perspective is based on the regression model

y_{i j} = α + β x_{i j} + e_{i j}, i = 1, \dots, n; j = 1, \dots, k

(22)

where

e_{i j} ~ MVN (\underline{0}, \sum)

and

\sum = σ_{\underline{y} ∣ \underline{x}}^{2} [(1 - ρ_{y y ∣ \underline{x}}) I_{k \times k} + ρ_{y y ∣ \underline{x}} J_{k \times k}]

The corresponding estimator of ρ_xy denoted by ρ̂_xy,standard is given by:

{\hat{ρ}}_{x y, standard} = \hat{β} {\hat{σ}}_{x} / {\hat{σ}}_{y}

(23)

The model in (22) and the corresponding estimator in (23) are actually a special case of Equation 11 when $ρ_{x y}^{*} = ρ_{x y} ρ_{x x}$ . In words, $ρ_{x y}^{*} = ρ_{x y} ρ_{x x}$ if the correlation between y_ij_₁and _xij_₂ is completely mediated by the correlation between y_ij and x_ij and between x_ij_₁and x_ij_₂. In general, this will not always be the case and the estimator in (23) is not the MLE of ρ_xy under the model in (8).

Bland and Altman [16] have also considered a population-average approach to estimation of ρ_xy. With their approach, a pseudo-data set is created where pairs (x̄_i, ȳ_i) are repeated k_i times and k_i = # replicates for the i^th subject. Their estimator can be expressed in the form:

{\hat{ρ}}_{x y, B A} = \sum_{i = 1}^{n} k_{i} ({\bar{x}}_{i} - \bar{\bar{x}}) ({\bar{y}}_{i} - \bar{\bar{y}}) / {[\sum_{i = 1}^{n} k_{i} {({\bar{x}}_{i} - \bar{\bar{x}})}^{2}] [\sum_{i = 1}^{n} k_{i} {({\bar{y}}_{i} - \bar{\bar{y}})}^{2}]}^{1 / 2}

where

\bar{\bar{x}} = \sum_{i = 1}^{n} k_{i} {\bar{x}}_{i} / K, \bar{\bar{y}} = \sum_{i = 1}^{n} k_{i} {\bar{y}}_{i} / K, K = \sum_{i = 1}^{n} k_{i}

The ordinary Pearson correlation is then computed based on the K observations in the pseudo-data set and is tested for significance using a t distribution with n - 2 df as the reference distribution.

Lorenz, Datta and Harkema [17] also consider a population-average approach, but based on observations from individual cluster-members rather than cluster averages. Their estimator takes the form:

{\hat{ρ}}_{x y, LDH} = \sum_{i = 1}^{n} [\sum_{j = 1}^{k_{i}} (x_{i j} - \bar{\bar{x}}) (y_{i j} - \bar{\bar{y}}) / k_{i}] / {[\sum_{i = 1}^{n} \sum_{j = 1}^{k_{i}} {(x_{i j} - \bar{\bar{x}})}^{2} / k_{i}] [\sum_{i = 1}^{n} \sum_{j = 1}^{k_{i}} {(y_{i j} - \bar{\bar{y}})}^{2} / k_{i}]}^{1 / 2}

Large sample asymptotic normality of ρ̂_xy,LDH is assumed and significance testing is based on z_LDH = ρ̂_xy,LDH/[var(ρ̂_xy,LDH)^1/2] and compared with a N(0,1) distribution. The delta method is used to estimate var(ρ̂_xy,LDH). The LDH approach is also applied to estimation of Kendall’s τ for clustered data.

Both the BA and LDH approaches assume that replicates are indistinguishable and allow for a variable number of replicates per cluster. With the BA estimator, the i^th cluster-mean is weighted k_i times. With the LDH estimator each cluster gets equal weight regardless of cluster-size. If either ρ_xx > 0 or ρ_yy > 0, the optimal weighting for the ith cluster will be between 1 and k_i.

The estimators in equations 7 and 11 of this paper assume replicates are distinguishable and use multiple imputation methods to estimate missing (x, y) replicate scores based on (x, y) scores for other cluster members as well as values of other covariates and performs the analysis based on n_IMP imputed data sets using Rubin’s rules. The estimators are maximum likelihood, can be implemented with standard software, are extended to the rank correlation setting and allow one to control for other covariates.

The distinguishing feature of ML estimation is that for clustered data the likelihood is based on

L = \prod_{i = 1}^{n} L_{i},

where $L_{i} = \prod_{i = 1}^{n} L_{1 i} ({\underline{y}}_{i} ∣ {\underline{x}}_{i}) L_{2 i} ({\underline{x}}_{i})$ and $L_{1} = \prod_{i = 1}^{n} L_{\underline{1} i}$ is maximized to estimate ρ_xy.

In the LDH approach the likelihood is based on

L_{1}^{*} = \prod_{i = 1}^{n} \prod_{j = 1}^{k_{i}} L (y_{i j} ∣ x_{i j})

Under $L_{1}^{*}$ , y_ij is expressed as a function of x_ij, while under L₁, y_ij is expressed as a function of both x_ij and x_i_+,−_j. $L_{1}^{*}$ is a special case of L₁ when $ρ_{x y}^{*} = ρ_{x y} ρ_{x x}$ . $L_{1}^{*}$ is the likelihood in the i.i.d. situation when ρ_xx = ρ_yy = 0 [1], but not in the clustered data situation.

Another approach is to consider a subject-specific type of model [18]. Under this approach, we consider the random intercept model

y_{i j} = μ + q_{i} + β x_{i j} + e_{i j}, i = 1, \dots, n; j = 1, \dots, k_{i}

(24)

where

q_{i} ~ N (0, σ_{q}^{2}), e_{i j} ~ N (0, σ^{2})

Under this model, the estimator β has a subject-specific interpretation (i.e., are changes in X between 2 replicates for an individual correlated with corresponding changes in Y). For example, do changes in visual field between 2 eyes of a subject correspond to changes in visual acuity. Bland and Altman [18] consider a slight variation of (24) where q is a fixed effect instead of a random effect.

Finally, we have extended our results to the estimation of Spearman rank correlation for clustered data. To our knowledge, this is the first time this has appeared in the literature. We also extend the Pearson and rank correlation results to the setting of partial correlation, where one can control for both cluster-specific and subunit-specific covariates. For rank correlation this is different from the partial Spearman correlation available in some statistical packages where one calculates covariate-adjusted residuals for both Y and X based on the raw data and then computes the Spearman correlation based on the residuals. With our approach, probit scores are first computed for X and Y variables and residuals are computed by regressing X and Y probit scores respectively, on covariates. Rank correlations are then obtained based on Spearman correlations between the probit residuals. This should provide a more nonparametric estimate of partial Spearman correlation.

The relationship between Pearson and Spearman correlation given in (13), assumes that H_Z = (H_X, H_Y) is bivariate normal. If X and/or Y are categorical, such as for salicylates in the second example, then this assumption will be violated. However, if we assume that there is a latent continuous scale but the continuous scale is divided into categories and the actual rank of an observation on the latent continuous scale is replaced by the average rank within a category, then this assumption may be more plausible. Also, in the simulation studies where we replaced the actual rank on the continuous scale by the average rank within a category (see Appendix Table 1), the type I error and coverage were preserved.

We also assessed the validity of the estimation procedure in (13) and (14) when H_X and H_Y are each univariate normal, but (H_X, H_Y) is not bivariate normal. The clustered KS copula in (21) was introduced for this purpose. Simulation studies indicate that bias and coverage probability associated with the estimators of ρ_xy,s and $ρ_{x y, s}^{*}$ in (13) and (14) are acceptable in this setting for n ≥ 50 (see Appendix Table 2). We also proposed a diagnostic procedure to test the assumption of bivariate normality of replicate probit scores which was applied and satisfied for both of the examples in this paper.

Acknowledgments

This work was supported by the National Eye Institute, R01 EY022445. The authors would also like to acknowledge the programming support of Marion McPhee.

Appendix – Proofs

Theorem 1

Let X be a continuous random variable and let {x₁,…,x_n} be an i.i.d. sample from X. Define ${\hat{H}}_{x, n} = Φ^{- 1} [R_{X, n}^{(x)} / (n + 1)]$ , where Φ⁻¹ is the inverse normal distribution, $R_{X, n}^{(x)} = 1 + \sum_{i = 1}^{n} U (x - x_{i})$ , and U(a) = 1 if a > 0, = 0 otherwise. If H_x = Φ⁻¹[F_X(x)] where F_X is the c.d.f. of X, then Ĥ_x,n converges in probability to H_x as n → ∞.

Proof

From (A.110), p. 346 of Lehmann [19] if E (Ĥ_x,n – H_x)² → 0 as n → ∞, then Ĥ_x,n converges in probability to H_x as n → ∞.

Let P_x = Φ(H_x) = F_X(x) and ${\hat{P}}_{x, n} = Φ ({\hat{H}}_{x, n}) = R_{X, n}^{(x)} / (n + 1)$

We have that

E {({\hat{H}}_{x, n} - H_{x})}^{2} = {[E ({\hat{H}}_{x, n}) - H_{x}]}^{2} + var ({\hat{H}}_{x, n})

(A1)

To assess the 1^st component of (A1) we use a Taylor series expansion of Ĥ_x,n about H_x, yielding

{\hat{H}}_{x, n} = H_{x} + \sqrt{2 π} \exp (H_{x}^{2} / 2) ({\hat{P}}_{x, n} - P_{x}) + π H_{x} \exp (H_{x}^{2}) {({\hat{P}}_{x, n} - P_{x})}^{2} + \sum_{r = 3}^{\infty} a_{r} {({\hat{P}}_{x, n} - P_{x})}^{r}

(A2)

where |a_r| < ∞, r ≥ 3. if we take expectations of both sides of (A2) we obtain:

E ({\hat{H}}_{x, n}) = H_{x} + \sqrt{2 π} \exp (H_{x}^{2} / 2) E ({\hat{P}}_{x, n} - P_{x}) + π H_{x} \exp (H_{x}^{2}) E {({\hat{P}}_{x, n} - P_{x})}^{2} + \sum_{r = 3}^{\infty} a_{r} E {({\hat{P}}_{x, n} - P_{x})}^{r}

(A3)

Since E(P̂_x,n – P_x)^q → 0 as n → ∞ for q ≥ 1 it follows that E(Ĥ_x,n) – H_x → 0 as n → ∞

To assess the 2^nd component of (A1) we use the delta method, whereby

\begin{array}{l} var ({\hat{H}}_{x, n}) = 2 π \exp (H_{x}^{2}) var ({\hat{P}}_{x, n}) + O (1 / n) \\ = 2 π \exp (H_{x}^{2}) n P_{x} (1 - P_{x}) / {(n + 1)}^{2} + O (1 / n) \to 0 as n \to \infty \end{array}

(A4)

Thus, based on (A1), (A3) and (A4), E(Ĥ_x,n – H_x)² → 0 as n → ∞, whereby Ĥ_x,n converges in probability to H_x as n → ∞.

Corollary 1

Let Ĥ_x,n be defined as in Theorem 1, H_X = {Ĥ_x, xεX} and Ĥ_X,n = {Ĥ_x,n, xεX}. Ĥ_X,n converges in law to a N(0,1) distribution as n → ∞.

Proof

We have that

lim_{n \to \infty} Pr ({\hat{H}}_{X, n} \leq z) = Pr (H_{X} \leq z) = Φ (z) .

Hence, Ĥ_X,n is asymptotically normal as n → ∞.

Theorem 2

Let Ĥ_X,n and Ĥ_Y,n be defined as in Corollary 1. Let H_Z = (H_X,H_Y), ρ_XY = cov(H_X,H_Y) and Ĥ_Z,n = (Ĥ_X,n, Ĥ_Y,n) If H_Z is bivariate normal, then Ĥ_Z,n is asymptotically bivariate normal as n → ∞.

Proof

From Theorem 2.6.2, p. 37 of Anderson [1], H_Z is bivariate normal if and only if aH_X + bH_Y is univariate normal for all a, b, where min(|a|,|b|) > 0. We have from Corollary 1 that aĤ_X,n + bĤ_Y,n converges in law to aH_X + bH_Y as n → ∞. Thus, since var(H_X) = var(H_Y) = 1 and cov(H_X,H_Y) = ρ_XY then

lim_{n \to \infty} Pr (a {\hat{H}}_{X, n} + b {\hat{H}}_{Y, n} \leq z) = Φ (z / \sqrt{a^{2} + b^{2} + 2 a b ρ_{X Y}})

Therefore, aĤ_x,n + bĤ_y,n is asymptotically univariate normal for all a, b. It follows that H_Z,n is asymptotically bivariate normal as n → ∞.

Theorem 3

Let X₁,…,X_L be continuous random variables and define H_{X_l}(x) = Φ⁻¹[F_{X_l}(x)], l =1,…,L. Let ρ_l_₁_l_₂ = cov(H_Xl₁,H_Xl₂), l₁ ≠ l₂ = 1,…,L: Define Ĥ_{Xl_n}(x) = Φ⁻¹ [R_{Xl_n}(x)/(n+1)], where $R_{X_{l}, n} (x) = 1 + \sum_{i = 1}^{n} U (x - x_{i l})$ and x_il = value of X_l for the i^th subject, i = 1,…,n; l = 1,…,L.

If H_X ≡ (H_X_₁,… H_{X_L},) is multivariate normal, then Ĥ_{X_n} ≡ (Ĥ_X_₁,_n_,…, Ĥ_{X_L,n}) will be asymptotically multivariate normal as n → ∞.

Proof

From Theorem 2.6.2, p. 37 of Anderson [1] H_x will be multivariate normal if and only if $Z = \sum_{l = 1}^{L} a_{l} H_{X_{l}}$ is univariate normal for all a = (a₁,…,a_L) where min(|a_l|, l = 1,…,L) > 0. We have from Corollary 1 that ${\hat{Z}}_{n} = \sum_{l = 1}^{L} a_{l} {\hat{H}}_{X_{l}, n}$ converges in law to Z as n → ∞ Thus, since var(H_{X_l}) = 1, l = 1,…,L, and cov(H_{X_l₁},H_{X_l2}) = ρ_l_₁_l_₂, l₁ ≠ l₂ = 1,…,L, then $lim_{n \to \infty} Pr ({\hat{Z}}_{n} \leq z) = Φ (z / \sqrt{\sum_{l = 1}^{L} a_{l}^{2} + \sum_{l_{1} \neq l_{2} = 1}^{L} a_{l_{1}} a_{l_{2}} ρ_{l_{1} l_{2}}})$ . Therefore, Ẑ_n is asymptotically univariate normal for all a where min(|a_l|, l = 1,…,L) > 0. It follows that Ĥ_{X_n} is asymptotically multivariate normal as n → ∞

Appendix Table 1.

Simulation Results for Rank Correlation Based on Discrete Data, MLE method

ρ_xy

ρ_{xy}^{*}

ρ_xx(ρ_yy)

ρ̂_xy

Type I error

coverage probability (%)

power

0.5

0.028

0.055

94.7

---

0.5

100

0.010

0.054

94.8

---

0.2

0.1

0.5

0.194

---

95.9

52.4

0.2

0.1

0.5

100

0.184

---

94.6

76.8

Open in a new tab

Computer runs:

:/proj/stross/stros0a/rankcorrelation/simulation.3rep.4000.n100.5.probit.sas 9/15/16
:/proj/stross/stros0a/rankcorrelation/simulation2.3rep.4000.n100.5.probit.sas 9/15/16
:/proj/stross/stros0a/rankcorrelation/simulation.3rep.4000.n50.5.probit.sas 9/15/16
:/proj/stross/stros0a/rankcorrelation/simulation2.3rep.4000.n50.5.probit.sas 9/15/16

Appendix Table 2.

Bias and coverage probability of (ρ̂_xy,s, ${\hat{ρ}}_{xy, s}^{*}$ ) based on (13) and (14) when the data are generated from the clustered KS^* copula in (21)

parameter

true value^**

median estimated value⁺

bias

coverage probability (%)

0.5

ρ_xy,s

0.497

0.488

−0.009

96.6

ρ_{xy, s}^{*}

0.328

0.327

−0.001

94.4

100

ρ_xy,s

0.497

0.491

−0.006

96.6

ρ_{xy, s}^{*}

0.328

0.325

−0.003

94.1

1.0

ρ_xy,s

0.497

0.486

−0.011

96.4

ρ_{xy, s}^{*}

0.246

92.1

100

ρ_xy,s

0.497

0.491

−0.006

96.7

ρ_{xy, s}^{*}

0.246

0.244

−0.002

91.4

Open in a new tab

Kimeldorf-Sampson

^**

based on simulating 100,000 random vectors (X₁, Y₁, X₂, Y₂) from the clustered KS copula in (21) and estimating ρ_xy,s from mean [Spearman correlation (X₁,Y₁ )+Spearman correlation (X₂,Y₂ )] and $ρ_{xy, s}^{*}$ from mean [Spearman correlation (X₁, Y₂ )+Spearman correlation (X₂, Y₁)].

⁺

based on 4,000 simulated datasets consisting of n (X₁, Y₁, X₂, Y₂) vectors from the clustered KS copula in (21).

Computer runs:

:/proj/stross/stros0a/rankcorrelation/simulation.new.n100000.sas 9/29/16
:/proj/stross/stros0a/rankcorrelation/simulation.new2.n100000.sas 10/13/16
:/proj/stross/stros0a/rankcorrelation/simulation.new.4000.n50.sas 10/13/16
:/proj/stross/stros0a/rankcorrelation/simulation.new.4000.n100.sas 10/13/16
:/proj/stross/stros0a/rankcorrelation/simulation.new2.4000.n50.sas 10/14/16
:/proj/stross/stros0a/rankcorrelation/simulation.new2.4000.n100.sas 10/14/16

References

1.Anderson TW. An introduction to multivariate statistical analysis. Wiley; New York: 1958. [Google Scholar]
2.Jennrich RI, Schluchter MD. Unbalanced repeated-measures models with structured covariance matrices. Biometrics. 1986;42:805–820. [PubMed] [Google Scholar]
3.Group ETDRS. Early Therapy Diabetic Retinopathy Study (ETDRS) manual of operations. University of Maryland; City: 1985. Early Therapy Diabetic Retinopathy Study (ETDRS) manual of operations. [Google Scholar]
4.Pearson K. Mathematical Contributions to the Theory of Evolution. On further methods of determining correlation. Cambridge University Press; Cambridge, UK: 1907. [Google Scholar]
5.Rosner B, Glynn RJ. Interval estimation for rank correlation coefficients based on the probit transformation with extension to measurement error correction of correlated ranked data. Stat Med. 2007;26:633–646. doi: 10.1002/sim.2547. [DOI] [PubMed] [Google Scholar]
6.Moran PA. Rank correlation and product-moment correlation. Biometrika. 1948;35:203–206. [PubMed] [Google Scholar]
7.Berson EL, Sandberg MA, Rosner B, Birch DG, Hanson AH. Natural course of retinitis pigmentosa over a three-year interval. Am J Ophthalmol. 1985;99:240–251. doi: 10.1016/0002-9394(85)90351-4. [DOI] [PubMed] [Google Scholar]
8.Rubin DB. Multiple imputation for non-response in surveys. John Wiley and Sons; New York: 1987. [Google Scholar]
9.Kimeldorf G, Sampson AR. Uniform representation of bivariate distributions. Comm Statist. 1975;4:617–627. [Google Scholar]
10.Joe H. Multivariate models and dependence concepts. Chapman & Hall; London ; New York: 1997. [Google Scholar]
11.Berson EL, Rosner B, Sandberg MA, Weigel-DiFranco C, Moser A, Brockhurst RJ, Hayes KC, Johnson CA, Anderson EJ, Gaudio AR, Willett WC, Schaefer EJ. Clinical Trial of docosahexaenoic acid in patients with retinitis pigmentosa receiving vitamin A treatment. Archives of Ophthalmology. 2004;122:1297–1305. doi: 10.1001/archopht.122.9.1297. [DOI] [PubMed] [Google Scholar]
12.Gregori NZ, Feuer W, Rosenfeld PJ. Novel method for analyzing snellen visual acuity measurements. Retina. 2010;30:1046–1050. doi: 10.1097/IAE.0b013e3181d87e04. [DOI] [PubMed] [Google Scholar]
13.Dubach UC, Levy PS, Muller A. Relationships Between Regular Analgesic Intake and Urorenal Disorders in a Working Female Population of Switzerland I. Initial Results (1968) American Journal of Epidemiology. 1971;93:425–434. doi: 10.1093/oxfordjournals.aje.a121276. [DOI] [PubMed] [Google Scholar]
14.Proudfoot AT, Krenzelok EP, Brent J, Vale JA. Does urine alkalinization increase salicylate elimination? If so, why? Toxicol Rev. 2003;22:129–136. doi: 10.2165/00139709-200322030-00001. [DOI] [PubMed] [Google Scholar]
15.Duthie GG, Kyle JA, Jenkinson AM, Duthie SJ, Baxter GJ, Paterson JR. Increased salicylate concentrations in urine of human volunteers after consumption of cranberry juice. J Agric Food Chem. 2005;53:2897–2900. doi: 10.1021/jf040393b. [DOI] [PubMed] [Google Scholar]
16.Bland JM, Altman DG. Calculating correlation coefficients with repeated observations: Part 2--Correlation between subjects. British Medical Journal. 1995;310:633. doi: 10.1136/bmj.310.6980.633. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Lorenz DJ, Datta S, Harkema SJ. Marginal association measures for clustered data. Stat Med. 2011;30:3181–3191. doi: 10.1002/sim.4368. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Bland JM, Altman DG. Calculating correlation coefficients with repeated observations: Part 1--Correlation within subjects. British Medical Journal. 1995;310:446. doi: 10.1136/bmj.310.6977.446. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Lehmann EL. Nonparametrics: Statistical Methods Based on Ranks. Springer; New York: 2006. (revised edn) [Google Scholar]

[R1] 1.Anderson TW. An introduction to multivariate statistical analysis. Wiley; New York: 1958. [Google Scholar]

[R2] 2.Jennrich RI, Schluchter MD. Unbalanced repeated-measures models with structured covariance matrices. Biometrics. 1986;42:805–820. [PubMed] [Google Scholar]

[R3] 3.Group ETDRS. Early Therapy Diabetic Retinopathy Study (ETDRS) manual of operations. University of Maryland; City: 1985. Early Therapy Diabetic Retinopathy Study (ETDRS) manual of operations. [Google Scholar]

[R4] 4.Pearson K. Mathematical Contributions to the Theory of Evolution. On further methods of determining correlation. Cambridge University Press; Cambridge, UK: 1907. [Google Scholar]

[R5] 5.Rosner B, Glynn RJ. Interval estimation for rank correlation coefficients based on the probit transformation with extension to measurement error correction of correlated ranked data. Stat Med. 2007;26:633–646. doi: 10.1002/sim.2547. [DOI] [PubMed] [Google Scholar]

[R6] 6.Moran PA. Rank correlation and product-moment correlation. Biometrika. 1948;35:203–206. [PubMed] [Google Scholar]

[R7] 7.Berson EL, Sandberg MA, Rosner B, Birch DG, Hanson AH. Natural course of retinitis pigmentosa over a three-year interval. Am J Ophthalmol. 1985;99:240–251. doi: 10.1016/0002-9394(85)90351-4. [DOI] [PubMed] [Google Scholar]

[R8] 8.Rubin DB. Multiple imputation for non-response in surveys. John Wiley and Sons; New York: 1987. [Google Scholar]

[R9] 9.Kimeldorf G, Sampson AR. Uniform representation of bivariate distributions. Comm Statist. 1975;4:617–627. [Google Scholar]

[R10] 10.Joe H. Multivariate models and dependence concepts. Chapman & Hall; London ; New York: 1997. [Google Scholar]

[R11] 11.Berson EL, Rosner B, Sandberg MA, Weigel-DiFranco C, Moser A, Brockhurst RJ, Hayes KC, Johnson CA, Anderson EJ, Gaudio AR, Willett WC, Schaefer EJ. Clinical Trial of docosahexaenoic acid in patients with retinitis pigmentosa receiving vitamin A treatment. Archives of Ophthalmology. 2004;122:1297–1305. doi: 10.1001/archopht.122.9.1297. [DOI] [PubMed] [Google Scholar]

[R12] 12.Gregori NZ, Feuer W, Rosenfeld PJ. Novel method for analyzing snellen visual acuity measurements. Retina. 2010;30:1046–1050. doi: 10.1097/IAE.0b013e3181d87e04. [DOI] [PubMed] [Google Scholar]

[R13] 13.Dubach UC, Levy PS, Muller A. Relationships Between Regular Analgesic Intake and Urorenal Disorders in a Working Female Population of Switzerland I. Initial Results (1968) American Journal of Epidemiology. 1971;93:425–434. doi: 10.1093/oxfordjournals.aje.a121276. [DOI] [PubMed] [Google Scholar]

[R14] 14.Proudfoot AT, Krenzelok EP, Brent J, Vale JA. Does urine alkalinization increase salicylate elimination? If so, why? Toxicol Rev. 2003;22:129–136. doi: 10.2165/00139709-200322030-00001. [DOI] [PubMed] [Google Scholar]

[R15] 15.Duthie GG, Kyle JA, Jenkinson AM, Duthie SJ, Baxter GJ, Paterson JR. Increased salicylate concentrations in urine of human volunteers after consumption of cranberry juice. J Agric Food Chem. 2005;53:2897–2900. doi: 10.1021/jf040393b. [DOI] [PubMed] [Google Scholar]

[R16] 16.Bland JM, Altman DG. Calculating correlation coefficients with repeated observations: Part 2--Correlation between subjects. British Medical Journal. 1995;310:633. doi: 10.1136/bmj.310.6980.633. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] 17.Lorenz DJ, Datta S, Harkema SJ. Marginal association measures for clustered data. Stat Med. 2011;30:3181–3191. doi: 10.1002/sim.4368. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] 18.Bland JM, Altman DG. Calculating correlation coefficients with repeated observations: Part 1--Correlation within subjects. British Medical Journal. 1995;310:446. doi: 10.1136/bmj.310.6977.446. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] 19.Lehmann EL. Nonparametrics: Statistical Methods Based on Ranks. Springer; New York: 2006. (revised edn) [Google Scholar]

PERMALINK

Estimation of Rank Correlation for Clustered Data

Bernard Rosner

Robert Glynn

Abstract

1. Introduction

Figure 1.

Figure 2.

Figure 3.

2. Estimation of Pearson correlation for clustered data

(a) Clusters of size 2

(b) Clusters of size > 2

(a) Interval Estimation

1. Estimation of Rank Correlation for Clustered Data

2. Estimation of partial correlation for clustered data

(a) Partial Pearson correlation

(b) Partial Spearman rank correlation

5. Unbalanced data

6. Simulation Studies

Table 1.

5. Examples

(a) Estimating the correlation between visual field and visual acuity in RP patients

Table 2.

Table 3.

(b) Estimating the correlation between biomarkers of phenacetin and aspirin intake in a study of analgesic abuse among Swiss women

Table 4.

Table 5.

6. Discussion

Acknowledgments

Appendix – Proofs

Theorem 1

Proof

Corollary 1

Proof

Theorem 2

Proof

Theorem 3

Proof

Appendix Table 1.

Appendix Table 2.

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases