A procedure to detect general association based on concentration of ranks

Pratyaydipta Rudra; Yihui Zhou; Fred A Wright

doi:10.1002/sta4.138

. Author manuscript; available in PMC: 2018 Feb 16.

Published in final edited form as: Stat (Int Stat Inst). 2017 Feb 16;6(1):88–101. doi: 10.1002/sta4.138

A procedure to detect general association based on concentration of ranks

Pratyaydipta Rudra ^a,^*, Yihui Zhou ^b, Fred A Wright ^b

PMCID: PMC5616165 NIHMSID: NIHMS851391 PMID: 28966789

Abstract

In modern high-throughput applications, it is important to identify pairwise associations between variables, and desirable to use methods that are powerful and sensitive to a variety of association relationships. We describe RankCover, a new non-parametric association test of association between two variables that measures the concentration of paired ranked points. Here ‘concentration’ is quantified using a disk-covering statistic similar to those employed in spatial data analysis. Considerations from the theory of Boolean coverage processes provide motivation, as well as an R²-like quantity to summarize strength of association. Analysis of simulated and real datasets demonstrate that the method is robust and often powerful in comparison to competing general association tests.

Keywords: Nonparametric Methods, Simulation, Spatial Statistics

1. Introduction

The need for statistical methods to identify general pairwise association is increasingly recognized, as evidenced by recent attention to methods such as distance correlation (dCor) (Székely et al., 2007; Székely & Rizzo, 2009), Maximal Information Coefficient (MIC) (Reshef et al., 2011), and the Heller-Heller-Gorfine (HHG) method (Heller et al., 2013). The term general association refers to any departure from independence among random variables, and methods differ in the types of departures to which they are sensitive. The need for general association tests is perhaps greatest for analysis of large datasets, for which discovery-based approaches are needed, without prior hypotheses regarding the form or structure of dependence. In addition to the need to test dependence among pairs of variables as a primary analysis, dependencies can invalidate inference for downstream methods that require independence among input variables (Albert et al., 2001).

Standard parametric and non-parametric tests of association, such as linear trend testing (Mann, 1945; Kendall, 1975; Cuzick, 1985; Hamed & Ramachandra Rao, 1998), are sensitive to only specific alternatives, while classical tests of association (Wilks, 1935; Puri & Sen, 1971) are not distribution free. Recent work has tried to capture the measure of association through a generalized correlation coefficient, which is able to capture numerous forms of relationships. MIC (Reshef et al., 2011) and dCor (Székely et al., 2007; Székely & Rizzo, 2009) are two recently introduced measures of general association. HHG (Heller et al., 2013) has been shown to be powerful against many alternatives, and also shown to be consistent, in the sense of having increasing power with sample size n, against all dependent alternatives. However its performance in small to moderate samples against varying alternatives has not been studied.

Simon & Tibshirani (2014) demonstrated that the distance correlation (dCor) is more powerful than MIC in almost all of a variety of simulated associations. However, a potential weakness of dCor is that it is not powerful to detect nonmonotone relationships, such as a circle (de Siqueira Santos et al., 2013), and the detection of such relationships is a primary motivation to develop methods for general association. To gain insight, we note that dCor is motivated by consideration of distances of the empirical characteristic function under the null vs. under the alternative. For observed data, the dCor statistic is the Pearson correlation of distances (after some adjustments) between all pairs of samples. For an observed random sample (X, Y) = {(X_k, Y_k) : k = 1, 2, …, n}, the distances between pairs of samples are defined as a_kl = |X_k − X_l | and b_kl = |Y_k − Y_l |; k, l = 1, 2, …, n. The approach is intuitively sensible when the relationship is monotone, as sample pairs that are close on the x-axis should also be close on the y -axis. However, for non-monotone relationships, pairs of points that are close on the x-axis can be quite distant on the y -axis (see Supplementary Article Section S7).

1.1. A new motivation

Another way to approach the general association problem is to consider spatial randomness of points (X_k, Y_k ), and the proposed tests of general association attempt to be sensitive to alternatives in which the points are clustered. A class of testing procedures sensitive to local clustering has been devised in spatial statistics (Clark & Evans, 1954; Holgate, 1965b,a; Ripley, 1979; Smith, 2004), including the F function by Diggle (Diggle, 1983) which is based on nearest neighbor distances. We propose RankCover, which quantifies the concentration of (X, Y ) values by measuring the area covered by laying disks of a fixed radius over each point in the scatter plot of the ranks of the two variables. Novel aspects of RankCover include careful consideration of the use of ranks, testing by permutation, and the choice of disk size. We demonstrate that RankCover is robust in the sense that it has power against a variety of alternatives, and regardless of the marginal distributions of the two variables. Moreover, RankCover has favorable power in comparison to dCor, MIC and HHG, and is especially useful for detecting oscillating relationships. A novel R²-like quantity provides an overall measure of association. Simulations demonstrate that RankCover and dCor are in some sense complementary, and a hybrid of the two methods is robust and powerful for most association types of interest.

2. Methods

2.1. A spatial viewpoint

For original paired vectors A and B (so denoted in order to be distinguished from the ranks) we assume no ties and RankCover starts by computing X = rank(A), Y = rank(B). The use of ranks simplifies the problem by placing the intervals between successive values on a common scale. In addition, the null distribution for ranks depends on only the sample size n. Thus the only computation lies in computing the observed statistic, while the null distribution can be pre-computed and is applicable to any dataset of size n.

Diggle’s F (δ) function as introduced in (Diggle, 1983) is the distribution function of the distance between a randomly chosen point in a region to the nearest observed point (X_k, Y_k ). To empirically estimate F (δ), the investigator conceptually lays disks of radius δ on each point (X_k, Y_k ) and calculates the proportion of the surrounding region covered by the union of the disks (Figure 1). If X and Y are highly associated, the areas covered by the disks should be small, so RankCover rejects only in the left tail of the statistic described below.

Illustration of RankCover for sample size n = 50: A. Scatter plot of the two variables. B. Scatter plot on the rank scale C. Disks laid on the scatter plot on rank scale using Euclidian distance D. Disks laid on the scatter plot on rank scale using Manhattan distance.

Different distance metrics can be used for this purpose and the shape of the disks depend on the choice of the distance metric. For instance, Euclidean distance leads to circular equidistance contours, resulting in circular disks, while the disks are diamond-shaped for Manhattan distance (Figure 1).

2.2. The test statistic

The empirical estimate of F can be obtained using the proportion of area covered by the disks. For ranks, we consider only the n × n grid of possible rank pairs, {1, 2, …, n} × {1, 2, …, n}, and whether each of paired value is covered by at least one disk. Let (X_k, Y_k ) denote the ranks of the kth sample pair, k = 1, 2, …, n.

Definition 2.1

Define d(i, j, X_k, Y_k ) = distance between the point (i, j) on the grid and (X_k, Y_k ); d_{i j} = min_k d(i, j, X_k, Y_k )

Using this definition, a natural statistic for fixed δ is

{\hat{F}}_{n} (δ) = \frac{1}{n^{2}} \sum_{i = 1}^{n} \sum_{j = 1}^{n} I (d_{i j} \leq δ),

where I(.) is the indicator function.

The choice of disk size δ is an important consideration which has not been fully addressed in the spatial statistics literature. One approach might be to compute the entire empirical curve F̂_n(δ) to develop a new summary statistic to compare against the null curve. However, this approach makes the procedure prohibitively computationally expensive, and we propose (See Supplementary Article Section S2 and S4 for details) using a fixed $δ = \sqrt{n}$ for Euclidean distance, with slight modification under Manhattan distance. In addition, we modify the statistic to account for edge effects of the grid, using an (n + ⌈δ⌉) × (n + ⌈δ⌉) grid extending beyond the range of the scatterplot. Here ⌈δ⌉ is the smallest integer greater than or equal to δ. Finally, our modified test statistic is

T_{n} (δ) = \frac{1}{n^{2}} \sum_{i = 1 - ⌈ δ ⌉}^{n + ⌈ δ ⌉} \sum_{j = 1 - ⌈ δ ⌉}^{n + ⌈ δ ⌉} I (d_{i j} \leq δ),

(2.1)

where the range of {i, j} reflects the outer boundaries of a larger region to account for edge effects. The null distribution of T_n depends entirely on n, so tables based on simulated null distributions can be precomputed for various sample sizes (See Supplementary Article Section S11).

For the distance metric d, we consider here both Euclidian and Manhattan distances, for which later simulations show similar performance (See Supplementary Article Section S6). However, the Manhattan distance has advantages in approximating tail areas (See Supplementary Article Section S11). Therefore we recommend its use and present results using Manhattan distance.

2.3. Fast computation of the test statistic

The crude way to compute the test statistic calculate the distances of the n sample points from each of the (n + ⌈δ⌉)² points on the grid. Thus, the order of computation for our choice of δ is n³. Here we propose a method with complexity O(n²). The algorithm first calculates a (2⌈δ⌉ + 1) × (2⌈δ⌉ + 1) ‘prototype’ matrix of 1’s and 0’s that represents the shape of the disk. Then the prototype matrix is used to modify the appropriate window of rows and columns in the larger matrix, which may be thought of as efficiently placing a disk surrounding each of the sample points (Figure 2). For our choice of δ, the placement of each disk is an order n operation (as there are approximately n elements in the prototype matrix), performed n times.

Showing the fast computation of RankCover: The (2⌈δ⌉ + 1) × (2⌈δ⌉ + 1) prototype matrix consists of 1’s (yellow) and 0’s (red). When the prototype matrix is multiplied with a larger matrix of 1’s at every observed data point, it creates a larger coverage matrix (right panel) with 0’s showing the points covered.

3. Lessons from Boolean coverage

The Boolean coverage process provides motivation for the RankCover statistic. We first consider a coverage process with iid random sets around a stationary Poisson process. For coverage of such a uniform process f with n circular disks of radius $δ = \frac{1}{\sqrt{n}}$ , Hall (1985) showed that for large n the expectation and variance of C, the proportion of coverage, approach

E_{f} (C) = 1 - e^{- π}

(3.1)

{Var}_{f} (C) = \frac{1}{n} π e^{- 2 π} (8 \int_{0}^{1} u {e^{2 π J_{k} (u)} - 1} d u - π) = \frac{1}{n} π e^{- 2 π} (8 \times 0.997216 - π),

(3.2)

where $J_{k} (u) = \frac{1}{π} (\frac{π}{2} - \sin^{- 1} (u) - \frac{1}{2} \sin (2 \sin^{- 1} u))$ . In addition, $\frac{C - E_{f} (C)}{\sqrt{{Var}_{f} (C)}} \to N (0, 1)$ . Hall (1985) also proved that under any alternative density h, the variance is still $O (\frac{1}{n})$ and the expectation is

E_{h} (C) = 1 - \int_{0}^{1} \int_{0}^{1} e^{- π h (x, y)} dxdy .

(3.3)

As Var_f (C) → 0, statistic $Z = \frac{C_{h} - E_{f} (C)}{\sqrt{{Var}_{f} (C)}}$ will have power approaching 1 provided E_h(C) < E_f (C).

Proposition 3.1

For any density h ≢ 1 on (0, 1)², $\int_{0}^{1} \int_{0}^{1} e^{- π h (x, y)} dxdy > e^{- π}$ .

Proof

Define g(z) = e⁻^πz. By Jensen’s inequality,

E_{f} (g (h (X, Y))) > g (E_{f} (h (X, Y))),

since g is strictly convex. Now,

\begin{array}{l} E_{f} (g (h (X, Y))) = \int_{0}^{1} \int_{0}^{1} e^{- π h (x, y)} f (x, y) dxdy = \int_{0}^{1} \int_{0}^{1} e^{- π h (x, y)} dxdy . \\ g (E_{f} (h (X, Y))) = g (1) = e^{- π} . \end{array}

We now draw the analogy to RankCover. We use (A_k, B_k) to denote the kth paired value in our data, and all pairs are assumed iid, following cdfs F (A) and G(B), respectively, with h and H denoting the joint density and distribution function of (F (A), G(B)). Note that (F (A_k ), G(B_k ))’s are similar to the Boolean coverage process with the marginals being U(0, 1). With F and G being unknown, the empirical distribution functions F_n and G_n could be used and the paired values (F_n(A_k ), G_n(B_k ))’s are equal to (X_k/n, Y_k/n). Therefore, if we rescale our RankCover set up from the n × n grid to a unit square, the resultant test statistic will be similar to the coverage C discussed above. The corresponding choice of δ will be $\frac{1}{\sqrt{n}}$ .

Let K_n denote, for each n, a pair drawn randomly and uniformly from the sequence k = 1, …, n. Then the following is true as a direct consequence of Glivenko-Cantelli theorem (Shorack & Wellner, 2009).

‖ (X_{K_{n}} / n, Y_{K_{n}} / n) - (F (A_{K_{n}}), G (B_{K_{n}})) ‖ \overset{a . s .}{\to} 0

This implies that the joint distribution function of (X_{K_n}/n, Y_{K_n}/n) → H. The full joint distribution of the ranks is complicated by dependencies inherent in ranks, and a fuller treatment will be given elsewhere. Here we note that the results additionally support the choice of $δ = \frac{1}{\sqrt{n}}$ , suggesting that T_n(δ), on average, approaches a constant strictly between 0 and 1 (rather than approaching 0 or 1) as n → 1. In addition, simulations support that the first two moments of T_n(δ) approach the Boolean coverage moments for large n (see Supplementary Article Section S3.3). Finally, the results suggest a natural R²-like quantity, described below.

3.1. A quantity similar to R² based on RankCover

Using the Boolean coverage null expectation as a point of reference, we define

R_{R C}^{2} = \frac{1 - e^{- π} - E_{h} (T_{n} (δ))}{1 - e^{- π}},

(3.4)

which may be interpreted as the proportional expected decrease in coverage due to the alternative h as compared to the maximum expected loss in coverage. The behavior of this population quantity for different alternatives is shown in Figure 3, noting that, due to the use of ranks, the alternatives are depicted according to the appropriate copulas. $R_{R C}^{2}$ is compared with the squared Pearson correlation and dCor measures for data drawn from those alternatives.

The bottom panel shows the RankCover R²-like coefficient for different alternatives. For the first two figures, the x-axis is the population correlation (Note that zero correlation does not imply independence for Student’s t). For the last two figures, x-axis is the standard deviation of Gaussian noise added to a quadratic and circular curve respectively. The average sample Pearson and distance correlations were computed for data generated from these alternatives.

The top panel shows the contour plots of the different copulas used to generate different types of the associations in the bottom panel.

The sample version of the coefficient uses the observed T_n(δ) and ensures that the result is non-negative,

R_{R C, sample}^{2} = \max {0, \frac{1 - e^{- π} - T_{n} (δ)}{1 - e^{- π}}},

(3.5)

and occupies the range [0, 1). For very small samples the expected null coverage can be computed via simulation, and further adjustment might use the fact that depending on n, T_n(δ) has a minimum possible value when the rank correlation is 1 or −1. However, $R_{R C, sample}^{2}$ as defined here is a ready and useful summary of the degree of association between the two variables.

4. Results

We applied the RankCover method using the Manhattan metric to simulated and real datasets, following setups similar to Simon & Tibshirani (2014), investigating dCor, HHG, and MIC as competing approaches. The simulation results indicate that RankCover and dCor exhibit power for some complementary alternatives, and so we additionally propose a hybrid statistic using results from both. The hybrid method uses the minimum p-value from RankCover and rank-based dCor as a new statistic. In addition to simulated data, we illustrate all the approaches on several real datasets.

4.1. Simulation results

The composite nature of the alternative hypothesis makes it very difficult to analytically compute the power function of the general association testing methods. Following the simulation procedure used in Simon & Tibshirani (2014), we have simulated pairs of variables with several canonical dependency relationships (Figure 4) and with varying noise levels. In each scenario, the X values were simulated iid from a uniform distribution, while the noise distribution was Gaussian. However, the overall results were similar for other distributional forms (See Supplementary Article Section S9).

Showing the scatter plots for different relationships between the pair of variables (low noise level)

Figure 5 shows the power of the methods for various relationships, with varying noise levels, for sample size n = 50. Here the “noise level” is a scale quantity appropriate to each relationship form, following Simon & Tibshirani (2014) (See Supplementary Article Section S5). It is evident that RankCover performs better than MIC in all the situations we have considered. It is found to be more powerful than dCor and HHG in several cases while these methods are found to be more powerful in other cases. Even when dCor or HHG is more powerful, RankCover still has reasonable power to identify the association. Numerous illustrations provided in the supplementary article indicate that these observations hold true for varying sample sizes, levels of noise, and functional forms for the originating X and noise distributions.

Showing the power of different methods (type-I α = 0.05 )against different relationships at varying noise levels.

A careful look into the results indicate that dCor is more powerful than RankCover when the type of association is monotone. When the relationship is non-monotone, dCor is typically not as powerful. We attribute this behavior to the fact that dCor is less sensitive to non-monotone relationships for the reasons mentioned earlier. We have also shown that with monotone relationships the Spearman’s rank correlation is as powerful as dCor (See Supplementary Article Section S7). Therefore, one might simply use Spearman’s rank correlation if there is prior knowledge that the relationship is monotone. On the other hand, RankCover is more sensitive to local clustering of points rather than trends. Thus, it is powerful against even non-monotone relationships like cubic, circular or the “X” relationship.

These observations motivate the use of a hybrid method utilizing both RankCover and dCor, as the two methods appear powerful in different situations. Formally, a new statistic is defined s_{hybr id} = min(p_dCor, p_RankCover ), where p_RankCover is the p-value obtained by using RankCover, and p_dCor is that using dCor on (rank(x), rank(y )). The p-value for the hybrid method is p_{hybr id} = P(S_{hybr id} ≤ s_{hybr id} ). As with RankCover, the p-value can be obtained by using pre-computed simulations. The hybrid method, as expected, is always less powerful than the most powerful statistic for each scenario, but seems to be robust against all forms of association investigated.

The HHG method also appears to be relatively robust. However, the ability of RankCover and the hybrid method to detect periodic relationships and non-functional relationships makes it very useful against such alternatives. The fact that RankCover is especially powerful against periodic relationships will be reinforced by the results in Section 4.2.3 and Section 4.2.4.

We summarize by emphasizing that RankCover and the hybrid method are powerful and robust in comparison to competing methods, and that these simulations cover a large range of relationships and noise levels. The broad conclusions are also not very sensitive to the marginal distributions of X and the error distributions, the sample size or the choice of distance metric (See Supplementary Article Section S6, S8 and S9).

4.2. Real data

4.2.1. Example 1: Eckerle4 data

We show data from a study of circular interference transmittance (Eckerle, 1979) from the NIST Statistical Reference Datasets for non-linear regression. The data were analyzed by Székely & Rizzo (2009) to illustrate dCor, and contain 35 observations on the predictor variable wavelength and the response variable transmittance.

Figure 6 shows the scatter plot of the predictor and the response along with the fitted curve (NIST StRD for non-linear regression) based on the model

Showing the scatter plot and the fitted curve for the Eckerle4 dataset

y = \frac{β_{1}}{β_{2}} \exp {\frac{{(x - β_{3})}^{2}}{2 β_{2}^{2}}} + ε,

where β₁, β₂ > 0, β₃ ∈ ℝ and ε is random Gaussian noise.

From the plot, it is evident that there is a very strong non-linear relationship between the two variables. For dCor, p = 0.02072, while MIC and HHG have p-values < 10⁻⁵. The RankCover method and the hybrid method are also highly significant, with p < 10⁻⁵ (The p-values are truncated since 100000 permutations were used).

4.2.2. Example 2: Aircraft data

We have explored the Saviotti aircraft data (Saviotti, 1996) which was also analyzed by Székely & Rizzo (2009). We consider the wing span (m) vs. speed (km/h) (n = 230, Bowman & Azzalini (1997)). Figure 7 shows the scatter plot of the two variables, alongside non-parametric density estimate contours (log scale). It is clear from the plot that there is a non-linear relationship (Pearson’s product moment correlation is a modest 0.0168, p-value= 0.8001), although the relationship is complicated and apparently not monotone.

Showing the scatter plot and the density estimate contours for the aircraft speed and wing span

All of the methods described here were significant at α = 0.05. The p-values for dCor, MIC, and HHG were 0.00013, 0.00004, and < 10⁻⁵, respectively. For both RankCover and the hybrid method the test was significant with p < 10⁻⁵.

4.2.3. Example 3: ENSO data

The ENSO data (also taken from the NIST Statistical Reference Datasets for non-linear regression) consists of monthly average atmospheric pressure differences between Easter Island and Darwin, Australia (Kahaner et al., 1989), with 168 observations. The data form a time series, and has different cyclical components which were modeled (NIST StRD for non-linear regression) by the proposed model

y = β_{1} + β_{2} \cos (\frac{2 π x}{12}) + β_{2} \sin (\frac{2 π x}{12}) + β_{2} \cos (\frac{2 π x}{β_{4}}) + β_{6} \sin (\frac{2 π x}{β_{4}}) + β_{8} \cos (\frac{2 π x}{β_{7}}) + β_{9} \sin (\frac{2 π x}{β_{7}}) + ε,

where β₁, β₂, …, β₉ ∈ ℝ and ε is random Gaussian noise.

Figure 8 shows the scatter plot of the data along with the fitted curve. The cyclical fluctuations are evident, but no linear trend is observed. Thus, the Pearsonian correlation (0.0843) fails to capture the pattern. However a simple serial correlation with lag 1 (0.6102) reveals the association. With 100,000 simulations, the RankCover test is significant with p-value 0.00084. The hybrid test and MIC test are also significant with p-values 0.00165 and 0.00027 respectively. However dCor and HHG fail to detect significant association (p-values 0.13521 and 0.07617, respectively).

Showing the scatter plot and the fitted curve for the ENSO dataset

4.2.4. Example 4: Yeast data

In this example, we analyze a yeast cell cycle gene expression dataset with 6223 genes Spellman et al. (1998). The experiment was designed to identify genes with activity varying throughout the cell cycle (Spellman et al., 1998), and thus transcript levels would be expected to oscillate. This data has been analyzed by many researchers, including Reshef et al. (2011), who used it to verifying the ability of MIC to detect oscillating patterns. We have run dCor, MIC, HHG, RankCover and the hybrid methods of test on the data and used the Benjamini-Hochberg method to control the false discovery rate.

We have listed the genes identified by different methods after controlling the false discovery rate (FDR) at the 5% level and compared them with the list of genes identified by Spellman et al. (1998). Of all the genes identified by Spellman et al. (1998), RankCover found 12% to be significant, while dCor, MIC and HHG found 6%, 2% and 8% respectively. The hybrid method could identify 10% of those genes. Instead controlling the FDR at 25%, the figures for HHG, dCor, MIC, RankCover and the hybrid method become 39%, 23%, 18%, 51% and 40% respectively. These figures differ slightly from those reported in Reshef et al. (2011), due to the difference in the procedure of handling the missing data (See Supplementary Article for details).

For these data, RankCover is clearly successful at identifying oscillating patterns expected for the experiment. This is also clear from Figure 9 (panel A, B and C) which compares the FDR adjusted q-values of our RankCover test with those of dCor, MIC and HHG on a logarithmic scale. Most of the genes in Spellman’s list which are identified by dCor, MIC or HHG are also identified by RankCover, but RankCover identified more genes than the other methods. Figure 9 (panels D–I) shows some of the genes that are found significant by RankCover at 5% level, but not found significant by at least one of the other three methods. PDR5 was found significant by MIC, HHG and RankCover, but not by dCor. On the other hand MIC could not identify FET3, which was identified by dCor, HHG and RankCover. The other four genes shown in Figure 9 are found significant by RankCover but not by dCor, MIC or HHG.

A. The plot comparing the FDR adjusted q-values of the test using RankCover and that using dCor for the genes in Spellman’s list in a log scale. It is evident that most of the genes in Spellman’s list have a smaller q-value when RankCover test is used. B. A similar plot comparing the q-values of RankCover and MIC. C. A similar plot comparing the q-values of RankCover and HHG. **D–I.** Examples of genes in the Spellman’s list that were identified by RankCover, but not by at least one of dCor, MIC or HHG. The values in parentheses are the Spellman scores for the genes, which were used by Spellman et al. (1998) to determine the list of significant genes. Higher score means stronger association.

5. Summary

Our RankCover testing procedure serves as a simple and powerful method to test for general association between a pair of variables. The method is applicable to the problem of testing general association irrespective of the marginal distributions of the (continuous) variables. Use of the rank scale also allows a pre-computed null distribution for the statistic, avoiding the need for actual permutation. This, along with the introduction of the idea of using a single disk size makes the procedure computationally feasible. The testing procedure has been shown to be powerful in simulated datasets even with a small sample size. A variety of real datasets, ranging from studies of cell cycle effects in gene expression to studies involving circular interference transmittance show that the approach provides useful and interpretable results.

Although dCor is theoretically motivated by consideration of characteristic functions, in practice it suffers for non-monotone relationships. Our RankCover procedure is generally powerful and robust, and is more powerful than MIC, dCor and HHG for a number of scenarios. RankCover may be especially useful to detect oscillating relationships, keeping in mind that such relationships need not be periodic and the amplitudes may vary. A hybrid of RankCover and dCor is proposed, which is shown to be highly robust against many forms of associations.

With the rapid rise of large datasets in today’s scientific community, RankCover provides a useful tool to detect general association. The approach is both sensitive and relatively powerful, even with small samples, against various and general forms of association.

Supplementary Material

NIHMS851391-supplement-Supplementary_Material.pdf^{(1.1MB, pdf)}

Acknowledgments

The authors would like to thank Dr Katerina Kechris, Department of Biostatistics and Informatics, University of Colorado, Denver, for her valuable suggestions. The study was funded by NIH (R21HG007840).

References

Albert PS, Ratnasinghe D, Tangrea J, Wacholder S. Limitations of the case-only design for identifying gene-environment interactions. American Journal of Epidemiology. 2001;154(8):687–693. doi: 10.1093/aje/154.8.687. [DOI] [PubMed] [Google Scholar]
Bowman AW, Azzalini A. Applied Smoothing Techniques for Data Analysis: The Kernel Approach with S-Plus Illustrations: The Kernel Approach with S-Plus Illustrations. Oxford University Press; 1997. [Google Scholar]
Clark PJ, Evans FC. Distance to nearest neighbor as a measure of spatial relationships in populations. Ecology. 1954:445–453. [Google Scholar]
Cuzick J. A wilcoxon-type test for trend. Statistics in medicine. 1985;4(4):543–547. doi: 10.1002/sim.4780040416. [DOI] [PubMed] [Google Scholar]
de Siqueira Santos S, Takahashi DY, Nakata A, Fujita A. A comparative study of statistical methods used to identify dependencies between gene expression signals. Briefings in bioinformatics. 2013:bbt051. doi: 10.1093/bib/bbt051. [DOI] [PubMed] [Google Scholar]
Diggle PJ. Statistical analysis of spatial point patterns. Academic Press; London: 1983. [Google Scholar]
Eckerle K. Circular interference transmittance study. National Institute of Standards and Technology (NIST), US Department of Commerce; USA: 1979. [Google Scholar]
Hall P. Three limit theorems for vacancy in multivariate coverage problems. Journal of Multivariate Analysis. 1985;16(2):211–236. [Google Scholar]
Hamed KH, Ramachandra Rao A. A modified mann-kendall trend test for autocorrelated data. Journal of Hydrology. 1998;204(1):182–196. [Google Scholar]
Heller R, Heller Y, Gorfine M. A consistent multivariate test of association based on ranks of distances. Biometrika. 2013;100(2):503–510. [Google Scholar]
Holgate P. Some new tests of randomness. The Journal of Ecology. 1965a:261–266. [Google Scholar]
Holgate P. Tests of randomness based on distance methods. Biometrika. 1965b;52(3–4):345–353. [Google Scholar]
Kahaner D, Moler CB, Nash S, Forsythe GE. Numerical methods and software. Prentice-Hall; Englewood Cliffs, NJ: 1989. [Google Scholar]
Kendall M. Rank correlation methods. Griffin; London: 1975. [Google Scholar]
Mann HB. Non-parametric test against trend. Econometrika. 1945;13:245–259. [Google Scholar]
Puri ML, Sen PK. Nonparametric methods in multivariate analysis 1971 [Google Scholar]
Reshef DN, Reshef YA, Finucane HK, Grossman SR, McVean G, Turnbaugh PJ, Lander ES, Mitzenmacher M, Sabeti PC. Detecting novel associations in large data sets. Science. 2011;334(6062):1518–1524. doi: 10.1126/science.1205438. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ripley B. Tests of ‘randomness’ for spatial point patterns. Journal of the Royal Statistical Society. Series B (Methodological) 1979:368–374. [Google Scholar]
Saviotti P. Technological evolution, variety, and the economy. E. Elgar; 1996. [Google Scholar]
Shorack GR, Wellner JA. Empirical processes with applications to statistics. Vol. 59. Siam; 2009. [Google Scholar]
Simon N, Tibshirani R. Comment on “detecting novel associations in large data sets” by reshef et al, science dec 16, 2011. 2014 doi: 10.1126/science.1205438. arXiv preprint arXiv:1401.7645. [DOI] [PMC free article] [PubMed] [Google Scholar]
Smith TE. A scale-sensitive test of attraction and repulsion between spatial point patterns. Geographical analysis. 2004;36(4):315–331. [Google Scholar]
Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B. Comprehensive identification of cell cycle–regulated genes of the yeast saccharomyces cerevisiae by microarray hybridization. Molecular biology of the cell. 1998;9(12):3273–3297. doi: 10.1091/mbc.9.12.3273. [DOI] [PMC free article] [PubMed] [Google Scholar]
Székely GJ, Rizzo ML. Brownian distance covariance. The annals of applied statistics. 2009;3(4):1236–1265. doi: 10.1214/09-AOAS312. [DOI] [PMC free article] [PubMed] [Google Scholar]
Székely GJ, Rizzo ML, Bakirov NK. Measuring and testing dependence by correlation of distances. The Annals of Statistics. 2007;35(6):2769–2794. [Google Scholar]
Wilks S. On the independence of k sets of normally distributed statistical variables. Econometrica, Journal of the Econometric Society. 1935:309–326. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material

NIHMS851391-supplement-Supplementary_Material.pdf^{(1.1MB, pdf)}

[R1] Albert PS, Ratnasinghe D, Tangrea J, Wacholder S. Limitations of the case-only design for identifying gene-environment interactions. American Journal of Epidemiology. 2001;154(8):687–693. doi: 10.1093/aje/154.8.687. [DOI] [PubMed] [Google Scholar]

[R2] Bowman AW, Azzalini A. Applied Smoothing Techniques for Data Analysis: The Kernel Approach with S-Plus Illustrations: The Kernel Approach with S-Plus Illustrations. Oxford University Press; 1997. [Google Scholar]

[R3] Clark PJ, Evans FC. Distance to nearest neighbor as a measure of spatial relationships in populations. Ecology. 1954:445–453. [Google Scholar]

[R4] Cuzick J. A wilcoxon-type test for trend. Statistics in medicine. 1985;4(4):543–547. doi: 10.1002/sim.4780040416. [DOI] [PubMed] [Google Scholar]

[R5] de Siqueira Santos S, Takahashi DY, Nakata A, Fujita A. A comparative study of statistical methods used to identify dependencies between gene expression signals. Briefings in bioinformatics. 2013:bbt051. doi: 10.1093/bib/bbt051. [DOI] [PubMed] [Google Scholar]

[R6] Diggle PJ. Statistical analysis of spatial point patterns. Academic Press; London: 1983. [Google Scholar]

[R7] Eckerle K. Circular interference transmittance study. National Institute of Standards and Technology (NIST), US Department of Commerce; USA: 1979. [Google Scholar]

[R8] Hall P. Three limit theorems for vacancy in multivariate coverage problems. Journal of Multivariate Analysis. 1985;16(2):211–236. [Google Scholar]

[R9] Hamed KH, Ramachandra Rao A. A modified mann-kendall trend test for autocorrelated data. Journal of Hydrology. 1998;204(1):182–196. [Google Scholar]

[R10] Heller R, Heller Y, Gorfine M. A consistent multivariate test of association based on ranks of distances. Biometrika. 2013;100(2):503–510. [Google Scholar]

[R11] Holgate P. Some new tests of randomness. The Journal of Ecology. 1965a:261–266. [Google Scholar]

[R12] Holgate P. Tests of randomness based on distance methods. Biometrika. 1965b;52(3–4):345–353. [Google Scholar]

[R13] Kahaner D, Moler CB, Nash S, Forsythe GE. Numerical methods and software. Prentice-Hall; Englewood Cliffs, NJ: 1989. [Google Scholar]

[R14] Kendall M. Rank correlation methods. Griffin; London: 1975. [Google Scholar]

[R15] Mann HB. Non-parametric test against trend. Econometrika. 1945;13:245–259. [Google Scholar]

[R16] Puri ML, Sen PK. Nonparametric methods in multivariate analysis 1971 [Google Scholar]

[R17] Reshef DN, Reshef YA, Finucane HK, Grossman SR, McVean G, Turnbaugh PJ, Lander ES, Mitzenmacher M, Sabeti PC. Detecting novel associations in large data sets. Science. 2011;334(6062):1518–1524. doi: 10.1126/science.1205438. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] Ripley B. Tests of ‘randomness’ for spatial point patterns. Journal of the Royal Statistical Society. Series B (Methodological) 1979:368–374. [Google Scholar]

[R19] Saviotti P. Technological evolution, variety, and the economy. E. Elgar; 1996. [Google Scholar]

[R20] Shorack GR, Wellner JA. Empirical processes with applications to statistics. Vol. 59. Siam; 2009. [Google Scholar]

[R21] Simon N, Tibshirani R. Comment on “detecting novel associations in large data sets” by reshef et al, science dec 16, 2011. 2014 doi: 10.1126/science.1205438. arXiv preprint arXiv:1401.7645. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] Smith TE. A scale-sensitive test of attraction and repulsion between spatial point patterns. Geographical analysis. 2004;36(4):315–331. [Google Scholar]

[R23] Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B. Comprehensive identification of cell cycle–regulated genes of the yeast saccharomyces cerevisiae by microarray hybridization. Molecular biology of the cell. 1998;9(12):3273–3297. doi: 10.1091/mbc.9.12.3273. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] Székely GJ, Rizzo ML. Brownian distance covariance. The annals of applied statistics. 2009;3(4):1236–1265. doi: 10.1214/09-AOAS312. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] Székely GJ, Rizzo ML, Bakirov NK. Measuring and testing dependence by correlation of distances. The Annals of Statistics. 2007;35(6):2769–2794. [Google Scholar]

[R26] Wilks S. On the independence of k sets of normally distributed statistical variables. Econometrica, Journal of the Econometric Society. 1935:309–326. [Google Scholar]

PERMALINK

A procedure to detect general association based on concentration of ranks

Pratyaydipta Rudra

Yihui Zhou

Fred A Wright

Abstract

1. Introduction

1.1. A new motivation

2. Methods

2.1. A spatial viewpoint

Figure 1.

2.2. The test statistic

Definition 2.1

2.3. Fast computation of the test statistic

Figure 2.

3. Lessons from Boolean coverage

Proposition 3.1

Proof

3.1. A quantity similar to R2 based on RankCover

Figure 3.

4. Results

4.1. Simulation results

Figure 4.

Figure 5.

4.2. Real data

4.2.1. Example 1: Eckerle4 data

Figure 6.

4.2.2. Example 2: Aircraft data

Figure 7.

4.2.3. Example 3: ENSO data

Figure 8.

4.2.4. Example 4: Yeast data

Figure 9.

5. Summary

Supplementary Material

Acknowledgments

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

3.1. A quantity similar to R² based on RankCover