Interpoint distance tests for high-dimensional comparison studies

Marco Marozzi; Amitava Mukherjee; Jan Kalina

doi:10.1080/02664763.2019.1649374

. 2019 Jul 31;47(4):653–665. doi: 10.1080/02664763.2019.1649374

Interpoint distance tests for high-dimensional comparison studies

Marco Marozzi ^a,^CONTACT, Amitava Mukherjee ^b, Jan Kalina ^c

PMCID: PMC9042018 PMID: 35707487

ABSTRACT

Modern data collection techniques allow to analyze a very large number of endpoints. In biomedical research, for example, expressions of thousands of genes are commonly measured only on a small number of subjects. In these situations, traditional methods for comparison studies are not applicable. Moreover, the assumption of normal distribution is often questionable for high-dimensional data, and some variables may be at the same time highly correlated with others. Hypothesis tests based on interpoint distances are very appealing for studies involving the comparison of means, because they do not assume data to come from normally distributed populations and comprise tests that are distribution free, unbiased, consistent, and computationally feasible, even if the number of endpoints is much larger than the number of subjects. New tests based on interpoint distances are proposed for multivariate studies involving simultaneous comparison of means and variability, or the whole distribution shapes. The tests are shown to perform well in terms of power, when the endpoints have complex dependence relations, such as in genomic and metabolomic studies. A practical application to a genetic cardiovascular case-control study is discussed.

KEYWORDS: Multivariate data, nonparametric tests, nonparametric combination, biomedicine, genomics

1. Introduction

Modern data collection techniques allow to analyze a very large number of endpoints, see [7] or [20]. For example, in the biomedical research of DNA, expressions of thousands of genes are commonly measured only on a small number of subjects, like in the cardiovascular genetic case-control study of Kalina and Schlenker [10] with 38 590 gene transcripts available on 48 individuals. Other fields, where the high-dimensional low-sample size problem is common, are metabolomics and proteomics. In these situations, traditional methods for comparison studies like the Hotelling test are not applicable, also due to a questionable assumption of normal distribution of the data.

The literature on the multivariate two-sample location problem is vast, see [25]. A large part of the literature, starting with [1], deals with modifications of the Hotelling test to overcome its limitations in high-dimensional studies; see [24] or [3] as examples of modified Hotelling tests.

Another important approach to testing for high-dimensional data is based on interpoint distances. Jurečková and Kalina [8] and Marozzi [14] and other scholars showed that this approach is very appealing for studies involving the comparison of means, because it does not assume normally distributed data and comprises tests that are distribution free, unbiased, consistent and computable even if the number of endpoints exceeds (perhaps largely) the number of subjects. Tests based on interpoint distances perform well in terms of power, when the endpoints have complex dependence relations as in genomic and metabolomic studies. Tests for the general problem have been proposed by Baringhaus and Franz [2] and Liu and Modarres [12] or Szekely and Rizzo [26]. In this paper, we wish to show that interpoint distance-based tests are effective also for the general alternative involving a simultaneous comparison of means and variability, or comparison of the shapes of the distributions. It is important to note that the literature on the multivariate two-sample general problem is markedly smaller than that for the multivariate location problem. Our methods are presented in Section 2 and compared with available methods in Section 3. An application on a real gene expression dataset motivated by Kalina and Schlenker [10] is presented in Section 4. In Section 5, we remark some important points of our study and suggest directions for future research.

2. Methods

A case-control or a two-arm comparison study is considered with two groups denoted by $X = (X_{1}, \dots, X_{m})$ and $Y = (Y_{1}, \dots, Y_{n})$ to be compared, while N=n+m. Let $Z$ denote the pooled sample

Z = (X_{1}, \dots, X_{m}, Y_{1}, \dots, Y_{n}) = (Z_{1}, \dots, Z_{m}, Z_{m + 1}, \dots, Z_{N}) .

We assume $X$ and $Y$ to be independent random samples from p-variate populations with unknown cumulative distribution functions. We wish to test whether the two samples come from populations with an identical distribution, i.e. to test the null hypothesis

H_{0} : F_{X} = F_{Y}

versus the general alternative hypothesis

H_{1} : F_{X} \neq F_{Y},

where $F_{X}$ and $F_{Y}$ denote the distribution functions of populations 1 and 2, respectively.

Our idea is to modify the location test proposed by Marozzi [14] to address the general problem. The first step is to compute the N−1 Euclidean distances between $X_{i}$ and the other elements of the pooled sample $Z$ , i.e.

D_{i k} = ‖ X_{i} - Z_{k} ‖_{2} = \sqrt{\sum_{h = 1}^{p} (X_{i h} - Z_{k h})^{2}}, i = 1, \dots, m, k = 1, \dots, N, k \neq i .

Then, we consider a suitable statistic $S_{i}$ for comparing the distances $D_{i k}$ for $k = 1, \dots, m$ ( $k \neq i$ ) with $D_{i k}$ for $k = m + 1, \dots, N$ , i.e. comparing distances between $X_{i}$ and the other elements of the first sample $X$ with distances between $X_{i}$ and the elements of the second sample $Y$ . It is assumed without loss of generality that large values of $S_{i}$ speak against $H_{0}$ . The generic test statistic for testing $H_{0}$ versus $H_{1}$ is defined as

M S = max_{i = 1, \dots, m} S_{i}

and the p-value is computed by means of permutations, as presented in Algorithm 1.

The maximum statistic often performs well together with the nonparametric combination methodology, i.e. procedures for combining several dependent tests, see e.g. [17,18]. For the location problem, the $S_{i}$ statistic can be chosen as the Wilcoxon statistic $W_{i}$ , as in the original location test of Marozzi [14] extending the test of Jurečková and Kalina [8]. The latter is based on just one $W_{i}$ statistic, with i being randomly selected among $1, \dots, m$ . The rationale is that if populations 1 and 2 have different locations, then $D_{i k}$ for $k = m + 1, \dots, N$ are stochastically larger compared to $D_{i k}$ for $k = 1, \dots, m$ ( $k \neq i$ ). The idea of the test of Marozzi [14] is to use the data more efficiently by jointly considering all $W_{i}$ tests and not just one of them. Note that $W_{1}, \dots, W_{m}$ are equally distributed under both the location null and alternative hypotheses, they are dependent and the dependence is assessed nonparametrically through the maximum statistic rule, or equivalently the minimum p-value rule (as in the original [14] test). Permutation testing is well-defined, because the elements of $Z$ are exchangeable under $H_{0}$ ; it can be applied even if a non-random sample of subjects is randomized into two groups to be compared, which is a common situation in biomedical studies. In theory, the permutation test statistic null distribution always exists, as it can be computed by considering all possible data permutations. In practice, it may be however computationally unfeasible to compute exactly the null distribution when sample sizes are not small. Nevertheless, this distribution can be consistently estimated by drawing a random sample of permutations. When $H_{0}$ is true, there are $(\binom{N - 1}{n - 1})$ possible permutations of the labels $k = 1, \dots, N$ ( $k \neq i$ ) that are equally likely with probability ${(\binom{N - 1}{n - 1})}^{- 1}$ , therefore the distribution of the test statistic does not depend on the unknown parent distribution and the test is distribution free.

We propose to replace the Wilcoxon statistic with the statistics of Cucconi [5] or Lepage [11] to address the general problem. These tests have been motivated by jointly testing for location and scale differences, but there is evidence that they work also for the general problem [13]. The Lepage test is based on the combination of the Wilcoxon location test and the Ansari-Bradley scale test. Let us denote by $A B_{i}$ the Ansari-Bradley statistic comparing $D_{i k}$ for $k = 1, \dots, m$ ( $k \neq i$ ) with $D_{i k}$ for $k = m + 1, \dots, N$ . The Lepage test statistic is the sum of the squared standardized Wilcoxon and the Ansari-Bradley statistics in the form

L_{i} = \frac{{(W_{i} - E (W))}^{2}}{var (W)} + \frac{{(A B_{i} - E (A B))}^{2}}{var (A B)},

where $E (.)$ and $var (.)$ denote the mean and variance of the test statistics under $H_{0}$ , respectively, independently on i. Large values of $L_{i}$ speak against $H_{0}$ . The ML statistic is defined as

M L = max_{i = 1, \dots, m} L_{i}

and its p-value is computed via permutations, using the generic Algorithm 1. The Lepage test is quite popular in practice, being featured for example in the well-known book by Hollander and Wolfe [6].

The Cucconi test has been proposed earlier than the Lepage test; still, the former is not as well-known, even if the interest in it has recently increased and its applications span e.g. to industrial quality control [4]. Marozzi [13] showed that the Cucconi test compares well with the Lepage test. Another reason for being interested in the Cucconi test is that its statistic is not based on the combination of a statistic for location and a statistic for scale, in contrary to usual location-scale tests. The statistic is based on the sum of squared ranks and squared contrary ranks and takes their correlation into account as

C_{i} = \frac{U_{i}^{2} + V_{i}^{2} - 2 ρ U_{i} V_{i}}{2 (1 - ρ)^{2}},

where

\begin{aligned} U_{i} & = \frac{\sum_{k = m + 1}^{N} R_{i k}^{2} - n N (2 N - 1) / 6}{\sqrt{(m - 1) n N (2 N - 1) (8 N + 3) / 180}}, \\ V_{i} & = \frac{\sum_{k = m + 1}^{N} (N - R_{i k})^{2} - n N (2 N - 1) / 6}{\sqrt{(m - 1) n N (2 N - 1) (8 N + 3) / 180}}, \\ ρ & = \frac{2 (N^{2} - 2 N - 3)}{(2 N - 1) (8 N + 3)} - 1, \end{aligned}

and $R_{i k}$ is the rank of $D_{i k}$ in $(D_{i 1}, \dots, D_{i (i - 1)}, D_{i (i + 1)}, \dots, D_{i N})$ . It holds for each i that $E (U_{i}) = E (V_{i}) = 0$ , $var (U_{i}) = var (V_{i}) = 1$ and $ρ = c o r r (U_{i}, V_{i})$ . The formulas for $U_{i}$ , $V_{i}$ and ρ are equivalent to those in [13], while his $n_{1}$ , $n_{2}$ and n correspond to our m−1, n, and N−1, respectively. Large values of $C_{i}$ speak against $H_{0}$ , while $C_{i}$ is closely related to the Mahalanobis distance between $U_{i}$ and $V_{i}$ ; these represent standardized sums of squared ranks and of squared contrary ranks of the distances between $X_{i}$ and the elements of the second sample, respectively. The MC statistic is defined as

M C = max_{i = 1, \dots, m} C_{i}

and its p-value is computed via permutations using Algorithm 1 and replacing the generic MS statistic by the particular choice MC.

Our approach is very flexible and can be also followed to define a multivariate version of the Kolmogorov–Smirnov test. More precisely, the interpoint distance Kolmogorov–Smirnov test statistic is defined as

M K S = max_{i = 1, \dots, m} K S_{i},

where $K S_{i}$ is the Kolmogorov–Smirnov statistic for comparing $D_{i k}$ for $k = 1, \dots, m$ ( $k \neq i$ ) with $D_{i k}$ for $k = m + 1, \dots, N$ . The p-value of the MKS test is computed similarly to those for ML and MC tests.

3. Size and power study

Since it is not possible to theoretically derive optimality properties of a nonparametric test without any assumptions on population distributions, the test size and power have to be estimated using Monte Carlo simulations. Given a simulation size equal to s, Marozzi [14] suggested $8 \sqrt{s}$ as the number B of permutations to be used for estimating the p-values of permutation tests. Note that it is not practical to consider all possible permutations to compute p-values in a power simulation study. Here, we consider s=5000 and thus B=566. The resulting root mean squared error is maximum when estimating a rejection probability of 0.5 and is equal to $1.2 \sqrt{0.5 (1 - 0.5) / 5000} = 0.00849$ ; there, the multiplicative constant is used to account for the error inflation caused by p-values being estimated rather than computed exactly, see Section 3 of Marozzi [14]. In fact, there are two sources of power estimation error. The first one is due to the Monte Carlo procedure and the second one is due to the permutation procedure.

The interpoint distance tests are compared with the Energy test by Szekely and Rizzo [26] and the multivariate version of the Cramér test by [2]. We sampled m=n=10 observations from

Normal distribution;
Cauchy distribution;
Student distribution with 2.5 degrees of freedom.
Gamma distribution;

The Cauchy distribution corresponds to the Student distribution with 1 degree of freedom, selected as a very heavy-tailed model. The Student distribution with 2.5 degrees of freedom has been selected as a less extreme model than the Cauchy, noting that the Student with 2 degrees of freedom has infinite variance. We considered p=10, and p=100 to simulate high-dimension low-sample size comparison studies that are very common in omics studies. The nominal significance level is set to $α = 0.05$ .

We use copulas to simulate the complex dependence relations among endpoints that are typical in fields like medical imaging, genomics and metabolomics. Copulas are very flexible and popular tools to model dependence in areas like engineering and finance, and they are being increasingly used in medicine [16,27]; see Appendix for details.

Three rounds of simulations have been run:

Simulation 1 aims at studying the type I error rate of the tests, i.e. data under $H_{0}$ with the same distributions (difference in neither location nor scale) are simulated.
Simulation 2 aims at studying the power of the tests when locations and scale (alone or together) differ, i.e. data under $H_{1}$ are simulated for the same distributions with different locations/different scales/jointly different locations and scales;
Simulation 3 aims at studying the power of the tests when comparing two different distributions with difference in neither location nor scale, i.e. data under $H_{1}$ are simulated for different distributions with difference in neither location nor scale.

Table 1 displays the results of simulation 1 for the normal and Student distributions using five copulas to model dependence. All tests control the size, except the Cramér test, which turns out very conservative.

Table 1. Type I error rate of the tests for normal and Student distributions under various copula models for the dependence.

Copula	Normal	Cauchy	Gumbel	Clayton	Frank	Normal	Cauchy	Gumbel	Clayton	Frank
	p=10					p=100
	Normal distributions
MC	0.054	0.056	0.049	0.050	0.047	0.050	0.048	0.050	0.051	0.050
ML	0.052	0.053	0.045	0.048	0.048	0.048	0.047	0.049	0.053	0.050
MKS	0.050	0.047	0.038	0.042	0.044	0.042	0.041	0.037	0.042	0.040
Energy	0.050	0.052	0.055	0.050	0.051	0.049	0.051	0.049	0.052	0.053
Cramér	0.037	0.031	0.036	0.030	0.025	0.034	0.027	0.027	0.032	0.019
	Student distributions
MC	0.052	0.046	0.053	0.052	0.052	0.051	0.051	0.052	0.050	0.053
ML	0.054	0.050	0.050	0.051	0.055	0.050	0.054	0.055	0.051	0.056
MKS	0.044	0.044	0.042	0.043	0.045	0.047	0.046	0.044	0.041	0.045
Energy	0.049	0.047	0.050	0.049	0.048	0.048	0.048	0.053	0.047	0.048
Cramér	0.020	0.017	0.017	0.021	0.011	0.017	0.018	0.014	0.012	0.005

Open in a new tab

Table 2 displays the results of simulation 2 for normal and Cauchy distributions, using normal and Clayton copulas. The interpoint distance tests are more powerful than both the Cramér and Energy tests. Under normal distributions, the interpoint distance tests have a very similar power. Under Cauchy distributions, both the Cucconi and Lepage tests are more powerful than the Kolmogorov–Smirnov test when locations differ, and the contrary happens when scale alone or both location and scale differ.

Table 2. Power of the two-sample tests in simulations under the normal or Clayton copula models for the dependence.

	p=10			p=100
location	1	0	1	1	0	1
scale	1	2	2	1	2	2
	Normal distributions, Normal copula
MC	0.419	0.743	0.816	0.454	0.947	0.955
ML	0.412	0.737	0.806	0.444	0.948	0.952
MKS	0.429	0.771	0.822	0.441	0.974	0.971
Energy	0.789	0.245	0.684	0.842	0.296	0.771
Cramér	0.752	0.152	0.596	0.802	0.153	0.653
	Normal distributions, Clayton copula
MC	0.527	0.837	0.919	0.633	0.991	0.996
ML	0.525	0.827	0.915	0.622	0.992	0.996
MKS	0.520	0.860	0.917	0.577	0.997	0.997
Energy	0.795	0.290	0.758	0.848	0.396	0.853
Cramér	0.760	0.159	0.674	0.807	0.177	0.737
location	2	0	2	2	0	2
scale	1	3.5	3.5	1	3.5	3.5
	Cauchy distributions, Normal copula
MC	0.410	0.484	0.547	0.279	0.483	0.521
ML	0.423	0.497	0.556	0.296	0.509	0.549
MKS	0.377	0.614	0.659	0.222	0.636	0.664
Energy	0.369	0.401	0.482	0.176	0.430	0.467
Cramér	0.093	0.050	0.082	0.010	0.036	0.040
	Cauchy distributions, Clayton copula
MC	0.420	0.519	0.562	0.231	0.498	0.475
ML	0.436	0.524	0.567	0.243	0.511	0.485
MKS	0.368	0.647	0.675	0.154	0.626	0.575
Energy	0.357	0.436	0.518	0.119	0.454	0.457
Cramér	0.092	0.061	0.109	0.005	0.033	0.039

Open in a new tab

Note: The two distributions have the same shape but different locations and/or different scales.

Table 3 displays the results of simulation 3 for two pairs of distribution: (i) normal versus Student, and (ii) normal versus gamma, using five copulas. In case (i) we compare the normal with a symmetric distribution with heavier tails, and in case (ii) we compare the normal with a skewed distribution. Both the Cucconi and Lepage tests are markedly more powerful than the other tests, with both the Cramér and Energy tests having an extremely low power. Even if both the univariate Cucconi and Lepage tests have been motivated by testing for location and scale differences, Marozzi [13] found evidence that they work also for comparing shapes. The results of simulation 3 confirm his finding in the multivariate case. In simulation 3, interpoint distances are quite different between the two groups and allow to separate between them. Therefore, interpoint distance tests turn out to be suitable under a broader context than only location and scale differences and this is a very appealing feature in practice. It is important to emphasize that (permutation) MANOVA methods like those proposed by Minas and Montana [15] or Shinohara et al. [22] fail to distinguish two distributions with the same location but different shape. Moreover, they fail to distinguish distributions with different scale, because MANOVA tests are sensitive only toward differences in location.

Table 3. Power of the two-sample tests in simulations under various copula models for the dependence.

Copula	Normal	Cauchy	Gumbel	Clayton	Frank	Normal	Cauchy	Gumbel	Clayton	Frank
	p=10					p=100
	Normal versus Student distribution
MC	0.271	0.213	0.313	0.321	0.295	0.414	0.219	0.509	0.446	0.367
ML	0.269	0.205	0.313	0.319	0.284	0.431	0.218	0.538	0.464	0.357
MKS	0.241	0.202	0.278	0.277	0.252	0.313	0.206	0.368	0.329	0.226
Energy	0.096	0.123	0.101	0.096	0.092	0.084	0.138	0.088	0.077	0.059
Cramér	0.056	0.060	0.047	0.050	0.033	0.044	0.060	0.038	0.041	0.015
	Normal versus gamma distribution
MC	0.099	0.082	0.139	0.108	0.126	0.146	0.079	0.282	0.236	0.460
ML	0.093	0.078	0.136	0.102	0.118	0.142	0.077	0.283	0.238	0.454
MKS	0.071	0.059	0.094	0.089	0.071	0.087	0.066	0.157	0.223	0.148
Energy	0.055	0.066	0.069	0.059	0.063	0.058	0.066	0.060	0.058	0.054
Cramér	0.041	0.043	0.048	0.037	0.027	0.043	0.039	0.035	0.032	0.014

Open in a new tab

Note: The two distributions have different shapes, but equal both locations and scales.

The results of the simulation study show that the interpoint distance approach based on the maximum statistic, originally proposed by Marozzi [14] for the location problem, is effective also when comparing distributions that can differ in both location and scale, or in shape, at least for the designs studied here. It has been also shown that this approach works well for both low- and high-dimension comparisons, as well as for comparisons where the endpoints have complex dependence relations among themselves. This result suggests that the approach can be useful for high-dimensional low-sample size data typical in omics fields, where the normality assumption is often not met and the endpoints are expected to have complex dependence relations.

4. Application: a cardiovascular genetic study

We consider the cardiovascular genetic case-control study data set of Kalina and Schlenker [10]. Gene expressions of $p = 38 590$ gene transcripts were measured by means of microarrays on N=48 individuals, namely on 24 patients immediately after a cerebrovascular stroke (CVS) and 24 control persons. The aim of Kalina and Schlenker [10] was to identify genes associated with excess genetic risk for the incidence of cerebrovascular stroke. Here, we use the proposed methods to test whether the measurements of patients and controls come from the same distribution. More precisely, we consider three different situations concerning different subsets of the original dataset. Analyzing subsamples of a dataset is a very important approach in molecular genetics, because it allows to explore important aspects of molecular genetic data, see e.g. [19]. Therefore, in practice, a given dataset is often reduced to a smaller predefined set of interesting genes, based on a prior biological hypothesis. While the order of genes in the dataset is arbitrary and non-random (and depending on the BeadChip Illumina microarray technology [9]), the examples below serve to illustrate the different performance of multivariate tests under different settings. Our prior analysis [10] reveals the two groups (CVS patients and controls) to be well separated by means of classification analysis; a support vector machine classifier with a Gaussian kernel yields a $100 %$ classification accuracy in a leave-one-out cross validation. In addition, there seem no very dominant genes in the dataset (in the sense of separating the two groups), but there is a large number of genes slightly contributing to the separation.

Situation (i). We randomly select r genes among all p genes for r increasing from 250 to 4000. The random selection is performed 1000 times and for each randomly selected set, the tests are applied. Table 4 presents median p-values over all randomly selected sets and at the same the proportion of times the tests reject the null hypothesis at $α = 0.05$ (i.e. estimated powers of the tests). It turns out that tests based on interpoint distances perform better having lower p-values and higher powers than the others, with the Lepage and Kolmogorov–Smirnov tests being the best performers.

Table 4. Genetic study results for 1000 random selections of r genes from the whole gene set (situation (i)).

Test	r=250	r=500	r=1000	r=2000	r=4000
	Median p-value
MC	0.076	0.051	0.035	0.028	0.021
ML	0.074	0.039	0.021	0.012	0.007
MKS	0.072	0.039	0.021	0.014	0.007
Energy	0.083	0.072	0.065	0.063	0.060
Cramér	0.153	0.147	0.145	0.146	0.145
	Power
MC	0.376	0.499	0.672	0.798	0.932
ML	0.405	0.570	0.765	0.897	0.987
MKS	0.411	0.554	0.765	0.887	0.976
Energy	0.272	0.311	0.307	0.275	0.249
Cramér	0.028	0.006	0.000	0.000	0.000

Open in a new tab

Note: Best result in each column is shown by bold face.

Situation (ii). We randomly select p genes only among the first 4000 genes for r increasing from 125 to 1000. The random selection is performed again 1000 times and median p-values and powers are shown in Table 5. Similarly with (i), tests based on interpoint distances perform better than the others with the Lepage and Kolmogorov–Smirnov tests being the best performers.

Table 5. Genetic study results for 1000 random selections of r genes from the first 4000 genes (situation (ii)).

Test	r=125	r=250	r=375	r=500	r=625	r=750	r=875	r=1000
	Median p-value
MC	0.092	0.060	0.051	0.048	0.042	0.042	0.041	0.039
ML	0.083	0.041	0.025	0.019	0.014	0.012	0.011	0.011
MKS	0.080	0.039	0.027	0.019	0.014	0.014	0.012	0.012
Energy	0.092	0.079	0.074	0.072	0.071	0.067	0.069	0.067
Cramér	0.150	0.145	0.141	0.139	0.141	0.139	0.140	0.137
	Power
MC	0.331	0.428	0.495	0.520	0.586	0.594	0.631	0.652
ML	0.383	0.562	0.690	0.791	0.866	0.903	0.940	0.962
MKS	0.390	0.566	0.684	0.776	0.847	0.889	0.914	0.939
Energy	0.252	0.252	0.255	0.241	0.234	0.227	0.208	0.213
Cramér	0.054	0.018	0.005	0.001	0.000	0.000	0.000	0.000

Open in a new tab

Note: Best result in each column is shown by bold face.

In situations (i) and (ii), median p-values are used to compare different tests. This makes sense for tests, which keep the size (as we combine situations under $H_{0}$ and $H_{1}$ ). Then, if one test is more powerful than the others, it has also a smaller median p-value. This is true for all tests except for the Cramér test, which, anyway, does not have a tendency to be best or nearly best.

Further, we used the Limma (Linear Models for Microarray Data [23]) methodology, which yields 45 differentially expressed genes, if the q-values are compared with the most usual threshold of 0.05. We recall that Limma considers a moderated t-statistic for each gene separately, testing that its expectation in one group (patients) equals the expectation in another group (controls), and adjusting for multiple testing.

Situation (iii). If we use only the 45 differentially expressed genes, the p-value of all considered tests is highly significant (as expected); the Energy test yields 0.0002, the KS test 0.0003, while MC, ML and Cramér tests $< 10^{- 6}$ .

Situation (iv). Further, we use only the remaining genes, which are not found as differentially expressed by Limma, and perform the tests over the first r of them (for different values of r). The results are presented in Table 6. We consider the (fixed) first block of r variables for different values of r. For smaller r (250 or 500), the MKS test has the smallest p-value. The p-value of the MC and ML tests decrease with an increasing dimensionality r, which does not happen for the MKS, Energy, and Cramér tests. For $r \geq 1000$ , the p-value of the MC test becomes the smallest. Actually there are only two tests with p-values decreasing with an increasing dimensionality, namely the MC and ML tests. We consider this to be their advantage and evidence of their ability to capture the information from the genes in a multivariate way. To explain this, we must note that the set of genes not found as differentially expressed do not necessarily fulfil the $H_{0}$ , specified in Section 2. Limma is not designed to find genes not contributing to the test of $H_{0}$ against $H_{1}$ and retains a univariate nature by performing a series of univariate tests. On the other hand, our $H_{0}$ requires to capture the multivariate information jointly across genes. Thus, the novel tests allowing to capture the multivariate structure of data are conceptually different from Limma, with potential applications in gene set testing, which becomes increasingly important in genetic data analysis [21].

Table 6. P-values evaluated in situation (iv), only over genes not determined as differentially expressed by Limma.

Test	r=250	r=500	r=1000	r=2000	r=3000	r=4000
MC	0.040	0.015	0.015	0.014	0.007	0.005
ML	0.056	0.020	0.025	0.025	0.019	0.009
MKS	0.019	0.012	0.050	0.065	0.040	0.040
Energy	0.030	0.025	0.037	0.058	0.070	0.071
Cramér	0.075	0.078	0.089	0.128	0.142	0.140

Open in a new tab

Note: Best result in each column is shown by bold face.

5. Conclusion

The approach based on interpoint distances originally proposed by Jurečková and Kalina [8] and Marozzi [14] for multivariate comparison studies of means is extended in this paper to more general studies. These involve simultaneous comparisons of means and variability, or comparisons of the distribution shapes. These comparisons arise in practice when the treatment is expected to alter also distribution variability and shape, and not just the means. The proposed tests are able to analyze high-dimensional low-sample size studies, where the endpoints are not normally distributed and have complex dependence relations among themselves. Apart from medicine or biology, a joint comparison of location together with variability (or even the shape of the distribution) is important e.g. in climate dynamics or finance. An interesting direction for future research is the implementation of the tests in control charts for health-care monitoring. Another direction for future research is the consideration of other (non-Euclidean) distances to obtain other tests for the general multivariate two-sample problem.

Acknowledgments

The authors are thankful to two anonymous referees for valuable suggestions allowing to improve the paper.

Appendix.

In the Appendix, it is explained how copulas are used to simulate complex dependence relations among endpoints. A copula K is a distribution function $K : [0, 1]^{p} \to [0, 1]$ with uniformly distributed margins. The Sklar theorem explains the usefulness of copulas for sampling from multivariate distributions. It states that for any joint distribution function $G (x_{1}, \dots, x_{p})$ , there exists a copula K that describes completely the dependence of p random variables $X_{1}, \dots, X_{p}$ , with distributions $F_{1}, \dots, F_{p}$

G (x_{1}, \dots, x_{p}) = K (F_{1} (x_{1}), \dots, F_{p} (x_{p})) = K (u_{1}, \dots, u_{p}) .

Therefore, to sample from the multivariate distribution function $G (x_{1}, \dots, x_{p})$ , it suffices to sample from the dependence structure, i.e. the copula K, and to apply the generalized inverse corresponding to the desired margins. Here, we consider both elliptical and Archimedean copulas; the former correspond to an elliptical distribution through the Sklar theorem, with the normal and Cauchy copulas as examples considered here. The normal copula is defined as

K_{ρ}^{N} (u_{1},, \dots, u_{p}) = Φ_{ρ} (Φ^{- 1} (u_{1}), \dots, Φ^{- 1} (u_{p})),

where $Φ^{- 1}$ denotes the quantile function of the univariate standard normal distribution and $Φ_{ρ}$ denotes the distribution function of a p-variate normal random variable with the correlation matrix

ρ = [\begin{matrix} 1 & ρ_{12} & ρ_{13} & \dots & ρ_{1 p} \\ ρ_{12} & 1 & ρ_{23} & \dots & ρ_{2 p} \\ \dots & \dots & \dots & \dots & \dots \\ ρ_{1 p} & ρ_{2 p} & ρ_{3 p} & \dots & 1 \end{matrix}];

here, the $(\binom{p}{2})$ values above (or below) the diagonal of $ρ$ are obtained as

ρ_{q + 1} = ρ_{\min} + q Δ_{ρ}, for q = 0, 1, \dots, (\binom{p}{2}) - 1,

where $0 \leq ρ_{\min} < ρ_{\max} \leq 1$ , $Δ_{ρ} = \frac{ρ_{\max} - ρ_{\min}}{(\binom{p}{2}) - 1}$ , so that

\begin{aligned} ρ_{1} & = ρ_{\min} \\ ρ_{2} & = ρ_{\min} + Δ_{ρ} \\ \dots \\ ρ_{(\binom{p}{2})} & = ρ_{\min} + ((\binom{p}{2}) - 1) Δ_{ρ} = ρ_{\max} . \end{aligned}

We set $ρ_{\min} = 0.1$ and $ρ_{\max} = 0.9$ . The Cauchy copula is obtained by replacing the normal distribution with the Cauchy. When p=2, the contour plot for the density defined with a normal copula is elliptically-shaped, whereas that corresponding to the Cauchy copula is star-shaped.

Archimedean copulas are popular, because many of them admit an explicit formula for K and allow to model the dependence with a single parameter. A copula $K_{ARCH}$ is called Archimedean if it can be represented as

K_{ARCH} (u_{1}, \dots, u_{p}) = ψ^{- 1} (ψ (u_{1}) + \dots + ψ (u_{p})),

where ψ is called copula generator. The Gumbel, Clayton and Frank copulas are considered here. The Gumbel copula is generated by

ψ_{GUM} (t) = (- \ln t)^{ξ} with ξ \geq 1;

the larger the parameter ξ, the stronger the dependence. The Gumbel copula is used when the endpoints are expected to be more correlated at high values than at low values because it is a skewed copula giving more correlation on the right tail than in the left tail. The Clayton copula is generated by

ψ_{CLA} (t) = t^{- δ} - 1 with δ \geq 0;

the larger the parameter δ, the stronger the dependence. In the bivariate case, the density contour plot is pear-shaped because the Clayton copula produces a tight correlation at the low end of each endpoint. The Frank copula is generated by

ψ_{FRA} (t) = - \ln \frac{e^{- ζ t} - 1}{e^{- ζ} - 1} with ζ \geq 0;

the larger the parameter ζ, the stronger the dependence. In the bivariate case, the Frank copula produces and even, sausage-shaped correlation across the range of the endpoints. We use here $ξ = 1.5$ , $δ = 1$ and $ζ = 2$ .

Funding Statement

The research of J. Kalina was supported by the Czech Science Foundation project 19-05704S.

Disclosure statement

No potential conflict of interest was reported by the authors.

ORCID

Marco Marozzi http://orcid.org/0000-0001-9538-0955

Amitava Mukherjee http://orcid.org/0000-0001-7462-3217

Jan Kalina http://orcid.org/0000-0002-8491-0364

References

[1].Bai Z.D. and Saranadasa H., Effect of high dimension: By an example of a two sample problem, Statist. Sinica 6 (1996), pp. 311–329. [Google Scholar]
[2].Baringhaus L. and Franz C., On a new multivariate two-sample test, J. Multivar. Anal. 88 (2004), pp. 190–206. doi: 10.1016/S0047-259X(03)00079-4 [DOI] [Google Scholar]
[3].Chen S.X. and Qin Y.L., A two-sample test for high-dimensional data with applications to gene-set testing, Ann. Stat. 38 (2010), pp. 808–835. doi: 10.1214/09-AOS716 [DOI] [Google Scholar]
[4].Chowdhury S., Mukherjee A. and Chakraborti S., A new distribution-free control chart for joint monitoring of location and scale parameters of continuous distributions, Qual. Reliab. Eng. Int. 30 (2014), pp. 191–204. doi: 10.1002/qre.1488 [DOI] [Google Scholar]
[5].Cucconi O., Un nuovo test non parametrico per il confronto tra due gruppi campionari, G. Econ. Ann. Econ. 27 (1968), pp. 225–248. [Google Scholar]
[6].Hollander M. and Wolfe D.A., Nonparametric Statistical Methods, 2nd ed., Wiley, New York, 1999. [Google Scholar]
[7].Hossain A. and Beyene J., Application of skew-normal distribution for detecting differential expression to microRNA data, J. Appl. Stat. 42 (2015), pp. 477–491. doi: 10.1080/02664763.2014.962490 [DOI] [Google Scholar]
[8].Jurečková J. and Kalina J., Nonparametric multivariate rank tests and their unbiasedness, Bernoulli 18 (2012), pp. 229–251. doi: 10.3150/10-BEJ326 [DOI] [Google Scholar]
[9].Kalina J., A robust pre-processing of BeadChip microarray images, Biocybern. Biomed. Eng. 38 (2018), pp. 556–563. doi: 10.1016/j.bbe.2018.04.005 [DOI] [Google Scholar]
[10].Kalina J. and Schlenker A., A robust supervised variable selection for noisy high-dimensional data, BioMed Res. Int. 2015 (2015), pp. 1–10. Article 320385. doi: 10.1155/2015/320385 [DOI] [PMC free article] [PubMed] [Google Scholar]
[11].Lepage Y., A combination of Wilcoxon's and Ansari-Bradley's statistics, Biometrika 58 (1971), pp. 213–217. doi: 10.1093/biomet/58.1.213 [DOI] [Google Scholar]
[12].Liu Z. and Modarres R., A triangle test for equality of distribution functions in high dimensions, J. Nonparametr. Stat. 23 (2011), pp. 605–615. doi: 10.1080/10485252.2010.485644 [DOI] [Google Scholar]
[13].Marozzi M., Some notes on the location-scale Cucconi test, J. Nonparametr. Stat. 21 (2009), pp. 629–647. doi: 10.1080/10485250902952435 [DOI] [Google Scholar]
[14].Marozzi M., Multivariate tests based on interpoint distances with application to magnetic resonance imaging, Stat. Methods Med. Res. 25 (2016), pp. 2593–2610. doi: 10.1177/0962280214529104 [DOI] [PubMed] [Google Scholar]
[15].Minas C. and Montana G., Distance-based analysis of variance: Approximate inference, Stat. Anal. Data Min. 7 (2014), pp. 450–470. doi: 10.1002/sam.11227 [DOI] [Google Scholar]
[16].Nelsen R.B., An Introduction to Copulas, 2nd ed., Springer Science + Business, New York, 2006. [Google Scholar]
[17].Neuhäuser M., Combining the t test and Wilcoxon's rank-sum test, J. Appl. Stat. 42 (2015), pp. 2769–2775. doi: 10.1080/02664763.2015.1070809 [DOI] [Google Scholar]
[18].Pesarin F. and Salmaso L., Permutation Tests for Complex Data, Chichester, Wiley, 2010. [Google Scholar]
[19].Rapaport F., Khanin R., Liang Y., Pirun M., Krek A., Zumbo P., Mason C.E., Socci N.D. and Betel D., Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data, Genome Biol. 14 (2013), pp. 1–13. Article 3158. doi: 10.1186/gb-2013-14-9-r95 [DOI] [PMC free article] [PubMed] [Google Scholar]
[20].Saraiva E.R., Suzuki A.K., Louzada F. and Milan L.A., Partitioning gene expression data by data-driven Markov chain Monte Carlo, J. Appl. Stat. 43 (2016), pp. 1155–1173. doi: 10.1080/02664763.2015.1092113 [DOI] [Google Scholar]
[21].Seok J., Davis R.W. and Xiao W., A hybrid approach of gene sets and single genes for the prediction of survival risks with gene expression data, PLoS ONE. 10 (2015), article e0122103. doi: 10.1371/journal.pone.0122103 [DOI] [PMC free article] [PubMed] [Google Scholar]
[22].Shinohara R.T., Shou H., Carone M., Schultz R., Tunc B., Parker D. and Verma R., Distance-based analysis of variance for brain connectivity, in University of Pennsylvania UPenn Biostatistics Working Papers 482016. [DOI] [PMC free article] [PubMed]
[23].Smyth G.K., Limma: Linear models for microarray data. in Bioinformatics and computational biology solutions using R and bioconductor, R. Gentleman, V. Carey, S. Dudoit, R. Irizarry, W. Huber, eds., Springer, New York, pp. 397–420.2005.
[24].Srivastava M.S., A test for the mean vector with fewer observations than the dimension under non-normality, J. Multivar. Anal. 100 (2009), pp. 518–532. doi: 10.1016/j.jmva.2008.06.006 [DOI] [Google Scholar]
[25].Stadler N. and Mukherjee S., Two-sample testing in high dimensions, J. R. Stat. Soc. B 79 (2017), pp. 225–246. doi: 10.1111/rssb.12173 [DOI] [Google Scholar]
[26].Szekely G.J. and Rizzo M.L., Energy statistics: Statistics based on distances, J. Statist. Plann. Inference 143 (2013), pp. 1249–1272. doi: 10.1016/j.jspi.2013.03.018 [DOI] [Google Scholar]
[27].Yan J., Enjoy the joy of copulas: With a package copula, J. Stat. Softw. 21 (2007), pp. 1–21. doi: 10.18637/jss.v021.i04 [DOI] [Google Scholar]

[CIT0001] [1].Bai Z.D. and Saranadasa H., Effect of high dimension: By an example of a two sample problem, Statist. Sinica 6 (1996), pp. 311–329. [Google Scholar]

[CIT0002] [2].Baringhaus L. and Franz C., On a new multivariate two-sample test, J. Multivar. Anal. 88 (2004), pp. 190–206. doi: 10.1016/S0047-259X(03)00079-4 [DOI] [Google Scholar]

[CIT0003] [3].Chen S.X. and Qin Y.L., A two-sample test for high-dimensional data with applications to gene-set testing, Ann. Stat. 38 (2010), pp. 808–835. doi: 10.1214/09-AOS716 [DOI] [Google Scholar]

[CIT0004] [4].Chowdhury S., Mukherjee A. and Chakraborti S., A new distribution-free control chart for joint monitoring of location and scale parameters of continuous distributions, Qual. Reliab. Eng. Int. 30 (2014), pp. 191–204. doi: 10.1002/qre.1488 [DOI] [Google Scholar]

[CIT0005] [5].Cucconi O., Un nuovo test non parametrico per il confronto tra due gruppi campionari, G. Econ. Ann. Econ. 27 (1968), pp. 225–248. [Google Scholar]

[CIT0006] [6].Hollander M. and Wolfe D.A., Nonparametric Statistical Methods, 2nd ed., Wiley, New York, 1999. [Google Scholar]

[CIT0007] [7].Hossain A. and Beyene J., Application of skew-normal distribution for detecting differential expression to microRNA data, J. Appl. Stat. 42 (2015), pp. 477–491. doi: 10.1080/02664763.2014.962490 [DOI] [Google Scholar]

[CIT0008] [8].Jurečková J. and Kalina J., Nonparametric multivariate rank tests and their unbiasedness, Bernoulli 18 (2012), pp. 229–251. doi: 10.3150/10-BEJ326 [DOI] [Google Scholar]

[CIT0009] [9].Kalina J., A robust pre-processing of BeadChip microarray images, Biocybern. Biomed. Eng. 38 (2018), pp. 556–563. doi: 10.1016/j.bbe.2018.04.005 [DOI] [Google Scholar]

[CIT0010] [10].Kalina J. and Schlenker A., A robust supervised variable selection for noisy high-dimensional data, BioMed Res. Int. 2015 (2015), pp. 1–10. Article 320385. doi: 10.1155/2015/320385 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0011] [11].Lepage Y., A combination of Wilcoxon's and Ansari-Bradley's statistics, Biometrika 58 (1971), pp. 213–217. doi: 10.1093/biomet/58.1.213 [DOI] [Google Scholar]

[CIT0012] [12].Liu Z. and Modarres R., A triangle test for equality of distribution functions in high dimensions, J. Nonparametr. Stat. 23 (2011), pp. 605–615. doi: 10.1080/10485252.2010.485644 [DOI] [Google Scholar]

[CIT0013] [13].Marozzi M., Some notes on the location-scale Cucconi test, J. Nonparametr. Stat. 21 (2009), pp. 629–647. doi: 10.1080/10485250902952435 [DOI] [Google Scholar]

[CIT0014] [14].Marozzi M., Multivariate tests based on interpoint distances with application to magnetic resonance imaging, Stat. Methods Med. Res. 25 (2016), pp. 2593–2610. doi: 10.1177/0962280214529104 [DOI] [PubMed] [Google Scholar]

[CIT0015] [15].Minas C. and Montana G., Distance-based analysis of variance: Approximate inference, Stat. Anal. Data Min. 7 (2014), pp. 450–470. doi: 10.1002/sam.11227 [DOI] [Google Scholar]

[CIT0016] [16].Nelsen R.B., An Introduction to Copulas, 2nd ed., Springer Science + Business, New York, 2006. [Google Scholar]

[CIT0017] [17].Neuhäuser M., Combining the t test and Wilcoxon's rank-sum test, J. Appl. Stat. 42 (2015), pp. 2769–2775. doi: 10.1080/02664763.2015.1070809 [DOI] [Google Scholar]

[CIT0018] [18].Pesarin F. and Salmaso L., Permutation Tests for Complex Data, Chichester, Wiley, 2010. [Google Scholar]

[CIT0019] [19].Rapaport F., Khanin R., Liang Y., Pirun M., Krek A., Zumbo P., Mason C.E., Socci N.D. and Betel D., Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data, Genome Biol. 14 (2013), pp. 1–13. Article 3158. doi: 10.1186/gb-2013-14-9-r95 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0020] [20].Saraiva E.R., Suzuki A.K., Louzada F. and Milan L.A., Partitioning gene expression data by data-driven Markov chain Monte Carlo, J. Appl. Stat. 43 (2016), pp. 1155–1173. doi: 10.1080/02664763.2015.1092113 [DOI] [Google Scholar]

[CIT0021] [21].Seok J., Davis R.W. and Xiao W., A hybrid approach of gene sets and single genes for the prediction of survival risks with gene expression data, PLoS ONE. 10 (2015), article e0122103. doi: 10.1371/journal.pone.0122103 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0022] [22].Shinohara R.T., Shou H., Carone M., Schultz R., Tunc B., Parker D. and Verma R., Distance-based analysis of variance for brain connectivity, in University of Pennsylvania UPenn Biostatistics Working Papers 482016. [DOI] [PMC free article] [PubMed]

[CIT0023] [23].Smyth G.K., Limma: Linear models for microarray data. in Bioinformatics and computational biology solutions using R and bioconductor, R. Gentleman, V. Carey, S. Dudoit, R. Irizarry, W. Huber, eds., Springer, New York, pp. 397–420.2005.

[CIT0024] [24].Srivastava M.S., A test for the mean vector with fewer observations than the dimension under non-normality, J. Multivar. Anal. 100 (2009), pp. 518–532. doi: 10.1016/j.jmva.2008.06.006 [DOI] [Google Scholar]

[CIT0025] [25].Stadler N. and Mukherjee S., Two-sample testing in high dimensions, J. R. Stat. Soc. B 79 (2017), pp. 225–246. doi: 10.1111/rssb.12173 [DOI] [Google Scholar]

[CIT0026] [26].Szekely G.J. and Rizzo M.L., Energy statistics: Statistics based on distances, J. Statist. Plann. Inference 143 (2013), pp. 1249–1272. doi: 10.1016/j.jspi.2013.03.018 [DOI] [Google Scholar]

[CIT0027] [27].Yan J., Enjoy the joy of copulas: With a package copula, J. Stat. Softw. 21 (2007), pp. 1–21. doi: 10.18637/jss.v021.i04 [DOI] [Google Scholar]

PERMALINK

Interpoint distance tests for high-dimensional comparison studies

Marco Marozzi

Amitava Mukherjee

Jan Kalina

ABSTRACT

1. Introduction

2. Methods

3. Size and power study

Table 1. Type I error rate of the tests for normal and Student distributions under various copula models for the dependence.

Table 2. Power of the two-sample tests in simulations under the normal or Clayton copula models for the dependence.

Table 3. Power of the two-sample tests in simulations under various copula models for the dependence.

4. Application: a cardiovascular genetic study

Table 4. Genetic study results for 1000 random selections of r genes from the whole gene set (situation (i)).

Table 5. Genetic study results for 1000 random selections of r genes from the first 4000 genes (situation (ii)).

Table 6. P-values evaluated in situation (iv), only over genes not determined as differentially expressed by Limma.

5. Conclusion

Acknowledgments

Appendix.

Funding Statement

Disclosure statement

ORCID

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Interpoint distance tests for high-dimensional comparison studies

Marco Marozzi

Amitava Mukherjee

Jan Kalina

ABSTRACT

1. Introduction

2. Methods

3. Size and power study

Table 1. Type I error rate of the tests for normal and Student distributions under various copula models for the dependence.

Table 2. Power of the two-sample tests in simulations under the normal or Clayton copula models for the dependence.

Table 3. Power of the two-sample tests in simulations under various copula models for the dependence.

4. Application: a cardiovascular genetic study

Table 4. Genetic study results for 1000 random selections of r genes from the whole gene set (situation (i)).

Table 5. Genetic study results for 1000 random selections of r genes from the first 4000 genes (situation (ii)).

Table 6. P-values evaluated in situation (iv), only over genes not determined as differentially expressed by Limma.

5. Conclusion

Acknowledgments

Appendix.

Funding Statement

Disclosure statement

ORCID

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases