SUMMARY
Detecting dependence between two random variables is a fundamental problem. Although the Pearson correlation coefficient is effective for capturing linear dependence, it can be entirely powerless for detecting nonlinear and/or heteroscedastic patterns. We introduce a new measure, G-squared, to test whether two univariate random variables are independent and to measure the strength of their relationship. The G-squared statistic is almost identical to the square of the Pearson correlation coefficient, R-squared, for linear relationships with constant error variance, and has the intuitive meaning of the piecewise R-squared between the variables. It is particularly effective in handling nonlinearity and heteroscedastic errors. We propose two estimators of G-squared and show their consistency. Simulations demonstrate that G-squared estimators are among the most powerful test statistics compared with several state-of-the-art methods.
Keywords: Bayes factor, Coefficient of determination, Hypothesis test, Likelihood ratio
1. INTRODUCTION
The Pearson correlation coefficient is widely used to detect and measure the dependence between two random quantities. The square of its least-squares estimate, popularly known as R-squared, is often used to quantify how linearly related two random variables are. However, the shortcomings of the R-squared statistic as a measure of the strength of dependence are also significant, as discussed recently by Reshef et al. (2011), which has inspired the development of many new methods for detecting dependence.
The Spearman correlation calculates the Pearson correlation coefficient between rank
statistics. Although more robust than the Pearson correlation, this method still cannot
capture nonmonotone relationships. The alternating conditional expectation method was
introduced by Breiman & Friedman (1985) to
approximate the maximal correlation between and
, i.e., to find optimal transformations of the
data,
and
, such
that their correlation is maximized. The implementation of this method has limitations,
because it is infeasible to search through all possible transformations. Estimating mutual
information is another popular approach due to the fact that the mutual information is zero
if and only if
and
are
independent. Kraskov et al. (2004) proposed a method
that involves estimating the entropy of
,
and
separately. The method was claimed to
be numerically exact for independent cases, and effective for high-dimensional variables. An
energy distance-based method (Szèkely et al., 2007;
Szèkely & Rizzo, 2009) and a kernel-based
method (Gretton et al., 2005, 2012) for solving the two-sample test problem appeared separately in the
statistics and machine learning literatures, and have corresponding usage in independence
tests. The two methods were recently shown to be equivalent (Sejdinovic et al., 2013). Methods based on empirical cumulative distribution
functions (Hoeffding, 1948), empirical copula (Genest & Rémillard, 2004) and empirical characteristic
functions (Kankainen & Ushakov, 1998; Huskova & Meintanis, 2008) have also been proposed for
detecting dependence.
Another set of approaches is based on discretization of the random variables. Known as
grid-based methods, they are primarily designed to test for independence between univariate
random variables. Reshef et al. (2011) introduced the
maximum information coefficient, which focuses on the generality and equitability of a
dependence statistic; two more powerful estimators for this quantity were suggested by
Reshef et al. (arXiv:1505.02213). Equitability requires that the same value of the statistic
imply the same amount of dependence regardless of the type of the underlying relationship,
but it is not a well-defined mathematical concept. We show in the Supplementary Material that the
equitability of G-squared is superior to all other independence testing statistics for a
wide range of functional relationships. Heller et al.
(2016) proposed a grid-based method which utilizes the
statistic to test independence and is a distribution-free test. Blyth (1994) and Doksum et al.
(1994) discussed using the correlation curve to measure the strength of the
relationship. However, a direct use of nonparametric curve estimation may rely too heavily
on the smoothness of the relationship; furthermore, it cannot deal with heteroscedastic
noise.
The statistic proposed in this paper is
derived from a regularized likelihood ratio test for piecewise-linear relationships and can
be viewed as an integration of continuous and discrete methods. It is a function of both the
conditional mean and the conditional variance of one variable given the other, so it is
capable of detecting general functional relationships with heteroscedastic error variances.
An estimate of
can be derived via the same likelihood
ratio approach as
when the true underlying relationship
is linear. Thus, it is reasonable that
is almost identical to
for linear relationships. Efficient
estimates of
can be computed quickly using a dynamic
programming method, whereas the methods of Reshef et al.
(2011) and Heller et al. (2016) consider
grids on two variables simultaneously and hence require longer computational times. We will
also show that, in terms of power,
is one of the best
statistics for independence testing when considering a wide range of functional
relationships.
2. MEASURING DEPENDENCE WITH G-SQUARED
2.1. Defining
as a generalization of
The R-squared statistic measures how well the data fit a linear regression model. Given
with
, the standard estimate
of R-squared can be derived from a likelihood ratio test statistic for testing
against
, i.e.,
where and
are the maximized
likelihoods under
and
.
Throughout the paper, we let and
be
univariate continuous random variables. As a working model, we assume that the
relationship between
and
can be
characterized as
, with
and
. If
and
are independent, then
and
. Now let us look at
the piecewise-linear relationship
where are called the
breakpoints. While this working model allows for heteroscedasticity, it requires constant
variance within each segment between two consecutive breakpoints. Testing whether
and
are
independent is equivalent to testing whether
and
. Given
, the likelihood ratio
test statistic can be written as
where
is the overall sample variance
of
and
is the residual variance
after regressing
on
for
. Because
is a transformation of the likelihood
ratio and converges to the square of the Pearson correlation coefficient, we perform the
same transformation on
. The resulting test statistic
converges to a quantity related to the conditional mean and the conditional variance of
on
. It is
easy to show that as
,
(1) |
When , the relationship degenerates to a
simple linear relation and
is exactly
.
More generally, because a piecewise-linear function can approximate any almost everywhere
continuous function, we can employ the same hypothesis testing framework as above to
derive (1) for any such approximation.
Thus, for any pair of random variables , the following
concept is a natural generalization of R-squared:
in which we require that . Evidently,
lies between 0 and 1, and
is equal to zero if and only if both
and
are constant. The
definition of
is closely related to the
R-squared defined by segmented regression (Oosterbaan
& Ritzema, 2006), discussed in the Supplementary Material. We symmetrize
to arrive at the following
quantity as the definition of the G-squared statistic:
provided .
Thus,
if and only if
,
,
and
are all constant,
which is not equivalent to independence of
and
. In practice, however, dependent cases
with
are rare.
2.2. Estimation of
Without loss of generality, we focus on the estimation of ;
can be estimated in the same
way by interchanging
and
. When
with
for an almost
everywhere continuous function
, we can use a
piecewise-linear function to approximate
and estimate
. However, in practice the number and
locations of the breakpoints are unknown. We propose two estimators of
, the first aiming to find
the maximum penalized likelihood ratio among all possible piecewise-linear approximations,
and the second focusing on a Bayesian average of all approximations.
Suppose that we have sorted independent observations,
, such that
. For the set of
breakpoints, we only need to consider
. Each interval
is called a slice of the
observations, so that
divide the range of
into
non-overlapping slices. Let
denote the number of observations in
slice
, and let
denote a slicing scheme of
, i.e.,
if
, which is abbreviated as
whenever the meaning is clear. Let
be the number of slices in
and let
denote the minimum size of all the slices.
To avoid overfitting when maximizing loglikelihood ratios both over unknown parameters
and over all possible slicing schemes, we restrict the minimum size of each slice to
and maximize the loglikelihood
ratio with a penalty on the number of slices. For simplicity, let
. Thus, we focus on
the penalized loglikelihood ratio
(2) |
where is the likelihood ratio for
and
is the penalty incurred
for one additional slice. From a Bayesian perspective, this is equivalent to assigning the
prior distribution for the number of slices to be proportional to
. Suppose that each
observation
has probability
of being the breakpoint independently. Then the probability of a slicing scheme
is
When , the statistic
is equivalent to
the Bayesian information criterion (Schwarz, 1978)
up to a constant.
Treating the slicing scheme as a nuisance parameter, we can maximize over all allowable slicing schemes to obtain that
Our first estimator of , which we call
with m standing for the
maximum likelihood ratio, can be defined as
Hence, the overall G-squared can be estimated as
By definition, lies between 0
and 1, and
when the
optimal slicing schemes for both directions have only one slice. Later, we will show that
when
and
follow
a bivariate normal distribution,
almost surely
for large
.
Another attractive way to estimate is to integrate out
the nuisance slicing scheme parameter. A full Bayesian approach would require us to
compute the Bayes factor (Kass & Raftery,
1995), which may be undesirable since we do not wish to impose too strong a
modelling assumption. On the other hand, however, the Bayesian formalism may guide us to a
desirable integration strategy for the slicing scheme. We therefore put the problem into a
Bayes framework and compute the Bayes factor for comparing the null and alternative
models. The null model is only one model while the alternative is any piecewise-linear
model, possibly with countably infinite pieces. Let
be the marginal
probability of the data under the null. Let
be the prior
probability for slicing scheme
and let
denote the marginal
probability of the data under
. The Bayes factor can
be written as
(3) |
where
is the minimum size of all the slices
of
. The marginal probabilities are not easy
to compute even with proper priors. Schwarz (1978)
states that if the data distribution is in the exponential family and the parameter is of
dimension
, the marginal probability of the data can
be approximated as
(4) |
where
is the maximized likelihood. In
our set-up, the number of parameters
for the null model is
2, and for an alternative model with a slicing scheme
it is
. Inserting expression (4) into both the numerator and the
denominator of (3), we obtain
(5) |
If we take ,
which corresponds to the penalty term in (2) and is involved in defining
, the
approximated Bayes factor can be restated as
(6) |
As we will discuss in § 2.5,
can serve as a
marginal likelihood function for
and be used to
find an optimal
suitable for a particular
dataset. This quantity also looks like an average version of
, but with an additional
penalty. Since
can take values below
1, its transformation
, as in the
case where we derived
via the likelihood ratio test, can
take negative values, especially when
and
are independent. It is therefore not an
ideal estimator of
.
By removing the model size penalty term in (5), we obtain a modified version, which is simply a weighted average of the likelihood ratios and is guaranteed to be greater than or equal to 1:
We can thus define a quantity similar to our likelihood formulation of R-squared,
which we call the total G-squared, and define
We show later that and
are both
consistent estimators of
.
2.3. Theoretical properties of the
estimators
In order to show that and
converge to
as the sample size goes to infinity,
we introduce the notation
,
,
and
, and
assume the following regularity conditions.
condition 1.
The random variables
and
are bounded continuously with finite variances such that
almost everywhere for some constant
.
condition 2.
The functions
,
,
and
have continuous derivatives almost everywhere.
condition 3.
There exists a constant
such that
almost surely.
With these preparations, we can state our main results.
Theorem 1.
Under Conditions 1–3, for all
,
almost surely as
. Thus,
and
are consistent estimators of
.
A proof of the theorem and numerical studies of the estimators’ consistency are provided
in the Supplementary Material.
It is expected that should converge
to
because of the way it is constructed.
It is surprising that
also converges to
. The result, which links
estimation with the likelihood ratio
and Bayesian formalism, suggests that most of the information up to the second moment has
been fully utilized in the two test statistics. The theorem thus supports the use of
and
for testing
whether
and
are
independent. The null distributions of the two statistics depend on the marginal
distributions of
and
, and
can be generated empirically using permutation. One can also perform a quantile-based
transformation on
and
so
that their marginal distributions become standard normal; however, the
based on the transformed data tends
to lose some power.
When and
are
bivariate normal, the G-squared statistic is almost the same as the R-squared statistic
when
is large enough.
Theorem 2.
If
and
follow a bivariate normal distribution, then for
large enough,
So, for
and
, we have
almost surely.
The lower bound on is not tight and can be relaxed
in practice. Empirically, we have observed that
is large
enough for
to be very close
to
in the bivariate normal setting.
2.4. Dynamic programming algorithm for computing
and
The brute force calculation of either or
has a computational
complexity of
and is prohibitive in practice.
Fortunately, we have found a dynamic programming scheme for computing both quantities with
a time complexity of only
. The algorithms for computing
and
are
roughly the same except for one operation, namely maximization versus summation, and can
be summarized by the following steps.
Step 1.
(Data preparation). Arrange the observed pairs
according to the
values sorted from low to high. Then normalize
such that
and
.
Step 2.
(Main algorithm). Define
as the smallest slice size,
and
. Initialize three sequences,
with
and
. For
, recursively fill in entries of the tables with
where
,
and
, with
being the residual variance of regressing
on
for observations
.
Step 3.
The final result is
Here, stores the partial
maximized likelihood ratio up to the ordered observation
;
stores the partial
normalizing constant; and
stores the partial
sum of the likelihood ratios. When
is extremely large, we
can speed up the algorithm by considering fewer slice schemes. For example, we can divide
into chunks of size
by rank and consider only slicing schemes
between the chunks. For this method, the computational complexity is
. We can compute
and
similarly
to get
and
. Empirically, the
algorithm is faster than many other powerful methods, as shown in the Supplementary Material.
2.5. An empirical Bayes strategy for selecting
Although the choice of the penalty parameter is not critical
for the general use of
, we typically take
for
and
because
is equivalent to the
Bayesian information criterion. Fine-tuning
can improve the
estimation of
; we therefore propose a data-driven
strategy for choosing
adaptively. The quantity
in (6) can be viewed as an approximation to
up to a normalizing constant. Hence we can use the maximum likelihood principle to choose
the
that maximizes
. We then use
the chosen
to compute
and
as estimators of
. In practice, we evaluate
for a finite
set of
values, such as
,
and pick the
value that maximizes
;
can be
computed efficiently via a dynamic programming algorithm similar to that described in
§2.4. As an illustration, we consider the sampling distributions of
and
with
and
for the following two
scenarios:
Example 1.
and
with
.
Example 2.
and
with
.
We simulated data points. For each model, we set
so that
and performed 1000
replications. Figure 1 shows histograms of
and
with different
values. The results demonstrate
that for relationships which can be approximated well by a linear function, a larger
is preferred because it penalizes
the number of slices more heavily, so that the resulting sampling distributions are less
biased. On the other hand, for complicated relationships such as trigonometric functions,
a smaller
is preferable because it allows
more slices, which can help to capture fluctuations in the functional relationship. The
figure also shows that the empirical Bayes selection of
worked very well, leading to a
proper choice of
for each simulated dataset from
both examples and resulting in the most accurate estimates of
.
Additional simulation studies and discussion of the consistency of the data-driven
strategy can be found in the Supplementary Material.
Fig. 1.
Sampling distributions of and
under the two models
described in §2.5 with
for
(dashed),
1
5 (dotted),
2
5 (dot-dash) and
3
5 (solid). The density function
in each case is estimated by the histogram. The sampling distributions of
and
with the empirical
Bayes selection of
are shaded grey and
overlaid on top of the other density functions.
3. POWER ANALYSIS
Next, we compare the power of different independence testing methods for various
relationships. Here we again fixed for both
and
. Other methods we tested include
the alternating conditional expectation (Breiman & Friedman, 1985), Genest’s test (Genest & Rémillard, 2004), Pearson correlation,
distance correlation (Szèkely et al., 2007), the
method of Heller et al. (2016), the characteristic
function method (Kankainen & Ushakov, 1998),
Hoeffding’s test (Hoeffding, 1948), the mutual
information method (Kraskov et al., 2004), and two
methods,
and
, based on the maximum
information criterion (Reshef et al., 2011). We
follow the procedure for computing the powers of different methods as described in Reshef et
al. (arXiv:1505.02214) and a 2012 online note by N. Simon and R. J. Tibshirani.
For different functional relationships and different noise
levels
, we let
where . Thus
is a monotone
function of the signal-to-noise ratio, and it is of interest to observe how the performances
of different methods deteriorate as the signal strength weakens for various functional
relationships. We used permutation to generate the null distribution and to set the
rejection region in all cases.
Figure 2 shows power comparisons for eight functional
relationships. We set the sample size to and performed 1000
replications for each relationship and each
value. For
the sake of clarity, here we plot only Pearson correlation, distance correlation, the method
of Heller et al. (2016),
,
and
. For any method with tuning
parameters, we chose the parameter values that resulted in the highest average power over
all the examples. Due to computational concerns, we chose
for
the method of Heller et al. (2016). It can be seen
that
and
performed robustly, and were
always among the most powerful methods, with
being
slightly more powerful than
in nearly all the examples.
They outperformed the other methods in cases such as the high-frequency sine, triangle and
piecewise-constant functions, where piecewise-linear approximation is more appropriate than
other approaches. For monotonic examples such as linear and radical relationships,
and
had slightly lower power than
Pearson correlation, distance correlation and the method of Heller et al. (2016), but were still highly competitive.
Fig. 2.
The powers of (black solid),
(grey solid), Pearson
correlation (grey circles), distance correlation (black dashed), the method of Heller et al. (2016) (black dotted) and
(black circles)
for testing independence between
and
when the underlying true functional
relationships are linear, quadratic, cubic, radical, low-frequency sine, triangle,
high-frequency sine, and piecewise constant. The horizontal axis represents
, a monotone function of
the signal-to-noise ratio, and the vertical axis is the power. We chose
and performed 1000 replications
for each relationship and each
value.
We also studied the performances of these methods for , 100
and 400, and found that
and
still had high power regardless
of
, although their advantages were much less
obvious when
was small. More details can be found in the
Supplementary Material.
4. DISCUSSION
The proposed G-squared statistic can be viewed as a direct generalization of the R-squared statistic. While maintaining the same interpretability as the R-squared statistic, the G-squared statistic is also a powerful measure of dependence for general relationships. Instead of resorting to curve-fitting methods to estimate the underlying relationship and the G-squared statistic, we employed piecewise-linear approximations with penalties and dynamic programming algorithms. Although we have considered only piecewise-linear functions, one could potentially approximate a relationship between two variables using piecewise polynomials or other flexible basis functions, with perhaps additional penalty terms to control the complexity. Furthermore, it would be worthwhile to generalize the slicing idea to testing dependence between two multivariate random variables.
Supplementary Material
ACKNOWLEDGEMENT
We are grateful to the two referees for helpful comments and suggestions. This research was supported in part by the U.S. National Science Foundation and National Institutes of Health. We thank Ashley Wang for her proofreading of the paper. The views expressed herein are the authors’ alone and are not necessarily the views of Two Sigma Investments, Limited Partnership, or any of its affiliates.
SUPPLEMENTARY MATERIAL
Supplementary material available at Biometrika online includes proofs of the theorems, software implementation details, discussions on segmented regression, a study of equitability, and more simulation results.
References
- Blyth S. (1994). Local divergence and association. Biometrika 91579–84. [Google Scholar]
- Breiman L. & Friedman J. H. (1985). Estimating optimal transformations for multiple regression and correlation. J. Am. Statist. Assoc. 80580–98. [Google Scholar]
- Doksum K. Blyth S. Bradlow E. Meng X. & Zhao H. (1994). Correlation curves as local measures of variance explained by regression. J. Am. Statist. Assoc. 89571–82. [Google Scholar]
- Genest C. & Rémillard B. (2004). Test of independence and randomness based on the empirical copula process. Test 13335–69. [Google Scholar]
- Gretton A. Gousquet O. Smola A. & Schlkopf B. (2005). Measuring statistical dependence with Hilbert–Schmidt norms. Algor. Learn. Theory 373463–77. [Google Scholar]
- Gretton A. Borgwardt K. M. Rasch M. J. Schlkopf B. & Smola A. (2012). A kernel two-sample test. J. Mach. Learn. Res. 13723–73. [Google Scholar]
- Heller R. Heller Y. Kaufman S. Brill B. & Gorfine M. (2016). Consistent distribution-free $K$-sample and independence tests for univariate random variables. J. Mach. Learn. Res. 171–54. [Google Scholar]
- Hoeffding W. (1948). A non-parametric test of independence. Ann. Statist. 19546–57. [Google Scholar]
- Huškovà M. & Meintanis S. (2008). Testing procedures based on the empirical characteristic functions I: Goodness-of-fit, testing for symmetry and independence. Tatra Mt. Math. Publ. 39225–33. [Google Scholar]
- Kankainen A. & Ushakov N. G. (1998). A consistent modification of a test for independence based on the empirical characteristic function. J. Math. Sci. 891486–94. [Google Scholar]
- Kass R. E. & Raftery A. E. (1995). Bayes factors. J. Am. Statist. Assoc. 90773–95. [Google Scholar]
- Kraskov A. Stogbauer H. & Grassberger P. (2004). Estimating mutual information. Phys. Rev. E 69.6066138. [DOI] [PubMed] [Google Scholar]
- Oosterbaan R. J. & Ritzema H. P. (2006). Drainage Principles and Applications. Wageningen, Netherlands: International Institute for Land Reclamation and Improvement, pp. 217–20. [Google Scholar]
- Reshef D. N. Reshef Y. A. Finucane H. K. Grossman S. R. McVean G. Turnbaugh P. J. Lander E. S. Mitzenmacher M. & Sabeti P. S. (2011). Detecting novel associations in large data sets. Science 3341518–24. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schwarz G. (1978). Estimating the dimension of a model. Ann. Statist. 6461–4. [Google Scholar]
- Sejdinovic D. Sriperumbudur B. Gretton A. & Fukumizu K. (2013). Equivalence of distance-based and RKHS-based statistics in hypothesis testing. Ann. Statist. 412263–91. [Google Scholar]
- Szèekely G. J. & Rizzo M. L. (2009). Brownian distance correlation. Ann. Appl. Statist. 121236–65. [Google Scholar]
- Székely G. J. Rizzo M. L. & Bakirov N. K. (2007). Measuring and testing dependence by correlation of distances. Ann. Statist. 122769–94. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.