Space–time clustering and the permutation moments of quadratic forms

Yi-Hui Zhou; Gregory Mayhew; Zhibin Sun; Xiaolin Xu; Fei Zou; Fred A Wright

doi:10.1002/sta4.37

. 2013 Nov 29;2(1):292–302. doi: 10.1002/sta4.37

Space–time clustering and the permutation moments of quadratic forms

Yi-Hui Zhou ^a,^*, Gregory Mayhew ^b, Zhibin Sun ^c, Xiaolin Xu ^d, Fei Zou ^b, Fred A Wright ^a

PMCID: PMC4157666 NIHMSID: NIHMS593632 PMID: 25210205

Abstract

The Mantel and Knox space–time clustering statistics are popular tools to establish transmissibility of a disease and detect outbreaks. The most commonly used null distributional approximations may provide poor fits, and researchers often resort to direct sampling from the permutation distribution. However, the exact first four moments for these statistics are available, and Pearson distributional approximations are often effective. Thus, our first goals are to clarify the literature and make these tools more widely available. In addition, by rewriting terms in the statistics, we obtain the exact first four permutation moments for the most commonly used quadratic form statistics, which need not be positive definite. The extension of this work to quadratic forms greatly expands the utility of density approximations for these problems, including for high-dimensional applications, where the statistics must be extreme in order to exceed stringent testing thresholds. We demonstrate the methods using examples from the investigation of disease transmission in cattle, the association of a gene expression pathway with breast cancer survival, regional genetic association with cystic fibrosis lung disease and hypothesis testing for smoothed local linear regression. © The Authors. Stat published by John Wiley & Sons Ltd.

Keywords: exact testing, resampling, statistical computing

1. Introduction

(Mantel 1967) proposed an approach to detect clustering of location of events in space versus time of occurrence, by regressing a function of geographic distance on a function of distance in time. The prototypical application is to evaluate the evidence for communicable disease transmission, in contrast to sporadic occurrences that show no clustering. The approach has proven to be hugely popular, with 5200+ citations in the Science Citation Index as of 2013, with approximately 450 citations in each of recent years. Briefly, we let l_i and t_i represent the geographic location (space) and time of occurrence for the ith location–time sample, i = 1, …, n. For samples i and j, we denote measures of location and time distances as c_ij = f(l_i,l_j), d_ij = g(t_i,t_j), and these elements populate the matrices C and D, respectively. For a final “regression” statistic

high values are evidence of location–time clustering, and the author considered the power of various choices of f and g. He also noted that the Knox statistic, which records whether two locations or time points are less distant than predefined thresholds, is a special case. In addition, the paper solved for the mean and variance of S under permutation of sample labels for the location and time points. This permutation is equivalent to simultaneous permutation of rows and columns of one of the matrices C or D.

Much of Mantel's (1967) and subsequent work is concerned with finding powerful choices of f and g, but here, we assume that the statistic has been chosen, and our goal is to provide accurate testing. For numerous datasets, a normal approximation to S is inadequate, because of strong dependencies among the matrix elements. For the Knox statistic, p-values based on Poisson approximations (Knox, 1964) or a normal approximation (David & Barton, 1966) have been used. The improvements to Mantel and Knox tests for space–time interaction were proposed in several papers (Kulldorff & Hjalmars, 1999; Diggle et al., 1995; Jacquez, 1996; Baker, 1996) by not requiring as strong assumptions on the spatial and temporal scales of clustering. But in general, direct sampling from the permutation distribution had often been thought to be necessary, as enumeration of the n! outcomes is of course infeasible for most datasets. An alternative approach is to use moment-based density approximations, but the skewness and kurtosis are important for tail accuracy. (Siemiatycki 1978) provided the first four moments of S under permutation, for the most commonly encountered situation that C and D are symmetric with zero diagonals. The author described graphical patterns to aid in computing expectations of product terms, for example, in c_ijc_klc_mnc_st, there are 23 distinct patterns of equality/inequality for the eight subscripts. In addition, moments of S were expressed in linear combinations of products of terms of varying order from C and D—the terms for the fourth moment involve nearly 150 non-zero coefficients. Although the bookkeeping is tedious, these operations reduce the complexity from a naive O(n⁸) to O(n³). With this reduction, density approximations become feasible for computing p-values, with reasonable accuracy even for stringent testing thresholds.

The space–time clustering statistic can easily be seen to resemble a quadratic form y^TAy, where y is an n × 1 vector with elements y_i, and A is a symmetric n × n matrix with elements a_ij. This can be seen by rewriting Inline graphic , which is similar to 1, with a_ij and y_iy_j serving the roles of c_ij and d_ij. However, a key difference lies in the diagonals, that is, that a_ii and are not generally zero. Quadratic forms have been used for location–time clustering (Tango, 1984), but we are not aware that a direct equivalence has been described between the Mantel statistic and a quadratic form over permutations, and for the latter, to our knowledge, only the first two exact moments have been reported (Commenges, 2003). Quadratic forms arise in a number of disciplines, including epidemiology, genomic, economics, and other areas. The computation of exact moments enables robust analysis, while avoiding the additional computational cost of direct permutation.

Despite the popularity of the Knox–Mantel and related location–time clustering statistics, software has not been available to compute the four moments or subsequently obtain approximation p-values, despite a number of packages devoted to location–time surveillance (Robertson & Nelson, 2010). Similarly, quadratic forms are increasingly used, for example, in genomics problems (Tong et al. 2010). However, standard results for normal quadratic forms may not apply, such as for binary disease traits. The application of quadratic forms to non-normal data is often justified by appealing to asymptotics (Wu et al. 2011), but the use of exact methods may be preferred.

We have developed R code to compute the first four exact moments for the location–time statistic and for centered quadratic forms and to compute approximations to the exact permutation p-values using Pearson density approximations. We believe that the software and methods are useful additions to the statistician's toolkit.

2. Methods

2.1. The location–time statistic S

For symmetric C and D (with zero diagonals), we have implemented the Siemiatycki moment computation. The permutation approach involves simultaneous permutation of rows and columns of one of the matrices (say D), which is equivalent to permutation of the location versus time observations (Mantel, 1967). We use π = 1, …, n! as a subscript to represent a permutation of the n objects, with reordered indexes π[1], …, π[n]. A random permutation is denoted as Π, and our task is to compute the first four moments of Inline graphic . The key computations are shown in the Appendix, expressed in matrix form to exploit linear algebra routines in R. Approximate p-values are obtained by matching the exact moments to the Pearson family of distributions using the PearsonDS package, which automatically chooses the best-fitting type within the Pearson family.

2.2. Equivalence of the quadratic form statistic S

Here, the statistic is S = y^TAy, for symmetric A with corresponding permutation random variable Inline graphic . In many useful applications, A is centered, that is, the rows and columns sum to a constant μ. Here, we will assume μ = 0, essentially without loss of generality, as non-zero μ values will offset S_Π by a constant μy^Ty. Standard normal-theory results typically assume that A is positive definite, and the assumption is necessary for standard χ² distributional approximations. However, relaxing this assumption would considerably increase the variety of problems for which accurate p-values can be obtained. For example, in a genomic context, (Zhou & Wright 2013) provided motivation for useful quadratic forms with eigenvalues summing to zero. (Kuonen 1999) summarized a number of previous studies of quadratic form approximations, including those that are not positive definite, and described saddlepoint approximations applicable to normally distributed y only.

The moments computed by Siemiatycki were considerably simplified by assuming zero diagonals for C and D. Here, we describe a simple construction to map the quadratic form to the Mantel statistic. First, we define C = A − diag(A), that is, c_ij = a_ij for i ≠ j and zero otherwise. Then we define D as the matrix with entries Inline graphic , and by this definition, each d_ii = 0. Our claim is that, for any π, .

Proof

By the constraint, Inline graphic , and therefore for any fixed π, we have , and by the same reasoning, . We have

because each d_π[i],π[i] = 0. Expanding the right-hand side gives

for which the last two entries are zero. Thus, Inline graphic .

As with the location–time statistic, we use Pearson family approximations to compute p-values. Because of the row/column constraint, several moment terms can be further simplified to lower order O(n²) (Appendix), which may be useful in applications for very large n.

2.3. Permutation versus normal quadratic forms

Our motivation here is to perform approximations to exact inference, and our procedures only need the exchangeability assumption on the observed y, applying equally well to discrete or continuous data. For normal quadratic forms, where the elements of y are drawn randomly iid from a normal density, the null distribution may be computed as a weighted sum of independent Inline graphic random variables, using the methods of (Imhof 1961) or the saddlepoint approximation of (Kuonen 1999), for example, as implemented in the survey package in R. A common technique used in genomics and other disciplines is to perform robust analysis by transforming data to be discrete-normal using rank-based inverse normal transformations. For example, if r(y_i) is the rank of the ith observation, the transformed value is Inline graphic . The use of normal scores in genetics was discussed and extensively critiqued by Beasley et al. (2009). An underlying theme in the application of normal scores appears to be a presumption that permutation of the scores is nearly equivalent to unconditional normal random sampling. For individual association tests, this assumption may be reasonable. For example, the permutation variance of the Pearson correlation coefficient between fixed vectors x and y is 1/(n − 1), which is identical to the variance if y is randomly drawn as iid normal. However, permutation of y inherently creates negative correlation among the sampled elements. This dependence, which is slight for individual elements of y and decreases with n, remains highly consequential for S, because there are n² correlation terms among the elements. This effect of with-replacement sampling is especially strong if the eigenvalues of A do not contain a few dominant values (Zhou & Wright, 2013).

The permutation dependency phenomenon is illustrated in four panels in Supplementary Figure 1. For each panel, a single initial m × n matrix X was generated with elements drawn iid N(0,1) and row-centered, where m = {10,1000}and n = {50,500}. Then we let A = X^TX and compare the distribution of the unconditional normal quadratic form with that of permutation of normal scores. The figure illustrates that the variability under permutation is markedly less than for unconditional sampling, except when n > > m. Thus, even if an investigator transforms y to normal scores, the normal quadratic form null distribution cannot be used for permutation testing, and the methods described here remain relevant.

Performance of the proposed approach for space–time clustering analysis of the cattle data. The left panel shows a histogram of S_Mantel and a q–q plot of observed approximating p-values versus expected for 10⁶ permutations.The right panel shows the analogous results for S_Knox for 10⁶ permutations, along with density fits based on the Barton–David and Poisson approximations, as well as our proposed density fit. The inset shows the true permutation p-values for all possible outcomes, compared to that of the approximation.

2.4. Example datasets

We illustrate our methods for four published examples, and for each of the first three examples, we use two different S statistics. The statistics are the same as proposed by the original authors or are otherwise well motivated within the context of the problem. For each example and choice of statistic, the analyst need only find C and D, or y and A, as appropriate to the problem. We note that these examples are useful not only for the observed statistics and p-values but also for the adequacy of the fit for the entire permutation distribution, and thus, the examples effectively illustrate the performance of our approximation in a variety of settings.

Example 1

In White et al. (1989), space–time clustering was used to investigate the evidence of transmissibility of dysentery in cattle for 37 outbreaks in farms in rural New York. Both the Mantel and Knox statistics were used, which we will denote S_Mantel and S_Knox. Following the authors’ implementation of the Mantel statistic, for f, we calculated the straight-line distance in kilometres between locations, and for g, we used the unsigned difference in days between outbreaks. The resulting matrices C and D were then used to calculate S_Mantel.

The Knox statistic is the number of outbreak pairs that are close in space and time. Thresholds for defining closeness are required, and we used the thresholds of 5.5 km and 30 days chosen in White et al. (1989). In other words, c_ij = 1 for f(l_i,l_j) < 5.5 km, and 0 otherwise. Similarly, for the Knox statistic, d_ij is an indicator for g(t_i,t_j) less than 30 days, and c_ii = d_ii = 0. The resulting matrices C and D were then used to calculate S_Knox (which is twice the statistic proposed by (Kulldorff and Hjalmars 1999)). Although our moment calculations are exact, for an observed statistic s, density approximations to p-values tend to be closer to the mid p-value Inline graphic than to the p-value P(S ≥ s). For most of the examples in this paper, the difference between the two is trivial and need not be considered. However, for this example, S_Knox statistic can assume only the 25 even values 0, 2, …, 48, and so we apply a continuity correction, by using the Pearson density approximation for s − 1 instead of s.

Example 2

For pathway analysis of genetic expression data, the data are typically divided into X_path, which represents the m_path × n matrix of expression of m_path genes belonging to a pathway, and X_comp is the remaining m_comp × n complementary matrix of genes not in the pathway. We assume that both matrices are row centered and scaled. Expressions of genes are then compared to a clinical or experimental outcome y, either by examining the association of y to only genes within the pathway (known as self-contained testing) or by contrasting the association with genes in the pathway versus that in the complement (competitive testing). (Zhou & Wright 2013) proposed corresponding quadratic form statistics Inline graphic , and , for which they obtained p-values using a weighted beta density approximation. However, for that approximation, only the first two moments are exact. S_compet has eigenvalues summing to zero, and for some, datasets can have a negative skew, making χ² density approximations ineffective. We use the breast cancer data of Miller et al. (2005), for which the pathway GO:0000184: “nuclear-transcribed mRNA catabolic process” (44 genes, n = 236 samples) was used in (Zhou & Wright 2013) for an example in tests of association with survival. Here, y is the vector of martingale residuals for survival time, X is gene expression data, and both have been preresidualized for p53 mutation status.

Example 3

Wright et al. (2011) described a genome-wide association analysis for lung function among 1978 cystic fibrosis (CF) patients, identifying the interval between the genes EHF and APIP on chromosome 11 as of interest. For an interval consisting of several genetics markers, we use an approach to perform regional genetic analysis, rather than testing individual markers. The approach compares similarities in the lung function phenotype between all pairs of individuals with a correlation-based measure in regional genotypes. The result (which we call S_assoc1) is similar in spirit to a Mantel statistic, except that the individual elements represent similarity rather than distance. Specifically, we let y_j denote the phenotype for the jth individual, and the subsequent description is simplified by assuming y has been centered and scaled so that Inline graphic . We use d_ij = y_iy_j for i ≠ j and d_ii = 0, following suggestions that the product y_iy_j should be powerful in performing tests of phenotypic versus genotypic relatedness (Elston et al. 2000). For m genetic markers in a region, with genotypes measured on the n individuals, we have an m × n genotype matrix G, which has been centered and column scaled. For i ≠ j, we use c_ij = corr(g_.i,g_.j), where g_.i is the ith column of G, and “corr” is the Pearson correlation.

A closely related quadratic form statistic (S_assoc2) is the sum of squared score statistics across the markers, which is similar to S_assoc1 but with slightly different genotype scaling, and with non-zero diagonals for the corresponding matrices. We use X to denote the matrix of genotypes, which have been row centered and scaled so that Inline graphic and . A single score statistic for the ith marker is , and , which can be shown to be S_assoc2 = y^TAy, where A = X^TX.

Example 4

(Bowman & Azzalini 1997, pp. 86–90) described a dataset resulting from sampling aquatic life in a coral reef, with 42 observations of catch score, summarized as a log weight across numerous species, versus depth. The dataset has been used by these authors and others to demonstrate local linear regression, using a normal smoothing kernel. A standard test statistic for local linear regression can be expressed as a quadratic form, as follows. The derivation applies to the Nadaraya–Watson estimator (Nadaraya, 1964; Watson, 1964)

with kernel function w_h, for the regression model E(Y _i | x_i) = m(x_i). The fitted values Inline graphic can obtained using a smoothing matrix M (which depends on h) such that . As shown in (Bowman & Azzalini 1997), an F-like statistic can be obtained using the ratio

with U = I − 1_{n × n}/n − (I − M)^T(I − M) and V = (I − M)^T(I − M). The p-value is P(F > F_obs), which can be rewritten as P(y^T(U − F_obsV)y > 0, and so we use, finally, A = (U − F_obsV) in the quadratic form. It is easy to show that A is symmetric with row/column sums of zero.

(Bowman & Azzalini 1997) obtained p-values using moments from a normal quadratic form and a scaled chi-square density approximation, while acknowledging that the data included some non-normal features, such as truncation. They describe permutation analysis as an alternative approach, which they did not pursue further. For the same normal quadratic form, (Kuonen 1999) reported p-values using a saddlepoint approximation. Here, we report p-values based for direct permutation and compare to results from our moment-based density approximation.