Abstract
Outlier sums were proposed by Tibshirani & Hastie (2007) and Wu (2007) for detecting outlier genes where only a small subset of disease samples shows unusually high gene expression, but they did not develop their distributional properties and formal statistical inference. In this study, a new outlier sum for detection of outlier genes is proposed, its asymptotic distribution theory is developed, and the p-value based on this outlier sum is formulated. Its analytic form is derived on the basis of the large-sample theory. We compare the proposed method with existing outlier sum methods by power comparisons. Our method is applied to DNA microarray data from samples of primary breast tumors examined by Huang et al. (2003). The results show that the proposed method is more efficient in detecting outlier genes.
Some key words: Asymptotic distribution, Cancer outlier profile analysis, Gene expression, Outlier robust t-statistic, Outlier sum, p-value, t-test
1. Introduction
DNA microarray technology, which simultaneously probes thousands of gene expression profiles, has been successfully used in medical research for disease classification (Agrawal et al., 2002; Ohki et al., 2005). For example, in breast cancer research, Sorlie et al. (2003) used gene expression to classify malignant breast tumors into five distinct molecular subtypes. In lymphoma research, Alizadeh et al. (2000) reported that patients with one type of molecular pattern, germinal center B-like diffuse large B-cell lymphoma, had a significantly better chance of overall survival than those with another molecular pattern, activated B-like diffuse large B-cell lymphoma. Recently, investigators have extended the disease classification capability of microarray analysis to identifying outlier genes that are overexpressed only in a small number of disease samples. Tomlins et al. (2005) introduced an approach called cancer outlier profile analysis to identify outlier genes. This approach standardizes gene expression by centring at the median and scaling by the median absolute deviation. A kth percentile of the standardized expression value is then used as a cut-off point to determine an outlier gene. Later, two outlier sum approaches, the outlier sum statistic (Tibshirani & Hastie, 2007) and the outlier robust t-statistic (Wu, 2007), were developed to improve detection of outlier genes. Both approaches use the summation of extreme high expression values, instead of the kth percentile of the standardized gene expression, to identify outlier genes. Empirical studies have shown that both outlier sum approaches are more powerful in detecting outlier genes than standard approaches, such as the t-test. However, the development of distribution theory for these statistics is still primitive. In this paper, we derive an explicit form for the p-value of an outlier sum test statistic. The approach is empirically proven to be generally more powerful than the other methods.
2. The new outlier sum and its p-value
In a study that consists of n1 subjects in the normal control group and n2 subjects in the disease group, suppose that there are m genes to be investigated. Their gene expression can be represented as Xij (i = 1, . . . , n1, j = 1, . . . , m) for the normal control group and Yij (i = 1, . . . , n2, j = 1, . . . , m) for the disease group. For a fixed gene j, let μj represent the parameter for central tendency that is used for measuring distance for observations in the disease group, let ηj represent the cut-off point that is used for identifying an observation in the disease group as an outlier, and let σj represent the scale parameter that is used for standardizing the sum of outliers. Let μ̂j, η̂j and σ̂j denote their corresponding estimators.
A standardized version of the outlier sum statistic for gene j defined by Tibshirani & Hastie (2007) and Wu (2007) may be represented as . The former uses combined observations of Xij s and Yij s for estimation of μj, ηj and σj and the latter uses only Xij s for estimation of the same set of parameters. In this paper, we present a nonstandardized outlier sum formulated in (1) and use its mean statistic to derive the p-value based on the large-sample theory of this statistic. In the following, we drop the gene index j from all mathematical expressions unless otherwise noted. The outlier sum and its mean statistic for a gene are, respectively, formulated as
(1) |
Let FX and FY denote the distribution functions, respectively, for variables X and Y. For testing H0 : FY = FX, we will further assume that there are the unknown location and scale parameters for this outlier mean, denoted by μℓ and σℓ, respectively, and they satisfy
(2) |
for any real number z, where Φ represents the cumulative distribution function of a standard normal variate. We also set two constraints on the cut-off point η̂ and the sample sizes n1 and n2, stated in Assumptions 1 and 2 in the Appendix.
From the relationship between the outlier mean and the outlier sum in (1) and from Assumption 1 for the proportion of the outlier samples, we can infer from Slutsky’s theorem and equation (2) that (L − n2βμℓ) has an approximate N (0, 1) distribution. This provides a natural candidate for the test statistic based on the outlier sum statistic:
(3) |
where β^, μ̂ℓ and σ̂ℓ are estimators for β, μℓ and σℓ, respectively.
The p-value for the outlier sum can be expressed as
(4) |
where ztest is the sample realization of Ztest and estimates β̂, μ̂ℓ and σ̂ℓ are computed based on the observations xi s from the normal group.
3. Asymptotic properties of the outlier mean
In a study of the outlier mean or sum, the cut-off point is usually chosen to be a certain multiple of the interquartile range away from the median. For example, in the outlier robust t-statistic approach, one interquartile range away from the median is adopted as the cut-off point. In the proposed method, we allow a flexible selection of the cut-off point and express it as for mathematical convenience, where k > 0 is a constant. To develop the p-value based on large sample theory, we suggest use of as the cut-off point for estimating η, where is the sample interquartile range for the distribution of variables {Xi}. Consider that the underlying distribution FX is normal, for k = 1, the population cut-off point becomes . In this case, the cut-off point η̂ for the outlier sum is equal to the outlier robust t-statistic approach. Hence, for k > 0 may serve as a generalization of the classical outlier sum. However, one interesting study is to see if k = 1 provides the desired results for gene expression analysis.
Let us now study the asymptotic distribution of the outlier mean L̄ in this specification of η̂. A Bahadur representation for the outlier mean can be stated in the following theorem.
Theorem 1. If Assumptions 1, 2, 3 and 4 in the Appendix are true, a Bahadur representation of the outlier mean is
where , , , and . In this case, β = pr(Y ⩾ η).
The asymptotic distribution of the outlier mean can be obtained from the central limit theorem.
Theorem 2. If Assumptions 2, 3 and 4 in the Appendix are true, then converges in distribution to N (0, ), where
After replacing the population parameters by their corresponding estimators in the formulas of Theorem 2 and in (3), the p-value in (4) can be computed.
There are two scenarios that require different approaches to find estimators of β, μℓ and σℓ. When distributions FX and FY are known but involve some unknown parameters, maximum likelihood estimation can be used. When the distributions FX and FY are unknown, a nonparametric technique can be used for estimating β, μℓ and σℓ by first estimating the mean μY, percentile , truncated mean μℓ, truncated variance and densities fX and fY. Theorem 2, along with equation (4), provides a new method for computing the p-value that can be useful in different scenarios. Although the nonparametric approach can broaden the application areas, this paper focuses on the parametric model.
In the rest of this section, we develop an explicit form for the p-value under an added assumption that variables X and Y have normal distributions N (μX, ) and N (μY, ). Under these assumptions, we have , which implies , where zα is the 100αth percentile of the standard normal distribution. This gives the outlier sum
(5) |
after replacing μX and σX by their estimators. Using equation (5) and the aforementioned properties and substituting the population parameters by their estimates, i.e. γ̂ = n2/n1, , , we obtain
where β is a known constant in this parametric setting.
With the above formulas, Theorem 2 permits the computation of . By plotting stimates of μx and σx, the p-value of (4) can be obtained. This technique can be extended to other known population distributions of X and Y.
4. Simulation study
4.1. Type I error
This simulation study examines whether the location of the cut-off point affects Type I error. From (5), we will make comparisons based on the value of k. We assume that gene expression in both control and disease groups follows a standard normal distribution, and each has a sample size n = 20. The p-value is computed based on (4). Type I error is calculated as the proportion of the p-values less than 0.05 among simulation runs. The results show that as k increases, or equivalently, the cut-off point shifts more away from the median, Type I error gets smaller and the chance of rejecting the null hypothesis decreases. Type I errors are 0.058, 0.0075, 0.00043, 0.00002 and 0.000002 for k = 1, 1.5, 2, 2.5 and 3, respectively, at α = 0.05. Clearly, when k = 1, Type I error is closest to the targeted significance level. When the adjusted p-value cut-off is 0.038, Type I error reaches 0.05 for k = 1. Thus, we opt for k = 1 for the distribution-based p-value approach to compare with other outlier statistics and choose 0.038 as the significance level to control for Type I error at 0.05 in the power study.
4.2. Power analysis
In the power study, we compare the distribution-based p-value approach with four other approaches: two-sample t-test, the cancer outlier profile analysis approach, the outlier sum approach, and the outlier robust t-statistic approach. Three scenarios are examined: (a) the genes in the control and disease groups follow the N (0, 1) distribution, except for the first gene in the n0 outlier samples from the disease group that follows the N (d, 1) distribution, where d is the effect size; (b) the genes in the control and disease groups follow the t-distribution with four degrees of freedom, except for the first gene in the n0 outlier samples from the disease group, which follows the noncentral t-distribution with four degrees of freedom and noncentrality d; (c) the genes in the control and disease groups follow the N (0, σ2) distribution, except for each gene in the n0 outlier samples from the disease group that follows the N (2σ, σ2) distribution. For scenarios (a) and (b), we evaluate how the effect size, noncentrality, and number of outlier samples affect the power. Specifically, we simulate 1000 genes for each of the 20 control and 20 disease samples and choose d from 1, 1.5, 2, 3, to 4. In addition, we let n0 vary from 1 to 20. For scenario (c), each gene has a different variance, but the effect size, the ratio of mean and standard deviation, is each fixed at 2. To replicate our data sample presented in § 5, we simulate 12 625 genes for each of the 19 control and 18 disease samples and allow each gene to have a separate σ, estimated from the data sample and ranging from 0.1 to 2.4. In this scenario, we let n0 vary from 1 to 18.
Power is calculated as the proportion of 1000 simulation runs that have a significant difference based on Type I error 0.05. For scenarios (a) and (b), power calculation is based on the first gene. For scenario (c), the average power of all genes is used. Significance is determined by the p-value when methods have a p-value formula available. Otherwise, significance is determined by whether the test statistic is greater than the 95th percentile of the outlier statistic, based on the parametric bootstrapping method.
From Fig. 1, the distribution-based p-value approach has the highest power when the number of outlier samples is smaller than 10. In contrast, when the number of outlier samples becomes large, such as greater than 15, the t-test yields the highest power. Interestingly, the distribution-based p-value approach and the outlier robust t-statistic approach also perform well. On the other hand, the cancer outlier profile analysis and the outlier sum have low power overall. From Fig. 2, with three outlier samples or less, power increases as effect size increases in the distribution-based p-value, the outlier robust t-statistic, and the outlier sum approaches, but not in the other two approaches for both normal and t-distributions. The distribution-based p-value approach performs best in most cases. As the number of outlier samples increases to 15 or more, the t-test performs best, but the distribution-based p-value approach and the outlier robust t-statistic approach yield a comparable power. In contrast, the other two approaches have a power less than 0.4. The results demonstrate that the distribution-based p-value approach has better power to identify outliers. When the number of outlier samples is large, the strength of identifiability diminishes. For example, when all disease samples are outliers in the extreme case, the statistical issue becomes a standard two-group comparison and hence the identifiability of the outlier approaches disappears. For scenario (c) of differing variances, the results, not presented, are similar to those of Fig. 1 at the effect size equal to 2. Moreover, all five approaches show a very weak correlation between power and standard deviation. This phenomenon indicates that the power to detect outliers is independent of the variance as long as the effect size is fixed.
5. Data example
We apply all five approaches discussed in § 4 to the breast cancer microarray data reported by Huang et al. (2003). This dataset contains the expression levels of 12 625 genes from 37 breast tumor samples: 19 of them have no positive nodes discovered and are treated as the control group and the other 18 samples have identifiably positive nodes and are treated as the disease group. Since we test 12 625 genes simultaneously, errors in inference are more likely to occur without p-value adjustment. To account for multiple testing, the false discovery rate method (Benjamini & Hochberg, 1995) is used to correct for the p-value at the 0.05 level for the distribution-based p-value approach and the t-test. For the other three approaches, the 99th percentile of each outlier statistic is used as the cut-off point to identify significant outlier genes based on the parametric bootstrapping method.
Results show that the t-test does not identify any significant outlier genes. In contrast, the other four approaches identify 584, 535, 740 and 695 outlier genes, respectively, for the distribution-based p-value approach, the outlier robust t-statistic approach, the outlier sum approach and the cancer outlier profile analysis approach. We further find that four disease samples are consistently ranked in the top four in number of outlier genes by the distribution-based p-value approach, the outlier robust t-statistic approach and the outlier sum approach. Each of these four disease samples has at least 20% of the outlier genes with a higher expression, suggesting abnormal up-regulation in gene expression. In addition, there are eight outlier genes consistently showing high expression in all four samples.
This application highlights the strength of the four outlier approaches, which identify many significant outlier genes and point out several disease samples with an abnormally large number of outlier genes. On the other hand, the t-test fails to detect any outlier gene. In addition, computation of the p-value is easier with the distribution-based p-value approach than the other three outlier approaches. These three approaches lack a distribution theory and hence need special procedures, such as the parametric bootstrapping method, to define a cut-off point for each outlier gene as the variance changes.
Acknowledgments
The authors are grateful to the referees, an associate editor and the editor for comments that led to improvements in this manuscript. The research of the first author was partially supported by a grant from the National Science Council of Taiwan. The research of the second and third authors was partially supported by various grants from the National Institutes of Health, U.S.A.
Appendix
Four assumptions for the outlier sum test statistic are as follows.
Assumption 1. The proportion of outlier samples, , converges in probability to β with 0 < β < 1.
Assumption 2. The limit γ = limn2,n1→∞ n2/n1 exists.
Assumption 3. The probability density function fX is bounded away from zero in a neighbourhood of for α ∈ (0, 1).
Assumption 4. The probability density function fY is bounded away from zero in a neighbourhood of the population cut-off point η.
Proof of Theorem 1. With Assumption 3, a representation of such as
(A1) |
implies that satisfies (Ruppert & Carroll, 1980). From the expression of the outlier mean in (1), we have
The above expression can be rewritten as
(A2) |
With (A1), Assumptions 2 and 4, and techniques from Ruppert & Carroll (1980) and Chen & Chiang (1996), a modified second term on the right-hand side of (A2), can be expressed as
(A3) |
By the same rationale and the weak law of large numbers, we can derive
(A4) |
which converges to pr(Y ⩾ η).
Combining (A2), (A3) and (A4), a Bahadur representation of the outlier mean is
References
- Agrawal D, Chen T, Irby R, Quackenbush J, Chambers AF, Szabo M, Cantor A, Coppola D, Yeatman TJ. Osteopontin identified as lead marker of colon cancer progression, using pooled sample expression profiling. J Nat Cancer Inst. 2002;94:513–21. doi: 10.1093/jnci/94.7.513. [DOI] [PubMed] [Google Scholar]
- Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, Boldrick JC, Sabet H, Tran T, Yu X, Powell JI, Yang L, Marti GE, Moore T, Hudson J, Jr, Lu L, Lewis DB, Tibshirani R, Sherlock G, Chan WC, et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature. 2000;403:503–11. doi: 10.1038/35000501. [DOI] [PubMed] [Google Scholar]
- Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Statist. Soc. B. 1995;57:289–300. [Google Scholar]
- Chen L-A, Chiang Y-C. Symmetric quantile and trimmed means for location and linear regression model. J Nonparam Statist. 1996;7:171–85. [Google Scholar]
- Huang E, Cheng SH, Dressman H, Pittman J, Tsou MH, Horng CF, Bild A, Iversen ES, Liao M, Chen CM, West M, Nevins JR, Huang AT. Gene expression predictors of breast cancer outcomes. Lancet. 2003;361:1590–6. doi: 10.1016/S0140-6736(03)13308-9. [DOI] [PubMed] [Google Scholar]
- Ohki R, Yamamoto K, Ueno S, Mano H, Misawa Y, Fuse K, Ikeda U, Shimada K. Gene expression profiling of human atrial myocardium with atrial fibrillation by DNA microarray analysis. Int J Cardiol. 2005;102:233–8. doi: 10.1016/j.ijcard.2004.05.026. [DOI] [PubMed] [Google Scholar]
- Ruppert D, Carroll RJ. Trimmed least squares estimation in the linear model. J Am Statist Assoc. 1980;75:828–38. [Google Scholar]
- Sorlie T, Tibshirani R, Parker J, Hastie T, Marron JS, Nobel A, Deng S, Johnsen H, Pesich R, Geisler S, Demeter J, Perou CM, Lonning PE, Brown PO, Borresen-Dale AL, Botstein D. Repeated observation of breast tumor subtypes in independent gene expression data sets. Proc Nat Acad Sci. 2003;100:8418–23. doi: 10.1073/pnas.0932692100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tibshirani R, Hastie T. Outlier sums differential gene expression analysis. Biostatistics. 2007;8:2–8. doi: 10.1093/biostatistics/kxl005. [DOI] [PubMed] [Google Scholar]
- Tomlins SA, Rhodes DR, Perner S, Dhanasekaran SM, Mehra R, Sun XW, Varambally S, Cao X, Tchinda J, Kuefer R, Lee C, Montie JE, Shah RB, Pienta KJ, Rubin MA, Chinnaiyan AM. Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer. Science. 2005;310:644–8. doi: 10.1126/science.1117679. [DOI] [PubMed] [Google Scholar]
- Wu B. Cancer outlier differential gene expression detection. Biostatistics. 2007;8:566–75. doi: 10.1093/biostatistics/kxl029. [DOI] [PubMed] [Google Scholar]