Supporting information for Tudor et al. (2002) Proc. Natl. Acad. Sci. USA, 10.1073/pnas.242566899

 

Supporting Text

Results of Power Analyses of Mecp2 Datasets.

Initial analysis of the data revealed no strong up- or down-regulation of gene expression in the mutant samples as defined as a consistent twofold change or greater. We noticed significant variability in the data, impeding the detection of less than twofold changes in gene expression. We found that iterative scaling of the data reduced experimental noise, whereas maintaining biological variability (see Materials and Methods, and data not shown). Next, we tested the data using a variety of methods. The data set was normalized and modeled by using the dchip (version 1.0) program (1) and analyzed for significance by using a simple parametric Student’s t test. We further tested the significance of gene expression changes by using average difference data from the Affymetrix Microarray Suite software and/or analyzing the data with other methods, such as neighborhood analysis (2), the PaGE method (3), Westfall and Young FWE correction (4, 5), Benjamini-Yekutieli FDR correction (6), or the VERA/SAM error model (7).

The parametric t test (two-tailed, unpaired Student's t test with unequal variance, using a significance cutoff of P < 0.05) identified about 200 genes in any given experiment as being significantly changed with a mean change of 15% ± 6%. Although our power/false-positive simulation analyses (see below) suggested that many of these are not true changes (for a 15% change, ~50-100 false positives are expected), there may nevertheless be some true changes in this group. However, biological significance of such small average fold-changes is uncertain.

To assess the quality of our data and our analysis methods, we quantitated our ability to detect changes in the data. Several different statistical methods were tested for power and false-positive rate by using simulation. The power of a statistical test is its ability to identify true changes (formally, one minus the Type II error rate, b). For example, if a t test is used to test for a significant difference in means of two populations, its power is the fraction of times it returns a significant (P < 0.05) result for pairs of populations that are truly different. The false positive rate (formally, the Type I error rate, a) is the fraction of times a statistical test fails to reject a true null hypothesis. Using the above example, the false positive rate of a t test is the fraction of times it calls two populations derived from the same distribution as significantly changed. The ideal statistical test would have high power (i.e., be able to identify most true changes) and low false-positive rate (i.e., not identify as changed most unchanged points). The power analyses described in the next paragraph were performed on the four MGU74 datasets as these were the largest, though the Mu11k experiments gave qualitatively similar results.

The power and false positive rates of the different methods varied (Table 3). The parametric t test had high power as would be expected for roughly normal data (mean normal probability plot correlation coefficients of genes, r > 0.95). Several groups have suggested that, because of the large number of hypotheses tested when assessing statistical significance in microarray experiments, multiple testing correction must be applied (2-4). Our simulations suggest that, whereas a simple parametric t test does generate a large number of false positives, almost all of these correspond to fold-changes of less than 1.5. Thus, combining a t test with an empirically-determined fold-change cutoff gives higher power than any multiple-testing algorithm while maintaining a reasonable false-positive rate. For example, for a sample size of eight (four each control and experimental samples), a t test has higher than 90% power to detect 1.5-fold or greater changes, and an average false-positive rate of 5 ´ 10–4 at this fold change. In our example, this amounted to two false positives per experiment. The VERA/SAM error model (7) was notable in that it did have higher power than a parametric t test (especially for smaller sample sizes, Table 4), though it did also have a marginally higher error rate on average.

Comparing dChip-calculated gene expression indices with Affymetrix Microarray Suite-calculated Average Differences (AD) showed that each analysis method had higher power on the dChip data. This is likely related to the higher variation of the AD data (mean coefficient of variation 0.16 for AD vs. 0.12 for dChip, P < 0.01 by permutation or t test, not shown).

To eliminate the possibility that dChip-specific biases were influencing the results, the analyses were repeated both on gene expression values output by the Affymetrix Microarray Suite and on expression indices and Absent/Present values output by the "Li-Wong Full" model of Lemon et al. (W. Lemon, J. Palatini, R. Krahe, and F. Wright, unpublished data). These data gave similar results to those obtained by using the dChip-modeled expression (not shown).

In addition, we tested the possibility that more transcriptional variability was present in Mecp2 mutant as compared to wild-type mice. The hypothesis has been presented that DNA methylation (and by extension, methylated-DNA-binding proteins such as MeCP2) are involved in the reduction of ectopic gene transcription (8). Additionally, human MECP2 mutant fibroblasts and lymphoblastoid cells have been shown to have a significant amount of clonal variability in transcription (9). In the absence of significant changes in mean gene expression in the mutants, and given the proposed role of Mecp2 in transcriptional noise reduction, we analyzed the data for differences in the variance of gene expression in mutant vs. wild-type mice. The ratio of the variances of the gene expression values was calculated and tested for significance, and no difference was found in the mutant mouse samples (i.e., for any given cutoff of significance, there were no more genes with excessive variance in the mutants than genes with excessive variance in the controls, not shown).

Methods of Power Analysis.

Power was estimated by simulating a number of datasets with similar characteristics to our observed data. This was done by filtering the data as for analysis (majority P calls, outliers winsored to median ± 2 IQR, and this set scaled with robust linear regression). Next, genes called changed at P < 0.05 [or confidence > 50% for PaGE(3)] were filtered out (typically fewer than 20 genes per set) and the resulting set was taken as a good approximation of a null distribution of the same size as our data (see figure legends for sample sizes and filtered genes represented in the simulations). The experiment labels of this data were permuted to model a series of different but consistent null distributions (between 50 and 200 were modeled, depending on CPU time necessary for each analysis). For each iteration, 40 genes were chosen at random, and the points with the (permuted) class label of 'experimental' were multiplied by a range of 1.1 to 5.0 in 0.1 increments (more precisely, this range was randomly inverted to yield a mix of simulated 'upregulation' and 'downregulation'). These datasets were then input into the various analysis methods and their significance calls on the manipulated genes as well as the unmanipulated genes was noted. These results were averaged across all replicates to yield these figures.

1. Li, C. & Wong, W. H. (2001) Proc. Natl. Acad. Sci. USA 98, 31–36.

2. Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., et al. (1999) Science 286, 531–537.

3. Manduchi, E., Grant, G. R., McKenzie, S. E., Overton, G. C., Surrey, S. & Stoeckert, C. J., Jr. (2000) Bioinformatics 16, 685–698.

4. Dudoit, S., Yang, Y. H., Callo, M. & Speed, T. (2000) Technical Report No. 578 (Univ. of California Press, Berkeley).

5. Westfall, P.H. & Young, S. S. (1993) Resampling-Based Multiple Testing: Examples and Methods for P-Value Adjustment (Wiley, New York).

6. Benjamini, Y. & Yekutieli, D. (2001) Ann. Stat. 29, 1165–1188.

7. Ideker, T., Thorsson, V., Siegel, A. F. & Hood, L. E. (2000) J. Comput. Biol. 7, 805–817.

8. Bird, A. P. & Wolffe, A. P. (1999) Cell 99, 451–454.

9. Traynor, J., Agarwal, P., Lazzeroni, L. & Francke, U. (2002) BMC Med. Genet., in press.