Abstract
Robust methods for identifying patterns of expression in genome-wide data are important for generating hypotheses regarding gene function. To this end, several analytic methods have been developed for detecting periodic patterns. We improve one such method, JTK_CYCLE, by explicitly calculating the null distribution such that it accounts for multiple hypothesis testing and by including non-sinusoidal reference waveforms. We term this method empirical JTK_CYCLE with asymmetry search, and we compare its performance to JTK_CYCLE with Bonferroni and Benjamini-Hochberg multiple hypothesis testing correction, as well as to five other methods: cyclohedron test, address reduction, stable persistence, ANOVA, and F24. We find that ANOVA, F24, and JTK_CYCLE consistently outperform the other three methods when data are limited and noisy; empirical JTK_CYCLE with asymmetry search gives the greatest sensitivity while controlling for the false discovery rate. Our analysis also provides insight into experimental design and we find that, for a fixed number of samples, better sensitivity and specificity are achieved with higher numbers of replicates than with higher sampling density. Application of the methods to detecting circadian rhythms in a metadataset of microarrays that quantify time-dependent gene expression in whole heads of Drosophila melanogaster reveals annotations that are enriched among genes with highly asymmetric waveforms. These include a wide range of oxidation reduction and metabolic genes, as well as genes with transcripts that have multiple splice forms.
Author Summary
Much biomedical research focuses on how the expression of genes changes over time. Many genes’ activities vary periodically. For example, circadian rhythms repeat daily with the light-dark cycle. Understanding how such rhythms couple to biological processes requires statistical methods that can identify cycling time series in typical genome-wide data. In this paper, we improve on a method used to identify cycling time series by better estimating the statistical significance of periodic patterns and, in turn, by searching for a wider range of patterns than traditionally investigated. We apply these methods to a compilation of data on gene expression in fruit flies, an important model organism. We find that our method allows us to discover rhythmic biological activities that the other methods tested are unable to reveal.
Introduction
Rhythmic behavior is ubiquitous across the spectrum of life [1–4]. Diverse fundamental biological functions such as cell division, energy metabolism, and sleep are periodic, and a growing body of evidence implicates temporal dysregulation as a contributing factor to depression, neurodegeneration, cardiovascular disease, and metabolic disorders in higher organisms [5–9]. Arguably the most well-studied periodic patterns are circadian rhythms: oscillatory changes in gene expression, metabolism, physiology, and behavior with approximately 24-hour (24 h) periods that enable organisms to anticipate and respond to daily changes in their environment, such as nutrient accessibility, temperature, and light [10–13].
Circadian rhythms arise from innate clocks. The components of the core clock are well characterized and are strongly conserved across a wide range of species [14, 15]. However, it remains to be determined how this clock couples to other molecular processes. Moreover, these interactions are likely to depend on tissue type and environmental conditions [2, 7, 11, 16, 17]. There is thus a need to identify molecular profiles that cycle and to characterize them as a function of conditions. The advent of high throughput methods for measuring gene expression now makes transcriptome-wide studies of this nature possible. Previous work suggests that hundreds, possibly thousands, of genes are regulated by circadian clocks [10, 15, 18].
Despite the decreasing cost of measuring transcript levels, profiling time series genome-wide continues to present formidable challenges: tissue-specific samples are difficult to collect, and, in contrast to imaging, measuring transcript levels is destructive in nature, requiring separate samples for each time point. As a result, gene expression time series are typically sparsely sampled (e.g., every 2–4 hours (h) in circadian studies), often without multiple measurements per time point, which we refer to here as “replicates”. These experimental limitations result in low signal-to-noise ratios that prevent straightforward identification of cycling gene expression.
Quantitative methods are thus needed to identify rhythmic time series from minimal data with statistical confidence. These methods can aid researchers in assessing the tradeoffs between the amount of data acquired, statistical precision, and breadth of biological discovery. While a number of different methods have been proposed for identifying cycling time series [19–30], further analysis is needed to guide selection of the best method(s) for a given situation and to aid in design of improved computational methods and further experiments.
In this paper, we improve on the JTK_CYCLE method [26]. The original method uses a conservative estimation for its p-values and a cosine as its only reference waveform. Here, we introduce a procedure, empirical JTK_CYCLE with asymmetry search, that provides accurate empirically-calculated p-values for arbitrary waveforms. We test its performance for detecting rhythms in simulated data and a circadian metadataset [27] against other algorithms: cyclohedron test [20, 21], address reduction [22, 23], stable persistence [24, 25], F24 [31, 32], and one-way analysis of variance (ANOVA) [27]. The simulated data allow us to examine how performance varies with sampling density, number of replicates and/or periods, noise level, and waveform. Most methods provide accurate rhythm detection when sampling density is high and noise is low. However, we find that the choice of method significantly affects rhythm detection when data are limited and/or noisy. In particular, JTK_CYCLE, F24, and ANOVA consistently outperform the other methods and offer distinct advantages for certain types of data. Our improved method, empirical JTK_CYCLE with asymmetry search, performs best of all for data that include asymmetric waveforms. Application of our improved method, empirical JTK_CYCLE with asymmetry, to a metadataset of whole head D. melanogaster circadian microarrays [27] reveals a strong lights-on peak in expression for genes involved in glutathione metabolism, high enrichment for genes involved in oxidation reduction, many more metabolic genes cycling than previously appreciated, and rhythmic genes with transcripts that have alternative splicings.
Methods
Overview
The methods that we consider are general and can be applied to detecting periodic behavior in any context, but we describe them here in terms of searching for circadian rhythms in gene expression for clarity. The methods that we test are cyclohedron test [20, 21], address reduction [22, 23], stable persistence [24, 25], F24 [31, 32], one-way analysis of variance (ANOVA) [27], and JTK_CYCLE [26]. We describe each briefly below and note specific features; additional details can be found in the references introducing the methods.
The methods can be broadly categorized as tests with and without reference waveforms. Cyclohedron test, address reduction, stable persistence, and ANOVA seek to identify patterns without specifying the waveform a priori. Address reduction, cyclohedron test, and stable persistence test for monotonicity. ANOVA compares the means of different time points with their variances to determine if differences are significant.
In contrast, F24 and JTK_CYCLE compare the time series in question to a reference waveform, which is typically sinusoidal. These methods also test for a specific period. As mentioned above, here we assume a period of 24 h, but the period of the reference can be varied, in the same manner that the phase can be varied, to search for rhythms on other time scales.
Cyclohedron test
Cyclohedron test [20, 21] maps data to a cyclohedron and joins data points into sets according to their adjacency in rank-ordering. Monotonicity is quantified by how the sizes of the sets scale as more data points are included. Cyclohedron test has an exact null distribution computable from a generating function. The domain of test statistics increases very quickly with the number of data points, however, so Monte Carlo (MC) sampling, in which representations of the null model are randomly generated and evaluated, is required to estimate p-values if there are more than approximately 20 time points due to the computational cost of the generating function.
Address reduction
Address reduction [22, 23] measures the entropy of the dataset by comparing the rank-ordering of adjacent time points. Low entropy data are monotonic and score higher in the method. The null distribution for address reduction is estimated by MC sampling.
Stable persistence
In stable persistence [24, 25], local minima are paired with surrounding local maxima, and the “persistences” of these features are characterized by the differences in ranks of the paired extrema. A hierarchy of such features is established, and the test compares the global persistence to local ones. In this way, stable persistence tries to robustly assess overall monotonicity of a time series. The null distribution for stable persistence is estimated by MC sampling.
Analysis of variance (ANOVA)
One-way ANOVA is a standard statistical test of the equivalence of means in several groups. In this case, each time point is a different group, and ANOVA is equivalent to testing for any statistically significant variation across the time points. Because expression measurements are averages over many cells and different time points come from different samples (as the measurement is destructive), only synchronized, consistent variation across all samples can generate a statistically significant trend. By this reasoning, significant changes in expression across time points can be attributed to time-dependent expression within the population, such as circadian rhythms. ANOVA tests for these time-dependent changes in expression. ANOVA has an exact null distribution derived from an assumption of normally distributed data; unlike the other five methods, however, ANOVA requires replicates to estimate the variance of experimental measurements at each time point.
24-hour Fourier projection (F24)
F24 [31, 32] assesses periodicity by focusing on the 24 h period of the Fourier transform of the data. The test statistic for F24 is the projection of the data onto the 24 h Fourier basis function, and the null distribution is obtained by recomputing the test statistic over repeated random permutations of the data. The phase is determined by projecting the data onto the cosine part of the Fourier basis function and finding the optimal phase for the projection. We find that the null distribution can be modeled by the Gamma distribution (S1 Fig.), which we parameterize from the mean and variance of the null distribution. We estimate the null distribution from a small number of permutations (usually 100). This allows more rapid and precise computation of p-values than can be obtained by standard permutation. Testing periods other than 24 h is accomplished simply by changing the period of the Fourier basis function used to compute the test statistic.
Jonckheere-Terpstra-Kendall cycle (JTK_CYCLE)
JTK_CYCLE [26] computes the Kendall τ rank correlation coefficient between the data and a reference function over a range of possible reference function phases. For two time series and ,
(1) |
where sgn(x) is −1 if x < 0 and +1 if x>0. The numerator is the number of pairs that vary concordantly between the two time series minus the number that vary discordantly (Fig. 1A). Every possible pair is included, not just ones between neighboring points in the time series. The denominator is the total number of pairs, which normalizes the value of τ to the range [−1, 1]. Perfectly correlated series score τ = 1, perfectly anti-correlated series score τ = −1, and uncorrelated series score τ = 0. Like the cyclohedron test, the null distribution for JTK_CYCLE can be computed exactly from a generating function [33], although again the test statistic domain grows quickly with time series size (becoming impractically large at 100–200 time points with present computing power). However, for large time series the JTK_CYCLE null distribution is approximately normal, allowing for a convenient, fast p-value estimate.
Improvements to the JTK_CYCLE method
It is important to note that τ (Eq. 1) is calculated for a specific reference time series, and thus JTK_CYCLE typically tests against a family of curves (e.g., to consider the possible phases of a waveform, as illustrated in Fig. 1B). It is thus necessary to account for multiple hypothesis testing across reference waveforms in assessing the significance of the results. Hughes et al. [26] employed the Bonferroni correction [34] in their original formulation and implementation of the method. This method is known to be conservative [34], and we illustrate this fact here explicitly for JTK_CYCLE (Fig. 2). These considerations motivate a new procedure for estimating the significance of the results, which we describe. We end this section by discussing the comparison of time series to reference waveforms (Fig. 1C) other than the cosine waveform that was used originally. Together, our improvements allow for the JTK_CYCLE method to include additional reference waveforms in its rhythm detection without compromising sensitivity and specificity.
Empirical p-values
By definition, a p-value is the likelihood of obtaining a test statistic equal to or more extreme than the value that is observed if the null hypothesis is true—it increases cumulatively as one progresses through a set of rank ordered test statistics. In the case of JTK_CYCLE, under the null hypothesis, time series values are independent (and generated by the same noise distribution) and so the rank ordering of time series values is random. For a dataset generated from this null model, the p-values should be uniformly distributed from 0 to 1, exclusive: the highest Kendall’s τ out of N tests should have a p-value of 1/(N + 1), the second highest test statistic has a p-value of 2/(N + 1), and the i th highest test statistic has a p-value of i/(N + 1) [35]. Restated, the p-values should be a linear function of the ranks (black lines in Fig. 2).
JTK_CYCLE computes the Kendall τ values for all the reference time series against the signal of interest and then performs a selection step for the lowest p-value (i.e., the highest τ), which we refer to here as the “initial” p-value. This procedure biases the p-values (the blue and green lines in Fig. 2A) such that they underestimate the true probability of obtaining test statistics by chance (the black line in Fig. 2A). The previous version of JTK_CYCLE corrects for underestimating the p-values with the Bonferroni correction, which controls the family-wide error rate (FWER) by multiplying the p-values by the number of hypothesis tests being performed. The FWER is the probability that there is at least one false positive for a given threshold. Therefore, a threshold of 0.01 means that there is a 1% chance that the list of time series with a Bonferroni adjusted p-value below 0.01 contains a false positive. This method, while rigorous, is overly conservative and overcompensates for the bias that comes from selecting the lowest p-value (blue and green lines in Fig. 2B). The likelihood of false positives is greatly reduced, but so is the likelihood of identifying true positives.
A common alternative to the Bonferroni correction is the Benjamini-Hochberg procedure [36], which seeks to control the false discovery rate (FDR). The FDR is the fraction of the time series that are identified as cycling that are false positive. For example, a Benjamini-Hochberg adjusted p-value cutoff of 0.05 means that 5% of the positives are false. This is a less stringent constraint than the FWER. In this procedure, the p-values are also multiplied by the number of hypotheses tested, as in the Bonferroni procedure. However, the p-values are additionally ordered from lowest to highest and then divided by their rank order (the lowest p-value has rank order 1, the second highest p-value has rank order 2, and so on). The p-values are also adjusted such that there is no change in ordering: if the originally lowest p-value is adjusted so that it is higher than the originally second lowest p-value, the lowest p-value takes the value of the adjusted second p-value so that the ordering is not violated. The same holds for the relationship between the second and the third lowest p-values and so forth. While the Benjamini-Hochberg procedure is a reasonable approach to multiple hypothesis testing in general, it does not account for the selection step in JTK_CYCLE; it still is thus overly conservative (Fig. 2C).
Consequently, we instead numerically compute the null distribution by applying the full JTK_CYCLE procedure to time series in which the values have random rank orders. Since we test a family of curves (e.g., spanning phases), we focus on positive correlations and compute one-sided p-values. In the present study, these “empirical” p-values are based on 2 × 106 random time series and are nearly uniformly distributed, as desired (Fig. 2D). Though this empirical calculation is more computationally expensive than the application of the Bonferroni correction or the Benjamini-Hochberg correction, we show that empirically calculating the p-values results in better rhythmic detection and biological insight. We term this improved method empirical JTK_CYCLE, as we empirically calculate the p-values after selecting the highest τ value for each time series.
Below, we compare the Bonferroni adjusted p-values, Benjamini-Hochberg adjusted p-values, and empirical p-values directly. These have been adjusted on the basis of correcting for multiple hypothesis testing across different waveforms for a single time series. When we compare different time series to each other, we have to correct again for multiple hypothesis testing, this time across time series. To do this we use the Benjamini-Hochberg correction, as in the original implementation of the method [26]. When we refer to the Bonferroni, Benjamini-Hochberg, or empirical method, we refer to corrections across different waveforms for a given time series; all corrections across time series are with the Benjamini-Hochberg method. While the Bonferroni adjusted p-values, Benjamini-Hochberg adjusted p-values, and empirical p-values represent different quantities (the FWER, the FDR, and the p-values, respectively) they are all at least as conservative as the “true” p-values in the range that we are examining (compare blue and green lines with black lines in Figs. 2B, C, and D). This means that the FDRs that result from the Benjamini-Hochberg correction between time series are more conservative than the true FDRs.
Asymmetric waveforms
There is no a priori reason biological time series need be sinusoidal [37, 38], so it is of interest to test additional waveforms. In this regard, it is important to keep in mind that for JTK_CYCLE the rank order of the points in the reference matters, so we can represent a wide range of simple waveforms (e.g., Fig. 1C) by a triangle function with a specified separation between the maximum and the minimum. This allows us to avoid functionally defining an asymmetric cosine waveform. For the time series that we examine in this paper the difference is insignificant (S2 Fig.). We term the size of the interval from the maximum to the minimum the “asymmetry”, and we express the asymmetry here in units corresponding to a 24 h period. In this notation, a cosine has an asymmetry of 12 h, while a time series with an asymmetry of 8 h has a more rapid fall than rise (the values decrease over 8 h and increase over 16 h). The triangle reference waveforms have the same monotonicity as a cosine, and we keep the convention that the peak value corresponds to the phase of the time series.
To parse the effects of empirically calculating the p-values from those of including asymmetric waveforms, we test our form of JTK_CYCLE with and without asymmetric reference time series. In the former case, we denote searching over asymmetry values in steps of 2 h “by 2 h”, in steps of 4 h “by 4 h”, etc. We expect these additional waveforms to be more sensitive to asymmetric patterns of gene expression, resulting in discovery of additional rhythmic time series. It is important, however, to be cognizant of the fact that we are increasing the total number of hypotheses that we test, resulting in a greater need for the empirical correction procedure. Fig. 2 shows the different correction methods for the minimum p-values for JTK_CYCLE with searching over 12 phases (every 2 h, green line) or searching over 12 phases and 11 asymmetries (every 2 h, blue line). The added hypotheses for searching across asymmetries result in larger selection bias when choosing the highest τ value (Fig. 2A) as well as larger correction biases (Fig. 2B and C) than when only searching across phases. The empirical calculation of the p-values improves as the number of tests increases as well (Fig. 2), further justifying its use.
Results
Simulated data benchmarks
To assess the performance of our empirical form of JTK_CYCLE against the original form as well as other methods, we utilize two simulated datasets. We employ the first simulated dataset to understand the sensitivity of each method to different shapes of time series. It comprises four types of waveforms: sine, ramp (a triangle with maximum asymmetry), impulse, and step, as well as an equal number of time series consisting of Gaussian noise. We compare all the precision-recall curves for all the methods on these data via the area under the receiver operating characteristic (AUROC), a measure of the sensitivity and specificity of the rhythm detection methods that does not depend on the proportions of positives and negatives in the dataset. The second simulated dataset contains 10% rhythmic time series of triangle waveform with uniformly distributed phases and asymmetries and 90% time series consisting solely of Gaussian noise. We use it to further assess the importance of considering asymmetric waveforms, and we explore how multiple hypothesis correction impacts the results when the true positives represent a relatively small fraction of the simulated time series, as we expect to be the case in genome-wide studies.
F24, ANOVA, and JTK_CYCLE outperform other methods
To construct the first dataset described immediately above, for each of the four waveforms in Fig. 3A we generated 10,000 time series with uniformly distributed random phase shifts (always with a 24 h period) and added Gaussian noise to each point with a standard deviation of 25% or 50% of the total waveform amplitude, examples of which can be seen in Fig. 3B. We tested data with 4, 6, 8, or 12 evenly spaced points per 24 h period, and 1, 2, 3, or 4 replicates per time point (which is the equivalent of 1, 2, 3, or 4 periods per time series). At each spacing and replicate count we also generated 10,000 time series of Gaussian noise to serve as true negatives. The cyclohedron test, address reduction, stable persistence, and F24 are designed for single-replicate data, so we treated replicates as subsequent days of data, yielding multiple-period time series.
We scored each method by computing the area under the receiver operating characteristic curve (AUROC). The receiver operating characteristic (ROC) curve plots the true positive rate (TPR) as a function of the false positive rate (FPR) as the threshold for calling a time series as a positive is varied. The TPR and FPR are the fractions of the 10,000 simulated or Gaussian noise time series determined to be rhythmic at a threshold, respectively, and the threshold is varied over the entire range of false positive scores, such that the FPR ranges from 0 to 1. The AUROC is the integral of this curve; perfect classifiers have an AUROC of 1.0, while random classifiers have an AUROC of 0.5. An advantage of the AUROC as a metric is that it does not depend on the proportions of positives and negatives in the dataset because the TPR and FPR are calculated separately, i.e., they are normalized by the total number of positives and negatives, respectively. For stable persistence, the cyclohedron test, and address reduction, we calculated the AUROC from the test statistics themselves as opposed to the p-values, which we use for the latter three methods. Although the AUROC for JTK_CYCLE can be in principle be computed directly from the Kendall’s τ statistic, we include the multiple hypothesis testing correction because it impacts the TPR and FPR in practice; in particular, aggressive correction can lead to a loss of rank information because p-values must be less than or equal to 1.
The performance of the different methods at 50% noise can be seen in Fig. 4 (performance at 25% noise can be seen in S3 Fig.). ANOVA requires multiple measurements at each time point to determine the variance, so we define ANOVA to have an AUROC of 0.5 (performs no better than random guessing) when there is only one replicate. At 25% noise and high sampling rate, the cyclohedron test, address reduction, and stable persistence all perform roughly equivalent to the JTK_CYCLE methods, F24, and ANOVA. However, the former do noticeably worse than the latter at 50% noise. Empirical JTK_CYCLE out-performs original JTK_CYCLE, F24, and ANOVA for the sine and ramp waveforms, while ANOVA generally outperforms the other methods for the step and impulse waveforms.
While the empirical calculation approximates the null model well, it does not fully prevent multiple hypothesis testing from weakening the ability to identify rhythmic time series. Therefore, we do not sample phases and asymmetries more densely than the resolution of the data (e.g., if the experimental time points are spaced every 4 h, then we do not test phase values spaced every 2 h). We break this rule in Fig. 4 for the time series with 4–8 points for consistency of the figure. Sampling phases and asymmetries more densely than the resolution of the data needlessly reduces the power of our test, but does not affect the analysis in Fig. 4.
The JTK_CYCLE with Benjamini-Hochberg correction (JTK_BH) has AUROC values that are in between the AUROC values for the original JTK_CYCLE with Bonferroni correction (JTK) and empirical JTK_CYCLE. This is to be expected since the Benjamini-Hochberg method is more conservative than the empirical method but less conservative than the Bonferroni method. An additional detail is that the original JTK_CYCLE here uses a cosine as a reference waveform, in comparison to the triangle used by the other JTK_CYCLE methods. The methods that use the triangle waveform do not do significantly worse than the methods that use the cosine waveform in any of the cases, justifying the use of the triangle waveform for rhythm detection.
The cyclohedron test, address reduction, and stable persistence fail to improve as the number of replicates increases, and perform worse for low sampling rates. For example, a sine wave sampled at 4 time points per period for multiple periods has extrema at every other time point. Because cyclohedron test, address reduction, and stable persistence are essentially tests of monotonicity, they fail to detect the sparse periodic pattern in such data. In fact sparsely sampled data sometimes results in scores consistently lower than expected by chance, leading to the AUROC values less than 0.5 for these methods on some datasets in Fig. 4.
In summary, we find that all the methods tested can identify rhythmic expression patterns when the sampling density, replicate number, and signal-to-noise ratios are high. If data are sparse or noisy, however, method choice can significantly impact rhythm detection. In such cases, we find that ANOVA, F24, and JTK_CYCLE consistently better distinguish true and false positives. Empirical JTK_CYCLE outperforms ANOVA, F24, and original JTK_CYCLE for sine and ramp waveforms, but ANOVA performs better for impulse waveforms.
Increasing replicate number for a fixed number of total measurements improves sensitivity
The total number of samples required for an experiment is the product of the number of time points and the number of replicates. Consequently, it is important to consider how best to apportion resources. To this end, in Fig. 5 we compare possible combinations of numbers of time points and replicates that give rise to 12 or 24 total samples (see S4 Fig. for additional waveforms and numbers of samples). For this comparison, we consider sampling at a density of 4 points per period to be the minimal requirement for the identification of rhythmicity. Furthermore, we focus on genome-wide experiments where the experimental design is such that there is no meaningful difference between data collected over multiple periods and data collected at the same sampling rate in replicate over a single period. This assumption does not hold for experiments that follow the response to a synchronization event or other perturbation because time points from successive periods are not equivalent.
To optimize the performance of ANOVA, it is best to maximize the number of replicates at the expense of the number of time points, which is not surprising given the importance of accurately estimating the variance in this test. For JTK_CYCLE and F24, the choice is less clear, but greater improvement is obtained with replicate increases in the case of the step and impulse waveforms (S4 Fig.). By contrast, original JTK_CYCLE performs slightly better for sinusoidal waveforms with higher numbers of time points, but empirical JTK_CYCLE does not. Overall, the results suggest that limited resources are better directed at increasing replicate numbers than the density of time points.
Interpolated pseudo-replicates improve ANOVA sensitivity
Given the importance of replicates in improving sensitivity, we also explored interpolating neighboring time points to create pseudo-replicates, which would double the number of time points in the data (S5 Fig.). However, this requires recomputing null distributions via MC sampling because the construction procedure introduces correlations between data points, resulting in p-value underestimates if not corrected. We found that the pseudo-replicates improved the performance mainly of ANOVA when the replicate number was low (e.g., 1 or 2); in particular, they allowed ANOVA to be applied and give good results for the single replicate case (S6 Fig.). We stress that 2 or more biological replicates should be obtained if at all possible, and we do not recommend using the pseudo-replicate approach if sufficient data are available.
Empirical JTK_CYCLE outperforms other methods after correcting for multiple hypothesis testing
Our second benchmark comprises 15,840 times series, which was chosen to allow equal numbers of time series with different phases and asymmetries. 10% of the time series were generated from a triangle waveform with noise added and 90% were generated entirely from Gaussian noise. This composition was chosen to be reflective of a genome-wide dataset. The rhythmic time series were 24 points long, with 2 periods, each with 12 time points. Here, we analyze two such datasets: one with only asymmetry of 12 h (analogous to a cosine) and one with a uniform sampling of possible asymmetries (by 2 h from 2 to 22 h). In both cases, phases (peak values) were uniformly distributed over the possible discrete values. We added Gaussian noise with a standard deviation of either 25% or 50% of the amplitude of the time series, as previously described. We tested these data against the empirical JTK_CYCLE method with various asymmetries as well as original JTK_CYCLE, ANOVA, F24, and Benjamini-Hochberg adjusted JTK_CYCLE with various asymmetries for comparison. In all cases the JTK_CYCLE methods used the triangle waveform as the reference waveform, as it was the waveform used to generate the data.
We show cumulative histograms of the number of cycling time series identified for a given significance cutoff in Fig. 6. The methods shown yield comparable numbers for p-values less than 0.05, a reasonable threshold (Fig. 6A and B). However, in reporting total cycling numbers, it is important to correct for the fact that we are testing many time series (as opposed to testing many waveforms for a single time series, as previously). The p-values of empirical JTK_CYCLE are approximately uniformly distributed (Fig. 2D), as are those of ANOVA and F24, which satisfies the assumptions of the Benjamini-Hochberg correction [36], described above, so we use it for this purpose. We also apply the Benjamini-Hochberg correction to the original JTK_CYCLE with the intra-time series Bonferroni and Benjamini-Hochberg corrections discussed previously, which results in underestimates of the true FDR since their adjusted p-values are conservative (Fig. 2B and C).
In Figs. 6C and D, we see that the performance of the methods differs considerably when controlling for the false discovery rate (FDR). For these curves, the proportion of false positives identified as cycling matches the FDR. Specifically, the Benjamini-Hochberg correction (for many time series) penalizes methods with many p-values clustered at relatively high values (corresponding to a rapid rise toward the right of Figs. 6A and B, as for ANOVA and F24). Thus despite the fact that ANOVA and F24 perform comparably to JTK_CYCLE in the AUROC analysis (Fig. 4), their p-values provide less discrimination between time series, and thus they provide less sensitivity for a given FDR. In addition to looking at AUROC scores in Fig. 4 and time series identification in Figs. 6C and D, we also computed the Matthews Correlation Coefficient [39], which quantifies the quality of a binary classification. A score of 1 indicates that a method correctly identified all true positives and true negatives, while a score of −1 indicates that a method yielded all false positives and false negatives. S7 Fig. shows that the JTK_CYCLE methods have higher-quality classification ability than F24 and ANOVA for these simulated data. Furthermore, the figure shows that empirical JTK_CYCLE with asymmetry search performs equally well with and without asymmetric time series, whereas the JTK_CYCLE methods without asymmetry search perform worse when the dataset includes asymmetric time series.
Therefore, in terms of genome-wide studies, empirical JTK_CYCLE with asymmetric waveforms is the method of choice for identifying rhythmic genes. S8 Fig. examines how the inclusion of different asymmetries affects rhythm detection.
Microarray metadataset
Keegan et al. [27] previously assembled a metadataset comprised of data from four DNA microarray studies of Drosophila melanogaster under light-dark (LD) conditions (from Ceriani [40], Claridge-Chang [18], Lin [41], and Ueda [42]). We do not include a fifth dataset from that study [15], because it was limited to dark-dark (DD) conditions. Here, we discuss issues that arise from merging data from different laboratories and use the resulting metadataset to test the methods. We find that empirical JTK_CYCLE with asymmetry search identifies a larger number of rhythmic genes and, in turn, enriched annotations among those genes, such as oxidation reduction, glutathione metabolism, and alternative splicing.
Z-score-based procedure for preparing the metadataset
All of the measurements in the contributing studies are at intervals of 4 h. Time points for circadian LD time series are referenced as zeitgeber time points (ZT); the beginning of the light period is ZT0. Under 12 hours of light and 12 hours of dark, ZT24 is the equivalent of ZT0. Three studies sampled at ZT0, 4, 8, 12, 16, and 20, and the fourth (Ueda [42]) sampled at ZT1, 5, 9, 13, 17, and 21. We found that the differences in sampling protocols, together with variations from one laboratory to another, consistently gave rise to a jagged structure in the time series of known cycling genes (Fig. 7A). Microarray-wide normalization techniques such as quantile normalization were unable to produce curves consistent with independently measured profiles. Instead, we found that the best approach was to convert the values in each time series to Z-scores—i.e., for each gene in each dataset, we subtract its mean expression level and divide by its standard deviation. Then we pool the Z-scores to generate the metadataset. Fig. 7 illustrates the effect of the processing step on Pdp1, a known cycling gene. This method is equivalent to treating measurements from the same zeitgeber time point as replicates. For probes that corresponded to the same gene, we chose the probe with the highest mean expression value to use in the analysis. This reduced 14,010 probes to 11,625 genes.
The metadata has many places where the values are not available (NA). To prevent the need to recalculate the null distribution for every pattern of NAs in the data for empirical JTK_CYCLE (there were 5005 unique NA permutations in the data), the NAs were replaced by random noise drawn from a Gaussian distribution with mean and standard deviation that match those of the data on the whole. While this adds noise to the time series, it should not have a large effect given that each time series has 57 points. To mitigate the impact of this procedure on our study, however, time series that had more then half their points as NA were discarded from the dataset, leaving 9,313 out of 11,625 genes. We consistently used the dataset resulting from these preprocessing steps for all our analysis to ensure that comparisons between methods were fair; where comparisons with and without NA substitution were possible, we found that NA substitution led to slight increases in cycling numbers in all cases except ANOVA (114 vs. 101). However, these differences did not change any of the ontological results (discussed below).
Analysis of microarray metadataset
To evaluate the methods against genes for which we know the rhythmicity a priori, we compared the p-values for 6 positive examples (per, tim, vri, Pdp1, cry, and Clk) and 4 negative examples (cam, RpL32, cyc, and dco). S9 Fig. shows the performance of the different methods for the known positive and negative examples. Stable persistence, the cyclohedron test, and address reduction all have false negatives. The JTK_CYCLE methods, ANOVA, and F24, however, detect all of the known cycling genes and none of the non-cycling genes as rhythmic.
Having again established F24, ANOVA, and JTK_CYCLE as the better methods, we now apply them to the full dataset (Fig. 8). As in Fig. 6, the Benjamini-Hochberg correction decreases the sensitivity of ANOVA and F24 relative to JTK_CYCLE for a given FDR (compare Fig. 8A with Fig. 8B). Choosing a Benjamini-Hochberg adjusted p-value cutoff of 0.05 (i.e., 5%), the number of genes and overlap between methods can be seen in Figs. 8C and D. All the JTK_CYCLE methods outperform F24 and ANOVA. Empirical JTK_CYCLE with asymmetry search by 4 h (eJTK_aby4) identified the most genes, showing a clear improvement over the Bonferroni (JTK) and Benjamini-Hochberg (JTK_BH) methods with asymmetry search by 4 h (aby4); eJTK_aby4 also identified more cycling genes than methods without asymmetry search, and the genes were distinct.
Interestingly, among the JTK_CYCLE methods without asymmetry search the Bonferroni and Benjamini-Hochberg methods identified more genes than the empirical method did. For JTK_CYCLE without asymmetry search, there were only 6 hypotheses tested per gene time series (for each of the 6 phases searched), for which the Bonferroni and Benjamini-Hochberg correction across waveforms is very slight. For JTK_CYCLE with asymmetry search every 4 h, the number of hypotheses tested becomes 30, for 6 different phases paired with 5 different asymmetries, which results in a more stringent correction by the Bonferroni and Benjamini-Hochberg methods. As experimental sampling rates and sampling densities enable more extensive searching of phases, periods, and asymmetries, we expect the advantage for empirical JTK_CYCLE relative to the original formulation to grow because the Bonferroni correction strongly penalizes adding hypothesis tests. Provided that sufficient permutations are performed, empirical JTK_CYCLE provides the more robust identification of rhythmic genes. Another way of viewing this difference between the inclusion and exclusion of asymmetry search is by examining the distributions of the Bonferroni-adjusted p-values against the empirical p-values, as in S10 Fig. With asymmetry search, the empirical p-values are significantly lower than the Bonferroni-adjusted p-values, a pattern that is less pronounced without asymmetry search.
We examine the effect of searching multiple asymmetries with empirical JTK_CYCLE further in S11 Fig. Searching for rhythmic genes with asymmetries of 8 and 16 h alone yielded 4 more genes than searching for rhythmic genes with asymmetries of 4, 8, 12, 16, and 20 h, with an overlap of 293 genes and approximately 50 genes each that were separately called rhythmic by each method. Comparing the two sets of cycling genes in S12 Fig., we find that searching by 4 h excludes genes with asymmetries 8 to 16 h that are barely below the adjusted p-value of 0.05 (upper left quadrant), while searching at asymmetries of 8 and 16 h excludes genes that have extreme asymmetries.
We also examined how our results depended on using a triangle vs. a cosine for the reference waveform. Figs. S13 and S14 show that there is no substantial difference in genes identified as cycling or in ontological results (discussed below). This can be attributed to the fact that across many time points (57 in the case of the metadataset), the differences between the cosine and triangle waveform are slight (S2 Fig.).
Comparison with Keegan et al
We compared our results to those of Keegan et al. [27], an earlier analysis of this metadataset. There were two main differences in the way we constructed the dataset: we excluded time series that had more than half their values as NA, and we excluded the dark-dark (DD) McDonald dataset, as discussed above. Of the 200 genes identified as cycling by Keegan et al., 169 remained after pre-processing to remove time series with more than half of their values as NAs. Of those 169, 111 had Benjamini-Hochberg adjusted p-values less than 0.05 for the empirical JTK_CYCLE with asymmetry search by 4 h (eJTK_aby4). 58 genes that were identified as cycling by Keegan et al. were not identified by eJTK_aby4. S15 Fig. compares the cycling genes identified by Keegan et al. with the cycling genes identified by eJTK_aby4. Keegan et al. identified genes as cycling primarily on the basis of scoring well (p < 0.05) on several tests following pre-screening by ANOVA. S15A Fig. shows a comparison of the number of tests passed after the ANOVA pre-screening with the Benjamini-Hochberg adjusted p-value from eJTK_aby4. While there appears to be a weak relation between the number of tests passed and the p-value, there is not a clear pattern that would enable one to predict the cycling genes common to both Keegan et al. and eJTK_by4. S15B Fig. shows the maximum amplitude measurements (after Z-scoring) for the genes identified as cycling by Keegan et al., which are organized by whether they are identified as cycling by eJTK_aby4 as well. The genes identified by Keegan et al. but not by eJTK_aby4 tend to have larger maximum amplitudes than the ones identified by both. The ANOVA pre-screening in Keegan et al. can account for this difference; our results with empirical JTK_CYCLE suggest that there are many cycling genes with lower amplitudes. S15C Fig. shows the asymmetries of the genes identified by Keegan et al. as cycling, as determined by eJTK_aby4. A large number of genes identified by Keegan et al., but not by eJTK_aby4, have asymmetry of 16 h. The bias in the earlier study may reflect the fact that one of the tests that Keegan et al. employs is based on correlation with the gene per, which has an asymmetry of 16 h. More generally, Keegan et al. fail to identify 231 genes as cycling that eJTK_aby4 identifies with Benjamini-Hochberg adjusted p-values below 0.05. Of these 231, 82 have Benjamini-Hochberg adjusted p-values below 0.01, 65 have values below 0.005, and 16 have values below 0.001.
Comparison with Wijnen et al
In addition to comparing our results to those of Keegan et al., we also compared our results to those of Wijnen et al. [32], who identified 336 genes as rhythmic using an F24-based method. Again, we excluded time series that had more than half their values as NA, and we excluded the dark-dark (DD) McDonald dataset. S16 Fig. shows a comparison of the genes that are identified as rhythmic by eJTK_aby4, Keegan et al., and Wijnen et al. Whereas 31 genes that Keegan et al. identified as rhythmic were removed by the empirical JTK_CYCLE analysis pre-processing, 57 genes that were identified by Wijnen et al. were removed by the empirical JTK_CYCLE analysis pre-processing due to more than half their time points being NA. These genes are “unassigned” in S16A Fig. because an asymmetry estimate is not available. Wijnen et al. and eJTK_aby4 jointly identified 120 genes as rhythmic, of which Keegan et al. identified 59 as well. Wijnen et al. uniquely identified 177 genes as rhythmic, whereas eJTK_aby4 uniquely identified 167 genes. A comparison of the asymmetry distributions for all the genes (S16A Fig.) shows that they are similar for eJTK_aby4 and Wijnen et al.
Validation with dataset-independent literature citations
As a first step toward validation the new genes that eJTK_aby4 exclusively identified as rhythmic (i.e., those genes not previously identified by Keegan et al. or Wijnen et al.), we examined the literature for references that independently suggest that these genes are cycling. Specifically, for each gene, we identified the references in FlyBase (http://flybase.org) that mention the gene. Of those references, those that have the term “circadian” in their title or abstract were identified. S16B Fig. shows the distribution of genes based on their citation in FlyBase by a “circadian” paper, by the original five dataset papers [15, 18, 40–42], or by neither. Genes identified by “circadian” papers but not by the original five dataset papers represent further validation that the genes that we select as rhythmic are related to circadian processes.
Among these references, there were some that discussed several genes. Kadener et al. [43] assayed for genes regulated by the gene Clk and referenced 6 of the genes not mentioned by the original five papers out of a total of 32 genes, which has less than 1.6% probability of occurring by chance (Fisher’s Exact Test unadjusted p<0.016 [44]). One gene referenced by Kadener et al. as well as Abruzzi et al., who also assayed for genes regulated by Clk, is cabut (cbt, CG4427, FBgn0043364, referred to as EP2237 by Kadener et al.). The gene cbt was previously unidentified as having rhythmic expression. The average time series from the metadata can be seen in S17 Fig. The gene cbt is a metal-ion binding transcription factor downstream of the JNK cascade and is involved in morphogenesis [45–48]. It has an asymmetry of 4 h, potentially explaining why it was missed by previous methods but identified by eJTK_aby4. Abruzzi et al. [49] also discuss another Clk-regulated gene that was uniquely identified by eJTK_aby4 as rhythmic, twins (tws, CG6235, FBgn0004889), seen in S17 Fig. It has an asymmetry of 20 h, which explains how, like cbt, it could have been missed by previous methods. These genes, though previously unidentified as rhythmic, are strong candidates for having roles in circadian regulation and processes based on our identification of them as rhythmic and the work of Kadener et al. and Abruzzi et al. This warrants further experimental studies of these genes in a circadian context as well as the other genes that we have identified.
The gene cbt is also referenced by another study that discusses several genes identified as rhythmic by eJTK_aby4. Fujikawa et al. [50] identified 114 genes that are up-regulated and down-regulated in the head of D. melanogaster following 24 h of starvation. 16 of these genes are not mentioned by the original five papers but are identified as rhythmic by eJTK_aby4, which has less than 0.3% probability of occurring by chance (Fisher’s Exact Test unadjusted p<0.003). Fujikawa et al. refer to several genes from the circadian dataset papers that also appear in their lists of differentially expressed genes, but they do not associate rhythmic behavior with all the genes that they describe. In addition to the gene cbt, Fujikawa et al. reference other genes that were previously unidentified as rhythmic: Esterase-Q (Est-Q, CG7529, FBgn0037090) and 1, 4-Alpha-Glucan Branching Enzyme (AGBE, CG33138, FBgn0053138). Both have asymmetries of 16 h, which is also outside the range of standard symmetric-waveform detection (S17 Fig.). The identification of these genes as rhythmic reinforces the connection between metabolism and circadian regulation and indicates other potential areas of experimental exploration.
To further understand the relationship between circadian regulation that we see in the genes eJTK_aby4 has identified as rhythmic and biological processes, we examined the enrichment of functional annotations in the identified genes.
Functional classification of cycling genes
We used DAVID [51, 52] to analyze the ontological enrichment of the genes contributing to Fig. 8 separately for each rhythm detection method. Because many of the annotation terms are obviously related (e.g., “oxidoreductase” and “oxidation reduction”), we manually grouped them. The grouped terms enriched with Benjamini-Hochberg adjusted p-values less than 0.05 for the different methods can be seen in Fig. 9. Genes that are identified as rhythmic by F24 and ANOVA are enriched in the fewest terms. They are mainly in rhythm/light/circadian categories, corroborating the selection of these genes as cycling. The JTK_CYCLE methods without asymmetry search identify sets of genes that are enriched for different terms in addition to the rhythm/light/circadian ones found by ANOVA and F24, such as glutathione and oxidation reduction annotation terms. The JTK_CYCLE methods with asymmetry search identify sets of genes enriched in the most terms of all the methods. The original JTK_CYCLE method with Bonferroni correction and empirical JTK_CYCLE method identify sets of enriched genes known to have alternative splice forms of their RNA; JTK_CYCLE with the Benjamini-Hochberg correction and empirical JTK_CYCLE identify sets of genes that are enriched for genes involved in biosynthetic pathways.
Because eJTK_aby4 captures all the annotation terms of interest, we focus on its results for the remainder of this section. The individual annotation terms that are enriched in the rhythmic genes found by eJTK_aby4 can be seen with their adjusted p-values and phase distributions in Fig. 10A. Fig. 10B shows the total phase distribution of the genes, and Fig. 10C shows the total asymmetry distribution. Functionally, these genes fall into several annotation categories, each of which we discuss in turn.
Many of the rhythmic genes involved in glutathione metabolism are also involved in drug metabolism. The peak expressions of these genes are focused around ZT4, and these genes mainly have asymmetries close to 12 h (Figs. 10A and S18). Glutathione and drug metabolism are known to be circadian [53–55]; possible links to aging are suggested by the role of glutathione metabolism in clearing reactive oxygen species [56]. Other oxidation-reduction related terms peak at either ZT4 or ZT16-20 (Figs. 10A and S19). These genes have a broader distribution of asymmetries, with several with extreme values of 4 or 20 h.
A subset of the genes involved with oxidation-reduction are also associated with iron and have a bimodal distribution of phases, with peaks at ZT4 and ZT16-20 (Fig. 10A). Various iron-related genes have been implicated as important in circadian rhythms. Recent studies, however, have only looked at the effect of iron-related genes on whole organism activity, or on particular circadian genes, such as per or tim [57, 58]. These studies have shown that individual iron-related genes affect circadian rhythms. To our knowledge, no studies to date have found as many iron-related genes displaying rhythmic behavior as we have described here.
Genes that have multiple protein forms due to alternative splicing peak at times that are evenly distributed throughout the day (Figs. 10A and S20). They have a broad distribution of asymmetries as well, with several genes with extreme values of 4 h and 20 h, such as tws, the newly discovered cycling gene previously mentioned as having an asymmetry of 20 h and as a regulatory target of Clk. The existence of these alternatively spliced genes with extreme asymmetries explains why “alternative splicing” was only found to be enriched in the genes identified as rhythmic by methods searching for asymmetric waveforms. Alternative splicing has been found to be important in circadian rhythms in Drosophila as well as in other species. Most studies, however, have focused on specific experimental findings that discovered particular genes that modulate circadian rhythms [59, 60]. No studies exist that have found that so many genes with alternative splicing are rhythmic.
Discussion
In this paper, we compare several rhythm detection methods. These approaches are general and can be applied to detecting periodic behavior in a wide range of contexts, but we focus on time series representative of genome-wide expression data. Deckard et al. [28] recently reviewed a number of earlier studies of rhythm detection methods and selected four algorithms for comparison (de Lichtenberg, Lomb-Scargle, JTK_CYCLE, and persistent homology) based on their mathematical properties and applicability to genome-wide expression data. They test the methods with simulated data and experimental data for the metabolic cycle in yeast, circadian rhythms in the mouse, and the root clock in the flowering plant Arabidopsis thaliana (see [28] for references). They find that there is no all-around best method and construct a decision tree for picking an algorithm based on the expected nature of the data. For increasing noise and decreasing sampling rate, they favor JTK_CYCLE and Lomb-Scargle, a Fourier-like method. These recommendations are consistent with our own findings that JTK_CYCLE and F24 were consistently more accurate than the monotonicity tests (represented by persistent homology in [28]) and justifies our focus on improving JTK_CYCLE.
Also recently, Zielinski et al. [29] reviewed six different Fourier-like methods for their ability to estimate periods in periodic time series. They focus on time series that are well-sampled (36 and 72 time points were the lowest-sampled time series they examined, 720 was among the highest), as might be obtained from tracking a luminescence reporter for a single gene. The fundamentally different nature of their data from ours highlights the fact that it is important to match the computational tool with the task of interest. Zielinski et al. [29] seek precise period determination for genes already known to cycle. By contrast, here we focus on discovering rhythmic time series that represent only a fraction of a genome-wide dataset. JTK_CYCLE can provide estimates of a periodic time series’ phase, period, and asymmetry, but it resolves these parameters only to the level of the sampling (or search) depth. It is likely that there is a tradeoff between robust separation of rhythmic and arrhythmic time series and precise estimation of the cycling parameters; presently, no algorithm achieves both these goals simultaneously.
Zielinski et al. [29], as well as earlier studies [37, 38], note that biological oscillations are expected to have asymmetric patterns of expression. Thaben and Westermark recently published one approach to this problem [30]. Their method, RAIN, employs the Mann-Whitney U test [61] between different time points to look for a rising pattern followed by a falling pattern. They show that RAIN outperforms the original JTK_CYCLE method as well as a cosine-fitting method [62] for simulated data consisting of sinusoidal and ramp waveforms. They also analyze genomic and proteomic data for the mouse liver. Their work reinforces our finding that searching for asymmetric waveforms can produce better rhythm detection sensitivity and validates our efforts. Thaben and Westermark suggest that modifying JTK_CYCLE to allow for asymmetric waveforms would provide a useful complement to their approach. In particular, in contrast to RAIN, JTK_CYCLE can search for specific waveforms, including arbitrary shapes with multiple peaks. We meet that need here.
We were able to expand JTK_CYCLE to search for asymmetric waveforms without degrading sensitivity because we empirically calculate p-values, which yields much more accurate significance estimates than the Bonferroni correction employed in the original formulation of JTK_CYCLE [26]. Our analysis of two different simulated datasets and the fly head metadataset clearly shows the importance of accurate significance estimates. As sequencing costs continue to decrease, we expect sampling density to increase. This trend should favor use of empirical JTK_CYCLE over alternative means of correcting for multiple hypothesis testing because the increase in data will enable more phases, asymmetries, and periods to be examined. Our analysis shows that certain gene ontologies have many genes with highly asymmetric patterns of expression. It will be interesting to determine the prevalence of different waveforms in additional biological datasets and to understand how their features depend quantitatively on genotype, tissue, and environmental conditions.
Conclusions
In this paper, we compare methods for detecting rhythmic time series in genome-wide expression data. With regard to experimental design, we find that increasing the number of replicates is more important than increasing the sampling density for achieving greater sensitivity. A key aspect of our study is that we improve the estimation of p-values in JTK_CYCLE. This enables control of the false discovery rate and testing waveforms beyond sinusoidal ones. For both simulated data and a circadian metadataset [27] the resulting empirical JTK_CYCLE with asymmetry search exhibits the greatest sensitivity among the methods that we evaluated. The annotation terms that are enriched among the genes that we identify as cycling include rhythm/light/circadian, glutathione/drug metabolism, oxidation-reduction, iron metabolism, and alternative splicing. These findings are consistent with known circadian biology but also suggest new investigations.
Supporting Information
Acknowledgments
We would like to thank Matthew Stephens for advice about evaluating statistical significance and multiple hypothesis correction.
Data Availability
The simulated datasets can be computationally generated given the specifications contained within the paper. The experimental data analyzed are already publicly available through DOI: 10.1371/journal.pcbi.0030208. Our analysis of the experimental data is within the paper and its Supporting Information files. The code used can be found in the first author’s Github repository at https://github.com/alanlhutchison/empirical-JTK_CYCLE-with-asymmetry
Funding Statement
This work was supported by the Defense Advanced Research Projects Agency (DARPA) (D12AP00023) http://www.darpa.mil/. ALH is a trainee of the National Institutes of Health Medical Scientist Training program at the University of Chicago (grant NIGMS T32GM07281) http://www.nigms.nih.gov/. MMC is supported in part by NIH grant ULITR000050 (University of Illinois CTSA). HG is supported by the National Institute of Biomedical Imaging And Bioengineering of the National Institutes of Health (Award Number T32EB009412) http://www.nibib.nih.gov/. The content is solely the responsibility of the authors and does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
References
- 1. Etter PD, Ramaswami M (2002) The ups and downs of daily life: Profiling circadian gene expression in Drosophila . BioEssays 24: 494–498. 10.1002/bies.10109 [DOI] [PubMed] [Google Scholar]
- 2. Abe M, Herzog ED, Yamazaki S, Straume M, Tei H, et al. (2002) Circadian rhythms in isolated brain regions. The Journal of Neuroscience 22: 350–356. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Panda S, Antoch MP, Miller BH, Su AI, Schook AB, et al. (2002) Coordinated transcription of key pathways in the mouse by the circadian clock. Cell 109: 307–320. 10.1016/S0092-8674(02)00722-5 [DOI] [PubMed] [Google Scholar]
- 4. Ueda HR, Chen W, Adachi A, Wakamatsu H, Hayashi S, et al. (2002) A transcription factor response element for gene expression during circadian night. Nature 418: 534–539. 10.1038/nature00906 [DOI] [PubMed] [Google Scholar]
- 5. Wulff K, Gatti S, Wettstein JG, Foster RG (2010) Sleep and circadian rhythm disruption in psychiatric and neurodegenerative disease. Nat Rev Neurosci 11: 589–599. 10.1038/nrn2868 [DOI] [PubMed] [Google Scholar]
- 6. Armitage R (2007) Sleep and circadian rhythms in mood disorders. Acta Psychiatrica Scandinavica 115: 104–115. 10.1111/j.1600-0447.2007.00968.x [DOI] [PubMed] [Google Scholar]
- 7. Storch KF, Lipan O, Leykin I, Viswanathan N, Davis FC, et al. (2002) Extensive and divergent circadian gene expression in liver and heart. Nature 417: 78–83. 10.1038/nature744 [DOI] [PubMed] [Google Scholar]
- 8. Germain A, Kupfer DJ (2008) Circadian rhythm disturbances in depression. Human Psychopharmacology: Clinical and Experimental 23: 571–585. 10.1002/hup.964 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. McClung CA (2007) Circadian genes, rhythms and the biology of mood disorders. Pharmacology & Therapeutics 114: 222–232. 10.1016/j.pharmthera.2007.02.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Albrecht U (2006) Orchestration of gene expression and physiology by the circadian clock. Journal of Physiology-Paris 100: 243–251. 10.1016/j.jphysparis.2007.05.004 [DOI] [PubMed] [Google Scholar]
- 11. Dibner C, Schibler U, Albrecht U (2010) The mammalian circadian timing system: Organization and coordination of central and peripheral clocks. Annual Review of Physiology 72: 517–549. 10.1146/annurev-physiol-021909-135821 [DOI] [PubMed] [Google Scholar]
- 12. Hayes K, Baggs J, Hogenesch J (2005) Circadian clocks are seeing the systems biology light. Genome Biology 6: 219 10.1186/gb-2005-6-5-219 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Wijnen H, Young MW (2006) Interplay of circadian clocks and metabolic rhythms. Annual Review of Genetics 40: 409–448. 10.1146/annurev.genet.40.110405.090603 [DOI] [PubMed] [Google Scholar]
- 14. Moynihan Ramsey K, Marcheva B, Kohsaka A, Bass J (2007) The clockwork of metabolism. Annual Review of Nutrition 27: 219–240. 10.1146/annurev.nutr.27.061406.093546 [DOI] [PubMed] [Google Scholar]
- 15. McDonald MJ, Rosbash M (2001) Microarray analysis and organization of circadian gene expression in Drosophila . Cell 107: 567–578. 10.1016/S0092-8674(01)00545-1 [DOI] [PubMed] [Google Scholar]
- 16. Zvonic S, Ptitsyn AA, Conrad SA, Scott LK, Floyd ZE, et al. (2006) Characterization of peripheral circadian clocks in adipose tissues. Diabetes 55: 962–970. 10.2337/diabetes.55.04.06.db05-0873 [DOI] [PubMed] [Google Scholar]
- 17. Bell-Pedersen D, Cassone VM, Earnest DJ, Golden SS, Hardin PE, et al. (2005) Circadian rhythms from multiple oscillators: lessons from diverse organisms. Nat Rev Genet 6: 544–556. 10.1038/nrg1633 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Claridge-Chang A, Wijnen H, Naef F, Boothroyd C, Rajewsky N, et al. (2001) Circadian regulation of gene expression systems in the Drosophila head. Neuron 32: 657–671. 10.1016/S0896-6273(01)00515-3 [DOI] [PubMed] [Google Scholar]
- 19. Dequéant ML, Ahnert S, Edelsbrunner H, Fink TMA, Glynn EF, et al. (2008) Comparison of pattern detection methods in microarray time series of the segmentation clock. PLoS ONE 3: e2856 10.1371/journal.pone.0002856 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Morton J, Pachter L, Shiu A, Sturmfels B (2007) The cyclohedron test for finding periodic genes in time course expression studies. Statistical Applications in Genetics and Molecular Biology 6: 1–25. 10.2202/1544-6115.1286 [DOI] [PubMed] [Google Scholar]
- 21.Morton J, Pachter L, Shiu A, Sturmfels B, Wienand O (2006) Geometry of rank tests. ArXiv Mathematics e-prints arXiv:math/0605173.
- 22. Fink TMA, Willbrand K, Brown FCS (2007) 1-D random landscapes and non-random data series. EPL (Europhysics Letters) 79: 38006 10.1209/0295-5075/79/38006 [DOI] [Google Scholar]
- 23. Ahnert SE, Willbrand K, Brown FCS, Fink TMA (2006) Unbiased pattern detection in microarray data series. Bioinformatics 22: 1471–1476. 10.1093/bioinformatics/btl121 [DOI] [PubMed] [Google Scholar]
- 24. Edelsbrunner H, Letscher D, Zomorodian A (2000) Topological persistence and simplification In: Proceedings of the 41st Annual Symposium on Foundations of Computer Science. Washington, DC, USA: IEEE Computer Society, FOCS’00, p. 454. [Google Scholar]
- 25. Cohen-Steiner D, Edelsbrunner H, Harer J, Mileyko Y (2010) Lipschitz functions have Lp-stable persistence. Foundations of Computational Mathematics 10: 127–139. 10.1007/s10208-010-9060-6 [DOI] [Google Scholar]
- 26. Hughes M, Hogenesch J, Kornacker K (2010) JTK CYCLE: an efficient non-parametric algorithm for detecting rhythmic components in genome-scale datasets. J Biol Rhythms 25: 372–380. 10.1177/0748730410379711 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Keegan K, Pradhan S, Wang JP, Allada R (2007) Meta-analysis of Drosophila circadian microarray studies identifies a novel set of rhythmically expressed genes. PLoS Comp Biol 3: e208 10.1371/journal.pcbi.0030208 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Deckard A, Anafi RC, Hogenesch JB, Haase SB, Harer J (2013) Design and analysis of large-scale biological rhythm studies: a comparison of algorithms for detecting periodic signals in biological data. Bioinformatics 29: 3174–3180. 10.1093/bioinformatics/btt541 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Zielinski T, Moore AM, Troup E, Halliday KJ, Millar AJ (2014) Strengths and limitations of period estimation methods for circadian data. PLoS ONE 9: e96462 10.1371/journal.pone.0096462 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Thaben PF, Westermark PO (2014) Detecting rhythms in time series with RAIN. Journal of Biological Rhythms 29: 391–400. 10.1177/0748730414553029 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Wijnen H, Naef F, Young MW, Young MW (2005) Molecular and statistical tools for circadian transcript profiling. Methods in Enzymology 393: 341–365. 10.1016/S0076-6879(05)93015-2 [DOI] [PubMed] [Google Scholar]
- 32. Wijnen H, Naef F, Boothroyd C, Claridge-Chang A, Young MW (2006) Control of daily transcript oscillations in Drosophila by light and the circadian clock. PLoS Genet 2: e39 10.1371/journal.pgen.0020039 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. Harding EF (1984) An efficient, minimal-storage procedure for calculating the Mann-Whitney U, generalized U and similar distributions. Journal of the Royal Statistical Society Series C (Applied Statistics) 33: 1–6. [Google Scholar]
- 34.Weisstein EW (2014). Bonferroni correction, from MathWorld–A Wolfram Web Resource. URL http://mathworld.wolfram.com/BonferroniCorrection.html.
- 35. North B, Curtis D, Sham P (2002) A note on the calculation of empirical p-values from Monte Carlo procedures. The American Journal of Human Genetics 71: 439–441. 10.1086/341527 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Statist Soc Ser B 57: 289–300. [Google Scholar]
- 37. Yosef N, Regev A (2011) Impulse control: Temporal dynamics in gene transcription. Cell 144: 886–896. 10.1016/j.cell.2011.02.015 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Tyson JJ, Chen KC, Novak B (2003) Sniffers, buzzers, toggles and blinkers: dynamics of regulatory and signaling pathways in the cell. Current Opinion in Cell Biology 15: 221–231. 10.1016/S0955-0674(03)00017-6 [DOI] [PubMed] [Google Scholar]
- 39. Matthews B (1975) Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA)—Protein Structure 405: 442–451. 10.1016/0005-2795(75)90109-9 [DOI] [PubMed] [Google Scholar]
- 40. Ceriani MF, Hogenesch JB, Yanovsky M, Panda S, Straume M, et al. (2002) Genome-wide expression analysis in Drosophila reveals genes controlling circadian behavior. The Journal of Neuroscience 22: 9305–9319. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41. Lin Y, Han M, Shimada B, Wang L, Gibler TM, et al. (2002) Influence of the period-dependent circadian clock on diurnal, circadian, and aperiodic gene expression in Drosophila melanogaster . Proceedings of the National Academy of Sciences 99: 9562–9567. 10.1073/pnas.132269699 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42. Ueda HR, Matsumoto A, Kawamura M, Iino M, Tanimura T, et al. (2002) Genome-wide transcriptional orchestration of circadian rhythms in Drosophila . J Biol Chem 16: 14048–14052. 10.1074/jbc.C100765200 [DOI] [PubMed] [Google Scholar]
- 43. Kadener S, Stoleru D, McDonald M, Nawathean P, Rosbash M (2007) Clockwork orange is a transcriptional repressor and a new Drosophila circadian pacemaker component. Genes & Development 21: 1675–1686. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44. Fisher RA (1922) On the interpretation of χ2 from contingency tables, and the calculation of p. Journal of the Royal Statistical Society 85: 87 10.2307/2340521 [DOI] [Google Scholar]
- 45. St Pierre SE, Ponting L, Stefancsik R, McQuilton P, Consortium F (2014) FlyBase 102-advanced approaches to interrogating FlyBase. Nucleic Acids Research 42: D780–D788. 10.1093/nar/gkt1092 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46. Brown JL (2005) An Sp1/KLF binding site is important for the activity of a polycomb group response element from the Drosophila engrailed gene. Nucleic Acids Research 33: 5181–5189. 10.1093/nar/gki827 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Munoz-Descalzo S, Llobell MC, Paricio N (2004) cabut encodes a C 2 H 2 zinc finger transcription factor required during Drosophila embryogenesis. Program and Abstracts 45th Annual Drosophila Research Conference, Washington, DC, 2004: 528C.
- 48.Munoz-Descalzo S, Llobell MC, Paricio N (2003) cabut, a new gene involved in multiple processes during Drosophila melanogaster development. Abstracts Eighteenth European Drosophila Research Conference, Gottingen, 2003: P04.
- 49. Abruzzi KC, Rodriguez J, Menet JS, Desrochers J, Zadina A, et al. (2011) Drosophila CLOCK target gene characterization: implications for circadian tissue-specific gene expression. Genes & Development 25: 2374–2386. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50. Fujikawa K, Takahashi A, Nishimura A, Itoh M, Takano-Shimizu T, et al. (2009) Characteristics of genes up-regulated and down-regulated after 24h starvation in the head of Drosophila . Gene 446: 11–17. 10.1016/j.gene.2009.06.017 [DOI] [PubMed] [Google Scholar]
- 51. Huang D, Sherman B, Lempicki R (2009) Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res 37: 1–13. 10.1093/nar/gkn923 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52. Huang D, Sherman B, Lempicki R (2009) Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc 4: 44–57. 10.1038/nprot.2008.211 [DOI] [PubMed] [Google Scholar]
- 53. Hooven LA, Sherman KA, Butcher S, Giebultowicz JM (2009) Does the clock make the poison? Circadian variation in response to pesticides. PLoS ONE 4: e6469 10.1371/journal.pone.0006469 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54. Beaver LM, Hooven LA, Butcher SM, Krishnan N, Sherman KA, et al. (2010) Circadian clock regulates response to pesticides in Drosophila via conserved Pdp1 pathway. Toxicological Sciences 115: 513–520. 10.1093/toxsci/kfq083 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55. Hamby KA, Kwok RS, Zalom FG, Chiu JC (2013) Integrating circadian activity and gene expression profiles to predict chronotoxicity of Drosophila suzukii response to insecticides. PLoS ONE 8: e68472 10.1371/journal.pone.0068472 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56. Pompella A, Visvikis A, Paolicchi A, Tata VD, Casini A (2003) The changing faces of glutathione, a cellular protagonist. Biochem Pharmacol 66: 1499–1503. 10.1016/S0006-2952(03)00504-5 [DOI] [PubMed] [Google Scholar]
- 57. Mandilaras K, Missirlis F (2012) Genes for iron metabolism influence circadian rhythms in Drosophila melanogaster . Metallomics 4: 928 10.1039/c2mt20065a [DOI] [PubMed] [Google Scholar]
- 58. Freeman AA, Mandilaras K, Missirlis F, Sanyal S (2013) An emerging role for Cullin-3 mediated ubiquitination in sleep and circadian rhythm: Insights from Drosophila . Fly 7: 39–43. 10.4161/fly.23506 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59. Bartok O, Kyriacou CP, Levine J, Sehgal A, Kadener S (2013) Adaptation of molecular circadian clockwork to environmental changes: a role for alternative splicing and miRNAs. Proceedings of the Royal Society B: Biological Sciences 280 10.1098/rspb.2013.0011 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60. Preußner M, Wilhelmi I, Schultz AS, Finkernagel F, Michel M, et al. (2014) Rhythmic U2af26 alternative splicing controls PERIOD1 stability and the circadian clock in mice. Molecular Cell 54: 651–662. [DOI] [PubMed] [Google Scholar]
- 61. Mann HB, Whitney DR (1947) On a test of whether one of two random variables is stochastically larger than the other. The Annals of Mathematical Statistics 18: 50–60. 10.1214/aoms/1177730491 [DOI] [Google Scholar]
- 62. Cornelissen G (2014) Cosinor-based rhythmometry. Theoretical Biology and Medical Modelling 11: 16 10.1186/1742-4682-11-16 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The simulated datasets can be computationally generated given the specifications contained within the paper. The experimental data analyzed are already publicly available through DOI: 10.1371/journal.pcbi.0030208. Our analysis of the experimental data is within the paper and its Supporting Information files. The code used can be found in the first author’s Github repository at https://github.com/alanlhutchison/empirical-JTK_CYCLE-with-asymmetry