Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2011 May 4.
Published in final edited form as: IEEE Int Workshop Genomic Signal Process Stat. 2008;2008:1–4. doi: 10.1109/GENSIPS.2008.4555659

An Iterative Time Windowed Signature Algorithm for Time Dependent Transcription Module Discovery

Jia Meng 1, Shou-Jiang Gao 2,3, Yufei Huang 1,3
PMCID: PMC3087294  NIHMSID: NIHMS165784  PMID: 21552463

Abstract

An algorithm for the discovery of time varying modules using genome-wide expression data is present here. When applied to large-scale time serious data, our method is designed to discover not only the transcription modules but also their timing information, which is rarely annotated by the existing approaches. Rather than assuming commonly defined time constant transcription modules, a module is depicted as a set of genes that are co-regulated during a specific period of time, i.e., a time dependent transcription module (TDTM). A rigorous mathematical definition of TDTM is provided, which is serve as an objective function for retrieving modules. Based on the definition, an effective signature algorithm is proposed that iteratively searches the transcription modules from the time series data. The proposed method was tested on the simulated systems and applied to the human time series microarray data during Kaposi's sarcoma-associated herpesvirus (KSHV) infection. The result has been verified by Expression Analysis Systematic Explorer.

1. Introduction

DNA microarray experiments monitor the expression profiles of thousands of genes simultaneously. Using this technology, a large amount of genome-wide expression data has been accumulated and made available.

To reveal insights into the transcriptional network from large-scale expression data, a crucial step is to classify genes and conditions into function modules, i.e., a set of genes sharing similar functions. Although it is well recognized that different transcription modules (TM) exist under different condition, most existing algorithms consider only time static transcription modules under a specific experimental condition, thus failing to capture changes of cell state. We seek in this paper to overcome this limitation.

Rather than assuming time static TM, a more realistic scenario is considered where a module is defined on for a specific period of time, i.e., a time dependent transcription modules (TDTM). To develop an algorithm for TDTM discovery, a rigorous mathematical definition is provided for TDTM, which defines the information to be extracted from time series expression data. This definition also serves as an objective function, on which an effective iterative sliding window signature algorithm (ITWSA) is developed that iteratively refines the modules contents and time periods. In order to retrieve the time information, a sliding time window is introduced. Given a sufficiently large number of initial sets, ISWSA is possible to determine all the modules.

2. Problem Formulation

Consider a gene expression data matrix Y obtained from time series microarray experiment. Define

Ygt : Expression level of gene g at time t

Yg(ti:tj) = [Ygti, Yg(ti + 1), …, Ygtj]

YgT = [Ygt0, …, YgtT]

where, g∈[1, 2, …, G], t∈[0, 1, …, T]

Given a pair thresholds τC, τG, a window width W = 2l + 1, a TDTM is defined by a set of genes Gm and a set of time period [Tm, Lm]:

M(τT,τG):={(Gm,[Tm,Lm])[tm(i),lm(i)][Tm,Lm]:1|Gm|gGmρ(Yg(tm(i):tm(i)+lm(i)1),YGm(tm(i):tm(i)+lm(i)1))>τCgGm:1i=1|Lm|lm(i)[tm(i),lm(i)][Tm,Lm][lm(i)ρ(Yg(tm(i):tm(i)+lm(i)1),YGm(tm(i):tm(1)+lm(i)1))]>τG}

Where, ρ is the Pearson correlation, |X| is the number of component of X, and YGmti=1|Gm|gGmYgti. With this definition, the objective is to design an algorithm that can determine the gene set Gm and time period [Tm, Lm] that satisfy the definition.

3. The Iterative Time Windowed Signature Algorithm

We describe in this section the detailed ITWSA. From an initial input set, ITWSA first selects time periods during which the selected genes are co-regulated; then it identifies genes that are co-regulated during the selected time periods. Iterating between these two steps will revise the output and finally reach a stable state (genes and time information no longer change). Presumably, when using a sufficient number of initial gene set, it is possible to retrieve all the TMs in the data. Since the number of possible initial sets scales exponentially with the number of genes, efficient initial sets should be selected.

The algorithm for determining a TDTM can be summarized as follows:

Step 1: A first gene is randomly selected and 29 genes that have the largest Pearson correlation with the first gene were added to form the initial gene set;

Step 2: A set of consecutive time points of width (2l + 1) is chosen to measure the convergence of a suggested TM at a particular time; Keep all the ti that satisfies:

StiGm=1|Gm|gGmρ(Yg(til:ti+l),YGm(til:ti+l))>τT

Step 3: All the genes that are co-regulated by this suggested TDTM are identified by calculating its Pearson distance with the center of the TDTM at the selected time period:

S[Tm,Lm]g=1i=1|Lm|lm(i)[tm(i),lm(i)][Tm,Lm][lm(i)ρ(Yg(tm(i):tm(i)+lm(i)1),YGm(tm(i):tm(i)+lm(i)1))]>τG

where, [Tm, Lm] is calculated from the previous step.

Iterate between steps 2 and 3 until convergence, i.e., the gene set Gm and time period sets [Tm, Lm] no longer change during iterations.

To find another module, the algorithm restarts but in step 1 a gene is selected randomly from the remaining genes excluding genes in the previous initial set.

4. Test On Simulated Systems

ITWSA is first validated on simulated systems. To simulate date, transcription modules as shown in Figure 1 were first generated, which is represented as matrix Xg.

Figure 1. Simulated TMs and true clustered TMs.

Figure 1

The left figure shows the transcription module structure in simulated data (Horizontal label is time; vertical label represents genes), each color represents a separate TM; the right figure shows the same TMs but with genes that belong to the same TM put together. There are 6 TMs, and each TM containing 40 to 60 randomly selected genes and occupying a single (e.g., the red module in Fig 1.) or two separate time periods (e.g., the green module in Fig. 1). A gene is possible not involved in any TM or involved in more than one TM during different time. Data is generated by scaling and adding noise to the left figure

Then a dynamic state space model is used to generate simulated data:

{Xgti+1=Xgti+nYgti=aXgti+ugti

Where, nN(0,1),agUnif(0,1),ugtiN(ug,0.01);ugUnif(0,1)

For genes belonging to the same TM, we use the same XGmt during the time periods that the correspondent TM exists, i.e.,

Xgt=XGmt,gGm,t[Tm,Lm]

An example of the discovered module is shown in Figure 2.

Figure 2. True TM structure and a typical result.

Figure 2

The left figure shows again the true transcription module structure in simulated data; The right figure depicts 26 TMs we identified from the simulated data with genes belonging to the same TM put together. Although some genes are misclassified, most genes of the encoded TMs were correctly identified along with the existing time period of the corresponding TM

(Some TMs were identified more than once in our result)

Our simulation results show when choosing the thresholds and window size properly, the ITWSA is able to identify not only most genes in the encoded transcription modules but also the time period that TMs exists correctly.

5. Test on Real Data

We applied the ITWSA algorithm to analyze the human time series microarray data during KSHV infection. The data was produced by Affymetrix Human Genome U133A Chip, consisting of the expression profiles of 14,500 genes at time t=[0,1,3,6,10,16,24,36,54,78] (hour) after infection. Since priority was given to earlier states, so sample times were unevenly chosen.

5.1. Data Preprocessing

An intensity filter (the intensity of a gene should be above 100 in at least 25 percent of the samples), and a variance filter (the inter-quartile range of log2–intensities should be at least 0.5) were first applied to select 2210 differentially expressed genes. They were further normalized so that all genes contribute equally in the algorithm.

5.2. Results

When τT = 0.6, τG = 0.7 and window size equals to 5, 145 TMs were identified, and 1230 of 2210 genes were associated to at least one TM. See Figure 3 for detailed discussion of the results.

Figure 3. A typical TDTM and centroid plot of all 145 TDTMs identified from data.

Figure 3

The upper figure provides an example of a TDTM identified from data. It depicts 1) the expression trajectory of all the genes belong to this TDTM but over the entire sample period, 2) the centroid of TDTM during the identified time period. The TDTM is identified to exist between 10h-36h as described by the centriod plot. It can be seen that, these genes behave only alike during the identified time period but they do not show similarity in other period.

The lower figure shows the centroid plot of all the 145 TDTMs identified. Due to the size of sliding window, we cannot elucidate whether a TDTM exists at time t=[0h, 1h, 54h, 78h]. As can be seen:
  1. Many TMs identified are similar. (In fact many of them are even identical).
  2. A single TM may exist in earlier time period, later period or exist all through the time.
  3. Generally, TMs that exist only in earlier time period tend to go up (TDTMs illustrated by black lines in the figure); while TMs exist only in later period tend to go down (TDTMs illustrated by red lines in the figure).

5.3. GO analysis

To further validate the results, EASE (Expression Analysis Systematic Explorer) is used to verify whether an identified TDTM enriches meaningful gene categories. The result is supportive. The EASE results of the TDTM example shown in Figure 3 is tabulated in table 1:

Table 1. Enriched GO Terms of the TDTM shown in Figure 3.

Gene Category List Hits List Total Population Hits Population Total EASE score
defense response 26 162 765 11051 1.11E-04
response to external stimulus 36 162 1278 11051 1.51E-04
immune response 24 162 690 11051 1.61E-04
integral to plasma membrane 35 160 1254 10895 2.37E-04
response to biotic stimulus 26 162 830 11051 3.90E-04
integral to membrane 61 160 2795 10895 4.83E-04
membrane 83 160 4191 10895 4.90E-04
receptor activity 34 165 1268 11182 7.02E-04
plasma membrane 43 160 1786 10895 8.31E-04

The table shows the GO analysis result of the specific TDTM shown in Figure 3 by using EASE.

The numbers of genes in the table are all genes assayed or listed (i.e. all genes in the microarray (HGU133A) or TDTM) and annotated within a given system of classifying genes (e.g. the ‘Molecular Function’ branch of the Gene Ontology). Therefore the population can change from one system to the next. “Hits” refers to genes falling within the gene category in question. Population Hits refer to number of genes in the total group of genes assayed that belong to the specific Gene Category. The EASE score (The upper bound of the distribution of Jackknife Fisher exact probabilities given the List Hits, List Total, Population Hits and Population Total) is essentially a sliding-scale, conservative adjustment of the Fisher exact that strongly penalizes the significance of categories supported by few genes and negligibly penalizes categories supported by many genes. The smaller EASE score is, the gene category is enriched more significantly.

The listed GO categories are enriched significantly by the genes belonging to this TDTM, and these GO categories (defense response, response to external stimulus, etc) are relevant to KSHV infection.

6. Discussion and Future Work

In the algorithm, 3 parameters need to be chosen: two thresholds τC, τG and the width of window (2l + 1), which is used when we calculate the Pearson correlation in the step of deciding TM's existing time.

Two thresholds serve as the detection threshold of the search. Presumably, the larger they are, the more TMs can be discovered and more genes they will contain. The width of the window determines the sensitivity of the search towards time. When a smaller window is used, the location time boundary can be identified more precisely; however FDR will also increase.

Future work will seek to increase the stability and precision simultaneously. Post processing of result will also be stressed.

Acknowledgments

Yufei Huang is supported by an NSF Grant CCF-0546345.

Contributor Information

Jia Meng, Email: jmeng@lonestar.utsa.edu.

Shou-Jiang Gao, Email: gaos@uthscsa.edu.

Yufei Huang, Email: yufei.huang@utsa.edu.

References

  • 1.Ihmels J, Friedlander G, Bergmann S, Sarig O, Ziv Y, Barkai N. Revealing modular organization in the yeast transcriptional network. Nature genetics. 2002 Aug;31 doi: 10.1038/ng941. [DOI] [PubMed] [Google Scholar]
  • 2.Bergmann S, Ihmels J, BarkaiPettinen N. Iterative signature algorithm for the analysis of large-scale gene expression data. Physical review, E. 2003;67:031902. doi: 10.1103/PhysRevE.67.031902. [DOI] [PubMed] [Google Scholar]
  • 3.Supper J, Strauch M, Wanke D, Harter K, Zell A. EDISA: extracting biclusters from multiple time-series of gene expression profiles. BMC Bioinformatics. 2007;8:334. doi: 10.1186/1471-2105-8-334. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.DAVID: download EASE [http://david.niaid.nih.gov/david/ease.htm]
  • 5.Hosack D, Dennis G, Sherman B, Lane H, LempickI R. Identifying biological themes within lists of genes with EASE. Genome Biology. 2003;4:R70. doi: 10.1186/gb-2003-4-10-r70. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES