An efficient method for mining cross-timepoint gene regulation sequential patterns from time course gene expression datasets

Chun-Pei Cheng; Yu-Cheng Liu; Yi-Lin Tsai; Vincent S Tseng

doi:10.1186/1471-2105-14-S12-S3

. 2013 Sep 24;14(Suppl 12):S3. doi: 10.1186/1471-2105-14-S12-S3

An efficient method for mining cross-timepoint gene regulation sequential patterns from time course gene expression datasets

Chun-Pei Cheng ¹, Yu-Cheng Liu ^1,², Yi-Lin Tsai ¹, Vincent S Tseng ^1,^3,^✉

PMCID: PMC3848764 PMID: 24267918

Abstract

Background

Observation of gene expression changes implying gene regulations using a repetitive experiment in time course has become more and more important. However, there is no effective method which can handle such kind of data. For instance, in a clinical/biological progression like inflammatory response or cancer formation, a great number of differentially expressed genes at different time points could be identified through a large-scale microarray approach. For each repetitive experiment with different samples, converting the microarray datasets into transactional databases with significant singleton genes at each time point would allow sequential patterns implying gene regulations to be identified. Although traditional sequential pattern mining methods have been successfully proposed and widely used in different interesting topics, like mining customer purchasing sequences from a transactional database, to our knowledge, the methods are not suitable for such biological dataset because every transaction in the converted database may contain too many items/genes.

Results

In this paper, we propose a new algorithm called CTGR-Span (Cross-Timepoint Gene Regulation Sequential pattern) to efficiently mine CTGR-SPs (Cross-Timepoint Gene Regulation Sequential Patterns) even on larger datasets where traditional algorithms are infeasible. The CTGR-Span includes several biologically designed parameters based on the characteristics of gene regulation. We perform an optimal parameter tuning process using a GO enrichment analysis to yield CTGR-SPs more meaningful biologically. The proposed method was evaluated with two publicly available human time course microarray datasets and it was shown that it outperformed the traditional methods in terms of execution efficiency. After evaluating with previous literature, the resulting patterns also strongly correlated with the experimental backgrounds of the datasets used in this study.

Conclusions

We propose an efficient CTGR-Span to mine several biologically meaningful CTGR-SPs. We postulate that the biologist can benefit from our new algorithm since the patterns implying gene regulations could provide further insights into the mechanisms of novel gene regulations during a biological or clinical progression. The Java source code, program tutorial and other related materials used in this program are available at http://websystem.csie.ncku.edu.tw/CTGR-Span.rar.

Background

Over the past decade, a great number of studies on time course issue have become increasingly important since most clinical/biological events, such as infection-related chronic/acute inflammatory responses [1-3], drug treatment-related experiments [4], cell cycle-arrest [5] or other important issues [6], require a period of time in which aberrant alterations in gene expression would lead to different outcomes. Therefore, through performing a consecutive monitoring of massive gene expressions and discovering their regulations during clinical/biological manifestations, the hidden layer of biological mechanisms could be unveiled. However, to our knowledge, these is no effective method can handle this issue although the high-throughput microarray is a powerful tool and has been widely utilized to efficiently detect differentially expressed genes among a group of patients in a time course experiment [3,4]. These authors only focused on how to identify differentially expressed genes varied with time but actually we did not know whether these genes are associated with each other or not. Their results did not show the valuable information.

Sequential pattern mining is one of the most important topics in the field of data mining, especially for the database systems. The fundamental meaning of a sequential pattern refers to a set of singleton frequent items/differentially expressed genes that are followed by another set of items/differentially expressed genes in the time-stamp ordered transaction. Therefore, once the potential gene regulations occurred in a period of time, it could be identified by mining such sequential patterns from a dataset-converted database. Referring to previous studies, several parental algorithms using different computational designs, such as AprioriAll [7], SPADE [8] and PrefixSpan [9], have been successfully proposed and used for different databases to discover their own sequential patterns. The apriori-like (level-wise) GSP [10] and pattern-growth-based Prefix-growth [11] as well as DELISP [12] are evolutionarily designed incorporating with many constraints such as the size of gap among the sequence-involved singleton items, or a time interval within which items are observed as belonging to the same transaction even if they originate from different transactions. Besides, any possible subpatterns derived from each parental sequential pattern also satisfy the user-set constraint values. This property is called downward closure [7-12]. Therefore, any possible subpatterns of each sequential pattern, particularly for the longer ones, need to be generated during the decomposing process that is time-consuming and space-exhausting. Once both shorter and longer sequential patterns have the same occurrence times across all transactions in the database, i.e., closed sequential patterns, the shorter ones will be eliminated from the final resulting patterns. For this purpose, some newer algorithms like incorporating with constraints, CTSP [13], and without constraints, CloSpan [14], were then designed to tackle this problem. In addition to these traditional algorithms, an increasing number of extended methods have also been performed on some interesting topics. For example, an algorithm called WSpan [15] could be used to determine weighted sequential patterns from a transactional database, and the MAGIIC [16] was designed to discover the structure motifs from protein sequences. However, to the best of our knowledge, all of the aforementioned methods are not suitable for the widely used microarray data, as a large-scale DNA microarray-based platform normally consists over tens of thousands of probes/genes, e.g., over 45,000 probes/genes in rice and over 20,000 probes/genes in human arrays. A set of differentially expressed genes (significant singleton gene items) on a single array could be individually considered as a single transaction. In that way, each transaction (each time point contained gene items) may contain too many significant singleton gene items after converting the numeric datasets into the format (discrete) of transactional databases [17]. This is called a long transaction issue. However, to date, there exists no method which can efficiently handle such kind of issue. Actually, a lot of items would frequently occur at most time points. They are similar to the housekeeping genes, which are very insensible to an extracellular stimulus; instead, they play critical roles as maintenance genes in the basic cellular functions [18]. Moreover, mining sequential patterns containing too many such items may increase the difficulty in interpreting the resulting gene regulations. The performance of the preceding sequential pattern mining methods would also be limited to these simultaneous items.

In this paper, we propose a new algorithm called CTGR-Span (Cross-Timepoint Gene Regulation Sequential pattern) with some biologically designed parameters to solve the issue mentioned above by mining CTGR-SPs (Cross-Timepoint Gene Regulation Sequential Patterns). The CTGR-Span ensures that all of the resulting patterns imply gene regulations, which take place across different time points during the course of biological observations. The method is an extended and improved version of our previous paper [19] presented in the 2012 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). The most important changes include: first, we designed a new optimal parameter tuning procedure for the proposed algorithm to ideally determine suitable conditions in pattern mining. The procedure has a merit that there is no need to additionally compute the standard deviation of time intervals in a time course dataset. Based on this design, then we compared our method with two representative sequential pattern mining algorithms, namely GSP and PrefixSpan, in execution efficiency and effectiveness. The resulting patterns were validated using a manual literature survey and an automatic Gene Ontology enrichment analysis [20]. Finally, more explanations for the proposed algorithm have also been added to this paper like i) providing complete examples for readily understanding both our proposed algorithm and the new parameter tuning procedure, and ii) performing more experimental results on the two publicly available human disease-related time course microarray datasets [3,4].

The rest of this paper is organized as follows. The proposed method and materials for analysis are described in Methods. In Results and Discussion, we give the experimental results of the proposed method on two time course gene expression datasets. Concluding remarks are given in Conclusions.

Methods

In this section, we introduce how to efficiently discover CTGR-SPs (Cross-Timepoint Gene Regulation Sequential Patterns) from a time course microarray dataset through 3 main parts: i) an introduction to the experimental background of 2 input microarray datasets, ii) how to convert a numeric dataset into a transactional database, and iii) the kernel of the CTGR-Span (Cross-Timepoint Gene Regulation Sequential pattern) and its required biologically designed arguments.

Input microarray datasets

We tested this paper presenting method using the same input datasets as our previous works [19]. In brief, 2 time course gene expression microarray datasets (GSE6377 [3] and GSE11342 [4]) were downloaded from the GEO database. In GSE6377, McDunn et al. attempted to detect 8,793 transcriptional changes in 11 ventilator-associated pneumonia patients' leukocytes across 10 time points. For the other GSE11342, Taylor et al. monitored 22,283 gene expression changes in peripheral blood monocytes of 20 hepatitis C virus infected patients across the first 10 weeks right after treating with the Peg-interferon alfa-2b plus ribavirin.

Converting microarray datasets into transactional databases

The sequential patterns could be mined directly from a transactional database if the data are discrete. The microarray-involved probe/gene expression values need to be discretized into singleton items within every transaction. Here we show you an example from Table 1 to 3. Table 1 shows the probe/gene expression values of 3 genes G₁to G₃over 4 time points TP₁to TP₄with a fixed interval (1 day). The experimental design is performed in 3 patients. The first time point of this example is regarded as a baseline for deriving the significant items at each time point. All of the values are then divided by the first time point. The divided values can be presented in a fold change matrix as Table 2. The absolute fold changes exceeding a fold-change threshold are further defined as the significant genes. Suppose that the threshold is set as 1.5, only the eligible significant genes can be preserved as new items as shown in Table 3. Take patient 1 for instance, up-regulated G₁, down-regulated G₂and down-regulated G₃occur at the second time point that will be presented within the same parentheses (transaction). In this example, a set of 3 time-ordered transactions for each patient is called a sequence.

Table 1.

Example of time course microarray dataset

Patient IDs	Genes	TP₁	TP₂	TP₃	TP₄
1	G₁	249	656	100	50
	G₂	333	100	777	989
	G₃	500	250	157	333

2	G₁	123	950	135	354
	G₂	222	987	592	80
	G₃	300	222	246	735

3	G₁	500	121	100	50
	G₂	400	777	520	60
	G₃	100	300	400	500

Patient IDs	Sequences
1	<(G₁₊G_2-G_3-)₂(G_1-G₂₊G_3-)₃(G_1-G₂₊G_3-)₄>
2	<(G₁₊G₂₊)₂(G₂₊)₃(G₁₊G_2-G₃₊)₄>
3	<(G_1-G₂₊G₃₊)₂(G_1-G₃₊)₃(G_1-G_2-G₃₊)₄>

Patient IDs	Genes	TP_1/1	TP_2/1	TP_3/1	TP_4/1
1	G₁	1.00	2.63	-2.49	-4.98
	G₂	1.00	-3.33	2.33	2.97
	G₃	1.00	-2.00	-3.18	-1.50

2	G₁	1.00	7.72	1.10	2.88
	G₂	1.00	4.45	2.67	-2.78
	G₃	1.00	-1.35	-1.22	2.45

3	G₁	1.00	-4.13	-5.00	-10.00
	G₂	1.00	1.94	1.30	-6.67
	G₃	1.00	3.00	4.00	5.00

Patient IDs	Sequences
1	<(G₁₊)₁(G_2-G₃₊)₂(G₃₊)₃>
2	<(G₁₊G_4-)₁(G₃₊)₂(G_2-G₃₊)₄(G₅₊)₅>
3	<(G_8-)₁(G₁₊G_2-)₂(G_2-G₃₊)₃>
4	<(G₇₊)₁(G₁₊G₃₊G_6-)₂(G_2-G₃₊)₃>

Prefixes	Traditional projected databases	Projected databases of CTGR-Span	Traditional sequential patterns	CTGR-SPs
G₁₊	<(G_2-G₃₊)₂(G₃₊)₃> <(_G_4-)₁(G₃₊)₂(G_2-	<(G_2-G₃₊)₂(G₃₊)₃> <(G₃₊)₂(G_2-G₃₊)₄(G₅₊)₅>	<(G₁₊)(G_2-)> <(G₁₊)(G₃₊)>	<(G₁₊)(G_2-)> <(G₁₊)(G₃₊)>
	G₃₊)₄(G₅₊)₅>	<(G_2-G₃₊)₃>	<(G₁₊)(G_2-G₃₊)>*	<(G₁₊)(G₃₊)(G₃₊)>
	<(_G_2-)₂(G_2-G₃₊)₃>	<(G_2-G₃₊)₃>	<(G₁₊)(G₃₊)(G₃₊)>
	<(_G₃₊G_6-)₂(G_2-G₃₊)₃>

G_2-	<(_G₃₊)₂(G₃₊)₃>	<(G₃₊)₃>	<(G_2-)(G₃₊)>	<(G_2-)(G₃₊)>
	<(_G₃₊)₄(G₅₊)₅>	<(G₅₊)₅>	<(G_2-G₃₊)>*
	<(G_2-G₃₊)₃> <(_G₃₊)₃>	<(G_2-G₃₊)₃> <>

G₃₊	<(G₃₊)₃>	<(G₃₊)₃>	<(G₃₊)(G₃₊)>	<(G₃₊)(G₃₊)>
	<(G_2-G₃₊)₄(G₅₊)₅> <>	<(G_2-G₃₊)₄(G₅₊)₅> <>	<(G₃₊)(G_2-)> <(G₃₊)(G_2-G₃₊)>*	<(G₃₊)(G_2-)>
	<(G_6-)₂(G_2-G₃₊)₃>	<(G_2-G₃₊)₃>

Prefixes	Projected databases	CTGR-SPs
G₁₊	<(G_2-G₃₊)_2'(G₃₊)₃>	<(G₁₊G_2-)>
	<(G₃₊)_2'(G_2-	<(G₁₊G₃₊)>
	G₃₊)₄(G₅₊)₅>
	<(G_2-G₃₊)_3'>
	<(G_2-G₃₊)_3'>

G_2-	<(G₃₊)_3'>	<(G_2-G₃₊)>
	<(G₅₊)_5'> <(G_2-G₃₊)_3'>
	<>

G₃₊	<(G₃₊)_3'> <(G_2-G₃₊)₄(G₅₊)₅>
	<>	<(G₃₊G₃₊)>
	<(G_2-G₃₊)_3'>

Prefixes	Projected databases	CTGR-SPs
G₁₊	<(G_2-G₃₊)_2'(G₃₊)₃>	<(G₁₊)(G_2-)>
	<(G₃₊)_2'(G_2-G₃₊)₄(G₅₊)₅>	<(G₁₊)(G₃₊)>
	<(G_2-G₃₊)_3'>
	<(G_2-G₃₊)_3'>

G_2-	<(G₃₊)_3'>	<(G_2-)(G₃₊)>
	<(G₅₊)_5'>
	<(G_2-G₃₊)_3'>
	<>

G₃₊	<(G₃₊)_3'>	<(G₃₊)(G₃₊)>
	<(G_2-G₃₊)₄(G₅₊)₅>
	<>
	<(G_2-G₃₊)_3'>

	GSE6377							GSE11342

	100%	95%	90%	85%	80%	75%	70%	100%	95%	90%	85%	80%	75%	70%
# of CTGR-SPs	417	426	4,762	5,090	181,295	181,170	6,948,828	32	224	964	3,077	11,105	6,053	17,412
# of longest CTGR-SPs	81	81	59	59	176,552	176,552	208,297	2	28	203	1,717	4	283	4,713
Maximal length of CTGR-SPs	4	4	6	6	6	6	7	4	4	4	4	5	5	5
# of genes in CTGR-SPs	212	211	1,006	996	2,821	2,826	5,313	25	138	466	1,132	2,011	2,801	4,142
# of genes in longest CTGR-SPs	14	14	11	11	214	214	77	2	3	16	67	3	30	160
# of gene pairs in lonest CTGR-SPs	70	70	58	58	4,077	4,077	1,548	4	21	128	672	6	119	1,119
-Log(p-value)	0.34^†	0.34^†	0.00^†	0.00^†	0.55^†	0.55^†	0.29^†	0.00^††	1.26^††	0.26^††	0.91^††	0.00^††	1.58^††	4.11^††
# of GSP	-	-	-	-	-	-	-	-	-	-	-	-	-	-
# of PrefixSpan	-	-	-	-	-	-	-	-	-	-	-	-	-	-

	GSE6377							GSE11342

	100%	95%	90%	85%	80%	75%	70%	100%	95%	90%	85%	80%	75%	70%

GSP	-	-	-	-	-	-	-	-	-	-	-	-	-	-
PrefixSpan	-	-	-	-	-	-	-	-	-	-	-	-	-	-
CTGR-Span	0	0	0.03	0.03	1.65	1.65	220.88	0	0	0	0	0.05	0.23	0.93

I₁	I₂	I₃	Supports
CAV1+ [21]	GNG7+	EIF2D+ [24]	100% (11/11)

		FTSJ2+	100% (11/11)

		NR2E1- [22]	100% (11/11)

		TMOD3- [25]	100% (11/11)

CCL20- [26]	KIF4A+ [27]	FTSJ2+	100% (11/11)

		TMOD3- [25]	100% (11/11)

CSF3R- [28]	GNG7+	CHST7+	100% (11/11)

		EIF2D+ [24]	100% (11/11)

		FTSJ2+	100% (11/11)

		NR2E1- [22]	100% (11/11)

		TMOD3- [25]	100% (11/11)

	KIF4A+ [27]	FTSJ2+	100% (11/11)

		NR2E1- [22]	100% (11/11)

		TMOD3- [25]	100% (11/11)

DGKQ+ [29]	GNG7+	FTSJ2+	100% (11/11)

NUDT4+ [30]	CDC25A+ [31]	NR2E1- [22]	100% (11/11)

	GNG7+	NR2E1- [22]	100% (11/11)

	KIF4A+ [27]	EIF2D+ [24]	100% (11/11)

		FTSJ2+	100% (11/11)

		NR2E1- [22]	100% (11/11)

		SOAT1- [32]	100% (11/11)

	TLR6- [33]	CORO1A+ [34]	100% (11/11)

		KAT2B- [35]	100% (11/11)

		NR2E1- [22]	100% (11/11)

		PLAGL1- [22]	100% (11/11)

NUDT4P1+	CDC25A+ [31]	NR2E1- [22]	100% (11/11)

	GNG7+	NR2E1- [22]	100% (11/11)

	KIF4A+ [27]	EIF2D+ [24]	100% (11/11)

		FTSJ2+	100% (11/11)

		NR2E1- [22]	100% (11/11)

		SOAT1- [32]	100% (11/11)

	TLR6- [33]	CORO1A+ [34]	100% (11/11)

		KAT2B- [35]	100% (11/11)

		NR2E1- [22]	100% (11/11)

		PLAGL1- [22]	100% (11/11)

STX4- [36]	CDC25A+ [31]	NR2E1- [22]	100% (11/11)

		TMOD3- [25]	100% (11/11)

	KIF4A+ [27]	EIF2D+ [24]	100% (11/11)

		FTSJ2+	100% (11/11)

		NR2E1- [22]	100% (11/11)

		TMOD3- [25]	100% (11/11)

	TLR6- [33]	CORO1A+ [34]	100% (11/11)

		KAT2B- [35]	100% (11/11)

		LSM7+ [37]	100% (11/11)

		NR2E1- [22]	100% (11/11)

		PLAGL1- [22]	100% (11/11)

L₁	L₂	L₃	L₄	Supports
CXCL10+ [23]	IFIT2+ [38]	ZNF710-	FECH+ [39]	95% (19/20)

			BPGM+ [40]	95% (19/20)

			SNCA+ [41]	95% (19/20)

			SELENBP1+ [42]	95% (19/20)

		HBZ+	FECH+ [39]	95% (19/20)

			BPGM+ [40]	95% (19/20)

			SNCA+ [41]	95% (19/20)

			SELENBP1+ [42]	100% (20/20)

			TRIM46+	95% (19/20)

		SELENBP1+ [42]	HBZ+	95% (19/20)

			SELENBP1+ [42]	95% (19/20)

		PPP4R4+	SELENBP1+ [42]	95% (19/20)

IFIT2+ [38]	IFIT2+ [38]	ZNF710-	FECH+ [39]	95% (19/20)

			BPGM+ [40]	95% (19/20)

			SNCA+ [41]	95% (19/20)

			SELENBP1+ [42]	95% (19/20)

		HBZ+	FECH+ [39]	95% (19/20)

			BPGM+ [40]	95% (19/20)

			SNCA+ [41]	95% (19/20)

			SELENBP1+ [42]	100% (20/20)

			TRIM46+	95% (19/20)

		SELENBP1+ [42]	HBZ+	95% (19/20)

			SELENBP1+ [42]	95% (19/20)

		PPP4R4+	SELENBP1+ [42]	95% (19/20)

TNFSF10+ [43]	IFIT2+ [38]	HBZ+	SELENBP1+ [42]	95% (19/20)

	28d	31d	34d	37d	40d	43d	46d	49d	52d	55d	58d	61d	64d	≥ 67d
# of CTGR-SPs	112	112	120	126	157	165	160	163	163	161	194	194	220	242
# of longest CTGR-SPs	112	112	8	14	45	2	2	2	2	2	28	28	28	28
Maximal length of CTGR-SPs	1	1	3	3	3	4	4	4	4	4	4	4	4	4
# of genes in CTGR-SPs	112	112	119	123	132	132	132	132	132	132	136	135	136	140
# of genes in longest CTGR-SPs	0	0	4	6	14	2	2	2	2	2	3	3	3	3
# of gene pairs in lonest CTGR-SPs	0	0	7	11	42	4	4	4	4	4	21	21	21	21
-Log(p-value)^††	-	-	1.02	0.74	0.40	0	0	0	0	0	1.31	1.31	1.31	1.31

	0d	1d	2d	3d	4d	5d	6d	7d	8d	9d	≥ 10d
# of CTGR-SPs	352	419	203	203	169	169	201	189	279	354	423
# of longest CTGR-SPs	81	81	46	46	3	3	201	189	279	354	423
Maximal length of CTGR-SPs	4	4	3	3	2	2	1	1	1	1	1
# of genes in CTGR-SPs	206	212	178	178	174	174	187	183	197	209	213
# of genes in longest CTGR-SPs	14	14	11	11	2	2	11	9	15	20	21
# of gene pairs in lonest CTGR-SPs	70	70	33	33	5	5	0	0	0	0	0
-Log(p-value)^†	0.37	0.37	0.44	0.44	0.44	0.44	-	-	-	-	-

	0d	3d	6d	9d	12d	15d	18d	21d	24d	27d	30d	33d	36d	39d	42d	45d	48d	51d	54d	57d	60d	63d	≥ 66d
# of CTGR-SPs	214	211	221	194	154	135	131	127	125	128	125	127	136	157	157	163	163	163	163	187	190	198	217
# of longest CTGR-SPs	28	25	25	82	37	17	17	14	10	13	13	7	10	157	157	163	163	163	163	187	190	198	217
Maximal length of CTGR-SPs	4	4	4	3	3	3	3	3	3	3	2	2	2	1	1	1	1	1	1	1	1	1	1
# of genes in CTGR-SPs	136	134	136	134	127	124	123	121	120	121	119	121	125	132	132	132	132	132	132	136	136	136	136
# of genes in longest CTGR-SPs	3	3	3	15	10	9	9	9	7	8	5	5	4	7	7	7	7	7	7	7	7	7	7
# of gene pairs in lonest CTGR-SPs	21	19	19	59	26	16	16	14	10	12	10	10	12	0	0	0	0	0	0	0	0	0	0
-Log(p-value)^††	1.26	1.37	1.37	0.70	0.00	0.00	0.00	0.00	0.86	0.86	0.65	0.53	0.40	-	-	-	-	-	-	-	-	-	-

PERMALINK

An efficient method for mining cross-timepoint gene regulation sequential patterns from time course gene expression datasets

Chun-Pei Cheng

Yu-Cheng Liu

Yi-Lin Tsai

Vincent S Tseng

Supplement

Conference

Abstract

Background

Results

Conclusions

Background

Methods

Input microarray datasets

Converting microarray datasets into transactional databases

Table 1.

Table 3.

Table 2.

CTGR-Span: cross-timepoint gene regulation sequential pattern

Kernel procedure

Table 4.

Table 5.

Biological parameter designs

Figure 1.

Table 6.

Table 7.

Results and discussion

Optimal parameter tuning

Table 8.

Table 9.

Table 10.

Table 11.

Table 12.

High performance of CTGR-Span

Table 13.

Evaluation with literature

Table 14.

Table 15.

Conclusions

List of abbreviations

Competing interests

Authors' contributions

Supplementary Material

Contributor Information

Acknowledgements

Declarations

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases