Exploratory Methods for Checking Quality of Microarray Data

Eun-Kyung Lee; Taesung Park

doi:10.6026/97320630001423

. 2007 Apr 10;1(10):423–428. doi: 10.6026/97320630001423

Exploratory Methods for Checking Quality of Microarray Data

Eun-Kyung Lee ¹, Taesung Park ^1,^*

PMCID: PMC1896057 PMID: 17597933

Abstract

In microarray experiments many undesirable systematic variations are commonly observed. Often investigators analyzing microarray data need to make subjective decisions about the quality of the experiment, by examining its chip image and a simple scatter plot. Thus, a more rigorous but simple method is desirable to determine the quality of microarray data. We propose two exploratory methods to investigate the quality of microarray experiments with replicated chips. The first method is based on correlations among chips and the second on the actual intensity values for each gene. The proposed methods are illustrated using a real microarray data set. The methods provide an initial estimation for determining the quality of microarray experiments.

Keywords: microarray, quality, checking, exploratory methods

Background

In microarray experiments different sources of systematic and random errors can arise, which may significantly affect the inference on the measured gene expression patterns. A normalization procedure is regularly employed to remove (or minimize) the artifacts due to such errors. While these normalization approaches are useful for adjusting bias of each individual chip, they do not provide a rigorous statistical criterion to detect chips in poor quality. At an earlier stage of analysis, each microarray slide is often examined graphically using the scatter plot between chips to examine large variability (or low reproducibility) and any unusual patterns. However, such examinations are based on subjective human pattern recognition, and chips in poor quality can frequently enter the subsequent analysis, resulting in unreliable inference on the whole microarray study. Therefore, in this study we are concerned about checking the quality of overall microarray experiments and to identify the outlying chips that have much lower reproducibility than other chips.

There have been several approaches for checking reproducibility in microarray experiments. For example, Parmigiani et al., [1] defined integrative correlation between two experiments that are conducted separately to answer the same biological question. This integrative correlation is calculated for each gene and called a gene's reproducibility score. King et al., [2] used correlations, the rate of two fold changes, and principal component analysis to check the reproducibility of gene expression measurements. Park et al., [3] proposed a diagnostic plots for identifying outlying slides. In this paper, we propose an exploratory method to check the quality of microarray data using two different approaches.

Methodology

We first describe the approach based on the correlations between chips and then describe the other approach based on the actual intensity values.

Correlation Based Approach

Given in the supplementary material linked below

Example

In this section, the proposed methods are applied to murine B-cell data. To study gene expression profiles in murine B-cell development, total cellular RNA was extracted from five consecutive B-lymphocyte lineage sub-populations (pre-BI cells, large pre-BII cells, small pre-BII cells, immature B-cells, and mature B-cells), and then, gene expression profiles from the five consecutive stages of mouse B cell development were generated with more than five replicates. [8]

Murine B-cell data show lower sensitivity (0.66) and specificity (0.02). For the further exploratory analysis, we apply the proposed methods. In the chip-wise correlation plot (Figure 1), most treatments except small Pre-BII cells (chip 23 - chip 27) show high chip-wise correlations. Chipwise correlations of the small Pre-BII cell treatment have a highly skewed distribution and the third replicate has very small correlations compared to the other chips in the same group. Therefore, we can conclude that this third replicate is problematic and has to be checked or treated before a further analysis. In the summary correlation plot (Figure 2), Murine B-cell data shows outliers, chip 25. All the chips except chips in Small Pre-BII group are located in the upper triangular and chip 25 is far from the other chips. It supports the result from chip-wise correlation plot (Figure 1).

Chip-wise correlation plot: Murine B-cell data. The plots are for the five treatments: Immature B (1, 2, 3, 4, 5), Large Pre-BII (6, 7, 8, 9, 10), Mature B (11, 12, 13, 14, 15, 16), Pre-BI (17, 18, 19, 20, 21, 22), and Small Pre-BII (23, 24, 25, 26, 27)

The summary correlation plot. The solid line across the plot is the reference line for specificity. The chips lower than this line represent low specificity and the chips upper than this line represent high specificity

In Table 1, the last column of P_KS and P_W show lower p-values than the others. Therefore, we can conclude that the distribution of within correlation in Small Pre-BII group is greater than the distribution of the other groups. Also the mean of within correlation in small Pre BII group is less than the mean of the other groups.

Table 1. P_KS and P_W matrices of Murine B-cell data.

P_KS	Imm. B	Large BII	Mat. B	Pre BI	Small BII
Immature B	1.00	0.41	0.34	0.52	0.20
Large Pre BII	0.90	1.00	0.62	0.62	0.20
Mature B	0.89	0.81	1.00	0.77	0.15
Pre BI	0.89	0.81	0.94	1.00	0.15
Small Pre BII	1.00	0.90	0.89	0.72	1.00

P_W	Imm. B	Large BII	Mat. B	Pre BI	Small BII

Immature B	1.00	0.37	0.30	0.34	0.11
Large Pre BII	0.66	1.00	0.45	0.42	0.24
Mature B	0.72	0.58	1.00	0.47	0.17
Pre BI	0.68	0.60	0.55	1.00	0.18
Small Pre BII	0.90	0.78	0.84	0.83	1.00

Open in a new tab

Next, we apply the test based on intensities within treatment. We assume the FDR as 5%. Table 2 shows the result of the intensity based tests. Murine B-cell data show quite different patterns. Especially, the gamma of small Pre-BII treatment is lowest among five treatments. Therefore we can conclude that Murine B-cell data set is less reproducible.

Table 2. Summary table for the within test based on intensities.

Treatment	Conc/disc	Γ
Murine B-cell (27)
Immature B(5)	1086/5509	0.6707
Large Pre BII (5)	1079/5516	0.6728
Mature B(6)	1145/5450	0.6528
Pre BI (6)	1095/5500	0.6679
Small Pre BII (5)	1320/5275	0.5997

Open in a new tab

We can conclude that murine B-cell data show lower reproducibility, sensitivity and specificity. Therefore, it is not clear whether or not a further statistical test procedure can detect true differences successfully among the five consecutive stages, especially with small pre-BII cells. It is mainly due to one outlying chip (chip 25), as shown in Figure 3. Therefore, the analyst should check the experimental procedure and tissues used for this chip before a further statistical analysis.

Discussion

At the initial stage of the microarray data analysis, the exploratory data analysis (EDA) provides the first contact with data. The techniques of EDA consist of a number of informal steps such as checking the quality of the data, calculating simple summary statistics, and constructing appropriate graphs.

The proposed method is a more formal way of checking quality than simple EDA plots. Thus, at an initial stage of the microarray data analysis, the proposed method provides useful information regarding the quality of microarray experiments. The correlation based approaches check the treatment-wise quality, while the test based on the actual intensity values checks the gene-wise quality for each gene.

The proposed method is quite effective in detecting some outlying chips. It is much easier to apply than a traditional method of checking outlying chips either by the principal component analysis or the quality control plot. [3]

There are some statistical issues to be taken into consideration, however. First, the log intensities may not have an approximate normal distribution. For simplicity, we have assumed the normal distribution for testing all hypotheses. However extensions to other distributional assumptions are certainly possible. For example, the other distributions such as log-normal and gamma distributions can be easily handled. Second, we did not use a stringent criterion for identifying the concordant/discordant genes. All these genes should be checked by using a analysis such as SAM [9] or t-test [10] during a later stage of analysis. Third, the correlation coefficients derived from all possible pairs of chips may not be independent. We did not consider these correlations in the current analysis. A more sophisticated approach based on the bootstrapping method is under development which considers possible correlations among the correlation coefficients.

We would like to emphasize that the proposed method is an exploratory analysis. We believe the proposed method to be practically useful, simple and easy to implement that will provide a more rigorous approach in a preliminary overview regarding the quality of microarray experiments. Most proposed methods are implemented in the software arrayQCplot [11] and can be downloaded from Bioconductor(www.bioconductor.org).

Supplementary Material

Data 1

97320630001423S1.pdf^{(100.7KB, pdf)}

The scatter plot matrix of five replicates for Small Pre BII treatment in Murine B-cell data

Acknowledgments

The authors would like to thank to anonymous referees and the editor whose comments were extremely helpful. This study was supported by the National Research Laboratory Program of Korea Science and Engineering Foundation (M10500000126) and the Brain Korea 21 Project of the Ministry of Education.

Footnotes

Citation:Lee & Park, Bioinformation 1(10): 423-428 (2007)

References

1.Parmigiani G, et al. Clin Cancer Res. 2004;10:2922. doi: 10.1158/1078-0432.ccr-03-0490. [DOI] [PubMed] [Google Scholar]
2.King C, et al. J Mol Diagn. 2005;7:57. doi: 10.1016/S1525-1578(10)60009-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Park T, et al. Biotechniques. 2005;38:463. doi: 10.2144/05383RR02. [DOI] [PubMed] [Google Scholar]
4.Pavlidis P, Noble WS. Genome Biol. 2001;2:RESEARCH0042. doi: 10.1186/gb-2001-2-10-research0042. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Jain N, et al. Bioinformatics. 2003;19:1945. doi: 10.1093/bioinformatics/btg264. [DOI] [PubMed] [Google Scholar]
6.Storey JD, Tibshirani R. Proc Natl Acad Sci. 2003;100:9440. doi: 10.1073/pnas.1530509100. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Pounds S. Brief Bioinform. 2005;38:463. [Google Scholar]
8.Hoffmann R, et al. Genome Res. 2002;12:98. doi: 10.1101/gr.201501. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Tusher VG, et al. Proc Natl Acad Sci. 2001;98:5116. doi: 10.1073/pnas.091062498. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Choe SE, et al. Genome Biol. 2005;6:R16. doi: 10.1186/gb-2005-6-2-r16. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Lee EK, et al. Bioinformatics. 2006;22:2305. doi: 10.1093/bioinformatics/btl367. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data 1

97320630001423S1.pdf^{(100.7KB, pdf)}

[R01] 1.Parmigiani G, et al. Clin Cancer Res. 2004;10:2922. doi: 10.1158/1078-0432.ccr-03-0490. [DOI] [PubMed] [Google Scholar]

[R02] 2.King C, et al. J Mol Diagn. 2005;7:57. doi: 10.1016/S1525-1578(10)60009-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R03] 3.Park T, et al. Biotechniques. 2005;38:463. doi: 10.2144/05383RR02. [DOI] [PubMed] [Google Scholar]

[R04] 4.Pavlidis P, Noble WS. Genome Biol. 2001;2:RESEARCH0042. doi: 10.1186/gb-2001-2-10-research0042. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R05] 5.Jain N, et al. Bioinformatics. 2003;19:1945. doi: 10.1093/bioinformatics/btg264. [DOI] [PubMed] [Google Scholar]

[R06] 6.Storey JD, Tibshirani R. Proc Natl Acad Sci. 2003;100:9440. doi: 10.1073/pnas.1530509100. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R07] 7.Pounds S. Brief Bioinform. 2005;38:463. [Google Scholar]

[R08] 8.Hoffmann R, et al. Genome Res. 2002;12:98. doi: 10.1101/gr.201501. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R09] 9.Tusher VG, et al. Proc Natl Acad Sci. 2001;98:5116. doi: 10.1073/pnas.091062498. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] 10.Choe SE, et al. Genome Biol. 2005;6:R16. doi: 10.1186/gb-2005-6-2-r16. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] 11.Lee EK, et al. Bioinformatics. 2006;22:2305. doi: 10.1093/bioinformatics/btl367. [DOI] [PubMed] [Google Scholar]

PERMALINK

Exploratory Methods for Checking Quality of Microarray Data

Eun-Kyung Lee

Taesung Park

Abstract

Background

Methodology

Correlation Based Approach

Example

Figure 1.

Figure 2.

Table 1. P_KS and P_W matrices of Murine B-cell data.

Table 2. Summary table for the within test based on intensities.

Discussion

Supplementary Material

Figure 3.

Acknowledgments

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Exploratory Methods for Checking Quality of Microarray Data

Eun-Kyung Lee

Taesung Park

Abstract

Background

Methodology

Correlation Based Approach

Example

Figure 1.

Figure 2.

Table 1. PKS and PW matrices of Murine B-cell data.

Table 2. Summary table for the within test based on intensities.

Discussion

Supplementary Material

Figure 3.

Acknowledgments

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Table 1. P_KS and P_W matrices of Murine B-cell data.