IMA: an R package for high-throughput analysis of Illumina's 450K Infinium methylation data

Dan Wang; Li Yan; Qiang Hu; Lara E Sucheston; Michael J Higgins; Christine B Ambrosone; Candace S Johnson; Dominic J Smiraglia; Song Liu

doi:10.1093/bioinformatics/bts013

. 2012 Jan 16;28(5):729–730. doi: 10.1093/bioinformatics/bts013

IMA: an R package for high-throughput analysis of Illumina's 450K Infinium methylation data

Dan Wang ¹, Li Yan ¹, Qiang Hu ¹, Lara E Sucheston ², Michael J Higgins ³, Christine B Ambrosone ², Candace S Johnson ⁴, Dominic J Smiraglia ⁵, Song Liu ^1,^*

PMCID: PMC3289916 PMID: 22253290

Abstract

Summary: The Illumina Infinium HumanMethylation450 BeadChip is a newly designed high-density microarray for quantifying the methylation level of over 450 000 CpG sites within human genome. Illumina Methylation Analyzer (IMA) is a computational package designed to automate the pipeline for exploratory analysis and summarization of site-level and region-level methylation changes in epigenetic studies utilizing the 450K DNA methylation microarray. The pipeline loads the data from Illumina platform and provides user-customized functions commonly required to perform exploratory methylation analysis for individual sites as well as annotated regions.

Availability: IMA is implemented in the R language and is freely available from http://www.rforge.net/IMA.

Contact: song.liu@roswellpark.org

1 INTRODUCTION

As a major epigenetic modification, DNA methylation plays a vital role in transcriptional regulation and chromatin remodeling. The aberration of DNA methylation profile has been found to be associated with many human diseases including cancer (Jones and Baylin, 2007; Portela and Esteller, 2010). Use of DNA methylation microarray is a popular approach in studies to characterize the epigenetic landscape of human cells (Laird, 2010). Two widely used commercial platforms to perform methylation profiling are the GoldenGate Methylation Beadarray and Infinium HumanMethylation27 BeadChip provided by Illumina Inc. These two arrays quantitatively target 1505 CpG loci covering ~ 800 genes and 27 578 CpG sites targeting ~ 14 000 genes, respectively. Since their release, many analytic methods have been developed to process and analyze the Illumina DNA methylation array data [for a recent summary, see Siegmund (2011)].

Compared with previously released Illumina DNA methylation platforms, the recently launched Infinium HumanMethylation450 BeadChip represents a significant increase in the CpG site density for quantifying methylation events. At the gene level, the 450K microarray covers 99% of RefSeq genes with multiple sites in the annotated promoter (1500 bp or 200 bp upstream of transcription start site), 5^′-UTR, first exon, gene body and 3^′-UTR. From the CpG context, it covers 96% of CpG islands with multiple sites in the annotated CpG Islands, shores (regions flanking island) and shelves (regions flanking shores) (Bibikova et al., 2011). While the role of DNA methylation in promoter and/or CpG island regions is long been appreciated, the importance of DNA methylation in gene body or shore regions for transcription regulation and tumor initialization has recently come to attention (Irizarry et al., 2009; Maunakea et al., 2010). The significantly increased coverage makes 450K microarray a powerful platform for exploring methylation profile in these annotated regions. As each targeted region contains at least one CpG site, treating the region as a unit in the differential methylation analysis might help identify regions with consistently coordinate methylation changes. From a statistical point of view, region-based differential methylation analysis will reduce the burden of multiple comparisons and increase the power to catch differentially methylated regions associated with the phenotypes of interest. To this end, we have developed a pipeline, IMA, for automatic site-level and region-level methylation analysis using the 450K microarray. While the pipeline is primarily designed as an automatic tool for exploratory analysis and summarization, it is flexible for users to tailor within R statistical computing and graphics environment for their specific needs.

2 DESCRIPTIONS

IMA is implemented in R and can be run on any platform with an existing R and Bioconductor installation. The user can run the pipeline with default settings or specify optional routes in the parameter file. An overview of the IMA pipeline is provided below:

Preprocessing: IMA takes as input the β values representing the methylation levels of individual sites reported by Illumina BeadStudio or GenomeStudio software. It allows user to choose several filtering steps or modify filtering criteria for specific quality control purposes. By default, IMA will filter out loci with missing β value, from the X chromosome or with median detection P>0.05. As probe containing SNP(s) at/near the targeted CpG site might not be sufficient to measure DNA methylation level (but rather genomic variation), users can choose to filter out the loci whose methylation levels are measured by probes containing SNP(s) at/near the targeted CpG site. The option for sample level quality control is also provided (Christensen et al., 2011). Although the raw β values will be analyzed as recommended by Illumina, the user can choose Arcsine square root transformation when modeling the methylation level as the response in a linear model (Marsit et al., 2011; Rocke, 1993). Logit transformation is also available as an option (Kuan et al., 2010). The default setting of IMA is that no normalization will be performed, and quantile normalization is available as an alternative preprocessing option. It has been shown that quantile normalization is not sufficient for removing all the unwanted technical variation across samples (Teschendorff et al., 2009). The development of normalization strategy for DNA methylation study is an active area of ongoing research (Aryee et al., 2011).

Methylation index calculation: the promoter, 5^′-UTR, first exon, gene body and 3′-UTR are gene-based regions. The CpG island and its surrounding shore and shelve regions are not necessary gene-based, depending on their distance to the nearest genes. For each specific region (e.g. first exon), IMA will collect the loci within it and derive an index of overall region methylation value. Currently, there are three different index metrics implemented in IMA: mean, median and Tukey's Biweight robust average. By default, the median β value will be used as the region's methylation index for further analysis.

Differential methylation analysis: for each specific region, Wilcoxon rank-sum test (default), Student's t-test and empirical Bayes statistics are available for inference in differential testing. General linear models are available as an option to infer methylation change associated with continuous covariate (e.g. age), as well as to adjust confounding factors (e.g. batch). A variety of multiple testing correction algorithms are available, including stringent Bonferroni correction and widely used false discovery rate control. Users can specify the significance criteria in the parameter file. The same statistical inference and multiple test correction procedures described above can also be applied to each single site to obtain site-level differential methylation inference.

Output: detailed output files are provided for each of the three modules above. For the preprocessing module, the output contains a matrix of methylation value for qualified loci across qualified samples. For the methylation index calculation module, there is a matrix of methylation index across the samples for each region category of interest (e.g. South Shore). For the differential methylation analysis module, the differential methylation values (e.g. delta β ) together with both raw and adjusted P-values of each region (or site) of interest will be provided.

3 DISCUSSION

The major differences between IMA and existing R packages for Infinium methylation analysis (e.g. Du et al., 2008) are that IMA provides a pipeline, which automates the tasks commonly required for the exploratory analysis and summarization of 450K DNA methylation data at both site-level and region-level. The package makes use of Illumina methylation annotation for region definition, as well as several Bioconductor packages for various preprocessing and differential testing steps (Gentleman et al., 2004).

Instead of providing recommendations about which specific analysis method should be used, the main purpose of developing the IMA package is to provide a range of commonly used DNA methylation microarray analysis options for users to choose for their exploratory analysis and summarization in an automatic way. Written in open-source R environment, it provides the flexibility for users to adopt, extend and customize the functionality for their specific needs. It can be used as an automatic pipeline of methylation level index and differential analysis for downstream functional exploration and hypothesis generation. For example, the matrix of methylation index for shore regions produced by IMA can be used as the input for model-based clustering (Houseman et al., 2008) to identify clustered shores associated with the phenotype of interest.

Analytic methods for DNA methylation microarray analysis are still under rapid developments (Laird, 2010; Siegmund, 2011). Future development of IMA package will include the extension of its functionality by incorporating the latest preprocessing and differential analysis methods. For example, options will be added to filter out defective bead types (e.g. mismatched or non-uniquely aligned probes) detected from systematic re-annotation efforts (Barbosa-Morais et al., 2010).

ACKNOWLEDGEMENTS

We wish to thank Ali Torkamani and Benjamin Tycko for sharing the SNP annotations, and Jeffrey Conroy for discussions.

Funding: National Institute of Health (grant R01-CA133264 to C.B.A. and M.J.H.; R01-CA095045 to C.S.J.)

Conflict of Interest: none declared.

REFERENCES

Aryee M.J., et al. Accurate genome-scale percentage DNA methylation estimates from microarray data. Biostatistics. 2011;12:197–210. doi: 10.1093/biostatistics/kxq055. [DOI] [PMC free article] [PubMed] [Google Scholar]
Barbosa-Morais N.L., et al. A re-annotation pipeline for Illumina BeadArrays: improving the interpretation of gene expression data. Nucleic Acids Res. 2010;38:e17. doi: 10.1093/nar/gkp942. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bibikova M., et al. High density DNA methylation array with single CpG site resolution. Genomics. 2011;98:288–295. doi: 10.1016/j.ygeno.2011.07.007. [DOI] [PubMed] [Google Scholar]
Christensen B.C., et al. DNA methylation, isocitrate dehydrogenase mutation, and survival in glioma. J. Natl Cancer Inst. 2011;103:143–153. doi: 10.1093/jnci/djq497. [DOI] [PMC free article] [PubMed] [Google Scholar]
Du P., et al. lumi: a pipeline for processing Illumina microarray. Bioinformatics. 2008;24:1547–1548. doi: 10.1093/bioinformatics/btn224. [DOI] [PubMed] [Google Scholar]
Gentleman R.C., et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 2004;5:R80–R80. doi: 10.1186/gb-2004-5-10-r80. [DOI] [PMC free article] [PubMed] [Google Scholar]
Houseman E.A., et al. Model-based clustering of DNA methylation array data: a recursive-partitioning algorithm for high-dimensional data arising as a mixture of beta distributions. BMC Bioinformatics. 2008;9:365. doi: 10.1186/1471-2105-9-365. [DOI] [PMC free article] [PubMed] [Google Scholar]
Irizarry R.A., et al. The human colon cancer methylome shows similar hypo- and hypermethylation at conserved tissue-specific CpG island shores. Nat. Genet. 2009;41:178–186. doi: 10.1038/ng.298. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jones P.A., Baylin S.B. The epigenomics of cancer. Cell. 2007;128:683–692. doi: 10.1016/j.cell.2007.01.029. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kuan P.F., et al. A statistical framework for Illumina DNA methylation arrays. Bioinformatics. 2010;26:2849–2855. doi: 10.1093/bioinformatics/btq553. [DOI] [PMC free article] [PubMed] [Google Scholar]
Laird P.W. Principles and challenges of genome-wide DNA methylation analysis. Nat. Rev. Genet. 2010;11:191–203. doi: 10.1038/nrg2732. [DOI] [PubMed] [Google Scholar]
Marsit C.J., et al. DNA methylation array analysis identifies profiles of blood-derived DNA methylation associated with bladder cancer. J. Clin. Oncol. 2011;29:1133–1139. doi: 10.1200/JCO.2010.31.3577. [DOI] [PMC free article] [PubMed] [Google Scholar]
Maunakea A.K., et al. Conserved role of intragenic DNA methylation in regulating alternative promoters. Nature. 2010;466:253–257. doi: 10.1038/nature09165. [DOI] [PMC free article] [PubMed] [Google Scholar]
Portela A., Esteller M. Epigenetic modifications and human disease. Nat. Biotechnol. 2010;28:1057–1068. doi: 10.1038/nbt.1685. [DOI] [PubMed] [Google Scholar]
Rocke D.M. On the beta transformation family. Technometrics. 1993;35:72–81. [Google Scholar]
Siegmund K.D. Statistical approaches for the analysis of DNA methylation microarray data. Hum. Genet. 2011;129:585–595. doi: 10.1007/s00439-011-0993-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Teschendorff A.E., et al. An epigenetic signature in peripheral blood predicts active ovarian cancer. PLoS One. 2009;4 doi: 10.1371/journal.pone.0008274. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B1] Aryee M.J., et al. Accurate genome-scale percentage DNA methylation estimates from microarray data. Biostatistics. 2011;12:197–210. doi: 10.1093/biostatistics/kxq055. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B2] Barbosa-Morais N.L., et al. A re-annotation pipeline for Illumina BeadArrays: improving the interpretation of gene expression data. Nucleic Acids Res. 2010;38:e17. doi: 10.1093/nar/gkp942. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B3] Bibikova M., et al. High density DNA methylation array with single CpG site resolution. Genomics. 2011;98:288–295. doi: 10.1016/j.ygeno.2011.07.007. [DOI] [PubMed] [Google Scholar]

[B4] Christensen B.C., et al. DNA methylation, isocitrate dehydrogenase mutation, and survival in glioma. J. Natl Cancer Inst. 2011;103:143–153. doi: 10.1093/jnci/djq497. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B5] Du P., et al. lumi: a pipeline for processing Illumina microarray. Bioinformatics. 2008;24:1547–1548. doi: 10.1093/bioinformatics/btn224. [DOI] [PubMed] [Google Scholar]

[B6] Gentleman R.C., et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 2004;5:R80–R80. doi: 10.1186/gb-2004-5-10-r80. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B7] Houseman E.A., et al. Model-based clustering of DNA methylation array data: a recursive-partitioning algorithm for high-dimensional data arising as a mixture of beta distributions. BMC Bioinformatics. 2008;9:365. doi: 10.1186/1471-2105-9-365. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B8] Irizarry R.A., et al. The human colon cancer methylome shows similar hypo- and hypermethylation at conserved tissue-specific CpG island shores. Nat. Genet. 2009;41:178–186. doi: 10.1038/ng.298. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B9] Jones P.A., Baylin S.B. The epigenomics of cancer. Cell. 2007;128:683–692. doi: 10.1016/j.cell.2007.01.029. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B10] Kuan P.F., et al. A statistical framework for Illumina DNA methylation arrays. Bioinformatics. 2010;26:2849–2855. doi: 10.1093/bioinformatics/btq553. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B11] Laird P.W. Principles and challenges of genome-wide DNA methylation analysis. Nat. Rev. Genet. 2010;11:191–203. doi: 10.1038/nrg2732. [DOI] [PubMed] [Google Scholar]

[B12] Marsit C.J., et al. DNA methylation array analysis identifies profiles of blood-derived DNA methylation associated with bladder cancer. J. Clin. Oncol. 2011;29:1133–1139. doi: 10.1200/JCO.2010.31.3577. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B13] Maunakea A.K., et al. Conserved role of intragenic DNA methylation in regulating alternative promoters. Nature. 2010;466:253–257. doi: 10.1038/nature09165. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B14] Portela A., Esteller M. Epigenetic modifications and human disease. Nat. Biotechnol. 2010;28:1057–1068. doi: 10.1038/nbt.1685. [DOI] [PubMed] [Google Scholar]

[B15] Rocke D.M. On the beta transformation family. Technometrics. 1993;35:72–81. [Google Scholar]

[B16] Siegmund K.D. Statistical approaches for the analysis of DNA methylation microarray data. Hum. Genet. 2011;129:585–595. doi: 10.1007/s00439-011-0993-x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B17] Teschendorff A.E., et al. An epigenetic signature in peripheral blood predicts active ovarian cancer. PLoS One. 2009;4 doi: 10.1371/journal.pone.0008274. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

IMA: an R package for high-throughput analysis of Illumina's 450K Infinium methylation data

Dan Wang

Li Yan

Qiang Hu

Lara E Sucheston

Michael J Higgins

Christine B Ambrosone

Candace S Johnson

Dominic J Smiraglia

Song Liu

Abstract

1 INTRODUCTION

2 DESCRIPTIONS

3 DISCUSSION

ACKNOWLEDGEMENTS

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

IMA: an R package for high-throughput analysis of Illumina's 450K Infinium methylation data

Dan Wang

Li Yan

Qiang Hu

Lara E Sucheston

Michael J Higgins

Christine B Ambrosone

Candace S Johnson

Dominic J Smiraglia

Song Liu

Abstract

1 INTRODUCTION

2 DESCRIPTIONS

3 DISCUSSION

ACKNOWLEDGEMENTS

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases