Skip to main content
Epigenetics logoLink to Epigenetics
. 2012 Aug 1;7(8):961–962. doi: 10.4161/epi.20941

Report on the Infinium 450k Methylation Array Analysis Workshop

April 20, 2012 UCL, London, UK

Tiffany Morris 1,, Robert Lowe 2,
PMCID: PMC3427291  PMID: 22722984

Abstract

A new platform for DNA methylome analysis is Illumina's Infinium HumanMethylation450. This technology is an extension of the previous HumanMethylation27 BeadChip and allows the methylation status of 12 samples per chip and 4 to 8 chips (total of 48 to 96 samples) to be assessed simultaneously for more than 480,000 cytosines across the genome. The platform incorporates two different probe types using different assay designs (InfiniumI and InfiniumII). Although this has allowed the assessment of more CpG sites, it has also introduced technical variation between the two probe types, which has complicated the analysis process. Many groups are working on normalization methods and analysis pipelines while many others are struggling to make sense of their new data sets. This motivated the organization of a meeting held at University College London that focused solely on the analysis methods and problems related to this new platform. The meeting was attended by 125 computational and bench scientists from 11 countries. There were 10 speakers, a small poster session and a discussion session.

Keywords: 450K, April, HumanMethylation450, Illumina, London, methylation, report, workshop

Overview and Validation of 450k

Robert Lowe (Queen Mary University of London, UK) gave an overview of the 450k platform to attendees that were new to the technology and unfamiliar with the design.

Juan Sandoval (Bellvitge Biomedical Research Institute, Spain) presented his work that validated the 450k using normal colon mucosa and a human colon cancer cell line. He then presented work he has done on centromere instability and facial anomalies syndrome using the 450k to identify DMRs that have since been verified by bisulfite sequencing. Sandoval also showed that regions surrounding CpGs chosen on the 450k array and the Infinium-targeted CpG had a similar pattern of methylation in the bisulfite sequencing.

The Importance and Source of Batch Effects

Rafael Irizarry (Johns Hopkins University, USA) was the invited keynote speaker and focused on the lessons that could be learned from expression arrays. He emphasized the importance of study design to avoid confounding factors and highlighted that by using a number of techniques, such as clustering, principal component analysis (PCA) and multi-dimensional scaling (MDS), it was possible to spot possible confounders such as sex, race and batch. Another key problem can be caused by sample swaps and one possible way to spot this is by checking the sexes on the array using the probes from the X and Y chromosomes. He also reported that there are 40,484 probes on the 450k array that may be affected by a SNP and that these may also appear as differentially methylated probes. This leads to the warning that not checking or controlling for these problems can lead to incorrect conclusions being drawn.

Irizarry finished by questioning differences found using a single CpG and explained a method, to be included in the R package minfi, that can look for differentially methylated regions on the array as opposed to single CpG differences. He highlighted the effect by showing examples in cancer data sets.

Probe Type Normalization and Methods to Identify Statistical Confounding

One of the problems with the 450k array that had previously been shown by Dedeurwaerder et al. 2011 was that the two different probe types contained on the array (Type I and Type II) follow different distributions.

Andrew Teschendorff (University College London, Cancer Institute, UK) presented a normalization method he has developed currently titled Beta Mixture Quantile Dilation. Comparing his method to the previously published method by Dedeurwaerder et al., 2011, which used peak correction, he showed that peak correction may be problematic when methylated peaks are not well defined. He presented a data set where this was the case and his method performed better, while on other data sets both methods performed similarly.

Teschendorff also discussed the importance of statistical confounding. He discussed Surrogate Variable Analysis (SVA) and Independent Surrogate Variable Analysis (I-SVA) as methods for identifying confounding factors that need to be removed or adjusted for. Often confounders may be unknown or approximated and, hence, by using this method it is possible to model confounders from the data. He showed that the use of I-SVA improves the robustness of identifying confounders.

Comparison of Current Normalization Methods

Since the release of the 450k several R packages have been developed with a range of normalization methods available for analysis. A number of talks involved the comparison of different normalization methods and the pros and cons of each.

Pei-Chein Tsai (King's College London, UK) showed that a number of probes map to multiple locations within 2 mismatches using MAQ and that these probes should be removed. She used twin study data to compare 27k data with 450k data showing that there was reasonably good correlation between the two but was similar to unrelated subject correlation on 450k. Tsai also compared 4 normalization methods for the 450k all based on quantile normalization, which varied, by quantile normalizing on different subsets of probes.

Wouter den Hollander (Leiden University Medical Centre, NL) compared methods available in 3 different R packages: lumi, minfi and IMA. He showed that a number of different methods can lead to variable outcomes in the different number of CpGs that were called significant and that the overlap between the different methods could be reasonable. Based on this, he suggested that the quantile normalization strategy in the IMA package seemed to be too aggressive; ultimately, the take-home message is that the top hits are fairly consistent across all strategies.

Kelly Rabionet (Centre for Genomic Regulation, Spain) highlighted the use of the Quality Control probes on the array to discover possible failed samples, discarding those which had average detection p-value > 0.05. She showed clear separation of the cerebellum to other brain areas and, by using linear discriminant analysis, could separate further classes of diagnostic samples. Then, by performing differential analysis with the LIMMA R package, she found a number of interesting CpG sites. Many of the corresponding genes contained multiple CpG sites from her top hits. Future work will involve the validation of these results as well as testing a number of normalization methods suggested at the meeting.

One attempt to conclusively determine the potential differences in normalization methods was presented by Leonardo Schalkwyk (King's College London, UK). He suggested using particular probe sets to test the “goodness”of a normalization, including imprinting DMRs (they have monoallelic methylation), X-inactivation (differs between male and female) and SNP probes (distinct AA, AB and BB genotypes).

Visualization and Integration of GWAS and EWAS

Arne Mueller (Novartis, Switzerland) used the Combat package to adjust for batch effects and questioned whether significance is driven by outliers and how surrounding CpGs behave. He also presented a complex new R package, Gviz, for data visualization and integration.

Mueller hopes that future developments in the field will include models that integrate GWAS and EWAS data, but added that it is important to remember that DNA methylation is tissue specific.

450k for Copy Number Estimation

Andrew Feber (University College London, Cancer Institute, UK) presented his work investigating whether the 450k array can be used to estimate copy number. He compared data from the 450k array to the Illumina CytoSNP array using copy number states identified from Affymetrix SNP 6.0 arrays. This comparison showed a good correlation among the three platforms and suggested that 450k can be used to predict copy number state of cancer genomes. This would represent significant savings of both time and money. Further validation is needed and testing to see whether the approach works with FFPE data. Integration of methylation and copy number data may enable us to define epigenetic and genetic mechanisms driving aberrant expression.

Discussion and Future Considerations

A number of key points were discussed about future work to increase our understanding of both methylation and the effectiveness of the 450k array. Overall, it was shown that by use of visualization tools, such as cluster analysis or PCA plots it is possible to highlight possible confounders as well as gain insight into the biology. Once confounders have been found, adjustment for these is still at an early stage and could be one area for future work.

Previous work has highlighted the problems with the two different types of probes contained on the array and a number of correction and normalization strategies were discussed throughout the meeting. There is still no clear method or pipeline that one should use in the analysis and, indeed, different techniques may be appropriate for different samples; nevertheless, it looks like the most significant differences may be found independently of which method was used. It is currently difficult to determine which strategy is most effective without decent data sets available to test different theories.

One problem with the 450k array may be due to the detection of only single CpG differences around a potential region of interest. While possible methods for extracting differentially methylated regions are soon to be available, it is of interest that bisulfite sequencing or perhaps targeted sequencing may provide a reliable validation technique as well as providing further insight.

Several groups at the meeting were involved in longitudinal studies. Finding the best way to detect the small difference in these data sets and avoiding confounding from factors such as age and time is a complex problem that needs more work. A web forum has been set up to facilitate discussion on 450k analysis: http://groups.google.com/group/epigenomicsforum.

Acknowledgments

This meeting was supported by two EU-FP7 consortia: BLUEPRINT and IDEAL.

Footnotes


Articles from Epigenetics are provided here courtesy of Taylor & Francis

RESOURCES