An analytical pipeline for genomic representations used for cytosine methylation studies

Reid F Thompson; Mark Reimers; Batbayar Khulan; Mathieu Gissot; Todd A Richmond; Quan Chen; Xin Zheng; Kami Kim; John M Greally

doi:10.1093/bioinformatics/btn096

. Author manuscript; available in PMC: 2016 Oct 13.

Published in final edited form as: Bioinformatics. 2008 Mar 18;24(9):1161–1167. doi: 10.1093/bioinformatics/btn096

An analytical pipeline for genomic representations used for cytosine methylation studies

Reid F Thompson ¹, Mark Reimers ², Batbayar Khulan ¹, Mathieu Gissot ^3,⁴, Todd A Richmond ⁵, Quan Chen ^6,⁷, Xin Zheng ^6,⁷, Kami Kim ^3,⁴, John M Greally ^1,^8,^✉

PMCID: PMC5061929 NIHMSID: NIHMS816444 PMID: 18353789

Motivation

Representations of the genome can be generated by the selection of a subpopulation of restriction fragments using ligation-mediated PCR. Such representations form the basis for a number of high-throughput assays, including the HELP assay to study cytosine methylation. We find that HELP data analysis is complicated not only by PCR amplification heterogeneity but also by a complex and variable distribution of cytosine methylation. To address this, we created an analytical pipeline and novel normalization approach that improves concordance between microarray-derived data and single locus validation results, demonstrating the value of the analytical approach. A major influence on the PCR amplification is the size of the restriction fragment, requiring a quantile normalization approach that reduces the influence of fragment length on signal intensity. Here we describe all of the components of the pipeline, which can also be applied to data derived from other assays based on genomic representations.

1 INTRODUCTION

There are several techniques described that rely on restriction enzyme-generated representations to sample the genome. These include representational oligonucleotide microarray analysis with Roche NimbleGen microarrays (ROMA; Lucito, et al., 2003), whole-genome sampling analysis (WGSA) with Affymetrix SNP microarrays (Kennedy, et al., 2003), representational difference analysis (RDA; Lisitsyn and Wigler, 1993) and other techniques to study cytosine methylation in the genome (Hatada, et al., 2006) including the HELP assay that we have previously described (Khulan, et al., 2006). Cytosine methylation is an epigenetic mark maintained by DNA methyltransferases and important for transcriptional regulation (Jones and Baylin, 2007). The HELP assay is based on HpaII Tiny Fragment (HTF; Bird, 1986) Enrichment by Ligation-mediated PCR (HELP). The approach relies on comparative genomic hybridization of HpaII-digested and MspI-digested DNA. While HpaII is only able to digest its unmethylated recognition motif (5’-CCGG-3’), its isoschizomer, MspI, digests any HpaII sites whether methylated or not at the central CG dinucleotide (Waalwijk and Flavell, 1978). MspI therefore serves as an internal control for the assay, representing all HTFs equivalently unless the locus is deleted, amplified, or mutated to prevent restriction digest. We have shown that PCR enrichment of the two genomic representations followed by co-hybridization on a custom microarray provides a powerful tool for determining local methylation status on a genome-wide scale (Khulan, et al., 2006).

However, PCR amplification of mixed templates, a step inherent to representational techniques, can cause certain fragments to amplify with efficiencies different to those of other fragments. A number of important biases have been described that affect PCR with increasing numbers of amplification cycles (Mathieu-Daude, et al., 1996; Suzuki and Giovannoni, 1996). Furthermore, random artifacts introduced at early stages of the PCR can have dramatic effects (Polz and Cavanaugh, 1998; Wagner, et al., 1994). The HELP amplification technique is limited to 20 cycles to minimize some of these potential sources of bias. However fragment length bias (Carvalho, et al., 2007; Nannya, et al., 2005) is inherent to any multi-template PCR reaction, as the efficiency of amplification of each component template is affected by the length of that template. In this report, we provide further evidence that this phenomenon may be responsible for substantial differences in PCR amplification efficiency, sometimes of the order of the biological differences we seek to measure, complicating our ability to compare results within and between different arrays.

We solve the amplicon length heterogeneity problem with a novel quantile normalization method that we have developed as part of a modular pipeline of analytical tools. We assess the performance of this pipeline with extensive bisulphite pyrosequencing validation studies. Designed for the HELP assay specifically, these tools can also be applied to other techniques that use data from PCR amplification to create genomic representations. The functions in this pipeline (Supp. Fig. 1) are publicly available and written for the R Statistical Package (R Development Core Team, 2005) to allow adoption and testing by other investigators.

2 SYSTEM AND METHODS

2.1 Samples

For the studies described here, samples were obtained from C57Bl/6J mice (Charles River Laboratories). Highly-enriched spermatogenic cell populations were isolated from 25 mice of ages 6–8 weeks using the Sta-Put method based on sedimentation velocity and unit gravity (Bellve, et al., 1977; Romrell, et al., 1976). Purity of germ cells (≥90%) was established on the basis of cellular morphology using light microscopy. Whole brain samples were obtained from male mice at age 8 weeks. Additional samples were obtained from Sprague Dawley rat liver and cultured Toxoplasma gondii RH strain tachyzoites. Genomic DNA was purified from all cell or tissue types and HELP samples were prepared as we have described previously (Khulan, et al., 2006).

2.2 The HELP Assay

To perform a HELP experiment, high molecular weight genomic DNA is isolated, digested to completion by HpaII and by MspI separately, and then ligated to an oligonucleotide adapter pair complementary to the cohesive ends generated. The linkers then serve as a priming site for a ligation-mediated PCR reaction that we have described to generate a product ranging primarily in the 200–2,000 bp size range (Khulan, et al., 2006). To avoid the PCR biases described above, we use a universal primer, a relatively small number of amplification cycles (20), a substantial quantity of initial DNA template (0.1 µg), and a pooled mixture of (3 or more) “hot start” PCR (Chou, et al., 1992) replicates.

Following PCR, the HpaII and MspI representations are labeled with different fluorophores using random priming and are then cohybridized on a customized genomic microarray representing the HpaII/MspI fragments of 200–2,000 bp in unique sequence.

2.3 Array Designs and Data Import

For each HELP experiment, Cy3 and Cy5 signal intensities measure the relative abundances for each HTF of MspI and HpaII representations, respectively. In addition to gridding and other technical controls supplied by Roche NimbleGen, the microarrays also report thousands of random probes (50-mers of random nucleotides) which serve as a metric of non-specific annealing and background fluorescence. By design, all probes are randomly distributed across each microarray.

Signal intensity data for every spot on the array is read from flat files (read.pairs ()) and linked to its corresponding probe identifier. Roche NimbleGen-formatted design files are then used to link probe identifiers to their corresponding HTF, and provide genomic position and probe sequence information (read.design ()). From these probe sequences we calculate (G+C) content as %GC (calc.gc ()) and theoretical melting temperatures (calc.tm ()) using the nearest-neighbor approach (Allawi and SantaLucia, 1997) with the unified thermodynamic parameters (SantaLucia, 1998).

2.4 Inter- and Intra-Microarray Quality Assessment

Before we can consider biological variability, we have to address the issues of array performance and quality control (QC). We screen for spatial artifacts by comparing the average ratios of signal intensities as a function of position on the array. We divide the array into sectors (default is 25) and take summary measures of probes located within each sector, then compare the distributions between sectors. High quality hybridizations yield a relatively uniform distribution of ratios across all sectors (Supp. Fig. 2A), whereas samples such as the one shown in Supp. Fig. 2B demonstrate poor hybridizations that require repeating.

Consideration of probe signal in the context of its performance across multiple arrays improves our ability to discriminate finer deviations in performance. We therefore perform additional quality assessment in a manner analogous to our previous description (Reimers and Weinstein, 2005). We first define a prototypical signal intensity and ratio profile by mean-centering each data set, followed by calculation of a summary measure for each probe as the (20%-) trimmed mean of its values across the arrays (calc.prototype ()). We then subtract the log-intensity or log-ratio prototype from the corresponding value for each oligonucleotide on each microarray to obtain an array-specific data set that compares its signal intensities with those for the population of signals on all of the microarrays, plotting each set of values relative to the technical variables under consideration. For instance, regional heterogeneities of signal intensities on the microarrays can be color-coded in terms of deviations from prototype (plot.chip.image ()), illustrating their positions on a depiction of the microarray (Fig. 1A,B,C,F,G,H). We also study intensity-dependent biases (Yang, et al., 2002) (see Fig. 1D,E,I,J; plot.HELP.qc ()), which may reflect sources of technical variability such as labeling efficiency. The influence of other variables can also be studied (e.g. probe melting temperature; Supp. Fig. 3).

Fig. 1 — Quality assessment shows improvement of poor array data following rehybridization on a fresh array. Spatial artifacts for a poor-quality hybridization are shown as the difference of MspI (A) and HpaII (B) signal intensities as well as HpaII/MspI ratios (C) from the probe-by-probe averages across all arrays in the dataset. Green indicates signal or ratio data that is less than the multi-array average while red indicates signal or ratios that exceed the average. Panels (D) and (E) show the MspI and HpaII signal intensities on the y-axis, respectively, versus their multi-array averages shown along the x-axis. The yellow lines on these panels represent lowess-smoothing and highlight the non-linearity of the data, consistent with intensity-dependent bias. The lower panels show improved quality for a rehybridization of the same sample on a fresh array.

These studies allow us to identify at an early stage of the analytical phase of an experiment the extremes or outliers within the dataset. Occasionally there are microarrays that demonstrate varying degrees of spatial artifacts, of which we show a representative sample in Fig. 1A,B,C. This particular microarray also manifested dramatic intensity-dependent biases that further mark it as an outlier in the dataset (Fig. 1D,E). Rehybridization of these samples on a new microarray removed the regional artifacts observed in the original hybridization (Fig. 1F,G,H), resulting in significant reduction of the intensity-dependent bias (Fig. 1I,J). Rehybridization also improved correlation of this sample with its other replicates in the dataset (R values ~0.94, up from ~0.84).

2.5 Size-Dependent Intensities and Definition of Background

When we visualize signal intensities as a function of fragment size (Fig. 2; plot.HELP.svi ()), we observe several characteristics of the representations. It is obvious that signal intensity is dependent on fragment length, with maximal intensities for both MspI and HpaII representations around 500 bp and weaker signals at the extremes (Fig. 1A,B). We also note that MspI-derived representations show amplification of all HTFs (Fig. 2A) whereas the HpaII-derived distribution shows a second population of oligonucleotides with low signal intensities across all fragment sizes represented (Fig. 2B). This second population corresponds to DNA sequences that did not digest or amplify because of methylation at the flanking HpaII sites. As we have shown previously (Khulan, et al., 2006), the HpaII/MspI ratios have a bimodal distribution, the lower ratios representing methylated HTFs while the higher ratios represent HTFs that are relatively hypomethylated (Fig. 2C).

For each HELP experiment, the level of background signal intensity (“noise”) is measured by thousands of random probes (50-mers of random nucleotides). These probes measure non-specific annealing and background fluorescence and enable definition of “failed” probes, those for which the levels of MspI and HpaII signal intensities are indistinguishable from the background intensities defined by the random probes (Fig. 2, yellow data points). In these cases, failed probes represent the population of fragments that do not amplify by PCR, whatever the biological or experimental cause (e.g. genomic deletions). We remove these probes from further consideration, typically affecting 10–20% of the probes and a smaller fraction of HTFs. If the probe only “fails” in the HpaII and not the MspI channel, the cause is likely to be due to methylation of that locus, and we maintain these data through subsequent analyses.

The HELP assay also makes use of mitochondrial probes as a high copy number, hypomethylated control (Supp. Fig. 4, red data points). Mitochondrial DNA has been observed to be unmethylated in germline (Hecht, et al., 1984), somatic (Pollack, et al., 1984), and cancerous cells (Maekawa, et al., 2004). As such, the mitochondrial loci serve as a highly reliable control, the failure of which is a robust indicator of major problems with the assay.

2.6 Quantile Normalization

Despite our measures to avoid PCR bias during the amplification process, we continue to see a size-dependence of signal intensities as described above. The distribution of MspI intensities in Fig. 3A clearly demonstrates this size bias. The HTFs around 500–600 bp tend to generate the highest MspI signals in the intensity distribution, whereas HTFs at the tails of the size distribution have lower signal intensities (Fig. 3A,B). In the mouse spermatogenic sample data depicted (see also Fig. 2), the individual biases in the MspI (Fig. 3A) and HpaII (Fig. 3B) intensity distributions do not compensate for each other and there remains an overt fragment size bias in the distribution of HpaII/MspI ratios (Fig. 3C).

Fig. 3 — Fragment-size effect in pre- and post-normalized data from a normal mouse spermatogenic sample. The data is divided into 58 step-wise bins each containing a comparable number of HTFs. The color key at the upper left illustrates the partitioning scheme for pre-normalized data where each colored block corresponds to a bin of certain HTF sizes; color varies from blue to red with increasing fragment size (from 200–2,000 bp). The density of MspI signal intensities for each bin is shown in panel (A), where different color lines represent the density data for HTFs from each corresponding bin. The black line represents the overall density of MspI intensity data. Panel (B) shows the same representation for HpaII signals and panel (C) shows the same representation for HpaII/MspI ratios. The color key to the lower left shows the analogous partitioning scheme for normalized data after failed probes have been identified and removed. Normalized MspI, HpaII, and HpaII/MspI ratio data are shown (without failed probes) in panels (D), (E), and (F), respectively.

Previous reports have also shown fragment length bias in PCR and, further, that reduction of this bias with linear modeling improves data interpretation (Carvalho, et al., 2007; Nannya, et al., 2005). We address the fragment length problem using a quantile normalization approach (quantile.normalize ()). The goal of this approach was to normalize signal intensities across all fragment lengths, improving within and between-array comparisons. This normalization corrects for the fragment size-dependency of the MspI, HpaII, and HpaII/MspI ratio distributions (Fig. 3D,E,F). The approach is similar to inter-array quantile normalization methods (e.g. RMA; Irizarry, et al., 2003); however, in this particular case we align the quantiles across density-dependent sliding windows of size-sorted data.

All HTFs that are considered amplifiable (i.e. those not classified as failed by the above criterion) are sorted according to increasing fragment size. The resultant data are then divided into multiple bins (b) and steps (s), resulting in a total number (n, where n=s(b−1)+1) of sliding windows (w={1,…n}). Minimum and maximum fragment size boundaries for each window are calculated as the (w−1)/n and w/n quantiles, respectively. The data are then split according to these boundaries into each window (overlapping windows are each assigned a copy of the overlapping data). MspI signal intensity quantiles are calculated for each window and are then averaged across all n windows to produce an average quantile (Q). In order to track overlapping data, each probe is assigned intensity-sorted position(s) (p) within the window(s) in which it is included. These positions then determine which values from Q to assign to each probe in a given window. Because a probe may appear at different points within the quantile distribution for two or more overlapping windows (e.g., p1≠p2), final quantile-normalized values are calculated for each probe as the average of these values: mean(Qp1,Qp2,…).

The same calculations are then applied to the HpaII signals, again with failed probes (defined by MspI intensities) removed. However, for HpaII data the (methylated) probes that fall within the random signal distribution (99% quantile) are normalized separately from those that exceed random probe signals (> 99% quantile). This piecewise normalization is performed to separate amplifying (hypomethylated) probes from their unamplifying (methylated) counterparts in order to preserve the potentially variable distributions of methylation across different fragment sizes. This may be particularly relevant for the treatment of CG-dense regions, which tend to occur at shorter fragment lengths (Supp. Fig. 5) and which may exhibit different distributions of methylation. Quantile-normalized HpaII/MspI logratios are calculated as the difference between normalized HpaII and MspI signals, and are then centered to the average difference of random probe signal intensities (HpaII-MspI) to adjust for global differences in signal strength.

A useful experiment involved the HELP analysis of Toxoplasma gondii, which we have found to lack methylation in its genome (Gissot, et al., 2008). This allowed us to consider size bias in the absence of methylation and therefore due to technical sources alone (e.g. PCR). We show that the HpaII and MspI distributions in this case exhibit different size-dependencies of the signal strength, causing a size-dependent artifact in the HpaII/MspI ratio (Supp. Fig. 6A,C,E). Quantile normalization corrects this artifact (Supp. Fig. 6B,D,F) and improves inter-sample HpaII, MspI, and HpaII/MspI ratio correlations in each of four replicate assays (by an average of 7.8%).

We tested whether normalization improves analysis of data from methylating genomes, finding that it preserves HpaII/MspI ratio correlation for technical replicates (R values differ in pre- and post-normalized data by an average of 0.1–0.2%) while enhancing the differences between tissues (R values comparing brain and sperm are reduced by an average of 2% following normalization).

2.7 Data Summarization

The methylation status of each HpaII fragment is typically measured by a set of probes (up to 10, depending on the array design). Failed probes are removed from consideration as described previously; however, the remaining probes must be considered together, necessitating a summarization approach (combine.data ()). Currently, we employ a 20%-trimmed mean, weighted by MspI signal intensities as follows: for a given HTF, weights for each probe are assigned between 1 (for the lowest MspI signal intensity) and the magnitude of the range of signal intensities (maximal weight is given to probes with the best performance in the MspI channel). A weight of zero is assigned to the 20% most deviant probes per HTF. This enables us to make slight adjustments for the performance of a given probe and also enables us to take a more robust summary measure of the data (by removal of outliers).

2.8 Categorization

The HELP assay generates a bimodal distribution of HpaII intensities and of HpaII/MspI ratio values as a consequence (Khulan, et al., 2006) (Fig. 2B,C). We explored whether this allowed us to categorize loci as methylated or hypomethylated (categorize.HELP ()), finding agreement with methylation levels detected using validation studies. We identified loci as methylated wherever HpaII signals fell below random noise thresholds and the corresponding MspI data were above background noise. A high-confidence hypomethylated population was defined by HpaII signals above background with a corresponding positive HpaII/MspI logratio. Some values, however, did not group clearly into either the methylated or hypomethylated categories and were therefore considered to have “indeterminate” methylation status. These categorizations are consistent with the bisulphite pyrosequencing data we generated (Supp Table 1), which group into two distinct populations: methylated and hypomethylated.

2.9 Data Interpretation

We analyze sample-to-sample relationships, including both similarities and differences, at both the global and local levels. We determine the global pairwise (Pearson) correlations between all combinations of samples and show a representative pair plot for multiple technical replicates of two tissue types that we have previously shown to have distinctive methylation profiles (Khulan, et al., 2006), brain and sperm (Fig. 4, Supp. Fig. 7). While pairwise comparison is a pre-existing program written in R (pairs ()), we combine this analysis with an unsupervised clustering using Ward’s minimum variance method and a Euclidean distance matrix (Fig. 4). The union of both components (plot.pairs ()) enables a novel visualization and interpretation of the relatedness of different samples to each other. The representative figure shows that replicates of both brain and sperm are similar to each other (R ~ 0.9) while comparison of one tissue with the other shows dramatic global differences (R ~ 0.4). In addition, the data show that rehybridization of a poor technical replicate improves correlation among spermatogenic cells (Fig. 4, “Sperm1re”).

Fig. 4 — Union of two comparative approaches: unsupervised clustering and global pairwise (Pearson) correlations of normalized ratios from mouse brain and spermatogenic samples. Three brain and four sperm samples were compared using Ward’s minimum variance clustering, with inter-array distances calculated as the Euclidean distance between ratios. The resulting tree is shown in the lower left portion of the figure, where the branching order is shown in solid lines, colored by group (blue indicates brain samples, and red indicates sperm samples). The diagonal dotted lines are numbered and indicate the Euclidean distance scale. The dotted red line indicates the Euclidean distance cutoff used to separate the individual groups of samples; this cutoff is calculated automatically using the `cutree ()` function. Pairwise correlations are shown in the upper right portion of the figure, where R values indicate the Pearson correlation for each pair and blue dotplots show a visual representation of the differences between samples.

Additionally, we explore HELP data at the local level, by chromosomal position. We generate BED-formatted tracks of the data for visualization with the UCSC Genome Browser (Kent, et al., 2002). In Supp. Fig. 8 we show the H19/Igf2 imprinted domain on mouse chromosome 7 and demonstrate tissue-specific differences in methylation at the differentially-methylated CTCF-binding site upstream from H19 (Supp. Fig. 8, starred) (Bell and Felsenfeld, 2000). The observed changes (methylation in sperm, hypomethylation in brain) are consistent with the previous finding that the H19 locus is methylated exclusively on the paternal chromosome but hypomethylated in somatic cells due to the maternal copy of the locus (Ferguson-Smith, et al., 1993).

2.10 Validation of Analytical Approach

The best means of testing a novel analytical approach is in terms of performance with reference to a validation dataset. For cytosine methylation, this validation is provided by quantitative analyses of methylation at the HpaII sites generating the HELP signals, using bisulphite conversion of DNA (Kerjean, et al., 2001) and either pyrosequencing (Biotage) or MassArray (Sequenom) to measure C/T ratios in the population of molecules (Ehrich, et al., 2005; Fakhrai-Rad, et al., 2002). A dataset was prepared using bisulphite pyrosequencing and MassArray on both brain and sperm samples. We investigated 11 loci with varying degrees of methylation as identified by HELP. Three loci (Tyr, Ube3a, Kcnq1) were hypomethylated in all samples, while three were hypomethylated in brain but methylated in sperm (H19/Igf2, Hccs, Ube3a) and five showed the opposite pattern (Figla, Th-Ins2, Fthl17, Xmr, Ott). Cytosine methylation was quantified using bisulphite pyrosequencing at both HpaII sites flanking each of 10 loci. Two loci (Ube3a and Kcnq1) were not amenable to pyrosequencing and were therefore analyzed by MassArray. All pyrosequencing and MassArray data were summarized as single values representing the maximum level of methylation detected for each HTF (Supp Table 1).

The corresponding HELP data for each of these loci were averaged across replicates and the results were compared to the validation data (Supp Table 1). This was applied to both pre- and post-normalized values. We show that HELP reliably discriminates between two groups of loci, methylated and hypomethylated, for the vast majority of loci (Fig. 5). While raw HpaII/MspI ratios are unable to achieve complete concordance with the validation results (Fig. 5A), normalization of HELP data improves the ability of the assay to discriminate between methylation and hypomethylation and achieves complete concordance with the validation dataset (Fig. 5B). Normalization also improves the correlation of HELP results with those from the validation dataset (R values increased 0.5% for sperm and 3% for brain).

Fig. 5 — Quantitative validation of HELP microarray data demonstrates improvement of accuracy through normalization. Twenty-seven raw (A) and normalized (B) HpaII/MspI ratios (from HELP data) are plotted against the bisulphite validation data (methylation percent) for each locus. Small gray circles indicate spermatogenic cell samples while black triangles indicate brain samples. The dotted curves in each panel represent the density of HpaII/MspI ratios from all experiments with the x-axis drawn to scale and the y-axis indicating relative density values; both raw and normalized data exhibit a clear bimodal distribution. The dashed vertical line in panel (B) is a cutoff that enables discrete classification of methylated and hypomethylated loci; such discrete classification cannot be performed for the raw data in panel (A), demonstrating the value of the normalization.

DISCUSSION

We describe a series of functions that are assembled as a pipeline for the analysis of HELP data. We demonstrate that these functions improve insight into data quality, normalize for potentially misleading technical influences and improve performance when tested against a large validation set. While PCR amplification of genomic representations is used effectively for a number of applications including HELP, it is critical that technical variability does not influence the interpretation of biology (e.g. cytosine methylation status). Fragment length normalization allows intra- and inter-array comparisons to be made in a more robust manner, allowing biological variability to be tested independently of fragment length, as shown in Supp Table 1. The normalization preserves the relative proportion of fragments in the methylated and hypomethylated categories for different fragment sizes even in DNA samples with markedly different overall methylation; Supp. Fig. 9, and maintains these distributions in different genomic compartments (e.g. inter- and intragenic sequences, and CG dinucleotide-dense CG clusters; Supp. Figs. 10, 11).

Our prior report of the HELP assay described the measurement of cytosine methylation solely in terms of HpaII/MspI ratios, with categorization of methylation status defined by the use of mixture models to separate the bimodal distributions of methylated from hypomethylated loci. We were prompted by data from cancer specimens (not shown) to explore alternative means of analysis and categorization, as the proportion of methylated loci in some of these specimens became so small that mixture models were insensitive to the presence of the methylated population. The new approaches described here improve accuracy over raw ratio values when measured against the gold standard of bisulphite pyrosequencing data, although it should be noted that a given ratio may not discriminate a locus that is partially versus fully unmethylated (for example the imprinted loci in the brain samples; Fig. 5, Supp Table 1).

In summary, we show how HELP microarray data can be more accurately interpreted to measure cytosine methylation states in the genome. We also note the potential application of this approach to a number of other representational techniques whenever PCR is used for their generation. With the increasing study of epigenetic influences in general and cytosine methylation in particular, the value of careful analytical techniques to complement high-throughput molecular assays is clearly of importance.

IMPLEMENTATION

This analytical pipeline is implemented in the R Statistical Package (R Development Core Team, 2005). The pipeline has been tested on the Mac platform (OS X 10.4.10) using R version 2.5.1 with grDevices, stats, utils, and graphics packages installed and enabled. R source code is publicly available online at http://greallylab.aecom.yu.edu/~greally/HELP_pipeline/.

Supplementary Material

Supplementary Data

NIHMS816444-supplement-Supplementary_Data.doc^{(14.6MB, doc)}

Acknowledgments

FUNDING

This work is supported by a grant from the National Institutes of Health (NIH) to JMG (R01 HD044078). RFT is supported by NIH MSTP Training Grant GM007288. KK is supported by NIH NIAID RO1 AI060496, the Albert Einstein College of Medicine Biodefense Proteomics Research Center (NIH NIAID contract HSN26620-0400054C), and a pilot grant from the AECOM CFAR (NIH NIAID 5 P30 AI051519). MG is supported by a Philippe Foundation fellowship.

The authors acknowledge the contribution of the Genomics Core Facility at the Albert Einstein College of Medicine, resources from the Albert Einstein Cancer Center, and Anton Svetlanov and Paula Cohen (Cornell University) for the spermatogenic cells samples.

REFERENCE

Allawi HT, SantaLucia J., Jr Thermodynamics and NMR of internal G.T mismatches in DNA. Biochemistry. 1997;36:10581–10594. doi: 10.1021/bi962590c. [DOI] [PubMed] [Google Scholar]
Bell AC, Felsenfeld G. Methylation of a CTCF-dependent boundary controls imprinted expression of the Igf2 gene. Nature. 2000;405:482–485. doi: 10.1038/35013100. [DOI] [PubMed] [Google Scholar]
Bellve AR, Cavicchia JC, Millette CF, O'Brien DA, Bhatnagar YM, Dym M. Spermatogenic cells of the prepubertal mouse. Isolation and morphological characterization. J Cell Biol. 1977;74:68–85. doi: 10.1083/jcb.74.1.68. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bird AP. CpG-rich islands and the function of DNA methylation. Nature. 1986;321:209–213. doi: 10.1038/321209a0. [DOI] [PubMed] [Google Scholar]
Carvalho B, Bengtsson H, Speed TP, Irizarry RA. Exploration, normalization, and genotype calls of high-density oligonucleotide SNP array data. Biostatistics. 2007;8:485–499. doi: 10.1093/biostatistics/kxl042. [DOI] [PubMed] [Google Scholar]
Chou Q, Russell M, Birch DE, Raymond J, Bloch W. Prevention of pre-PCR mis-priming and primer dimerization improves low-copy-number amplifications. Nucleic Acids Res. 1992;20:1717–1723. doi: 10.1093/nar/20.7.1717. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ehrich M, Nelson MR, Stanssens P, Zabeau M, Liloglou T, Xinarianos G, Cantor CR, Field JK, van den Boom D. Quantitative high-throughput analysis of DNA methylation patterns by base-specific cleavage and mass spectrometry. Proc Natl Acad Sci U S A. 2005;102:15785–15790. doi: 10.1073/pnas.0507816102. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fakhrai-Rad H, Pourmand N, Ronaghi M. Pyrosequencing: an accurate detection platform for single nucleotide polymorphisms. Hum Mutat. 2002;19:479–485. doi: 10.1002/humu.10078. [DOI] [PubMed] [Google Scholar]
Ferguson-Smith AC, Sasaki H, Cattanach BM, Surani MA. Parental-origin-specific epigenetic modification of the mouse H19 gene. Nature. 1993;362:751–755. doi: 10.1038/362751a0. [DOI] [PubMed] [Google Scholar]
Gissot M, Choi SW, Thompson RF, Greally JM, Kim K. Toxoplasma gondii and Cryptosporidium parvum lack detectable DNA cytosine methylation. Eukaryot Cell. 2008 doi: 10.1128/EC.00448-07. In press. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hatada I, Fukasawa M, Kimura M, Morita S, Yamada K, Yoshikawa T, Yamanaka S, Endo C, Sakurada A, Sato M, Kondo T, Horii A, Ushijima T, Sasaki H. Genome-wide profiling of promoter methylation in human. Oncogene. 2006;25:3059–3064. doi: 10.1038/sj.onc.1209331. [DOI] [PubMed] [Google Scholar]
Hecht NB, Liem H, Kleene KC, Distel RJ, Ho SM. Maternal inheritance of the mouse mitochondrial genome is not mediated by a loss or gross alteration of the paternal mitochondrial DNA or by methylation of the oocyte mitochondrial DNA. Dev Biol. 1984;102:452–461. doi: 10.1016/0012-1606(84)90210-0. [DOI] [PubMed] [Google Scholar]
Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003;4:249–264. doi: 10.1093/biostatistics/4.2.249. [DOI] [PubMed] [Google Scholar]
Jones PA, Baylin SB. The epigenomics of cancer. Cell. 2007;128:683–692. doi: 10.1016/j.cell.2007.01.029. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kennedy GC, Matsuzaki H, Dong S, Liu WM, Huang J, Liu G, Su X, Cao M, Chen W, Zhang J, Liu W, Yang G, Di X, Ryder T, He Z, Surti U, Phillips MS, Boyce-Jacino MT, Fodor SP, Jones KW. Large-scale genotyping of complex DNA. Nat Biotechnol. 2003;21:1233–1237. doi: 10.1038/nbt869. [DOI] [PubMed] [Google Scholar]
Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D. The human genome browser at UCSC. Genome Res. 2002;12:996–1006. doi: 10.1101/gr.229102. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kerjean A, Vieillefond A, Thiounn N, Sibony M, Jeanpierre M, Jouannet P. Bisulfite genomic sequencing of microdissected cells. Nucleic Acids Res. 2001;29:E106–E106. doi: 10.1093/nar/29.21.e106. [DOI] [PMC free article] [PubMed] [Google Scholar]
Khulan B, Thompson RF, Ye K, Fazzari MJ, Suzuki M, Stasiek E, Figueroa ME, Glass JL, Chen Q, Montagna C, Hatchwell E, Selzer RR, Richmond TA, Green RD, Melnick A, Greally JM. Comparative isoschizomer profiling of cytosine methylation: The HELP assay. Genome Res. 2006;16:1046–1055. doi: 10.1101/gr.5273806. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lisitsyn N, Wigler M. Cloning the differences between two complex genomes. Science. 1993;259:946–951. doi: 10.1126/science.8438152. [DOI] [PubMed] [Google Scholar]
Lucito R, Healy J, Alexander J, Reiner A, Esposito D, Chi M, Rodgers L, Brady A, Sebat J, Troge J, West JA, Rostan S, Nguyen KC, Powers S, Ye KQ, Olshen A, Venkatraman E, Norton L, Wigler M. Representational oligonucleotide microarray analysis: a high-resolution method to detect genome copy number variation. Genome Res. 2003;13:2291–2305. doi: 10.1101/gr.1349003. [DOI] [PMC free article] [PubMed] [Google Scholar]
Maekawa M, Taniguchi T, Higashi H, Sugimura H, Sugano K, Kanno T. Methylation of mitochondrial DNA is not a useful marker for cancer detection. Clin Chem. 2004;50:1480–1481. doi: 10.1373/clinchem.2004.035139. [DOI] [PubMed] [Google Scholar]
Mathieu-Daude F, Welsh J, Vogt T, McClelland M. DNA rehybridization during PCR: the 'Cot effect' and its consequences. Nucleic Acids Res. 1996;24:2080–2086. doi: 10.1093/nar/24.11.2080. [DOI] [PMC free article] [PubMed] [Google Scholar]
Nannya Y, Sanada M, Nakazaki K, Hosoya N, Wang L, Hangaishi A, Kurokawa M, Chiba S, Bailey DK, Kennedy GC, Ogawa S. A robust algorithm for copy number detection using high-density oligonucleotide single nucleotide polymorphism genotyping arrays. Cancer Res. 2005;65:6071–6079. doi: 10.1158/0008-5472.CAN-05-0465. [DOI] [PubMed] [Google Scholar]
Pollack Y, Kasir J, Shemer R, Metzger S, Szyf M. Methylation pattern of mouse mitochondrial DNA. Nucleic Acids Res. 1984;12:4811–4824. doi: 10.1093/nar/12.12.4811. [DOI] [PMC free article] [PubMed] [Google Scholar]
Polz MF, Cavanaugh CM. Bias in template-to-product ratios in multitemplate PCR. Appl Environ Microbiol. 1998;64:3724–3730. doi: 10.1128/aem.64.10.3724-3730.1998. [DOI] [PMC free article] [PubMed] [Google Scholar]
R Development Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; 2005. [Google Scholar]
Reimers M, Weinstein JN. Quality assessment of microarrays: visualization of spatial artifacts and quantitation of regional biases. BMC Bioinformatics. 2005;6:166. doi: 10.1186/1471-2105-6-166. [DOI] [PMC free article] [PubMed] [Google Scholar]
Romrell LJ, Bellve AR, Fawcett DW. Separation of mouse spermatogenic cells by sedimentation velocity. A morphological characterization. Dev Biol. 1976;49:119–131. doi: 10.1016/0012-1606(76)90262-1. [DOI] [PubMed] [Google Scholar]
SantaLucia J., Jr A unified view of polymer, dumbbell, and oligonucleotide DNA nearest-neighbor thermodynamics. Proc Natl Acad Sci U S A. 1998;95:1460–1465. doi: 10.1073/pnas.95.4.1460. [DOI] [PMC free article] [PubMed] [Google Scholar]
Suzuki MT, Giovannoni SJ. Bias caused by template annealing in the amplification of mixtures of 16S rRNA genes by PCR. Appl Environ Microbiol. 1996;62:625–630. doi: 10.1128/aem.62.2.625-630.1996. [DOI] [PMC free article] [PubMed] [Google Scholar]
Waalwijk C, Flavell RA. MspI, an isoschizomer of hpaII which cleaves both unmethylated and methylated hpaII sites. Nucleic Acids Res. 1978;5:3231–3236. doi: 10.1093/nar/5.9.3231. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wagner A, Blackstone N, Cartwright P, Dick M, Misof B, Snow P, Wagner GP, Bartels J, Murtha M, Pendleton J. Surveys of gene families using polymerase chain reaction: PCR selection and PCR drift. Syst. Biol. 1994;43:250–261. [Google Scholar]
Yang YH, Dudoit S, Luu P, Lin DM, Peng V, Ngai J, Speed TP. Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res. 2002;30:e15. doi: 10.1093/nar/30.4.e15. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

NIHMS816444-supplement-Supplementary_Data.doc^{(14.6MB, doc)}

[R1] Allawi HT, SantaLucia J., Jr Thermodynamics and NMR of internal G.T mismatches in DNA. Biochemistry. 1997;36:10581–10594. doi: 10.1021/bi962590c. [DOI] [PubMed] [Google Scholar]

[R2] Bell AC, Felsenfeld G. Methylation of a CTCF-dependent boundary controls imprinted expression of the Igf2 gene. Nature. 2000;405:482–485. doi: 10.1038/35013100. [DOI] [PubMed] [Google Scholar]

[R3] Bellve AR, Cavicchia JC, Millette CF, O'Brien DA, Bhatnagar YM, Dym M. Spermatogenic cells of the prepubertal mouse. Isolation and morphological characterization. J Cell Biol. 1977;74:68–85. doi: 10.1083/jcb.74.1.68. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] Bird AP. CpG-rich islands and the function of DNA methylation. Nature. 1986;321:209–213. doi: 10.1038/321209a0. [DOI] [PubMed] [Google Scholar]

[R5] Carvalho B, Bengtsson H, Speed TP, Irizarry RA. Exploration, normalization, and genotype calls of high-density oligonucleotide SNP array data. Biostatistics. 2007;8:485–499. doi: 10.1093/biostatistics/kxl042. [DOI] [PubMed] [Google Scholar]

[R6] Chou Q, Russell M, Birch DE, Raymond J, Bloch W. Prevention of pre-PCR mis-priming and primer dimerization improves low-copy-number amplifications. Nucleic Acids Res. 1992;20:1717–1723. doi: 10.1093/nar/20.7.1717. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] Ehrich M, Nelson MR, Stanssens P, Zabeau M, Liloglou T, Xinarianos G, Cantor CR, Field JK, van den Boom D. Quantitative high-throughput analysis of DNA methylation patterns by base-specific cleavage and mass spectrometry. Proc Natl Acad Sci U S A. 2005;102:15785–15790. doi: 10.1073/pnas.0507816102. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] Fakhrai-Rad H, Pourmand N, Ronaghi M. Pyrosequencing: an accurate detection platform for single nucleotide polymorphisms. Hum Mutat. 2002;19:479–485. doi: 10.1002/humu.10078. [DOI] [PubMed] [Google Scholar]

[R9] Ferguson-Smith AC, Sasaki H, Cattanach BM, Surani MA. Parental-origin-specific epigenetic modification of the mouse H19 gene. Nature. 1993;362:751–755. doi: 10.1038/362751a0. [DOI] [PubMed] [Google Scholar]

[R10] Gissot M, Choi SW, Thompson RF, Greally JM, Kim K. Toxoplasma gondii and Cryptosporidium parvum lack detectable DNA cytosine methylation. Eukaryot Cell. 2008 doi: 10.1128/EC.00448-07. In press. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] Hatada I, Fukasawa M, Kimura M, Morita S, Yamada K, Yoshikawa T, Yamanaka S, Endo C, Sakurada A, Sato M, Kondo T, Horii A, Ushijima T, Sasaki H. Genome-wide profiling of promoter methylation in human. Oncogene. 2006;25:3059–3064. doi: 10.1038/sj.onc.1209331. [DOI] [PubMed] [Google Scholar]

[R12] Hecht NB, Liem H, Kleene KC, Distel RJ, Ho SM. Maternal inheritance of the mouse mitochondrial genome is not mediated by a loss or gross alteration of the paternal mitochondrial DNA or by methylation of the oocyte mitochondrial DNA. Dev Biol. 1984;102:452–461. doi: 10.1016/0012-1606(84)90210-0. [DOI] [PubMed] [Google Scholar]

[R13] Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003;4:249–264. doi: 10.1093/biostatistics/4.2.249. [DOI] [PubMed] [Google Scholar]

[R14] Jones PA, Baylin SB. The epigenomics of cancer. Cell. 2007;128:683–692. doi: 10.1016/j.cell.2007.01.029. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] Kennedy GC, Matsuzaki H, Dong S, Liu WM, Huang J, Liu G, Su X, Cao M, Chen W, Zhang J, Liu W, Yang G, Di X, Ryder T, He Z, Surti U, Phillips MS, Boyce-Jacino MT, Fodor SP, Jones KW. Large-scale genotyping of complex DNA. Nat Biotechnol. 2003;21:1233–1237. doi: 10.1038/nbt869. [DOI] [PubMed] [Google Scholar]

[R16] Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D. The human genome browser at UCSC. Genome Res. 2002;12:996–1006. doi: 10.1101/gr.229102. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] Kerjean A, Vieillefond A, Thiounn N, Sibony M, Jeanpierre M, Jouannet P. Bisulfite genomic sequencing of microdissected cells. Nucleic Acids Res. 2001;29:E106–E106. doi: 10.1093/nar/29.21.e106. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] Khulan B, Thompson RF, Ye K, Fazzari MJ, Suzuki M, Stasiek E, Figueroa ME, Glass JL, Chen Q, Montagna C, Hatchwell E, Selzer RR, Richmond TA, Green RD, Melnick A, Greally JM. Comparative isoschizomer profiling of cytosine methylation: The HELP assay. Genome Res. 2006;16:1046–1055. doi: 10.1101/gr.5273806. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] Lisitsyn N, Wigler M. Cloning the differences between two complex genomes. Science. 1993;259:946–951. doi: 10.1126/science.8438152. [DOI] [PubMed] [Google Scholar]

[R20] Lucito R, Healy J, Alexander J, Reiner A, Esposito D, Chi M, Rodgers L, Brady A, Sebat J, Troge J, West JA, Rostan S, Nguyen KC, Powers S, Ye KQ, Olshen A, Venkatraman E, Norton L, Wigler M. Representational oligonucleotide microarray analysis: a high-resolution method to detect genome copy number variation. Genome Res. 2003;13:2291–2305. doi: 10.1101/gr.1349003. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] Maekawa M, Taniguchi T, Higashi H, Sugimura H, Sugano K, Kanno T. Methylation of mitochondrial DNA is not a useful marker for cancer detection. Clin Chem. 2004;50:1480–1481. doi: 10.1373/clinchem.2004.035139. [DOI] [PubMed] [Google Scholar]

[R22] Mathieu-Daude F, Welsh J, Vogt T, McClelland M. DNA rehybridization during PCR: the 'Cot effect' and its consequences. Nucleic Acids Res. 1996;24:2080–2086. doi: 10.1093/nar/24.11.2080. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] Nannya Y, Sanada M, Nakazaki K, Hosoya N, Wang L, Hangaishi A, Kurokawa M, Chiba S, Bailey DK, Kennedy GC, Ogawa S. A robust algorithm for copy number detection using high-density oligonucleotide single nucleotide polymorphism genotyping arrays. Cancer Res. 2005;65:6071–6079. doi: 10.1158/0008-5472.CAN-05-0465. [DOI] [PubMed] [Google Scholar]

[R24] Pollack Y, Kasir J, Shemer R, Metzger S, Szyf M. Methylation pattern of mouse mitochondrial DNA. Nucleic Acids Res. 1984;12:4811–4824. doi: 10.1093/nar/12.12.4811. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] Polz MF, Cavanaugh CM. Bias in template-to-product ratios in multitemplate PCR. Appl Environ Microbiol. 1998;64:3724–3730. doi: 10.1128/aem.64.10.3724-3730.1998. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] R Development Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; 2005. [Google Scholar]

[R27] Reimers M, Weinstein JN. Quality assessment of microarrays: visualization of spatial artifacts and quantitation of regional biases. BMC Bioinformatics. 2005;6:166. doi: 10.1186/1471-2105-6-166. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] Romrell LJ, Bellve AR, Fawcett DW. Separation of mouse spermatogenic cells by sedimentation velocity. A morphological characterization. Dev Biol. 1976;49:119–131. doi: 10.1016/0012-1606(76)90262-1. [DOI] [PubMed] [Google Scholar]

[R29] SantaLucia J., Jr A unified view of polymer, dumbbell, and oligonucleotide DNA nearest-neighbor thermodynamics. Proc Natl Acad Sci U S A. 1998;95:1460–1465. doi: 10.1073/pnas.95.4.1460. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] Suzuki MT, Giovannoni SJ. Bias caused by template annealing in the amplification of mixtures of 16S rRNA genes by PCR. Appl Environ Microbiol. 1996;62:625–630. doi: 10.1128/aem.62.2.625-630.1996. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] Waalwijk C, Flavell RA. MspI, an isoschizomer of hpaII which cleaves both unmethylated and methylated hpaII sites. Nucleic Acids Res. 1978;5:3231–3236. doi: 10.1093/nar/5.9.3231. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] Wagner A, Blackstone N, Cartwright P, Dick M, Misof B, Snow P, Wagner GP, Bartels J, Murtha M, Pendleton J. Surveys of gene families using polymerase chain reaction: PCR selection and PCR drift. Syst. Biol. 1994;43:250–261. [Google Scholar]

[R33] Yang YH, Dudoit S, Luu P, Lin DM, Peng V, Ngai J, Speed TP. Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res. 2002;30:e15. doi: 10.1093/nar/30.4.e15. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

An analytical pipeline for genomic representations used for cytosine methylation studies

Reid F Thompson

Mark Reimers

Batbayar Khulan

Mathieu Gissot

Todd A Richmond

Quan Chen

Xin Zheng

Kami Kim

John M Greally

Motivation

1 INTRODUCTION

2 SYSTEM AND METHODS

2.1 Samples

2.2 The HELP Assay

2.3 Array Designs and Data Import

2.4 Inter- and Intra-Microarray Quality Assessment

Fig. 1.

2.5 Size-Dependent Intensities and Definition of Background

Fig. 2.

2.6 Quantile Normalization

Fig. 3.

2.7 Data Summarization

2.8 Categorization

2.9 Data Interpretation

Fig. 4.

2.10 Validation of Analytical Approach

Fig. 5.

DISCUSSION

IMPLEMENTATION

Supplementary Material

Acknowledgments

REFERENCE

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases