Abstract
Many types of epigenetic profiling have been used to classify stem cells, stages of cellular differentiation, and cancer subtypes. Existing methods focus on local chromatin features such as DNA methylation and histone modifications that require extensive analysis for genome-wide coverage. Replication timing has emerged as a highly stable cell type-specific epigenetic feature that is regulated at the megabase-level and is easily and comprehensively analyzed genome-wide. Here, we describe a cell classification method using 67 individual replication profiles from 34 mouse and human cell lines and stem cell-derived tissues, including new data for mesendoderm, definitive endoderm, mesoderm and smooth muscle. Using a Monte-Carlo approach for selecting features of replication profiles conserved in each cell type, we identify “replication timing fingerprints” unique to each cell type and apply a k nearest neighbor approach to predict known and unknown cell types. Our method correctly classifies 67/67 independent replication-timing profiles, including those derived from closely related intermediate stages. We also apply this method to derive fingerprints for pluripotency in human and mouse cells. Interestingly, the mouse pluripotency fingerprint overlaps almost completely with previously identified genomic segments that switch from early to late replication as pluripotency is lost. Thereafter, replication timing and transcription within these regions become difficult to reprogram back to pluripotency, suggesting these regions highlight an epigenetic barrier to reprogramming. In addition, the major histone cluster Hist1 consistently becomes later replicating in committed cell types, and several histone H1 genes in this cluster are downregulated during differentiation, suggesting a possible instrument for the chromatin compaction observed during differentiation. Finally, we demonstrate that unknown samples can be classified independently using site-specific PCR against fingerprint regions. In sum, replication fingerprints provide a comprehensive means for cell characterization and are a promising tool for identifying regions with cell type-specific organization.
Author Summary
While continued advances in stem cell and cancer biology have uncovered a growing list of clinical applications for stem cell technology, errors in indentifying cell lines have undermined a number of recent studies, highlighting a growing need for improvements in cell typing methods for both basic biological and clinical applications of stem cells. Induced pluripotent stem cells (iPSCs)—adult cells reprogrammed to a pluripotent state—show great promise for patient-specific stem cell treatments, but more efficient derivation of iPSCs depends on a more comprehensive understanding of pluripotency. Here, we describe a method to identify sets of regions that replicate at unique times in any given cell type (replication timing fingerprints) using pluripotent stem cells as an example, and show that genes in the pluripotency fingerprint belong to a class previously shown to be resistant to reprogramming in iPSCs, identifying potential new target genes for more efficient iPSC production. We propose that the order in which DNA is replicated (replication timing) provides a novel means for classifying cell types, and can reveal cell type specific features of genome organization.
Introduction
In mammals, replication of the genome occurs in large, coordinately firing regions called replication domains [1]–[7]. These domains are typically one to several megabases, roughly align to genomic features such as isochores, and are closely tied to subnuclear position, with transitions to the nuclear interior often coupled to earlier replication, and transitions to the periphery to later replication [4], [5], [8], [9]. Given their connections to subnuclear position and remarkably strong correlation to chromatin interaction maps [3], replication profiles provide a window into large-scale genome organization changes important for establishing cellular identity. The organization of replication domains is cell-type specific, and a larger number of smaller replication domains is a property of embryonic stem cells (ESCs) [3]–[5]. Importantly, in both humans and mice, induced pluripotent stem cells (iPSCs) reprogrammed from fibroblasts display a timing profile almost indistinguishable from ESCs, suggesting that replication profiles may also be used to measure cellular potency [3], [5].
While a wide-range of cell classification methods are actively used, the most common practice for verifying identity is to monitor a handful of molecular markers, some of which are shared with other cell types. Genome-wide classification of features such as DNA methylation [10]–[12], transcription [13], [14] and histone modifications [15], [16] have in principle more potential to accurately distinguish specific cell types. However, these features of chromatin are highly dynamic at any given genomic site [17], and most measurements require high-resolution arrays and costly antibodies. Moreover, recent reports highlight the unstable nature of transcription and related epigenetic marks in multiple embryonic stem cell lines [18], [19]. By contrast, since replication is regulated at the level of large domains, replication profiles are considerably less complex to generate and interpret than other molecular profiles. Timing changes occurring during differentiation are on the order of several hundred kilobases and are highly reproducible between various stem cell lines [3]–[5]. They are also robust to changes in individual chromatin modifications, retaining their normal developmental pattern in G9a(−/−) cells despite strong upregulation of G9a target genes and near-complete loss of H3K9me2 [8].
Here, we describe a method for classifying cell types—replication fingerprinting—based on genome-wide replication timing patterns in mouse and human ESCs and other cell types. We applied the method to 67 (36 mouse and 31 human) whole-genome replication timing datasets to demonstrate the feasibility of classifying cell types using a minimal set of cell type-specific regions. After identification, these regions were used to classify two independent samples using site-specific PCR. We also demonstrate that loss of pluripotency is accompanied by consistent changes in replication timing, implicating the replication program as an important factor in maintaining pluripotency and revealing a novel fingerprint for pluripotent stem cells.
Results
Generation of replication profiles
In addition to our previously reported replication profiles, BG02 hESCs were differentiated to mesendoderm and definitive endoderm as previously described [20], as well as ISL+ mesoderm and smooth muscle cultured in defined medium (Methods), and profiled for replication. Replication profiles were generated as described previously [3]–[5], [21]. In brief, nascent DNA fractions were collected in early and late S-phase, differentially labeled, and co-hybridized to a whole-genome CGH microarray. The ratio of early and late fraction abundance for each probe—“replication timing ratio”—represents its relative time of replication. Values from individual probes are then smoothed using LOESS (a locally weighted smoothing function), and plotted on log scale (Figure 1). Replication profiles generated in this way are freely available to view or download at www.ReplicationDomain.org [22], and those analyzed in this report are summarized in Table S1.
Generation of replication fingerprints
Figure 1 illustrates the basic concept of replication fingerprinting. Two exemplary profiles each for D3 embryonic stem cells (ESCs; blue) and D3 ESC-derived neural precursor cells (NPCs; green) are overlaid. Given that most of the genome is conserved in replication timing between any two cell types (e.g. 80% conserved between ESCs and NPCs [4]), the first challenge is to choose genomic regions that are differentially replicated within a set of cell types. We define a “replication fingerprint” of a cell type as a set of genomic regions useful for classification, along with their associated replication timing values. For a simplified example, we show exemplary fingerprint regions for a segment of chromosome 7 (Figure 1A, gray bars). Note that the four regions change dramatically upon differentiation to neural precursors (e.g., ESC2 vs. NPC1; Figure 1A,B), but have replication timing values that are well conserved between replicate experiments (e.g., ESC1 vs. ESC2). We and others have observed similarly widespread changes in replication profiles between any two different cell types profiled to date [1], [3]–[5], [7].
As classification methods require a measure of distance between samples, we defined the distance between replication profiles as the sum of absolute differences in replication timing in fingerprinting regions (Figure 1B). To select an optimal set of fingerprinting regions we maximize a “distance ratio,” representing the ratio of the average distance between unlike cell types to the average distance between equivalent cell types (Figure 1C). This ratio is maximized by selecting regions that are consistently different in replication timing between different cell types, but consistently similar between equivalent types. Importantly, the assignment of unlike vs. equivalent cell types is user-defined and flexible, allowing selection of features that best distinguish any group of cells from any other, such as ESCs from NPCs, normal from disease-related cells, or pluripotent from committed cells.
While Figure 1 shows a simplified example of four regions distinguishing ESCs from NPCs, real-world classification requires the ability to make distinctions genome-wide between many cell types, making manual selection of regions impractical. Therefore, to make the method generally applicable, we developed an automated algorithm based on Monte Carlo sampling [23] to select regions that best distinguish between all available cell types in genome-wide replication datasets. Alternative approaches evaluated for feature selection and classification included Bayesian networks, nearest neighbor methods, decision trees and SVMs, which were comparably successful only for smaller collections of cell types. We chose to explicitly maximize distances between cell types in the method described here in anticipation of translating cell classification to more convenient empirical assays with a limited number of features, because larger timing differences are easier to verify empirically and are more robust to experimental and biological variation.
Monte Carlo optimization of fingerprint regions
In practice, replication fingerprinting is a feature selection problem. Although most genome-wide approaches are both simple and comprehensive, we found that genome-wide correlations and distances, while a good first approximation of the relatedness between cell types, are not ideal for classification as the small amount of noise in regions with conserved replication timing is compounded over this relatively large fraction of the genome (Figure S1). We therefore wish to exclude domains that are noisy (having high technical or biological variability), irrelevant (conserved in all cell types), or redundant (containing overlapping information). To achieve this, we first remove regions with conserved replication timing between cell types, resulting in a set of informative regions that can be further optimized by a Monte Carlo selection algorithm.
Figure S2 depicts the Monte Carlo algorithm. To reduce noise from individual probe measurements, replication profiles are first averaged into nonoverlapping windows of approximately 200 kb. This window size represents a balance between sizes of the regions that change replication timing during development (400–800 kb), and the number of probes needed for timing changes to be deemed statistically significant (35–180 probes are contained in each window depending on the probe density of the array platform; see Methods, Table S2). An initial set of regions with the highest replication timing changes in the set of replication profiles are chosen to exclude regions with conserved replication timing, and half of these starting regions are randomly selected to calculate initial distances between cell types. At each iteration of the algorithm, a region can be added to the set of fingerprint regions, removed from the set, or swapped with an unused region. Using a Metropolis-Hastings criterion [23], [24], moves that improve the overall distance ratio are accepted with higher probability than those that do not; after 20,000 or more such moves, a final set of fingerprinting regions is selected.
As depicted in Figure 2, the fingerprinting algorithm selects domains with large and reproducible replication timing differences between cell types, discarding those with minimal or variable changes. Before selecting optimal regions (Figure 2A,C), the average distance between “like” and “unlike” cell types are similar, translating into classification errors for randomly selected domains (Figure 2C) as well as the whole genome (Figure S1, red shading). After selection, the separation in distances between like and unlike types becomes very distinct (Figure 2B, D), even for closely related cell types (Figure 3). These regions similarly highlight distinctions between cell types both in correlations (Figure S3, S4, S5, S6, S7, S8), and distance matrices between cell types (Figure S9, S10, S11, S12).
Since Monte Carlo selection is stochastic, different sets of fingerprinting regions can be selected in different runs. To evaluate the stability of regions included in replication fingerprints, we applied the algorithm 100 times for each type of human and mouse fingerprint constructed (Figure S13). Results demonstrate that fingerprinting regions are well-conserved among multiple rounds of selection, with the top 10–14 regions selected in 100/100 trials in each case. For all subsequent classification, we used regions included in at least 75/100 fingerprinting runs.
As the distances between profiles derive from either the same or different cell types (Figure 2C), their distributions can be used to create a general classifier (Figure 2C,D, Figure 3A), with an error rate proportional to the overlap in distances shared by “like” and “unlike” cell type comparisons (Figure 2C,D, blue shading). This allows us to state a level of confidence for a given prediction, as well as estimate the similarity of a cell type to others. To refine this classification, we applied the k-nearest-neighbor rule [25] (kNN; k = 3) to assign cell types according to the three most similar profiles in the training set. Distances below the threshold – θ = 2.4 in Figure 2D – are hypothesized to derive from similar cell types, and are used with kNN to classify profiles according to the closest profiles in the training set. Distances above the threshold are presumed to derive from different cell types, preventing kNN from classifying highly divergent RT profiles as the cell type of the most similar known profile.
Classification of cell types using fingerprint regions
To test the ability of our method to select suitable regions for classification, we applied it to predict the known identity of 9 mouse and 7 human cell types with 36 and 31 total experimental replicates, respectively. Datasets used for prediction are summarized in Table S1, with most described in detail in previous publications [3]–[5]. Rough classification of each experiment into like and unlike cell types by a distance ratio cutoff was accurate in 951/961 (99.0%) human and 1250/1296 (96.5%) mouse comparisons respectively (Figure 3A,B). Refining this classifier by using kNN to assign cell types according to the three most similar profiles in the training set resulted in correct predictions for 36/36 mouse and 31/31 human replication timing profiles (Figure 3C,D). Strikingly, even closely related cell types could be reliably distinguished using this method, such as mouse ESCs and early primitive ectoderm-like stem cells (EPL/EBM3), and two day intermediates of human ESC differentiation into endomesoderm (DE2; day 2) and definitive endoderm (DE4; day 4). Thus, replication profiles appear capable of distinguishing among a wide array of cell types in early mouse and human development.
Confirmation and generalizability of replication fingerprints
The use of all experimental data in a selection algorithm often results in overfitting the model to a limited set of observations. For this reason, machine-learning algorithms are commonly trained and tested on different subsets of data (termed cross-validation). To determine whether overfitting is occurring in our selection method and assess the degree to which fingerprinting domains are generally cell type-specific, we performed leave-one-out cross-validation (LOOCV) with each of the available experiments by constructing fingerprints using all but one experimental replicate, and testing classification on the remaining replicate. In all cases (31/31 human, 36/36 mouse), correct predictions in the excluded profile confirmed that fingerprinting regions remained consistent with cell type, and that most cell-line-specific differences were discarded (Figure 3C, LOOCV column). This was also true for a cell line with only one replicate (mouse 46C neural precursor cells), implying that most of the regions of differential replication timing useful for classification are shared between cell lines.
To simulate the classification of a cell type not yet encountered in the training set, we tested predictions after selecting fingerprinting regions with all replicates of a given cell type excluded (Figure 3C, LCTO column). This confirmed that most cell types not yet encountered were correctly classified as “Unseen” (7/7 cell types in human, 7/9 in mouse). However, two cases in which profiles were ambiguous were between neural precursors (NPCs) and mouse epiblast-like stem cells (EpiSCs, EBM6), suggesting that closely related cell types are more accurately distinguished when examples of each type are included in the training set.
A replication fingerprint for pluripotency
One of the most striking features of replication timing is its widespread consolidation into larger replication domains during neural differentiation, concomitant with global compaction of chromatin [3], [4]. This consolidation, along with recovery of ESC replication timing by induced pluripotent stem cells (iPSCs), suggested that replication patterns in specific regions of the genome are associated with the pluripotent state. Further, if certain timing changes are a stable property of cellular commitment, they may provide a unique opportunity to evaluate differentiation capacity using replication-timing patterns. To explore this, we analyzed the differences in replication profiles between collections of pluripotent/reversible (ESCs, iPSCs, EBM3) and committed cell types in 13 human and 21 mouse cell lines (Figure 4A). In each case, we created a stringent consensus fingerprint for classification consisting of regions found in >75/100 runs (18 regions each in mouse and human), and examined genes in the top 200 fingerprint regions (∼2% of the genome) to characterize a more inclusive sample. Genes and regions found to consistently switch to earlier or later replication as pluripotency is lost are provided in Tables S3, S4, S5, S6.
Strikingly, several regions displayed conserved, significant differences in timing between all pluripotent and committed cell types (Figures 4A, S10, S12). As with general fingerprints, classification into pluripotent or committed types could be performed unambiguously (36/36 cases in mouse, 31/31 in human), even with regions selected with the test profile excluded (LOOCV column). Several of the genes consistently switching to later replication in mouse and human pluripotency fingerprints have known roles in maintaining pluripotency (for instance, Dppa2 and Dppa4 in both species, and DKK1 in human; Tables S4 and S6). In addition, two classes of genes stood out from this analysis that showed significant switches to later replication in both species: a large cluster of protocadherins (PCDs), and the majority of the Hist1 cluster of core histone genes (Table S7). The former are developmentally regulated genes with broad involvement in neural development and cell-cell signaling [26], [27], and switch to later replication in all committed mouse and human cell types. The latter Hist1 cluster was later replicating in 8/8 committed cell types in mouse and 5/6 in human (not lymphoblasts), and includes several core histone genes that were downregulated up to 2.5-fold in NPCs. These results are intriguing in light of previous reports of histone downregulation during development [28], as well as a hyperdynamic chromatin phenotype in ESCs that involves higher exchange rates of histone H1 [29] and is required for efficient somatic cell nuclear reprogramming in Xenopus oocytes [30]. Importantly, all of the histone H1 genes are found in this cluster, suggesting that regulation of global H1 abundance may provide a mechanism for the overall chromatin compaction and consolidation of replication timing observed during neural differentiation [3]–[5].
To characterize the genes included in the mouse pluripotency fingerprint, we compared them to a previous class of genes that showed lineage-independent switches to later replication in mouse ESC differentiation, and failed to revert to ESC-like expression in three separately derived samples of partial iPSCs (clusters 15 and 16 in Figure 7 of Hiratani et al., 2010). Remarkably, 200 out of 217 genes in the top 100 mouse pluripotency regions belonged to this class, despite very different methods for deriving them (Figure 5A). All of the fingerprint genes switched to later replication, and at the transition between early and late epiblast stages where cell fates become restricted [5]. Most genes also had reduced expression in late epiblast and neural progenitor stages (average 1.66-fold reduction in transcription from ESC/EBM3 to EBM6/NPCs). Thus, some of these genes may make prime candidates for improving the efficiency of iPSC production, or for reverting human ESCs to a more naive, mouse ESC-like state. However, the overlap between human and mouse pluripotency fingerprint genes, while significant, was much lower (Figure 5A), and this was true even when comparing human ESCs to developmentally analogous mouse EpiSCs [3], [31]. Therefore, many pluripotency-associated genes and loci may be species-specific, consistent with recent studies that underscore considerable differences between mouse and human pluripotency networks [32], [33]. This low alignment is also accounted for by a general drop in overall alignment in regions with the greatest developmental switches in replication timing (Figure 5B), which are those preferentially selected by the fingerprinting algorithm.
Of the genes conserved in the fingerprints of both species (indicated by boldface type in Tables S4 and S6), most belong to the aforementioned large class of protocadherins. However, Dppa2 and Dppa4 are also conserved, as well as genes with no known roles in maintaining pluripotency (Cast, Riok2, Lix1) that reside within the same replication units as pluripotency fingerprint genes in both species. Other core pluripotency genes remain relatively early replicating in both species (Pou5f1[Oct4], Sox2, Nanog), and are likely regulated by other mechanisms. For instance, Sox2 belongs to a class of genes with strong promoters (HCP, or high CPG content promoters) generally unaffected by local replication timing [4], [34].
Independent verification of fingerprint classification by PCR
One potential application of replication fingerprints is in the development of PCR kits for epigenetic classification, particularly for cell types or disease samples with no known aberrations in transcription or sequence. To confirm that fingerprint regions can be translated into a classification scheme using site-specific PCR, we classified two unknown samples representing cell types that were analyzed previously, but that were derived from different cell lines than the original set used for training. The experiment was performed in a blind manner in which the experimenter had no prior knowledge of the regions or cell types being tested. Primers were assembled against sequences within 10–20 kb from the center of each fingerprint region, and the replication times of each region were quantified as the “relative early S phase abundance” (relative abundance of a sequence in nascent strands from early S phase), as previously described [35] (Figure 6A). PCR-based timing values were rescaled for consistency with the original scale of the array datasets used in training, and distances were calculated between the unknown samples and other human profiles in fingerprint regions (Figure 6B). Using the same methods as in prior classifications, these distances correctly identified the two unknown samples as lymphoblasts and hESCs, respectively; the three known datasets with the smallest distances were each of the correct cell type.
Discussion
Advantages and caveats of replication profiles for cell typing
Our method for cell typing through replication fingerprinting addresses a well-recognized need for comprehensive methods to assess cellular identity and differentiation potential in stem cell biology. Unlike other molecular markers, replication is regulated at the level of large, multi-megabase domains, making comprehensive, genome-wide profiles relatively simple to generate and interpret [36]. In particular, the robust stability of replication timing profiles in stem cells [8], and wide divergence between cell types make them a promising candidate for classification.
While the functional role for the replication program is not yet understood, its conservation between human and mouse cell culture models of development support its functional significance. We and others have shown a substantial correlation (R2 = 0.42–0.53) in replication patterns between mouse and human cell types, with timing patterns of embryonic stem cells, neural precursor cells, and lymphoblastoid cells most closely aligned to their cognate in the other species [1], [3]. The important role for replication is further corroborated by its remarkably strong link to genome organization [3], and its ability to confirm the mouse epiblast identity of human ESCs genome-wide and with an epigenetic property [3], [31].
By comparison, methods for cell typing using DNA methylation, gene expression, histone modifications or protein markers are well suited to some applications [10]–[16], but may not be informative for certain fractions of the genome, or may rely on genome features that cannot distinguish between similar cell states. We therefore envision replication fingerprinting as a complement to existing cell typing strategies that may be used for samples unsuitable for traditional methods, or for additional confidence in assessing cell identity in cases where this is critical, such as regenerative medicine. One caveat to consider in these applications is that replication profiles, similar to other genome-wide methods, are an ensemble aggregate from many cells, making measurement of homogeneity difficult. In addition, as with other supervised classification approaches, the method is informative only for cell types (classes) available during training. However, our fingerprinting method is in principle applicable to any data type, and may be modified to select discriminating features in other epigenetic profiles.
A major advantage of our fingerprinting method is in selection of a minimal set of regions that allow for classification with a straightforward PCR-based timing assay and a reasonably small set of primers, particularly if only cell-type specific regions are examined. Our results suggest that a standard set of 20 fingerprint loci can be effective for classification, but the number of regions queried can be adjusted based on the confidence level required. The sole requirement for replication profiling is the collection of a sufficient number of proliferating cells for sorting on a flow cytometer. Consistently, just as replication fingerprints can be generated for particular cell types or general categories of cells, features of replication profiles allow for the creation of disease-specific fingerprints, which may be valuable for prognosis.
Consistent timing changes between pluripotent and committed cell types
In addition to cell typing applications, replication profiling is informative for basic biological questions. Here, we have identified regions that may undergo important organizational changes upon differentiation, which include a class of gene that fail to reverse expression in partial iPSCs, and the majority of mouse and human histone H1 genes. Human lymphoblasts retained early replication in H1 genes, which may be explained by their high rate of proliferation. Since highly developmentally plastic regions (including pluripotency fingerprint regions) are poorly conserved (Figure 5B) the evolutionary conservation of cell-type specific timing patterns must be driven by the moderately changing majority of the genome.
The recent derivation of mouse ESC-like human stem cells with various methods raises an intriguing question [37]: will naïve hESCs align better to mESCs than to mEpiSCs for replication timing as they have for transcription? Although pluripotency is currently assessed by marker gene expression or laborious complementation experiments, replication timing assays in regions uniquely early or late replicating in pluripotent cells provide a tractable method to predict the pluripotency of various cell types, as well as insights into conserved genome organizational changes during differentiation.
Methods
Cell culture and differentiation
Mouse replication timing datasets are described in Hiratani et al., 2010. Briefly, mouse embryonic stem cells (ESCs) from D3, TT2, and 46C cell lines were subjected to either 6-day (46C) or 9-day (D3, TT2) neural differentiation protocols to generate neural progenitor cells (NPCs) [4], [5]. For D3, intermediates were also profiled after 3 (EBM3) and 6 (EBM6) days of differentiation. Muscle stem cells (myoblast) and induced pluripotent stem cells (iPSCs) reprogrammed from fibroblasts were collected as described for human and mouse [38]–[40]. For human timing datasets, neural precursors were differentiated from BG01 ESCs as described in Schulz et al., 2004 [3], [41]. Lymphoblast cell lines GM06990 and C0202 were cultured as previously described [2], [42]. Differentiation of BG02 hESCs to mesendoderm (DE2) and definitive endoderm (DE4) was performed by switching from defined media (McLean et al. [20]) to DMEM/F12+100 ng/ml Activin A 20 ng/ml Fgf2 for two and four days, respectively, with 25 ng/ml Wnt3a added on the first day. Mesoderm and smooth muscle cells were derived by adding BMP4 to DE2 cells at 100 ng/ml.
Generation and preprocessing of microarray datasets
Using custom R/Bioconductor scripts [43], [44], microarray data from Hiratani et al. 2008, Hiratani et al. 2010, and Ryba et al., 2010 were normalized to equivalent scales, and averaged in nonoverlapping windows of approximately 200 kb. Additional profiles for human ESCs, definitive endoderm, mesendoderm, mesoderm, and smooth muscle were derived, normalized and scaled equivalently, as described [45]. Profiles shown in Figure 1 and Figure 4 were smoothed using LOESS with a span of 300 kb.
Monte Carlo selection of fingerprinting regions
Selection of fingerprint regions was performed as described using custom R/Bioconductor scripts. Regions of non-conserved RT (2000/10994 mouse, 2000/12625 human) were first selected based on standard deviation, then optimized using a Monte Carlo algorithm (Figure S2). Using the Metropolis-Hastings criterion for Monte Carlo with simulated annealing [23], [24], moves are accepted when exp((dRbest−dR)/T)>i, where dR is the distance ratio of the proposed move, dRbest is the current best distance ratio, T is a temperature parameter that decreases geometrically during the simulation, and i is a random number from 0 to 1.
Cell type classification
Cell type classification was performed using absolute distances between experiments measured from replication timing in fingerprint regions, using the k-nearest neighbor rule with k = 3; i.e., each profile was categorized according to the three nearest profiles. Crossvalidation was performed to select an appropriate value for k, with k = 3 chosen as the smallest value that yielded 100% classification accuracy after leave-one-out crossvalidation (LOOCV) to allow classification of cell types with fewer replicates. For LOOCV results, each experiment was sequentially left out during Monte Carlo selection, and the resulting regions were used to predict the identity of the excluded experiment. To test prediction on cell types not yet encountered, all profiles for a given cell type were left out during region selection (LCTO), and cell type was predicted using the resulting regions. All data analysis was performed using custom R scripts and Bioconductor packages [43], [44].
Cell type classification using PCR
For each fingerprint region depicted in Table 1, 10–20 kb from the center of the region was sent to NCBI Primer-Blast (http://www.ncbi.nlm.nih.gov/tools/primer-blast/) to design several PCR primer sets with product sizes of 150–350 bp, using standard parameters. Forward and reverse primer pairs displaying the greatest specificity were chosen. Primer sets were verified for specificity and product size using the In-Silico PCR tool at the UCSC genome browser (http://genome.ucsc.edu/cgi-bin/hgPcr).
Table 1. Primers used for PCR fingerprint verification.
Genomic region | |||||||
Region | F/R | Sequence | Length | Product region (Hg18)/Size | Chr | Start | End |
551 | Forward | ACATGGGCGTGCAATCCCCA | 20 | chr1:145,442,862–145,443,081 | chr1 | 145,439,397 | 145,449,397 |
Reverse | GGCCCTGGTTGCTAGGTGCG | 20 | 220 | ||||
647 | Forward | GCAAAACGGCCAACGGCTGA | 20 | chr1:167,999,361–167,999,518 | chr1 | 167,993,179 | 168,003,179 |
Reverse | GGCCTGCCAGTGCTGAGAGG | 20 | 158 | ||||
927 | Forward | TTGGCCCACGTGTGGGGAGA | 20 | chr1:230,199,814–230,200,011 | chr1 | 230,199,352 | 230,209,352 |
Reverse | TGACCCCCTCCCAGGCAGTG | 20 | 198 | ||||
928 | Forward | TGACCCCTGCACCCACCACA | 20 | chr1:230,387,991–230,388,261 | chr1 | 230,396,329 | 230,406,329 |
Reverse | GCCGCAGCTGGAAAAGGGGT | 20 | 271 | ||||
1023 | Forward | GCGTTGCCCTTTGCCACCAC | 20 | chr10:4,017,971–4,018,253 | chr10 | 4,008,411 | 4,018,411 |
Reverse | GCTGCAGCCTCCACCTGCAA | 20 | 283 | ||||
1377 | Forward | AGCCCTGAAAGGTGAGCCCCA | 21 | chr10:90,168,738–90,168,918 | chr10 | 90,159,026 | 90,169,026 |
Reverse | GGGTGGGAGGGGGCTGCTAA | 20 | 181 | ||||
1494 | Forward | GCACGTTGCAGATGCACTGCG | 21 | chr10:114,445,426–114,445,576 | chr10 | 114,440,981 | 114,450,981 |
Reverse | TGGATGGGTCACGTGTGGCG | 20 | 151 | ||||
1496 | Forward | GTTGCTCACCACGGGAGGCC | 20 | chr10:114,788,086–114,788,296 | chr10 | 114,782,400 | 114,792,400 |
Reverse | CCACAAGCCAGCCGACGGAG | 20 | 211 | ||||
2658 | Forward | GGGTTTCCGGGCAGGTGGTG | 20 | chr12:114,297,381–114,297,554 | chr12 | 114,296,629 | 114,306,629 |
Reverse | AGCCCCGGCTTTCCTCCCTT | 20 | 174 | ||||
2659 | Forward | CCGCCTCTTCCCACCCCACT | 20 | chr12:114,463,910–114,464,159 | chr12 | 114,463,748 | 114,473,748 |
Reverse | GGCCTGGCAGGGGTTTTGCT | 20 | 250 | ||||
5418 | Forward | GTGGGGATTGACTGCACTGCCA | 22 | chr2:44,479,986–44,480,137 | chr2 | 44,479,922 | 44,489,922 |
Reverse | TCAATGCCCCCTCCCCCAACA | 21 | 152 | ||||
7316 | Forward | GTGCCCAGCGGCCAGTGTAG | 20 | chr3:109,192,760–109,192,989 | chr3 | 109,187,624 | 109,197,624 |
Reverse | CTGAGTGTGGCGCCTGTGGG | 20 | 230 | ||||
7317 | Forward | CGCCCTATCCCGGGACTGCT | 20 | chr3:109,390,861–109,391,033 | chr3 | 109,380,748 | 109,390,748 |
Reverse | CCTGTGTCTGCTTCCCCCACC | 21 | 173 | ||||
7515 | Forward | AAGGCCAGTTGCAGGCCCTG | 20 | chr3:153,125,829–153,126,087 | chr3 | 153,111,466 | 153,121,466 |
Reverse | CACCCCAACGGCGCATGGTT | 20 | 259 | ||||
7516 | Reverse | AGGCAGCACTGCGGTAATATGCT | 23 | chr3:153,356,058–153,356,339 | chr3 | 153,355,644 | 153,365,644 |
Forward | GGGCTGCAGTGGGTTCTGCC | 20 | 282 | ||||
7551 | Reverse | GTGGCAGCTACACGGCCAGG | 20 | chr3:161,085,915–161,086,106 | chr3 | 161,084,943 | 161,094,943 |
Forward | GGAAGCCCCAAACCCAGGCA | 20 | 192 | ||||
8679 | Reverse | TCTGTGCGGGCCTGAAGGGG | 20 | chr5:38,608,473–38,608,822 | chr5 | 38,601,124 | 38,611,124 |
Forward | AGCGGCCACATCTTGCAGGT | 20 | 350 | ||||
8680 | Reverse | GGCACCACGAGGGAGATGCG | 20 | chr5:38,794,843–38,795,022 | chr5 | 38,786,446 | 38,796,446 |
Reverse | CCCGGCAGTGGGAGAGACGA | 20 | 180 | ||||
8893 | Forward | CTGGCCCCTCCTCACCTCCG | 20 | chr5:95,884,467–95,884,676 | chr5 | 95,902,422 | 95,912,422 |
Reverse | GGCACACCAGGCAGCACCTC | 20 | 210 | ||||
9107 | Forward | ACACATTTGCTGAGGGCCCACT | 22 | chr5:142,903,434–142,903,692 | chr5 | 142,899,586 | 142,909,586 |
Reverse | ACTGTCCCATCCTGCGGCCT | 20 | 259 |
PCR reactions were set up using 1.25 ng genomic DNA and 1 uM each of forward and reverse primers in 12.5 uL scaled according to the instructions of Crimson Taq DNA Polymerase (NEB). Thirty six cycles of PCR (empirically determined to be unsaturated for amplification) were performed according to manufacturer's conditions with annealing temperature of 62°C. One-third of the reaction was analyzed on a 1.5% agarose gel containing ethidium bromide. The gel was scanned by Typhoon Trio (GE Healthcare) and band intensity was quantified by Image Quant TL (GE Healthcare). After the background was subtracted, signal intensity from the early S fraction was divided by the sum of those from early S and late S fractions from each sample, as described [35]. PCR timing values were converted to array RT scale (equivalent root-mean-square) using the scale function in R, and distances were calculated against other cell types as previously performed.
Accession numbers
Supporting Information
Acknowledgments
We gratefully acknowledge the assistance of Mari Itoh, Junjie Lu and Tomoki Yokochi in data collection, Ruth Didier for expert assistance with cell sorting, as well as helpful discussions with Dan McGee and Allan Robins. We also thank Rudolph Jaenisch and Jacob Hanna for providing the BrdU-labeled WIBR3 cells and Dirk Schubeler for providing the NC-NC cell line.
Footnotes
The authors have declared that no competing interests exist.
This work was supported by NIH grant GM085354 to DMG. SD is supported by the National Institute of Child Health and Human Development (HD049647) and the National Institute for General Medical Sciences (GM75334), and IH by a post-doctoral fellowship from the International Rett Syndrome Foundation. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
References
- 1.Yaffe E, Farkash-Amar S, Polten A, Yakhini Z, Tanay A, et al. Comparative analysis of DNA replication timing reveals conserved large-scale chromosomal architecture. PLoS Genet. 2010;6:e1001011. doi: 10.1371/journal.pgen.1001011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Woodfine K, Fiegler H, Beare DM, Collins JE, McCann OT, et al. Replication timing of the human genome. Hum Mol Genet. 2004;13:191–202. doi: 10.1093/hmg/ddh016. [DOI] [PubMed] [Google Scholar]
- 3.Ryba T, Hiratani I, Lu J, Itoh M, Kulik M, et al. Evolutionarily conserved replication timing profiles predict long-range chromatin interactions and distinguish closely related cell types. Genome Res. 2010;20:761–70. doi: 10.1101/gr.099655.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Hiratani I, Ryba T, Itoh M, Yokochi T, Schwaiger M, et al. Global reorganization of replication domains during embryonic stem cell differentiation. PLoS Biol. 2008;6:e245. doi: 10.1371/journal.pbio.0060245. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Hiratani I, Ryba T, Itoh M, Rathjen J, Kulik M, et al. Genome-wide dynamics of replication timing revealed by in vitro models of mouse embryogenesis. Genome Res. 2010;20:155–69. doi: 10.1101/gr.099796.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Farkash-Amar S, Lipson D, Polten A, Goren A, Helmstetter C, et al. Global organization of replication time zones of the mouse genome. Genome Res. 2008;18:1562–70. doi: 10.1101/gr.079566.108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Desprat R, Thierry-Mieg D, Lailler N, Lajugie J, Schildkraut C, et al. Predictable dynamic program of timing of DNA replication in human cells. Genome Res. 2009;19:2288–2299. doi: 10.1101/gr.094060.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Yokochi T, Poduch K, Ryba T, Lu J, Hiratani I, et al. G9a selectively represses a class of late-replicating genes at the nuclear periphery. Proc Natl Acad Sci U S A. 2009;106:19363–8. doi: 10.1073/pnas.0906142106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Berezney R, Dubey DD, Huberman JA. Heterogeneity of eukaryotic replicons, replicon clusters, and replication foci. Chromosoma. 2000;108:471–84. doi: 10.1007/s004120050399. [DOI] [PubMed] [Google Scholar]
- 10.Marsit CJ, Koestler DC, Christensen BC, Karagas MR, Houseman EA, et al. DNA methylation array analysis identifies profiles of blood-derived DNA methylation associated with bladder cancer. J Clin Oncol. 2011;29:1133–9. doi: 10.1200/JCO.2010.31.3577. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Figueroa ME, Wouters BJ, Skrabanek L, Glass J, Li Y, et al. Genome-wide epigenetic analysis delineates a biologically distinct immature acute leukemia with myeloid/T-lymphoid features. Blood. 2009;113:2795–804. doi: 10.1182/blood-2008-08-172387. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Baron U, Türbachova I, Hellwag A, Eckhardt F, Berlin K, et al. DNA methylation analysis as a tool for cell typing. Epigenetics. 2006;1:55–60. doi: 10.4161/epi.1.1.2643. [DOI] [PubMed] [Google Scholar]
- 13.Sotiriou C, Pusztai L. Gene-expression signatures in breast cancer. N Engl J Med. 2009;360:790–800. doi: 10.1056/NEJMra0801289. [DOI] [PubMed] [Google Scholar]
- 14.Hou J, Aerts J, Hamer B den, Ijcken W van, Bakker M den, et al. Gene expression-based classification of non-small cell lung carcinomas and survival prediction. PLoS ONE. 2010;5:e10312. doi: 10.1371/journal.pone.0010312. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Elsheikh SE, Green AR, Rakha EA, Powe DG, Ahmed RA, et al. Global histone modifications in breast cancer correlate with tumor phenotypes, prognostic factors, and patient outcome. Cancer Res. 2009;69:3802–9. doi: 10.1158/0008-5472.CAN-08-3907. [DOI] [PubMed] [Google Scholar]
- 16.Barlési F, Giaccone G, Gallegos-Ruiz MI, Loundou A, Span SW, et al. Global histone modifications predict prognosis of resected non small-cell lung cancer. J Clin Oncol. 2007;25:4358–64. doi: 10.1200/JCO.2007.11.2599. [DOI] [PubMed] [Google Scholar]
- 17.Voss TC, Schiltz RL, Sung M-H, Johnson TA, John S, et al. Combinatorial probabilistic chromatin interactions produce transcriptional heterogeneity. J Cell Sci. 2009;122:345–56. doi: 10.1242/jcs.035865. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Chang HH, Hemberg M, Barahona M, Ingber DE, Huang S. Transcriptome-wide noise controls lineage choice in mammalian progenitor cells. Nature. 2008;453:544–7. doi: 10.1038/nature06965. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Efroni S, Melcer S, Nissim-Rafinia M, Meshorer E. Stem cells do play with dice: a statistical physics view of transcription. Cell Cycle. 2009;8:43–8. doi: 10.4161/cc.8.1.7216. [DOI] [PubMed] [Google Scholar]
- 20.McLean AB, D'Amour KA, Jones KL, Krishnamoorthy M, Kulik MJ, et al. Activin a efficiently specifies definitive endoderm from human embryonic stem cells only when phosphatidylinositol 3-kinase signaling is suppressed. Stem Cells. 2007;25:29–38. doi: 10.1634/stemcells.2006-0219. [DOI] [PubMed] [Google Scholar]
- 21.Schübeler D, Scalzo D, Kooperberg C, Steensel B van, Delrow J, et al. Genome-wide DNA replication profile for Drosophila melanogaster: a link between transcription and replication timing. Nat Genet. 2002;32:438–42. doi: 10.1038/ng1005. [DOI] [PubMed] [Google Scholar]
- 22.Weddington N, Stuy A, Hiratani I, Ryba T, Yokochi T, et al. ReplicationDomain: a visualization tool and comparative database for genome-wide replication timing data. BMC Bioinformatics. 2008;9:530. doi: 10.1186/1471-2105-9-530. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Hastings WK. Monte Carlo sampling methods using Markov chains and their applications. Biometrika. 1970;57:97–109. [Google Scholar]
- 24.Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E. Equation of State Calculations by Fast Computing Machines. J Chem Phys. 1953;21:1087. [Google Scholar]
- 25.Cover T, Hart P. Nearest neighbor pattern classification. IEEE Trans Inf Theory. 1967;13:21–27. [Google Scholar]
- 26.Sano K, Tanihara H, Heimark RL, Obata S, Davidson M, et al. Protocadherins: a large family of cadherin-related molecules in central nervous system. EMBO J. 1993;12:2249–56. doi: 10.1002/j.1460-2075.1993.tb05878.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Angst BD, Marcozzi C, Magee AI. The cadherin superfamily: diversity in form and function. J Cell Sci. 2001;114:629–41. doi: 10.1242/jcs.114.4.629. [DOI] [PubMed] [Google Scholar]
- 28.Gerbaulet SP, Wijnen AJ van, Aronin N, Tassinari MS, Lian JB, et al. Downregulation of histone H4 gene transcription during postnatal development in transgenic mice and at the onset of differentiation in transgenically derived calvarial osteoblast cultures. J Cell Biochem. 1992;49:137–47. doi: 10.1002/jcb.240490206. [DOI] [PubMed] [Google Scholar]
- 29.Meshorer E, Yellajoshula D, George E, Scambler PJ, Brown DT, et al. Hyperdynamic plasticity of chromatin proteins in pluripotent embryonic stem cells. Dev Cell. 2006;10:105–16. doi: 10.1016/j.devcel.2005.10.017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Jullien J, Astrand C, Halley-Stott RP, Garrett N, Gurdon JB. Characterization of somatic cell nuclear reprogramming by oocytes in which a linker histone is required for pluripotency gene reactivation. Proc Natl Acad Sci U S A. 2010;107:5483–8. doi: 10.1073/pnas.1000599107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Tesar PJ, Chenoweth JG, Brook FA, Davies TJ, Evans EP, et al. New cell lines from mouse epiblast share defining features with human embryonic stem cells. Nature. 2007;448:196–9. doi: 10.1038/nature05972. [DOI] [PubMed] [Google Scholar]
- 32.Ginis I, Luo Y, Miura T, Thies S, Brandenberger R, et al. Differences between human and mouse embryonic stem cells. Dev Biol. 2004;269:360–80. doi: 10.1016/j.ydbio.2003.12.034. [DOI] [PubMed] [Google Scholar]
- 33.Heng J-CD, Orlov YL, Ng H-H. Transcription Factors for the Modulation of Pluripotency and Reprogramming. Cold Spring Harb Symp Quant Biol. 2010;75:237–44. doi: 10.1101/sqb.2010.75.003. [DOI] [PubMed] [Google Scholar]
- 34.Weber M, Hellmann I, Stadler MB, Ramos L, Pääbo S, et al. Distribution, silencing potential and evolutionary impact of promoter DNA methylation in the human genome. Nat Genet. 2007;39:457–66. doi: 10.1038/ng1990. [DOI] [PubMed] [Google Scholar]
- 35.Hiratani I, Leskovar A, Gilbert DM. Differentiation-induced replication-timing changes are restricted to AT-rich/long interspersed nuclear element (LINE)-rich isochores. Proc Natl Acad Sci U S A. 2004;101:16861–6. doi: 10.1073/pnas.0406687101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Pope BD, Hiratani I, Gilbert DM. Domain-wide regulation of DNA replication timing during mammalian development. Chromosome Res. 2010;18:127–36. doi: 10.1007/s10577-009-9100-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Hanna J, Cheng AW, Saha K, Kim J, Lengner CJ, et al. Human embryonic stem cells with biological and epigenetic characteristics similar to those of mouse ESCs. Proc Natl Acad Sci U S A. 2010;107:9222–7. doi: 10.1073/pnas.1004584107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Takahashi K, Yamanaka S. Induction of pluripotent stem cells from mouse embryonic and adult fibroblast cultures by defined factors. Cell. 2006;126:663–76. doi: 10.1016/j.cell.2006.07.024. [DOI] [PubMed] [Google Scholar]
- 39.Park I-H, Zhao R, West JA, Yabuuchi A, Huo H, et al. Reprogramming of human somatic cells to pluripotency with defined factors. Nature. 2008;451:141–6. doi: 10.1038/nature06534. [DOI] [PubMed] [Google Scholar]
- 40.Maherali N, Sridharan R, Xie W, Utikal J, Eminli S, et al. Directly reprogrammed fibroblasts show global epigenetic remodeling and widespread tissue contribution. Cell Stem Cell. 2007;1:55–70. doi: 10.1016/j.stem.2007.05.014. [DOI] [PubMed] [Google Scholar]
- 41.Schulz TC, Noggle SA, Palmarini GM, Weiler DA, Lyons IG, et al. Differentiation of human embryonic stem cells to dopaminergic neurons in serum-free suspension culture. Stem Cells. 2004;22:1218–38. doi: 10.1634/stemcells.2004-0114. [DOI] [PubMed] [Google Scholar]
- 42.Koch CM, Andrews RM, Flicek P, Dillon SC, Karaöz U, et al. The landscape of histone modifications across 1% of the human genome in five human cell lines. Genome Res. 2007;17:691–707. doi: 10.1101/gr.5704207. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 2004;5:R80. doi: 10.1186/gb-2004-5-10-r80. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.R Development Core Team. R: A Language and Environment for Statistical Computing. ISBN 3: 2673. Vienna, Austria: R Foundation for Statistical Computing; 2008. [Google Scholar]
- 45.Ryba T, Battaglia D, Pope BD, Hiratani I, Gilbert DM. Genome-scale analysis of replication timing: from bench to bioinformatics. Nat Protoc. 2011;6:870–95. doi: 10.1038/nprot.2011.328. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.