Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2016 Sep 14;113(39):E5765–E5774. doi: 10.1073/pnas.1603241113

Inevitability and containment of replication errors for eukaryotic genome lengths spanning megabase to gigabase

Mohammed Al Mamun a,1, Luca Albergante a,1, Alberto Moreno b, James T Carrington b, J Julian Blow b, Timothy J Newman a,2
PMCID: PMC5047159  PMID: 27630194

Significance

Errors in DNA replication can never be completely avoided. By combining a minimal model that takes into account the positions of replication origins (the regions on the DNA where replication initiates) with experimental evidence, we show that genome size strongly influences the frequency of replicative errors. Our work reveals that (i) simple eukaryotes are able to achieve a very low probability of replicative errors by having a moderate number of origins placed at regular intervals; (ii) this strategy is ineffective in eukaryotes with larger genomes, such as human, for which replicative errors are inevitable; and (iii) in these organisms, even moderate numbers of origins can provide containment of replication errors to very low levels, which can be repaired subsequently.

Keywords: eukaryotes, genome length, replication error, Poisson distribution, mathematical modeling

Abstract

The replication of DNA is initiated at particular sites on the genome called replication origins (ROs). Understanding the constraints that regulate the distribution of ROs across different organisms is fundamental for quantifying the degree of replication errors and their downstream consequences. Using a simple probabilistic model, we generate a set of predictions on the extreme sensitivity of error rates to the distribution of ROs, and how this distribution must therefore be tuned for genomes of vastly different sizes. As genome size changes from megabases to gigabases, we predict that regularity of RO spacing is lost, that large gaps between ROs dominate error rates but are heavily constrained by the mean stalling distance of replication forks, and that, for genomes spanning ∼100 megabases to ∼10 gigabases, errors become increasingly inevitable but their number remains very small (three or less). Our theory predicts that the number of errors becomes significantly higher for genome sizes greater than ∼10 gigabases. We test these predictions against datasets in yeast, Arabidopsis, Drosophila, and human, and also through direct experimentation on two different human cell lines. Agreement of theoretical predictions with experiment and datasets is found in all cases, resulting in a picture of great simplicity, whereby the density and positioning of ROs explain the replication error rates for the entire range of eukaryotes for which data are available. The theory highlights three domains of error rates: negligible (yeast), tolerable (metazoan), and high (some plants), with the human genome at the extreme end of the middle domain.


The proper maintenance of genetic information is of fundamental importance to the survival of all organisms, and many molecular mechanisms exist to ensure that the genetic sequence encoded by DNA is maintained unaltered generation after generation (13). To preserve the integrity of genetic information and to avoid aberrant ploidy, it is crucial that the entire DNA is copied exactly once; replicating only part of the DNA results in potential corruption of genes, and replicating certain parts of the DNA more than once would perturb chromosome structure and strongly affect gene dosage (46). Not surprisingly, regions of underreplicated and overreplicated DNA are common in cancer (7, 8).

DNA replication is a particularly complex process in eukaryotic organisms with large genomes distributed across multiple chromosomes. Multiple checkpoints exist to ensure that, once replication starts, the whole DNA is faithfully replicated before the chromosomes are segregated. Underreplication and overreplication of DNA are prevented by using predefined points of replication initiation called replication origins (ROs) (3, 9).

During late mitosis and the G1 phase of the cell division cycle, each potential RO is “licensed” for a single initiation event by being loaded with minichromosome maintenance proteins 2–7 (MCM2-7) double hexamers. To prevent rereplication of DNA segments, the ability to license new origins ceases before cells enter S phase. During this phase, hundreds to thousands of licensed ROs are activated throughout the genome (10). Bidirectional replication forks (RFs) are established at active ROs, each driven by a single MCM2-7 hexamer, allowing DNA polymerases to copy the DNA (Fig. 1A). Despite being highly reliable molecular machines, RFs can, on rare occasions, irreversibly stall (11). The activation of additional ROs can overcome the problem of irreversibly stalled RFs, as a new fork will eventually meet the stalled one, hence replicating all of the intervening DNA. However, if adjacent right-moving and left-moving forks stall and no additional ROs are available between them, the DNA in between the two forks will remain unreplicated (Fig. 1B). This phenomenon constitutes a major replication error for the cell, which is commonly called a double-fork stall (DFS) (Fig. 1). The occurrence of DFSs is therefore a key obstacle for cells to either avoid or overcome to maintain replication fidelity. The molecular processes underlying the management of DFSs are an active field of study, and insults to these processes have been associated with different pathologies (1113).

Fig. 1.

Fig. 1.

Potential outcomes arising from ROs licensed on a DNA segment. DNA is denoted as a single black line. Before S-phase entry, four origins (denoted by I, II, III, and IV) are licensed by binding a double hexamer of MCM2-7 proteins (blue). As an origin fires, both MCM2-7 single hexamers are converted into an active Cdc45, Mcm2-7, and GINS complex helicase (pink). (A) RO II is dormant and passively replicated by the fork coming from RO I; replication is complete. (B) Red crosses depict the fork-stalling. Previously dormant RO II is fired to complete the replication of DNA between stalled forks. However, as there is no RO licensed between RO III and IV, the DNA between two stalled forks in this part remains unreplicated, and complete replication is compromised. Adapted from ref. 14.

In our previous work, we introduced a simple probabilistic theory to determine the probability of replication failure arising from DFSs for a given set of ROs in a genome (14). The theory depends on two key assumptions, i.e., that the cell has no time constraint in completing the process (i.e., that all licensed ROs are allowed to be activated as necessary) and that there is a constant small probability per nucleotide for each individual replication fork to irreversibly stall. Mathematical analysis of the theory showed that, in organisms with a genome length comparable to yeasts ( 10 Mbp), evenly distributing the ROs throughout the genome optimally reduces the replication errors due to irreversible fork-stalling to levels observed in experiments. In accordance with the theoretical prediction, a strong bias toward evenly distributed ROs was observed in biological data derived from different yeast species (14). The theory relies on a single unknown parameter, the median stall distance (denoted by Ns), which describes the typical stalling distance of RFs in eukaryotes (14). Our theory was used to obtain an estimate Ns from the probability of DFS and the RO distribution. The value obtained (12 Mbp) is remarkably close to direct experimental measurements.

In this article, we extend our theory to study much larger genomes (100 Mbp to 10 Gbp), which are typically found in metazoa and plants. Our theory requires as input the positions of ROs along the genome and yields a number of clear predictions concerning the rates of DFSs, using both mathematical and computational approaches. These predictions were tested on available datasets describing RO distribution in one plant (Arabidopsis) (15), one invertebrate (Drosophila) (16), and two independent human datasets (reporting different human cell lines) (17, 18) (Table S1). Note that the two human datasets have been derived using different approaches to RO detection, and hence the number and positions of ROs vary between them. The two datasets are largely compatible with reported 70% overlap in genomic sites containing ROs in both datasets (18) (see also Fig. S1), and therefore can be used to test the robustness of our theory to experimental and biological variation.

Table S1.

General information from the various datasets for the distribution of ROs

Data Yeasts A. thaliana D. melanogaster Human cell lines*
S. cerevisiae S. pombe IMR90 (B) HeLa (B) hESC (B) iPSC (B) IMR90 (P) HeLa (P) K562 (P)
No. ROs 1,626 1,476 3,076 12,356 513,934 467,044 416,994 493,686 179,676 180,054 118,346
Ng, Mbp 22.45 24.62 229.94 234.18 5,665.42 5,670.47 5,664.43 5,661.91 5,640.31 5,645.12 5,618.98
Nl ± SD, Kbp 14 ± 10 17 ± 13 75 ± 73 19 ± 16 11 ± 36 12 ± 43 14 ± 43 11 ± 41 31 ± 44 31 ± 47 47 ± 72
LR, Kbp 60.70 60.96 773.20 151.39 3,599.33 4,292.70 3,707.84 3,707.94 5,655.11 5,733.72 5,940.84
P(DFS) 0.0011 0.0016 0.077 0.018 0.813 0.884 0.860 0.875 0.702 0.741 0.873
R 0.72 0.77 0.98 0.82 3.23 3.52 3.15 3.57 1.39 1.51 1.53

LR, largest replicon.

*

Besnard et al. data (B) and Picard et al. data (P) were derived using different resolutions in detecting ROs; hence the numbers of total ROs are different, but both datasets are strongly compatible with each other in regards to the genomic locations of ROs (Fig. S1).

Fig. S1.

Fig. S1.

Inter-RO distances from Besnard et al. (B) and Picard et al. (P) datasets are plotted. Due to the difference in resolution of detection, the minimum inter-RO distance in Picard et al. data is 4001 bp, and, in Besnard et al. data, it is 240 bp. The overlapping bar charts show that the two datasets are compatible. More detail of the compatibility of the two datasets is discussed in ref. 18.

Our theoretical and computational analysis leads to a series of direct predictions, which are all found to be consistent with all datasets analyzed, revealing a picture of great simplicity. The robustness of DNA replication in eukaryotes can be maintained so long as the largest replicon (inter-RO distance) is well below the median stall distance Ns. For organisms with larger genomes, such as typical vertebrates and plants, DFSs are highly likely, even if the mean replicon length is small. These organisms therefore require mechanisms to deal with DFSs, and, in related experimental work, we provide experimental evidence for one such postreplicative mechanism (19). For cells with such repair mechanisms, the burden of equally spacing ROs is lifted; far more important is the distribution of larger replicons (relative to Ns) from which DFS events are most likely to arise. Our theory also indicates that the number of DFSs becomes unwieldy for genomes significantly greater than 10 Gb, and this additional challenge may play a central role in limiting the genome size of higher eukaryotes.

Results

The “Central Equation” for Determining Replication Errors.

In our previous work (14), we derived a mathematical equation for the genome-wide probability of DFSs, based on the distribution of ROs and Ns. The published equations depend on the largest replicon being significantly smaller than Ns. Although this limitation holds true for yeast genomes, it does not apply to ROs that have been mapped in mammalian cells. As described in Materials and Methods, we use the same theoretical framework to derive more general equations that are applicable to genomes containing arbitrarily large replicons. To use our theoretical results, we require detailed information on the location of ROs. A number of datasets have been published that provide the locations of ROs in eukaryotes, along with the total genome length (denoted by Ng in the following). In this work, we have used origin-mapping data from Saccharomyces cerevisiae (20), Schizosaccharomyces pombe (21), Arabidopsis thaliana (15), Drosophila melanogaster (16), and five human tissue culture cell lines (IMR-90, HeLa, hESC, iPSC, and K562) from Besnard et al. (17) (denoted by “B” in the following) and Picard et al. (18) (denoted by “P”). Because the work of Picard et al. used more modern techniques (particularly in peak identification), it might be considered a more reliable dataset; comparison with the Besnard et al. dataset is useful in assessing the experimental uncertainties in some of the data.

Because of the very low probability of a DFS in any given replicon, we can show that the statistics of DFSs are Poisson to a very high level of accuracy (SI Text), and that the probability of no DFSs genome-wide has the form exp(λ). Thus, a great deal of information concerning the probabilities of DFSs for a given genome can be obtained from the single parameter λ. We remind the reader that, for a Poisson distribution, λ also describes both the mean and the variance of the distribution. For a given genome with K ROs, we denote the replicons by the K − 1 values Ni (with i = 1, …, K − 1). These data can then be used in the “central equation” arising from our theory (Eq. 1),

λ=log(2)NgNsi=1Klog[1+log(2)NiNs]. [1]

This expression for λ contains a single unknown parameter Ns—i.e., the number of replicated bases along the DNA beyond which 50% of RFs irreversibly stall. The median stalling distance Ns is inversely proportional to the very small probability of stalling per nucleotide (14).

On the right-hand side of Eq. 1, we can identify the two distinct contributions of the genome length (first term) and of the RO distribution (second term). Genome length determines a baseline probability of DFSs that can be lowered by increasing the number of ROs and/or changing their distribution along the genome; indeed, as we have shown previously (14), for a given number of ROs, equally distributing them across the genome is the optimal arrangement to minimize the probability of DFSs. Therefore, the different terms on the right-hand side of Eq. 1 establish a hierarchy of contributions to the probability of DFSs, with genome length being the most important factor, followed by RO number and then RO distribution (Fig. 2).

Fig. 2.

Fig. 2.

Schematic of the central equation. The genome length is the dominant contributor to the overall replication error due to fork-stalling, followed by the number of licensed ROs and, lastly, by their distribution.

In organisms with relatively small genomes, such as yeasts (∼10 Mbp), an average density of 1 RO per ∼20 Kbp allows the maintenance of very small probabilities of genome-wide DFSs. Application of Eq. 12 to the yeast datasets gives values around 10−3 for the probability of one or more DFSs, consistent with our previous analysis. With the increase in genome size from around 10 Mbp (in yeasts) to around 10 Gbp (in human), Eq. 12 shows that the probability of DFSs increases by approximately two orders of magnitude, to more than 0.5 for human genomes (Fig. 3A). This huge increase in error rate occurs despite essentially no shift in the mean replicon size (Fig. 3B). Therefore, it is absolutely necessary for these organisms to have molecular machinery able to repair DFSs.

Fig. 3.

Fig. 3.

(A) Predicted probability of one or more DFSs for various eukaryotic genomes using the central equation from the model. (B) Measured mean replicon length across the same genomes from the corresponding experimental datasets. (C) Computed R values from the same eukaryotic datasets; note that the dashed bars represent simulated R values for virtual genomes of the same length and RO density but assuming ROs to be randomly distributed. (D) The probability of a DFS, denoted P(DFS), is plotted as a function of increasing replicon length. The estimated median fork-stalling distance, Ns (10 Mbp), is highlighted on the x axis. P(DFS) starts to increase sharply as soon as the replicon size reaches approximately half the value of Ns; note that the x axis has a log scale. (E) The calculated probability of a DFS inside replicons plotted against normalized chromosomal lengths for the largest chromosomes in budding yeast, Drosophila, Arabidopsis, and the IMR90 cell line from two human datasets (B and P).

The Bias Toward Uniformly Spaced Replication Origins Is Progressively Lost in Larger Genomes.

The regularity of the RO distribution can be assessed by computing the coefficient of variation of the replicon lengths, denoted by R, defined as the ratio of their SD to their mean. For a perfectly uniform distribution of equally spaced ROs, R is equal to 0. On the other hand, computational analysis indicates that, when ROs are randomly distributed on the genome, the value of R is very close to 1 (14).

In the yeast genomes (diploid genome sizes ∼20 Mbp), we previously showed that their RO distributions were strongly biased toward uniform spacing with values of R ranging from 0.72 to 0.77 (Fig. 3C). The probability of DFSs is very small in yeasts due to the small genome size, and optimization of the RO positions by lowering R reduces this even further. However, as discussed above, organisms with larger genomes have a significantly higher probability of DFS events, which results in the need for additional molecular mechanisms to cope with the consequences (19), and the presence of such mechanisms means there is little to be gained in uniformly ordering ROs on the genomes. Thus, our expectation is that R should be significantly larger in organisms with larger genomes compared with the values found in yeast. Statistical analysis of the available data confirms this expectation (Fig. 3C). Arabidopsis and Drosophila (diploid genome sizes ∼250 Mbp) have values of R around unity (i.e., approximating a random distribution). Particularly striking is the fact that, in human genomes (∼6,000 Mbp), the values of R are significantly larger than unity, indicating that ROs are not spaced purely randomly and that both the number and size of large replicons are significantly greater than expected by chance. This unexpected distribution has important consequences that will be discussed in Large Replicons in Human Genomes.

The probability of a DFS in a given replicon increases with the replicon length according to Eq. 6 (Materials and Methods) and is plotted in Fig. 3D. The probability has a strongly nonlinear form: increasing as the square of the replicon length for lengths much less than the stalling distance, and saturating to unity for lengths significantly greater than the stalling distance. Fig. 3E provides a graphical representation that highlights the dramatic shift in variation of replicon lengths, or, equivalently, the per replicon rate of DFS, by plotting the predicted probability of DFSs across the largest chromosome of different organisms. It is apparent that the variation in probability of error increases by approximately one order of magnitude from yeast to Drosophila, and then again by approximately one order of magnitude from Drosophila to human.

Large Replicons in Human Genomes Cause the Most Errors but Are Bounded by the Stalling Distance.

Consistent with our analysis of the values of R, we would expect the largest replicons in the genome to be very significantly different in diploid genomes of size ∼20 Mbp, ∼250 Mbp, and ∼6 Gbp (represented by yeasts, Drosophila/Arabidopsis, and human, respectively), with significantly larger replicons appearing in those genomes with R larger than unity. As seen in Fig. 4A, this is exactly what is observed, with the largest replicons being ∼60 Kbp in yeasts (∼120 Kbp expected for a random distribution), 151 Kbp in Drosophila (207 Kbp expected if random), 773 Kbp in Arabidopsis (663 Kbp expected if random), and ∼5 Mbp in human (∼300 Kbp expected if random). The tendency for larger and larger replicons can also be seen by the significant increase in outliers in the box plots of replicon lengths for the different organisms considered (Fig. 4B). As is clear from Fig. 3D, the probability of a DFS in a given replicon increases dramatically as the length of the replicon approaches Ns. To avoid almost inevitable errors arising from a single replicon, we would expect the length of the largest replicon in the entire genome to be bounded by Ns, and this is, indeed, what is observed in the data. In the B dataset, we find that the largest replicons in each human cell line are 3.59 Mbp (IMR90), 3.71 Mbp (hESC), 3.71 Mbp (iPSC), and 4.29 Mbp (HeLa), whereas, in P, we find 5.65 Mbp (IMR90), 5.73 Mbp (HeLa), and 5.94 Mbp (K562). Interestingly, the largest replicons appear to be bounded by approximately one half of the stalling distance, which means that the largest replicon in each human cell line contributes a predicted error rate of ∼5%. We note that all of the datasets used for our analysis rely on genomic sequencing data. As such, large regions of repetitive DNA will not be sequenced accurately and yet are likely to contain ROs. These false negatives imply that the largest replicons measured provide an upper bound rather than a definite value, although we do not expect large numbers of missed ROs (19). The future use of more advanced techniques, for example, single-cell sequencing, will shed more light on this aspect.

Fig. 4.

Fig. 4.

(A) Measured lengths of the largest replicons are shown in each dataset alongside the dashed bars showing the value obtained for virtual genomes of the same length and RO density but assuming ROs to be randomly distributed. (B) The distribution of genome-wide replicon lengths plotted in boxplot format for budding yeast, Drosophila, Arabidopsis, and the IMR90 cell line from two human datasets (B and P).

In the human genome, given that errors are very likely, we can determine the range of replicon lengths that are the main contributors to the DFS. We grouped the replicons into five cohorts: very small (XS; <1 Kbp), small (S; 1 to 10 Kbp), medium (M; 10 to 100 Kbp), large (L; 100 Kbp to 1 Mbp), and very large (XL; >1 Mbp). The frequency of replicons in these five cohorts is shown in Fig. 5 A and B for IMR90 from the B and P studies. The most common range of replicons is S, followed by M, the shift from S to M being due to the coalescence of small replicons in the Picard et al. study. L and XL replicons appear only at low frequency. Despite this, Fig. 5 C and D shows that the cohort of L replicons dominates as the source of error, which is due to the fact that the DFS probability increases nonlinearly with the replicon length (Fig. 3D). The error rate due to the small number of XL replicons is significantly smaller compared with the L replicons. An important consequence of this finding is that there will be a very limited impact on genome-wide error rates from false negatives, which primarily affect the distribution of XL replicons.

Fig. 5.

Fig. 5.

Data are from the IMR90 human datasets (A, C, E, and G) B and (B, D, F, and H) P. (A and B) Frequency of replicons in each cohort, defined according to the following size ranges: <103 bp, XS; 103 to 104 bp, S; 104 to 105 bp, M; 105 to 106 bp, L; and >106 bp, XL. (C and D) Probability of DFS in each cohort of the replicons. (E and F) Higher-resolution plot of probability of DFS at the transition from M to L gap cohorts contributing most toward the P(DFS); red bars show the bins with maximum P(DFS) in respective datasets. (G and H) Theoretical frequency distribution of replicons inferred from E and F are presented in blue; gray shows the actual frequency distribution in those bins in the data, and red highlights the red bins in E and F.

Interestingly, in both datasets, for all cell lines, a closer examination of the error rates in the vicinity of the L cohort shows a surprisingly statistically uniform distribution of error rate, which is suggestive of ROs being placed so as to “spread the risk” of error across size scales. In Fig. 5 E and F, the probability of DFS in each 10-kbp interval in the range 10 to 300 kbp is shown for the Besnard et al. (Fig. 5E) and Picard et al. (Fig. 5F) datasets for primary IMR90 cells. These replicons are the ones that contribute the most to the DFS probability. The maxima are relatively broad, particularly for the B dataset, for which the probability of DFS in each 10 kbp is approximately constant at 0.030 to 0.035 across replicons spanning from 40 kbp to 200 kbp. For replicons significantly smaller than the stalling distance, one can infer, from the theory, that ROs are placed in such a way to give a power law, with a frequency of DFSs that decreases as the inverse square of the replicon length thereby spreading the probability of a DFS equally among all size classes (described by Eq. S6). Fig. 5 G and H shows that there is a remarkable concordance between the theoretical frequency distribution (in blue) and the frequency distribution in the data for IMR90 cell line in both datasets (in red). There is also excellent agreement with the theoretical distribution in all of the other cell lines in both datasets (Figs. S2S4). These results can be interpreted in terms of “spreading the damage” as widely as possible in the replicon size region of maximal DFS errors, as a power law is the most effective way to delocalize errors from any single cohort of replicon lengths.

Fig. S2.

Fig. S2.

Data are from HeLa human datasets (A, C, E, and G) B and (B, D, F, and H) P. (A and B) Frequency of replicons in each cohort, defined according to the following size ranges: <103 bp, XS; 103 to 104 bp, S; 104 to 105 bp, M; 105 to 106 bp, L; and >106 bp, XL. (C and D) Probability of DFS in each cohort of the replicons. (E and F) Higher-resolution plot of probability of DFS at the transition from M to L gap cohorts contributing most toward the P(DFS); red bars show the bins with maximum P(DFS) in respective datasets. (G and H) Theoretical frequency distribution of replicons inferred from E and F are presented in blue; gray shows the actual frequency distribution in those bins in the data, and red highlights the red bins in E and F.

Fig. S4.

Fig. S4.

Data are from iPSC human dataset in B. (A) Frequency of replicons in each cohort, defined according to the following size ranges: <103 bp, XS; 103 to 104 bp, S; 104 to 105 bp, M; 105 to 106 bp, L; and >106 bp, XL. (B) Probability of DFS in each cohort of the replicons. (C) Higher-resolution plot of probability of DFS at the transition from M to L gap cohorts contributing most toward the P(DFS); red bars show the bins with maximum P(DFS) in respective datasets. (D) Theoretical frequency distribution of replicons inferred from C are presented in blue; gray shows the actual frequency distribution in those bins in the data, and red highlights the red bins in C.

Fig. S3.

Fig. S3.

Data are from hESC and K562 in human datasets (A, C, E, and G) B and (B, D, F, and H) P. (A and B) Frequency of replicons in each cohort, defined according to the following size ranges: <103 bp, XS; 103 to 104 bp, S; 104 to 105 bp, M; 105 to 106 bp, L; and >106 bp, XL. (C and D) Probability of DFS in each cohort of the replicons. (E and F) Higher-resolution plot of probability of DFS at the transition from M to L gap cohorts contributing most toward the P(DFS); red bars show the bins with maximum P(DFS) in respective datasets. (G and H) Theoretical frequency distribution of replicons inferred from E and F are presented in blue; gray shows the actual frequency distribution in those bins in the data, and red highlights the red bins in E and F.

Replication Errors Are Common but Low in Number for Higher Eukaryotes.

As discussed in The “Central Equation,” our theory predicts that the distribution of the number of DFSs in a given genome is Poisson-distributed to a very high degree of accuracy. We have applied our theory to the human cell lines datasets to test this prediction. As shown in Fig. 6, for all cell lines, from both laboratories, the distribution of DFSs is indeed Poisson-distributed, regardless of being primary or tumoral cell lines. Statistical analysis confirms that the computationally derived probability distribution of DFSs is statistically indistinguishable from the fitted Poisson distribution. Interestingly, we find a very low probability (<10%) of encountering more than three DFSs in the replication of the entire diploid human DNA per cell cycle. Therefore, despite the high probability of the presence of DFSs (∼80%), in ∼90% of cells undergoing DNA replication, the expected number of DFSs is predicted to be three or less, with one or two errors being the most likely occurrences. Indeed, we find that the parameter λ (i.e., the mean number of errors) that characterizes the distribution of DFSs ranges from 1.67 to 2.15 in Besnard et al. (17) and from 1.21 to 2.05 in Picard et al. (18).

Fig. 6.

Fig. 6.

Theoretical prediction for the distribution of the number of DFSs based on the RO positions in each human cell-line dataset (using data from both B and P); also shown, as lines and dots, are best fits to a Poisson distribution.

Given that DFSs in human cell lines are almost inevitable, it is somewhat surprising to find that their number is quite sharply constrained to be essentially one, two, or three. This might indicate that the mechanism that deals with such errors has a very low capacity. If, as suggested in Moreno et al. (19), the defects induced by DFSs can be resolved in the following cell cycle by segregating unreplicated DNA to daughter cells, DNA strand breaks could be generated at each DFS. Because the number of illegitimate ways that double-strand breaks could be correctly rejoined increases as the factorial of the number of breaks, this might constrain the number of tolerated DFSs to about three or less. We provide a rationale for putative biological mechanisms in Discussion, and our arguments lead us to consider two different biomarkers for double-strand breaks that would arise from DFS errors; these are the presence of 53BP1 nuclear bodies in the G1 phase of the subsequent cell cycle and the presence of ultrafine anaphase bridges (UFBs) during mitosis. Our theory suggests that the number of both 53BP1 nuclear bodies and UFBs are distributed as a Poisson with a value of λ between 1 and 2.

We performed an experimental analysis of 53BP1 in IMR90 cells and both 53BP1 and UFBs in U2-OS cells, and we measured the frequency of their occurrence during the cell cycle at a single-cell level (19). In agreement with our predictions, the experimental distributions of both 53BP1 nuclear bodies and UFBs fit to a Poisson distribution (Fig. 7 AC). Statistical analyses indicate that both a naïve fitting using the mean of the data and a more advanced approach that accounts for potential errors introduced by the experimental procedure of the immunofluorescence experiments (Fig. 7 AC) produce distributions that are not statistically different from Poisson distributions for both 53BP1 nuclear bodies (P values between 0.61 and 1 for both IMR90 and U2-OS cells) and UFBs (P values between 0.53 and 1 for U2-OS cells). Additionally, the fitted λ values, 0.52 (naïve) and 0.54 (filtered) in IMR90 and 1.64 (naïve) and 1.89 (filtered) in U2-OS cells for 53BP1 nuclear bodies, and 1.27 (naïve) and 1.19 (filtered) for UFBs, are in line with the expectation of a limited number of DFSs. Moreno et al. (19) show that the number of 53BP1 nuclear bodies and UFBs follows a Poisson distribution in the HeLa cell line with λ values of 0.94 (naïve) and 1.12 (filtered) for 53BP1 and 1.43 (naïve) and 1.19 (filtered) for UFBs (19). Taken together, these results provide good agreement of our theory with the available data and reinforce the connection between 53BP1 nuclear bodies and UFBs to DFSs. The analysis of UFBs in unperturbed IMR90 cells was not possible due to experimental difficulties related to the fact that this cell line is not immortalized.

Fig. 7.

Fig. 7.

(A) Experimental distribution of three different replicates of 53BP1 nuclear bodies in the IMR90 cell line fitted with a naïve Poisson (i.e., taking the mean of the data as λ) (gray) and a filtered Poisson (i.e., ignoring the frequencies of zero counts to account for potential error from immunofluorescence staining) (light gray). The single fitting with the average of the three replicates (not statistically different) is shown. (B) Experimental distribution of 53BP1 nuclear bodies in the U2-OS cell line fitted with a naïve Poisson (i.e., taking the mean of the data as λ) (gray) and a filtered Poisson (i.e., ignoring the frequencies of zero counts to account for potential error from immunofluorescence staining) (light gray). (C) Experimental distribution of UFBs in the U2-OS cell line fitted with a naïve Poisson (gray) and a filtered Poisson (light gray). (D) Values of the Possion parameter λ obtained from experimental fits of 53BP1 nuclear bodies in IMR90, U2-OS, and HeLa, and UFBs in U2-OS and HeLa, are compared with theoretical values obtained from different cell lines in Fig. 6.

As a more quantitative analysis, we compared the λ values obtained by direct calculation from the RO distribution of different human cell lines and the experimental λ values estimated from the distribution of 53BP1 and UFBs. Note that comprehensive RO distribution data are not available for the cell line used for the UFB experiments (U2-OS), and diversity has been observed in RO distribution across different cell lines (17). Moreover, both 53BP1 and UFBs are likely to provide only an approximation of the number of DFSs, as they appear also in the presence of non-DFS-associated double-strand breaks. Despite these limitations, a comparison of the λ values indicates that experimental measures are in excellent agreement with theoretical prediction (Fig. 7D). Additional comparisons with the λ values obtained from HeLa reinforce our conclusions (Fig. 7D). Interestingly, the range of variation observed in the experimental value of λ is matched by the range of variation of our model predictions, suggesting that our methodology is correctly capturing experimental variations.

In both IMR90 and HeLa cells, the experimentally derived λ obtained from 53BP1 nuclear bodies data are approximately half of the theoretical estimate obtained from the RO mapping data. This is also true for UFBs in HeLa cells. So long as the density of ROs is small, it is straightforward to show that doubling the density of ROs halves the value of λ. Hence the factor of 2 difference in the experimental and theoretical values of λ could indicate that around half of the genomic ROs are missing in the current datasets (e.g., due to difficulties in detecting ROs that fire very rarely or ROs positioned in repetitive regions of the DNA). This line of reasoning is also consistent with a potential issue with the largest measured replicon being ∼4 Mbp, the issue being that the replication time for such a gap would be significantly longer than typical S phase (ca. 8 h) (22). If the true RO density is twice that measured, one can show that the largest gap would be halved, giving a value of 2 Mbp, which is in line with the estimate of 2 Mbp for the longest stretch of DNA that could be replicated in the duration of S phase (assuming a fork speed of ∼2 Kbp per minute (23), and remembering that a large replicon will be replicated almost symmetrically by forks traveling from either end).

Effect of Variation of the Stalling Distance.

In applying our theory to the RO position data for various human cell lines, we can vary the numerical value of Ns and measure the effect on the expected number of DFSs. This allows us to gauge the extent to which our conclusions are robust to the variation of the only parameter in our analysis for which we do not have strong experimental data. Both theoretical and biological estimates indicate that Ns is ∼10 Mbp (14, 24). However, a precise estimate of this value is difficult to determine in vivo. The stalling distance is inversely proportional to the very small probability of an irreversible stalling event per nucleotide replicated, which, because of the conservation of the basic replication machinery, is likely to be relatively well conserved across eukaryotes.

First, we analyzed the overall probability of DFSs occurring as Ns is varied. In all of the human cell lines considered, we observe a characteristic transition around 5 Mbp: Below this value, the probability of observing DFSs saturates at 1 (Fig. 8A). Therefore, DFSs are inevitable for smaller values of Ns, as one might expect. Importantly, our analysis indicates diminishing returns when Ns is increased to much larger values: Even for Ns around 30 Mbp, error rates are sufficiently high (one in five cells would experience a DFS during S phase) that additional DFS repair mechanisms are still required. Therefore, in higher eukaryotes with large genomes, the pressure to maintain genome stability is most easily resolved by additional safeguard mechanisms to deal with consequences of DFSs, rather than by stabilizing the replication machinery to give such a large Ns that DFSs can be avoided with the regular RO distribution found in eukaryotes with smaller genomes.

Fig. 8.

Fig. 8.

(A) Based on the RO distributions in the various human datasets, theoretical predictions of the percentage of cells with DFSs are plotted as a function of the parameter Ns; the percentage is essentially 100% when Ns < 5 Mbp, and this percentage is still nontrivially high even when Ns > 20 Mbp. (B and C) Theoretical predictions of the probability of one, two, and three DFSs are shown as a function of Ns for the Besnard et al. (B) and Picard et al. (C) data. (D and E) Theoretical predictions of the probability of one, two, or three DFSs are shown as a function of Ns for the Besnard et al. (D) and Picard et al. (E) data. (F and G) Expected numbers of DFSs in different cell lines are plotted against Ns for the Besnard et al. (F) and Picard et al. (G) data; in black, blue, and red are the experimentally obtained expected number of 53BP1 nuclear bodies in IMR90, U2-OS, and HeLa, and UFBs in U2-OS and HeLa cell lines, respectively. Crossing points of the black, blue, and red lines over the curves provide an independent estimate for the plausible range of Ns (vertical lines) by directly comparing experimental data with theoretical predictions.

Our analysis stresses the inevitability of DFS errors during replication of the human genome and calls for a shift in our approach with respect to how the problem has been viewed in the past. On varying the median stalling distance in human cells, the probability of exactly one DFS genome-wide reaches a maximum between 10 and 15 Mbp, depending on the particular cell line and dataset used (Fig. 8 B and C). Furthermore, on varying the stalling distance, we find that the probabilities of exactly two or exactly three DFSs occurring also have peaks in the range 6 to 10 Mbp, again depending on the cell line and the dataset used (Fig. 8 B and C). To probe the likelihood of a small number of errors occurring, we plotted the probability of observing one, two, or three DFSs as stalling distance was varied (Fig. 8 D and E). These results show a very pronounced maximum for Ns around 10 Mbp in the B dataset, and around 8 Mbp in the P dataset. In summary, our analysis of the available RO distribution in a variety of human cell lines and in different datasets indicates that the number of DFSs is constrained between zero and three only for Ns in the vicinity of 10 Mbp.

Finally, we can measure the average number of DFSs when Ns is varied. This number is equal to the λ parameter of a Poisson distribution and therefore allows a direct comparison with our experimental measures. As expected, the average number of DFSs decreases from a large value as Ns is increased (Fig. 8 F and G). As explained in Replication Errors Are Common, fitting the Poisson distribution to 53BP1 and UFB experimental data gives values of λ between 0.54 and 1.89 (the values are shown in Fig. 8 F and G as black, blue, and red lines). The intersection of the decaying curve with these two lines provides another independent estimate of the stalling distance, which we find to be between 8 and 16 Mbp, depending on the cell line and dataset used. Our analysis of the statistics of DFSs in human cell data on varying the stalling distance therefore provides very strong evidence for the robustness of this parameter, with a value in the range 8 to 15 Mbp, consistent with previous estimates from our analysis of yeast RO distributions, and with direct experimental estimates (14, 24).

Effect of Varying the Number of Licensed ROs.

Interestingly, among the cell types we analyzed, there was no major difference in the mean replicon length (Fig. 3B). Fig. 9 shows how decreasing mean replicon length would reduce the probability of DFSs in a generic organism. The black, light blue, and blue lines illustrate the mean replicon length to achieve a fixed probability of DFSs under the optimal situation of equally spaced ROs. All of the datasets analyzed in the article have a mean replicon length ranging between 10 and 100 Kbp (shaded pink in Fig. 9). Because of the relatively small genome sizes of yeasts, so long as ROs are evenly spaced, this mean replicon length can achieve a tolerable DFS probability of ∼0.1%, similar to the chromosome missegregation rate (14). To maintain a low probability of DFSs as in yeasts, longer genomes would require a much lower mean replicon length or, in other words, much higher density of ROs on the genome. Because the MCM2-7 double hexamer that licenses an RO has a footprint of ∼60 bp (25, 26), this provides an absolute limit to the possible replicon length (dashed line in Fig. 9). It is just about possible for organisms with ∼6,000-Mbp genomes to achieve yeast-like DFS probabilities, but the genome would have to be almost completely packed with MCM2-7, which might leave the genome unable to perform its major function of providing the template for transcription. Because this saturation is implausible for normal cells, additional postreplicative mechanisms must be in place to deal with the inevitable DFSs. For this reason, regularity in RO distribution is not an effective safeguard against DFSs in organisms with larger genomes.

Fig. 9.

Fig. 9.

Highlighting the issues faced to maintain small DFS error rates for genomes of increasing length: theoretical prediction of the average replicon length as a function of increasing genome length, to maintain a fixed probability of DFS, for three different values of this probability. Diamonds show the positions of yeast, Arabidopsis, Drosophila, and human, obtained from the datasets of RO positions. The pink shadow highlights the biologically relevant range for mean replicon lengths as per all eukaryotic datasets available. The dashed red line marks the footprint for the MCM2-7 double hexamer, below which any replicon length is biologically unrealistic.

Discussion

Faithful DNA replication is fundamental to preserve the genetic content of cells and to avoid the severe pathologies that arise when DNA is improperly replicated. The appropriate location and activation of ROs is fundamental to ensuring that replicative errors are minimized. Here we show that understanding the principles that govern distribution of ROs provides quantitative insights into the way that different organisms maintain genetic integrity. By using a probability theory approach, based on a one-parameter model with simple yet plausible assumptions, we have developed a set of measures and predictions that further this understanding. The excellent agreement of our theoretical predictions with experimental data strongly supports the validity of our model assumptions. Moreover, it allows us to explore the rich system-level diversity of features and constraints associated with DNA replication.

Replicative Errors Are Inevitable in Larger Genomes.

Increased phenotypic complexity of organisms is generally associated with larger genome length, and metazoans have much larger genomes compared with yeast: The diploid human genome is ∼600 times larger than the haploid yeast genome. Despite this large difference in genome size, the replication machinery is essentially conserved (4). Over the past few decades, much effort has been devoted to understanding the molecular mechanisms involved in eukaryotic DNA replication and the associated damage-repair mechanisms. However, less is known about the system-level structures and processes that allow replication fidelity across the different scales of eukaryotic complexity, mirrored by genome lengths spanning over three orders of magnitude across yeast to human. We have used a theoretical approach, previously validated in yeasts (14), to predict the probability of DFSs for different organisms with widely different genome lengths, and for which detailed RO distribution data are available.

Our central equation shows that there is a hierarchy of contributions to the probability of DFS, with genome length being the most important factor, followed by RO number and then RO distribution. This effectively creates different classes of probabilities of DFS errors (∼10−3, ∼10−2, and ∼1) for the respective classes of organisms according to their genome lengths (∼20 Mbp, ∼250 Mbp, and ∼6 Gbp). Interestingly, among the cell types we analyzed, there was no major difference in the density of ROs, i.e., mean replicon length. One possible explanation for this is that to make a significant effect on reducing DFSs, the RO density in organisms with genomes of 250 Mbp or more would lead to excessive clashes with the transcriptional machinery. The third component of our equation—the uniformity of replicon length, i.e., R—also reflects these classes (with values of <1, ∼1, and >1, respectively), indicating that, as the probability of DFSs approaches 1 in larger genomes, the pressure toward a regular RO distribution is lifted.

Inevitability Is Mitigated by Containment in Longer Genomes and Beyond.

DFSs are the primary cause of DNA double-strand breaks during replication (2729), and are likely to be major contributors to the development of cancer and other pathologies, such as ones associated with aging (30, 31). The inevitability of DFSs in longer genomes requires the presence of cellular mechanisms, which are able to deal with such errors in an efficient manner. In related experimental work, we provide experimental evidence for one such postreplicative mechanism, involving the segregation of unreplicated DNA via UFBs and its protection by 53BP1 before being resolved in the next S phase (19). We have demonstrated very good agreement in the numbers and statistical distribution of experimental measurements of both 53BP1 and UFBs with the predictions of Poisson statistics from our theory, supporting the validity of our conclusions, and indicating that DFSs in the experimental systems are well approximated as independent events.

Analysis of the data available for human cell lines within our theoretical framework shows that RO density and distribution constrain the number of DFSs per cell cycle to three or less for nearly all cells. This limit on the number of DFSs may partially be explained by the difficulty in properly recombining two strands of DNA when end-joining is used. For example, if four DFSs occur and need to be fixed, eight strands will be generated, and only one of the 24 theoretically possible combinations is correct. From our experimental observations, cells with large numbers of 53BP1 nuclear bodies and UFBs showed increased blebbing and apoptosis. This suggests that large numbers of DFSs could compromise the working of the cell and the efficiency of the repair mechanism. Thus, our theory, in light of the experimental data, shows a contingent trade-off between the inevitability of DFS occurrence and the difficulty of its resolution (i.e., apparently requiring sophisticated molecular machinery for detection and repair). It is worth stressing that our central equation for λ, the mean number of DFSs, contains very large numerical values, i.e., Ng and Ns, as well as thousands of replicon lengths. Therefore, in principle, the formula could have produced values for λ of almost arbitrary magnitude, either much less than or much greater than unity. It is striking that our theoretical predictions from the central equation yield values for λ close to unity and in such strong agreement with experimental data.

Another important requirement for the containment of replicative errors in larger genomes is an upper limit in the length of large replicons. Longer replicons correspond to a higher probability of DFSs (Fig. 3D). Our theory indicates that the largest tolerable replicons in human cell lines are bounded by ∼0.5 Ns, and, interestingly, the largest replicons found in experimental datasets are around 0.3 Ns. In addition, we have analyzed human cell line data within our theoretical framework, and, by varying Ns, we are able to clearly show that the probability of observing a number of DFSs equal to one, two, or three is maximized for Ns in the region of 10 Mbp. This value for Ns is in excellent agreement with previous experimental and theoretical estimates in human cell lines and yeasts (14, 24). Due to the universality of replication machinery across the eukaryotes and the necessity of error containment in larger genomes, we propose this Ns value to be robust and universal in eukaryotes. A further signature of the containment mechanisms associated with the inevitable errors in human genomes can be found in the distribution of the risk among replicons of different sizes: A relatively narrow range of replicons (of size ∼40 to ∼200 kbp) contributes the most to DFSs, with the different replicon sizes in this range contributing approximately equally to the risk.

As a final note, it is worth stressing that some organisms, particularly plants, have very large genomes, with Ng as large as ∼100 Gbp (32). Our theory would predict, in such cases, that the number of DFSs becomes much larger than 3, and in the region of 10 or more. Interestingly, it has been observed that the cell cycle length in plants undergoes a dramatic lengthening as genome size exceeds about 25 Gb (32), potentially reflecting the significantly greater burden of DFS detection and correction in these organisms. We would predict similar effects for ploidy variants within the same species. We currently do not have genome-wide RO distribution data for these organisms to test this idea, but this would provide further opportunities for gaining new understanding of the system-level strategies that eukaryotes use to minimize replication errors.

Materials and Methods

Experimental Setup.

For the 53BP1 and UFBs experiments, U2OS and IMR-90 cell lines from the American Type Culture Collection were maintained in Dulbeccos’s Modified Eagle’s medium (Invitrogen), supplemented with 10% (vol/vol) FBS (Invitrogen) and penicillin and streptomycin at 37 °C in 5% CO2. Standard immunofluorescence protocols were used for the 53BP1 and UFBs staining. Briefly, cells were fixed with 4% formaldehyde, permeabilized with 0.1% Triton in PBS, and blocked in 0.5% fish gelatin (G-7765; Sigma). Samples were incubated overnight with primary antibodies. To specify G1-phase cells, they were incubated with 40 µM EdU (Invitrogen) for 30 min before fixation, and then incubated with Cyclin A (1:300, ab16726; Abcam). For the detection of 53BP1, cells were also stained with GFP (1:2,000, ab13970; Abcam). To stain incorporated nucleotides, the Click-iT-EdU kit was used as instructed by the manufacturers (C10337; Invitrogen). For staining UFBs, cells were incubated with BLM (1:200, sc-7790; Santa Cruz). Alexa secondary antibodies (Invitrogen) were used for 1 h. Microscopy images were acquired using an Olympus IX70 deltavision deconvolution microscope and a CCD camera. Data from microscopy experiment were analyzed using Volocity 3D analysis software (Perkin-Elmer).

Datasets Used and Statistical Analysis.

Limited direct experimental evidence exists on ROs in plants and metazoan, and most data focus on the genomic density, rather than localization, of ROs (33, 34). Therefore, the main results of our article are framed in the context of available datasets describing genome-wide RO positions. Less high-quality datasets have been considered, where appropriate, to provide additional challenge to the theoretical predictions and their interpretation. Saccharomyces cerevisiae ROs were obtained from the highly curated DNA Replication Origin Database (20) with selection criteria discussed in ref. 14. To provide additional validation, we considered another yeast species in this article: Schizosaccharomyces pombe (21). RO distribution data were also obtained for the following multicellular organisms: Arabidopsis thaliana (15), Drosophila melanogaster (16), and human. Human data for the four cell lines IMR90, HeLa, hESC, and iPSC were derived as discussed in ref. 17, and different datasets for IMR90, HeLa, and K562 cell lines were obtained from ref. 18. The summary of the datasets is presented in Table S1.

When RO positions were defined by genomic ranges, the middle point of the range was used as the genomic location of the RO. Moreover, to limit the problems associated with technological limitations in sequencing the centromeric regions of chromosomes, the largest replicon of each chromosome (corresponding to the centromeric region) was excluded from the analysis in all of the organisms considered.

Probabilities of DFSs were obtained from RO position data using the formulas detailed in the following mathematical derivations. To allow standardized comparisons in computing the probability of DFS, all of the organisms were considered as diploid. Poisson fits of the computationally derived distribution of DFSs were computed using the probability of no DFSs. Poisson fits of the experimental data were computed using the mean (naïve) or by minimizing the difference from the frequencies of DFS strictly larger than zero (filtered). Differences between distributions were computed using Chi-Squared tests.

Model Derivation and Mathematical Details.

The baseline assumptions that have been used to construct the mathematical model have been described elsewhere (14) and will not be discussed here. In yeast, the size of the largest replicon, i.e., inter-RO distance, is significantly smaller than Ns. This size difference allowed the introduction of approximations, which could be used to obtain simpler formulas in our previous work (14). This is not valid in human genomes, and therefore we could not rely on the approximations previously used. Hence, various quantities had to be rederived to avoid previously introduced approximations, and we provide the more general derivations below.

Let D be the distance between two adjacent ROs located respectively at n = 0 and n = N, where N − 1 is the number of nucleotides within D. As shown in ref. 14, the probability of a double stall in D (DSD) is given by the following expression:

Prob(DSD)=n=0N1(1q)nq[1(1q)Nn]. [2]

Therefore,

Prob(DSD)=qn=0N1(1q)nqn=0N1(1q)n(1q)Nn
=qn=0N1(1q)nqn=0N1(1q)N.

Evaluating the sums using the formula for a geometric series, we have

Prob(DSD)=q[1(1q)N1(1q)]Nq(1q)N
=1(1q)NNq(1q)N.

Thus,

Prob(DSD)=1(1+Nq)(1q)N. [3]

Expressing the product as the exponential of the sum of the logarithms gives

(1q)N=exp[Nlog(1q)]. [4]

Because q is an extremely small number, log(1q)q, and hence

(1q)N=exp(Nq). [5]

Combining Eq. 3 with Eq. 5, we obtain

Prob(DSD)=1(1+Nq)exp(Nq). [6]

Let us define the distance between the adjacent (k + 1)th and kth ROs as Nk. The probability of double stall between this pair of ROs will be denoted as Pk. Thus,

Pk=1(1+Nkq)exp(Nkq). [7]

The genome-wide probability of no double stall, which will be denoted as Prob(NDS), is given by the product of probability of no double stalls in each replicon, i.e.,

Prob(NDS)=k(1Pk). [8]

Combining Eq. 7 and Eq. 8, we have

Prob(NDS)=[k(1+Nkq)]{k[exp(Nkq)]}. [9]

Let Ng be the genome length; then

kNk=Ng.

Thus,

k[exp(Nkq)]=exp(qkNk)=exp(qNg). [10]

Similarly,

k(1+Nkq)=kexp[log(1+Nkq)]=exp[klog(1+Nkq)]. [11]

Therefore, combining Eqs. 9, 10, and 11, we have

Prob(NDS)=exp(qNg)exp[klog(1+Nkq)]

or

Prob(NDS)=exp[qNg+klog(1+Nkq)].

We have shown before (14) that q=log(2)/Ns. Hence,

Prob(NDS)=exp{log(2)NgNs+klog[1+log(2)NkNs]}. [12]

As given by Eq. 1, where the negative of the quantity in parentheses is denoted by λ. Further derivations and mathematical details are provided in SI Text and Table S2.

Table S2.

Direct calculations on IMR90 cell line suggest that only the leading term is playing a significant role for the value of Ri in Probability of a specific number of DFSs

Probability under consideration Terms Numerical value for the term
Terms used to calculate Prob(1DS) (terms in R1) S1 1.68
Terms used to calculate Prob(2DS) (terms in R2) S12 2.81
S2/S12 8.84 × 10−4
Terms used to calculate Prob(3DS) (terms in R3) S13 4.71
(S1 × S2)/S13 8.84 × 10−4
S3/S13 9.25 × 10−6
Terms used to calculate Prob(4DS) (terms in R4) S14 7.89
(S12 × S2)/S14 8.84 × 10−4
(S1 × S3)/S14 9.25 × 10−6
S22/S14 7.81 × 10−7
S4/S14 1.40 × 10−7

DS, double stall.

Software Used.

Data analysis was performed using R version 2.15 and RStudio version 0.98.978 (https://www.rstudio.com).

SI Text

Mathematical Derivations.

Probability of a specific number of DFSs.

The previous approach can be extended to calculate the probability of an arbitrary number of double stalls. The probability of exactly one DFS, which will be called Prob(1DS), can be calculated directly as

Prob(1DS)=kPk.k1k(1Pk1). [S1]

Combining Eq. 13 with Eq. 8, we obtain

Prob(1DS)=Prob(NDS)kPk(1Pk).

Therefore,

Prob(1DS)Prob(NDS)=kPk(1Pk).

To simplify the next steps, we introduce the following definitions:

S1=kPk1Pk
S2=k(Pk1Pk)2
Sm=k(Pk1Pk)m.

Additionally, let Prob(mDS) be the probability of m DFSs; the following conventions will be used:

R1=Prob(1DS)Prob(NDS)
R2=Prob(2DS)Prob(NDS)
Rm=Prob(mDS)Prob(NDS). [S2]

Hence,

R1=kPk(1Pk)=S1 [S3]

and

R2=12!k1k2k1Pk1(1Pk1)Pk2(1Pk2),

which can be rewritten as

R2=12![(kPk1Pk)2k(Pk1Pk)2]=12!(S12S2).

Similarly,

R3=13!k1k2k1k3k1,k2Pk1(1Pk1)Pk2(1Pk2)Pk3(1Pk3)
=13!(S133S1S2+2S3).

Iterating the same approach, it is possible to show that

R4=14!(S146S12S2+8S1S3+3S226S4)
R5=15!(S1510S13S2+15S1S22+20S12S320S2S330S1S4+24S5)
R6=16!(S1615S14S2+45S12S2215S23+40S13S3120S1S2S3+40S3290S12S4+90S2S4+144S1S5120S6).

Finally, combining R1, R2, R3, R4, R5, and R6 with Eq. S2, we can obtain the probability of one to six DFSs as follows:

Prob(1DS)=Prob(NDS)R1
Prob(2DS)=Prob(NDS)R2
Prob(3DS)=Prob(NDS)R3
Prob(4DS)=Prob(NDS)R4
Prob(5DS)=Prob(NDS)R5
Prob(6DS)=Prob(NDS)R6.

As shown in Table S2, direct calculations on the IMR90 cell line suggest that only the leading power is playing a significant role for the Ri. Therefore, we write

R212!(S12)
Rk1k!(S1k)

and hence

Prob(kDS)=Prob(NDS)RkProb(NDS)1k!(S1k).

The probability density function of a Poisson distribution is

Prob(n)=exp(λ)λnn!.

Therefore, for our distribution to follow a Poisson, we have to show that

Prob(NDS)=exp(λ). [S4]

This formula implies S1 = λ.

Now, from Eq. 8 and Eq. S4, we have

λ=log[k(1Pk)]=klog(1Pk)=klog(11Pk)=klog(1+Pk1Pk).

The value Pk/(1Pk) is very small, and we can use a Taylor expansion to obtain

λ=k[Pk1Pk+O(Pk2)]k(Pk1Pk)=S1.

Because the Pk are very small, this approximation is generally very good.

Frequency of replicons of particular size.

We have shown in Eq. 3 that the probability of a DFS in the region of DNA between a pair of adjacent ROs separated by N nucleotides is

Prob(N)=1(1+Nq)(1q)N. [S5]

Now, we calculate the probability of DFS in a cohort of M replicons whose size is in the vicinity of N. The probability of no error occurring from this cohort would be the following product: k[1P(Nk)], where the product is restricted to those replicons within the cohort. This probability will be very close to 1, and we denote it by θ. Substituing Eq. S5 into this expression, and recognizing that all of the Nk are close to N, enables us to rewrite the probability of no error from the cohort as

(1+Nq)M(1q)MN=θ.

Now, taking the natural logarithm,

Mlog(1+Nq)+MNlog(1q)=log(θ). [S6]

Because q << 1, log(1q)q, and thus we write

Mlog(1+Nq)MNq=log(θ).

Hence,

M=log(θ)[log(1+Nq)Nq]. [S7]

For Nq << 1, it is straightforward to show, from expanding the denominator, that M ≈ 1/N2.

Acknowledgments

We thank Dianbo Liu and Sam Palmer for helpful discussions. A.M., J.T.C., and J.J.B. acknowledge support from Cancer Research UK (Grant C303/A14301) and the Wellcome Trust (Grant WT096598MA). M.A.M., L.A., and T.J.N. acknowledge support from the Scottish Universities Life Science Alliance. T.J.N. acknowledges support from the National Institutes of Health (Physical Sciences in Oncology Centers, U54 CA143682). The authors also acknowledge High Performance Computer resources partially supported by the Wellcome Trust (Strategic Grant 097945).

Footnotes

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1603241113/-/DCSupplemental.

References

  • 1.Nielsen O, Løbner-Olesen A. Once in a lifetime: Strategies for preventing re-replication in prokaryotic and eukaryotic cells. EMBO Rep. 2008;9(2):151–156. doi: 10.1038/sj.embor.2008.2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Bebenek A. DNA replication fidelity. Postepy Biochem. 2008;54(1):43–56. [PubMed] [Google Scholar]
  • 3.Blow JJ, Ge XQ, Jackson DA. How dormant origins promote complete genome replication. Trends Biochem Sci. 2011;36(8):405–414. doi: 10.1016/j.tibs.2011.05.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Sclafani RA, Holzen TM. Cell cycle regulation of DNA replication. Annu Rev Genet. 2007;41:237–280. doi: 10.1146/annurev.genet.41.110306.130308. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Diffley JFX. Quality control in the initiation of eukaryotic DNA replication. Philos Trans R Soc Lond B Biol Sci. 2011;366(1584):3545–3553. doi: 10.1098/rstb.2011.0073. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Arias EE, Walter JC. Strength in numbers: Preventing rereplication via multiple mechanisms in eukaryotic cells. Genes Dev. 2007;21(5):497–518. doi: 10.1101/gad.1508907. [DOI] [PubMed] [Google Scholar]
  • 7.Hastings PJ, Lupski JR, Rosenberg SM, Ira G. Mechanisms of change in gene copy number. Nat Rev Genet. 2009;10(8):551–564. doi: 10.1038/nrg2593. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Dayal JHS, Albergante L, Newman TJ, South AP. Quantitation of multiclonality in control and drug-treated tumour populations using high-throughput analysis of karyotypic heterogeneity. Converg Sci Phys Oncol. 2015;1(2):025001. [Google Scholar]
  • 9.Blow JJ, Dutta A. Preventing re-replication of chromosomal DNA. Nat Rev Mol Cell Biol. 2005;6(6):476–486. doi: 10.1038/nrm1663. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Alver RC, Chadha GS, Blow JJ. The contribution of dormant origins to genome stability: From cell biology to human genetics. DNA Repair (Amst) 2014;19:182–189. doi: 10.1016/j.dnarep.2014.03.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Cobb JA, et al. Replisome instability, fork collapse, and gross chromosomal rearrangements arise synergistically from Mec1 kinase and RecQ helicase mutations. Genes Dev. 2005;19(24):3055–3069. doi: 10.1101/gad.361805. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Ghosal G, Chen J. DNA damage tolerance: A double-edged sword guarding the genome. Transl Cancer Res. 2013;2(3):107–129. doi: 10.3978/j.issn.2218-676X.2013.04.01. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Mazouzi A, Velimezi G, Loizou JI. DNA replication stress: Causes, resolution and disease. Exp Cell Res. 2014;329(1):85–93. doi: 10.1016/j.yexcr.2014.09.030. [DOI] [PubMed] [Google Scholar]
  • 14.Newman TJ, Mamun MA, Nieduszynski CA, Blow JJ. Replisome stall events have shaped the distribution of replication origins in the genomes of yeasts. Nucleic Acids Res. 2013;41(21):9705–9718. doi: 10.1093/nar/gkt728. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Costas C, et al. Genome-wide mapping of Arabidopsis thaliana origins of DNA replication and their associated epigenetic marks. Nat Struct Mol Biol. 2011;18(3):395–400. doi: 10.1038/nsmb.1988. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Cayrou C, et al. Genome-scale analysis of metazoan replication origins reveals their organization in specific but flexible sites defined by conserved features. Genome Res. 2011;21(9):1438–1449. doi: 10.1101/gr.121830.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Besnard E, et al. Unraveling cell type-specific and reprogrammable human replication origin signatures associated with G-quadruplex consensus motifs. Nat Struct Mol Biol. 2012;19(8):837–844. doi: 10.1038/nsmb.2339. [DOI] [PubMed] [Google Scholar]
  • 18.Picard F, et al. The spatiotemporal program of DNA replication is associated with specific combinations of chromatin marks in human cells. PLoS Genet. 2014;10(5):e1004282. doi: 10.1371/journal.pgen.1004282. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Moreno A, et al. Unreplicated DNA remaining from unperturbed S phases passes through mitosis for resolution in daughter cells. Proc Natl Acad Sci USA. 2016;113:E5757–E5764. doi: 10.1073/pnas.1603252113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Siow CC, Nieduszynska SR, Müller CA, Nieduszynski CA. OriDB, the DNA replication origin database updated and extended. Nucleic Acids Res. 2012;40(Database issue) D1:D682–D686. doi: 10.1093/nar/gkr1091. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Hayashi M, et al. Genome-wide localization of pre-RC sites and identification of replication origins in fission yeast. EMBO J. 2007;26(5):1327–1339. doi: 10.1038/sj.emboj.7601585. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Cooper GM. The Eukaryotic Cell Cycle. The Cell: A Molecular Approach. 2nd Ed Sinauer Associates; Sunderland, MA: 2000. [Google Scholar]
  • 23.Méchali M. Eukaryotic DNA replication origins: Many choices for appropriate answers. Nat Rev Mol Cell Biol. 2010;11(10):728–738. doi: 10.1038/nrm2976. [DOI] [PubMed] [Google Scholar]
  • 24.Maya-Mendoza A, Petermann E, Gillespie DAF, Caldecott KW, Jackson DA. Chk1 regulates the density of active replication origins during the vertebrate S phase. EMBO J. 2007;26(11):2719–2731. doi: 10.1038/sj.emboj.7601714. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Remus D, et al. Concerted loading of Mcm2−7 double hexamers around DNA during DNA replication origin licensing. Cell. 2009;139(4):719–730. doi: 10.1016/j.cell.2009.10.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Evrin C, et al. A double-hexameric MCM2-7 complex is loaded onto origin DNA during licensing of eukaryotic DNA replication. Proc Natl Acad Sci USA. 2009;106(48):20240–20245. doi: 10.1073/pnas.0911500106. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Unno J, et al. Artemis-dependent DNA double-strand break formation at stalled replication forks. Cancer Sci. 2013;104(6):703–710. doi: 10.1111/cas.12144. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Allen C, Ashley AK, Hromas R, Nickoloff JA. More forks on the road to replication stress recovery. J Mol Cell Biol. 2011;3(1):4–12. doi: 10.1093/jmcb/mjq049. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Jones RM, Kotsantis P, Stewart GS, Groth P, Petermann E. BRCA2 and RAD51 promote double-strand break formation and cell death in response to gemcitabine. Mol Cancer Ther. 2014;13(10):2412–2421. doi: 10.1158/1535-7163.MCT-13-0862. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Bohgaki T, Bohgaki M, Hakem R. DNA double-strand break signaling and human disorders. Genome Integr. 2010;1(1):15. doi: 10.1186/2041-9414-1-15. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Li H, Mitchell JR, Hasty P. DNA double-strand breaks: A potential causative factor for mammalian aging? Mech Ageing Dev. 2008;129(7-8):416–424. doi: 10.1016/j.mad.2008.02.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Francis D, Davies MS, Barlow PW. A strong nucleotypic effect on the cell cycle regardless of ploidy level. Ann Bot (Lond) 2008;101(6):747–757. doi: 10.1093/aob/mcn038. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Wong PG, et al. Cdc45 limits replicon usage from a low density of preRCs in mammalian cells. PLoS One. 2011;6(3):e17533. doi: 10.1371/journal.pone.0017533. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Mahbubani HM, Chong JPJ, Chevalier S, Thömmes P, Blow JJ. Cell cycle regulation of the replication licensing system: Involvement of a Cdk-dependent inhibitor. J Cell Biol. 1997;136(1):125–135. doi: 10.1083/jcb.136.1.125. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES