Skip to main content
PLOS ONE logoLink to PLOS ONE
. 2007 Jun 27;2(6):e570. doi: 10.1371/journal.pone.0000570

Real-Time Definition of Non-Randomness in the Distribution of Genomic Events

Ulrich Abel 1,2, Annette Deichmann 1, Cynthia Bartholomae 1, Kerstin Schwarzwaelder 1, Hanno Glimm 1, Steven Howe 2, Adrian Thrasher 3,4, Alexandrine Garrigue 5, Salima Hacein-Bey-Abina 5,6, Marina Cavazzana-Calvo 5,6, Alain Fischer 5,7, Dirk Jaeger 1, Christof von Kalle 1,8,*, Manfred Schmidt 1,*
Editor: Xiaolin Wu9
PMCID: PMC1892803  PMID: 17593969

Abstract

Features such as mutations or structural characteristics can be non-randomly or non-uniformly distributed within a genome. So far, computer simulations were required for statistical inferences on the distribution of sequence motifs. Here, we show that these analyses are possible using an analytical, mathematical approach. For the assessment of non-randomness, our calculations only require information including genome size, number of (sampled) sequence motifs and distance parameters. We have developed computer programs evaluating our analytical formulas for the real-time determination of expected values and p-values. This approach permits a flexible cluster definition that can be applied to most effectively identify non-random or non-uniform sequence motif distribution. As an example, we show the effectivity and reliability of our mathematical approach in clinical retroviral vector integration site distribution.

Introduction

With the sequences of complete genomes available [1][4], and accelerating technologies for high-throughput sequencing [5] genome wide sequence analyses of individual samples will soon become reality. Comparative analyses of sequence composition and sequence motif distribution have become central parts of genome and transcriptome research, providing new insights on evolution, physiology and medical diagnosis [6][15]. Our understanding of integrating viruses and related vectors in gene therapy trials is an interesting example of such approaches. Since the completion of the human and murine genome sequencing projects the location of the vector in the cellular genome can be defined precisely, allowing the determination of possible vector integration induced effects on the surrounding genomic DNA regions at the molecular level. Integration site analyses have gained increasing interest with the dramatic development of a retroviral vector-induced lymphoproliferative disease in 3 patients cured of X-linked severe combined immunodeficiency (X-SCID) that was triggered by insertional activation of the proto-oncogene LMO2 [16], [17]. Meanwhile, insertion induced side effects have been identified ranging from immortalization [18] to clonal dominance [19][22] and even oncogenesis [23][25] in a variety of gene therapy studies. These studies have in common that a clustering of integration sites (IS) in certain genomic loci was detectable, and likely provided a selective advantage for the affected cell clone.

The clustering of integrations, termed common integration sites (CIS), as an indicator for clone selection has already been used in concerted retrovirus insertional mutagenesis studies that aimed to identify new cancer genes by determining the gene configuration near frequently affected integration site loci [26][28]. For CIS determination, computer simulations were performed to assess non-randomness of IS distribution in tumors [28]. To validate the correctness of our mathematical approach defining non-randomness and non-uniform sequence motif distribution, we analyzed the IS distribution and presence of CIS in 2 successful clinical SCID-X1 studies [29,30, unpublished data]. We considered 2, 3 or 4 insertions as CIS of 2nd, of 3rd or 4th order if they fell within a 30 kb, 50 kb or 100 kb window of genomic sequence from each other, respectively. Simultaneously, we performed computer simulations written in open source ‘R’-language (http://cran.r-project.org) for which a window of size dn (dn  =  the maximum distance defining a CIS of order n) was shifted through the ordered sequence of the IS. For each window W(j) = [IS(j),IS(j)+dn] it was then counted how many CIS of order n including IS(j) as first element were contained in W(j). We show that our mathematical approach for defining biased IS distribution is comparable to the output of computational simulations. It may have advantages in performance of large quantities of individual analyses. Even if the null hypothesis of random uniform allocation is not adequate, as it is known from retroviral vector integration [31], our calculations can address segments of the genome located between sites of predilection for virus integration and can be extended to address non-uniform sequence motif distributions.

Results and Discussion

Part 1: Random uniform allocation of IS

For the purpose of this discussion, the unit of observation (location and distance) is kilobasepair (kb). We assume that a number n is of IS is randomly allocated (with a uniform distribution) to the locations of a genome consisting of g kb. A CIS of order n is an n-tuple of IS such that the maximum distance between the lowest and highest position is no greater than a fixed bound.

Further terminology

dn

defining “size” or distance of a CIS of order n, i.e. maximum permissible distance between any two members of a CIS of order n

Pn

probability that a given (sub)set of n IS that are randomly allocated form a CIS of order n

P(m,d)

probability that a given subset of m randomly allocated IS has a span ( =  maximum distance between any two elements) of exactly d

En

expected value of the number of CIS of order n

We start with the elementary observation that En equals Pn times the number of subsets of IS consisting of n elements:

graphic file with name pone.0000570.e001.jpg (1)

Clearly,

graphic file with name pone.0000570.e002.jpg (2)

It remains to determine P(n,d). First note that P(1,d) = 0 for d>0. Furthermore, for all m≥1:

graphic file with name pone.0000570.e003.jpg (3)

A recursive formula for P(m,d), d>0, can be derived by breaking down the potential CIS of order m into subsets of m–1 elements having a span of d'≤d, to which an m-th IS is added such that the maximum span is exactly d:

graphic file with name pone.0000570.e004.jpg (4)

where r is a negligible correction term that arises because the uncorrected recursion formula is strictly valid only for subsets of IS that have a distance ≥d from the telomeres.

By mounting the recursive ladder (m = 1,...,n), these formulas successively yield P(n,d), Pn, and En. In particular, one easily obtains (d>0):

graphic file with name pone.0000570.e005.jpg
graphic file with name pone.0000570.e006.jpg
graphic file with name pone.0000570.e007.jpg

Plugging this into equations (2) and (1) yields for the expected value En:

graphic file with name pone.0000570.e008.jpg
graphic file with name pone.0000570.e009.jpg
graphic file with name pone.0000570.e010.jpg

As shown in Table 1 , our mathematical approximation corresponds extremely well to the mean values found in 50000 simulation runs.

Table 1. Mean values for random CIS formation (1000 IS) determined either with computer simulations or mathematically.

Order of CIS Mean Value Mathematical Formula Mean Value Computer Simulations
2nd
graphic file with name pone.0000570.e011.jpg
9.75
3rd
graphic file with name pone.0000570.e012.jpg
0.13
4th
graphic file with name pone.0000570.e013.jpg
0.01

Simulations were performed with 50000 runs each. g, haploid size of the human genome: 3.12 x 106 kb; d n, genomic window size [kb] for CIS of nth order: d 2 = 30, d 3 = 50, and d 4 = 100; n is, number of (assumed) sampled integration sites: 1000.

Statistical inferences, such as the calculation of p-values, can be based on the observation that, under the null hypothesis (H 0) of random uniform allocation of the IS, the number of CIS of order n is (approximately) Poisson distributed with parameter λ = E n. Thus, if the random variable X denotes the number of CIS of order n, and X = k is observed in a trial, then the p-value P(X≥k) of this observation calculated under H 0, i.e. from the Poisson distribution P o(En), is given by

graphic file with name pone.0000570.e014.jpg

where the random variable χ2 has a chi-square distribution with 2 k degrees of freedom [32], [33].

The Poisson approximation to the true random distribution of CIS is exceedingly close. In fact, if the number of simulation runs is sufficiently high, the simulated distribution is virtually undistinguishable from P o(En). In particular, both the expected values and the p-values derived from P o(En) are nearly identical to those obtained in computer simulations. The latter point is apparent from Table 2 , where for a final proof of principle of our mathematical calculations, results of the analysis of our integration data set retrieved from two clinical SCID-X1 therapy trials [unpublished data] are given.

Table 2. Comparative analysis of mean values and p-values obtained computationally (‘Simulation’) or mathematically (‘Formula’).

CIS IS MV Simulation MV Formula p-Value Simulation p-Value Formula
3 140 0.188 0.190 0.0009 0.001
1 134 0.175 0.174 0.16 0.16
4 102 0.100 0.101 0 3.9×10−6
15 304 0.899 0.900 0 6.8×10−14
102 572 3.200 3.193 0 <10−16

The results refer to the presence of CIS detected in 2 clinical X-SCID gene therapy studies [unpublished data]. Simulations were performed with 50000 runs on the haploid size of the human genome (3.12×106 kb). P-values estimated from simulations equal the proportion per 50000 runs in which the number of CIS was at least as high as the number observed in the trials. The genomic window size chosen for CIS of 2nd order was 30kb. CIS, number of identified CIS of 2nd order in patient and control samples pre- and post-transplant; IS, number of all unique identified integration sites in patient and control samples pre- and post-transplant; MV, mean value.

The p-value can be calculated by means of either of the following commands (‘R’ code): 1–ppois(lambda = En, q = k–1) or pchisq(df = 2k, q = 2En). Using the data of Table 2 (first line) 1–ppois(lambda = 0.19, q = 2) or pchisq(df = 6, q = 0.38). In both instances, the result is 0.00099. Alternatively, the table of the chisquare distribution with 6 degrees of freedom can be used to look up the probability P(X≤0.38). One should note that, for low En, the p-value of a single observed CIS is virtually identical to En. This implies that, for n>5, no p-values need to be calculated (and hence no formulas are required for En, n>5), because even with an extremely liberal definition of the CIS (d 5 = 500) and a fairly high number of IS (n is = 1000) a single CIS of order 5 will be statistically significant (p = 0.027).

Part 2: Non-uniform allocation of IS

Defining non-randomness in the clustering of genomic events often requires additional precautions as sequence structures of interest may already have known specific distribution biases. In the case of our clinical example (unpublished data), it is known that retroviral vectors based on the murine leukaemia virus (MLV) tend to integrate into gene coding regions preferentially near the transcriptional start site (TSS) [34]-[36]. It is also proposed that additional factors, indeed mostly unknown, may influence the accessibility of vectors to certain genomic DNA regions [37]. Thus, the null hypothesis of random uniform allocation of MLV IS distribution may not be adequate according to the current ‘state of the art’, as has recently been argued [31]. In line with this study, we portioned the genome into 2 adequate areas that differ in the likelihood of getting targeted by vectors.

Further terminology

nTSS

number of TSS

T5

an interval of +/-5kb around a TSS

GT5

union of all T5

nis,Mix, nis,Comp

number of IS occuring in GT5 and in the complement of GT5, respectively

ncis,GT5, ncis,Mix, ncis,Comp

number of CIS occurring in GT5, both in GT5 and in the complement of GT5 and in the complement of GT5 only, respectively

Clearly, the expected value E n of the number CIS of order n is given by the following sum:

graphic file with name pone.0000570.e015.jpg (5)

In the following it will be shown how to calculate the terms on the right side of (5). We start with the expected value of n cis,GT5 fore what we assume that vector integration into any T5 occurs with the same probability. Then

graphic file with name pone.0000570.e016.jpg (6)

where X is the number of CIS (among those occurring in GT5) that occur in a fixed T5. Observing that i IS in a fixed T5 yield Inline graphic CIS of order n in this T5 one easily obtains the expected value of X

graphic file with name pone.0000570.e018.jpg (7)

Since X is binomially distributed as ∼ B(n is,GT5,1/n TSS),

graphic file with name pone.0000570.e019.jpg (8)

Merging equations (6)–(8) yields the desired formula for E(n cis,GT5):

graphic file with name pone.0000570.e020.jpg (9)

If n is,GT5 is small compared to n TSS (undoubtedly, this is mostly the case), terms of higher order can be neglected so that, because (n TSS–1)/n TSS≈1, formula (9) simplifies to

graphic file with name pone.0000570.e021.jpg (10)

Notice that formulas (6)–(10) do not depend on the spatial distribution of the IS within the T5. (It is unnecessary to account for the closeness of IS within T5 because any pair – or triple, quadruple etc., for that matter – of IS within a T5 yields a CIS.)

Clearly, the expected value of n cis,Mix E(n cis,Mix) is not independent of the distance between the IS and the TSS. Thus, inevitably, assumptions regarding the spatial distribution for the IS will influence its value. In the sequel, a formula for E(n cis,Mix) shall be derived for the case n = 2. As before, CIS of order 2 are defined by a maximum distance d 2 of 30kb between the IS.

If the TSS are indistiguishable with respect to the probability distribution of the integrations, then

graphic file with name pone.0000570.e022.jpg (11)

where p Mix denotes the probability that an arbitrary pair of IS (with one element in GT5 and one element in the complement of GT5) forms a CIS of order 2 around a fixed TSS.

We will assume that the distributions of IS within a T5 and within +/-35 kb around a TSS are symmetric. Then, again using kb as unit of distance,

graphic file with name pone.0000570.e023.jpg (12)

In formula (12) the points x = 0 and y = 0 correspond to the TSS-5; f(x) designates the probability density function of vector integrations in T5; and g(y) designates the corresponding density function in [TSS-35, TSS-5].

Formula (12) shall be evaluated for two special cases:

Case 1: Vector integrations are uniformly distributed in GT5 and in the complement of GT5, respectively. I.e.,

graphic file with name pone.0000570.e024.jpg
graphic file with name pone.0000570.e025.jpg

Solving the integrals in formula (12) we have

graphic file with name pone.0000570.e026.jpg (13)

Case 2: As above, vector integrations in the complement of GT5 are assumed to be uniformly distributed. However, a triangular distribution is assumed for f(x). The corresponding formula is easily calculated:

graphic file with name pone.0000570.e027.jpg

By plugging this into (12) we get

graphic file with name pone.0000570.e028.jpg (14)

It may be surprising that a triangular distribution in T5 results in a higher expected value for n cis,Mix than a uniform distribution. However, this becomes more plausible if one notes that a higher value is also obtained if the IS are concentrated in an extreme manner within the T5, viz. in a one-point distribution with total mass in the TSS. In this special case (which is particularly easy to evaluate), p Mix  =  50/(n TSS(g–n TSS)).

If, with respect to the formation of CIS, the complement of GT5 could be regarded as a continuum, the expected value of n cis,Comp would be given by the formulas developed in Part 1 of this contribution. In the case of retroviral (MLV) vectors, however, the complement of GT5 has rather to be viewed as a partitioned set consisting of approximately TSS disjoint intervals. It follows that that the residual term on the right-hand side of equation (4) (Part 1) may no longer be negligible. Note however, the assumption of a continuum clearly tends to lead to an overestimation of the number of CIS, because the boundaries of the components reduce the number of CIS occurring in their neighborhood. It follows that the formulas derived in Part 1 form an upper bound for E(n cis,Comp). In particular, the true p-values are less or equal to the values calculated by means of the formulas derived in Part 1. Therefore, any positive statements regarding statistical significance remain valid. Moreover, the overestimation is probably fairly small given that the sections of GT5 located between the TSS are mostly rather wide compared to the length defining a CIS.

Indeed, the null hypothesis of non-uniform allocation for IS distribution does not substantially change the results we have obtained based on the hypothesis of a random uniform allocation for CIS formation in our clinical samples ( Table 2 ), as is shown in Table 3 .

Table 3. Formulas based statistical analysis of the results on CIS formation in clinical samples derived from 2 clinical X-SCID gene therapy studies [unpublished data].

CIS IS MV Uniform* MV Triangular§ p-Value Uniform* p-Value Triangular§
3 140 0.191 0.212 0.001 0.0014
1 134 0.175 0.195 0.161 0.177
4 102 0.101 0.124 4.0 x 10−6 6.1 × 10−6
15 304 0.905 1.006 7.4 × 10−14 3.3 × 10−13
102 572 3.212 3.568 <10−16 <10−16

Calculations were performed on the haploid size of the human genome (3.12 × 106 kb) and on the basis of an IS skewing (25% of all IS) to the +/− 5 kb TSS region, for which an (*) uniform or a (§) triangular IS distribution, respectively, was assumed. 75% of IS were assumed to be uniformly distributed over the remaining human genome. The genomic window size chosen for CIS of 2nd order was 30 kb. CIS, number of identified CIS of 2nd order in patient and control samples pre- and post-transplant; IS, number of all unique identified integration sites in patient and control samples pre- and post-transplant; MV, mean value.

Our mathematical formulas allow a reliable, straightforward calculation of non-randomness in CIS and other genomic event distributions under the null hypothesis of uniform and non-uniform allocation. Using formula based workspaces (available on request), expected values and p-values can be calculated with ease in real-time. They may be preferable to computer simulations when (routine) high-speed processing of large quantities of analyses is needed. Our approach enables a closely problem-oriented, highly exact evaluation of non-randomness that is useful for assessing IS distribution in clinical trials and for assessing the distribution of any sequence motif of interest in a natural or artificial genome.

Footnotes

Competing Interests: The authors have declared that no competing interests exist.

Funding: This work was supported by the Deutsche Forschungsgemeinschaft (Grant SPP1230), the German Ministry of Education and Research (Grant TREATID) and the EU (VIth Framework Program, CONSERT and CLINIGENE).

References

  • 1.Adams MD, Celniker SE, Holt RA, Evans CA, Gocayne JD, et al. The genome sequence of Drosophila melanogaster. Science. 2000;287:2185–2195. doi: 10.1126/science.287.5461.2185. [DOI] [PubMed] [Google Scholar]
  • 2.Lander ES, Linton LM, Birren B, Nusbaum C, Zody M, et al. Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921. doi: 10.1038/35057062. [DOI] [PubMed] [Google Scholar]
  • 3.Holt RA, Subramanian GM, Halpern A, Sutton GG, Charlab R, et al. The genome sequence of the malaria mosquito Anopheles gambiae. Science. 2002;298:129–149. doi: 10.1126/science.1076181. [DOI] [PubMed] [Google Scholar]
  • 4.Waterston RH, Lindblad-Toh K, Birney E, Rogers J, Abril JF, et al. Initial sequencing and comparative analysis of the mouse genome. Nature. 2002;420:520–562. doi: 10.1038/nature01262. [DOI] [PubMed] [Google Scholar]
  • 5.Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005;437:376–380. doi: 10.1038/nature03959. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Camargo AA, Samara HP, Dias-Neto E, Simao DF, Bigotto IA. The contribution of 700,000 ORF sequence tags to the definition of the human transcriptome. Proc Natl Acad Sci U S A. 2001;98:12103–12108. doi: 10.1073/pnas.201182798. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Riva A, Delorme M-O, Chevalier T, Guilhot N, Henaut C, et al. The difficult interpretation of transcriptome data: the case of the GATC regulatory network. Computational Biology and Chemistry. 2004;28:109–118. doi: 10.1016/j.compbiolchem.2003.12.004. [DOI] [PubMed] [Google Scholar]
  • 8.Gerhard DS, Wagner L, Feingold EA, Shenmen CM, Grouse LH, et al. The status, quality, and expansion of the NIH full-length cDNA project: the Mammalian Gene Collection (MGC). Genome Res. 2004;14:2121–2127. doi: 10.1101/gr.2596504. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Ota T, Suzuki Y, Nishikawa T, Otsuki T, Sugiyama T, et al. Complete sequencing and characterization of 21,243 full-length human cDNAs. Nat Genet. 2004;36:40–45. doi: 10.1038/ng1285. [DOI] [PubMed] [Google Scholar]
  • 10.Wang J, Song L, Gonder MK, Azrak S, Ray DA, et al. Whole genome computational comparative genomics: A fruitful approach for ascertaining Alu insertion polymorphisms. Gene. 2006;365:11–20. doi: 10.1016/j.gene.2005.09.031. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Garrigan D, Hammer MF. Reconstructing human origins in the genomic era. Nat Rev Genet. 2006;7:669–680. doi: 10.1038/nrg1941. [DOI] [PubMed] [Google Scholar]
  • 12.Subramanian S, Madgula VM, George R, Mishra RK, Pandit MW, et al. Triplet repeats in human genome: distribution and their association with genes and other genomic regions. Bioinformatics. 2003;19:549–552. doi: 10.1093/bioinformatics/btg029. [DOI] [PubMed] [Google Scholar]
  • 13.Subramanian S, Mishra RK, Singh L. Genome-wide analysis of Bkm sequences (GATA repeats): predominant association with sex chromosomes and potential role in higher order chromatin organization and function. Bioinformatics. 2003;19:681–685. doi: 10.1093/bioinformatics/btg067. [DOI] [PubMed] [Google Scholar]
  • 14.Miranda KC, Huynh T, Tay Y, Ang YS, Tam WL, et al. A pattern-based method for the identification of MicroRNA binding sites and their corresponding heteroduplexes. Cell. 2006;126:1203–1217. doi: 10.1016/j.cell.2006.07.031. [DOI] [PubMed] [Google Scholar]
  • 15.Bakker EG, Toomajian C, Kreitman M, Bergelson J. A genome-wide survey of R gene polymorphisms in Arabidopsis. Plant Cell. 2006;18:1803–1818. doi: 10.1105/tpc.106.042614. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Hacein-Bey-Abina S, von Kalle C, Schmidt M, McCormack MP, Wulffraat N, et al. LMO2-associated clonal T cell proliferation in two patients after gene therapy for SCID-X1. Science. 2003;302:415–419. doi: 10.1126/science.1088547. [DOI] [PubMed] [Google Scholar]
  • 17.Hacein-Bey-Abina S, von Kalle C, Schmidt M, Le Deist F, Wulffraat N, et al. A serious adverse event after successful gene therapy for X-linked severe combined immunodeficiency. N Engl J Med. 2003;348:255–256. doi: 10.1056/NEJM200301163480314. [DOI] [PubMed] [Google Scholar]
  • 18.Du Y, Jenkins NA, Copeland NG. Insertional mutagenesis identifies genes that promote the immortalization of primary bone marrow progenitor cells. Blood. 2005;106:3932–3839. doi: 10.1182/blood-2005-03-1113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Hematti P, Hong BK, Ferguson C, Adler R, Hanawa H, et al. Distinct genomic integration of MLV and SIV vectors in primate hematopoietic stem and progenitor cells. PLoS Biology. 2004;2:e423. doi: 10.1371/journal.pbio.0020423. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Calmels B, Ferguson C, Laukkanen MO, Adler R, Faulhaber M, et al. Recurrent retroviral vector integration at the MDS1-EVI1 locus in non-human primate hematopoietic cells. Blood. 2005;106:2530–2533. doi: 10.1182/blood-2005-03-1115. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Kustikova O, Fehse B, Modlich U, Yang M, Dullmann J, et al. Clonal dominance of hematopoietic stem cells triggered by retroviral gene marking. Science. 2005;308:1171–1174. doi: 10.1126/science.1105063. [DOI] [PubMed] [Google Scholar]
  • 22.Ott MG, Schmidt M, Schwarzwaelder K, Stein S, Siler U, et al. Correction of X-linked chronic granulomatous disease by gene therapy is augmented by insertional activation of MDS/EVI1, PRDM16 or SETBP1. Nat Med. 2006;12:401–409. doi: 10.1038/nm1393. [DOI] [PubMed] [Google Scholar]
  • 23.Li X, Düllmann J, Schiedlmeier B, Schmidt M, von Kalle C, et al. Murine leukemia induced by retroviral gene marking. Science. 2002;296:497. doi: 10.1126/science.1068893. [DOI] [PubMed] [Google Scholar]
  • 24.Modlich U, Kustikova OS, Schmidt M, Rudolph C, Meyer J, et al. Leukemias following retroviral transfer of multidrug resistance 1 (MDR1) are driven by combinatorial insertional mutagenesis. Blood. 2005;105:4235–4246. doi: 10.1182/blood-2004-11-4535. [DOI] [PubMed] [Google Scholar]
  • 25.Montini E, Cesana D, Schmidt M, Sancito F, Ponzoni M, et al. Hematopoietic stem cell gene transfer in a tumor-prone mouse model uncovers low genotoxicity of lentiviral vector integration. Nat Biotechnol. 2006;24:687–696. doi: 10.1038/nbt1216. [DOI] [PubMed] [Google Scholar]
  • 26.Mikkers H, Allen J, Knipscheer P, Romeijn L, Hart A, et al. High-throughput retroviral tagging to identify components of specific signalling pathways in cancer. Nat Genet. 2002;32:153–159. doi: 10.1038/ng950. [DOI] [PubMed] [Google Scholar]
  • 27.Lund AH, Turner G, Trubetskoy A, Verhoeven E, Wientjens E, et al. Genome-wide retroviral insertional tagging of genes involved in cancer in Cdkn2a-deficient mice. Nat Genet. 2002;32:160–165. doi: 10.1038/ng956. [DOI] [PubMed] [Google Scholar]
  • 28.Suzuki T, Shen H, Akagi K, Morse HC, Malley JD, et al. New genes involved in cancer identified by retroviral tagging. Nat Genet. 2002;32:166–174. doi: 10.1038/ng949. [DOI] [PubMed] [Google Scholar]
  • 29.Cavazzana-Calvo M, Hacein-Bey S, de Saint Basile G, Gross F, Yvon E, et al. Science. 2000;288:669–672. doi: 10.1126/science.288.5466.669. [DOI] [PubMed] [Google Scholar]
  • 30.Gaspar HB, Parsley KL, Howe S, King D, Gilmour KC, et al. Gene therapy of X-linked severe combined immunodeficiency by use of a pseudotyped gammaretroviral vector. Lancet. 2004;364:2181–2187. doi: 10.1016/S0140-6736(04)17590-9. [DOI] [PubMed] [Google Scholar]
  • 31.Wu X, Luke BT, Burgess SM. Redefining the common insertion site. Virology. 2006;344:292–295. doi: 10.1016/j.virol.2005.08.047. [DOI] [PubMed] [Google Scholar]
  • 32.Hartung J, Elpelt B, Klösener K-H. 1987. Statistik. (Oldenbourg Verlag, München-Wien). [Google Scholar]
  • 33.Dudewicz EJ, Mishra SN. 1988. Modern Mathematical Statistics. (Wiley, New York). [Google Scholar]
  • 34.Wu X, Li Y, Crise B, Burgess SM. Transcription start regions in the human genome are favored targets for MLV integration. Science. 2003;300:1749–1751. doi: 10.1126/science.1083413. [DOI] [PubMed] [Google Scholar]
  • 35.Mitchell RS, Beitzel BF, Schroder AR, Shinn P, Chen H, et al. Retroviral DNA integration: ASLV, HIV, and MLV show distinct target site preferences. PLoS Biology. 2004;2:e234. doi: 10.1371/journal.pbio.0020234. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Laufs S, Gentner B, Nagy KZ, Jauch A, Benner A, et al. Retroviral vector integration occurs in preferred genomic targets in human bone marrow repopulating cells. Blood. 2003;101:2191–2198. doi: 10.1182/blood-2002-02-0627. [DOI] [PubMed] [Google Scholar]
  • 37.Bushman FD. Targeting survival: Integration site selection by retroviruses and LTR-retrotransposons. Cell, 2003;115:135–138. doi: 10.1016/s0092-8674(03)00760-8. [DOI] [PubMed] [Google Scholar]

Articles from PLoS ONE are provided here courtesy of PLOS

RESOURCES