Skip to main content
Molecular Therapy logoLink to Molecular Therapy
. 2021 Oct 29;29(12):3328–3331. doi: 10.1016/j.ymthe.2021.10.022

Challenges in estimating numbers of vectors integrated in gene-modified cells using DNA sequence information

Frederic D Bushman 1,, Adrian Cantu 1, John Everett 1, Denise Sabatino 2,3, Charles Berry 4
PMCID: PMC8636165  PMID: 34717818

Main text

In evaluating outcomes in human gene therapy, it can be of interest to estimate the absolute numbers of vector copies integrated into the genomes of patient cells. Unfortunately, all methods for quantification of population sizes have limitations, some of which may not be fully appreciated by the field. In one approach, DNA sequencing can be used to enumerate junctions between vector and host sequences in patient specimens. However, a second technical replicate from the same DNA specimen almost always yields numerous integration sites not recovered in the first. Thus, reconstructing the full number requires mathematical modeling to infer the full population size from partial samples. Many methods have been proposed for this; the Chao1 estimator is often used for integration site data. However, complications of sampling lead to estimates that represent only a lower bound on the true number. In addition, choices of methods for quality control may have large effects on population bounds. After accounting for the number of cells sampled, a lower bound can be proposed for the number of cells harboring an integrated vector. Other methods for estimating the numbers of gene-modified cells are also described, but these too have considerable uncertainties, emphasizing that accurate population size estimation remains a challenge.

In assessing the safety and outcomes in gene therapy, it can be useful to assess the distributions of vectors integrated in host cell chromosomes. Such data can be used to monitor for possible expansion of gene-modified cell clones, to assess potential mechanisms of insertional mutagenesis, and to compare target sites preferences for different vector types. Another goal, technically more difficult, is quantifying the absolute number of cells in a sample that become gene modified by vector integration.

Integration site analysis

Boundaries on population sizes of cells harboring integrated vectors can be estimated using data from deep sequencing of host-vector junctions. In a typical protocol, a specimen of gene-modified cells is recovered from a subject after gene correction and DNA is purified for analysis. To characterize integration sites by DNA sequencing, DNA specimens are first sheared by sonication, which breaks the DNA at random positions.1, 2, 3 DNA linkers are then ligated onto the sheared DNAs, and PCR amplification is carried out using primers that bind to the edge of the vector and the ligated linker. PCR products are then analyzed by deep sequencing, and junction sequences mapped onto the genome of the host organism, providing detailed data on the distribution of sites of vector integration. Protocols have been published by several groups.2, 3, 4, 5, 6

This method allows quantification of clonal structure, which can be read out by quantifying fragment lengths (Figure 1A).1, 2, 3, 4,6 For an expanded clone, purification of DNA yields many DNA chains from the expanded cells, each with the same unique integration site position. To quantify clone size, one can quantify the number of positions of linker ligation associated with each unique integration site after shearing and ligation. Counting the linker positions then provides an estimate of relative abundance. This provides a useful method for quantifying potential clonal outgrowth.

Figure 1.

Figure 1

Use of DNA fragment length data to document independent capture of unique integration sites (nicknamed “sonic abundance”) and assessing capture frequency and population sizes in samples from gene therapy trails

(A) DNA is purified from gene-modified cells (top), and sheared by sonication. The free DNA ends are then ligated to DNA adaptors. PCR is subsequently carried out, followed by deep sequencing of the PCR products. Example sequences are shown at the bottom. The number of adaptor positions provides a measure of the number of cells contributing to each recovered unique integration site. The protocol has been described in detail.2,3 (B and C) Venn diagrams documenting sparse sampling based on quantifying repeated capture of unique integration sites over four technical replicates. Aliquots from single DNA specimens were sampled four times independently, and the numbers of integration sites shared between replicates tabulated. (A) A sample from a gammaretroviral-mediated stem cell gene therapy trial correcting ADA deficiency.7 (D) A sample from a lentiviral-mediated stem cell gene therapy trial correcting Wiskott-Aldrich syndrome.8 (D and E) Examples of integration site datasets from human subjects used to infer minimal population sizes using the Chao1 estimator. For each panel, the x axis shows the number of times each unique integration site was isolated, the y axis shows the proportion of unique sites in each abundance class. The number of unique sites called and the Chao1 estimate for the lower bound on population size are shown on each panel. (C) Assay of the above sample from a gammaretroviral-mediated stem cell gene therapy trial correcting ADA deficiency.7 (E) Assay of the above sample from a lentiviral-mediated stem cell gene therapy trial correcting Wiskott-Aldrich syndrome.8 (F–H) Examples of integration site datasets from dog liver transduced with AAV.12 Samples are labeled as for (D and E). In this case only single replicates were assessed. Comparing results for different samples of tissue from the same liver illustrates regional heterogeneity in lower bounds on population sizes.

Reconstruction experiments show that this yields more accurate quantification of clonal structure than counting sequence reads. For example, short DNA molecules amplify by PCR more readily than long molecules, derailing quantification based on read counts alone.1,6

Sparse sampling

However, a considerable complication is that sampling by deep sequencing is almost always incomplete. This can be seen by comparing the integration sites recovered after repeatedly assaying a single DNA specimen from gene-modified patient cells. Figures 1B–1E shows the results of analyzing integration sites from patient specimens, each analyzed as four technical replicates. The two examples show a specimen from an ADA therapy trial in humans using a gammaretroviral vector to correct hematopoietic stem cells7 (Figures 1B and 1D), and a specimen from a Wiskott-Aldrich therapy trial in humans using a lentiviral vector to correct hematopoietic stem cells8 (Figures 1C and 1E). In each case, integration sites found in all four replicates are relatively rare, while sites seen in one replicate only are the most common. As a result, the expectation is that further sampling would yield additional unique integration sites. Thus, it would be a mistake to assume that simply counting the numbers of integration sites recovered in such sequence samples yields the true number.

Estimating population sizes

Methods for estimating the sizes of populations based on partial subsamples have been developed in population biology and other fields.1,9, 10, 11 For example, consider the problem of estimating the size of a population of fish in a pond. For this, a researcher may capture a sample of fish, tag them, release them back into the pond, allow them to mix with untagged fish, then capture another sample. The proportion of tagged fish in the second sample provides a basis for estimating the number of total fish in the pond.

The Chao1 estimator9 has been used to estimate the sizes of populations of integrated vectors in gene therapy.1,2 This estimator uses the numbers of integration sites detected once or twice to estimate the number seen zero times. Adding the inferred number seen zero times to the number observed provides a higher estimate of population size closer to the true value.

Complications in estimating population size

However, just as not all fish are equally easy to catch, an integration site from an expanded clone will be more readily captured than a site from an unexpanded lineage. A consequence of this variable capture probability is that population size estimates represent a lower bound on the population size, and not an exact measure. In addition, a considerable danger in this type of analysis is inflation of sequence variants due to sequence error.10 Sequence data processing involves condensing different sequences that could plausibly have diverged due to PCR errors or errors in sequence determination.2,3 There are multiple types of error and approaches to their correction, and specific choices can have large effects on the estimated population boundaries.2, 3, 4, 5, 6

Another complication is that mammalian genomes are rich in repeated sequences such as LINEs, SINEs, and HERVs; integration sites mapped in these sequences can thus map to many locations on the target genome. Some analytical pipelines simply discard these sequences, even though they are authentic integration sites that could even be involved in vector-mediated adverse events. One approach is to identify these sequences and subject them to a separate analysis.2,3 For population size estimation, repeated sequences need to be specifically quantified and added back to avoid artificial undercounting—this does not yet seem to have been done in any study.

Exactly what are we estimating the population of?

Commonly a goal of the analysis is to estimate how many sites of vector integration there are in a whole tissue, or all the cells circulating in blood. Unfortunately, there is again a considerable complication involving the degree to which the cells sampled are well mixed within the organism. At face value, one might focus on the total amount of DNA sampled for integration site distribution in a specimen and use this to estimate the number of genome equivalents sampled. For example, the weight of a diploid human genome is 7 picograms—dividing the weight of DNA used for integration site analysis by 7 picograms yields the number of cells sampled. Dividing the estimated number of integrated vectors by the number of cells sampled roughly yields a proportion of the cells sampled with integrated vectors. Slightly confounding this is the fact that more than one vector may integrate in a single cellular genome, so the simple division above is not strictly correct, but this is not a major factor when cells were initially transduced at a low multiplicity of infection.

However, a complex question turns on exactly what was sampled. For stem cell gene therapy, where stem cells are transduced ex vivo and then reinfused, and descendant cells are well mixed throughout the body, probably the population sampled corresponds to all the long-term repopulating cells from the initial transduction mixture. However, for a solid tissue, such as a liver sample after in vivo AAV-mediated gene correction, probably there is not the same mixing, and the population in question is likely just the piece of tissue sampled. Obviously, the biology of the tissue type studied may have a large effect on the treatment of the issue of mixing, and consequently the biological understanding of the population sampled.

Examples of bounds on population sizes

Some examples of lower bounds on population sizes are shown in Figures 1D–1H. Shown are samples from gene therapy, including (1) ADA therapy in humans with a gammaretroviral vector in hematopoietic stem cells (Figure 1D),7 (2) Wiscott-Aldrich syndrome therapy in humans with a lentiviral vector in hematopoietic stem cells (Figure 1E),8 and (3) hemophilia therapy in dogs with an AAV vector transducing factor VIII (Figures 1F–1H).12 Plotted are the frequencies of capture of each integration site using the fragment length “sonic abundance” measure (x axis). The y axis shows the proportion of all integration sites in each abundance category. The figure emphasizes that most integration sites were captured only once in all samples. Thus, further sampling would be expected to yield many new integration sites, and this is reflected in the higher estimates of minimal population size by Chao1, in this case applicable to all long-term repopulating cells in the subject.

For gene transfer into solid tissues, different samples from the same tissue can have different sized populations of integrated vectors, since cells will commonly not be well mixed. Figures 1F–1H shows an analysis of three liver specimens from canine gene therapy with AAV,12 which yielded inferred lower bounds on population sizes ranging from 43 to 211 sites in separate specimens from the same liver.

Other methods

Numerous additional mathematical approaches have been applied to estimating population sizes from subsamples; some of which of which may be useful to test in the gene therapy field.10,11 As an additional complication, empirical tests show that use of other mathematical measures for estimating the population sizes (e.g., adjusted Chao10) yield estimates that are usually at least a little different from Chao1, and sometimes very different.

Several other techniques can be used for estimating bounds on population sizes of integrated vectors, but these have severe limitations as well. It is possible of course to characterize the number of vector sequences in a specimen using qPCR, but this does not distinguish between integrated and unintegrated vector DNAs. If a vector produces a product recognizable in tissue specimens, such as GFP, then the numbers of expression-positive cells can be counted. However, most types of vectors can express their gene products in both integrated and unintegrated forms, so again the measurement does not isolate the integrated fraction. To address this, it is possible to grow out the transduced cells, so that unintegrated forms are potentially diluted out. Several studies have taken this approach. A considerable confounder, however, is that not all integrated vectors may express the encoded transgene. In a study of AAV vector integration in dog liver,12 all integrated vectors were found to be extensively deleted and rearranged, and only one of five integrants studied encoded an intact transgene. Thus, there can be vectors that are undetectable using this method, and in at least some settings these may be the great majority. Vector deletion and rearrangement may also affect detection by in situ hybridization. The extent of rearrangement may be less severe with retroviral and lentiviral vectors, but for these as well recombination between LTRs can yield a “solo LTR” lacking the transgene;13 the frequency of this event in human gene therapy specimens appears to be uninvestigated.

Summary

It is possible to estimate a boundary on the size of a population of integrated vectors in a patient sample, but there are many challenges. Sparse sampling limits the accuracy of estimation, requiring methods, such as Chao1, to reconstruct a population. The fact that the efficiency of recovery of integration sites is variable means that estimates are a lower boundary. Care needs to be taken in evaluating what the population of cells sampled corresponds to biologically. Different choices of sequence quality filtering methods may have substantial effects on population boundary estimates. Finally, methods based on qPCR or imaging can be used, but unintegrated DNA and rearranged genomes are commonly confounders. Integration site analysis remains useful for tracking expanded clones and investigating targeting preferences of different vector systems, but the specific issue of determining absolute population sizes remains difficult.

Acknowledgments

We are grateful to members of the Bushman and Sabatino laboratories for help and suggestions, and Laurie Zimmerman for figure art. This work was supported by NIH in part by P30-AI045008, U01AI125051, R01CA241762, 5R01HL142791, and U19AI149680.

References

  • 1.Berry C.C., Gillet N.A., Melamed A., Gormley N., Bangham C.R., Bushman F. Estimating abundances of retroviral insertion sites from DNA fragment length data. Bioinformatics. 2012;28(6):755–762. doi: 10.1093/bioinformatics/bts004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Berry C.C., Nobles C., Six E., Wu Y., Malani N., Sherman E., Dryga A., Everett J.K., Male F., Bailey A., et al. INSPIIRED: quantification and visualization tools for analyzing integration site distributions. Mol. Ther. Methods Clin. Dev. 2017;4:17–26. doi: 10.1016/j.omtm.2016.11.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Sherman E., Nobles C., Berry C.C., Six E., Wu Y., Dryga A., Malani N., Male F., Reddy S., Bailey A., Bittinger K., et al. INSPIIRED: a pipeline for quantitative analysis of sites of new DNA integration in cellular genomes. Mol. Ther. Methods Clin. Dev. 2017;4:39–49. doi: 10.1016/j.omtm.2016.11.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Spinozzi G., Calabria A., Brasca S., Beretta S., Merelli I., Milanesi L., Montini E. VISPA2: a scalable pipeline for high-throughput identification and annotation of vector integration sites. BMC Bioinformatics. 2017;18:520. doi: 10.1186/s12859-017-1937-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Arens A., Appelt J.U., Bartholomae C.C., Gabriel R., Paruzynski A., Gustafson D., Cartier N., Aubourg P., Deichmann A., Glimm H., et al. Bioinformatic clonality analysis of next-generation sequencing-derived viral vector integration sites. Hum. Gene Ther. Methods. 2012;23:111–118. doi: 10.1089/hgtb.2011.219. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Afzal S., Wilkening S., von Kalle C., Schmidt M., Fronza R. GENE-IS: time-efficient and accurate analysis of viral integration events in large-scale gene therapy data. Mol. Ther. Nucleic Acids. 2017;6:133–139. doi: 10.1016/j.omtn.2016.12.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Reinhardt B.C., Habib O., Shaw K.L., Garabedian E.K., Carbonaro-Sarracino D.A., Terrazas D.R., Fernandez B.C., de Oliveira S., Moore T.B., Ikeda A.K., et al. Long-term outcomes after gene therapy for adenosine deaminase severe combined immune deficiency (ADA SCID) Blood. 2021;138:1304–1316. doi: 10.1182/blood.2020010260. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Six E., Guilloux A., Denis A., Lecoules A., Magnani A., Villette R., Male F., Cagnard N., Delville M., Magrin E., et al. Clonal tracking in gene therapy patients reveals a diversity of human hematopoietic differentiation programs. Blood. 2020;135:1219–1231. doi: 10.1182/blood.2019002350. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Chao A., Chazdon R.L., Colwell R.K., Shen T.J. Abundance-based similarity indices and their estimation when there are unseen species in samples. Biometrics. 2006;62:361–371. doi: 10.1111/j.1541-0420.2005.00489.x. [DOI] [PubMed] [Google Scholar]
  • 10.Chiu C.H., Chao A. Estimating and comparing microbial diversity in the presence of sequencing errors. PeerJ. 2016;4:e1634. doi: 10.7717/peerj.1634. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Bunge J., Willis A., Walsh F. Estimating the number of species in microbial diversity studies. Annu. Rev. Stat. Appl. 2014;1:427–445. [Google Scholar]
  • 12.Nguyen G.N., Everett J.K., Kafle S., Roche A.M., Raymond H.E., Leiby J., Wood C., Assenmacher C.-A., Merricks E.P., Long C.T., et al. A long-term study of AAV gene therapy in dogs with hemophilia A identifies clonal expansions of transduced liver cells. Nat. Biotechnol. 2020;39:47–53. doi: 10.1038/s41587-020-0741-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Thomas J., Perron H., Feschotte C. Variation in proviral content among human genomes mediated by LTR recombination. Mob. DNA. 2018;9:36. doi: 10.1186/s13100-018-0142-3. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Molecular Therapy are provided here courtesy of The American Society of Gene & Cell Therapy

RESOURCES