Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2009 Oct 25.
Published in final edited form as: Gene Ther. 2008 Jun 26;15(18):1294–1298. doi: 10.1038/gt.2008.99

Automated analysis of viral integration sites in gene therapy research using the SeqMap web resource

Brandon Peters 1, Sara Dirscherl 2, Jessica Dantzer 1, Jonathan Nowacki 1, Scott Cross 3, Xiaoman Li 1,4, Kenneth Cornetta 3,5,7, Mary C Dinauer 2,3,6,7, Sean D Mooney 1,3
PMCID: PMC2766545  NIHMSID: NIHMS105704  PMID: 18580967

Abstract

Research in gene therapy involving genome integrating vectors, now often includes analysis of vector integration sites across the genome using methods such as ligation mediated (LM)-PCR or linear amplification-mediated (LAM)-PCR. To help researchers analyze these sites and the functions of nearby genes, we have developed SeqMap (http://seqmap.compbio.iupui.edu/) a secure, web-based comprehensive vector integration site management tool that automatically analyzes and annotates large numbers of vector integration sites derived from LM-PCR experiments in human and model organisms upon a common genome database. We believe use of this resource will enable better reproducibility and understanding of this important data.

Keywords: Gene therapy, bioinformatics, LM-PCR, insertional mutagenesis, database

Introduction

Gene therapy holds great promise toward treating genetic disease, and many researchers are moving toward clinical application of this approach. One of the challenges faced is determining how viral integration events cause phenotypic changes in cells that eventually lead to cancer risk or malignant transformation1. Many recent studies have begun to address this challenge2,3. These studies, and their resulting publications, have described the abundance and location of vector integration sites in normal and malignant cells using experimental methods such as ligation mediated (LM)-PCR or linear amplification-mediated (LAM)-PCR in both human and model organisms2,4,5

LM-PCR and LAM-PCR seek to identify retroviral vector integrations sites by digesting genomic DNA with an appropriate restriction enzyme, and using PCR primers within the Long Terminal Repeats (LTR) and an adapter fragment to amplify the LTR genomic DNA junctions6,7. Investigators may sequence the PCR product directly or through Topocloning methods8. However, bioinformatic protocols are frequently used to describe genomic sites, and different gene annotations and different genome builds are used. Although the integration sites are fixed, these differences in genome builds and gene models can lead to differences, resulting in statistical analyses that are difficult to compare.

Here we describe an automated web-based approach for annotating large numbers of viral integration sites. This software, named SeqMap, accepts sequences, removes vector and repeating elements, then maps to a genome sequence and then extracts annotations from both the UCSC and Ensembl genome annotations. Information such as nearby genes, their associated GO functions, the orientation of the integration and the distance of the integration site to gene start sites is tabulated. Reports can be generated on webpages and in Excel.

Construction of this tool was motivated by our need to understand both where and how often genome viral integrations were occurring, but also to understand why certain sample submissions were failing to identify an insertion site from an LM-PCR product. It is often challenging to determine the underlying causes of a failure to identify a site when annotations are being built by hand or from a script. We find that these failures can arise from a variety of sources, including integration into areas of genomic repeating elements (i.e. satellite or LINE repeats), sample contamination, too little genomic sequence to map, or poor sequence quality. These causes can be readily determined using our tools, and if addressable, the user is able to override the annotations with their own. The data can then be exported for further analysis if required. Details of the analysis follow.

Database and analysis

SeqMap is constructed using the Python (http://python.org/) programming language. Bioperl (http://bioperl.org/) is used to render images of genetic structure, with MySQL (http://mysql.com/) as a backend database. UCSC GoldenPath (http://genome.ucsc.edu/) and Ensembl (http://ensembl.org/) annotations, sequences and related tools are all stored locally, ensuring that data annotations are stable over time.

Genome mapping

In using the SeqMap tool the user inputs the vector sequences (and sequences generated during Topo-cloning) so these can be removed prior to searching. The species being analyzed is indicated and the appropriate databases are then selected by SeqMap. Submission of the sequencing data is performed by formatting the sequences in FASTA file format, and giving each sequence a title in the form of SeqID. SampleID. The method will automatically split the title by `.' and all characters on the left of the first `.' will be the unique identifier for that sequence, the characters on the right will be the identification for the sample. Multiple sequences can map to a single sample. Sample IDs can be used later to perform analysis on groups of integration sites. Once the sequences are submitted, the server begins to execute the workflow in Figure 1.

Figure 1. Automated workflow for annotation in SeqMap.

Figure 1

First, the user must specify vector sequence to be removed from the input sequences, this only needs to be done once per LM-PCR protocol. Next, sequences are inputted in FASTA format, one for each submitted sequence. Then each sequence is mapped to the genome, using the following three step protocol. First, Chaos9, a local alignment method, is used to identify regions of vector in the inputted sequence, regions of vector are replaced with the letter `N'. Second, Censor10, a repeat masking algorithm, is applied to the vector removed sequence to remove repeating elements. Finally, the resulting sequence is mapped to the genome build using Blat. Then, nearby RefSeq (http://www.ncbi.nlm.nih.gov/RefSeq/) gene products are identified near the integration site using the both the Ensembl and UCSC annotation databases, only genes with Entrez Gene IDs or MGI IDs are considered. Distances are from the specified integration site to the `txStart' location in UCSC or the `Start' location in Ensembl. Parameters for each of these tools are published on our online reference guide at http://seqmap.compbio.iupui.edu/.

Upon submission of sequences and related vector construct structures (in FASTA format), SeqMap automatically maps the genomic sequences to the genome using the following steps. First, the local alignment tool Chaos9 is used to mask each submitted vector sequence out of each input sequence. The submitted vector sequences are iteratively mapped to the input sequences and any resulting alignments are removed. The Censor tool is then used to remove repeats. Since many integration sites occur in repeating elements, this process allows SeqMap to identify and display such portions of the sequence for the user. The input sequences are compared with the RepBase database10, identifying any repeats present in the sequence, which are then masked out. The resulting masked sequences are mapped to the GoldenPath and Ensembl genome sequences using Blat11 for both mouse and human sites, provided there is enough non-repeating content to accomplish the task. A cut off of at least 95% identity is required for a Blat hit to be considered. All hits that meet this constraint are reported so the user can identify ambiguous sequences or sequences with multiple mappable positions. It is possible that a sequence containing a nearby repeat is still mappable, however, we allow technicians to override the annotation with an analysis by hand.

Local sequence alignments, using Chaos, with the LTR at the junctions are performed, with only complete junctions reported as including integration sites. The orientation of the integration is also reported, as `A' or `B', specifying the orientation of the integrated sequence on the reference genome. Nearby genes within 300kb are identified from both annotation databases using the RefSeq tables, and are connected using their Entrez Gene or MGI IDs. These IDs are used to determine whether nearby transcripts come from the same or different genes.

This process is able to map very large numbers of sites; submissions can include hundreds of sequences. Analyses can be grouped by a user specified sample ID or date of submission and links are provided to the public genome databases. Common integration sites from the retroviral tagged cancer gene database (RTCGD)12 are highlighted in bold if the samples are from mice. Users are able to create group pages, submit batches of data, return later and view the automatic annotations, and build statistics by submission and user-defined samples. The workflow is not completed until the “Mapped” flag is set to `1' on the sample list page (Figure 2). Once workflow is completed, the user can click through to each submission and observe graphically the structure of the submitted sequence including regions of poor sequence (`N'), vector, repeating elements, and the genomic region. Additionally, nearby genes from both the Ensembl and UCSC databases can be observed on a map, and the nearest transcription start sites to the integration site is determined. Finally, reports can be generated for each group of integrations by either submission date, genomic location or sample ID.

Figure 2. Example of an integration site submission.

Figure 2

At left, is the submission summary (`A' in figure), and below, each submission is summarized by sequence name (B). At right, is a specific integration site summary page (C). Each sequence submission is summarized with whether it is confirmed by a technician, used for analysis, contains a technician comment, whether it is found on the genome, and what UCSC and Ensembl have annotated as the closest gene. Genes in the RTCGD12 are highlighted in bold. A gene structure map is visualized showing any original sequence errors (“N's”), the blocks removed by vector removal, any repeating elements, the genomic region mapped to the genome, and the proposed integration site. The structure of the inputted sequence is shown in user configurable colors in both image map and the sequences. A summary of all status, comments and nearby genes found with links to a complete summary of that region of the genome for each annotation database. For model organisms, human orthologs are provided using the Jackson laboratory ortholog tables (http://www.informatics.jax.org/). The sequences outputted by each step of the preparation process.

Identification of nearby genes

The summary of the identified integration sites can be viewed directly by clicking on the gene summary link and visualizing the genetic structure around the integration site on both UCSC and Ensembl RefSeq annotations13.

Characterization of nearby genes and elucidated integration sites

Functional analysis is performed by providing the gene names that are in the RTCGD and the Gene Ontology14 terms annotated with each gene, with their validation status, using the Jackson Laboratories Mouse Genome Informatics IDs (MGI, http://www.informatics.jax.org/) and EBI Gene Ontology annotations. We highlight both the UCSC GoldenPath annotations and the Ensembl annotations as both a quality control measure and to enable cross database communication for users of one particular database. Exporting a summary of every submission into Excel is possible by clicking on the export to Excel link on each summary page (Figure 2A).

Example Submission

It is useful to view an example dataset submission to understand how this resource can be used. One of our challenges is creating an interface that is both feature-rich and useful, but also easy to use by the bench scientist that may have little or no familiarity with bioinformatics tools. To overcome this, we chose a web based solution that enables researchers to upload the sequences and vectors of interest and then run the automated analysis on our servers. By completing this workflow, users are able to readily identify sites of integration in a system that yields easily reproducible results. An example submission can be viewed directly on the website at this address: https://seqmap.compbio.iupui.edu/seqmap/demo/cgibin/GetData.py?LAB=16&GROUP=1163175105&GROUPING=date&DELETED=0; username `demo', password `demo'). Additionally, all integration sites at a particular genomic position can be viewed with a map of the nearby genes (Figure 3).

Figure 3. Example analysis of specific groups of integration sites.

Figure 3

On the gene summary page for a submission, a map of the genomic region around an integration site is displayed for each integration event with the BLAT hit on both the UCSC and Ensembl datasets. Nearby transcripts are displayed in green, with an arrow pointing toward coding direction. Exons (red boxes) and introns (black lines) are also displayed showing transcript structure.

When used, we find that we are able to automatically map sites in a manner that agrees with human annotation, and we are able to interpret experimental results which fail to identify a genomic site.

Analysis of large numbers of genomic sequences is an increasingly important problem in areas outside of gene therapy research. SeqMap has been developed to enable similar analyses in domains outside of gene therapy research. If a researcher generates a significant amount of sequence data that needs to be mapped to a genome, SeqMap will support this. We believe this resource will enable users to accurately compare their results to other experiments and publications in a consistant manner. Future additions include analysis of functional annotations and analysis of nearby genes using Gene Ontology and the ability to publish public reports based on the results. Together these tools will represent a significant advance in our ability to analyze insertional mutagenesis data, and we will continue to improve this tool as new experimental methods are determined.

Acknowledgements

We would like to thank Robert Getty, Susan Jean Johns and Giselle Knudsen for helpful comments. Funding provided by P01HL53586 (PI: Dinauer), K22LM009135 (PI: Mooney), the Indiana University Vector Production Facility (U42RR11148, PI: Cornetta) and INGEN. The Indiana Genomics Initiative (INGEN) is funded in part by a grant from the Lilly.

References

  • 1.Wu X, Burgess SM. Integration target site selection for retroviruses and transposable elements. Cell Mol Life Sci. 2004;61:2588–2596. doi: 10.1007/s00018-004-4206-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Ott MG, Schmidt M, Schwarzwaelder K, Stein S, Siler U, Koehl U, et al. Correction of X-linked chronic granulomatous disease by gene therapy, augmented by insertional activation of MDS1-EVI1, PRDM16 or SETBP1. Nat Med. 2006;12:401–409. doi: 10.1038/nm1393. [DOI] [PubMed] [Google Scholar]
  • 3.Kustikova OS, Geiger H, Li Z, Brugman MH, Chambers SM, Shaw CA, et al. Retroviral vector insertion sites associated with dominant hematopoietic clones mark “stemness” pathways. Blood. 2006 doi: 10.1182/blood-2006-08-044156. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Du Y, Spence SE, Jenkins NA, Copeland NG. Cooperating cancer-gene identification through oncogenic-retrovirus-induced insertional mutagenesis. Blood. 2005;106:2498–2505. doi: 10.1182/blood-2004-12-4840. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Hematti P, Hong BK, Ferguson C, Adler R, Hanawa H, Sellers S, et al. Distinct genomic integration of MLV and SIV vectors in primate hematopoietic stem and progenitor cells. PLoS Biol. 2004;2:e423. doi: 10.1371/journal.pbio.0020423. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Gentner B, Laufs S, Nagy KZ, Zeller WJ, Fruehauf S. Rapid detection of retroviral vector integration sites in colony-forming human peripheral blood progenitor cells using PCR with arbitrary primers. Gene Ther. 2003;10:789–794. doi: 10.1038/sj.gt.3301935. [DOI] [PubMed] [Google Scholar]
  • 7.Schmidt M, Schwarzwaelder K, Bartholomae C, Zaoui K, Ball C, Pilz I, et al. High-resolution insertion-site analysis by linear amplification-mediated PCR (LAM-PCR) Nat Methods. 2007;4:1051–1057. doi: 10.1038/nmeth1103. [DOI] [PubMed] [Google Scholar]
  • 8.Invitrogen . Invitrogen TOPO TA Cloning User Manual. Version U 2006. [Google Scholar]
  • 9.Brudno M, Do CB, Cooper GM, Kim MF, Davydov E, Green ED, et al. LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome Res. 2003;13:721–731. doi: 10.1101/gr.926603. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Kohany O, Gentles AJ, Hankus L, Jurka J. Annotation, submission and screening of repetitive elements in Repbase: RepbaseSubmitter and Censor. BMC Bioinformatics. 2006;7:474. doi: 10.1186/1471-2105-7-474. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Kent WJ. BLAT--the BLAST-like alignment tool. Genome Res. 2002;12:656–664. doi: 10.1101/gr.229202. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Akagi K, Suzuki T, Stephens RM, Jenkins NA, Copeland NG. RTCGD: retroviral tagged cancer gene database. Nucleic Acids Res. 2004;32:D523–527. doi: 10.1093/nar/gkh013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Wheeler DL, Church DM, Federhen S, Lash AE, Madden TL, Pontius JU, et al. Database resources of the National Center for Biotechnology. Nucleic Acids Res. 2003;31:28–33. doi: 10.1093/nar/gkg033. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Harris MA, Clark J, Ireland A, Lomax J, Ashburner M, Foulger R, et al. The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res. 2004;32:D258–261. doi: 10.1093/nar/gkh036. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES