Skip to main content
. 2016 Mar 28;4:e1847. doi: 10.7717/peerj.1847

Figure 3. Bioinformatic pipeline for identification of KoRV integration sites.

Figure 3

The pipeline was run separately for each data set obtained by three different techniques. For the key steps, the number of sequences retained is indicated in parentheses for each technique in this order from left to right: PEC, SPEX and hybridization capture. After processing NGS reads, KoRV integration sites were identified in a two-step analysis of KoRV LTR ends, next to the host DNA flanking KoRV. The first round of selection targeted the A region of the LTR end and its output, was used for subsequent identification of the B region. The LTR ends of all sequences were trimmed off, and only sequences longer than four bp were considered. Using a sequence clustering approach, unique vs. shared integration sites were sorted into clusters. The consensus of each non-singleton cluster was computed using a multiple sequence alignment. These consensus sequences and singleton sequences were queried against wallaby genomic scaffolds and koala Illumina Hiseq reads to determine whether they represented KoRV flanking sequences. At the same time extension products into the KoRV genome were identified.