PREMiner-II: A Tool for Rapid Identification and Configuration of Repetitive Element Arrays from Large Mammalian Chromosomes as a Single Query

Woo-Chan Kim; Kang-Hoon Lee; Kyung-Seop Shin; Ri-Na You; Young-Kwan Lee; Kiho Cho; Dong-Ho Cho

doi:10.1016/j.ygeno.2012.06.006

. Author manuscript; available in PMC: 2013 Sep 1.

Published in final edited form as: Genomics. 2012 Jun 28;100(3):131–140. doi: 10.1016/j.ygeno.2012.06.006

PREMiner-II: A Tool for Rapid Identification and Configuration of Repetitive Element Arrays from Large Mammalian Chromosomes as a Single Query

Woo-Chan Kim ^1,⁺, Kang-Hoon Lee ^2,⁺, Kyung-Seop Shin ¹, Ri-Na You ¹, Young-Kwan Lee ², Kiho Cho ^2,^#, Dong-Ho Cho ^1,^*,^#

PMCID: PMC3428500 NIHMSID: NIHMS389733 PMID: 22750555

Abstract

Genes occupy ~3 % of the human and mouse genomes whereas repetitive elements (REs), whose biologic functions are largely uncharacterized, constitute greater than 50 %. A heterogeneous population of RE arrays (arrangement structures) is formed by combinations of various REs in mammalian genomes. In this study, REMiner-II was refined from the original REMiner for a more efficient identification and configuration of RE arrays from large queries (e.g., human chromosomes) using an unbiased self-alignment protocol. Chromosome-wide RE array profiles for the entire sets of human and mouse chromosomes were obtained using REMiner-II on a personal computer. REMiner-II provides 10 adjustable parameters and three data output modes to accommodate different experimental settings and/or goals. Examination of the human and mouse chromosome data using the REMiner-II viewer revealed species-specific libraries of complexly organized RE arrays. In conclusion, REMiner-II is an efficient tool for chromosome-wide identification and characterization of RE arrays from mammalian genomes.

Keywords: mammalian genome, chromosome-wide, repetitive element, RE array, mining, REMiner-II

1. Introduction

In addition to the DNA sequence of the human genome which was reported to be decoded in 2001, more than 1,000 genome sequences of different species, ranging from prokaryotes to primates, have been deposited into various databases [1,2]. These genome data sets, which are readily accessible, enable life scientists to investigate otherwise impossible complex layers of biologic phenomena. Genes, which constitute ~3 % of the human genome, have been placed at the center stage of modern biologic research during the last several decades, whereas the rest of the genomic constituents has received very limited attention [3,4]. Importantly, the fact that human and mouse genes share a high level of sequence homology is inconsistent with the obvious phenotypic differences between these two species, leading to the speculation that the rest of the genomic constituents play a larger part in phenotype determination than previously expected [5].

A diverse population of repetitive elements (REs) (both characterized and uncharacterized) represents the vast majority of the non-gene genome sequences [2,6]. It has been reported that certain trinucleotide tandem REs participate in specific normal processes, such as differential limb and skull morphology among dog breeds, as well as disease processes [7,8]. Our recent survey of the human genome for REs by an unbiased self-alignment using the bl2seq program from the National Center for Biotechnology Information (NCBI) and the original REMiner program revealed a diverse population of complexly ordered RE arrangement structures, named RE arrays [9,10]. The complexly ordered configuration of these RE arrays suggests that they exist in the genome as a functional unit.

Various homology search algorithms, most of which compensate for certain levels of mismatches and gaps within a sequence pair, have been implemented into a range of software tools to identify REs from genome sequences [11–15]. The majority of these search algorithms are derivatives of BLAST (basic local alignment search tool) [16–18]. Some of them are designed for the identification of specific RE families, such as LTR (long terminal repeat) retrotransposons and MITEs (miniature inverted repeat transposable elements), while others employ self-alignment protocols for unbiased RE mining [14,19–22].

In this study, we refined the original REMiner, which was developed for mining REs and RE arrays, for a more efficient identification and characterization of RE arrays with the following analytical features: 1) rapid mining of REs and RE arrays from any individual mammalian chromosome as a single query under the specifications of personal computers, 2) user-friendly graphic interface features, such as instant retrieval of RE alignment results of interest from the viewer, and 3) implementation of adjustable parameters depending on the system specifications and/or experimental goals. Our recent survey of other RE mining studies indicates that there has been no attempt to identify REs and RE arrays using an unbiased protocol from large size genome queries, such as human chromosome 1 which is approximately 250 Mb (megabase).

2. Results

2.1. System Design and Algorithm

The REMiner-II design consists of three main stages: preprocessing of query sequences, seeding, and alignment-extension (Figure 1).

The REMiner-II design consists of three main stages: preprocessing of query sequences, seeding, and extension. The preprocessing stage prepares the query sequences for seeding. The query sequence data is entered in a FASTA format, and low complexity sequences are filtered out based on the embedded parameters. The word list is extracted from the forward and reverse complemented sequences. During the seeding stage, the query sequence is parsed with probe sequences (words). Forward (by self-alignment of the original query sequence) and reverse (two-sequence comparison between the original and the reverse complemented sequences) seeding events proceed sequentially. The generated seeds are temporarily put into the seed storage. Once the seed storage is full, the gapped extension process begins. The seed generation and gapped extension processes are repeated until the seed storage is empty. Alignment-extensions will be initiated from the individual seed loci using multiple processing units and the RE alignment results are recorded.

2.1.1. Preprocessing

The preprocessing stage prepares the query sequences for seeding.

2.1.1.1. Data Extraction and Low-Complexity Filtering

Non-DNA sequence information, such as comments and other letters which are embedded in the original query files (FASTA format), are removed to extract nucleotide sequence information only. Subsequently, low-complexity sequences are masked for filtering out during the seeding stage to exclude mining of REs with low complexity. The majority of nucleotide homology search tools, including BLAST, use a DUST or SDUST (symmetric DUST) algorithm to mask low-complexity sequences [23]. In the DUST or SDUST module, when Q₁ is a nucleotide sequence of length l, the filtering score S_F(Q_l) of Q_l is defined as follows:

S_{F} (Q_{l}) = \frac{\sum_{t \in J} N_{t} (Q_{l}) (N_{t} (Q_{l}) - 1)}{2 (l - 3)}

(1)

In (1), where t is a triplet, T is a set of all 64 triplets, and N₁(Q₁) is the number of t within Q_l. Both DUST and SDUST modules mask sequences of which the length is less than the filtering window size (WD_F) and the score is greater than the filtering score threshold (T_F). In these settings, tandem dinucleotide and trinucleotide repeats are masked. However, to retrieve these short-length/low-complexity tandem repeats, which are determined to be valuable REs, REMiner-II uses a filtering algorithm different from DUST or SDUST. In REMiner-II, the filtering score S_F(Q_l) of Q_l is defined as follows:

S_{F} (Q_{l}) = \frac{max_{b \in {A, C, G, T}} {N_{b} (Q_{l})}}{l}

(2)

In (2), N_b(Q_l) is the number of nucleotides b within Q_l. REMiner-II masks sequences of which the filtering score is greater than or equal to T_F when l is equal to WD_F. Thus, the sequence regions with certain lengths of a specific nucleotide repeat are generally masked in REMiner-II. However, the program allows for retaining short nucleotide repeats, such as dinucleotide and trinucleotide repeats. Similar to the protocols in both the DUST and SDUST algorithms, masking replaces an upper-case nucleotide letter with a lower-case.

2.1.1.2. Reverse Complementation

All query sequences are subjected to reverse complementation in the preprocessing stage. In addition to the self-alignment of the original query sequence for the mining of REs, REMiner-II searches for REs by comparing the original sequence with its reverse complemented sequence to account for the fact that DNA sequences are double-stranded. Reverse complementation follows the Watson-Crick base pairing of “A” with “T” and “C” with “G” [24].

2.1.2. Seeding

During the initial seeding stage, the query sequence is parsed with fixed-length overlapping probe sequences that are called “words” (W). Immediately after parsing begins, seeding events proceed by self-alignment of the original sequence (forward) and two-sequence comparison between the original and the reverse complemented sequence (reverse). Candidate seeds are assembled by merging words based on the results of the word matching process, and the candidate seeds are screened to eliminate seeds that will produce redundant alignments before deposition of the extension-ready seeds to the seed storage.

2.1.2.1. Word Matching

In this BLAST-based algorithm, all matching word pairs are surveyed and stored in the word lookup table (WLT). However, the WLT structure of REMiner-II is somewhat different from that of other homology search tools, including BLAST. REMiner-II has two interconnected WLTs, named WLT1 and WLT2. WLT1 records the nucleotide index of the last position of a specific word and its total number of entries is 4^W. In conjunction with WLT1, WLT2 registers the previous incidence (nucleotide index) of a word at each position. Since WLT2 is established based on the individual word sizes within a query sequence, WLT2 contains (F-W+1) entries, where F is the query sequence length.

During the word matching process, the query sequence is sequentially surveyed for a specific word length frame by frame. Initial values of both the last word index of WLT1 and the former word index of WLT2 are set to −1. An initial occurrence of each word is converted to a numerical value and first recorded into the corresponding entry of WLT1. If an occurrence of the same word has already been recorded in the index of WLT1, the previous value from WLT1 is recorded in the corresponding nucleotide position in WLT2, and the new value replaces the previous value in WLT1. Figure 2 shows an example of the word matching process with an input sequence of “ACACACAT” and W = 3.

Initial values of the last word index in the word lookup table 1 (WLT1) and the former word index in the WLT2 are set to −1. The last word indices of three unique words (ACA, CAC, and CAT) identified from all three frames of the query (ACACACAT) are recorded in the WLT1. As the last word indices are newly entered into the WLT1, the existing indices are transferred to the matching rows (words) of the former index column in the WLT2. The former word indices of both the first (ACA) and second (CAC) words are recorded as −1 in the WLT2 due to the lack of their corresponding former words. The former word index of the third word (ACA) is 0, which is the index value of the previous word (ACA). The third through sixth words are processed the same as the first three to determine their former word indices. The completed WLT2 provides information in regard to the matching words.

The REMiner-II WLTs have two unique functional characteristics: First, a population of words identified in the query sequence are not only mapped as WLT1 indices but also sorted in the order of occurrence in the query sequence, enabling the seeding process to be more efficient. Second, memory usage is reduced in the word matching process. By allocating 4 bytes of memory for each index, REMiner-II is able to process up to 4 Gb (gigabase) of query sequence, which is larger than the entire human genome. Since each index occupies 4 bytes of memory in REMiner-II, WLT1 and WLT2 need 4^W+1 bytes and 8(F-W+1) bytes of memory space, respectively. In this memory structure, the memory usage by REMiner-II will exponentially increase when word size increases. However, the word size does not need to be larger in REMiner-II because relatively long seeds can be generated using small word sizes by fine-tuning the seed length threshold and word merging algorithm. In addition, REMiner-II utilizes several WLT optimization strategies, involving memory space for pointers, bytepacking, and memory allocation for efficient and stable processing. In fact, the results from an experiment using six human chromosomes (1, 5, 8, 11, 15, and Y), with W = 14, demonstrated that the memory usage of the REMiner-II WLTs was reduced to about 60 % of the BLAST’s memory usage (data not shown).

2.1.2.2. Seed Generation

Candidate seeds are assembled by merging words before establishing a population of seeds for the alignment-extension process. Each candidate seed has four attributes: indices of the two homologous sequences (index1 and index2), length, and score. The candidate seed (CS) of diagonal d is represented as CS (x,y,l,s,d), where d = x−y, x is index1, y is index2, l is the length, and s is the score. These candidate seeds are sorted by order of the diagonal in the candidate seed list (CSL). The total number of diagonals is variable depending on the direction (forward or reverse). The word merging process is performed sequentially by an ascending order of index. Identical words are paired using the index information from WLT1 and WLT2. Once a pair of words on the same dot-matrix diagonal becomes a candidate seed, it is registered to the CSL. Figure 3 shows the seed generation algorithm.

The flow chart describes how the new candidate seeds (NCSs) are examined to determine whether they can be merged with old candidate seeds (OCSs) and how the candidate seeds are merged by ungapped extension.

REMiner-II also introduces the m-allowable method, which permits two homologous words to retain the maximum (m) mismatches, enabling a more flexible RE search. If N_m is the number of word combinations in a given word (W) with the maximum m value, then N_m is calculated as follows.

N_{m} = \sum_{a = 0}^{m} (\begin{array}{l} W \\ a \end{array}) \times 3^{a}

(3)

In (3), since m increases N_m exponentially, a high m value is directly linked to increased processing time while allowing for the identification of more REs. Thus, optimization of the m value is necessary in consideration of both processing time and sensitivity.

Candidate seeds are subjected to further comparison analyses to select the seeds for the alignment-extension process. Two candidate seeds are marked for merging when the space between them is less than the space threshold (SP). The candidate seeds are merged by the ungapped extension process, which is also implemented in the BLAST algorithm (Figure 3) [16,17]. Briefly, the upstream candidate seed is extended backward until either its total score is below the ungapped extension score threshold S_T, it meets an N letter, or it reaches the downstream candidate seed. Once the upstream candidate seed is extended to the downstream candidate seed, they are merged. If it is not extended, the new downstream candidate seed now serves as an upstream candidate seed for another round of merging events with the downstream candidate seed (formerly upstream) by forward extension. Following examination of both the merged and unextended candidate seeds for their qualifications to become a seed, they are either registered in the seed storage or discarded.

The protocols implemented during the merging of words or candidate seeds allow for a reduction in the number of seeds that need to be surveyed during the alignment-extension process; as a result, the processing time for the identification of REs is decreased. These protocols do not impose any significant effects on sensitivity because the merging processes involving words or candidate seeds are highly correlated; therefore, it is anticipated that they promote the production of compensatory alignments during the extension process. In addition, contrary to the two-hit method (merging of two words), which is implemented in some homology search programs, REMiner-II utilizes an n-hit method, in which two or more words are merged during the seed generation process [17,25,26]. There was a significant decrease in the number of candidate seeds that completed the word merging process when the n-hit method was used compared to the one-hit and/or two-hit methods (Figure 4).

The bar graph depicts the hit- method dependent differences in the number of candidate seeds, which were identified from the human chromosomes (10, 12, 17, and X) when W = 14. There were ~21 % and ~18 % decreases in the number of candidate seeds after word merging using the n-hit method, compared to the one-hit and two-hit methods, respectively. CS (candidate seed)

2.1.3 Alignment-Extension of Seeds

The greedy algorithm, which has been demonstrated to provide computation efficiency in Megablast and other programs, is implemented for the gapped alignment-extension of seeds in REMiner-II [27]. The seeding step, which is not computationally challenging, is executed by a single processing unit because it is processed sequentially for individual words and/or candidate seeds. On the other hand, the alignment-extension of individual seeds is a time-consuming task, especially with large query sequences, and is independent of each other. Thus, in REMiner-II, individual seeds are aligned/extended in parallel using multiple, instead of single, processing units.

Individual processing units dynamically process one seed at a time and write alignment information as an output of the gapped extension process to a result file. It needs to be noted that the seeds with a small diagonal (d) tend to proceed toward the center line during the forward extension process, often resulting in biologically insignificant alignments/extensions. To circumvent these potential shortcomings during the gapped extension, the minimum diagonal in the forward extension is set to (−d+1) and the maximum diagonal in the reverse direction is set to (d−1).

2.1.4. Modes of RE Data Output

REMiner-II provides three different data output modes which are Mode 1 (create RE array with seed library), Mode 2 (create RE array without seed library), and Mode 3 (create seed library only) (Figure 5). We have tested and compared these three modes using human chromosome Y. With Mode 1, it took 1,060 seconds to create a complete set of RE arrays and the corresponding seed library, while only 223 seconds was needed with Mode 3. The execution time is expected to vary greatly depending on the profiles of REs and their arrays in the individual query sequences. The seed library data, which are obtained using Mode 3, may be useful for an efficient comparative analysis with different data sets.

The three data output modes of REMiner-II are indicated by different types of lines (Mode 1, solid; Mode 2, dotted; Mode 3, broken). The seed library file consists of nine parameters (listed in Table 1; W, m, SP, L, *S_MAT*, *S_MIS*, *S_T*, *W_DF*, and *T_F*) which are involved in the seed generation process. F (file size), N_F (number of seeds on forward sequence), N_R (number of seeds on reverse complement sequence)

2.1.5. User Interface: REMiner-II Viewer

A user interface, named REMiner-II viewer, was developed using the WinAPI within the Windows environment. Each RE alignment was drawn as a line using a device context of the graphic device interface.

2.2. Performance Evaluation of REMiner-II

2.2.1. Identification of RE Arrays from Human and Mouse Chromosomes

Ten parameters, which can be adjusted for each run, are embedded in the REMiner-II algorithm: word size (W), maximum allowable mismatch (m), space threshold (SP), seed length threshold (L), matching score (S_MAT), mismatching score (S_MIS), ungapped extension threshold (S_T), user-specified gapped extension threshold in the greedy algorithm (X) [27], window size for filtering (WD_F), and filtering score threshold (T_F) (Table 1). W, m, SP, L, and S_T are associated with the seeding process. X is linked to the gapped alignment-extension process, and WD_F and T_F are associated with the low-complexity filtering step. S_MAT and S_MIS are involved in both the ungapped and gapped alignment-extension processes.

Table 1.

Values of 10 adjustable parameters of REMiner-II, which were employed to process the human and mouse chromososomes in this study.

Parameter	Value

word size (W)	14
allowable mismatch number (m)	1
space threshold (SP)	2
seed length threshold (L)	56
matching score (S _MAT)	1
mismatching score (S _MIS)	−2
ungapped extension threshold (S _T)	−10
gapped extension threshold (X)	30
window size for filtering (W _DF)	20
filtering score threshold (T _F)	0.6

Open in a new tab

Using REMiner-II, we performed chromosome-wide mining of REs and associated RE arrays in the sets of all 24 human chromosomes (1–22, X, and Y) and 21 mouse chromosomes (1–19, X, and Y) obtained from the NCBI databases. The results from the performance evaluation, which are summarized in Supplementary Table 1, suggest that the range of processing times (e.g., about seven hours for human chromosome 1 of ~250 Mb) and memory usage with a personal computer are acceptable, considering the large query sizes. Our recent literature survey indicates that chromosome-wide mining of REs and RE arrays for the entire sets of human and mouse chromosomes has not been reported previously [9].

We also determined the density of the RE alignments within the individual human and mouse chromosomes (Supplementary Table 1). While human chromosomes 16, 17, and 19 have a relatively high density of RE alignments, chromosomes 4, 13, and 21 have a very low RE density. Examination of the coordinate information of the alignments revealed that ~5.9 % of the alignments identified in the human chromosomes were duplicates, which are presumably caused by the gapped alignment of short tandem repeats with high homology in specific regions. Consequently, the highest rate of duplicate alignments (~39.2 %), seen in human chromosome Y, indicates that there is a dense population of short tandem repeats with high homology in that chromosome.

2.2.2. Visualization of REs and RE Arrays via REMiner-II Viewer

The REMiner-II viewer is a user interface program for a two-dimensional representation of REs and RE arrays utilizing the dot-plot protocol described previously [21]. The key features of the REMiner-II viewer include: a controller panel, instant retrieval of alignment data, color-defined RE orientation (direct and inverse), RE/alignment position, identity ratio, and zoom function (Figure 6). In addition, the RE alignment data window displays information in regard to the sequence position, identity, score, gap, and strand orientation.

The key functional features of the REMiner-II viewer, such as the controller panel, instant retrieval of alignment data, color-defined RE orientation (direct-blue and inverse-red), alignment position, identity ratio, and zoom function, are illustrated in a snapshot view.

2.2.3. Effects of the REMiner-II Adjustable Parameters on RE Alignment and RE Array Configuration

2.2.3.1. Effects of Low-Complexity Filtering Options

Low-complexity filtering is affected by the window size for filtering (WD_F) and filtering score threshold (T_F). As the value of T_F decreases, processing time is shortened with a smaller output size in conjunction with a potential for reduced RE detection/identification. Panels A1–A3 of Figure 7 show changes in the profiles of REs and associated RE arrays of a region (7.5~10.1 Mb) in human chromosome Y, in conjunction with a sequential change of T_F values (0.5, 0.6, and 0.7). While three RE arrays (a, b, and c in Figure 7-A2) were identified with T_F values of 0.6 and 0.7, it was difficult to delineate the RE array (c) with a value of 0.5. On the other hand, a T_F value of 0.7 (Figure 7-A3) resulted in an increased processing time by 48 % and output file size by 104 % in comparison to the results using T_F value of 0.6 (Figure 7-A2) although they share similar profiles of REs and RE arrays. The results from this study demonstrated that the user-defined low-complexity filtering options play a critical role in the identification of REs and associated RE arrays.

Three genomic regions were selected to evaluate the effects of the REMiner-II parameters on RE alignment and RE array configuration: (A) 7.4~10.1 Mb (mega base) subsequence of human chromosome Y for testing the effects of low-complexity filtering (A1: *T_F* = 0.5; A2: *T_F* = 0.6; A3: *T_F* = 0.7), (B) 90.31~90.51 Mb subsequence of human chromosome 3 for testing the effects of seed generation conditions (B1: m = 0 and L = 56; B2: m = 1 and L = 56; B3: m = 0 and L = 34), and (C) 26.25~26.32 Mb subsequence of human chromosome 20 for testing the effects of the gapped alignment algorithm (C1: X = 10; C2: X = 30; C3: X = 50). The results of this experiment are discussed in the main text (section 2.2.3).

2.2.3.2. Effects of Seed Generation Conditions

There are five tunable parameters in REMiner-II that affect the seed generation process: word size (W), maximum allowable mismatch (m), space threshold (SP), seed length threshold (L), and ungapped extension threshold (S_T). A large word size is anticipated to decrease sensitivity since a seed population derived from a large word size is expected to be a subset of a seed population using a smaller word. In a similar context, a small space threshold may be associated with decreased sensitivity. REMiner-II applies an ungapped extension algorithm during the merging of candidate seeds, whereas the seeds that are subjected to an alignment-extension in the BLAST algorithm are generated without a merging process [16]. As the value of S_T is set lower, there are parallel decreases in the number of merged seeds, computation complexity, and size of RE alignment result files. At the same time, it may raise computation complexity due to the increased frequency of seed merging efforts; thus, an optimal S_T value needs to be set for each seed generation process.

Two of the most influential parameters are maximum allowable mismatch (m) and seed length threshold (L). The effects of m and L on RE alignment and RE array configuration were evaluated using a region (90.3~90.5 Mb) of human chromosome 3 (Figure 7-B). As expected, the alignment results with an m value of 0 (Figure 7-B1) showed a lower density of REs and RE arrays compared to the results with an m value of 1 (Figure 7-B2). In addition, a decrease in seed length threshold (L) (from 56 in Figure 7-B1 to 34 in Figure 7-B3; both m = 0) compensated for the effects of a lower m value. The RE pattern appears by decreasing the seed length threshold from 56 to 34, resulting in higher densities of REs and RE arrays more similar to the results obtained with an m value of 1 (Figure 7-B2; L = 56).

2.2.3.3. Effects of Alignment-Extension Parameters

There are three parameters embedded in the alignment-extension algorithm: matching score (S_MAT), mismatching score (S_MIS), and gapped extension threshold (X). S_MAT and S_MIS affect the alignment score as a reward and a penalty, respectively. The gap score is computed by S_MAT − S_MIS/2 according to the greedy algorithm [27]. Since the gapped extension threshold (X) determines the length of the alignments, the value of X affects the sensitivity of the alignment-extension process. The effects of variations in the value of X (10, 30, and 50) on RE alignment and RE array configuration were examined using a region (26.25~26.32 Mb) of human chromosome 20 (Figure 7-C). It was evident that the run with X values of 30 and 50 resulted in a higher sensitivity in regard to the number of RE alignments and RE arrays compared to the results obtained with an X value of 10. Interestingly, although the profiles of RE alignments and RE arrays from runs with X values of 30 and 50 look similar, the one with an X value of 50 required approximately 40 % more computation time.

2.2.4. Complexly Organized RE Arrays Identified from Human and Mouse Chromosomes using REMiner-II

Two groups of 12 complexly organized RE arrays were selected from the survey of the entire sets of human and mouse chromosomes, and species-specific libraries of RE arrays were established (Figure 8 and Supplementary Figures 1 and 2). Within the two-dimensional dot-plot viewer, tandem repeats are represented by a square filled with staggered diagonal lines, while interspersed repeats form randomly displayed dots and lines. The RE arrays presented in Figure 8, however, demonstrate much more complex and diverse patterns compared to the basic tandem and interspersed repeats. For instance, combinations of different tandem repeats, displaying various densities and patterns, formed complex, but ordered, RE arrays (e.g., REAch19–11 in Figure 8). While some RE arrays had only direct REs indicated with blue lines and dots, the other RE arrays are mixed with direct and inverse (red lines and dots) REs, which are often associated with imperfect palindromic REs.

A survey of the mining results from the 24 human chromosomes and 21 mouse chromosomes revealed a number of complexly organized RE arrays, and 12 representative RE arrays were selected from each genome to demonstrate their unique architectural configurations as well as species-specific patterns.

2.2.5. Comparative analysis: REMiner vs. REMiner-II

In comparison to the original REMiner program (developed for the Linux operation system), there are three key changes in REMiner-II (developed for the Windows operation system) that contributed to its performance, primarily in regard to computation time, memory usage, and stability. First, instead of employing the perfect hash table as a WLT in REMiner, REMiner-II uses two WLTs which allow for reduced memory consumption and stable memory management. Second, while REMiner uses a two-hit method during the seeding process, an n-hit protocol is incorporated into the design of REMiner-II. As demonstrated in Figure 4, the n-hit method yielded a substantial decrease in the number of candidate seeds following the merging of words when it is compared to the two-hit method, resulting in a more efficient alignment-extension process. Lastly, REMiner-II utilizes multiple processors in parallel during the alignment-extension process to reduce computation time in contrast to the single processor protocol embedded in REMiner.

3. Discussion

In this study, we developed and tested the REMiner-II program for a chromosome-wide and unbiased identification and configuration of REs and RE arrays. One unique functional feature of REMiner-II is that it can identify REs as well as RE arrays from substantially large query sequences with a relatively short processing time (about seven hours for human chromosome 1 of ~250 Mb) as well as low memory usage. In this regard, REMiner-II is unique compared to the majority of available RE analysis tools, which are designed to search for specific REs and/or able to handle only sub-chromosome size query sequences [28,29]. The REMiner-II’s ability to process large chromosomes as a single query within several hours may be attributed to a combination of factors, such as its efficient search algorithms in conjunction with the two WLTs, an n-hit method, parallel computing, and a m-allowable protocol. In addition, 10 adjustable parameters in regard to computation speed, memory usage, and sensitivity provide flexibility so that an optimal set of mining parameters can be formulated for individual RE array studies.

A survey of the two-dimensional dot-matrix RE data from entire sets of human and mouse chromosomes identified species-specific libraries of RE arrays. The structural configuration of some of these RE arrays is very complex; they are formed by combinations of different orientations (direct, inverse, and palindromic) and spacing characteristics (tandem and interspersed) among REs. The ordered nature of the RE arrays suggests that certain RE arrays may participate in biologic processes (e.g., cell division, recombination) for phenotype determination (both species- and individual-specific) and evolution.

One of the REMiner-II’s core applications focuses on the establishment of a comprehensive database for REs and RE arrays from large size chromosomes/genomes. In addition to this primary feature, its ability to save and export the seed and alignment data, which are collected from each RE scan of a query chromosome, may allow for efficient comparative analyses of REs and RE arrays on different data sets as well as the creation of de novo RE libraries with classified members. Furthermore, we plan to develop a computer program to analyze a one-dimensional configuration of the individual RE arrays that are identified throughout the human and mouse genomes using REMiner-II. This new program, in combination with REMiner-II, may serve as an important tool for the future study of RE array polymorphisms in humans, mice, and other species.

4. Materials and Methods

4.1. Acquisition of Genome Sequences

The entire reference genome sequences of 3,095,677,412 (229,318,183 uncharacterized) and 2,654,895,218 (90,012,700 uncharacterized) nucleotides were obtained for the human and mouse, respectively, from the National Center for Biotechnology Information (NCBI) (Build 37.1).

4.2. Surveying of the Human and Mouse Chromosomes for RE Arrays

The entire sets of the NCBI human and mouse chromosome sequences were surveyed for REs and RE arrays using REMiner-II. Individual chromosomes (24 chromosomes for human and 21 chromosomes for mouse) were analyzed as a single query. The default set of parameters listed in Table 1 was applied to all chromosomes which were surveyed for RE arrays. Using the REMiner-II viewer, each 1 Mb (megabase) dot-matrix plot was surveyed for RE arrays based on selective RE arrangement characteristics, such as size, orientation, complexity, and density. Subsequently, unique RE arrays were compiled to create two libraries specific for the human and mouse genomes. The hardware environment employed for this study was as follows: a 64 bit Windows XP workstation with a quad core processor operating at 3.07 GHz (gigahertz) and 12 Gbytes (gigabytes) of memory.

Supplementary Material

01. Supplementary Table 1. A summary of the results from the performance evaluation of REMiner-II in regard to the chromosome-wide mining of RE arrays from the 24 human and 21 mouse chromosomes.

Chr (chromosome), Hu (human), Mo (mouse)

NIHMS389733-supplement-01.xls^{(37KB, xls)}

02. Supplementary Table 2. The position (chromosomal coordinates) and other relevant information (e.g., viewer magnification ratio) for the individual RE arrays which are presented in Supplementary Figures 1 and 2.

Gray shades indicate the RE arrays with variable numbers of subset arrays. Chr (chromosome)

NIHMS389733-supplement-02.xls^{(127.5KB, xls)}

03. Supplementary Figure 1. RE arrays identified from the survey of the entire set of 24 human chromosomes using REMiner-II.

The RE arrays (a total of 333) identified from the survey of the entire set of 24 human chomosomes (1–22, X, and Y) are compiled. The position (chromosomal coordinates) and other relevant information (e.g., viewer magnification ratio) for each RE array are listed in Supplementary Table 2.

NIHMS389733-supplement-03.pdf^{(11.8MB, pdf)}

04. Supplementary Figure 2. RE arrays identified from the survey of the entire set of 21 mouse chromosomes using REMiner-II.

The RE arrays (a total of 308) identified from the survey of the entire set of 21 mouse chomosomes (1–19, X, and Y) are compiled. The position (chromosomal coordinates) and other relevant information (e.g., viewer magnification ratio) for each RE array are listed in Supplementary Table 2.

NIHMS389733-supplement-04.pdf^{(8.1MB, pdf)}

Acknowledgments

This study was supported by grants from Shriners of North America (No. 86800 to KC, No. 84302 to KHL [postdoctoral fellowship]) and the National Institutes of Health (R01 GM071360 to KC). We also want to thank David Bosik Kim for helpful comments during the software development.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

1.Bernal A, Ear U, Kyrpides N. Genomes OnLine Database (GOLD): a monitor of genome projects world-wide. Nucleic Acids Res. 2001;29:126–127. doi: 10.1093/nar/29.1.126. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, Funke R, Gage D, Harris K, Heaford A, Howland J, Kann L, Lehoczky J, LeVine R, McEwan P, McKernan K, Meldrim J, Mesirov JP, Miranda C, Morris W, Naylor J, Raymond C, Rosetti M, Santos R, Sheridan A, Sougnez C, Stange-Thomann N, Stojanovic N, Subramanian A, Wyman D, Rogers J, Sulston J, Ainscough R, Beck S, Bentley D, Burton J, Clee C, Carter N, Coulson A, Deadman R, Deloukas P, Dunham A, Dunham I, Durbin R, French L, Grafham D, Gregory S, Hubbard T, Humphray S, Hunt A, Jones M, Lloyd C, McMurray A, Matthews L, Mercer S, Milne S, Mullikin JC, Mungall A, Plumb R, Ross M, Shownkeen R, Sims S, Waterston RH, Wilson RK, Hillier LW, McPherson JD, Marra MA, Mardis ER, Fulton LA, Chinwalla AT, Pepin KH, Gish WR, Chissoe SL, Wendl MC, Delehaunty KD, Miner TL, Delehaunty A, Kramer JB, Cook LL, Fulton RS, Johnson DL, Minx PJ, Clifton SW, Hawkins T, Branscomb E, Predki P, Richardson P, Wenning S, Slezak T, Doggett N, Cheng JF, Olsen A, Lucas S, Elkin C, Uberbacher E, Frazier M, Gibbs RA, Muzny DM, Scherer SE, Bouck JB, Sodergren EJ, Worley KC, Rives CM, Gorrell JH, Metzker ML, Naylor SL, Kucherlapati RS, Nelson DL, Weinstock GM, Sakaki Y, Fujiyama A, Hattori M, Yada T, Toyoda A, Itoh T, Kawagoe C, Watanabe H, Totoki Y, Taylor T, Weissenbach J, Heilig R, Saurin W, Artiguenave F, Brottier P, Bruls T, Pelletier E, Robert C, Wincker P, Smith DR, Doucette-Stamm L, Rubenfield M, Weinstock K, Lee HM, Dubois J, Rosenthal A, Platzer M, Nyakatura G, Taudien S, Rump A, Yang H, Yu J, Wang J, Huang G, Gu J, Hood L, Rowen L, Madan A, Qin S, Davis RW, Federspiel NA, Abola AP, Proctor MJ, Myers RM, Schmutz J, Dickson M, Grimwood J, Cox DR, Olson MV, Kaul R, Shimizu N, Kawasaki K, Minoshima S, Evans GA, Athanasiou M, Schultz R, Roe BA, Chen F, Pan H, Ramser J, Lehrach H, Reinhardt R, McCombie WR, de la Bastide M, Dedhia N, Blocker H, Hornischer K, Nordsiek G, Agarwala R, Aravind L, Bailey JA, Bateman A, Batzoglou S, Birney E, Bork P, Brown DG, Burge CB, Cerutti L, Chen HC, Church D, Clamp M, Copley RR, Doerks T, Eddy SR, Eichler EE, Furey TS, Galagan J, Gilbert JG, Harmon C, Hayashizaki Y, Haussler D, Hermjakob H, Hokamp K, Jang W, Johnson LS, Jones TA, Kasif S, Kaspryzk A, Kennedy S, Kent WJ, Kitts P, Koonin EV, Korf I, Kulp D, Lancet D, Lowe TM, McLysaght A, Mikkelsen T, Moran JV, Mulder N, Pollara VJ, Ponting CP, Schuler G, Schultz J, Slater G, Smit AF, Stupka E, Szustakowski J, Thierry-Mieg D, Thierry-Mieg J, Wagner L, Wallis J, Wheeler R, Williams A, Wolf YI, Wolfe KH, Yang SP, Yeh RF, Collins F, Guyer MS, Peterson J, Felsenfeld A, Wetterstrand KA, Patrinos A, Morgan MJ, de Jong P, Catanese JJ, Osoegawa K, Shizuya H, Choi S, Chen YJ. Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921. doi: 10.1038/35057062. [DOI] [PubMed] [Google Scholar]
3.Schuler GD, Boguski MS, Stewart EA, Stein LD, Gyapay G, Rice K, White RE, Rodriguez-Tome P, Aggarwal A, Bajorek E, Bentolila S, Birren BB, Butler A, Castle AB, Chiannilkulchai N, Chu A, Clee C, Cowles S, Day PJ, Dibling T, Drouot N, Dunham I, Duprat S, East C, Edwards C, Fan JB, Fang N, Fizames C, Garrett C, Green L, Hadley D, Harris M, Harrison P, Brady S, Hicks A, Holloway E, Hui L, Hussain S, Louis-Dit-Sully C, Ma J, MacGilvery A, Mader C, Maratukulam A, Matise TC, McKusick KB, Morissette J, Mungall A, Muselet D, Nusbaum HC, Page DC, Peck A, Perkins S, Piercy M, Qin F, Quackenbush J, Ranby S, Reif T, Rozen S, Sanders C, She X, Silva J, Slonim DK, Soderlund C, Sun WL, Tabar P, Thangarajah T, Vega-Czarny N, Vollrath D, Voyticky S, Wilmer T, Wu X, Adams MD, Auffray C, Walter NA, Brandon R, Dehejia A, Goodfellow PN, Houlgatte R, Hudson JR, Jr, Ide SE, Iorio KR, Lee WY, Seki N, Nagase T, Ishikawa K, Nomura N, Phillips C, Polymeropoulos MH, Sandusky M, Schmitt K, Berry R, Swanson K, Torres R, Venter JC, Sikela JM, Beckmann JS, Weissenbach J, Myers RM, Cox DR, James MR, Bentley D, Deloukas P, Lander ES, Hudson TJ. A gene map of the human genome. Science. 1996;274:540–546. [PubMed] [Google Scholar]
4.Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M, Evans CA, Holt RA, Gocayne JD, Amanatides P, Ballew RM, Huson DH, Wortman JR, Zhang Q, Kodira CD, Zheng XH, Chen L, Skupski M, Subramanian G, Thomas PD, Zhang J, Gabor Miklos GL, Nelson C, Broder S, Clark AG, Nadeau J, McKusick VA, Zinder N, Levine AJ, Roberts RJ, Simon M, Slayman C, Hunkapiller M, Bolanos R, Delcher A, Dew I, Fasulo D, Flanigan M, Florea L, Halpern A, Hannenhalli S, Kravitz S, Levy S, Mobarry C, Reinert K, Remington K, Abu-Threideh J, Beasley E, Biddick K, Bonazzi V, Brandon R, Cargill M, Chandramouliswaran I, Charlab R, Chaturvedi K, Deng Z, Di Francesco V, Dunn P, Eilbeck K, Evangelista C, Gabrielian AE, Gan W, Ge W, Gong F, Gu Z, Guan P, Heiman TJ, Higgins ME, Ji RR, Ke Z, Ketchum KA, Lai Z, Lei Y, Li Z, Li J, Liang Y, Lin X, Lu F, Merkulov GV, Milshina N, Moore HM, Naik AK, Narayan VA, Neelam B, Nusskern D, Rusch DB, Salzberg S, Shao W, Shue B, Sun J, Wang Z, Wang A, Wang X, Wang J, Wei M, Wides R, Xiao C, Yan C, Yao A, Ye J, Zhan M, Zhang W, Zhang H, Zhao Q, Zheng L, Zhong F, Zhong W, Zhu S, Zhao S, Gilbert D, Baumhueter S, Spier G, Carter C, Cravchik A, Woodage T, Ali F, An H, Awe A, Baldwin D, Baden H, Barnstead M, Barrow I, Beeson K, Busam D, Carver A, Center A, Cheng ML, Curry L, Danaher S, Davenport L, Desilets R, Dietz S, Dodson K, Doup L, Ferriera S, Garg N, Gluecksmann A, Hart B, Haynes J, Haynes C, Heiner C, Hladun S, Hostin D, Houck J, Howland T, Ibegwam C, Johnson J, Kalush F, Kline L, Koduru S, Love A, Mann F, May D, McCawley S, McIntosh T, McMullen I, Moy M, Moy L, Murphy B, Nelson K, Pfannkoch C, Pratts E, Puri V, Qureshi H, Reardon M, Rodriguez R, Rogers YH, Romblad D, Ruhfel B, Scott R, Sitter C, Smallwood M, Stewart E, Strong R, Suh E, Thomas R, Tint NN, Tse S, Vech C, Wang G, Wetter J, Williams S, Williams M, Windsor S, Winn-Deen E, Wolfe K, Zaveri J, Zaveri K, Abril JF, Guigo R, Campbell MJ, Sjolander KV, Karlak B, Kejariwal A, Mi H, Lazareva B, Hatton T, Narechania A, Diemer K, Muruganujan A, Guo N, Sato S, Bafna V, Istrail S, Lippert R, Schwartz R, Walenz B, Yooseph S, Allen D, Basu A, Baxendale J, Blick L, Caminha M, Carnes-Stine J, Caulk P, Chiang YH, Coyne M, Dahlke C, Mays A, Dombroski M, Donnelly M, Ely D, Esparham S, Fosler C, Gire H, Glanowski S, Glasser K, Glodek A, Gorokhov M, Graham K, Gropman B, Harris M, Heil J, Henderson S, Hoover J, Jennings D, Jordan C, Jordan J, Kasha J, Kagan L, Kraft C, Levitsky A, Lewis M, Liu X, Lopez J, Ma D, Majoros W, McDaniel J, Murphy S, Newman M, Nguyen T, Nguyen N, Nodell M, Pan S, Peck J, Peterson M, Rowe W, Sanders R, Scott J, Simpson M, Smith T, Sprague A, Stockwell T, Turner R, Venter E, Wang M, Wen M, Wu D, Wu M, Xia A, Zandieh A, Zhu X. The sequence of the human genome. Science. 2001;291:1304–1351. doi: 10.1126/science.1058040. [DOI] [PubMed] [Google Scholar]
5.Waterston RH, Lindblad-Toh K, Birney E, Rogers J, Abril JF, Agarwal P, Agarwala R, Ainscough R, Alexandersson M, An P, Antonarakis SE, Attwood J, Baertsch R, Bailey J, Barlow K, Beck S, Berry E, Birren B, Bloom T, Bork P, Botcherby M, Bray N, Brent MR, Brown DG, Brown SD, Bult C, Burton J, Butler J, Campbell RD, Carninci P, Cawley S, Chiaromonte F, Chinwalla AT, Church DM, Clamp M, Clee C, Collins FS, Cook LL, Copley RR, Coulson A, Couronne O, Cuff J, Curwen V, Cutts T, Daly M, David R, Davies J, Delehaunty KD, Deri J, Dermitzakis ET, Dewey C, Dickens NJ, Diekhans M, Dodge S, Dubchak I, Dunn DM, Eddy SR, Elnitski L, Emes RD, Eswara P, Eyras E, Felsenfeld A, Fewell GA, Flicek P, Foley K, Frankel WN, Fulton LA, Fulton RS, Furey TS, Gage D, Gibbs RA, Glusman G, Gnerre S, Goldman N, Goodstadt L, Grafham D, Graves TA, Green ED, Gregory S, Guigo R, Guyer M, Hardison RC, Haussler D, Hayashizaki Y, Hillier LW, Hinrichs A, Hlavina W, Holzer T, Hsu F, Hua A, Hubbard T, Hunt A, Jackson I, Jaffe DB, Johnson LS, Jones M, Jones TA, Joy A, Kamal M, Karlsson EK, Karolchik D, Kasprzyk A, Kawai J, Keibler E, Kells C, Kent WJ, Kirby A, Kolbe DL, Korf I, Kucherlapati RS, Kulbokas EJ, Kulp D, Landers T, Leger JP, Leonard S, Letunic I, Levine R, Li J, Li M, Lloyd C, Lucas S, Ma B, Maglott DR, Mardis ER, Matthews L, Mauceli E, Mayer JH, McCarthy M, McCombie WR, McLaren S, McLay K, McPherson JD, Meldrim J, Meredith B, Mesirov JP, Miller W, Miner TL, Mongin E, Montgomery KT, Morgan M, Mott R, Mullikin JC, Muzny DM, Nash WE, Nelson JO, Nhan MN, Nicol R, Ning Z, Nusbaum C, O’Connor MJ, Okazaki Y, Oliver K, Overton-Larty E, Pachter L, Parra G, Pepin KH, Peterson J, Pevzner P, Plumb R, Pohl CS, Poliakov A, Ponce TC, Ponting CP, Potter S, Quail M, Reymond A, Roe BA, Roskin KM, Rubin EM, Rust AG, Santos R, Sapojnikov V, Schultz B, Schultz J, Schwartz MS, Schwartz S, Scott C, Seaman S, Searle S, Sharpe T, Sheridan A, Shownkeen R, Sims S, Singer JB, Slater G, Smit A, Smith DR, Spencer B, Stabenau A, Stange-Thomann N, Sugnet C, Suyama M, Tesler G, Thompson J, Torrents D, Trevaskis E, Tromp J, Ucla C, Ureta-Vidal A, Vinson JP, Von Niederhausern AC, Wade CM, Wall M, Weber RJ, Weiss RB, Wendl MC, West AP, Wetterstrand K, Wheeler R, Whelan S, Wierzbowski J, Willey D, Williams S, Wilson RK, Winter E, Worley KC, Wyman D, Yang S, Yang SP, Zdobnov EM, Zody MC, Lander ES. Initial sequencing and comparative analysis of the mouse genome. Nature. 2002;420:520–562. doi: 10.1038/nature01262. [DOI] [PubMed] [Google Scholar]
6.Deininger PL, Batzer MA. Mammalian retroelements. Genome Res. 2002;12:1455–1465. doi: 10.1101/gr.282402. [DOI] [PubMed] [Google Scholar]
7.Dion V, Wilson JH. Instability and chromatin structure of expanded trinucleotide repeats. Trends Genet. 2009;25:288–297. doi: 10.1016/j.tig.2009.04.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Fondon JW, 3rd, Garner HR. Molecular origins of rapid and continuous morphological evolution. Proc Natl Acad Sci U S A. 2004;101:18058–18063. doi: 10.1073/pnas.0408118101. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Lee KH, Lee YK, Kwon DN, Chiu S, Chew V, Rah H, Kujawski G, Melhem R, Hsu K, Chung C, Greenhalgh DG, Cho K. Identification of a unique library of complex, but ordered, arrays of repetitive elements in the human genome and implication of their potential involvement in pathobiology. Exp Mol Pathol. 2011;90:300–311. doi: 10.1016/j.yexmp.2011.02.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Chung BI, Lee KH, Shin KS, Kim WC, Kwon DN, You RN, Lee YK, Cho K, Cho DH. REMiner: a tool for unbiased mining and analysis of repetitive elements and their arrangement structures of large chromosomes. Genomics. 2011;98:381–389. doi: 10.1016/j.ygeno.2011.07.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Bedell JA, Korf I, Gish W. MaskerAid: a performance enhancement to RepeatMasker. Bioinformatics. 2000;16:1040–1041. doi: 10.1093/bioinformatics/16.11.1040. [DOI] [PubMed] [Google Scholar]
12.Li X, Kahveci T, Settles AM. A novel genome-scale repeat finder geared towards transposons. Bioinformatics. 2008;24:468–476. doi: 10.1093/bioinformatics/btm613. [DOI] [PubMed] [Google Scholar]
13.Ma B, Tromp J, Li M. PatternHunter: faster and more sensitive homology search. Bioinformatics. 2002;18:440–445. doi: 10.1093/bioinformatics/18.3.440. [DOI] [PubMed] [Google Scholar]
14.Flecken T, Schmidt N, Spangenberg HC, Thimme R. [Hepatocellular carcinoma - from immunobiology to immunotherapy] Z Gastroenterol. 2012;50:47–56. doi: 10.1055/s-0031-1282002. [DOI] [PubMed] [Google Scholar]
15.Jurka J, Klonowski P, Dagman V, Pelton P. CENSOR--a program for identification and elimination of repetitive elements from DNA sequences. Comput Chem. 1996;20:119–121. doi: 10.1016/s0097-8485(96)80013-1. [DOI] [PubMed] [Google Scholar]
16.Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215:403–410. doi: 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]
17.Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. doi: 10.1093/nar/25.17.3389. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Lerat E. Identifying repeats and transposable elements in sequenced genomes: how to find your way through the dense forest of programs. Heredity. 2010;104:520–533. doi: 10.1038/hdy.2009.165. [DOI] [PubMed] [Google Scholar]
19.Achaz G, Boyer F, Rocha EP, Viari A, Coissac E. Repseek, a tool to retrieve approximate repeats from large DNA sequences. Bioinformatics. 2007;23:119–121. doi: 10.1093/bioinformatics/btl519. [DOI] [PubMed] [Google Scholar]
20.Chen Y, Zhou F, Li G, Xu Y. MUST: a system for identification of miniature inverted-repeat transposable elements and applications to Anabaena variabilis and Haloquadratum walsbyi. Gene. 2009;436:1–7. doi: 10.1016/j.gene.2009.01.019. [DOI] [PubMed] [Google Scholar]
21.Edgar RC, Myers EW. PILER: identification and classification of genomic repeats. Bioinformatics. 2005;21(Suppl 1):i152–158. doi: 10.1093/bioinformatics/bti1003. [DOI] [PubMed] [Google Scholar]
22.Ellinghaus D, Kurtz S, Willhoeft U, Rharvest LT. an efficient and flexible software for de novo detection of LTR retrotransposons. BMC Bioinformatics. 2008;9:18. doi: 10.1186/1471-2105-9-18. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Morgulis A, Gertz EM, Schaffer AA, Agarwala R. A fast and symmetric DUST implementation to mask low-complexity DNA sequences. J Comput Biol. 2006;13:1028–1040. doi: 10.1089/cmb.2006.13.1028. [DOI] [PubMed] [Google Scholar]
24.Watson JD, Crick FH. Molecular structure of nucleic acids; a structure for deoxyribose nucleic acid. Nature. 1953;171:737–738. doi: 10.1038/171737a0. [DOI] [PubMed] [Google Scholar]
25.Andonian BJ, Prus KM, Masi AT. Mechanobiology likely contributes to immunobiology pathways in the pathogenesis of ankylosing spondyltitis. Clin Exp Rheumatol. 2012 [PubMed] [Google Scholar]
26.Leonard DA, Gordon CR, Sachs DH, Cetrulo CL., Jr Immunobiology of face transplantation. J Craniofac Surg. 2012;23:268–271. doi: 10.1097/SCS.0b013e318241b8e0. [DOI] [PubMed] [Google Scholar]
27.Zhang Z, Schwartz S, Wagner L, Miller W. A greedy algorithm for aligning DNA sequences. J Comput Biol. 2000;7:203–214. doi: 10.1089/10665270050081478. [DOI] [PubMed] [Google Scholar]
28.Smit A, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2010 < http://www.repeatmasker.org>.
29.Xu Z, Wang H. LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Res. 2007;35:W265–268. doi: 10.1093/nar/gkm286. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

01. Supplementary Table 1. A summary of the results from the performance evaluation of REMiner-II in regard to the chromosome-wide mining of RE arrays from the 24 human and 21 mouse chromosomes.

Chr (chromosome), Hu (human), Mo (mouse)

NIHMS389733-supplement-01.xls^{(37KB, xls)}

Gray shades indicate the RE arrays with variable numbers of subset arrays. Chr (chromosome)

NIHMS389733-supplement-02.xls^{(127.5KB, xls)}

03. Supplementary Figure 1. RE arrays identified from the survey of the entire set of 24 human chromosomes using REMiner-II.

NIHMS389733-supplement-03.pdf^{(11.8MB, pdf)}

04. Supplementary Figure 2. RE arrays identified from the survey of the entire set of 21 mouse chromosomes using REMiner-II.

NIHMS389733-supplement-04.pdf^{(8.1MB, pdf)}

[R1] 1.Bernal A, Ear U, Kyrpides N. Genomes OnLine Database (GOLD): a monitor of genome projects world-wide. Nucleic Acids Res. 2001;29:126–127. doi: 10.1093/nar/29.1.126. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] 3.Schuler GD, Boguski MS, Stewart EA, Stein LD, Gyapay G, Rice K, White RE, Rodriguez-Tome P, Aggarwal A, Bajorek E, Bentolila S, Birren BB, Butler A, Castle AB, Chiannilkulchai N, Chu A, Clee C, Cowles S, Day PJ, Dibling T, Drouot N, Dunham I, Duprat S, East C, Edwards C, Fan JB, Fang N, Fizames C, Garrett C, Green L, Hadley D, Harris M, Harrison P, Brady S, Hicks A, Holloway E, Hui L, Hussain S, Louis-Dit-Sully C, Ma J, MacGilvery A, Mader C, Maratukulam A, Matise TC, McKusick KB, Morissette J, Mungall A, Muselet D, Nusbaum HC, Page DC, Peck A, Perkins S, Piercy M, Qin F, Quackenbush J, Ranby S, Reif T, Rozen S, Sanders C, She X, Silva J, Slonim DK, Soderlund C, Sun WL, Tabar P, Thangarajah T, Vega-Czarny N, Vollrath D, Voyticky S, Wilmer T, Wu X, Adams MD, Auffray C, Walter NA, Brandon R, Dehejia A, Goodfellow PN, Houlgatte R, Hudson JR, Jr, Ide SE, Iorio KR, Lee WY, Seki N, Nagase T, Ishikawa K, Nomura N, Phillips C, Polymeropoulos MH, Sandusky M, Schmitt K, Berry R, Swanson K, Torres R, Venter JC, Sikela JM, Beckmann JS, Weissenbach J, Myers RM, Cox DR, James MR, Bentley D, Deloukas P, Lander ES, Hudson TJ. A gene map of the human genome. Science. 1996;274:540–546. [PubMed] [Google Scholar]

[R6] 6.Deininger PL, Batzer MA. Mammalian retroelements. Genome Res. 2002;12:1455–1465. doi: 10.1101/gr.282402. [DOI] [PubMed] [Google Scholar]

[R7] 7.Dion V, Wilson JH. Instability and chromatin structure of expanded trinucleotide repeats. Trends Genet. 2009;25:288–297. doi: 10.1016/j.tig.2009.04.007. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] 8.Fondon JW, 3rd, Garner HR. Molecular origins of rapid and continuous morphological evolution. Proc Natl Acad Sci U S A. 2004;101:18058–18063. doi: 10.1073/pnas.0408118101. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] 9.Lee KH, Lee YK, Kwon DN, Chiu S, Chew V, Rah H, Kujawski G, Melhem R, Hsu K, Chung C, Greenhalgh DG, Cho K. Identification of a unique library of complex, but ordered, arrays of repetitive elements in the human genome and implication of their potential involvement in pathobiology. Exp Mol Pathol. 2011;90:300–311. doi: 10.1016/j.yexmp.2011.02.007. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] 10.Chung BI, Lee KH, Shin KS, Kim WC, Kwon DN, You RN, Lee YK, Cho K, Cho DH. REMiner: a tool for unbiased mining and analysis of repetitive elements and their arrangement structures of large chromosomes. Genomics. 2011;98:381–389. doi: 10.1016/j.ygeno.2011.07.002. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] 11.Bedell JA, Korf I, Gish W. MaskerAid: a performance enhancement to RepeatMasker. Bioinformatics. 2000;16:1040–1041. doi: 10.1093/bioinformatics/16.11.1040. [DOI] [PubMed] [Google Scholar]

[R12] 12.Li X, Kahveci T, Settles AM. A novel genome-scale repeat finder geared towards transposons. Bioinformatics. 2008;24:468–476. doi: 10.1093/bioinformatics/btm613. [DOI] [PubMed] [Google Scholar]

[R13] 13.Ma B, Tromp J, Li M. PatternHunter: faster and more sensitive homology search. Bioinformatics. 2002;18:440–445. doi: 10.1093/bioinformatics/18.3.440. [DOI] [PubMed] [Google Scholar]

[R14] 14.Flecken T, Schmidt N, Spangenberg HC, Thimme R. [Hepatocellular carcinoma - from immunobiology to immunotherapy] Z Gastroenterol. 2012;50:47–56. doi: 10.1055/s-0031-1282002. [DOI] [PubMed] [Google Scholar]

[R15] 15.Jurka J, Klonowski P, Dagman V, Pelton P. CENSOR--a program for identification and elimination of repetitive elements from DNA sequences. Comput Chem. 1996;20:119–121. doi: 10.1016/s0097-8485(96)80013-1. [DOI] [PubMed] [Google Scholar]

[R16] 16.Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215:403–410. doi: 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]

[R17] 17.Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. doi: 10.1093/nar/25.17.3389. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] 18.Lerat E. Identifying repeats and transposable elements in sequenced genomes: how to find your way through the dense forest of programs. Heredity. 2010;104:520–533. doi: 10.1038/hdy.2009.165. [DOI] [PubMed] [Google Scholar]

[R19] 19.Achaz G, Boyer F, Rocha EP, Viari A, Coissac E. Repseek, a tool to retrieve approximate repeats from large DNA sequences. Bioinformatics. 2007;23:119–121. doi: 10.1093/bioinformatics/btl519. [DOI] [PubMed] [Google Scholar]

[R20] 20.Chen Y, Zhou F, Li G, Xu Y. MUST: a system for identification of miniature inverted-repeat transposable elements and applications to Anabaena variabilis and Haloquadratum walsbyi. Gene. 2009;436:1–7. doi: 10.1016/j.gene.2009.01.019. [DOI] [PubMed] [Google Scholar]

[R21] 21.Edgar RC, Myers EW. PILER: identification and classification of genomic repeats. Bioinformatics. 2005;21(Suppl 1):i152–158. doi: 10.1093/bioinformatics/bti1003. [DOI] [PubMed] [Google Scholar]

[R22] 22.Ellinghaus D, Kurtz S, Willhoeft U, Rharvest LT. an efficient and flexible software for de novo detection of LTR retrotransposons. BMC Bioinformatics. 2008;9:18. doi: 10.1186/1471-2105-9-18. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] 23.Morgulis A, Gertz EM, Schaffer AA, Agarwala R. A fast and symmetric DUST implementation to mask low-complexity DNA sequences. J Comput Biol. 2006;13:1028–1040. doi: 10.1089/cmb.2006.13.1028. [DOI] [PubMed] [Google Scholar]

[R24] 24.Watson JD, Crick FH. Molecular structure of nucleic acids; a structure for deoxyribose nucleic acid. Nature. 1953;171:737–738. doi: 10.1038/171737a0. [DOI] [PubMed] [Google Scholar]

[R25] 25.Andonian BJ, Prus KM, Masi AT. Mechanobiology likely contributes to immunobiology pathways in the pathogenesis of ankylosing spondyltitis. Clin Exp Rheumatol. 2012 [PubMed] [Google Scholar]

[R26] 26.Leonard DA, Gordon CR, Sachs DH, Cetrulo CL., Jr Immunobiology of face transplantation. J Craniofac Surg. 2012;23:268–271. doi: 10.1097/SCS.0b013e318241b8e0. [DOI] [PubMed] [Google Scholar]

[R27] 27.Zhang Z, Schwartz S, Wagner L, Miller W. A greedy algorithm for aligning DNA sequences. J Comput Biol. 2000;7:203–214. doi: 10.1089/10665270050081478. [DOI] [PubMed] [Google Scholar]

[R28] 28.Smit A, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2010 < http://www.repeatmasker.org>.

[R29] 29.Xu Z, Wang H. LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Res. 2007;35:W265–268. doi: 10.1093/nar/gkm286. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

PREMiner-II: A Tool for Rapid Identification and Configuration of Repetitive Element Arrays from Large Mammalian Chromosomes as a Single Query

Woo-Chan Kim

Kang-Hoon Lee

Kyung-Seop Shin

Ri-Na You

Young-Kwan Lee

Kiho Cho

Dong-Ho Cho

Abstract

1. Introduction

2. Results

2.1. System Design and Algorithm

Figure 1. Design of REMiner-II.

2.1.1. Preprocessing

2.1.1.1. Data Extraction and Low-Complexity Filtering

2.1.1.2. Reverse Complementation

2.1.2. Seeding

2.1.2.1. Word Matching

Figure 2. Word matching process: an example (query = ACACACAT and word length [W] = 3).

2.1.2.2. Seed Generation

Figure 3. Word merging algorithm.

Figure 4. Differences in the number of candidate seeds produced in a comparison of one-hit, two-hit, or n-hit method during the word merging process.

2.1.3 Alignment-Extension of Seeds

2.1.4. Modes of RE Data Output

Figure 5. Three data output modes and file structure of the seed library.

2.1.5. User Interface: REMiner-II Viewer

2.2. Performance Evaluation of REMiner-II

2.2.1. Identification of RE Arrays from Human and Mouse Chromosomes

Table 1.

2.2.2. Visualization of REs and RE Arrays via REMiner-II Viewer

Figure 6. A snapshot of the REMiner-II viewer.

2.2.3. Effects of the REMiner-II Adjustable Parameters on RE Alignment and RE Array Configuration

2.2.3.1. Effects of Low-Complexity Filtering Options

Figure 7. Effects of the adjustable parameters of REMiner-II on RE alignment and RE array configuration.

2.2.3.2. Effects of Seed Generation Conditions

2.2.3.3. Effects of Alignment-Extension Parameters

2.2.4. Complexly Organized RE Arrays Identified from Human and Mouse Chromosomes using REMiner-II

Figure 8. A selection of complexly organized RE arrays from human and mouse chromosomes using REMiner-II.

2.2.5. Comparative analysis: REMiner vs. REMiner-II

3. Discussion

4. Materials and Methods

4.1. Acquisition of Genome Sequences

4.2. Surveying of the Human and Mouse Chromosomes for RE Arrays

Supplementary Material

Acknowledgments

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases