NTRFinder: a software tool to find nested tandem repeats

Atheer A Matroud; M D Hendy; C P Tuffley

doi:10.1093/nar/gkr1070

. 2011 Nov 25;40(3):e17. doi: 10.1093/nar/gkr1070

NTRFinder: a software tool to find nested tandem repeats

Atheer A Matroud ^1,2,^*, M D Hendy ³, C P Tuffley ¹

PMCID: PMC3273788 PMID: 22121222

Abstract

We introduce the software tool NTRFinder to search for a complex repetitive structure in DNA we call a nested tandem repeat (NTR). An NTR is a recurrence of two or more distinct tandem motifs interspersed with each other. We propose that NTRs can be used as phylogenetic and population markers. We have tested our algorithm on both real and simulated data, and present some real NTRs of interest. NTRFinder can be downloaded from http://www.maths.otago.ac.nz/~aamatroud/.

INTRODUCTION

Genomic DNA has long been known to contain ‘tandem repeats’: repetitive structures in which many approximate copies of a common segment (the ‘motif’) appear consecutively. Several studies have proposed different mechanisms for the occurrence of tandem repeats (1,2), but their biological role is not well understood.

Recently, we have observed a more complex repetitive structure in the ribosomal DNA of Colocasia esculenta (taro), consisting of multiple approximate copies of two distinct motifs interspersed with one another. We call such structures nested tandem repeats (NTRs), and the problem of finding them in sequence data is the focus of this article. Our motivation is their potential use for studying populations: for example, a preliminary analysis suggests that changes in the NTR in taro have been occurring on a 1000 year time scale, so a greater understanding of this NTR offers the potential to date the early agriculture of this ancient staple food crop.

The problem of locating tandem repeats is well known, as their implication for neurological disorders (3,4), and their use to infer evolutionary histories has urged some researchers to develop tools to find them. This has resulted in a number of software tools, each of which has its own strengths and limitations. However, the existing tools were not designed to find NTRs, and consequently do not generally find them. In this article, we present a new software tool, NTRFinder, which is designed to find these more complex repetitive structures.

We report here the algorithm on which NTRFinder is based and report some of the NTRs it has identified, including an even more complex structure where copies of four distinct motifs are interspersed.

Sequences, edit operations and the edit distance

A DNA sequence is a sequence of symbols from the nucleotide alphabet Σ = {A,C,G,T}. We define a DNA segment to be a string of contiguous DNA nucleotides and define a site to be a component in a segment. For a DNA segment

x_i ∈ Σ is the nucleotide at the i-th site and |X| = n is the length of X.

Copying errors happen in DNA replication due to different external and internal factors. These changes include substitution, insertion, deletion, duplication and contraction. We refer to these as ‘edit operations’. By giving each type of edit operation some specific weight, we can in principle find a series of edit operations which transforms segment x to segment y, whose sum of weights is minimal. We will refer to this sum as the ‘edit distance’, and denote it by d(x, y). For the purposes of this article, the edit operations allowed in calculating the edit distance are restricted to single nucleotide substitutions, and single nucleotide insertions or deletions (indels).

Classification of tandem repeats

Many classifications of tandem repeat schemas have been introduced in the computational biology literature. We list some which are commonly used:

(Exact) tandem repeats: an ‘exact tandem repeat’ (TR) is a sequence comprising two or more contiguous copies XX…X of identical segments X (referred to as the motif).
k–Approximate tandem repeats: a k–approximate tandem repeat (k–TR) is a sequence comprising two or more contiguous copies X₁X₂…X_n of similar segments, where each individual segment X_i is edit distance at most k from a template segment X.
Multiple length tandem repeats: a multiple length TR is a TR where each repeat copy is of the form Xxⁿ, where n is a constant larger than one and d(X, x) is greater than some threshold value k.

Examples

• TR:

AGG AGG AGG AGG AGG. The motif is AGG.
• 1 − TR:

AGG AGC ATG AGG CGG. The motif is AGG.
• MLTR:

GACCTTTGG ACGGT ACGGT ACGGT

GACCTTTGG ACGGT ACGGT ACGGT.

The motifs are x = ACGGT and X = GACCTTTGG, with n = 3.

NTRs

In this section, we introduce a more complex repetitive structure, the NTR, also referred to as a ‘variable length tandem repeat’ (5). Let X and x be two segments (typically of different lengths) from the alphabet Σ = {A,C,G,T}, such that d(X,x) is greater than some threshold value k.

Definition 1. —

An ‘exact nested tandem repeat’ is a string of the form

where n > 1, s_i ≥ 1 for each 0 < i < n, and s_j ≥ 2 for some j ∈ {0, 1, · · · , n}. The motif x is called the TR and the motif X is the ‘interspersed repeat’. The concatenations of the tandem repeats alone, and of the interspersed motifs X alone, both form exact TRs.

Example —

x = ACGGT, X = GACCTTTGG, n = 7, s₀ = 0, s₁ = 3, s₂ = 5, s₃ = 2, s₄ = 4, s₅ = 1, s₆ = s₇ = 2, so

In practice, we expect any NTRs occurring in DNA sequences to be approximate rather than exact. In what follows we will write Inline graphic to mean an approximate copy of the motif X, and to mean an approximate TR consisting of s (not necessarily identical) approximate copies of the motif x.

Definition 2. —

A (k₁, k₂)-approximate NTR is a string of the form

where n and s_i satisfy the same conditions as in Definition 1, and is a k₁-approximate TR with motif x, and is a k₂-approximate TR with motif X.

Examples —

• NTR:

AGG AGG CTCAG AGG CTCAG AGG AGG AGG CTCAG.

The motifs are AGG, CTCAG.

• (1, 2)–NTR:

AGA AGG CTTCG AGG CTCAG AG AGA AGG CTTCG AGG CTCAG AAG.

The motifs are x = AGG, X = CTCAG.

Related work

Various algorithms have been introduced to find exact TRs. Such algorithms were developed mainly for theoretical purposes, namely, to solve the problem of finding squares in strings (6–10). These algorithms are not easily adapted to finding the approximate TRs that usually occur in DNA.

A number of algorithms (11,12) consider motifs differing only by substitutions, using the Hamming distance as a measure of similarity. Others, e.g. (13–17), have considered insertions and deletions by using the edit distance. Most of these algorithms have two phases, a scanning phase that locates candidate TRs, and an analysis phase that checks the candidate TRs found during the scanning phase.

The only algorithm designed to look for NTRs is that of Hauth and Joseph in (5), which searches for tandem motifs of length at most 6 nt.

MATERIAL AND METHODS

In this section, we present the algorithm we have developed to search for NTRs in a DNA sequence. The algorithm requires several preset parameters. These are: k₁ and k₂ which bound the edit distances from the tandem and interspersed motifs; and the motif length bounds Inline graphic . Other input parameters are discussed below.

Search phase —

Our search is confined to seeking NTRs with motifs of length and A (k₁, k₂)–NTR must contain a k₁ − TR, so we begin by scanning the sequence for approximate TRs. To do this, we have chosen to adapt the TR search algorithm of Wexler et al., in which the sequence is searched for tandem motifs of length l₁ by scanning the sequence with two windows w₁ and w₂ of width w, at distance l₁ apart. This may be adapted to find non-adjacent copies of the tandem motif (as occur in NTRs) by holding w₁ fixed, and moving w₂ further away.

The user may set the k₁, k₂ values, preset with default values

following Domaniç and Preparata (2007), with matching probability p_m given the default value p_m = 0.8.

Once a TR has been found and its full extent determined, the right-most copy of the repeated pattern is taken as the current TR motif x, and further approximate copies of x are sought, displaced from the TR up to a distance of Inline graphic nucleotides to the right. This is done by moving the second scanning window w₂ to the right, while holding the first fixed in the current copy of x. If no further approximate copies of x are located, this TR is abandoned, and the TR search continues to the right. If a displaced approximate copy of x is observed, then both x and the interspersed segment X are recorded in a list, as we have found a candidate NTR. Further contiguous copies of x are then sought, with the rightmost copy x replacing the previous template motif.

The steps above are repeated with successive motifs x and interspersed segments copied to the list, until no additional copies of the last recorded motif x are found. This search phase is illustrated in Figure 1.

Figure 1. — Flowchart of the `NTRFinder` algorithm.

At this point, the algorithm builds consensus patterns for x and X using majority rule. After constructing the two consensus patterns, the algorithm moves to the verification phase.

Example —

An example will help illustrate the procedure. Suppose that S contains an NTR of the form

The algorithm will scan from the left until it locates the TR consisting of three copies of x between X₀ and X₁. It will then start searching for additional non-adjacent copies of x to the right, locating the first copy to the right of X₁. Having found this, it will record the intervening segment X₁, and then continue the TR search from this point until the full extent of the TR between X₁ and X₂ is found.

This procedure is repeated once more, locating the TR between X₂ and X₃, recording the segment X₂, and then searching for further copies to the right. At this point, no more copies of x are found, and the process of verification begins. The segments X₀, X₃ and the initial copy of x are found during this stage.

Verification phase —

Each candidate NTR is checked to determine whether it meets the NTR definition. This is accomplished by aligning the candidate NTR region, together with a margin on either side of it, against the consensus motifs x and X, using the nested wrap-around dynamic programming algorithm of Matroud et al. (18). The nested wrap-around dynamic programming parameters are set to be 2 for a match, −5 for a mismatch and −7 for a gap. These parameters were chosen following Wexler et al. (17). The nested wrap-around dynamic programming algorithm has complexity O(n|x||X|), where n is the length of the NTR region and |x| and |X| are the length of the tandem motif and the length of the interspersed motif, respectively.

A remark on tandem repeat detection, and the role of verification

The definition of a k-TR requires that each repeat be a distance at most k from some template motif. However, this template is unknown during the search phase. We follow Wexler et al.'s algorithm of comparing each repeat copy with the preceding copy. Comparisons between adjacent copies will not miss any TRs, provided the distance threshold is set appropriately, but may result in false positives due to ‘drift’. Such false positives are eliminated during the verification phase, when the candidate TR is aligned against the consensus motif.

Suppose that x₁x₂…x_n is a k-TR with motif x. Then since d(x, x_i) ≤ k we have

by the triangle inequality. It follows that a TR search that correctly detects when d(x_i, x_i+1) ≤ d will find all (d/2)-TRs.

We note, however, that a segment x₁x₂…x_n satisfying d(x_i, x_i+1) ≤ d for all i need not be a TR, since x_j may ‘drift’ away from x_i as j increases. A simple example is

in which adjacent copies are distance 1 apart, but the first and last copies are distance 4 apart.

RESULTS

Tests on simulated data

In order to measure the accuracy of NTRFinder, we generated synthetic sequence data containing NTR subsequences with varying probabilities of substitution and insertion/deletion (indels), and determined the proportion of the NTRs that were found by NTRFinder. In our simulation, we first generated one random DNA sequence of 100 000 nt, with each nucleotide occurring with probability 0.25. Within this sequence, we embedded 100 exact NTRs with repeats of randomly generated motifs X and x of varying lengths. From this sequence, we generated four additional sequences by introducing indels and substitutions. Indels were introduced to each sequence with a constant probability of 1% per site, and substitutions were introduced with varying probabilities of 1, 2, 3 and 4% per site. NTRFinder recovered 95, 84, 83, 83 and 80% of the NTRs, respectively. These results are plotted in Figure 2. No false positives were detected.

Figure 2. — Percentage of NTRs found in the synthetic sequences.

The first phase of NTRFinder uses a modification of ATRhunter's algorithm. In Wexler et al. (17), the authors report that ATRhunter has a 74–90% success rate for finding ATRs in synthetic sequences, with average score of an ATR over all sequences being 238 with a standard deviation of 116. These results suggest the accuracy of the Wexler algorithm provides the major limitation on the accuracy of NTRFinder.

Tests on real sequence data

To test NTRFinder on real sequence data, we searched all IGS sequences available in GenBank. The IGS sequences were chosen because we already knew of an NTR in the IGS region of C. esculenta. We also searched the entire Human Y chromosome from Fujita et al. (19).

The size ranges used for this search were Inline graphic , with the parameters k₁ and k₂ set to their default values. NTRs found in IGS sequences are listed in Table 1. We searched 27 IGS sequences and found NTRs in 12 of them.

Table 1.

NTRs found in some IGS sequences searched from GenBank and an additional unpublished sequence (C. esculenta)

Species and accession number	NTR structure (s₀, s₁, … , s_n)	\|x\|	Start index
Species and accession number	NTR structure (s₀, s₁, … , s_n)	\|X\|	End index
Nicotiana sylvestris X76056.1	(0,1,2,3,4,2,3,2,3,3,3,5,4,3,2,3,3,1,2,5,3,5,4,2,2,4,3)	10	960
Nicotiana sylvestris X76056.1		13	2111
Brassica juncea X73032.1	(9,12,2,6,2,5,4,1)	21	1403
Brassica juncea X73032.1		30	2605
Brassica olerecea	(1,1,1,3,3,1,3,2,1,1,2,1,3,1,1,2,1,2,1,1,2)	30	1031
Brassica olerecea		44	2902
Brassica olerecea X60324.1	(1,1,1,3,3,1,3,1,3,2,1,1,2,1,3,1,1,2,1,2,1,1,2)	30	1036
Brassica olerecea X60324.1		44	3133
Brassica rapa S78172.1	(1,2,2,3,3,2,2,2,3)	12	385
Brassica rapa S78172.1		45	1337
Brassica campestris X73031.1	(6,4,4,7,4,4,4,3,1)	21	1558
Brassica campestris X73031.1		51	2580
Colocasia esculenta Not published	(5,3,1,6,10,5,10,9,13,14,15,4)	11	725
Colocasia esculenta Not published		48	2384
Nicotiana tomentosiformis Y08427.1	(1,1,2,2,1,2,4,2,1)	20	1016
Nicotiana tomentosiformis Y08427.1		46	1969
Arabidopsis thaliana CP002685.1	(3,2,1,1)	13	32 189
Arabidopsis thaliana CP002685.1		17	32 365
Zea mays AJ309824.2	(2,2,1)	19	2984
Zea mays AJ309824.2		52	3113
Olea europea AJ865373.1	(3,1,3,6,5,6,4,3,4)	75	961
Olea europea AJ865373.1		11	3743
Herdmania momus X53538.1	(3,1,1,1,1,0)	107	6363
Herdmania momus X53538.1		91	7642

Open in a new tab

NTRs found in the Human Y chromosome are listed in Table 2. All NTRs found in the Y chromosome appear to be in the pseudoautosomal region.

Table 2.

NTRs found in the Human Y chromosome

NTR structure (s₀, s₁, … , s_n)	\|x\|	Start index
NTR structure (s₀, s₁, … , s_n)	\|X\|	End index
(1,2,2,1,2,1,2,1,1,2,2,2,2,1)	12	143 865
(1,2,2,1,2,1,2,1,1,2,2,2,2,1)	56	144 880
(7,22,23,12,14,4)	2	234 183
(7,22,23,12,14,4)	88	234 767
(1,1,2,1,1,1,1,,1,1,1,1,1,1,1,1,1,1,1,1)	15	465 369
(1,1,2,1,1,1,1,,1,1,1,1,1,1,1,1,1,1,1,1)	14	466 397
(1,1,1,2,1,1,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1)	11	647 659
	16	649 721
(17,15,31,28,72,62)	2	901 237
(17,15,31,28,72,62)	49	902 037
(3,6,8,11,7,6,4,4,5,4,11)	12	1 279 754
(3,6,8,11,7,6,4,4,5,4,11)	32	1 280 875
(26,27,25,25,25,20,17,13,26)	1	1 397 128
(26,27,25,25,25,20,17,13,26)	48	1 397 735
(1,2,1,2,1,2,2,2,2,2,1,2,1,1,1,2,2,2,1,2,2,1,1)	16	1 516 157
(1,2,1,2,1,2,2,2,2,2,1,2,1,1,1,2,2,2,1,2,2,1,1)	22	1 517 560
(1,1,2,6,2,2,2,1,2)	19	1 626 578
(1,1,2,6,2,2,2,1,2)	35	1 627 258
(1,1,1,0,2,1)	19	2 102 194
(1,1,1,0,2,1)	56	2 102 594
(2,2,2,1,2,1,1,1,1,2,6)	21	2 164 541
(2,2,2,1,2,1,1,1,1,2,6)	15	2 165 091

Open in a new tab

More complex structures

In addition to the NTRs in Table 2, NTRFinder also reported an NTR in Linum usitatissimum (accession number gi∣164684852∣ gb∣EU307117.1∣) which on further analysis by hand turned out to have a more complex structure. The IGS region of the rDNA of this species contains an NTR with four motifs interspersed with each other. The four motifs are w=GTGCGAAAAT, x=GCGCGCCAGGG, y=GCACCCATAT and z=GCGATTTTG, and the structure of the NTR has the form

where q_i ∈ {1, 2, 3}; r_i ∈ {1, 2}; s_i ∈ {0, 1}; t_i ∈ {0, 1}.

Running time

The running time for NTRFinder searching some sequences from GenBank is shown in Figure 3. It can be seen that the run time is approximately linear in the length of the sequence. However, it must be noted that the run time depends not only on the length of the input sequence, but also on the number of TRs and NTRs found in the sequence. The program spends most of the time verifying any TRs found.

DISCUSSION

In the last decade, a number of software tools to find TRs have been introduced; however, little work exists on more complex repetitive structures such as NTRs. The problem of finding NTRs is addressed in this study. The motivation for our study is the potential use of NTRs as a marker for genetic studies of populations and of species.

We have done some analysis on the NTR in the intergenic spacer region in C. esculenta (taro), noting some variation in the NTRs derived from domesticated varieties sourced from New Zealand, Australia and Japan. Further varieties are currently being analyzed.

CONCLUSION

The NTR structure is a complex structure that requires further analysis and study. The number of copy variants in the NTR region and the relationships between these copies might suggest a TR generation mechanism. In this article, we have introduced a new algorithm to find NTRs. The first phase of the algorithm has Inline graphic time complexity, while the second phase (the alignment) needs space and time, where n is the length of the NTR region, and are the maximum allowed lengths of the tandem and interspersed motifs.

FUNDING

The Allan Wilson Centre for Molecular Ecology and Evolution. Funding for open access charge: Allan Wilson Centre.

Conflict of interest statement. None declared.

ACKNOWLEDGEMENTS

We thank Andrew Clarke and Peter Matthews, for providing data and useful background about taro, and Hussain Matawa, for assisting in the development of the program interface. We would also like to thank the anonymous reviewers for their help in improving the manuscript.

REFERENCES

1.Weitzmann MN, Woodford KJ, Usdin K. DNA Secondary Structures and the Evolution of Hypervariable Tandem Arrays. J. Biol. Chem. 1997;272:9517–9523. doi: 10.1074/jbc.272.14.9517. [DOI] [PubMed] [Google Scholar]
2.Wells RD. Molecular basis of genetic instability of triplet repeats. J. Biol. Chem. 1996;271:2875–2878. doi: 10.1074/jbc.271.6.2875. [DOI] [PubMed] [Google Scholar]
3.Macdonald ME, Ambrose CM, Duyao MP, Myers RH, Lin C, Srinidhi L, Barnes G, Taylor SA, James M, Groot N. A novel gene containing a trinucleotide repeat that is expanded and unstable on huntington's disease chromosomes. the huntington's disease collaborative research group. Cell. 1993;72:971–983. doi: 10.1016/0092-8674(93)90585-e. [DOI] [PubMed] [Google Scholar]
4.Fu YH, Pizzuti A, Jr, Fenwick RG, King J, Rajnarayan S, Dunne PW, Dubel J, Nasser GA, Ashizawa T, de Jong P, et al. An unstable triplet repeat in a gene related to myotonic muscular dystrophy. Science. 1992;255:1256–1258. doi: 10.1126/science.1546326. [DOI] [PubMed] [Google Scholar]
5.Hauth AM, Joseph DA. Beyond tandem repeats: complex pattern structures and distant regions of similarity. Bioinformatics. 2002;18(Suppl. 1):S31–S37. doi: 10.1093/bioinformatics/18.suppl_1.s31. [DOI] [PubMed] [Google Scholar]
6.Apostolico A, Preparata FP. Optimal off-line detection of repetitions in a string. Theor. Comput. Sci. 1983;22:297–315. [Google Scholar]
7.Crochemore M. An optimal algorithm for computing the repetitions in a word. Inf. Process. Lett. 1981;12:244–250. [Google Scholar]
8.Kolpakov R, Kucherov G, Logiciel TG. Theoretical Computer Science. Springer; 2001. Finding approximate repetitions under hamming distance; pp. 170–181. [Google Scholar]
9.Main MG, Lorentz RJ. An o(n log n) algorithm for finding all repetitions in a string. J. Algorithms. 1984;5:422–432. [Google Scholar]
10.Stoye J, Gusfield D. Simple and flexible detection of contiguous repeats using a suffix tree. Theor. Comput. Sci. 2002;270:843–856. [Google Scholar]
11.Delgrange O, Rivals E. Star: an algorithm to search for tandem approximate repeats. Bioinformatics. 2004;20:2812–2820. doi: 10.1093/bioinformatics/bth335. [DOI] [PubMed] [Google Scholar]
12.Landau GM, Schmidt JP, Sokol D. An algorithm for approximate tandem repeats. J. Comput. Biol. 2001;8:1–18. doi: 10.1089/106652701300099038. [DOI] [PubMed] [Google Scholar]
13.Benson G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic. Acids Res. 1999;27:573–580. doi: 10.1093/nar/27.2.573. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Hauth AM, Joseph D. Beyond tandem repeats: complex pattern structures and distant regions of similarity. ISMB. 2002:31–37. doi: 10.1093/bioinformatics/18.suppl_1.s31. [DOI] [PubMed] [Google Scholar]
15.Domaniç NO, Preparata FP. A novel approach to the detection of genomic approximate tandem repeats in the levenshtein metric. J. Comput. Biol. 2007;14:873–891. doi: 10.1089/cmb.2007.0018. [DOI] [PubMed] [Google Scholar]
16.Sagot MF, Myers EW. Identifying satellites and periodic repetitions in biological sequences. J. Comput. Biol. 1998;5:539–554. doi: 10.1089/cmb.1998.5.539. [DOI] [PubMed] [Google Scholar]
17.Wexler Y, Yakhini Z, Kashi Y, Geiger D. Finding approximate tandem repeats in genomic sequences. J. Comput. Biol. 2005;12:928–942. doi: 10.1089/cmb.2005.12.928. [DOI] [PubMed] [Google Scholar]
18.Matroud AA, Hendy MD, Tuffley CP. An algorithm to solve the motif alignment problem for approximate nested tandem repeats in biological sequences. J. Comput. Biol. 2011;18:1211–1218. doi: 10.1089/cmb.2011.0101. [DOI] [PubMed] [Google Scholar]
19.Fujita PA, Rhead B, Zweig AS, Hinrichs AS, Karolchik D, Cline MS, Goldman M, Barber GP, Clawson H, Coelho A, et al. The UCSC genome browser database: update 2011. Nucleic Acids Res. 2010;39:D876–D882. doi: 10.1093/nar/gkq963. [DOI] [PMC free article] [PubMed] [Google Scholar]

[gkr1070-B1] 1.Weitzmann MN, Woodford KJ, Usdin K. DNA Secondary Structures and the Evolution of Hypervariable Tandem Arrays. J. Biol. Chem. 1997;272:9517–9523. doi: 10.1074/jbc.272.14.9517. [DOI] [PubMed] [Google Scholar]

[gkr1070-B2] 2.Wells RD. Molecular basis of genetic instability of triplet repeats. J. Biol. Chem. 1996;271:2875–2878. doi: 10.1074/jbc.271.6.2875. [DOI] [PubMed] [Google Scholar]

[gkr1070-B3] 3.Macdonald ME, Ambrose CM, Duyao MP, Myers RH, Lin C, Srinidhi L, Barnes G, Taylor SA, James M, Groot N. A novel gene containing a trinucleotide repeat that is expanded and unstable on huntington's disease chromosomes. the huntington's disease collaborative research group. Cell. 1993;72:971–983. doi: 10.1016/0092-8674(93)90585-e. [DOI] [PubMed] [Google Scholar]

[gkr1070-B4] 4.Fu YH, Pizzuti A, Jr, Fenwick RG, King J, Rajnarayan S, Dunne PW, Dubel J, Nasser GA, Ashizawa T, de Jong P, et al. An unstable triplet repeat in a gene related to myotonic muscular dystrophy. Science. 1992;255:1256–1258. doi: 10.1126/science.1546326. [DOI] [PubMed] [Google Scholar]

[gkr1070-B5] 5.Hauth AM, Joseph DA. Beyond tandem repeats: complex pattern structures and distant regions of similarity. Bioinformatics. 2002;18(Suppl. 1):S31–S37. doi: 10.1093/bioinformatics/18.suppl_1.s31. [DOI] [PubMed] [Google Scholar]

[gkr1070-B6] 6.Apostolico A, Preparata FP. Optimal off-line detection of repetitions in a string. Theor. Comput. Sci. 1983;22:297–315. [Google Scholar]

[gkr1070-B7] 7.Crochemore M. An optimal algorithm for computing the repetitions in a word. Inf. Process. Lett. 1981;12:244–250. [Google Scholar]

[gkr1070-B8] 8.Kolpakov R, Kucherov G, Logiciel TG. Theoretical Computer Science. Springer; 2001. Finding approximate repetitions under hamming distance; pp. 170–181. [Google Scholar]

[gkr1070-B9] 9.Main MG, Lorentz RJ. An o(n log n) algorithm for finding all repetitions in a string. J. Algorithms. 1984;5:422–432. [Google Scholar]

[gkr1070-B10] 10.Stoye J, Gusfield D. Simple and flexible detection of contiguous repeats using a suffix tree. Theor. Comput. Sci. 2002;270:843–856. [Google Scholar]

[gkr1070-B11] 11.Delgrange O, Rivals E. Star: an algorithm to search for tandem approximate repeats. Bioinformatics. 2004;20:2812–2820. doi: 10.1093/bioinformatics/bth335. [DOI] [PubMed] [Google Scholar]

[gkr1070-B12] 12.Landau GM, Schmidt JP, Sokol D. An algorithm for approximate tandem repeats. J. Comput. Biol. 2001;8:1–18. doi: 10.1089/106652701300099038. [DOI] [PubMed] [Google Scholar]

[gkr1070-B13] 13.Benson G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic. Acids Res. 1999;27:573–580. doi: 10.1093/nar/27.2.573. [DOI] [PMC free article] [PubMed] [Google Scholar]

[gkr1070-B14] 14.Hauth AM, Joseph D. Beyond tandem repeats: complex pattern structures and distant regions of similarity. ISMB. 2002:31–37. doi: 10.1093/bioinformatics/18.suppl_1.s31. [DOI] [PubMed] [Google Scholar]

[gkr1070-B15] 15.Domaniç NO, Preparata FP. A novel approach to the detection of genomic approximate tandem repeats in the levenshtein metric. J. Comput. Biol. 2007;14:873–891. doi: 10.1089/cmb.2007.0018. [DOI] [PubMed] [Google Scholar]

[gkr1070-B16] 16.Sagot MF, Myers EW. Identifying satellites and periodic repetitions in biological sequences. J. Comput. Biol. 1998;5:539–554. doi: 10.1089/cmb.1998.5.539. [DOI] [PubMed] [Google Scholar]

[gkr1070-B17] 17.Wexler Y, Yakhini Z, Kashi Y, Geiger D. Finding approximate tandem repeats in genomic sequences. J. Comput. Biol. 2005;12:928–942. doi: 10.1089/cmb.2005.12.928. [DOI] [PubMed] [Google Scholar]

[gkr1070-B18] 18.Matroud AA, Hendy MD, Tuffley CP. An algorithm to solve the motif alignment problem for approximate nested tandem repeats in biological sequences. J. Comput. Biol. 2011;18:1211–1218. doi: 10.1089/cmb.2011.0101. [DOI] [PubMed] [Google Scholar]

[gkr1070-B19] 19.Fujita PA, Rhead B, Zweig AS, Hinrichs AS, Karolchik D, Cline MS, Goldman M, Barber GP, Clawson H, Coelho A, et al. The UCSC genome browser database: update 2011. Nucleic Acids Res. 2010;39:D876–D882. doi: 10.1093/nar/gkq963. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

NTRFinder: a software tool to find nested tandem repeats

Atheer A Matroud

M D Hendy

C P Tuffley

Abstract

INTRODUCTION