Skip to main content
Nucleic Acids Research logoLink to Nucleic Acids Research
. 2005 Feb 8;33(3):903–911. doi: 10.1093/nar/gki235

Design of nucleic acid sequences for DNA computing based on a thermodynamic approach

Fumiaki Tanaka 1,*, Atsushi Kameda 2, Masahito Yamamoto 1,2, Azuma Ohuchi 1,2
PMCID: PMC549402  PMID: 15701762

Abstract

We have developed an algorithm for designing multiple sequences of nucleic acids that have a uniform melting temperature between the sequence and its complement and that do not hybridize non-specifically with each other based on the minimum free energy (ΔGmin). Sequences that satisfy these constraints can be utilized in computations, various engineering applications such as microarrays, and nano-fabrications. Our algorithm is a random generate-and-test algorithm: it generates a candidate sequence randomly and tests whether the sequence satisfies the constraints. The novelty of our algorithm is that the filtering method uses a greedy search to calculate ΔGmin. This effectively excludes inappropriate sequences before ΔGmin is calculated, thereby reducing computation time drastically when compared with an algorithm without the filtering. Experimental results in silico showed the superiority of the greedy search over the traditional approach based on the hamming distance. In addition, experimental results in vitro demonstrated that the experimental free energy (ΔGexp) of 126 sequences correlated well with ΔGmin (|R| = 0.90) than with the hamming distance (|R| = 0.80). These results validate the rationality of a thermodynamic approach. We implemented our algorithm in a graphic user interface-based program written in Java.

INTRODUCTION

Nucleic acids are now being utilized in computations (13), various engineering applications such as microarrays (46), and nano-fabrications (79). In these fields of research, called ‘DNA computing’, nucleic acid design or sequence design is a crucial problem in engineering using nucleic acids. Nucleic acid design is deciding the base sequences (e.g. ‘GCTAGCTAGGTTTA’,…,‘ATCGTACGCTATGTGCA’ in DNA) in order to satisfy the constraints based on the physicochemical properties of nucleic acids. In particular, it is essential to prevent undesired hybridization. In DNA computing, multiple sequences need to be designed that do not hybridize non-specifically with each other (10,11), while in RNA secondary structure design, a single sequence needs to be designed that folds into the desired secondary structure (1215). For example, 40 DNA sequences of length 15 were designed to prevent undesired hybridization and used for solving a 20-variable instance of a three-satisfiability problem (2). In the field of microarrays, 69 122 DNA sequences of length 45–47 were designed to hybridize specifically to 24 502 transcripts from Arabidopsis thaliana (4). Furthermore, four DNA sequences of length 26 and four DNA sequences of length 48 were used to construct a periodic two-dimensional crystalline lattice (7). These sequences were carefully designed for the intended hybridization. Thus, an algorithm/program that generates multiple sequences, which do not hybridize non-specifically with each other, is useful for various applications of nucleic acid.

We applied a thermodynamic approach to the nucleic acid design for DNA computing. In particular, we focused on designing a pool P containing n sequences of length l for which (i) the duplex melting temperature (TM) is in the range TM to TM+ for any pairwise duplex of a sequence in P and its complement and (ii) the minimum free energy (ΔGmin) is greater than a threshold (ΔGmin*) in any pairwise duplex of sequences in P and any concatenation of two sequences in P plus their complements except for the pairwise duplex of a sequence in P and its complement. Traditional approaches to the sequence design in DNA computing have approximated the stability between two sequences using the hamming distance (i.e. the number of base pairs) rather than ΔGmin. However, since the hamming distance is only an approximation of the stability, ΔGmin is preferable for predicting the stability. In practice, an RNA secondary structure can be adequately predicted using ΔGmin rather than the number of base pairs (16). However, the algorithm for calculating the ΔGmin for double-stranded DNA requires time complexity O(l3), where l is the length of the sequence, as is the case with secondary structure prediction for single-stranded RNA. Since all the combinations of three sequences must be evaluated in this design, dozens of sequences cannot be designed within a reasonable computation time. Another approach that reduces the computation time is thus needed. Andronescu et al. (13) overcame this drawback in the secondary structure design by hierarchically decomposing the secondary structure into smaller substructures and integrating the partial sequences. By using this method, they designed sequences that a traditional program cannot.

We have developed a random generate-and-test algorithm that generates a candidate sequence randomly and tests whether the sequence satisfies the constraints. It stores the sequence in a sequence pool if and only if it satisfies all the constraints. To reduce the difficulty in computation time, we use a greedy search to calculate ΔGmin. The advantage of a greedy search is that it approximates ΔGmin in less time, with time complexity O(l2), than a rigorous algorithm, which calculates ΔGmin with time complexity O(l3). The ΔGmin approximated using the greedy search (denoted by ΔGgre) correlated well with ΔGmin. The correlation coefficients were 0.95, 0.85 and 0.76 at 20mer, 40mer and 60mer lengths, respectively. Furthermore, since ΔGgre is the upper bound for ΔGmin (i.e. ΔGmin≤ΔGgre), the sequence such that ΔGgreΔGmin* is sure to satisfy ΔGminΔGmin*. Therefore, using a greedy search excludes in advance most inappropriate sequences before ΔGmin is calculated without excluding an appropriate sequence. With this approach, our algorithm reduces computation time.

To evaluate our algorithm, we investigated the effectiveness of the greedy search filtering. We compared the computation time of our algorithm with that of the same algorithm but without the greedy search filtering. The experimental results showed that the greedy search filtering reduces the total computation time drastically. For example, using the filtering reduced the computation time to 83% for 30 sequences with length 20 and to 87% for 20 sequences with length 15 when TM=69.58, TM+=72.58 and ΔGmin*=10.0, and TM=60.49, TM+=63.49 and ΔGmin*=7.0, respectively. In addition, we compared the greedy search with the traditional approach based on a hamming distance in terms of the filtering performance. We demonstrated that the greedy search can filter out inappropriate sequences better than using the hamming distance. In a laboratory experiment, we investigated the correlation coefficient between ΔGmin and the experimental free energy (ΔGexp) and that between the hamming distance and the ΔGexp using 126 duplexes. The ΔGexp correlated better with ΔGmin (|R| = 0.90) than with the hamming distance (|R| = 0.80).

To implement our algorithm, we developed a computer program called DNA-SDT, a graphic user interface (GUI)-based application written in Java. This program enables users to design DNA sequences with our algorithm and can be downloaded freely from the web site (http://ses3.complex.eng.hokudai.ac.jp/~fumi95/DNA-SDT/index.html).

ALGORITHM

Definition

Let n be the number of sequences to be designed and li (0 ≤ in − 1) be the length of each sequence. In this paper, we formulated li such that li = lj = l (0 ≤ i, jn − 1), although we can extend our algorithm easily for lilj (0 ≤ ijn − 1). We define P as the pool of n sequences, with length l, to be designed. Furthermore, let P = {U0, U1,…, Un−1} and Q = {V0, V1,…, Vn−1} such that Vi (0 ≤ in − 1) is the complement of Ui. TM and TM+ are defined as the lower and upper thresholds of TM for the duplex between a sequence and its complement. Moreover, ΔGmin* is defined as the threshold of ΔGmin given by the sequence designer. Thus, our algorithm designs a pool P containing n sequences of length l for which (i) the duplex TM is in the range from TM to TM+ for any duplex of Ui (0 ≤ in − 1) and Vi, and (ii) ΔGmin is greater than ΔGmin* for any pairwise duplex of sequences in P and any concatenation of two sequences in PQ except for the pairwise duplex of Ui and Vi.

Here, we describe more specifically the combination of sequences to be calculated using the ΔGmin. The sequence Ui (0 ≤ in − 1) is denoted by a string of bases such as ui0ui1uil1 (5′ to 3′ direction), and similarly, Vi (0 ≤ in − 1) is denoted as vi0vi1vil1 (5′ to 3′ direction). For example, if Ui = 5′-AAATTTCCCGGG-3′, then Vi = 5′-CCCGGGAAATTT-3′. Furthermore, let 〈X, Y〉 be the combination of sequences X and Y, and XY be the concatenation of sequences X and Y in that order. For example, if Ui, Uj and Uk (0 ≤ i, j, kn − 1) are 5′-AAATTT-3′, 5′-CCCGGG-3′ and 5′-TCTCTC-3′, respectively, then 〈UiUj,Uk〉 means the combination of sequences 5′-AAATTTCCCGGG-3′ and 5′-TCTCTC-3′. In our algorithm, the following combinations are considered for the ΔGmin calculation.

  1. Ui Uj Uk 〉 (0 ≤ i, j, kn − 1)

  2. Ui Uj Vk 〉 (0 ≤ i, j, kn − 1), ik

  3. Ui Vj Uk 〉 (0 ≤ i, j, kn − 1), ij

  4. Ui Vj Vk 〉 (0 ≤ i, j, kn − 1), (ij)∧(ik).

For generality, we use two sequences, S(=s0s1sN−1) (5′ to 3′ direction) and T(=t0 t1tM−1) (3′ to 5′ direction), to describe the ΔGmin calculation in detail. The sequences are defined to be antiparallel to each other. Note that any two sequences can be represented by S and T. In this paper, S and T represent Ui and the reverse sequence of XjYk (Xj ∈ {Uj, Vj}, Xk ∈ {Uk, Vk}).

The notation si · tj represents the base pair between the i-th base in sequence S and the j-th base in sequence T, hence the structure between S and T is a set of base pairs such that each base is paired at most once. In addition, (si · tj, si′ · tj′) {(0i<iN1)(0j<jM1)} is defined as a structure in which base pairs si · tj and si′ · tj′ are formed and the sequences si+1si+2si1 and tj+1tj+2tj1 do not form any base pairs. The term (x,si·tj) {x{si,tj},(0i<iN1)(0j<jM1)} is defined as a structure in which base pair si′ · tj′ is formed and sequence sisi+1si1 in the case x=si (tjtj+1tj1 in the case x=tj) do not form any base pair. Similarly, (si·tj,x) {x{si,tj},(0i<iN1)(0j<jM1)} is defined as a structure in which base pair si · tj is formed and the sequence si+1si+2si in the case x=si (tj+1tj+2tj in the case x=tj) do not form any base pair.

Outline

Our algorithm is a random generate-and-test algorithm that generates a sequence randomly and then stores it in the pool if and only if the sequence satisfies all the constraints. The main feature of our algorithm is that it uses a greedy search for calculating ΔGmin to filter out inappropriate sequences. The advantage of a greedy search is that it approximates ΔGmin in less time when compared with a well-known dynamic programming algorithm (17). Therefore, using the greedy search before the ΔGmin calculation to exclude inappropriate sequences reduces the computation time. Hereafter, the approximated ΔGmin using a greedy search is denoted by ΔGgre.

The algorithm uses three filters:

  1. TM filter: checks whether the TM of candidate sequence Uc (0cn1) and Vc is in the range from TM to TM+. If not, Uc is rejected.

  2. ΔGgre filter: checks whether ΔGgre is greater than the threshold, ΔGgre*, for all the combinations above in P, provided that the candidate sequence Uc (0cn1) or Vc is included in that combination. If the ΔGgre of any combination is less than or equal to ΔGgre*, Uc is rejected.

  3. ΔGmin filter: checks whether ΔGmin is greater than the threshold, ΔGmin*, for all the combinations above in P, provided that the candidate sequence Uc (0cn1) or Vc is included in that combination. If the ΔGmin of any combination is less than or equal to ΔGmin*, Uc is rejected.

The TM and ΔGmin filters are necessary for satisfying the constraints of sequence design, while the ΔGgre filter reduces the computation time. The use of the ΔGgre filter is based on the hypothesis that it can exclude most sequences that cannot pass through the ΔGmin filter. If this hypothesis is true, the ΔGgre filter can exclude the inappropriate sequences in less time, resulting in reduced total computation time.

The algorithm is defined as follows:

  • Input: n, l, TM, TM+, ΔGgre*, ΔGmin*

  • Output: pool P consisting of n sequences with length l.

  • Procedure:
    1. Initialize pool P as an empty set.
    2. Iterate the following procedure until P has n sequences.
      1. Generate candidate sequence Uc (0cn1) with length l randomly, then add Uc to P.
      2. Evaluate Uc with TM filter. If Uc is rejected, exclude Uc from P and return to ii(a).
      3. Evaluate Uc with ΔGgre filter. If Uc is rejected, exclude Uc from P and return to ii(a).
      4. Evaluate Uc with ΔGmin filter. If Uc passes, leave Uc in P; else exclude Uc from P and return to ii(a).

The order of the filters is important. Each candidate sequence should be evaluated in this order to reduce the computation time. If pool P has m sequences, the time complexities to evaluate the (m + 1)-th candidate sequence are O(l), O(m2l2) and O(m2l3) at the TM, ΔGgre and ΔGmin filters, respectively. By evaluating the candidate sequences in ascending order of time complexity, the inappropriate ones can be excluded sooner.

Energy model

The ΔGmin between S and T is calculated using a dynamic programming algorithm (17), while the ΔGgre between S and T is calculated using a greedy search. Both ΔGmin and ΔGgre are calculated based on the nearest-neighbor model, which calculates the total free energy as the summation of the contributions of various elementary structures (18,19). The elementary structures considered in this paper are stacking base pairs, bulge loops, internal loops, dangling ends and free ends.

The contributions of the stacking base pairs, defined as (si·tj,si+1·tj+1), to the free energy are calculated using 12 parameters reported previously (19). The free energy contributions of the loop regions are sequence dependent (15,16,20). The free energies of single bulge loops, defined as (si·tj,si+2·tj+1) or (si·tj,si+1·tj+2), are calculated using 64 parameters covering all the possible combinations of bulged base and flanking base pairs (19). The free energies of the other loops, bulge loops longer than one and internal loops, are calculated using conventional parameters and equations (20,21). Bulge loops longer than one are defined as {(si·tj,si+l·tj+1)(l3)} or (si·tj,si+1·tj+l)(l3), and the internal loops are defined as (si·tj,si+l·tj+m)(l,m2). The free energies of dangling ends, defined as (s0,s1·t0), (t0,s0·t1), (sN2·tM1,sN1) or (sN1·tM2,tM1), are calculated using 32 parameters covering all the possible combinations (22). The free ends are defined as the sequences s0si and t0tj closing with si·tj such that both si(<i) and tj(<j) do not form a base pair or sisN1 and tjtM1 closing with si·tj such that both s(i<)i and t(j<)j do not form a base pair. The free energies of the free ends are also calculated using conventional parameters (20).

The crossing base pairs are defined as a pair of base pairs si·tj and si·tj in a structure with {(i < i′) ∧ (j′ < j)} or {(i′ < i) ∧ (j < j′)}. We prohibit crossing base pairs because of the computation time and the lack of thermodynamic data. This constraint is equivalent to the pseudoknot-free constraint in the RNA secondary structure prediction. At this point, we do not consider intra-molecular base pairs or the interactions between loop regions.

TM filter

This filter checks whether a candidate sequence paired to its complement has a TM in the range TM to TM+.

TM=ΔH°Rln(CT/α)+ΔS°, 1

where R is the gas constant, CT is the concentration, ΔH° is the enthalpy and ΔS° is the entropy. Parameter α is set to 1 for self-complementary and to 4 for non-self-complementary. Parameters ΔH° and ΔS° are calculated based on the nearest-neighbor model (18,19).

ΔGgre filter

This filter checks whether ΔGgre is greater than ΔGgre* for all the combinations as mentioned in Algorithm.

Using a ‘greedy search’ reduces the computation time for calculating ΔGmin. The greedy search works well because a structure with ΔGmin tends to include stable helices (i.e. continuous complementary regions). Therefore, the greedy search approximates ΔGmin by iteratively searching for the most stable helix and fixing the helix.

The greedy search algorithm is as follows:

  1. First, calculate the free energies of all helices over the structure between sequence s0s1sN1 and sequence t0t1tM1. Calculate the free energies of the helices with free ends using the following equations. Here, a helix is denoted by sisi+1sk (5′ → 3′) and tjtj+1tl(=j+ki) (3′ → 5′).
    • ΔGfreeEnd=ΔGcore+D(s0,si,t0,tj)+D(tM1,tl,sN1,sk)
    • ΔGcore=a=ik1eS(sa,sa+1,tj+ai,tj+ai+1)
    ΔGfreeEnd is the free energy of a helix with free ends; ΔGcore is that without free ends. D(s0,si,t0,tj) represents the free energy contribution of the dangling or free end between sequence 5-s0s1si-3 and sequence 3-t0t1tj-5 closing with base pair si·tj. For (i=0)(j=0), D(s0,si,t0,tj) is zero because there is no dangling or free end. eS(sa,sa+1,tj+ai,tj+ai+1) is the free energy of the stacked base pair between sequence 5-sasa+1-3 and sequence 3-tj+aitj+ai+1-5.
  2. Search for a minimal value of ΔGfreeEnd over all helices. Then, fix the base pairs in the helix where ΔGfreeEnd is minimum. This region is freshly denoted as sisi+1sk (5′ → 3′) and tjtj+1tl(=j+ki) (3′ → 5′). Furthermore, the free energies of the regions with and without free ends are freshly denoted as ΔGfreeEnd and ΔGcore, respectively.

  3. Iterate the following procedure d times, where d is a parameter defined below.
    1. If (0<i1)(0<j1) holds, calculate the free energies of all helices with the loop closing with base pair si·tj over the structure between sequence s0s1si1 and the sequence t0t1tj1. Thus, the free energy of helix, denoted as sisi+1sk (5′ → 3′) and tjtj+1tl(=j+ki) (3′ → 5′), is calculated as follows:
      • ΔGfreeEndL=ΔGcoreL+D(s0,si,t0,tj)
      • ΔGcoreL=a=ik1eS(sa,sa+1,tj+ai,tj+ai+1)+eL(sk,si,tl,tj),
      where ΔGfreeEndL is the free energy of the helix with a free end, ΔGcoreL is that of one without a free end and eL(sk,si,tl,tj) is the free energy contribution of a bulge or internal loop between sequence sksk+1si and sequence tltl+1tj closing with base pairs sk·tl and si·tj.
    2. Search for a region where ΔGfreeEndL is minimum over all helices. This region is freshly denoted as sisi+1sk (5′ → 3′) and tjtj+1tl(=j+ki) (3′ → 5′). Furthermore, the free energies of the regions with and without a free end are freshly denoted as ΔGfreeEndL and ΔGcoreL, respectively. If ΔGfreeEndLD(s0,si,t0,tj) holds, fix the base pairs, si·tj,si+1·tj+1,,sk·tl, and update i, j and ΔGcore to i′, j′ and ΔGcore+ΔGcoreL, respectively. This means that the base pairs in the region are energetically favorable.
    3. If (k+1<N1)(l+1<M1) holds, calculate the free energies of all helices with a loop closing with base pair sk·tl over the structure between sequence sk+1 sk+2sN1 and sequence tl+1tl+2tM1. Thus, the free energy of helix, denoted as sisi+1sk (5′ → 3′) and tjtj+1tl(=j+ki) (3′ → 5′), is calculated as follows:
      • ΔGfreeEndR=ΔGcoreR+D(tM1,tl,sN1,sk)
      • ΔGcoreR=a=ik1eS(sa,sa+1,tj+ai,tj+ai+1)+eL(sk,si,tl,tj),
      where ΔGfreeEndR is the free energy of the helix with a free end and ΔGcoreR is that of one without a free end.
    4. Search for a region where ΔGfreeEndR is minimum over all helices. This region is freshly denoted as sisi+1sk (5′ → 3′) and tjtj+1tl(=j+ki) (3′ → 5′). Furthermore, the free energies of the regions with and without a free end are freshly denoted as ΔGfreeEndR and ΔGcoreR, respectively. If ΔGfreeEndRD(tM1,tl,sN1,sk) holds, fix the base pairs, si·tj,si+1·tj+1,,sk·tl, and update k, l and ΔGcore to k″, l″ and ΔGcore+ΔGcoreR, respectively. This means that the base pairs in the region are energetically favorable.
  4. Calculate ΔGgre using
    ΔGgre=ΔGcore+D(s0,si,t0,tj)+D(tM1,tl,sN1,sk)+init,
    where init is the energy penalty for forming double-stranded DNA. If ΔGgre>0, however, set ΔGgre = 0. This is because the free energies are calculated relative to two non-interacting sequences, for which the free energy is defined as zero. Thus, two sequences must remain separate with zero free energy rather than form a structure with positive free energy.

In the above procedure, the number of iterations for a search, d, is defined as ‘degree’. A structure with degree = 0 has at most one helix. Note that base pairing does not occur during a greedy search when more than two continuous complementary bases do not exist between two sequences. For degree = 1, there are at most three helices. Eventually, for degree = d, there are at most (1+2·d) helices. If the length of sequences S and T are l and 2·l, respectively, the number of iterations is at most (l1)/3. Thus, the time complexity of greedy search O(degree·l2) is O(l3) at worst. However, because the degree increases slowly with the sequence length, the time complexity of a greedy search can be in practice regarded as O(l2). To confirm this, we calculated the degree for 10 000 random pairs of sequences from 10mer to 100mer in steps of 10mer. As shown in Table 1, degree was at most 9 at 80mer and 90mer, while it was 33 [=(100 − 1)/3]) at 100mer in the worst case. Furthermore, degree was rarely >6 (<1%), and, for >98% of the 10 000 random pairs, degree was in the range 0–4 at any length. Therefore, degree was nearly constant regardless of the sequence length on average, indicating that the time complexity of a greedy search is O(l2) in practice.

Table 1.

Distribution of degree for greedy search for 10 000 random pairs of sequences from 10mer to 100mer in steps of 10mer

Degree Sequence length (mer)
10 20 30 40 50 60 70 80 90 100
0 9426 7941 6679 5813 5032 4544 3973 3575 3196 2931
1 560 1838 2783 3238 3638 3723 3953 3856 3861 3863
2 14 204 471 767 1040 1301 1467 1752 1959 2067
3 0 17 57 161 244 336 458 596 698 770
4 0 0 10 16 38 78 106 165 206 242
5 0 0 0 4 7 13 28 43 48 89
6 0 0 0 1 1 5 15 12 24 30
7 0 0 0 0 0 0 0 0 4 7
8 0 0 0 0 0 0 0 0 3 1
9 0 0 0 0 0 0 0 1 1 0

ΔGmin filter

This filter checks whether the ΔGmin is greater than ΔGmin* for all the combinations as mentioned in Algorithm.

ΔGmin between S and T can be decomposed into two terms:

ΔGmin=min0iN10jM1{D(s0,si,t0,tj)+V(si,tj)}, 2

where V(si,tj) represents the minimum value of the free energy between sequence sisi+1sN1 (5′ → 3′ direction) and sequence tjtj+1tM1 (3′ → 5′ direction) closing with si·tj. Recall that D(s0,si,t0,tj) represents the free energy of the dangling or free ends between sequence s0s1si (5′ → 3′ direction) and sequence t0t1tj (3′ → 5′ direction) closing with base pair si·tj.

Furthermore, V(si,tj) is calculated using

V(si,tj)=min{D(tM1,tj,sN1,si),eS(si,si+1,tj,tj+1)+V(si+1,tj+1),VBI(si,tj)}, 3

where VBI(si,tj) represents the minimum value of the free energy forming a bulge or internal loops closing with base pair si·tj. Recall that eS(si,si+1,tj,tj+1) is the free energy of the stacked base pair between sequence sisi+1 (5′ → 3′ direction) and sequence tjtj+1 (3′ → 5′ direction). The first term represents the case in which the unpaired end consists of sequences tM1tM2tj (5′ → 3′ direction) and sN1sN2si (3′ → 5′ direction) closing with base pair tj·si. The second term corresponds to the case in which the stacked base pair is energetically favorable. In this case, V(si+1,tj+1) is calculated recursively. The third term is calculated using

VBI(si,tj)=mini<i,j<jii+jj>2{eL(si,si,tj,tj)+V(si,tj)}. 4

Recall that eL(si,si,tj,tj) represents the free energy contribution of the loops between sequence sisi+1si and sequence tjtj+1tj closing with base pairs si·tj and si·tj. The ΔGmin is calculated recursively using Equations 24 by dynamic programming. If we compute the VBI term in a straightforward manner, its time complexity is O(l4). However, this can be reduced to O(l3) using the algorithm of Lyngsø et al. (23).

RESULTS

Effectiveness of ΔGgre filter

To evaluate the effectiveness of the ΔGgre filter, we compared the computation time of our algorithm with that of the algorithm without the ΔGgre filter. That algorithm checks a randomly generated sequence by using the TM and ΔGmin filters and then stores the sequence in the pool if and only if the sequence passes both filters. All the computational experiments described in this section were performed using Windows 2000 on a computer with an Athlon 1.4 GHz CPU and 256 MB of memory. The results are shown in Figure 1. The computation time grew exponentially because the number of three-sequence combinations increased exponentially with the number of sequences.

Figure 1.

Figure 1

Number of sequences designed versus computation time for two design strategies up to 10 h. In (a) and (b), l = 20, TM=69.58 and TM+=72.58. In (a), ΔGgre*=ΔGmin*=10.0; and in (b), ΔGgre*=ΔGmin*=13.0. In (c) and (d), l = 15, TM=60.49 and TM+=63.49. In (c), ΔGgre*=ΔGmin*=7.0; and in (d), ΔGgre*=ΔGmin*=11.0.

Figure 1 shows that using our algorithm reduced the computation time drastically. For example, our algorithm needed ∼1.3 h to design 30 sequences with length 20 for ΔGmin*=10.0, while the algorithm without the ΔGgre filter needed ∼7.7 h (Figure 1a and b). This means that the ΔGgre filter effectively excludes the sequences that cannot pass through the ΔGmin filter.

A comparison of Figure 1a and b clearly shows the primacy of our algorithm for ΔGmin*=10.0 versus ΔGmin*=13.0. For instance, our algorithm reduced computation time to 83% (7.7 → 1.3 h) for 30 sequences and ΔGmin*=10.0 while it reduced computation time to 57% (9.1 → 3.9 h) for 69 sequences and ΔGmin*=13.0. This is because both algorithms can find sequences that satisfy the constraints more easily for ΔGmin*=13.0 than for ΔGmin*=10.0. A similar trend is seen in the sequences with length 15 (Figure 1c and d). Therefore, using the ΔGgre filter enables sequences to be designed in less time, particularly when the threshold is high.

Filtering performance of ΔGgre filter versus hamming distance

The function of the ΔGgre filter is to filter out the inappropriate sequences from many sequences generated randomly; the promising sequences are then checked using the ΔGmin filter. This means that the filtering performance of the ΔGgre filter can be evaluated using four terms: the number of sequences with both ΔGmin and ΔGgre higher than the threshold (true positive: TP), that with ΔGmin higher but not with ΔGgre (false negative: FN), that with ΔGgre higher but not with ΔGmin (false positive: FP) and that with neither ΔGmin nor with ΔGgre higher (true negative: TN) (Figure 2). There is a trade-off between FN and FP, which are the number of sequences incorrectly excluded by the ΔGgre filter and that incorrectly passing through the ΔGgre filter, respectively. To evaluate the ΔGgre filter, we compared the filtering performance of the ΔGgre filter with that using the hamming distance (called hamming filter) with respect to FN such that FP = 0 and to FP such that FN = 0 for 10 000 random pairs of sequences with length 20. Setting FP to 0 means that the ΔGgre or hamming filter certainly excludes inappropriate sequences with a ΔGmin less than or equal to ΔGmin*. Therefore, the lower the number in FN such that FP = 0, the better the filter. Similarly, because FN = 0 means that the ΔGgre or hamming filter never excludes appropriate sequences with a ΔGmin greater than ΔGmin*, the filter should have fewer sequences in FP such that FN = 0.

Figure 2.

Figure 2

Plot of ΔGgre versus ΔGmin for 10 000 random pairs of sequences with length 20; ΔGgre* and ΔGmin* represent threshold of ΔGgre and that of ΔGmin, respectively. Four terms were used for evaluating filtering performance: TP, {(ΔGgre*<ΔGgre)(ΔGmin*<ΔGmin)}; FN, {(ΔGgreΔGgre*)(ΔGmin*<ΔGmin)}; FP, {(ΔGgre*<ΔGgre)(ΔGminΔGmin*)}; and TN, {(ΔGgreΔGgre*)(ΔGminΔGmin*)}.

As shown in Figure 3, for FN = 0, the number of sequences classified into FP by the ΔGgre filter was much smaller than that by the hamming filter. Because most sequences have a ΔGmin of more than −10.0 kcal/mol (Figure 2), the FP converged to zero for both the hamming filter and ΔGgre filter for less than −10.0 kcal/mol. Similarly, for FP = 0, the number of sequences classified into FN by the ΔGgre filter was smaller than that by the hamming filter. The reason there is little difference for greater than −5.0 kcal/mol is that a small number of sequences with the ΔGgre actually had a ΔGmin such that ΔGminΔGgre. For example, sequences GGTCACCCTGGGCTACCGGA (5′ → 3′ direction) and TTAAGGTCGCGTGCTATCTT (3′ → 5′ direction) had a ΔGmin of −7.92 kcal/mol while the ΔGgre was −1.96 kcal/mol. Because such sequences are rare, however, the primacy of the ΔGgre filter over the hamming filter is clear for less than −5.0 kcal/mol.

Figure 3.

Figure 3

Threshold of ΔGmin versus number of sequences classified as FP and FN. (a) FP such that FN is zero and (b) FN such that FP is zero.

Another advantage of the ΔGgre filter is that it guarantees having the threshold where the number of FN is zero because the ΔGgre is the upper bound for the ΔGmin (i.e. ΔGminΔGgre). For example, a pair of sequences having a ΔGgre less than or equal to −5.0 kcal/mol is guaranteed to have a ΔGmin of at most −5.0 kcal/mol. Thus, the sequences with a ΔGgre of more than −5.0 kcal/mol include all sequences with a ΔGmin of more than −5.0 kcal/mol. Therefore, setting the threshold for a ΔGgre filter (ΔGgre*) to that for the ΔGmin filter (ΔGmin*) (i.e. ΔGgre*=ΔGmin*) guarantees that the number of FN is zero. This is not the case for hamming filter.

Comparison between hamming distance and ΔGmin in vitro

We investigated the validity of the ΔGmin based approach compared with the traditional approach based on the hamming distance by using an in vitro experiment. First, to address the validity of the ΔGmin calculation, we calculated the average deviations from the experimental free energies (ΔGexp). The average deviations were derived from 126 sequences (31 complementary sequences, 83 sequences with a single bulge loop and 12 sequences with free ends). The average deviations, calculated using 1126i=1126|100(ΔGminiΔGexpi)Δ/Gexpi| (ΔGmini and ΔGexpi represent the i-th ΔGmin and ΔGexp, respectively), were 3.2% (3.0, 2.8 and 6.5% for the complementary sequences, single bulges and free ends, respectively). These average deviations are within the limits of what can be expected for a nearest-neighbor model (18,19). Thus, we confirmed that our algorithm can predict the ΔGmin adequately. The 126 sequences with their ΔGmin and ΔGexp are provided in the Supplementary Material. The number of complementary bases for each pair of sequences is also provided.

Then, to compare the hamming distance with ΔGmin, we investigated the correlation coefficient with ΔGexp. To fairly compare different length sequences, we used the number of complementary bases but not the hamming distance. For example, although sequences 5′-GGG-3′ and 3′-CCC-5′ with three complementary bases and sequences 5′-GGGGG-3′ and 3′-CCCCC-5′ with five complementary bases both have zero hamming distance, the stability of the latter must be higher than that of the former. Thus, the number of complementary bases is appropriate for comparing the stability of sequences with different lengths. Note that the number of complementary bases is equivalent to the hamming distance for sequences with the same length. The results are shown in Figure 4a and b. In Figure 4a, BP represents the number of complementary bases. The correlation coefficient, |R|, between −BP and ΔGexp was 0.80, while that between ΔGmin and ΔGexp was 0.90, indicating that the ΔGmin is better than the number of complementary bases (i.e. hamming distance) as a predictor of stability.

Figure 4.

Figure 4

(a) −BP versus experimental free energy. BP represents the number of complementary bases. (b) Predicted minimum versus experimental free energy. In (a) and (b), values and lines are correlation coefficients and regression lines, respectively.

Bozdech et al. (5) compared the number of complementary bases with the binding energy (the calculation method and nearest-neighbor parameters differed from ours) by using 70mer oligonucleotides on microarray. They found that |R| between the binding energy and relative intensity of hybridization (= intensity of fluorescence) was 0.91, while it was 0.72 between the number of complementary bases and the relative intensity of hybridization. This is consistent with our findings, especially with respect to the correlation between the predicted energy (binding energy in theirs, ΔGmin in ours) and the stability derived from the experimental results (their |R| was 0.91, while ours was 0.90). With respect to the correlation between the number of complementary bases and the experimental stability, their |R| was 0.72, while ours was 0.80. This is because our data were derived from only sequences with a simple structure, i.e. with at most one loop (single bulge loop). In general, the more loops, the higher the discrepancy between the number of complementary bases and the experimental stability. Therefore, the correlation between the number of complementary bases and the experimental stability will be close to theirs with respect to more complex sequences, i.e. with more than two loops. These results demonstrate the ability of our program to approximate the experimental stability of double-stranded DNA based on the ΔGmin calculation.

DISCUSSION

In the previous section, we showed the effectiveness of the ΔGgre filter for filtering out the sequences that cannot pass through the ΔGmin filter. To further address the question how the ΔGgre is close to the ΔGmin, we investigated the average prediction error of ΔGgre, calculated using i=110000(GgreiGmini)/10000. Figure 5 shows that the average prediction error can be restricted to <1.0 kcal/mol for sequences shorter than 50mer, which are frequently used in DNA computing.

Figure 5.

Figure 5

Average prediction error of ΔGgre, calculated using i=110000(ΔGgreiΔGmini)/10000.

Figure 6 shows the number of sequences such that ΔGmin is equal to ΔGgre. The number decreased as the sequences became longer. For example, ΔGmin=ΔGgre in 9250 pairs from 10 000 pairs of sequences at 10mer, while ΔGmin=ΔGgre in 157 pairs from 10 000 pairs at 100mer. Therefore, ΔGgre is a good predictor of ΔGmin for short sequences, while it is only an approximation for long sequences.

Figure 6.

Figure 6

Number of sequences such that ΔGmin=ΔGgre.

For comparison with the hamming distance, the correlation coefficient (|R|) was also calculated. The correlation between ΔGmin and ΔGgre decreased almost linearly from 0.99 at 10mer to 0.63 at 100mer. Because long sequences tend to have more helices, the discrepancy increased as the sequences became longer. The correlation coefficient between ΔGmin and the hamming distance, which decreased almost linearly from 0.46 at 10mer to 0.12 at 100mer, was much less than that between ΔGmin and ΔGgre.

IMPLEMENTATION

We implemented our algorithm in a program called ‘DNA Sequence Design Tool (DNA-SDT)’. DNA-SDT can be downloaded freely from the web site (http://ses3.complex.eng.hokudai.ac.jp/~fumi95/DNA-SDT/index.html). It has two functions: sequence design and structure prediction.

In sequence design, the program solves the problem of designing a pool P containing n sequences of length l for which (i) the duplex TM is in the range TM to TM+ for any duplex of Ui (0in1) and Vi and (ii) ΔGmin is greater than ΔGmin* in any pairwise duplex of sequences in P and any concatenation of two sequences in PQ except for the pairwise duplex of Ui and Vi. This problem is solved using our algorithm with the threshold for a ΔGgre filter, ΔGgre*. The parameters, n, l, TM, TM+, ΔGgre* and ΔGmin*, are user-defined. DNA-SDT has a table of average TM values at length l (8l50) [TMave(l)] calculated from 10 000 sequences generated randomly. Parameters TM and TM+ are set to [TMave(l)1.5] and [TMave(l)+1.5], respectively, by default. The TM is calculated at 1 μM.

In structure prediction, the program calculates the ΔGmin between two sequences, then displays the structure with the ΔGmin using a traceback algorithm. Users can thus determine the structure of any combination of sequences from the pool designed by the program. Of course, for any given combination of sequences, users can also determine the structure with the ΔGmin.

The program is a GUI-based application written in Java, hence it can be executed on any computer that has the Java Runtime Environment (JRE) installed. Although older versions of JRE supposedly run without problem, the latest version is preferable. We tested the program with JRE 1.4.2 on a Windows 2000 workstation, JRE 1.4.1 on a Windows XP workstation and JRE 1.4.2 on a Turbolinux Workstation 7.0.

The program interface consists of two screens: one for structure prediction and one for sequence design. The left-hand side of the window is the screen for structure prediction, which outputs the structure with the ΔGmin from the two sequences input in the text box. The sequences in the text box can be converted into reverse or complementary sequences. The right-hand side is the screen for sequence design, which outputs the sequences, GC content and TM based on the constraints input in the text boxes.

To avoid impossible operations during design, the procedure for sequence design is executed as a separate thread. Therefore, the structure-prediction screen can be operated freely when sequences are being designed. Furthermore, the design procedure can be monitored and interrupted anytime, and the interim results can be output. However, because the button for running the sequence design procedure is locked during design, users cannot design another pool of sequences in parallel.

The number of sequences can be set up to 100. If more than 100 sequences need to be designed, the user can choose ‘Unlimited’. In this case, the design procedure iterates until the user interrupts it. The sequence length is restricted at 8mer to 50mer because of the computation time. Although the thresholds for ΔGmin and ΔGgre can be set freely, using irrelevant values resulting in zero sequences. We recommend setting these thresholds such that ΔGmin*=ΔGgre*(<0) to guarantee zero false negatives (see Results).

SUPPLEMENTARY MATERIAL

Supplementary Material is available at NAR Online.

Supplementary Material

[Supplementary Material]

Acknowledgments

Funding to pay the Open Access publication charges for this article was provided by the Ministry of Education, Science, Sports and Culture, Grant-in-Aid for Scientific Research on Priority Areas, 2002–2005, 14085201.

REFERENCES

  • 1.Adleman L.M. Molecular computation of solutions to combinatorial problems. Science. 1994;266:1021–1024. doi: 10.1126/science.7973651. [DOI] [PubMed] [Google Scholar]
  • 2.Braich R.S., Chelyapov N., Johnson C., Rothemund P.W., Adleman L. Solution of a 20-variable 3-SAT problem on a DNA computer. Science. 2002;296:499–502. doi: 10.1126/science.1069528. [DOI] [PubMed] [Google Scholar]
  • 3.Faulhammer D., Cukras A.R., Lipton R.J., Landweber L.F. Molecular computation: RNA solutions to chess problems. Proc. Natl Acad. Sci. USA. 2000;97:1385–1389. doi: 10.1073/pnas.97.4.1385. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Rouillard J.M., Zuker M., Gulari E. OligoArray 2.0: design of oligonucleotide probes for DNA microarrays using a thermodynamic approach. Nucleic Acids Res. 2003;31:3057–3062. doi: 10.1093/nar/gkg426. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Bozdech Z., Zhu J., Joachimiak M.P., Cohen F.E., Pulliam B., DeRisi J.L. Expression profiling of the schizont and trophozoite stages of Plasmodium falciparum with a long-oligonucleotide microarray. Genome Biol. 2003;4:R9. doi: 10.1186/gb-2003-4-2-r9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Kane M.D., Jatkoe T.A., Stumpf C.R., Lu J., Thomas J.D., Madore S.J. Assessment of the sensitivity and specificity of oligonucleotide (50mer) microarrays. Nucleic Acids Res. 2000;28:4552–4557. doi: 10.1093/nar/28.22.4552. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Winfree E., Liu F., Wenzler L.A., Seeman N.C. Design and self-assembly of two-dimensional DNA crystals. Nature. 1998;394:539–544. doi: 10.1038/28998. [DOI] [PubMed] [Google Scholar]
  • 8.Shih W.M., Quispe J.D., Joyce G.F. A 1.7-kilobase single-stranded DNA that folds into a nanoscale octahedron. Nature. 2004;427:618–621. doi: 10.1038/nature02307. [DOI] [PubMed] [Google Scholar]
  • 9.Yan H., Zhang X., Shen Z., Seeman N.C. A robust DNA mechanical device controlled by hybridization topology. Nature. 2002;415:62–65. doi: 10.1038/415062a. [DOI] [PubMed] [Google Scholar]
  • 10.Arita M., Kobayashi S. DNA sequence design using templates. New Gen. Comput. 2002;20:263–277. [Google Scholar]
  • 11.Arita M., Nishikawa A., Hagiya M., Komiya K., Gouzu H., Sakamoto K. Improving sequence design for DNA computing. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO-2000); 2000. pp. 875–882. [Google Scholar]
  • 12.Dirks R.M., Lin M., Winfree E., Pierce N.A. Paradigms for computational nucleic acid design. Nucleic Acids Res. 2004;32:1392–1403. doi: 10.1093/nar/gkh291. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Andronescu M., Fejes A.P., Hutter F., Hoos H.H., Condon A. A new algorithm for RNA secondary structure design. J. Mol. Biol. 2004;336:607–624. doi: 10.1016/j.jmb.2003.12.041. [DOI] [PubMed] [Google Scholar]
  • 14.Andronescu M., Aguirre-Hernandez R., Condon A., Hoos H.H. RNAsoft: a suite of RNA secondary structure prediction and design software tools. Nucleic Acids Res. 2003;31:3416–3422. doi: 10.1093/nar/gkg612. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Hofacker I.L. Vienna RNA secondary structure server. Nucleic Acids Res. 2003;31:3429–3431. doi: 10.1093/nar/gkg599. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Mathews D.H., Sabina J., Zuker M., Turner D.H. Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. J. Mol. Biol. 1999;288:911–940. doi: 10.1006/jmbi.1999.2700. [DOI] [PubMed] [Google Scholar]
  • 17.Zuker M., Stiegler P. Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Res. 1981;9:133–148. doi: 10.1093/nar/9.1.133. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.SantaLucia J., Jr A unified view of polymer, dumbbell, and oligonucleotide DNA nearest-neighbor thermodynamics. Proc. Natl Acad. Sci. USA. 1998;95:1460–1465. doi: 10.1073/pnas.95.4.1460. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Tanaka F., Kameda A., Yamamoto M., Ohuchi A. Thermodynamic parameters based on a nearest-neighbor model for DNA sequences with a single-bulge loop. Biochemistry. 2004;43:7143–7150. doi: 10.1021/bi036188r. [DOI] [PubMed] [Google Scholar]
  • 20.Zuker M. Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids Res. 2003;31:3406–3415. doi: 10.1093/nar/gkg595. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Peritz A.E., Kierzek R., Sugimoto N., Turner D.H. Thermodynamic study of internal loops in oligoribonucleotides: symmetric loops are more stable than asymmetric loops. Biochemistry. 1991;30:6428–6436. doi: 10.1021/bi00240a013. [DOI] [PubMed] [Google Scholar]
  • 22.Bommarito S., Peyret N., SantaLucia J., Jr Thermodynamic parameters for DNA sequences with dangling ends. Nucleic Acids Res. 2000;28:1929–1934. doi: 10.1093/nar/28.9.1929. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Lyngsø R.B., Zuker M., Pedersen C.N.S. Fast evaluation of internal loops in RNA secondary structure prediction. Bioinformatics. 1999;15:440–445. doi: 10.1093/bioinformatics/15.6.440. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

[Supplementary Material]
nar_33_3_903__1.pdf (53.8KB, pdf)

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press

RESOURCES