A k-mismatch string matching for generalized edit distance using diagonal skipping method

HyunJin Kim

doi:10.1371/journal.pone.0251047

. 2021 May 4;16(5):e0251047. doi: 10.1371/journal.pone.0251047

A k-mismatch string matching for generalized edit distance using diagonal skipping method

HyunJin Kim ^1,^*

Editor: Hans A Kestler²

PMCID: PMC8096107 PMID: 33945564

Abstract

This paper proposes an approximate string matching with k-mismatches when calculating the generalized edit distance. When the edit distance is generalized, more sophisticated string matching can be provided. However, the execution time increases because of the bundle of complex computations for calculating complicated edit distances. The computational costs for finding which steps or edit distances are over k-mismatches cannot be significant in the generalized edit distance metric. Therefore, we can reduce the execution time by determining steps over k-mismatches and then skipping them. The diagonal step calculations using the pruning register skips unnecessary distance calculations over k-mismatches. The overhead of control statements and reordered memory accesses can be amortized by skipping multiple steps. Even though the proposed skipping method requires additional overhead, the proposed scheme’s practical embodiments show that the execution time of string matching is reduced significantly when k is small.

Introduction

In the field of computer science, information retrieval is a fundamental problem. Notably, string matching is essential to digital information retrieval. String matching searches the sequence of characters or pattern to determine whether the pattern matches with an input sequence or not. In exact string matching, when a pattern is the same as an input sequence, it determines that the pattern is matched with the input sequence. On the other hand, approximate string matching evaluates the similarity between the input sequence and pattern based on its metric. With sophisticated data analysis and various applications, the approximate string matching can get more attention in the big-data era [1–6].

The similarity between two strings can be quantitated by the minimum number of basic operations that makes an input sequence equal to the target pattern. Traditionally, approximate string matching assumes that the insertion, deletion, replacement, and transposition of characters in a string make the difference [1, 7]. They are used as basic operators to calculate the distance between the input sequence and target pattern. In the Levenshtein distance calculation, string matching can be simplified because each basic operator has the unified cost of one. When estimating the distance between two strings. the Hamming distance [8] calculation counts ‘1’ bits after applying bitwise exclusive-OR.

On the other hand, the semantics or relationship between subsequences make approximate string matching sophisticated. For example, a human can feel that pattern “catch” is more similar to input sequence “cotch” than to input sequence “ctch” although the Levenshtein distance from pattern “catch” is one for both two input sequences, respectively. Therefore, a more complicated edit distance metric can be adopted, categorized into the generalized edit distance [9, 10] or the normalized edit distance [11, 12]. However, when complex functions generalize the edit distance, significant computational resources are required. When calculating the edit distance between input sequence and pattern, the edit distance between each input subsequence and subpattern is needed, which is called step. Moreover, in the traditional sequential dynamic programming [13], all steps should be calculated in order, which is a very time-consuming job due to the data dependency in calculating steps. Several mathematical approaches can show better computational complexity [14, 15]. However, the overhead of control statements and reordered memory accesses is not considered for practical applications. The parallel string matching methods have been researched using the parallelism equipped in GPU (Graphics Processing Unit) [7, 16–24] and FPGA (Field Programmable Gate Array) [25–29], where the parallel programming requires multiple computational resources. However, sequential string matching based on a processing unit is still an attractive and fundamental topic in many practical applications. Our study reduces the execution time of the sequential approximate string matching when performed by a processing unit.

Naively, in a step calculation, if its data-dependent previous steps have over k-mismatches, the evaluation of operators with these previous steps over k can be skipped. However, in the Levenshtein distance metric, the overhead ratio for finding whether data-dependent steps are over k-mismatches is relatively high. In previous theoretical approaches [14, 15], this overhead is not considered, so their implementations cannot have better performance than that of the dynamic programming-based method [13] in the Levenshtein distance metric. With the generalized edit distance, more sophisticated string matching can be confirmed. The execution time increases because of the bundle of complex computations for performing complex edit calculations. Therefore, if the step calculation that is expected to be over k can be skipped, the total execution time can be significantly reduced, which motivates our research.

This paper proposes an approximate string matching with k-mismatches for the edit distance metric. Our research is motivated that when previous steps are over k, the information can be used to skip unnecessary step calculations. This paper focuses on the practical embodiment of our method and its evaluation. Without finding which data-dependent previous steps are over k, the diagonal step calculations using the pruning register can skip unnecessary step calculations over k-mismatches. Each bit in the pruning register contains the information of step calculations to be skipped. Even though there is an additional overhead of control statements and reordered memory accesses, skipping multiple steps at a time can reduce execution time significantly. For realistic experiments, generalized edit distance metrics are assumed based on the similarity in shapes and keyboard character positions. The proposed string matching and other dynamic programming methods are coded and then evaluated using the generalized edit distance metrics. Despite additional overhead in the diagonal step calculations and pruning register accesses, experiments show that the proposed skipping method can reduce the execution time of approximate string matching when k is small.

Preliminaries

Edit distance in approximate string matching

In string matching, an input sequence is compared with the pattern, and then the difference between the input sequence and pattern is reported. Unlike exact string matching, the similarity is quantified in the approximate string matching. The distance between the input sequence and pattern refers to the calculation result based on the distance metric adopted in string matching.

In [1], for strings X_i = x₁, x₂, …,x_i−1, x_i and Y_j = y₁, y₂…,y_j−1, y_j where characters $x_{a}, y_{b} \in C$ for 1 ≤ a ≤ i and 1 ≤ b ≤ j, the distance between X_i and Y_j denoted as D(X_i, Y_j) is the minimum number of edit operations to make X_i and Y_i the same. The distance D(X_i, Y_j) should satisfy:

D(X_i, Y_j) = 0 if and only if X_i = Y_j
D(X_i, Y_j) > 0 when X_i ≠ Y_j
D(X_i, Y_j) = D(Y_j, X_i).

Besides, for a given string Z_k = z₁, z₂, …,z_k−1, z_k where $z_{c} \in C$ for 1 ≤ c ≤ k, the edit distance satisfies the condition of D(X_i, Z_k) ≤ D(X_i, Y_j) + D(Y_j, Z_k) called triangle inequality.

Significantly, the Levenshtein distance [30] is the most popular edit distance metric in string matching, so the edit distance has been interchangeably used with the Levenshtein distance sometimes. However, because the edit distance can include several meanings of other metrics different from the Levenshtein distance metric, this paper denotes that the Levenshtein distance metric adopts simple operators with the cost of one.

We define input subsequence X_α of input sequence X_i and subpattern Y_β of pattern Y_j for 1 ≤ α ≤ i and 1 ≤ β ≤ j, as follows:

Definition 1 For strings X_α = x₁, x₂, …,x_α−1, x_α, and Y_β = y₁, y₂, …,y_β−1, y_β, when subscripts α ≤ i and β ≤ j, X_α and Y_β are the input subsequence and subpattern of input sequence X_i and pattern Y_j, respectively.

For input subsequence X_α and subpattern Y_β, when the initial edit distance D(X₀, Y₀) is 0, the minimum edit distance D(X_α, Y_β) is formulated as follows:

\begin{matrix} D (X_{α}, Y_{β}) = m i n {\begin{matrix} D (X_{α - 1}, Y_{β - 1}) + s u b s t i t u t i o n (x_{α}, y_{β}) \\ D (X_{α - 1}, Y_{β}) + d e l e t i o n (x_{α}) \\ D (X_{α}, Y_{β - 1}) + i n s e r t i o n (y_{β}) . \end{matrix} \end{matrix}

(1)

Black arrows 1, 2, and 3 in Fig 1(a) mean substitution, deletion, and insertion operators, which correspond to cost functions substitution(x_α, y_β), deletion(x_α), and insertion(y_β) in Eq (1), respectively. The function substitution(x_α, y_β) means the cost of substituting x_α of X_α into y_β of Y_β. The function deletion(x_α) provides the cost of deleting x_α from X. On the other hand, the function insertion(y_β) means the cost of inserting y_β to the end of Y_β−1.

For example, let’s assume that X₃ = “bat” and Y₃ = “bad” with α = β = 3. In this case, by substituting x₃(‘t’) with y₃(‘d’) in X₃ in substitution(x₃, y₃), the converted X₃ can be the same as Y₃, which adds the cost of substitution(‘t’, ‘d’) to D(“ba”, “ba”). When D(“ba”, “bad”) is given, character x₃(‘t’) is removed from “bat”, and the given D(“ba”, “bad”) is required to convert “ba” to “bad”. Therefore, the cost of deletion(‘t’) is added to calculate D(“bat”, “bad”). When D(“bat”, “ba”) is given, after attaching y₃ = ‘d’ to “ba”, D(“bat”, “bad”) can be calculated, which means that the cost of insertion(‘d’) should be added.

Therefore, the minimum edit distance for X₃ = “bat” and Y₃ = “bad” formulated as D(“bat”, “bad”) is calculated on Eq (1) as:

\begin{matrix} D (“ b a t “, “ b a d ”) = m i n {\begin{matrix} D (‘ ‘ b a “, “ b a ”) + s u b s t i t u t i o n (‘ t, ’ ‘ d ’) \\ D (‘ ‘ b a “, “ b a d ”) + d e l e t i o n (‘ t ’) \\ D (‘ ‘ b a t “, “ b a ”) + i n s e r t i o n (‘ d ’) . \end{matrix} \end{matrix}

(2)

The Levenshtein distance metric simplifies the cost of each operator into 1 or 0, which makes the Levenshtein distance calculation very simple. Fig 1(b) illustrates an example of the Levenshtein distance matrix. In Fig 1(b), input subsequence “ca” can be the same as subpattern “cat” after attaching “t” to the end of input subsequence “ca”. When using the insertion operator, an input subsequence can be equal to the subpattern. Substitution and deletion operators are also applied to the input sequence to match with the pattern. In Fig 1(b), the rightmost bottom cell is numbered as 4, which is the final Levenshtein distance between input sequence “ccatese” and pattern “catch”.

We denote the edit distance between input subsequence and subpattern as step. In a traversal, the steps included in the traversal are calculated in order. The traversal method determines the order of calculating steps.

Generalized edit distance

Unlike the Levenshtein distance, the generalized edit distance adopts more sophisticated cost functions in Eq (1). Depending on the operator type, Cost functions can output other values different from 1 or 0. If the cost of one deletion operation increases twice as much as that of one substitution operation, it follows as:

\begin{matrix} s u b s t i t u t i o n (x_{α}, y_{β}) = 2 \cdot d e l e t i o n (x_{α}) . \end{matrix}

(3)

For example, the similarity in character shapes can be used in another generalized distance metric. Because ‘h’ is similar to that of ‘b’ in shape, a human can feel that input sequence “catcb” is more similar to pattern “catch” than input sequence “catco”. In this case, the similarity can be estimated by the substitution operator in the generalized edit distance metric. In another example, the misspelling can happen depending on the character positions in a keyboard. In the US computer keyboard, ‘q’ has a high possibility of being mistyped as ‘w’ because the key of ‘w’ is located next to that of ‘q’. However, the shape of ‘p’ is totally different from that of ‘w’, so that other functions are needed to quantify the difference between key positions. Besides, the generalized edit distance can consider the pattern length. Intuitively, we feel that the difference between “ca” and “cat” is expected to be greater than that between “catasrophe” and “catastrophe” even though the difference from both cases is caused by one deleted character ‘t’. Therefore, the costs from the insertion and deletion operations can be inversely proportional to the pattern length. In this case, condition D(X_α, Y_β) = D(Y_α, X_β) cannot be met when the costs for insertion and deletion operations are different from each other. In conclusion, the generalized edit distance metric requires more complicated operations. Besides, these generalized edit distance metrics can adopt fractions to represent the distance.

Fig 2 shows an example of the edit distance matrix using the generalized edit distance metric based on the similarity in shape and pattern length. We assume that the costs of deletion and insertion operators are fixed as 0.76. For the substitution operation, different values are added depending on character shapes. In Fig 2, because characters ‘c’ and ‘e’ seem to be similar in shape, substitution(‘c’, ‘e’) = 0.42. On the other hand, for the cases with characters ‘a’ and ‘h’, substitution(‘a’, ‘h’) = 1.20. Unlike Fig 1, each step’s distance has a fraction in Fig 2, so that the generalized edit distance calculation needs fractional operations. As these operators need complex computations, the evaluation of each operator requires an additional computational overhead.

k-mismatch string matching

A k-mismatch approximate string matching is defined as:

Definition 2 In k-mismatch string matching, for input sequence X_i and pattern Y_j, when D(X_i, Y_j) ≤ k, X_i matches Y_j.

Term k denotes the threshold for determining whether X_i is matched with Y_j or not.

Because the cost of any operation is a positive value, Eq (1) can be modified for k-mismatch string matching with input subsequence X_α and subpattern Y_β as:

\begin{matrix} D {(X_{α}, Y_{β})}_{k} = m i n {\begin{matrix} D {(X_{α - 1}, Y_{β - 1})}_{k} + s u b s t i t u t i o n (x_{α}, y_{β}) when D {(X_{α - 1}, Y_{β - 1})}_{k} \leq k \\ D {(X_{α - 1}, Y_{β})}_{k} + d e l e t i o n (x_{α}) when D {(X_{α - 1}, Y_{β})}_{k} \leq k \\ D {(X_{α}, Y_{β - 1})}_{k} + i n s e r t i o n (y_{β}) when D {(X_{α}, Y_{β - 1})}_{k} \leq k . \end{matrix} \end{matrix}

(4)

In Eq (4), when the edit distance of a data-dependent previous step (D(X_α−1, Y_β−1)_k, D(X_α−1, Y_β)_k, and D(X_α, Y_β−1)_k) is over k, there is no need to evaluate its operation for calculating D(X_α, Y_β)_k. However, an additional overhead is required to find whether its data-dependent previous steps are over k or not.

Proposed diagonal string matching using pruning register

Motivations

From Eq (4), when the edit distance of data-dependent previous steps (D(X_α−1, Y_β−1), D(X_α−1, Y_β), D(X_α, Y_β−1)) over k-mismatches is pre-known, we determine whether its related operation is needed or not. Our motivation starts from the fact that unnecessary step calculations over k-mismatches can be skipped depending on data-dependent previous steps. In the existing dynamic programming-based method, the distance matrix is filled by calculating edit distances between input subsequences and subpatterns, so that the data-dependent previous steps are accessed in the edit distance matrix.

Fig 3 is the conceptual figure that illustrates overhead ratios of conditional statements for finding data-dependent previous steps according to the computational overhead for performing operations (substitution, insertion, and deletion). In a simple edit distance metric such as the Levenshtein distance, each operation only compares characters and makes binary output, so that several operations are performed in the pipelining [31]. If there is any conditional jump, these predicted instructions of many operations can be cancelled, which degrades the performance. In other words, it is expected that the conditional jump for skipping evaluations cannot reduce the total execution time. The overhead ratio of conditional statements for finding whether a data-dependent previous step is over k is too high. Therefore, in a simple edit distance metric, there could be no benefits by skipping evaluations in range (a) of Fig 3.

On the other hand, when the edit distance metric requires more computational resource to evaluate complicated operators, the skipping method can be useful. As the computational resources for performing each operator increase, the overhead ratio of conditional statements becomes very small. In range (c), it can be better to skip each operator evaluation when finding its data-dependent previous steps over k. In range (b), the overhead of conditional statements and operator evaluations is not negligible. If the remaining iterations can be skipped in the loop for calculating each step, many operator evaluations can be reduced. The implementation of dynamic programming [13] using a nested loop cannot provide such functions.

Therefore, we propose a new string matching method for skipping the remaining iterations for the distance metric. In the following, the problem definition is discussed in detail, and the proposed diagonal skipping method is explained.

Problem definition

Fig 4 shows examples of calculating steps and their traversals considering data dependency between steps, in which Fig 4(a) shows simple vertical traversals. In vertical traversal, steps on the next column depend on those on the previous column for substitution and insertion operators. Therefore, after calculating steps on a column, steps neighbouring on the right column can be calculated. Besides, for the deletion operator, the traversal should proceed from top to bottom. This dynamic programming considers data dependency between its neighbouring steps that exists from Eq (1) [13]. After calculating steps in a traversal, the vertical traversal is performed on the right column. Therefore, when n and m are denoted as the input sequence and pattern lengths, the computational complexity can be O(mn).

Several approximate string matching algorithms have been studied to reduce the dynamic programming’s computational complexity [14, 15]. In general, several previous works about k-mismatch string matching enhance the throughput of string matching for the long input string such as network traffic data [2] and DNA sequences [4, 5]. Therefore, multiple occurrences of the pattern are searched in the long input string, where string matching with input subsequences can be considered. For example, when input sequence and pattern are “baseball player” and “catastrophe”, the Levenshtein distance between subsequence “base” and pattern “catastrophe” is 10. Considering the distance of 10, if k < 10, input subsequence “baseball” cannot be matched. Then, another string matching with another subsequence “player” begins. In this case, the calculation of the edit distance matrix cannot be avoidable. This paper proposes a new method that reduces the execution time of obtaining the edit distance matrix for k-mismatches.

Diagonal traversal and skipping method

Our method adopts the diagonal traversal to skip unnecessary step calculations over k-mismatches. Unlike the vertical traversal performed on each column, the diagonal traversal calculates steps across columns. Even though the work in [14] proposes the diagonal evaluation based on reordered data structure, the step calculations are not skipped for k-mismatch string matching.

In Fig 4(b), diagonal traversals are illustrated, where an arrow illustrates the order of steps calculated in each traversal. Fig 4(b) describes that the upper right step is calculated before the lower left step in a diagonal traversal. It is denoted that t is the index of a traversal, and the traversals indicated by arrows traversal(t − 2), traversal(t − 1), and traversal(t) are performed in order. The calculations of steps on a diagonal traversal traversal(t) do not have data dependency with each other. Each step calculation of traversal(t) has data dependency with three steps of traversal(t − 1) and traversal(t − 2). For substitution operation, a step can be calculated after obtaining the step in traversal(t − 2). For insertion and deletion operations, two steps can be calculated depending on steps of traversal(t − 1). These calculations require the values of two steps for the substitution and insertion operations on the left column and one step for deletion operation on the same column.

When the previous diagonal traversals traversal(t − 2) and traversal(t − 1) finish the calculations of all data-dependent previous steps, traversal(t) can use the calculation results to skip unnecessary operator evaluations. Our proposed method adopts so-called pruning register to avoid multiple iterations in a loop without accessing each element in the edit distance matrix. Each pruning bit in the pruning register is assigned into a column of the edit distance matrix. When the pruning bit for its column is set as ‘1’, there is no need to calculate all steps in the column. In this case, the steps to be calculated have distances over k. The pseudocode of the proposed string matching is as follows:

Algorithm 1 Diagonal Skipping

1: procedure Diagonal_Skipping (input sequence, pattern)

2: Initialization(D[0, …,i][0], D[0][1, …,j], pruning_reg)

3: for each traversal do

4: for each step(α, β) ∈ traversal do

5: if pruning_reg(β) == 1 then break;

6: end if

7: if pruning_reg(β − 1) == 0 then

8: D[α][β] = cost_min((D[α − 1][β], D[α][β − 1], D[α − 1][β − 1]))

9: else

10: D[α][β] = cost_min((D[α − 1][β], D[α − 1][β − 1]))

11: end if

12: if D[α][β] > k and pruning_reg(β − 1) == 1 then

13: pruning_reg(β) = 1;

14: end if

15: end for

16: end for

17: return D

18: end procedure

In the pseudocode, the procedure Diagonal_Skipping has two arguments: input sequence and pattern. Terms i, j, and k denote the input sequence length, pattern length, and mismatch threshold k, respectively. Firstly, several elements in two-dimensional (i + 1) × (j + 1) array D and pruning register pruning_reg are initialized. In this initialization, D[0, …,i][0] is initialized using only deletion operators. On the other hand, D[0][1, …,j] is initialized using only insertion operators. These steps can be simply calculated without considering min() function in Eq (1). In the pruning register, the bit indicating the leftmost column (the 0-th column) is set as ‘1’, and other bits are set as ‘0’.

Then, each step in a traversal is calculated in order. The direction of the arrow is from right top to left bottom, as shown in Fig 4. As shown in Preliminaries section, X_α and Y_β mean a subsequence of input sequence X_i and a subpattern of pattern Y_j for 1 ≤ α ≤ i and 1 ≤ β ≤ j. An element D[α][β] in the two-dimensional array contains edit distance D(X_α, Y_β). For each step of D[α][β] for distance D(X_α, Y_β), if the pruning bit of the β-th column is ‘1’, the next steps calculated in a traversal can be over k. Therefore, the break statement means that this procedure skips other iterations that calculate steps of the traversal; otherwise, each step in the traversal is calculated. When the pruning bit of (β−1)-th column is ‘0’, data-dependent previous steps are accessed, and function cost_min calculates the minimum distance. Except for the calculation D[s][1], 1 ≤ s ≤ i, when the pruning bit of (β−1)-th column is ‘1’, the value stored in D[α][β − 1] is over k, so only D[α − 1][β] and D[α − 1][β − 1] are accessed to calculate D[α][β]. When calculating D[s][1], 1 ≤ s ≤ i, all operators are considered because the pruning bit of the leftmost column is initialized as ‘1’. If the edit distance in D[α][β] is over k and the pruning bit of the (β − 1)-th column is ‘1’, the pruning bit of the β-th column becomes ‘1’. The number of diagonal traversals is proportional to the pattern length j. Therefore, the computational complexity can be O(jk), which means that the computations can be mainly limited by j and k.

Fig 5 illustrates the string matching operation with input sequence “ccatese” and pattern “catch”. Firstly, the bit for the leftmost column in the pruning register is initialized as ‘1’, as shown in Fig 5(a). Additionally, D[0, …,i][0] and D[0][1, …,j] are initialized. As diagonal traversals proceed on the fourth arrow, the step over k = 2 is reached on the second left column. Also, pruning_reg(1) is updated as ‘1’, as shown in Fig 5(c). On the fifth arrow, because the pruning bit of pruning_reg(1) is ‘1’, function cost(D[3][2], D[3][1]) is performed to calculate D([4][2]), where the insertion operation is skipped. Fig 5(d) shows intermediate progress based on Algorithm 1. The proposed diagonal skipping method can determine the steps to be skipped just by accessing the pruning register instead of using all data-dependent previous steps. Also, the proposed method skips multiple-step calculations at a time, which reduces the execution time.

In the proposed method, the number of traversals on arrows is proportional to the pattern length j, where several step calculations over k are not skipped in the proposed method. If all neighbouring data-dependent steps on the same column over k are checked before evaluating operators, the unnecessary operator evaluations can be skipped, which can make the complexity O(min(j, k)). However, unlike the proposed diagonal skipping method using only one pruning register, complicated conditional statements and additional memory accesses are required. This method can be valid when the computational overhead of operator evaluations is significant, which is described in the range (c) of Fig 3.

Experimental results and analysis

Based on realistic environments, we show the experimental results depending on different edit distance metrics. Firstly, when Levenshtein distance is calculated, it is expected that the skipping method is not effective due to the overhead of conditional statements. Then, when adopting the generalized edit distance metrics considering the visual similarity in shapes or keyboard character positions, the proposed skipping method can show better performance than the dynamic programming for small k-mismatches and the method using the reordered data structure. Besides, the overhead of conditional statements for finding data-dependent previous steps is discussed.

Experimental environments

In experiments, the proposed method was coded and complied by C language and GCC 5.4.0, respectively. For apple-to-apple comparisons, we implemented the dynamic programming and the skipping method that found neighbouring data-dependent previous steps to skip each step calculation over k. These implemented codes have been uploaded in [32], where the execution times of the proposed method and other counterparts were measured. The tests were performed on a single core of Intel Xeon CPU E5-2630 v3 @ 2.40GHz machine with 16 Gigabyte main memory and Ubuntu 16.04 operating system. The experiments randomly selected 100,000 pairs of the input sequence and pattern from the English dictionary with 370,099 words [33], where the average and standard deviation of the input sequence and pattern lengths were 9.4 and 2.90, respectively.

We evaluated the proposed method based on three different distance metrics. Firstly, we calculated the Levenshtein distances to know the benefits of the proposed method in the simple edit distance metric. Secondly, for the evaluation using a highly computational edit distance metric, the similarity in shapes between two alphabet characters was quantified in a two-dimensional array D considering [34]. This array was used to calculate the cost in the substitution operator. For example, substitution(‘a’, ‘b’) = 1/2.13 * C and substitution(‘o’, ‘e’) = 1/4.13 * C, where C was the scaling factor for normalizing the substitution cost. In this example, the cost of substitution(‘a’, ‘b’) can be 1.94 times the cost of substitution(‘o’, ‘e’). For insertion and deletion operators, this experiment assumed a weighted cost depending on pattern length j, where we developed an exponential cost function using the average word length as:

\begin{matrix} d e l e t i o n (x_{α}), i n s e r t i o n (y_{β}) = \frac{e x p (\frac{a v e r a g e_l e n g t h}{j})}{e x p (1)} . \end{matrix}

(5)

In Eq (5), costs were normalized by exp(1). As j increased, the costs of insertion or deletion operations decreased exponentially, so that different weights were assigned depending on j.

Finally, our experiments adopted a more complicated distance metric that considered character positions in a keyboard. In this metric, the Euclidean distance between characters was calculated to obtain each substitution operation’s cost. The position of each character was stored in an array, which was used to calculate the Euclidean distance between characters. Based on the typo distance in [35], the function for calculating the cost of each substitution operation was implemented. Unlike typo distance [35] without commutative property, our edit distance metric had the same cost for the deletion and insertion operations to meet the edit distance’s characteristic. The features of edit distance metrics above are summarized in Table 1.

Table 1. Features of evaluated edit distance metrics.

Metric	Role	Costs
Levenshtein	simple low-cost edit distance metric	low
Shape	normalized weighted edit distance metric	medium
Keyboard	complex weighted edit distance metric	high

Open in a new tab

Experimental analysis

Fig 6 shows the summary of average execution times by sweeping k when using the Levenshtein distance metric. When the diagonal traversal did not adopt k-mismatches, the average execution time was longer than that of the dynamic programming using the vertical traversal because of the overhead from conditional statements and reordered memory accesses. In these experiments, the execution time increased with k. When k > 4, the execution time was over that of the vertical traversal, which means the proposed method did not have any benefits over the simple vertical traversal for large k. Besides, Fig 6 shows that the diagonal traversal without considering k-mismatches required the additional overhead of conditional statements and reordered data accesses compared with the vertical traversal. Therefore, for the Levenshtein distance metric, when k was small, we concluded that the proposed method can help reduce the execution time. Significantly, compared with the vertical and diagonal traversals, the execution times were decreased by 44.3% and 52.3% with k = 1.

For the generalized edit distance using similarity in shapes, Fig 7 illustrates the average execution times by sweeping k. Like the case using the Levenshtein distance metric, the execution time increased with k. When k < 5, the average execution times of the proposed diagonal skipping method were shorter than that of the vertical traversal, which means that many step calculations can be skipped for small k in this generalized edit distance metric. The diagonal traversal only increased the average execution time by 11.5% over the vertical traversal. Notably, when k = 1, the execution times were reduced by 55.7% and 60.3% over the vertical and diagonal traversals. Compared with the evaluation using the Levenshtein distance metric, it was expected that the execution time can be further reduced because the overhead ratio of conditional statements and reordered memory accesses was smaller.

Fig 8 summarizes the average execution times based on the distance metric considering keyboard character positions. Like Figs 6 and 7, the execution times were evaluated by sweeping k. Diagonal skipping(II) adopted the skipping method using the pruning register and reduced unnecessary operator evaluations after accessing data-dependent previous steps over k. On the other hand, Diagonal skipping(I) just used the skipping method using the pruning register. By avoiding unnecessary operator evaluations over k, the Diagonal skipping(II) can further reduce the execution time when k was small. As k increased, the difference of the average execution time between Diagonal skipping(I) and Diagonal skipping(II) was reduced because the number of reduced operator evaluations using Diagonal skipping(II) diminished. When k = 5, the difference in the average execution times was negligible. When k > 5, the average execution time of Diagonal skipping(II) was longer than that of Diagonal skipping(I). Besides, when k = 8, the average execution time of Diagonal skipping(II) was very close to those of the vertical and diagonal traversals. Like the Levenshtein distance and the generalized distance using similarity in shapes, many step calculations can be skipped for small k, and the number of skipped step calculations decreased with k. However, even when k = 8, the proposed method’s average execution time was shorter than those of the vertical and diagonal traversals. In agreement with Table 1, the operators’ computational costs used in this distance metric can be high compared with the conditional statements. Notably, compared with the vertical traversal, when k = 1, Diagonal skipping(I) and Diagonal skipping(II) reduced the average execution time by 60.1% and 87.3%, respectively.

Fig 9 illustrates the ratios of the skipped edit distance calculations. This experiment counted two types of skipped edit distance calculations when obtaining D[i][j]. When the pruning bit of (β−1)-th column was ‘1’, the insertion operator was skipped, and only D[α − 1][β] and D[α − 1][β − 1] were accessed to calculate D[α][β]. Secondly, if pruning bit of the β-th column was ‘1’, the next steps in a traversal were skipped because they were over k. When k = 1, 84.6∼70.9% step calculations were skipped. As increasing k, the ratios decreased rapidly, where the decreasing ratios can be different depending on the adopted edit distance metric. When k = 8, only 11.0∼3.7% step calculations can be skipped, where the overhead of conditional statements and reordered memory accesses increased the average execution time compared with the vertical and diagonal traversals.

The statistical analysis was performed to know the functional relationship between the execution time and input parameters. As shown in [36], the regression approach was adopted, and the input sequence and pattern lengths were used as input parameters. This evaluation was performed with k = 2 for all adopted edit distance metrics. In these regression analyses, the coefficients of determination (R²) can be used to show how much the regression model was fit for the target data [37]. When using the Levenshtein distance metric, R² was just 0.267. On the other hand, R²s of the generalized edit distance metrics using similarity in shapes (denoted as Shape) and keyboard position (denoted as Keyboard) were 0.575 and 0.818, respectively. These results showed that except for the input sequence and pattern lengths, other overheads could significantly affect the the Levenshtein distance metric’s execution time. Table 2 lists the results of the regression analysis for the adopted three edit distance metrics, where Coef., SE Coef., T, and P denote the coefficient, standard error coefficient, t-value, and p-value, respectively. Because the p-values were small, the input sequence and pattern lengths can be statistically significant. Large t-values in Table 2 show that even though the input sequence and pattern lengths were the same, the execution time can be different severely depending on the input sequence and pattern values. Moreover, the coefficient for the pattern lengths was more significant than that of input sequences, which means that the pattern lengths were more critical in the execution time.

Table 2. Statistical analysis using a regression approach according to the edit distance metric.

Metric	Term	Coef.	SE Coef.	T	P
Levenshtein	Constant	2328.18	22.81	102.06	0
	length(input)	34.03	1.67	20.40	2.5E-92
	length(pattern)	316.26	1.67	189.85	0
Shape	Constant	96.89	29.98	3.23	1.23E-3
	length(input)	64.96	2.19	29.17	3E-186
	length(pattern)	803.49	2.19	366.95	0
Keyboard	Constant	-52085.6	329.11	-158.26	0
	length(input)	1982.56	24.07	82.37	0
	length(pattern)	15981.31	24.03	664.92	0

Open in a new tab

Conclusion

This paper proposes k-mismatch approximate string matching for the generalized edit distance. When the generalized edit distance is involved, this paper shows that the step calculations’ skipping can reduce the execution time. The proposed method adopts the pruning register to skip step calculations in the diagonal traversals. This paper introduces practical generalized edit distance metrics for the sophisticated experimental environments. The Levenshtein and two generalized edit distance metrics based on similarity in shapes and keyboard character positions are applied to know the effectiveness of the proposed method. In experiments, even though the overhead of conditional statements and reordered data accesses exists in the generalized edit distance metrics, the proposed method can reduce the execution time of k-mismatch string matching. Considering the experimental results with realistic edit distance metrics, the proposed skipping method helps reduce the execution time in k-mismatch approximate string matching.

Data Availability

All code files are available from the GitHub database (https://github.com/analog75/ED).

Funding Statement

The author received no specific funding for this work.

References

1. Navarro G. A Guided Tour to Approximate String Matching. ACM computing surveys (CSUR). 2001;33(1):31–88. 10.1145/375360.375365 [DOI] [Google Scholar]
2.Mateless R, Segal M. Approximate String Matching for DNS Anomaly Detection. In: International Conference on Security, Privacy and Anonymity in Computation, Communication and Storage. Springer; 2019. p. 490–504.
3. Hakak SI, Kamsin A, Shivakumara P, Gilkar GA, Khan WZ, Imran M. Exact String Matching Algorithms: Survey, Issues, and Future Research Directions. IEEE Access. 2019;7:69614–69637. 10.1109/ACCESS.2019.2914071 [DOI] [Google Scholar]
4. Ryu C, Lecroq T, Park K. Fast string matching for DNA sequences. Theoretical Computer Science. 2020;812:137–148. 10.1016/j.tcs.2019.09.031 [DOI] [Google Scholar]
5. Al-Ssulami AM, Mathkour HI, Arafah MA. Efficient String Matching Algorithm for Searching Large DNA and Binary Texts. In: Data Analytics in Medicine: Concepts, Methodologies, Tools, and Applications. IGI Global; 2020. p. 298–324. [Google Scholar]
6. Kim T, Li W, Behm A, Cetindil I, Vernica R, Borkar V, et al. Similarity query support in big data management systems. Information Systems. 2020;88:101455. 10.1016/j.is.2019.101455 [DOI] [Google Scholar]
7.Guo L, Du S, Ren M, Liu Y, Li J, He J, et al. Parallel Algorithm for Approximate String Matching with K-Differences. In: Networking, Architecture and Storage (NAS), 2013 IEEE Eighth International Conference on. IEEE; 2013. p. 257–261.
8. Hamming RW. Error detecting and error correcting codes. The Bell system technical journal. 1950;29(2):147–160. 10.1002/j.1538-7305.1950.tb00463.x [DOI] [Google Scholar]
9. Kashyap RL, Oommen BJ. An effective algorithm for string correction using generalized edit distances-I. Description of the algorithm and its optimality. Information Sciences. 1981;23(2):123–142. 10.1016/0020-0255(81)90052-9 [DOI] [Google Scholar]
10. Kashyap RL, Oommen BJ. An effective algorithm for string correction using generalized edit distance?II. Computational complexity of the algorithm and some applications. Information Sciences. 1981;23(3):201–217. 10.1016/0020-0255(81)90056-6 [DOI] [Google Scholar]
11. Marzal A, Vidal E. Computation of normalized edit distance and applications. IEEE transactions on pattern analysis and machine intelligence. 1993;15(9):926–932. 10.1109/34.232078 [DOI] [Google Scholar]
12. Yujian L, Bo L. A normalized Levenshtein distance metric. IEEE transactions on pattern analysis and machine intelligence. 2007;29(6):1091–1095. 10.1109/TPAMI.2007.1078 [DOI] [PubMed] [Google Scholar]
13. Wagner RA, Fischer MJ. The string-to-string correction problem. Journal of the ACM (JACM). 1974;21(1):168–173. 10.1145/321796.321811 [DOI] [Google Scholar]
14. Allison L. Lazy dynamic-programming can be eager. Information Processing Letters. 1992;43(4):207–212. 10.1016/0020-0190(92)90202-7 [DOI] [Google Scholar]
15. Gusfield D. Algorithms on stings, trees, and sequences: Computer science and computational biology. Acm Sigact News. 1997;28(4):41–60. 10.1145/270563.571472 [DOI] [Google Scholar]
16. Xu K, Cui W, Hu Y, Guo L. Bit-parallel multiple approximate string matching based on GPU. Procedia Computer Science. 2013;17:523–529. [Google Scholar]
17.Lin CH, Wang GH, Huang CC. Hierarchical parallelism of bit-parallel algorithm for approximate string matching on GPUs. In: Computer Applications and Communications (SCAC), 2014 IEEE Symposium on. IEEE; 2014. p. 76–81.
18.Nunes LS, Bordim JL, Nakano K, Ito Y. A fast approximate string matching algorithm on GPU. In: Computing and Networking (CANDAR), 2015 Third International Symposium on. IEEE; 2015. p. 188–192.
19.Nunes LS, Bordim J, Nakano K, Ito Y. A Memory-Access-Efficient Implementation of the Approximate String Matching Algorithm on GPU. In: Computing and Networking (CANDAR), 2016 Fourth International Symposium on. IEEE; 2016. p. 483–489.
20. Tran TT, Liu Y, Schmidt B. Bit-parallel approximate pattern matching: Kepler GPU versus Xeon Phi. Parallel Computing. 2016;54:128–138. 10.1016/j.parco.2015.11.001 [DOI] [Google Scholar]
21. Ho T, Oh SR, Kim H. A parallel approximate string matching under Levenshtein distance on graphics processing units using warp-shuffle operations. PloS one. 2017;12(10). 10.1371/journal.pone.0186251 [DOI] [PMC free article] [PubMed] [Google Scholar]
22. Ho T, Oh SR, Kim H. New algorithms for fixed-length approximate string matching and approximate circular string matching under the Hamming distance. The Journal of Supercomputing. 2018;74(5):1815–1834. 10.1007/s11227-017-2192-6 [DOI] [Google Scholar]
23. Nazli M, Cankur O, Ozsoy A. A Parallel Comparison of Several String Matching Algorithms Employing Different Strategies. Proceedings Book. 2019; p. 52. [Google Scholar]
24.Schultz DW, Xu B. Parallel Methods for Finding k-Mismatch Shortest Unique Substrings Using GPU. IEEE/ACM transactions on computational biology and bioinformatics. 2019;. [DOI] [PubMed]
25.Van Court T, Herbordt MC. Families of FPGA-based algorithms for approximate string matching. In: Proceedings. 15th IEEE International Conference on Application-Specific Systems, Architectures and Processors, 2004. IEEE; 2004. p. 354–364.
26.Herbordt MC, Model J, Gu Y, Sukhwani B, VanCourt T. Single pass, BLAST-like, approximate string matching on FPGAs. In: 2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines. IEEE; 2006. p. 217–226.
27.Mikami S, Kawanaka Y, WAKABAYASHI S, NAGAYAMA S. Efficient FPGA-based hardware algorithms for approximate string matching. In: ITC-CSCC: International Technical Conference on Circuits Systems, Computers and Communications; 2008. p. 201–204.
28. Kim H, Choi KI. A pipelined non-deterministic finite automaton-based string matching scheme using merged state transitions in an FPGA. PloS one. 2016;11(10):e0163535. 10.1371/journal.pone.0163535 [DOI] [PMC free article] [PubMed] [Google Scholar]
29. Cinti A, Bianchi FM, Martino A, Rizzi A. A novel algorithm for online inexact string matching and its FPGA implementation. Cognitive Computation. 2019; p. 1–19. [Google Scholar]
30. Levenshtein VI. Binary codes capable of correcting deletions, insertions, and reversals. In: Soviet physics doklady. vol. 10; 1966. p. 707–710. [Google Scholar]
31. Hennessy JL, Patterson DA. Computer architecture: a quantitative approach. Elsevier; 2011. [Google Scholar]
32.Edit distance; 2020. https://github.com/analog75/ED.
33.english-words; 2020. https://github.com/dwyl/english-words/.
34. Simpson IC, Mousikou P, Montoya JM, Defior S. A letter visual-similarity matrix for Latin-based alphabets. Behavior research methods. 2013;45(2):431–439. 10.3758/s13428-012-0271-4 [DOI] [PubMed] [Google Scholar]
35.TypoDistance; 2020. https://github.com/wsong/Typo-Distance.
36. Chakraborty S, Choudhury PP. A statistical analysis of an algorithm’s complexity. Applied Mathematics Letters. 2000;13(5):121–126. 10.1016/S0893-9659(00)00043-4 [DOI] [Google Scholar]
37.Coefficient of determination; 2020. https://en.wikipedia.org/wiki/Coefficient_of_determination.

PLoS One. doi: 10.1371/journal.pone.0251047.r001

Decision Letter 0

Hans A Kestler

15 Sep 2020

PONE-D-20-11239

A k-Mismatch String Matching for Generalized Edit Distance using Diagonal Skipping Method

PLOS ONE

Dear Dr. Kim,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the all the points raised during the review process.

Please ensure that your decision is justified on PLOS ONE’s publication criteria and not, for example, on novelty or perceived impact.

==============================

Please submit your revised manuscript by Oct 30 2020 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

We look forward to receiving your revised manuscript.

Kind regards,

Hans A Kestler

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2.We suggest you thoroughly copyedit your manuscript for language usage, spelling, and grammar. If you do not know anyone who can help you do this, you may wish to consider employing a professional scientific editing service.

Whilst you may use any professional scientific editing service of your choice, PLOS has partnered with both American Journal Experts (AJE) and Editage to provide discounted services to PLOS authors. Both organizations have experience helping authors meet PLOS guidelines and can provide language editing, translation, manuscript formatting, and figure formatting to ensure your manuscript meets our submission guidelines. To take advantage of our partnership with AJE, visit the AJE website (http://learn.aje.com/plos/) for a 15% discount off AJE services. To take advantage of our partnership with Editage, visit the Editage website (www.editage.com) and enter referral code PLOSEDIT for a 15% discount off Editage services. If the PLOS editorial team finds any language issues in text that either AJE or Editage has edited, the service provider will re-edit the text for free.

Upon resubmission, please provide the following:

The name of the colleague or the details of the professional service that edited your manuscript
A copy of your manuscript showing your changes by either highlighting them or using track changes (uploaded as a *supporting information* file)
A clean copy of the edited manuscript (uploaded as the new *manuscript* file)

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Partly

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: No

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: No

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: No

Reviewer #2: No

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The paper covers an interesting technical topic and provides an algorithm within the scope of the topic.

In its current state, I see some flaws of the paper.

- While most of the paper ist written in understandable English, it is clear through the whole paper that the writer is not a native English speaker. I recommend having the paper checked by a native speaker or someone with experience in writing English mauscripts.

- The description of the algorithm is hard to read, mainly because some terms are used before they are explained. Especially the term "arrow" is not defined explicitly. Also, the pseudocode states an implementation step, but the reader has to search for a more detailed description of this initialization.

- In the description of the distance D (beginning in line 75), the difference between x_i and X_i is unclear. Is X_i a string or a character? If it's a string, why is it indexed? If it's a character, why is it written with a capital letter (in contrast to lowercase letters in line 75).

- It appears the algorithm and the datasets used for the analysis are not publicly available.

- The analysis part misses a statistical analysis. This can lead to a misinterpretation of the data: Shorter average running times do not necessarily imply a more efficient algorithm. Also, comparing running times per se can lead to systematic errors in the analyis. E.g. one of the algorithms might have been implemented more efficiently, the computer hardware might be more beneficial for one of the compared programs (e.g. due to cache structure, compiler optimization etc.).

To sum up, I think that the paper needs some work but has potential.

Reviewer #2: The author scrupulously details the approach for reducing the execution time when calculating the generalized edit distance, considering, for example, similar shape characters or keyboard character positions. Through the accurate experimental analysis and the corresponding figures we can visualize how, by means of the proposed method, the execution time is reduced when complicate distance metrics are used. Furthermore the author introduces the disadvantage of using the skipping method when simple metrics are employed.

However some essential introductory notions are ambiguous and need to be better explained.

For example, at line 75, according to the used notation, the input strings X and Y appear to have length i and j respectively, but afterwards, starting from line 76, i and j are used as indices. Even more confusion can be created by considering line 87, where the input sequence X and pattern Y are denoted with subscripts i and j respectively.

Besides certain technical details should be expanded and clarified to ensure better understanding. For instance in Equation 1 the functions of substitution, deletion and insertion should be precisely defined.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2021 May 4;16(5):e0251047. doi: 10.1371/journal.pone.0251047.r002

Author response to Decision Letter 0

1 Dec 2020

Dear Reviewers,

Thank you for offering the opportunity to revise the paper PONE-D-20-11239 titled: "A k-Mismatch String Matching for Generalized Edit Distance using Diagonal Skipping Method," written by HyunJin Kim to be considered for publication as a research article in PLOS ONE.

Considering reviewer’s comments and concerns, the paper has been revised with careful study in aspects of language, terminology, conveyed meaning, paper format, and grammar. In addition, the revised paper has addressed academic editor and all reviewers’ comments sincerely. There have been several significant modifications as follows: firstly, the terminology and confusing explanation of the edit distance have been revised. Secondly, several terms related to the proposed algorithm has been additionally explained. Finally, we have changed the measurement of the execution time in the proposed method, where the obtained experimental data have shown that the proposed k-mismatch string matching can reduce the execution time using diagonal skipping method in all three different edit distance metrics. We have inserted additional experimental analysis of the ratio of the skipped steps and statistical analysis using a regression approach.

We believe that this revised draft has solved reviewer’s concerns very well. The detail rebuttal has been uploaded as separate file.

Attachment

Submitted filename: PONE-D-20-11239-rebuttal(201130).pdf

Click here for additional data file.^{(889.8KB, pdf)}

PLoS One. doi: 10.1371/journal.pone.0251047.r003

Decision Letter 1

Hans A Kestler

11 Feb 2021

PONE-D-20-11239R1

A k-Mismatch String Matching for Generalized Edit Distance using Diagonal Skipping Method

PLOS ONE

Dear Dr. Kim,

Please perform the remaining minor changes raised by the reviewer.

==============================

Please submit your revised manuscript by Mar 28 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

We look forward to receiving your revised manuscript.

Kind regards,

Hans A Kestler

Academic Editor

PLOS ONE

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: (No Response)

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #1: Yes

Reviewer #2: No

**********

6. Review Comments to the Author

Reviewer #1: In the current version of the paper, my previous comments to the last version have been adequately addressed.

Reviewer #2: Numerous of the apported adjustments improved the readability and informativeness of the paper, nevertheless a careful review is necessary for a complete perception of the study.

- A further accurate copy edit for language usage and grammar is mandatory. In fact a frequent mistake, is the usage of the verb in the third person singular when not required, for example at lines 98, 410, 412, but may not be the only ones. Other repetitive misspellings are the quoted comma (for example line 257: '1,') and capital letter after comma (see line 16). There are also lexical inaccuracies as close statement repetitions (for instance 343-345 and 349-350) or unclear references (at 346-348 one does not promptly associate "these experiments" with "the diagonal skipping method").

- The distance (as defined in lines 76-77) is the minimum number of operations to convert the input string into the pattern, it follows that the operations are always applied on the input string and never on the pattern. Therefore in expression (1) should be correct D(X_α-1, Y_β) + insertion(y_β) and D(X_α, Y_β-1) + deletion(x_α). This statement is also confirmed by the explanation of the example in Fig 1 (b), lines 112-118. Moreover the example in lines 104-109 needs a better adapted description.

- It is suggested to revisit the notations used to represent strings, substrings, the respective lengths and the indices to identify any character in the string/substring. There are misconceptions when the subscript refers to the length of a string/substring and when it refers to any character of the string/substring. For instance at line 267 the content suggests that D[α][1] wants to indicate the whole second column in the two-dimensional array D, however 'α' defines the length of the input substring. Moreover according to the employed notation D[α][β] (line 269) identifies only the rightmost bottom cell of D, but here it is intended to identify any cell of D.

- Comparing the definition of ‘traversal’ specified in 120-121, its usage in the pseudocode and in lines 229-230, the difference between arrow and traversal is unsettled.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

PLoS One. 2021 May 4;16(5):e0251047. doi: 10.1371/journal.pone.0251047.r004

Author response to Decision Letter 1

18 Mar 2021

Dear Reviewer,

Thank you for reviewing this paper PONE-D-20-11239R2 titles as "A k-Mismatch String Matching for Generalized Edit Distance using Diagonal Skipping Method."

Considering reviewers' comments, this paper has been revised with careful study in aspects of paper format, language, terminology, conveyed meaning, and grammar.

Several typos and missing information have been corrected. In addition, the revised paper addressed all reviewers' comments. The detailed response is uploaded as PDF file.

Attachment

Submitted filename: PLOSONE_2021_rebuttal(20210317).pdf

Click here for additional data file.^{(147.8KB, pdf)}

PLoS One. doi: 10.1371/journal.pone.0251047.r005

Decision Letter 2

Hans A Kestler

20 Apr 2021

A k-Mismatch String Matching for Generalized Edit Distance using Diagonal Skipping Method

PONE-D-20-11239R2

Dear Dr. Kim,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Hans A Kestler

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

PLoS One. doi: 10.1371/journal.pone.0251047.r006

Acceptance letter

Hans A Kestler

23 Apr 2021

PONE-D-20-11239R2

A k-Mismatch String Matching for Generalized Edit Distance using Diagonal Skipping Method

Dear Dr. Kim:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Prof. Hans A Kestler

Academic Editor

PLOS ONE

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Attachment

Submitted filename: PONE-D-20-11239-rebuttal(201130).pdf

Click here for additional data file.^{(889.8KB, pdf)}

Attachment

Submitted filename: PLOSONE_2021_rebuttal(20210317).pdf

Click here for additional data file.^{(147.8KB, pdf)}

Data Availability Statement

All code files are available from the GitHub database (https://github.com/analog75/ED).

[pone.0251047.ref001] 1. Navarro G. A Guided Tour to Approximate String Matching. ACM computing surveys (CSUR). 2001;33(1):31–88. 10.1145/375360.375365 [DOI] [Google Scholar]

[pone.0251047.ref002] 2.Mateless R, Segal M. Approximate String Matching for DNS Anomaly Detection. In: International Conference on Security, Privacy and Anonymity in Computation, Communication and Storage. Springer; 2019. p. 490–504.

[pone.0251047.ref003] 3. Hakak SI, Kamsin A, Shivakumara P, Gilkar GA, Khan WZ, Imran M. Exact String Matching Algorithms: Survey, Issues, and Future Research Directions. IEEE Access. 2019;7:69614–69637. 10.1109/ACCESS.2019.2914071 [DOI] [Google Scholar]

[pone.0251047.ref004] 4. Ryu C, Lecroq T, Park K. Fast string matching for DNA sequences. Theoretical Computer Science. 2020;812:137–148. 10.1016/j.tcs.2019.09.031 [DOI] [Google Scholar]

[pone.0251047.ref005] 5. Al-Ssulami AM, Mathkour HI, Arafah MA. Efficient String Matching Algorithm for Searching Large DNA and Binary Texts. In: Data Analytics in Medicine: Concepts, Methodologies, Tools, and Applications. IGI Global; 2020. p. 298–324. [Google Scholar]

[pone.0251047.ref006] 6. Kim T, Li W, Behm A, Cetindil I, Vernica R, Borkar V, et al. Similarity query support in big data management systems. Information Systems. 2020;88:101455. 10.1016/j.is.2019.101455 [DOI] [Google Scholar]

[pone.0251047.ref007] 7.Guo L, Du S, Ren M, Liu Y, Li J, He J, et al. Parallel Algorithm for Approximate String Matching with K-Differences. In: Networking, Architecture and Storage (NAS), 2013 IEEE Eighth International Conference on. IEEE; 2013. p. 257–261.

[pone.0251047.ref008] 8. Hamming RW. Error detecting and error correcting codes. The Bell system technical journal. 1950;29(2):147–160. 10.1002/j.1538-7305.1950.tb00463.x [DOI] [Google Scholar]

[pone.0251047.ref009] 9. Kashyap RL, Oommen BJ. An effective algorithm for string correction using generalized edit distances-I. Description of the algorithm and its optimality. Information Sciences. 1981;23(2):123–142. 10.1016/0020-0255(81)90052-9 [DOI] [Google Scholar]

[pone.0251047.ref010] 10. Kashyap RL, Oommen BJ. An effective algorithm for string correction using generalized edit distance?II. Computational complexity of the algorithm and some applications. Information Sciences. 1981;23(3):201–217. 10.1016/0020-0255(81)90056-6 [DOI] [Google Scholar]

[pone.0251047.ref011] 11. Marzal A, Vidal E. Computation of normalized edit distance and applications. IEEE transactions on pattern analysis and machine intelligence. 1993;15(9):926–932. 10.1109/34.232078 [DOI] [Google Scholar]

[pone.0251047.ref012] 12. Yujian L, Bo L. A normalized Levenshtein distance metric. IEEE transactions on pattern analysis and machine intelligence. 2007;29(6):1091–1095. 10.1109/TPAMI.2007.1078 [DOI] [PubMed] [Google Scholar]

[pone.0251047.ref013] 13. Wagner RA, Fischer MJ. The string-to-string correction problem. Journal of the ACM (JACM). 1974;21(1):168–173. 10.1145/321796.321811 [DOI] [Google Scholar]

[pone.0251047.ref014] 14. Allison L. Lazy dynamic-programming can be eager. Information Processing Letters. 1992;43(4):207–212. 10.1016/0020-0190(92)90202-7 [DOI] [Google Scholar]

[pone.0251047.ref015] 15. Gusfield D. Algorithms on stings, trees, and sequences: Computer science and computational biology. Acm Sigact News. 1997;28(4):41–60. 10.1145/270563.571472 [DOI] [Google Scholar]

[pone.0251047.ref016] 16. Xu K, Cui W, Hu Y, Guo L. Bit-parallel multiple approximate string matching based on GPU. Procedia Computer Science. 2013;17:523–529. [Google Scholar]

[pone.0251047.ref017] 17.Lin CH, Wang GH, Huang CC. Hierarchical parallelism of bit-parallel algorithm for approximate string matching on GPUs. In: Computer Applications and Communications (SCAC), 2014 IEEE Symposium on. IEEE; 2014. p. 76–81.

[pone.0251047.ref018] 18.Nunes LS, Bordim JL, Nakano K, Ito Y. A fast approximate string matching algorithm on GPU. In: Computing and Networking (CANDAR), 2015 Third International Symposium on. IEEE; 2015. p. 188–192.

[pone.0251047.ref019] 19.Nunes LS, Bordim J, Nakano K, Ito Y. A Memory-Access-Efficient Implementation of the Approximate String Matching Algorithm on GPU. In: Computing and Networking (CANDAR), 2016 Fourth International Symposium on. IEEE; 2016. p. 483–489.

[pone.0251047.ref020] 20. Tran TT, Liu Y, Schmidt B. Bit-parallel approximate pattern matching: Kepler GPU versus Xeon Phi. Parallel Computing. 2016;54:128–138. 10.1016/j.parco.2015.11.001 [DOI] [Google Scholar]

[pone.0251047.ref021] 21. Ho T, Oh SR, Kim H. A parallel approximate string matching under Levenshtein distance on graphics processing units using warp-shuffle operations. PloS one. 2017;12(10). 10.1371/journal.pone.0186251 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0251047.ref022] 22. Ho T, Oh SR, Kim H. New algorithms for fixed-length approximate string matching and approximate circular string matching under the Hamming distance. The Journal of Supercomputing. 2018;74(5):1815–1834. 10.1007/s11227-017-2192-6 [DOI] [Google Scholar]

[pone.0251047.ref023] 23. Nazli M, Cankur O, Ozsoy A. A Parallel Comparison of Several String Matching Algorithms Employing Different Strategies. Proceedings Book. 2019; p. 52. [Google Scholar]

[pone.0251047.ref024] 24.Schultz DW, Xu B. Parallel Methods for Finding k-Mismatch Shortest Unique Substrings Using GPU. IEEE/ACM transactions on computational biology and bioinformatics. 2019;. [DOI] [PubMed]

[pone.0251047.ref025] 25.Van Court T, Herbordt MC. Families of FPGA-based algorithms for approximate string matching. In: Proceedings. 15th IEEE International Conference on Application-Specific Systems, Architectures and Processors, 2004. IEEE; 2004. p. 354–364.

[pone.0251047.ref026] 26.Herbordt MC, Model J, Gu Y, Sukhwani B, VanCourt T. Single pass, BLAST-like, approximate string matching on FPGAs. In: 2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines. IEEE; 2006. p. 217–226.

[pone.0251047.ref027] 27.Mikami S, Kawanaka Y, WAKABAYASHI S, NAGAYAMA S. Efficient FPGA-based hardware algorithms for approximate string matching. In: ITC-CSCC: International Technical Conference on Circuits Systems, Computers and Communications; 2008. p. 201–204.

[pone.0251047.ref028] 28. Kim H, Choi KI. A pipelined non-deterministic finite automaton-based string matching scheme using merged state transitions in an FPGA. PloS one. 2016;11(10):e0163535. 10.1371/journal.pone.0163535 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0251047.ref029] 29. Cinti A, Bianchi FM, Martino A, Rizzi A. A novel algorithm for online inexact string matching and its FPGA implementation. Cognitive Computation. 2019; p. 1–19. [Google Scholar]

[pone.0251047.ref030] 30. Levenshtein VI. Binary codes capable of correcting deletions, insertions, and reversals. In: Soviet physics doklady. vol. 10; 1966. p. 707–710. [Google Scholar]

[pone.0251047.ref031] 31. Hennessy JL, Patterson DA. Computer architecture: a quantitative approach. Elsevier; 2011. [Google Scholar]

[pone.0251047.ref032] 32.Edit distance; 2020. https://github.com/analog75/ED.

[pone.0251047.ref033] 33.english-words; 2020. https://github.com/dwyl/english-words/.

[pone.0251047.ref034] 34. Simpson IC, Mousikou P, Montoya JM, Defior S. A letter visual-similarity matrix for Latin-based alphabets. Behavior research methods. 2013;45(2):431–439. 10.3758/s13428-012-0271-4 [DOI] [PubMed] [Google Scholar]

[pone.0251047.ref035] 35.TypoDistance; 2020. https://github.com/wsong/Typo-Distance.

[pone.0251047.ref036] 36. Chakraborty S, Choudhury PP. A statistical analysis of an algorithm’s complexity. Applied Mathematics Letters. 2000;13(5):121–126. 10.1016/S0893-9659(00)00043-4 [DOI] [Google Scholar]

[pone.0251047.ref037] 37.Coefficient of determination; 2020. https://en.wikipedia.org/wiki/Coefficient_of_determination.

PERMALINK

A k-mismatch string matching for generalized edit distance using diagonal skipping method

HyunJin Kim

Roles

Abstract

Introduction

Preliminaries

Edit distance in approximate string matching

Fig 1. An example of the Levenshtein distance matrix for input sequence “ccatese” and pattern “catch” (a) operators (b) Levenshtein distance matrix.

Generalized edit distance

Fig 2. Example of generalized edit distance matrix for input sequence “ccatese” and pattern “catch”.

k-mismatch string matching

Proposed diagonal string matching using pruning register

Motivations

Fig 3. Overhead ratios of conditional statements according to computational overhead for performing operations.

Problem definition

Fig 4. Step calculations in edit distance matrix; (a) vertical traversals: (b) diagonal traversals.

Diagonal traversal and skipping method

Fig 5. An example of proposed method using pruning register.

Experimental results and analysis

Experimental environments

Table 1. Features of evaluated edit distance metrics.

Experimental analysis

Fig 6. Average execution times of the Levenshtein distance metric.

Fig 7. Average execution times of the generalized edit distance metric using similarity in shapes.

Fig 8. Average execution times for the generalized edit distance metric using keyboard character positions.

Fig 9. Ratios of skipped edit distance calculations.

Table 2. Statistical analysis using a regression approach according to the edit distance metric.

Conclusion

Data Availability

Funding Statement

References

Decision Letter 0

Hans A Kestler

Roles

Author response to Decision Letter 0

Decision Letter 1

Hans A Kestler

Roles

Author response to Decision Letter 1

Decision Letter 2

Hans A Kestler

Roles

Acceptance letter

Hans A Kestler

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases