Abstract
Patterns within strings enable us to extract vital information regarding a string’s randomness. Understanding whether a string is random (Showing no to little repetition in patterns) or periodic (showing repetitions in patterns) are described by a value that is called the kth Subword Complexity of the character string. By definition, the kth Subword Complexity is the number of distinct substrings of length k that appear in a given string. In this paper, we evaluate the expected value and the second factorial moment (followed by a corollary on the second moment) of the kth Subword Complexity for the binary strings over memory-less sources. We first take a combinatorial approach to derive a probability generating function for the number of occurrences of patterns in strings of finite length. This enables us to have an exact expression for the two moments in terms of patterns’ auto-correlation and correlation polynomials. We then investigate the asymptotic behavior for values of . In the proof, we compare the distribution of the kth Subword Complexity of binary strings to the distribution of distinct prefixes of independent strings stored in a trie. The methodology that we use involves complex analysis, analytical poissonization and depoissonization, the Mellin transform, and saddle point analysis.
Keywords: subword complexity, asymptotics, generating functions, saddle point method, probability, the Mellin transform, moments
1. Introduction
Analyzing and understanding occurrences of patterns in a character string is helpful for extracting useful information regarding the nature of a string. We classify strings to low-complexity and high-complexity, according to their level of randomness. For instance, we take the binary string , which is constructed by repetitions of the pattern . This string is periodic, and therefore has low randomness. Such periodic strings are classified as low-complexity strings, whereas strings that do not show periodicity are considered to have high complexity. An effective way of measuring a string’s randomness is to count all distinct patterns that appear as contiguous subwords in the string. This value is called the Subword Complexity. The name is given by Ehrenfeucht, Lee, and Rozenberg [1], and initially was introduced by Morse and Hedlund in 1938 [2]. The higher the Subword Complexity, the more complex the string is considered to be.
Assessing information about the distribution of the Subword Complexity enables us to better characterize strings, and determine atypically random or periodic strings that have complexities far from the average complexity [3]. This type of string classification has applications in fields such as data compression [4], genome analysis (see [5,6,7,8,9]), and plagiarism detection [10]. For example, in data compression, a data set is considered compressible if it has low complexity, as consists of repeated subwords. In computational genomics, Subword Complexity (known as k-mers) is used in detection of repeated sequences and DNA barcoding [11,12]. k-mers are composed of A, T, G, and C nucleotides. For instance, 7-mers for a DNA sequence GTAGAGCTGT is four, meaning that there are 4-hour distinct substrings of length 7 in the given DNA sequence. Counting k-mers becomes challenging for longer DNA sequences. Our results can be easily extended to the alphabet and directly applied in theoretical analysis of the genomic k-mer distributions under the Bernoulli probabilistic model, particularly when the length n of the sequence approaches infinity.
There are two variations for the definition of the Subword Complexity: the one that counts all distinct subwords of a given string (also known as Complexity Index and Sequence Complexity [13]), and the one that only counts the subwords of the same length, say k, that appear in the string. In our work, we analyze the latter, and we call it the kth Subword Complexity to avoid any confusion.
Throughout this work, we consider the kth Subword Complexity of a random binary string of length n over a memory-less source, and we denote it by . We analyze the first and second factorial moments of (1) for the range , as . More precisely, will divide the analysis into three ranges as follows.
-
,
-
, and
-
.
Our approach involves two major steps. First, we choose a suitable model for the asymptotic analysis, and afterwards we provide proofs for the derivation of the asymptotic expansion of the first two factorial moments.
1.1. Part I
This part of the analysis is inspired by the earlier work of Jacquet and Szpankowski [14] on the analysis of suffix trees by comparing them to independent tries. A trie, first introduced by René de la Briandais in 1959 (see [15]), is a search tree that stores n strings, according to their prefixes. A suffix tree, introduced by Weiner in 1973 (see [16]), is a trie where the strings are suffixes of a given string. An example of these data structures are given in Figure 1.
Figure 1.
The suffix tree in (a) is built over the first four suffixes of string , and the trie in (b) is build over strings , , , and .
A direct asymptotic analysis of the moments is a difficult task, as patterns in a string are not independent from each other. However, we note that each pattern in a string can be regarded as a prefix of a suffix of the string. Therefore, the number of distinct patterns of length k in a string is actually the number of nodes of the suffix tree at level k and lower. It is shown by I. Gheorghiciuc and M. D. Ward [17] that the expected value of the k-th Subword Complexity of a Bernoulli string of length n is asymptotically comparable to the expected value of the number of nodes at level k of a trie built over n independent strings generated by a memory-less source.
We extend this analysis to the desired range for k, and we prove that the result holds for when k grows logarithmically with n. Additionally, we show that asymptotically, the second factorial moment of the k-th Subword Complexity can also be estimated by admitting the same independent model generated by a memory-less source. The proof of this theorem heavily relies on the characterization of the overlaps of the patterns with themselves and with one another. Autocorrelation and correlation polynomials explicitly describe these overlaps. The analytic properties of these polynomials are key to understanding repetitions of patterns in large Bernoulli strings. This, in conjunction with Cauchy’s integral formula (used to compare the generating functions in the two models) and the residue theorem, provides solid verification that the second factorial moment in the Subword Complexity behaves the same as in the independent model.
To make this comparison, we derive the generating functions of the first two factorial moments in both settings. In a paper published by F. Bassino, J. Clément, and P. Nicodème in 2012 [18], the authors provide a multivariate probability generating function for the number of occurrences of patterns in a finite Bernoulli string. That is, given a pattern w, the coefficient of the term in is the probability in the Bernoulli model that a random string of size n has exactly m occurrences of the pattern w. Following their technique, we derive the exact expression for the generating functions of the first two factorial moments of the kth Subword Complexity. In the independent model, the generating functions are obtained by basic probability concepts.
1.2. Part II
This part of the proof is analogous to the analysis of profile of tries [19]. To capture the asymptotic behavior, the expressions for the first two factorial moments in the independent trie are further improved by means of a Poisson process. The poissonized version yields generating functions in the form of harmonic sums for each of the moments. The Mellin transform and the inverse Mellin transforms of these harmonic sums establish a connection between the asymptotic expansion and singularities of the transformed function. This methodology is sufficient for when the length k of the patterns are fixed. However, allowing k to grow with n, makes the analysis more challenging. This is because for large k, the dominant term of the poissonized generating function may come from the term involving k, and singularities may not be significant compared to the growth of k. This issue is treated by combining the singularity analysis with a saddle point method [20]. The outcome of the analysis is a precise first-order asymptotics of the moments in the poissonized model. Depoissonization theorems are then applied to obtain the desired result in the Bernoulli model.
2. Results
For a binary string , where ’s () are independent and identically distributed random variables, we assume that , , and . We define the kth Subword Complexity, , to be the number of distinct substrings of length k that appear in a random string X with the above assumptions. In this work, we obtain the first order asymptotics for the average and the second factorial moment of . The analysis is done in the range . We rewrite this range as , and by performing a saddle point analysis, we will show that
| (1) |
In the first step, we compare the kth Subword Complexity to an independent model constructed in the following way: We store a set of n independently generated strings by a memory-less source in a trie. This means that each string is a sequence of independent and identically distributed Bernoulli random variables from the binary alphabet , with , . We denote the number of distinct prefixes of length k in the trie by , and we call it the kth prefix complexity. Before proceeding any further, we remind that factorial moments of a random variable are defined as following.
Definition 1.
The jth factorial moment of a random variable X is defined as
(2) where j = 1, 2, … will show that the first and second factorial moments of are asymptotically comparable to those of , when . We have the following theorems.
Theorem 1.
For large values of n, and for , there exists such that
We also prove a similar result for the second factorial moments of the kth Subword Complexity and the kth Prefix Complexity:
Theorem 2.
For large values of n, and for , there exists such that
In the second part of our analysis, we derive the first order asymptotics of the kth Prefix Complexity. The methodology used here is analogous to the analysis of profile of tries [19]. The rate of the asymptotic growth depends on the location of the value a as seen in (1). For instance, for the average kth Subword Complexity, , we have the following observations.
-
i.
For the range , the growth rate is of order ,
-
ii.
in the range , we observe some oscillations with n, and
-
iii.
in the range , the average has a linear growth .
The above observations will be discussed in depth in the proofs of the following theorems.
Theorem 3.
The average of the kth Prefix Complexity has the following asymptotic expansion
- i.
For ,
(3) where , and
is a bounded periodic function.
- ii.
For ,
- iii.
For
for some .
Theorem 4.
The second factorial moment of the kth Prefix Complexity has the following asymptotic expansion.
- i.
For ,
- ii.
For ,
- iii.
For ,
The periodic function in Theorems 3 and 4 is shown in Figure 2.
Figure 2.
Left: at , and various levels of . The amplitude increases as increases. Right: at , and various levels of p. The amplitude tends to zero as .
The results in Theorem 4 will follow for the second moment of the kth Subword Complexity as the analysis can be easily extended from the second factorial moment to the second moment. The variance however, as seen in Figure 3, does not show the same asymptotic behavior as the variance of kth Subword Complexity.
Figure 3.
Approximated second moments (left), and variances (right) of the kth Subword Complexity (red), and the kth Prefix Complexity (blue), for , at different probability levels, averaged over 10,000 iterations.
3. Proofs and Methods
3.1. Groundwork
We first introduce a few terminologies and lemmas regarding overlaps of patterns and their number of occurrences in texts. Some of the notations we use in this work are borrowed from [18] and [21].
Definition 2.
For a binary word of length k, The autocorrelation set of the word w is defined in the following way.
(4) The autocorrelation index set is
(5) And the autocorrelation polynomial is
(6)
Definition 3.
For the distinct binary words and , the correlation set of the words w and is
(7) The correlation index set is
(8) The correlation polynomial is
(9)
The following two lemmas present the probability generating functions for the number of occurrences of a single pattern and a pair of distinct pattern, respectively, in a random text of length n. For a detailed dissection on obtaining such generating functions, refer to [18].
Lemma 1.
The Occurrence probability generating function for a single pattern w in a binary text over a memoryless source is given by , where
(10) The coefficient is the probability that a random binary string of length n has m occurrences of the pattern w.
Lemma 2.
The Occurrence PGF for two distinct Patterns of length k in a Bernoulli random text is given by where,
(11) and
The coefficient is the probability that there are occurrences of w and occurrences of in a random string of length n.
The above results will be used to find the generating functions for the first two factorial moments of the kth Subword Complexity in the following section.
3.2. Derivation of Generating Functions
Lemma 3.
For generating functions and , we have
- i.
where , and
(12) - ii.
where
(13)
(14)
Proof.
We define
This yields
(15) We observe that . By defining and from (10), we obtain
(16) Having the above function, we derive the following result.
(17) For this part, we first note that
(18) Due to properties of indicator random variables, we observe that the expected value of the second factorial moment has only one term:
(19) We proceed by defining a second indicator variable as following.
This gives
Finally, we are able to express in the following
(20) where and . By (11) we have
(21) Having the above expression, we finally obtain
(22) □
In the following lemma, we present the generating functions for the first two factorial moments for the kth Prefix Complexity in the independent model.
Lemma 4.
For and , which are the generating functions for and respectively, we have
- i.
(23) - ii.
(24)
Proof.
We define the indicator variable as follows.
For each , we have
(25) Summing over all words w of length k, determines the generating function :
(26) Similar to in (18) and (20), we obtain
(27) Subsequently, we obtain the generating function below.
(28) □
Our first goal is to compare the coefficients of the generating functions in the two models. The coefficients are expected to be asymptotically equivalent in the desired range for k. To compare the coefficients, we need more information on the analytic properties of these generating functions. This will be discussed in Section 3.3.
3.3. Analytic Properties of the Generating Functions
Here, we turn our attention to the smallest singularities of the two generating functions given in Lemma 3. It has been shown by Jacquet and Szpankowski [21] that has exactly one root in the disk . Following the notations in [21], we denote the root within the disk of by , and by bootstrapping we obtain
| (29) |
We also denote the derivative of at the root , by , and we obtain
| (30) |
In this paper, we will prove a similar result for the polynomial through the following work.
Lemma 5.
If w and are two distinct binary words of length k and , there exists , such that and
(31)
Proof.
If the minimal degree of is greater than , then
(32) for . For a fixed , we have
(33) This leads to the following
(34) □
Lemma 6.
There exist , and such that , and such that, for every pair of distinct words w, and of length , and for , we have
(35) In other words, does not have any roots in .
Proof.
There are three cases to consider:
Case When either or , then every term of has degree k or larger, and therefore
(36) There exists , such that for , we have . This yields
(37) Case If the minimal degree for or is greater than , then every term of has degree at least . We also note that, by Lemma 9, . Therefore, there exists , such that
(38) Case The only remaining case is where the minimal degree for and are both less than or equal to . If , then , where u is a word of length . Then we have
(39) There exists , such that
(40) Similarly, we can show that there exists , such that . Therefore, for we have
(41) We complete the proof by setting . □
Lemma 7.
There exist and such that , and for every word w and of length , the polynomial
(42) has exactly one root in the disk .
Proof.
First note that
(43) This yields
(44) There exist , large enough, such that, for , we have
and for ,
If we define , then we have, for ,
(45) by Rouché’s theorem, as has only one root in , then also has exactly one root in . □
We denote the root within the disk of by , and by bootstrapping we obtain
| (46) |
We also denote the derivative of at the root , by , and we obtain
| (47) |
We will refer to these expressions in the residue analysis that we present in the next section.
3.4. Asymptotic Difference
We begin this section by the following lemmas on the autocorrelation polynomials.
Lemma 8
(Jacquet and Szpankowski, 1994). For most words w, the autocorrelation polynomial is very close to 1, with high probably. More precisely, if w is a binary word of length k and , there exists , such that and
(48) where . We use Iverson notation
Lemma 9
(Jacquet and Szpankowski, 1994). There exist and , such that , and for every binary word w with length and , we have
(49) In other words, does not have any roots in .
Lemma 10.
With high probability, for most distinct pairs , the correlation polynomial is very close to 0. More precisely, if w and are two distinct binary words of length k and , there exists , such that and
(50)
We will use the above results to prove that the expected values in the Bernoulli model and the model built over a trie are asymptotically equivalent. We now prove Theorem 1 below.
Proof of Theorem 1.
From Lemmas 3 and 4, we have
and
subtracting the two generating functions, we obtain
(51) We define
(52) Therefore, by Cauchy integral formula (see [20]), we have
(53) where the path of integration is a circle about zero with counterclockwise orientation. We note that the above integrand has poles at , , and (refer to expression (29)). Therefore, we define
(54) where the circle of radius contains all of the above poles. By the residue theorem, we have
(55) We observe that
Then we obtain
(56) and finally, we have
(57) First, we show that, for sufficiently large n, the sum approaches zero. □
Lemma 11.
For large enough n, and for , there exists such that
(58)
Proof.
We let
(59) The Mellin transform of the above function is
(60) We define
(61) which is negative and uniformly bounded for all w. Also, for a fixed s, we have
(62)
(63) and therefore, we obtain
(64) From this expression, and noticing that the function has a removable singularity at , we can see that the Mellin transform exists on the strip where . We still need to investigate the Mellin strip for the sum . In other words, we need to examine whether summing over all words of length k (where k grows with n) has any effect on the analyticity of the function. We observe that
Lemma 8 allows us to split the above sum between the words for which and words that have .
Such a split yields the following
(65) This shows that is bounded above for and, therefore, it is analytic. This argument holds for as well, as would still be bounded above by a constant that depends on s and k.
We would like to approximate when . By the inverse Mellin transform, we have
(66) We choose for a fixed . Then by the direct mapping theorem [22], we obtain
(67) and subsequently, we get
(68) □
We next prove the asymptotic smallness of in (54).
Lemma 12.
Let
(69) For large n and , we have
(70)
Proof.
We observe that
(71) For , we show that the denominator in (71) is bounded away from zero.
(72) To find a lower bound for , we can choose large enough such that
(73) We now move on to finding an upper bound for the numerator in (71), for .
(74) Therefore, there exists a constant such that
(75) Summing over all patterns w, and applying Lemma 8, we obtain
(76) which approaches zero as and . This completes the proof of of Theorem 1. □
Similar to Theorem 1, we provide a proof to show that the second factorial moments of the kth Subword Complexity and the kth Prefix Complexity, have the same first order asymptotic behavior. We are now ready to state the proof of Theorem 2.
Proof of Theorem 2.
As discussed in Lemmas 3 and 4, the generating functions representing and respectively, are
and
Note that
(77)
(78)
(79) In Theorem 1, we proved that for every (which does not depend on n or k), we have
Therefore, both (77) and (78) are of order for . Thus, to show the asymptotic smallness, it is enough to choose , where is a small positive value. Now, it only remains to show (79) is asymptotically negligible as well. We define
(80) Next, we extract the coefficient of
(81) where the path of integration is a circle about the origin with counterclockwise orientation. We define
(82) The above integrand has poles at , (as in (46)), and . We have chosen such that the poles are all inside the circle . It follows that
(83) and the residues give us the following.
and
where is as in (47). Therefore, we get
(84) We now show that the above two terms are asymptotically small. □
Lemma 13.
There exists where the sum
is of order O().
Proof.
We define
The Mellin transform of the above function is
(85) where . We note that is negative and uniformly bounded from above for all .For a fixes s, we also have,
(86) and
(87) Therefore, we have
(88) To find the Mellin strip for the sum , we first note that
Since , we have
(89) and
(90) Therefore, we get
(91)
(92)
(93)
(94) By Lemma 10, with high probability, a randomly selected w has the property , and thus
With that and by Lemma 8, for most words w,
Therefore, both sums (91) and (93) are of the form . The sums (92) and (94) are also of order by Lemma 10. Combining all these terms we will obtain
(95) By the inverse Mellin transform, for , and , we have
(96) □
In the following lemma we show that the first term in (85) is asymptotically small.
Lemma 14.
Recall that
We have
(97)
Proof.
First note that
(98) We saw in (73) that , and therefore, it follows that
(99) For , is also bounded below as the following
(100) which is bounded away from zero by the assumption of Lemma 7. Additionally, we show that the numerator in (98) is bounded above, as follows
(101) This yields
(102) By (75), the first term above is of order and by Lemma 10 and an analysis similar to (75), the second term yields as well. Finally, we have
Which goes to zero asymptotically, for . □
This lemma completes our proof of Theorem 2.
3.5. Asymptotic Analysis of the kth Prefix Complexity
We finally proceed to analyzing the asymptotic moments of the kth Prefix Complexity. The results obtained hold true for the moments of the kth Subword Complexity. Our methodology involves poissonization, saddle point analysis (the complex version of Laplace’s method [23]), and depoissonization.
Lemma 15
(Jacquet and Szpankowski, 1998). Let be the Poisson transform of a sequence . If is analytic in a linear cone with , and if the following two conditions hold:
(I) For and real values B, , ν
(103) where is such that, for fixed t, ;
(II) For and
(104) Then, for every non-negative integer n, we have
On the Expected Value: To transform the sequence of interest, , into a Poisson model, we recall that in (25) we found
Thus, the Poisson transform is
| (105) |
To asymptotically evaluate this harmonic sum, we turn our attention to the Mellin Transform once more. The Mellin transform of is
| (106) |
which has the fundamental strip . For , the inverse Mellin integral is the following
| (107) |
where we define for . We emphasize that the above integral involves k, and k grows with n. We evaluate the integral through the saddle point analysis. Therefore, we choose the line of integration to cross the saddle point . To find the saddle point , we let , and we obtain
| (108) |
and therefore,
| (109) |
where .
By (108) and the fact that for and , we can see that there are actually infinitely many saddle points of the form on the line of integration.
We remark that the location of depends on the value of a. We have as , and as . We divide the analysis into three parts, for the three ranges , , and .
In the first range, which corresponds to
| (110) |
we perform a residue analysis, taking into account the dominant pole at . In the second range, we have
| (111) |
and we get the asymptotic result through the saddle point method. The last range corresponds to
| (112) |
and we approach it with a combination of residue analysis at , and the saddle point method. We now proceed by stating the proof of Theorem 3.
Proof of Theorem 3.
We begin with proving part which requires a saddle point analysis. We rewrite the inverse Mellin transform with integration line at as
(113) Step one: Saddle points’ contribute to the integral estimation
First, we are able to show those saddle points with do not have a significant asymptotic contribution to the integral. To show this, we let
(114) Since as , we observe that
(115) which is very small for large n. Note that for , is decreasing, and bounded above by .
Step two: Partitioning the integral
There are now only finitely many saddle points to work with. We split the integral range into sub-intervals, each of which contains exactly one saddle point. This way, each integral has a contour traversing a single saddle point, and we will be able to estimate the dominant contribution in each integral from a small neighborhood around the saddle point. Assuming that is the largest j for which , we split the integral as following
(116) By the same argument as in (115), the second term in (116) is also asymptotically negligible. Therefore, we are only left with
(117) where .
Step three: Splitting the saddle contour
For each integral , we write the expansion of about , as follows
(118) The main contribution for the integral estimate should come from an small integration path that reduces to its quadratic expansion about . In other words, we want the integration path to be such that
(119) The above conditions are true when and . Thus, we choose the integration path to be . Therefore, we have
(120) Saddle Tails Pruning.
We show that the integral is small for . We define
(121) Note that for , we have
(122) where . Thus,
(123) Central Approximation.
Over the main path, the integrals are of the form
We have
(124) and
(125) Therefore, by Laplace’s theorem (refer to [22]) we obtain
(126) We finally sum over all j, and we get
(127) We can rewrite as
(128) where , and
(129) For part , we move the line of integration to . Note that in this range, we must consider the contribution of the pole at . We have
(130) Computing the residue at , and following the same analysis as in part i for the above integral, we arrive at
(131) For part of Theorem 3, we shift the line of integration to , then we have
(132) where .
Step four: Asymptotic depoissonization
To show that both conditions in (15) hold for , we extend the real values z to complex values , where . To prove (103), we note that
(133) and therefore
(134) is absolutely convergent for . The same saddle point analysis applies here and we obtain
(135) where , and is as in (128). Condition (103) is therefore satisfied. To prove condition (104) We see that for a fixed k,
(136) Therefore, we have
(137) This completes the proof of Theorem 3. □
On the Second Factorial Moment: We poissonize the sequence as well. By the analysis in (27),
which gives the following poissonized form
| (138) |
We show that in all ranges of a the leftover sum in (138) has a lower order contribution to compared to . We define
| (139) |
In the first range for k, we take the Mellin transform of , which is
| (140) |
and we note that the fundamental strip for this Mellin transform of is as well. The inverse Mellin transform for is
| (141) |
We note that this range of corresponds to
| (142) |
The integrand in (141) is quite similar to the one seen in (107). The only difference is the extra term . However, we notice that is analytic and bounded. Thus, we obtain the same saddle points with the real part as in (109) and the same imaginary parts in the form of , . Thus, the same saddle point analysis for the integral in (107) applies to as well. We avoid repeating the similar steps, and we skip to the central approximation, where by Laplace’s theorem (ref. [22]), we get
| (143) |
which can be represented as
| (144) |
where
| (145) |
This shows that , when
Subsequently, for , we get
| (146) |
and for , we get
| (147) |
It is not difficult to see that for each range of a as stated above, has a lower order contribution to the asymptotic expansion of , compared to . Therefore, this leads us to Theorem 4, which will be proved bellow.
Proof of Theorem 4.
It is only left to show that the two depoissonization conditions hold: For condition (103) in Theorem 15, from (135) we have
(148) and for condition (104), we have, for fixed k,
(149) Therefore both depoissonization conditions are satisfied and the desired result follows. □
Corollary. A Remark on the Second Moment and the Variance
For the second moment we have
| (150) |
Therefore, by (105) and (138) the Poisson transform of the second moment, which we denote by is
| (151) |
which results in the same first order asymptotic as the second factorial moment. Also, it is not difficult to extend the proof in Chapter 6 to show that the second moments of the two models are asymptotically the same. For the variance we have
| (152) |
Therefore the Poisson transform, which we denote by is
| (153) |
The Mellin transform of the above function has the following form
| (154) |
This is quite similar to what we saw in (106), which indicates that the variance has the same asymptotic growth as the expected value. But the variance of the two models do not behave in the same way (cf. Figure 2).
4. Summary and Conclusions
We studied the first-order asymptotic growth of the first two (factorial) moments of the kth Subword Complexity. We recall that the kth Subword Complexity of a string of length n is denoted by , and is defined as the number of distinct subwords of length k, that appear in the string. We are interested in the asymptotic analysis for when k grows as a function of the string’s length. More specifically, we conduct the analysis for , and as .
The analysis is inspired by the earlier work of Jacquet and Szpankowski on the analysis of suffix trees, where they are compared to independent tries (cf. [14]). In our work, we compare the first two moments of the kth Subword Complexity to the kth Prefix Complexity over a random trie built over n independently generated binary strings. We recall that we define the kth Prefix Complexity as the number of distinct prefixes that appear in the trie at level k and lower.
We obtain the generating functions representing the expected value and the second factorial moments as their coefficients, in both settings. We prove that the first two moments have the same asymptotic growth in both models. For deriving the asymptotic behavior, we split the range for k into three intervals. We analyze each range using the saddle point method, in combination with residue analysis. We close our work with some remarks regarding the comparison of the second moment and the variance to the kth Prefix Complexity.
5. Future Challenges
The intervals’ endpoints for a in Theorems 3 and 4 are not investigated in this work. The asymptotic analysis of the end points can be studied using van der Waerden saddle point method [24].
The analogous results are not (yet) known in the case where the underlying probability source has Markovian dependence or in the case of dynamical sources.
Acknowledgments
The authors thank Wojciech Szpankowski and Mireille Régnier for insightful conversations on this topic.
Abbreviations
The following abbreviations are used in this manuscript:
| PGF | Probabilty Generating Function |
| Probability | |
| Expected value | |
| Var | Variance |
| The second factorial moment of |
Author Contributions
This paper is based on a Ph.D. dissertation conducted by the L.A. under the supervision of the M.D.W. All authors have read and agreed to the published version of the manuscript.
Funding
M.D.W. Ward’s research is supported by FFAR Grant 534662, by the USDA NIFA Food and Agriculture Cyberinformatics and Tools (FACT) initiative, by NSF Grant DMS-1246818, by the NSF Science & Technology Center for Science of Information Grant CCF-0939370, and by the Society Of Actuaries.
Conflicts of Interest
The authors declare no conflict of interest.
References
- 1.Ehrenfeucht A., Lee K., Rozenberg G. Subword complexities of various classes of deterministic developmental languages without interactions. Theor. Comput. Sci. 1975;1:59–75. doi: 10.1016/0304-3975(75)90012-2. [DOI] [Google Scholar]
- 2.Morse M., Hedlund G.A. Symbolic Dynamics. Am. J. Math. 1938;60:815–866. doi: 10.2307/2371264. [DOI] [Google Scholar]
- 3.Jacquet P., Szpankowski W. Analytic Pattern Matching: From DNA to Twitter. Cambridge University Press; Cambridge, UK: 2015. [Google Scholar]
- 4.Bell T.C., Cleary J.G., Witten I.H. Text Compression. Prentice-Hall; Upper Saddle River, NJ, USA: 1990. [Google Scholar]
- 5.Burge C., Campbell A.M., Karlin S. Over-and under-representation of short oligonucleotides in DNA sequences. Proc. Natl. Acad. Sci. USA. 1992;89:1358–1362. doi: 10.1073/pnas.89.4.1358. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Fickett J.W., Torney D.C., Wolf D.R. Base compositional structure of genomes. Genomics. 1992;13:1056–1064. doi: 10.1016/0888-7543(92)90019-O. [DOI] [PubMed] [Google Scholar]
- 7.Karlin S., Burge C., Campbell A.M. Statistical analyses of counts and distributions of restriction sites in DNA sequences. Nucleic Acids Res. 1992;20:1363–1370. doi: 10.1093/nar/20.6.1363. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Karlin S., Mrázek J., Campbell A.M. Frequent Oligonucleotides and Peptides of the Haemophilus Influenzae Genome. Nucleic Acids Res. 1996;24:4263–4272. doi: 10.1093/nar/24.21.4263. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Pevzner P.A., Borodovsky M.Y., Mironov A.A. Linguistics of Nucleotide Sequences II: Stationary Words in Genetic Texts and the Zonal Structure of DNA. J. Biomol. Struct. Dyn. 1989;6:1027–1038. doi: 10.1080/07391102.1989.10506529. [DOI] [PubMed] [Google Scholar]
- 10.Chen X., Francia B., Li M., Mckinnon B., Seker A. Shared information and program plagiarism detection. IEEE Trans. Inf. Theory. 2004;50:1545–1551. doi: 10.1109/TIT.2004.830793. [DOI] [Google Scholar]
- 11.Chor B., Horn D., Goldman N., Levy Y., Massingham T. Genomic DNA k-mer spectra: models and modalities. Genome Biol. 2009;10:R108. doi: 10.1186/gb-2009-10-10-r108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Price A.L., Jones N.C., Pevzner P.A. De novo identification of repeat families in large genomes. Bioinformatics. 2005;21:i351–i358. doi: 10.1093/bioinformatics/bti1018. [DOI] [PubMed] [Google Scholar]
- 13.Janson S., Lonardi S., Szpankowski W. Annual Symposium on Combinatorial Pattern Matching. Springer; Berlin/Heidelberger, Germany: 2004. On the Average Sequence Complexity; pp. 74–88. [Google Scholar]
- 14.Jacquet P., Szpankowski W. Autocorrelation on words and its applications: Analysis of suffix trees by string-ruler approach. J. Comb. Theory Ser. A. 1994;66:237–269. doi: 10.1016/0097-3165(94)90065-5. [DOI] [Google Scholar]
- 15.Liang F.M. Word Hy-phen-a-tion by Com-put-er. Technical Report; Stanford University; Stanford, CA, USA: 1983. [Google Scholar]
- 16.Weiner P. Linear pattern matching algorithms; Proceedings of the 14th Annual Symposium on Switching and Automata Theory (swat 1973); Iowa City, IA, USA. 15–17 October 1973; pp. 1–11. [Google Scholar]
- 17.Gheorghiciuc I., Ward M.D. On correlation Polynomials and Subword Complexity. Discrete Math. Theor. Comput. Sci. 2007;7:1–18. [Google Scholar]
- 18.Bassino F., Clément J., Nicodème P. Counting occurrences for a finite set of words: Combinatorial methods. ACM Trans. Algorithms. 2012;8:31. doi: 10.1145/2229163.2229175. [DOI] [Google Scholar]
- 19.Park G., Hwang H.K., Nicodème P., Szpankowski W. Latin American Symposium on Theoretical Informatics. Springer; Berlin/Heidelberger, Germany: 2008. Profile of Tries; pp. 1–11. [Google Scholar]
- 20.Flajolet P., Sedgewick R. Analytic Combinatorics. Cambridge University Press; Cambridge, UK: 2009. [Google Scholar]
- 21.Lothaire M. Applied Combinatorics on Words. Volume 105 Cambridge University Press; Cambridge, UK: 2005. [Google Scholar]
- 22.Szpankowski W. Average Case Analysis of Algorithms on Sequences. Volume 50 John Wiley & Sons; Chichester, UK: 2011. [Google Scholar]
- 23.Widder D.V. The Laplace Transform (PMS-6) Princeton University Press; Princeton, NJ, USA: 2015. [Google Scholar]
- 24.van der Waerden B.L. On the method of saddle points. Appl. Sci. Res. 1952;2:33–45. doi: 10.1007/BF02919754. [DOI] [Google Scholar]



