Skip to main content
Entropy logoLink to Entropy
. 2020 Feb 12;22(2):207. doi: 10.3390/e22020207

Asymptotic Analysis of the kth Subword Complexity

Lida Ahmadi 1,*,, Mark Daniel Ward 2
PMCID: PMC7516637  PMID: 33285983

Abstract

Patterns within strings enable us to extract vital information regarding a string’s randomness. Understanding whether a string is random (Showing no to little repetition in patterns) or periodic (showing repetitions in patterns) are described by a value that is called the kth Subword Complexity of the character string. By definition, the kth Subword Complexity is the number of distinct substrings of length k that appear in a given string. In this paper, we evaluate the expected value and the second factorial moment (followed by a corollary on the second moment) of the kth Subword Complexity for the binary strings over memory-less sources. We first take a combinatorial approach to derive a probability generating function for the number of occurrences of patterns in strings of finite length. This enables us to have an exact expression for the two moments in terms of patterns’ auto-correlation and correlation polynomials. We then investigate the asymptotic behavior for values of k=Θ(logn). In the proof, we compare the distribution of the kth Subword Complexity of binary strings to the distribution of distinct prefixes of independent strings stored in a trie. The methodology that we use involves complex analysis, analytical poissonization and depoissonization, the Mellin transform, and saddle point analysis.

Keywords: subword complexity, asymptotics, generating functions, saddle point method, probability, the Mellin transform, moments

1. Introduction

Analyzing and understanding occurrences of patterns in a character string is helpful for extracting useful information regarding the nature of a string. We classify strings to low-complexity and high-complexity, according to their level of randomness. For instance, we take the binary string X=10101010..., which is constructed by repetitions of the pattern w=10. This string is periodic, and therefore has low randomness. Such periodic strings are classified as low-complexity strings, whereas strings that do not show periodicity are considered to have high complexity. An effective way of measuring a string’s randomness is to count all distinct patterns that appear as contiguous subwords in the string. This value is called the Subword Complexity. The name is given by Ehrenfeucht, Lee, and Rozenberg [1], and initially was introduced by Morse and Hedlund in 1938 [2]. The higher the Subword Complexity, the more complex the string is considered to be.

Assessing information about the distribution of the Subword Complexity enables us to better characterize strings, and determine atypically random or periodic strings that have complexities far from the average complexity [3]. This type of string classification has applications in fields such as data compression [4], genome analysis (see [5,6,7,8,9]), and plagiarism detection [10]. For example, in data compression, a data set is considered compressible if it has low complexity, as consists of repeated subwords. In computational genomics, Subword Complexity (known as k-mers) is used in detection of repeated sequences and DNA barcoding [11,12]. k-mers are composed of A, T, G, and C nucleotides. For instance, 7-mers for a DNA sequence GTAGAGCTGT is four, meaning that there are 4-hour distinct substrings of length 7 in the given DNA sequence. Counting k-mers becomes challenging for longer DNA sequences. Our results can be easily extended to the alphabet {A,T,G,C} and directly applied in theoretical analysis of the genomic k-mer distributions under the Bernoulli probabilistic model, particularly when the length n of the sequence approaches infinity.

There are two variations for the definition of the Subword Complexity: the one that counts all distinct subwords of a given string (also known as Complexity Index and Sequence Complexity [13]), and the one that only counts the subwords of the same length, say k, that appear in the string. In our work, we analyze the latter, and we call it the kth Subword Complexity to avoid any confusion.

Throughout this work, we consider the kth Subword Complexity of a random binary string of length n over a memory-less source, and we denote it by Xn,k. We analyze the first and second factorial moments of Xn,k (1) for the range k=Θ(logn), as n. More precisely, will divide the analysis into three ranges as follows.

  • i.

    1logq1logn<k<2logq1+logp1logn,

  • ii.

    2logq1+logp1logn<k<1qlogq1+plogp1logn, and

  • iii.

    1qlogq1+plogp1logn<k<1logp1logn.

Our approach involves two major steps. First, we choose a suitable model for the asymptotic analysis, and afterwards we provide proofs for the derivation of the asymptotic expansion of the first two factorial moments.

1.1. Part I

This part of the analysis is inspired by the earlier work of Jacquet and Szpankowski [14] on the analysis of suffix trees by comparing them to independent tries. A trie, first introduced by René de la Briandais in 1959 (see [15]), is a search tree that stores n strings, according to their prefixes. A suffix tree, introduced by Weiner in 1973 (see [16]), is a trie where the strings are suffixes of a given string. An example of these data structures are given in Figure 1.

Figure 1.

Figure 1

The suffix tree in (a) is built over the first four suffixes of string X=101110..., and the trie in (b) is build over strings X1=111..., X2=101..., X3=100, and X4=010....

A direct asymptotic analysis of the moments is a difficult task, as patterns in a string are not independent from each other. However, we note that each pattern in a string can be regarded as a prefix of a suffix of the string. Therefore, the number of distinct patterns of length k in a string is actually the number of nodes of the suffix tree at level k and lower. It is shown by I. Gheorghiciuc and M. D. Ward [17] that the expected value of the k-th Subword Complexity of a Bernoulli string of length n is asymptotically comparable to the expected value of the number of nodes at level k of a trie built over n independent strings generated by a memory-less source.

We extend this analysis to the desired range for k, and we prove that the result holds for when k grows logarithmically with n. Additionally, we show that asymptotically, the second factorial moment of the k-th Subword Complexity can also be estimated by admitting the same independent model generated by a memory-less source. The proof of this theorem heavily relies on the characterization of the overlaps of the patterns with themselves and with one another. Autocorrelation and correlation polynomials explicitly describe these overlaps. The analytic properties of these polynomials are key to understanding repetitions of patterns in large Bernoulli strings. This, in conjunction with Cauchy’s integral formula (used to compare the generating functions in the two models) and the residue theorem, provides solid verification that the second factorial moment in the Subword Complexity behaves the same as in the independent model.

To make this comparison, we derive the generating functions of the first two factorial moments in both settings. In a paper published by F. Bassino, J. Clément, and P. Nicodème in 2012 [18], the authors provide a multivariate probability generating function f(z,x) for the number of occurrences of patterns in a finite Bernoulli string. That is, given a pattern w, the coefficient of the term znxm in f(z,x) is the probability in the Bernoulli model that a random string of size n has exactly m occurrences of the pattern w. Following their technique, we derive the exact expression for the generating functions of the first two factorial moments of the kth Subword Complexity. In the independent model, the generating functions are obtained by basic probability concepts.

1.2. Part II

This part of the proof is analogous to the analysis of profile of tries [19]. To capture the asymptotic behavior, the expressions for the first two factorial moments in the independent trie are further improved by means of a Poisson process. The poissonized version yields generating functions in the form of harmonic sums for each of the moments. The Mellin transform and the inverse Mellin transforms of these harmonic sums establish a connection between the asymptotic expansion and singularities of the transformed function. This methodology is sufficient for when the length k of the patterns are fixed. However, allowing k to grow with n, makes the analysis more challenging. This is because for large k, the dominant term of the poissonized generating function may come from the term involving k, and singularities may not be significant compared to the growth of k. This issue is treated by combining the singularity analysis with a saddle point method [20]. The outcome of the analysis is a precise first-order asymptotics of the moments in the poissonized model. Depoissonization theorems are then applied to obtain the desired result in the Bernoulli model.

2. Results

For a binary string X=X1X2...Xn, where Xi’s (i=1,...,n) are independent and identically distributed random variables, we assume that P(Xi=1)=p, P(Xi=0)=q=1p, and p>q. We define the kth Subword Complexity, Xn,k, to be the number of distinct substrings of length k that appear in a random string X with the above assumptions. In this work, we obtain the first order asymptotics for the average and the second factorial moment of Xn,k. The analysis is done in the range k=Θ(logn). We rewrite this range as k=alogn, and by performing a saddle point analysis, we will show that

1/logq1<a<1/logp1 (1)

In the first step, we compare the kth Subword Complexity to an independent model constructed in the following way: We store a set of n independently generated strings by a memory-less source in a trie. This means that each string is a sequence of independent and identically distributed Bernoulli random variables from the binary alphabet A={0,1}, with P(1)=p, P(0)=q=1p. We denote the number of distinct prefixes of length k in the trie by X^n,k, and we call it the kth prefix complexity. Before proceeding any further, we remind that factorial moments of a random variable are defined as following.

Definition 1.

The jth factorial moment of a random variable X is defined as

E[(X)j]=E[(X)(X1)(X2)...(Xj+1)], (2)

where j = 1, 2, … will show that the first and second factorial moments of Xn,k are asymptotically comparable to those of X^n,k, when k=Θ(logn). We have the following theorems.

Theorem 1.

For large values of n, and for k=Θ(logn), there exists M>0 such that

E[Xn,k]E[X^n,k]=O(nM).

We also prove a similar result for the second factorial moments of the kth Subword Complexity and the kth Prefix Complexity:

Theorem 2.

For large values of n, and for k=Θ(logn), there exists ϵ>0 such that

E[(Xn,k)2]E[(X^n,k)2]=O(nϵ).

In the second part of our analysis, we derive the first order asymptotics of the kth Prefix Complexity. The methodology used here is analogous to the analysis of profile of tries [19]. The rate of the asymptotic growth depends on the location of the value a as seen in (1). For instance, for the average kth Subword Complexity, E[Xn,k], we have the following observations.

  • i.

    For the range I1:1logq1<a<2logq1+logp1, the growth rate is of order O(2k),

  • ii.

    in the range I2:2logq1+logp1<a<1qlogq1+plogp1, we observe some oscillations with n, and

  • iii.

    in the range I3:1qlogq1+plogp1<a<1logp1, the average has a linear growth O(n).

The above observations will be discussed in depth in the proofs of the following theorems.

Theorem 3.

The average of the kth Prefix Complexity has the following asymptotic expansion

  • i. 
    For aI1,
    E[X^n,k]=2kΦ1((1+logp)logp/qn)nνlogn1+O1logn, (3)
    where ν=r0+alog(pr0+qr0), and
    Φ1(x)=(p/q)r0/2+(p/q)r0/22πlogp/qjZΓ(r0+itj)e2πijx

    is a bounded periodic function.

  • ii. 
    For aI2,
    E[X^n,k]=Φ1((1+logp)logp/qn)nνlogn1+O1logn.
  • iii. 
    For aI3
    E[X^n,k]=n+O(nν0),

    for some ν0<1.

Theorem 4.

The second factorial moment of the kth Prefix Complexity has the following asymptotic expansion.

  • i. 
    For aI1,
    E[(X^n,k)2]=2kΦ1(logp/qn(1+logp))nνlogn1+O1logn2.
  • ii. 
    For aI2,
    E[(X^n,k)2]=Φ12(logp/qn(1+logp))n2νlogn1+O1logn.
  • iii. 
    For aI3,
    E[(X^n,k)2]=n2+O(n2ν0).

The periodic function Φ1(x) in Theorems 3 and 4 is shown in Figure 2.

Figure 2.

Figure 2

Left: Φ1(x) at p=0.90, and various levels of r0. The amplitude increases as r0 increases. Right: Φ1(x) at r0=1, and various levels of p. The amplitude tends to zero as p1/2+.

The results in Theorem 4 will follow for the second moment of the kth Subword Complexity as the analysis can be easily extended from the second factorial moment to the second moment. The variance however, as seen in Figure 3, does not show the same asymptotic behavior as the variance of kth Subword Complexity.

Figure 3.

Figure 3

Approximated second moments (left), and variances (right) of the kth Subword Complexity (red), and the kth Prefix Complexity (blue), for n=4000, at different probability levels, averaged over 10,000 iterations.

3. Proofs and Methods

3.1. Groundwork

We first introduce a few terminologies and lemmas regarding overlaps of patterns and their number of occurrences in texts. Some of the notations we use in this work are borrowed from [18] and [21].

Definition 2.

For a binary word w=w1...wk of length k, The autocorrelation set Sw of the word w is defined in the following way.

Sw={wi+1...wk|w1...wi=wki+1...wk}. (4)

The autocorrelation index set is

P(w)={i|w1...wi=wki+1...wk}, (5)

And the autocorrelation polynomial is

Sw(z)=iP(w)P(wi+1...wk)zki. (6)

Definition 3.

For the distinct binary words w=w1...wk and w=w1...wk, the correlation set Sw,w of the words w and w is

Sw,w={wi+1...wk|w1...wi=wki+1...wk}. (7)

The correlation index set is

P(w,w)={i|w1...wi=wki+1...wk}, (8)

The correlation polynomial is

Sw,w(z)=iP(w,w)P(wi+1...wk)zki. (9)

The following two lemmas present the probability generating functions for the number of occurrences of a single pattern and a pair of distinct pattern, respectively, in a random text of length n. For a detailed dissection on obtaining such generating functions, refer to [18].

Lemma 1.

The Occurrence probability generating function for a single pattern w in a binary text over a memoryless source is given by Fw(z,x1), where

Fw(z,t)=11A(z)tP(w)zk1t(Sw(z)1), (10)

The coefficient [znxm]Fw(z,x1) is the probability that a random binary string of length n has m occurrences of the pattern w.

Lemma 2.

The Occurrence PGF for two distinct Patterns of length k in a Bernoulli random text is given by Fw,w(z,x11,x21) where,

Fw,w(z,t1,t2)=11A(z)M(z,t1,t2), (11)

and

M(z,t1,t2)=P(w)zkt1P(w)zkt2I(Sw(z)1)t1Sw,w(z)t2Sw,w(z)t1(Sw(z)1)t2111.

The coefficient [znx1m1x2m2]Fw,w(z,x11,x21) is the probability that there are m1 occurrences of w and m2 occurrences of w in a random string of length n.

The above results will be used to find the generating functions for the first two factorial moments of the kth Subword Complexity in the following section.

3.2. Derivation of Generating Functions

Lemma 3.

For generating functions Hk(z)=n0E[Xn,k]zn and Gk(z)=n0E[(Xn,k)2]zn, we have

  • i. 
    Hk(z)=wAk11zSw(z)Dw(z), (12)
    where Dw(z)=P(w)zk+(1z)Sw(z), and
  • ii. 
    Gk(z)=w,wAkww11zSw(z)Dw(z)Sw(z)Dw(z)+Sw(z)Sw(z)Sw,w(z)Sw,w(z)Dw,w(z), (13)
    where
    Dw,w(z)=(1z)(Sw(z)Sw(z)Sw,w(z)Sw,w(z))+zkP(w)(Sw(z)Sw,w(z))+P(w)(Sw(z)Sw,w(z)). (14)

Proof. 

i. We define

Xn,k(w)=1ifwappearsatleastonceinstringX0otherwise.

This yields

E[Xn,k(w)]=P(Xn,k(w)=1)=1P(Xn,k(w)=0)=1[znx0]Fw(z,x). (15)

We observe that [znx0]Fw(z,x)=[zn]Fw(z,0). By defining fw(z)=Fw(z,0) and from (10), we obtain

fw(z)=Sw(z)P(w)zk+(1z)Sw(z). (16)

Having the above function, we derive the following result.

H(z)=n0E[Xn,k]zn=n0wAk(1[zn]fw(z))zn=wAk11zfw(z)=wAk11zSw(z)Dw(z). (17)

ii. For this part, we first note that

E[(Xn,k)2]=E[Xn,k2]E[Xn,k]=E(Xn,k(w)+...+Xn,k(w(r)))2EXn,k(w)+...+Xn,k(w(r))=wAkE(Xn,k(w))2+w,wAkwwEXn,k(w)Xn,k(w)wAkEXn,k(w)=w,wAkwwEXn,k(w)Xn,k(w). (18)

Due to properties of indicator random variables, we observe that the expected value of the second factorial moment has only one term:

E[(Xn,k)2]=w,wAkwwEXn,k(w)Xn,k(w). (19)

We proceed by defining a second indicator variable as following.

Xn,k(w)Xn,k(w)=1ifXn,k(w)=Xn,k(w)=10otherwise.

This gives

E[Xn,k(w)Xn,k(w)]=PXn,k(w)=1,Xn,k(w)=1=1PXn,k(w)=0Xn,k(w)=0=1PXn,k(w)=0PXn,k(w)=0+PXn,k(w)=0,Xn,k(w)=0.

Finally, we are able to express E[(Xn,k)2] in the following

E[(Xn,k)2]=w,wAkww1znfw(z)[zn]fw(z)+[zn]fww(z), (20)

where fw,w(z)=Fw,w(z,0,0) and [zn]Fw,w(z,0,0)=[znx10x20]Fw,w(z,x1,x2). By (11) we have

fw,w(z)=Sw(z)Sw(z)Sw,w(z)Sw,w(z)Dw,w(z) (21)

Having the above expression, we finally obtain

Gk(z)=n0E[(Xn,k)2]zn=w,wAkwwn01[zn]fw(z)[zn]fw(z)+[zn]fw,w(z)zn=w,wAkww11zfw(z)fw(z)+fw,w(z)=w,wAkww11zSw(z)Dw(z)Sw(z)Dw(z)+Sw(z)Sw(z)Sw,w(z)Sw,w(z)Dw,w(z). (22)

In the following lemma, we present the generating functions for the first two factorial moments for the kth Prefix Complexity in the independent model.

Lemma 4.

For H^k(z)=n0E[X^n,k]zn and G^k(z)=n0E[(X^n,k)2]zn, which are the generating functions for E[X^n,k] and E[(X^n,k)2] respectively, we have

  • i. 
    H^k(z)=wAk11z11(1P(w))z. (23)
  • ii. 
    G^k(z)=w,wAkww11z11(1P(w))z11(1P(w))z+w,wAkww11(1P(w)P(w))z. (24)

Proof. 

i. We define the indicator variable X^n,k(w) as follows.

X^n,k(w)=1ifwisaprefixofatleastonestringinP0otherwise.

For each X^n,k(w), we have

E[X^n,k(w)]=P(X^n,k(w)=1)=1P(X^n,k(w)=0)=11P(w)n. (25)

Summing over all words w of length k, determines the generating function H^(z):

H^(z)=n0E[X^n,k]zn=wAk11z11(1P(w))z. (26)

ii. Similar to in (18) and (20), we obtain

E[(X^n,k)2]=w,wAkwwE[X^n,k(w)X^n,k(w)]=w,wAkww1(1P(w))n(1P(w))n+(1P(w)P(w))n. (27)

Subsequently, we obtain the generating function below.

G^(z)=n0E[(X^n,k)2]zn=w,wAkwwn01(1P(w))n(1P(w))n+(1P(w)P(w))nzn=w,wAkww11z11(1P(w))z11(1P(w))z+w,wAkww11(1P(w)P(w))z. (28)

Our first goal is to compare the coefficients of the generating functions in the two models. The coefficients are expected to be asymptotically equivalent in the desired range for k. To compare the coefficients, we need more information on the analytic properties of these generating functions. This will be discussed in Section 3.3.

3.3. Analytic Properties of the Generating Functions

Here, we turn our attention to the smallest singularities of the two generating functions given in Lemma 3. It has been shown by Jacquet and Szpankowski [21] that Dw(z) has exactly one root in the disk |z|ρ. Following the notations in [21], we denote the root within the disk |z|ρ of Dw(z) by Aw, and by bootstrapping we obtain

Aw=1+1Sw(1)P(w)+OP(w)2. (29)

We also denote the derivative of Dw(z) at the root Aw, by Bw, and we obtain

Bw=Sw(1)+k2Sw(1)Sw(1)P(w)+OP(w)2. (30)

In this paper, we will prove a similar result for the polynomial Dw,w(z) through the following work.

Lemma 5.

If w and w are two distinct binary words of length k and δ=p, there exists ρ>1, such that ρδ<1 and

wAk[[|Sw,w(ρ)|(ρδ)kθ]]P(w)1θδk. (31)

Proof. 

If the minimal degree of Sw,w(z) is greater than >k/2, then

|Sw,w(ρ)|(ρδ)kθ. (32)

for θ=(1p)1. For a fixed w, we have

wAk[[Sw,w(z)hasminimaldegreek/2]]P(w)=i=1k/2wAk[[Sw,w(z)hasminimaldegree=i]]P(w)=i=1k/2w1...wiAiP(w1...wi)wi+1...wkAki[[Sw,w(z)hasminimaldegree=i]]P(wi+1...wk)i=1k/2w1..wiAiP(wi+1...wk)pki=i=1k/2pkiw1..wiAiP(w1...wi)=i=1k/2pkipkk/21p. (33)

This leads to the following

wAk[[everytermofSw,w(z)isofdegree>k/2]]P(w)=1wAk[[Sw,w(z)hasatermofdegreek/2]]P(w)1pk/21p1θδk. (34)

Lemma 6.

There exist K>0, and ρ>1 such that pρ<1, and such that, for every pair of distinct words w, and w of length kK, and for |z|ρ, we have

|Sw(z)Sw(z)Sw,w(z)Sw,w(z)|>0. (35)

In other words, Sw(z)Sw(z)Sw,w(z)Sw,w(z) does not have any roots in |z|ρ.

Proof. 

There are three cases to consider:

Case i. When either Sw(z)=1 or Sw(z)=1, then every term of Sw,w(z)Sw,w(z) has degree k or larger, and therefore

|Sw,w(z)Sw,w(z)|k(pρ)k1pρ. (36)

There exists K1>0, such that for k>K1, we have limkk(pρ)k1pρ=0. This yields

|Sw(z)Sw(z)Sw,w(z)Sw,w(z)||Sw(z)Sw(z)||Sw,w(z)Sw,w(z)|1k(pρ)k1pρ>0. (37)

Case ii. If the minimal degree for Sw(z)1 or Sw(z)1 is greater than k/2, then every term of Sw,w(z)Sw,w(z) has degree at least k/2. We also note that, by Lemma 9, |Sw(z)Sw(z)|>0. Therefore, there exists K2>0, such that

|Sw(z)Sw(z)Sw,w(z)Sw,w(z)||Sw(z)Sw(z)||Sw,w(z)Sw,w(z)|>0fork>K2. (38)

Case iii. The only remaining case is where the minimal degree for Sw(z)1 and Sw(z)1 are both less than or equal to k/2. If w=w1...wk, then w=uw1...wkm, where u is a word of length m1. Then we have

Sw,w(z)=P(wkm+1...wk)zmSw(z)O(pz)km. (39)

There exists K3>0, such that

|Sw,w(z)|(pρ)m|Sw(z|+O(pρ)km=(pρ)m|Sw(z)|+O(pρ)k<|Sw(z)|fork>K3. (40)

Similarly, we can show that there exists K3, such that |Sw,w(z)|<|Sw(z)|. Therefore, for k>K3 we have

|Sw(z)Sw(z)Sw,w(z)Sw,w(z)||Sw(z)||Sw(z)||Sw,w(z)||Sw,w(z)|>|Sw(z)||Sw(z)||Sw(z)||Sw(z)|=0. (41)

We complete the proof by setting K=max{K1,K2,K3,K3}. □

Lemma 7.

There exist Kw,w>0 and ρ>1 such that pρ<1, and for every word w and w of length kKw,w, the polynomial

Dw,w(z)=(1z)(Sw(z)Sw(z)Sw,w(z)Sw,w(z))+zkP(w)(Sw(z)Sw,w(z))+P(w)(Sw(z)Sw,w(z)), (42)

has exactly one root in the disk |z|ρ.

Proof. 

First note that

|Sw(z)Sw,w(z)||Sw(z)|+|Sw,w(z)|11pρ+pρ1pρ=1+pρ1pρ. (43)

This yields

zkP(w)(Sw(z)Sw,w(z))+P(w)(Sw(z)Sw,w(z))(pρ)k|Sw(z)Sw,w(z)|+|Sw(z)Sw,w(z)|(pρ)k2(1+pρ)1pρ. (44)

There exist K, K large enough, such that, for k>K, we have

|(Sw(z)Sw(z)Sw,w(z)Sw,w(z))|β>0,

and for k>K,

(pρ)k2(1+pρ)1pρ<(ρ1)β.

If we define Kw,w=max{K,K}, then we have, for kKw,w,

(pρ)k2(1+pρ)1pρ<(ρ1)β<|(1z)(Sw(z)Sw(z)Sw,w(z)Sw,w(z))|. (45)

by Rouché’s theorem, as (1z)(Sw(z)Sw(z)Sw,w(z)Sw,w(z)) has only one root in |z|ρ, then also Dw,w(z) has exactly one root in |z|ρ. □

We denote the root within the disk |z|ρ of Dw,w(z) by αw,w, and by bootstrapping we obtain

αw,w=1+Sw(1)Sw,w(1)Sw(1)Sw(1)Sw,w(1)Sw,w(1)P(w)+Sw(1)Sw,w(1)Sw(1)Sw(1)Sw,w(1)Sw,w(1)P(w)+O(p2k). (46)

We also denote the derivative of Dw,w(z) at the root αw,w, by βw,w, and we obtain

βw,w=Sw,w(1)Sw,w(1)Sw(1)Sw(1)+O(kpk). (47)

We will refer to these expressions in the residue analysis that we present in the next section.

3.4. Asymptotic Difference

We begin this section by the following lemmas on the autocorrelation polynomials.

Lemma 8

(Jacquet and Szpankowski, 1994). For most words w, the autocorrelation polynomial Sw(z) is very close to 1, with high probably. More precisely, if w is a binary word of length k and δ=p, there exists ρ>1, such that ρδ<1 and

wAk[[|Sw(ρ)1|(ρδ)kθ]]P(w)1θδk, (48)

where θ=(1p)1. We use Iverson notation

[[A]]=1ifAholds0otherwise

Lemma 9

(Jacquet and Szpankowski, 1994). There exist K>0 and ρ>1, such that pρ<1, and for every binary word w with length kK and |z|ρ, we have

|Sw(z)|>0. (49)

In other words, Sw(z) does not have any roots in |z|ρ.

Lemma 10.

With high probability, for most distinct pairs {w,w}, the correlation polynomial Sw,w(z) is very close to 0. More precisely, if w and w are two distinct binary words of length k and δ=p, there exists ρ>1, such that ρδ<1 and

wAk[[|Sw,w(ρ)|(ρδ)kθ]]P(w)1θδk (50)

We will use the above results to prove that the expected values in the Bernoulli model and the model built over a trie are asymptotically equivalent. We now prove Theorem 1 below.

Proof of Theorem 1.

From Lemmas 3 and 4, we have

H(z)=wAk11zSw(z)Dw(z),

and

H^(z)=wAk11z11(1P(w))z.

subtracting the two generating functions, we obtain

H(z)H^(z)=wAk11(1P(w))zSw(z)Dw(z). (51)

We define

Δw(z)=11(1P(w))zSw(z)Dw(z). (52)

Therefore, by Cauchy integral formula (see [20]), we have

[zn]Δw(z)=12πiΔw(z)dzzn+1=Resz=0Δw(z)dzzn+1, (53)

where the path of integration is a circle about zero with counterclockwise orientation. We note that the above integrand has poles at z=0, z=11P(w), and z=Aw (refer to expression (29)). Therefore, we define

Iw(ρ):=12πi|z|=ρΔw(z)dzzn+1, (54)

where the circle of radius ρ contains all of the above poles. By the residue theorem, we have

Iw(ρ)=Resz=0Δw(z)zn+1+Resz=AwΔw(z)zn+1+Resz=1/1P(w)Δw(z)zn+1=[zn]Δw(z)Resz=AwHw(z)zn+1+Resz=1/1P(w)H^w(z)zn+1 (55)

We observe that

Resz=AwΔw(z)zn+1=Sw(Aw)BwAwn+1,whereBwisasin(30)Resz=1/1P(w)H^w(z)zn+1=(1P(w))n+1.

Then we obtain

[zn]Δw=Iw(ρ)Sw(Aw)BwAwn+1(1P(w))n+1, (56)

and finally, we have

[zn](H(z)H^(z))=wAk[zn]Δw=wAkInw(ρ)wAkSw(Aw)BwAwn+1+(1P(w))n+1. (57)

First, we show that, for sufficiently large n, the sum wAkSw(Aw)BwAwn+1+(1P(w))n+1 approaches zero. □

Lemma 11.

For large enough n, and for k=Θ(logn), there exists M>0 such that

wAkSw(Aw)BwAwn+1+(1P(w))n+1=O(nM). (58)

Proof. 

We let

rw(z)=(1P(w))z+Sw(Aw)BwAwz. (59)

The Mellin transform of the above function is

rw*(s)=Γ(s)logs11P(w)Sw(Aw)BwΓ(s)logs(Aw). (60)

We define

Cw=Sw(Aw)Bw=Sw(Aw)Sw(1)+O(kP(w)), (61)

which is negative and uniformly bounded for all w. Also, for a fixed s, we have

lns11P(w)=lns1+P(w)+OP(w)2=P(w)+OP(w)2s=P(w)s1+OP(w)s=P(w)s1+OP(w), (62)
lns(Aw)=lns1P(w)Sw(1)+OP(w)2=P(w)Sw(1)+OP(w)2s=P(w)Sw(1)s1+OP(w)s=P(w)Sw(1)s1+OP(w), (63)

and therefore, we obtain

rw*(s)=Γ(s)P(w)s11Sw(1)sO(1). (64)

From this expression, and noticing that the function has a removable singularity at s=0, we can see that the Mellin transform rw*(s) exists on the strip where (s)>1. We still need to investigate the Mellin strip for the sum wAkrw*(s). In other words, we need to examine whether summing rw*(s) over all words of length k (where k grows with n) has any effect on the analyticity of the function. We observe that

wAk|rw*(s)|=wAk|Γ(s)P(w)s11Sw(1)sO(1)||Γ(s)|wAkP(w)(s)11Sw(1)(s)O(1)=(qk)(s)1|Γ(s)|wAkP(w)(1Sw(1)(s))O(1).

Lemma 8 allows us to split the above sum between the words for which Sw(1)1+O(δk) and words that have Sw(1)>1+O(δk).

Such a split yields the following

wAk|rw*(s)|=(qk)(s)1|Γ(s)|O(δk). (65)

This shows that wAkrw*(s) is bounded above for (s)>1 and, therefore, it is analytic. This argument holds for k=Θ(logn) as well, as (qk)(s)1 would still be bounded above by a constant Ms,k that depends on s and k.

We would like to approximate wAkrw*(s) when z. By the inverse Mellin transform, we have

wAkrw(z)=12πicic+iwAkrw*(s)zsds. (66)

We choose c(1,M) for a fixed M>0. Then by the direct mapping theorem [22], we obtain

wAkrw(z)=O(zM). (67)

and subsequently, we get

wAkSw(Aw)BwAwn+1+(1P(w))n+1=O(nM). (68)

We next prove the asymptotic smallness of Inw(ρ) in (54).

Lemma 12.

Let

Inw(ρ)=12πi|z|=ρ11(1P(w))zSw(z)Dw(z)dzzn+1. (69)

For large n and k=Θ(logn), we have

wAkInw(ρ)=Oρn(ρδ)k. (70)

Proof. 

We observe that

|Inw(ρ)|12π|z|=ρP(w)zzk1Sw(z)Dw(z)(1(1P(w))z)1zn+1dz. (71)

For |z|=ρ, we show that the denominator in (71) is bounded away from zero.

|Dw(z)|=|(1z)Sw(z)+P(w)zk||1z||Sw(z)|P(w)|zk|(ρ1)α(pρ)k,whereα>0byLemma9.>0,weassumektobelargeenoughsuchthat(pρ)k<α(ρ1). (72)

To find a lower bound for |1(1P(w))z|, we can choose Kw large enough such that

|1(1P(w))z|1(1P(w))|z||1ρ(1pKw)|>0. (73)

We now move on to finding an upper bound for the numerator in (71), for |z|=ρ.

|zk1Sw(z)||Sw(z)1|+|1zk1|(Sw(ρ)1)+(1+ρk1)=(Sw(ρ)1)+O(ρk). (74)

Therefore, there exists a constant μ>0 such that

|Inw|μρP(w)(Sw(ρ)1)+O(ρk)1ρn+1=O(ρn)P(w)(Sw(ρ)1)+P(w)O(ρk). (75)

Summing over all patterns w, and applying Lemma 8, we obtain

wAk|Inw(ρ)|=O(ρn)wAkP(w)(Sw(ρ)1)+O(ρn+k)wAkP(w)=O(ρn)θ(ρδ)k+pρ1pρθδk+O(ρn+k)=O(ρn(ρδ)k), (76)

which approaches zero as n and k=Θ(logn). This completes the proof of of Theorem 1. □

Similar to Theorem 1, we provide a proof to show that the second factorial moments of the kth Subword Complexity and the kth Prefix Complexity, have the same first order asymptotic behavior. We are now ready to state the proof of Theorem 2.

Proof of Theorem 2.

As discussed in Lemmas 3 and 4, the generating functions representing E[(Xn,k)2] and E[(X^n,k)2] respectively, are

G(z)=w,wAkww11zSw(z)Dw(z)Sw(z)Dw(z)+Sw(z)Sw(z)Sw,w(z)Sw,w(z)Dw,w(z),

and

G^(z)=w,wAkww11z11(1P(w))z11(1P(w))z+w,wAkww11(1P(w)P(w))z.

Note that

G(z)G^(z)=wAkwwwAk11(1P(w))zSw(z)Dw(z) (77)
+wAkwwwAk11(1P(w))zSw(z)Dw(z) (78)
+w,wAkww11(1P(w)P(w))zSw(z)Sw(z)Sw,w(z)Sw,w(z)Dw,w(z) (79)

In Theorem 1, we proved that for every M>0 (which does not depend on n or k), we have

H(z)H^(z)=wAk11(1P(w))zSw(z)Dw(z)=O(nM).

Therefore, both (77) and (78) are of order (2k1)O(nM)=O(nM+alog2) for k=alogn. Thus, to show the asymptotic smallness, it is enough to choose M=alog2+ϵ, where ϵ is a small positive value. Now, it only remains to show (79) is asymptotically negligible as well. We define

Δw,w(z)=11(1P(w)P(w))zSw(z)Sw(z)Sw,w(z)Sw,w(z)Dw,w(z). (80)

Next, we extract the coefficient of zn

[zn]Δw,w(z)=12πiΔw,w(z)dzzn+1, (81)

where the path of integration is a circle about the origin with counterclockwise orientation. We define

Inw,w(ρ)=12πi|z|=ρΔw,w(z)dzzn+1, (82)

The above integrand has poles at z=0, z=αw,w (as in (46)), and z=11P(w)P(w). We have chosen ρ such that the poles are all inside the circle |z|=ρ. It follows that

Inw,w(ρ)=Resz=0Δw,w(z)zn+1+Resz=αw,wΔw,w(z)zn+1+Resz=11P(w)P(w)Δw(z)zn+1, (83)

and the residues give us the following.

Resz=11P(w)P(w)11(1P(w)P(w))z)zn+1=(1P(w)P(w))n+1,

and

Resz=αw,wSw(z)Sw(z)Sw,w(z)Sw,w(z)Dw,w(z)=Sw(αw,w)Sw(αw,w)Sw,w(αw,w)Sw,w(αw,w)βw,wαw,wn+1,

where βw,w is as in (47). Therefore, we get

w,wAkww[zn]Δw,w(z)=w,wAkwwInw,w(ρ)w,wAkww(Sw(αw,w)Sw(αw,w)Sw,w(αw,w)Sw,w(αw,w)βw,wαw,wn+1+(1P(w)P(w))n+1). (84)

We now show that the above two terms are asymptotically small. □

Lemma 13.

There exists ϵ>0 where the sum

w,wAkwwSw(αw,w)Sw(αw,w)Sw,w(αw,w)Sw,w(αw,w)βw,wαw,wn+1+(1P(w)P(w))n+1

is of order O(nϵ).

Proof. 

We define

rw,w(z)=Sw(αw,w)Sw(αw,w)Sw,w(αw,w)Sw,w(αw,w)βw,wαw,wz+(1P(w)P(w))z.

The Mellin transform of the above function is

rw,w*(s)=Γ(s)logs11P(w)p(w)+Cw,wΓ(s)logs(αw,w), (85)

where Cw,w=Sw(αw,w)Sw(αw,w)Sw,w(αw,w)Sw,w(αw,w)βw,w. We note that Cw,w is negative and uniformly bounded from above for all w,wAk.For a fixes s, we also have,

lns11P(w)P(w)=lns1+P(w)+P(w)+Op2k=P(w)+P(w)+Op2ks=(P(w)+P(w))s1+Opks=(P(w)+P(w))s1+Opk, (86)

and

lns(αw,w)=(Sw(1)Sw,w(1)Sw(1)Sw(1)Sw,w(1)Sw,w(1)P(w)+Sw(1)Sw,w(1)Sw(1)Sw(1)Sw,w(1)Sw,w(1)P(w)+O(p2k))s=(Sw(1)Sw,w(1)Sw(1)Sw(1)Sw,w(1)Sw,w(1)P(w)+Sw(1)Sw,w(1)Sw(1)Sw(1)Sw,w(1)Sw,w(1)P(w))s1+O(pk). (87)

Therefore, we have

rw,w*(s)=Γ(s)P(w)+P(w)s(1+O(pk))Γ(s)(Sw(1)Sw,w(1)Sw(1)Sw(1)Sw,w(1)Sw,w(1)P(w)+Sw(1)Sw,w(1)Sw(1)Sw(1)Sw,w(1)Sw,w(1)P(w))s1+O(pk)O(1). (88)

To find the Mellin strip for the sum wAkrw,w*(s), we first note that

(x+y)axa+ya,foranyrealx,y>0anda1.

Since (s)<1, we have

P(w)+P(w)(s)P(w)(s)+P(w)(s), (89)

and

Sw(1)Sw,w(1)Sw(1)Sw(1)Sw,w(1)Sw,w(1)P(w)ın+Sw(1)Sw,w(1)Sw(1)Sw(1)Sw,w(1)Sw,w(1)P(w)(s)Sw(1)Sw,w(1)Sw(1)Sw(1)Sw,w(1)Sw,w(1)P(w)(s)+Sw(1)Sw,w(1)Sw(1)Sw(1)Sw,w(1)Sw,w(1)P(w)(s). (90)

Therefore, we get

w,wAkww|rw,w*(s)||Γ(s)|O(1)(w,wAkwwP(w)(s)1Sw(1)Sw(1)Sw,w(1)Sw,w(1)Sw(1)Sw,w(1)(s)+w,wAkwwP(w)(s)1Sw(1)Sw(1)Sw,w(1)Sw,w(1)Sw(1)Sw,w(1)(s))(qk)(s)1|Γ(s)|O(1)
(wAkwwwAkP(w)1(Sw(1))(s)1Sw,w(1)Sw(1)(s) (91)
+wAkwwwAkP(w)Sw,w(1)(s)Sw(1)Sw,w(1)Sw,w(1)(s) (92)
+wAkwwwAkP(w)1(Sw(1))(s)1Sw,w(1)Sw(1)(s) (93)
+wAkwwwAkP(w)Sw,w(1)(s)Sw(1)Sw,w(1)Sw,w(1)(s)). (94)

By Lemma 10, with high probability, a randomly selected w has the property Sw,w(1)=O(δk), and thus

1Sw,w(1)Sw(1)(s)=1+O(δk).

With that and by Lemma 8, for most words w,

1Sw(1)(s)(1+O(δk))=O(δk).

Therefore, both sums (91) and (93) are of the form (2k1)O(δk). The sums (92) and (94) are also of order (2k1)O(δk) by Lemma 10. Combining all these terms we will obtain

w,wAkww|rw,w*(s)|(2k1)(qk)(s)1|Γ(s)|O(δk)O(1). (95)

By the inverse Mellin transform, for k=alogn, M=alog2+ϵ and c(1,M), we have

w,wAkwwrw,w(z)=12πicic+iw,wAkwwrw,w*(s)zsds=O(zM)O(2k)=O(zϵ). (96)

In the following lemma we show that the first term in (85) is asymptotically small.

Lemma 14.

Recall that

Inw,w(ρ)=12πi|z|=ρΔw,w(z)dzzn+1.

We have

w,wAkwwInw,w(ρ)=Oρn+2kδk. (97)

Proof. 

First note that

Δw,w(z)=11(1P(w)P(w))zSw(z)Sw(z)Sw,w(z)Sw,w(z)Dw,w(z)=zP(w)Sw,w(z)Sw,w(z)Sw(z)Sw(z)+zk1Sw(z)zk1Sw,w(z)1(1P(w)P(w))zDw,w(z)+zP(w)Sw,w(z)Sw,w(z)Sw(z)Sw(z)+zk1Sw(z)zk1Sw,w(z)1(1P(w)P(w))zDw,w(z). (98)

We saw in (73) that |1(1P(w))z|c2, and therefore, it follows that

|1(1P(w)P(w))z|c1 (99)

For z=ρ, |Dw,w(z)| is also bounded below as the following

|Dw,w(z)|=|(1z)(Sw(z)Sw(z)Sw,w(z)Sw,w(z))+zkP(w)(Sw(z)Sw,w(z))+P(w)(Sw(z)Sw,w(z))||(1z)(Sw(z)Sw(z)Sw,w(z)Sw,w(z))|zkP(w)(Sw(z)Sw,w(z))+P(w)(Sw(z)Sw,w(z))(ρ1)β(pρ)k2(1+pρ)1pρ, (100)

which is bounded away from zero by the assumption of Lemma 7. Additionally, we show that the numerator in (98) is bounded above, as follows

|Sw,w(z)Sw,w(z)Sw(z)Sw(z)+zk1Sw(z)zk1Sw,w(z)||Sw(z)(zk1Sw(z))|+|Sw,w(z)(Sw,w(z)zk1)|Sw(ρ)(Sw(ρ)1)+O(ρk)+Sw,w(ρ)Sw,w(ρ)+O(ρk). (101)

This yields

w,wAkww|Inw,w|O(ρn)wAkwwSw(ρ)wAkP(w)(Sw(ρ)1)+O(ρk)+O(ρn)wAkwwwAkP(w)Sw,w(ρ)Sw,w(ρ)+O(ρk). (102)

By (75), the first term above is of order (2k1)O(ρn+k) and by Lemma 10 and an analysis similar to (75), the second term yields (2k1)O(ρn+k) as well. Finally, we have

w,wAkww|Inw,w|O(ρn+2kδk).

Which goes to zero asymptotically, for k=Θ(logn). □

This lemma completes our proof of Theorem 2.

3.5. Asymptotic Analysis of the kth Prefix Complexity

We finally proceed to analyzing the asymptotic moments of the kth Prefix Complexity. The results obtained hold true for the moments of the kth Subword Complexity. Our methodology involves poissonization, saddle point analysis (the complex version of Laplace’s method [23]), and depoissonization.

Lemma 15

(Jacquet and Szpankowski, 1998). Let G˜(z) be the Poisson transform of a sequence gn. If G˜(z) is analytic in a linear cone Sθ with θ<π/2, and if the following two conditions hold:

(I) For zSθ and real values B, r>0, ν

|z|>r|G˜(z)|B|zν|Ψ(|z|), (103)

where Ψ(x) is such that, for fixed t, limxΨ(tx)Ψ(x)=1;

(II) For zSθ and A,α<1

|z|>r|G˜(z)ez|Aeα|z|. (104)

Then, for every non-negative integer n, we have

gn=G˜(n)+O(nν1Ψ(n)).

On the Expected Value: To transform the sequence of interest, (E[X^n,k])n0, into a Poisson model, we recall that in (25) we found

E[X^n,k]=wAk11P(w)n.

Thus, the Poisson transform is

E˜k(z)=n=0E[X^n,k]znn!ez=n=0wAk1(1P(w))nznn!ez=wAk1ezP(w). (105)

To asymptotically evaluate this harmonic sum, we turn our attention to the Mellin Transform once more. The Mellin transform of E˜k(z) is

E˜k*(s)=Γ(s)wAkP(w)s=Γ(s)(ps+qs)k, (106)

which has the fundamental strip s1,0. For c(1,0), the inverse Mellin integral is the following

E˜k(z)=12πicic+iE˜k*(s)·zsds=12πicic+izsΓ(s)(ps+qs)kds=12πicic+iΓ(s)ek(slogzklog(ps+qs))ds=12πicic+iΓ(s)ekh(s)ds, (107)

where we define h(s)=salog(ps+qs) for k=alogz. We emphasize that the above integral involves k, and k grows with n. We evaluate the integral through the saddle point analysis. Therefore, we choose the line of integration to cross the saddle point r0. To find the saddle point r0, we let h(r0)=0, and we obtain

p/qr0=alogp111alogq1, (108)

and therefore,

r0=1logp/qlogalogq111alogp1, (109)

where 1logq1<a<1logp1.

By (108) and the fact that p/qitj=1 for tj=2πjlogp/q and jZ, we can see that there are actually infinitely many saddle points zj of the form r0+itj on the line of integration.

We remark that the location of r0 depends on the value of a. We have r0 as a1logq1, and r0 as a1logp1. We divide the analysis into three parts, for the three ranges r0(0,), r0(1,0), and r0(,1).

In the first range, which corresponds to

1logq1<a<2logq1+logp1, (110)

we perform a residue analysis, taking into account the dominant pole at s=1. In the second range, we have

2logq1+logp1<a<1qlogq1+plogp1, (111)

and we get the asymptotic result through the saddle point method. The last range corresponds to

1qlogq1+plogp1<a<1logp1, (112)

and we approach it with a combination of residue analysis at s=0, and the saddle point method. We now proceed by stating the proof of Theorem 3.

Proof of Theorem 3.

We begin with proving part ii which requires a saddle point analysis. We rewrite the inverse Mellin transform with integration line at (s)=r0 as

E˜k(z)=12πz(r0+it)Γ(r0+it)(p(r0+it)+q(r0+it))kdt=12πΓ(r0+it)ek((r0+it)logzklog(p(r0+it)+q(r0+it)))dt. (113)

Step one: Saddle points’ contribute to the integral estimation

First, we are able to show those saddle points with |tj|>logn do not have a significant asymptotic contribution to the integral. To show this, we let

Tk(z)=|t|>lognzr0itΓ(r0+it)(pr0it+qr0it)kdt. (114)

Since |Γ(r0+it)|=O(|t|r012eπ|t|2) as |t|±, we observe that

Tk(z)=Ozr0(pr0+qr0)klogntr0/21/2eπt/2dt=Ozr0(pr0+qr0)k(logn)r0/41/4logneπt/2dt=Ozr0(pr0+qr0)k(logn)r0/41/4eπlogn/2=O(logn)r0/41/4eπlogn/2, (115)

which is very small for large n. Note that for t(logn,), tr0/21/2 is decreasing, and bounded above by (logn)r0/41/4.

Step two: Partitioning the integral

There are now only finitely many saddle points to work with. We split the integral range into sub-intervals, each of which contains exactly one saddle point. This way, each integral has a contour traversing a single saddle point, and we will be able to estimate the dominant contribution in each integral from a small neighborhood around the saddle point. Assuming that j* is the largest j for which 2πjlogp/qlogn, we split the integral E˜k(z) as following

E˜k(z)=12π|j|<j*|ttj|πlogp/qzr0+itΓ(r0+it)(pr0it+qr0it)kdt12ππlogp/q|tj*|<lognΓ(r+it)zr0+it(pr0it+qr0it)kdt. (116)

By the same argument as in (115), the second term in (116) is also asymptotically negligible. Therefore, we are only left with

E˜k(z)=|j|<j*Sj(z), (117)

where Sj(z)=12π|ttj|πlogp/qzr0+itΓ(r0+it)(pr0it+qr0it)kdt).

Step three: Splitting the saddle contour

For each integral Sj, we write the expansion of h(t) about tj, as follows

h(t)=h(tj)+12h(tj)(ttj)2+O((ttj)3). (118)

The main contribution for the integral estimate should come from an small integration path that reduces kh(t) to its quadratic expansion about tj. In other words, we want the integration path to be such that

k(ttj)2,andk(ttj)30. (119)

The above conditions are true when |ttj|k1/2 and |ttj|k1/3. Thus, we choose the integration path to be |ttj|k2/5. Therefore, we have

Sj(z)=12π|ttj|k2/5zr0+itΓ(r0+it)(pr0it+qr0it)kdt12πk2/5<|ttj|<πlogp/qzr0+itΓ(r0+it)(pr0it+qr0it)kdt. (120)

Saddle Tails Pruning.

We show that the integral is small for k2/5<|ttj|<πlogp/q. We define

Sj(1)(z)=12πk2/5<|ttj|<πlogp/qzr0+itΓ(r0+it)(pr0it+qr0it)kdt. (121)

Note that for |ttj|πlogp/q, we have

|pr0it+qr0it|=(pr0+qr0)12pr0qr0(pr0+qr0)2(1costlogp/q)(pr0+qr0)1pr0qr0(pr0+qr0)2(1costtj)logp/qsince1x1x2forx[0,1](pr0+qr0)12pr0qr0π2(pr0+qr0)2((ttj)logp/q)2since1cosx2x2π2for|x|π(pr0+qr0)eγ(ttj)2, (122)

where γ=2pr0qr0log2p/qπ2(pr0+qr0)2. Thus,

Sj(1)(z)=Ozr0|Γ(r0+it)|k2/5<|ttj|<πlogp/q|pr0it+qr0it|dt=Ozr0(pr0+qr0)kk2/5eγku2du=Ozr0(pr0+qr0)kk3/5eγk1/5,sinceerf(x)=Oex2/x. (123)

Central Approximation.

Over the main path, the integrals are of the form

Sj(0)(z)=12π|ttj|k2/5Γ(r0+it)zr0+it(pr0it+qr0it)kdt=12π|ttj|k2/5Γ(r0+it)ekh(t)dt.

We have

h(tj)=log2p/q((p/q)r0/2+(p/q)r0/2)2, (124)

and

pr0itj+qr0itj=pitj(pr0+qr0). (125)

Therefore, by Laplace’s theorem (refer to [22]) we obtain

Sj(0)(z)=12πkh(tj)Γ(r0+itj)ekh(tj)(1+O(k1/2))=(p/q)r0/2+(p/q)r0/22πlogp/q×zr0(pr0+qr0)kΓ(r0+itj)zitjpiktjk1/21+O1k. (126)

We finally sum over all j(|j|<j*), and we get

E˜k(z)=(p/q)r0/2+(p/q)r0/22πlogp/q×|j|<j*zr0(pr0+qr0)kΓ(r0+itj)zitjpiktjk1/21+O1k. (127)

We can rewrite E˜k(z) as

E˜k(z)=Φ1((1+alogp)logp/qn)zνlogn1+O1logn, (128)

where ν=r0+alog(pr0+qr0), and

Φ1(x)=(p/q)r0/2+(p/q)r0/22aπlogp/q|j|<j*Γ(r0+itj)e2πijx. (129)

For part ii, we move the line of integration to r0(0,). Note that in this range, we must consider the contribution of the pole at s=0. We have

E˜k(z)=Ress=0E˜k*(s)zs+r0ir0+iE˜k*(z)zsds. (130)

Computing the residue at s=0, and following the same analysis as in part i for the above integral, we arrive at

E˜k(z)=2kΦ1((1+alogp)logp/qn)zνlogn1+O1logn. (131)

For part iii. of Theorem 3, we shift the line of integration to c0(2,1), then we have

E˜k(z)=Ress=1E˜k*(s)zs+cic+iE˜k*(z)zsds=z+Ozc0(pc0+qc0)k=zalog2+O(zν0), (132)

where ν0=c0+alog(pc0+qc0)<1.

Step four: Asymptotic depoissonization

To show that both conditions in (15) hold for E˜k(z), we extend the real values z to complex values z=neiθ, where |θ|<π/2. To prove (103), we note that

|eiθ(r0+it)Γ(r0+it)|=O(|t|r01/2etθπ|t|/2), (133)

and therefore

E˜k(neiθ)=12πeiθ(r0+it)nr0itΓ(r0+it)(pr0it+qr0it)kdt (134)

is absolutely convergent for |θ|<π/2. The same saddle point analysis applies here and we obtain

|E˜k(z)|B|zν|logn, (135)

where B=|Φ1((1+alogp)logp/qn)|, and ν is as in (128). Condition (103) is therefore satisfied. To prove condition (104) We see that for a fixed k,

|E˜k(z)ez|wAk|ezez(1P(w))|2k+1e|z|cos(θ). (136)

Therefore, we have

E[X^n,k]=E˜(n)+Onν1logn. (137)

This completes the proof of Theorem 3. □

On the Second Factorial Moment: We poissonize the sequence (E[(X^n,k)2])n0 as well. By the analysis in (27),

E[(X^n,k)2]=w,wAkww1(1P(w))n(1P(w))n+(1P(w)P(w))n,

which gives the following poissonized form

G˜(z)=n0E[(X^n,k)2]znn!ez=w,wAkww1eP(w)zeP(w)z+e(P(w)+P(w))z=w,wAkww1eP(w)z1eP(w)z=wAk1eP(w)z2wAk1eP(w)z2=(E˜k(z))2wAk1eP(w)z2=(E˜k(z))2wAk12eP(w)z+e2P(w)z. (138)

We show that in all ranges of a the leftover sum in (138) has a lower order contribution to G˜k(z) compared to (E˜k(z))2. We define

L˜k(z)=wAk12eP(w)z+e2P(w)z. (139)

In the first range for k, we take the Mellin transform of L˜k(z), which is

L˜k*(s)=2Γ(s)wAkP(w)s+Γ(s)wAk(2P(w))s=2Γ(s)(ps+qs)k+Γ(s)2s(ps+qs)k=Γ(s)(ps+qs)k(2s11), (140)

and we note that the fundamental strip for this Mellin transform of is 2,0 as well. The inverse Mellin transform for c(2,0) is

L˜k(z)=12πicic+iL˜k*(s)zsds=1πicic+iΓ(s)(ps+qs)k(2s11)zsds (141)

We note that this range of r0 corresponds to

2logq1+logp1<a<p2+q2q2logq1+p2logp1. (142)

The integrand in (141) is quite similar to the one seen in (107). The only difference is the extra term 2s11. However, we notice that 2s11 is analytic and bounded. Thus, we obtain the same saddle points with the real part as in (109) and the same imaginary parts in the form of 2πijlogp/q, jZ. Thus, the same saddle point analysis for the integral in (107) applies to L˜k(z) as well. We avoid repeating the similar steps, and we skip to the central approximation, where by Laplace’s theorem (ref. [22]), we get

L˜k(z)=(p/q)r0/2+(p/q)r0/22πlogp/q×|j|<j*zr0(pr0+qr0)k(2r01itj1)×Γ(r0+itj)zitjpiktjk1/21+O1k, (143)

which can be represented as

L˜k(z)=Φ2((1+alogp)logp/qn)zνlogn1+O1logn, (144)

where

Φ2(x)=(p/q)r0/2+(p/q)r0/22aπlogp/q|j|<j*(2r01itj1)Γ(r0+itj)e2πijx. (145)

This shows that L˜k(z)=Ozνlogn, when

2logq1+logp1<a<p2+q2q2logq1+p2logp1.

Subsequently, for 1logq1<a<2logq1+logp1, we get

L˜k(z)=2kΦ2((1+alogp)logp/qn)zνlogn1+O1logn, (146)

and for p2+q2q2logq1+p2logp1<a<1logp1, we get

L˜k(z)=O(n2). (147)

It is not difficult to see that for each range of a as stated above, L˜k(z) has a lower order contribution to the asymptotic expansion of G˜k(z), compared to (E˜k(z))2. Therefore, this leads us to Theorem 4, which will be proved bellow.

Proof of Theorem 4. 

It is only left to show that the two depoissonization conditions hold: For condition (103) in Theorem 15, from (135) we have

|G˜k(z)|B2|z2ν|logn, (148)

and for condition (104), we have, for fixed k,

|G˜k(z)ez|w,wAkwweze(1P(w))ze(1P(w))z+e(1(P(w)+P(w)))z4ke|z|cosθ. (149)

Therefore both depoissonization conditions are satisfied and the desired result follows. □

Corollary. A Remark on the Second Moment and the Variance

For the second moment we have

E(X^n,k)2=w,wAkwwEX^n,k(w)X^n,k(w)+wAkE[X^n,k(w)]=w,wAkww1(1P(w))n(1P(w))n+(1P(w)P(w))n+wAk11P(w)n. (150)

Therefore, by (105) and (138) the Poisson transform of the second moment, which we denote by G˜k(2)(z) is

G˜k(2)(z)=(E˜k(z))2+E˜k(z)wAk12eP(w)z+e2P(w)z, (151)

which results in the same first order asymptotic as the second factorial moment. Also, it is not difficult to extend the proof in Chapter 6 to show that the second moments of the two models are asymptotically the same. For the variance we have

Var[X^n,k]=E(X^n,k)2EX^n,k2=w,wAkww1(1P(w))n(1P(w))n+(1P(w)P(w))n+wAk11P(w)nw,wAkww1(1P(w))n(1P(w))n+(1P(w)P(w))nwAk11P(w)n1P(w)n+1P(w)2n=wAk1P(w)n1P(w)2n. (152)

Therefore the Poisson transform, which we denote by G˜kvar(z) is

G˜kvar(z)=wAkeP(w)ze(2P(w)+(P(w))2)z. (153)

The Mellin transform of the above function has the following form

G*˜kvar(z)=Γ(s)(ps+qs)k(1+O(P(w))). (154)

This is quite similar to what we saw in (106), which indicates that the variance has the same asymptotic growth as the expected value. But the variance of the two models do not behave in the same way (cf. Figure 2).

4. Summary and Conclusions

We studied the first-order asymptotic growth of the first two (factorial) moments of the kth Subword Complexity. We recall that the kth Subword Complexity of a string of length n is denoted by Xn,k, and is defined as the number of distinct subwords of length k, that appear in the string. We are interested in the asymptotic analysis for when k grows as a function of the string’s length. More specifically, we conduct the analysis for k=Θ(logn), and as n.

The analysis is inspired by the earlier work of Jacquet and Szpankowski on the analysis of suffix trees, where they are compared to independent tries (cf. [14]). In our work, we compare the first two moments of the kth Subword Complexity to the kth Prefix Complexity over a random trie built over n independently generated binary strings. We recall that we define the kth Prefix Complexity as the number of distinct prefixes that appear in the trie at level k and lower.

We obtain the generating functions representing the expected value and the second factorial moments as their coefficients, in both settings. We prove that the first two moments have the same asymptotic growth in both models. For deriving the asymptotic behavior, we split the range for k into three intervals. We analyze each range using the saddle point method, in combination with residue analysis. We close our work with some remarks regarding the comparison of the second moment and the variance to the kth Prefix Complexity.

5. Future Challenges

The intervals’ endpoints for a in Theorems 3 and 4 are not investigated in this work. The asymptotic analysis of the end points can be studied using van der Waerden saddle point method [24].

The analogous results are not (yet) known in the case where the underlying probability source has Markovian dependence or in the case of dynamical sources.

Acknowledgments

The authors thank Wojciech Szpankowski and Mireille Régnier for insightful conversations on this topic.

Abbreviations

The following abbreviations are used in this manuscript:

PGF Probabilty Generating Function
P Probability
E Expected value
Var Variance
E[(Xn,k)2] The second factorial moment of Xn,k

Author Contributions

This paper is based on a Ph.D. dissertation conducted by the L.A. under the supervision of the M.D.W. All authors have read and agreed to the published version of the manuscript.

Funding

M.D.W. Ward’s research is supported by FFAR Grant 534662, by the USDA NIFA Food and Agriculture Cyberinformatics and Tools (FACT) initiative, by NSF Grant DMS-1246818, by the NSF Science & Technology Center for Science of Information Grant CCF-0939370, and by the Society Of Actuaries.

Conflicts of Interest

The authors declare no conflict of interest.

References

  • 1.Ehrenfeucht A., Lee K., Rozenberg G. Subword complexities of various classes of deterministic developmental languages without interactions. Theor. Comput. Sci. 1975;1:59–75. doi: 10.1016/0304-3975(75)90012-2. [DOI] [Google Scholar]
  • 2.Morse M., Hedlund G.A. Symbolic Dynamics. Am. J. Math. 1938;60:815–866. doi: 10.2307/2371264. [DOI] [Google Scholar]
  • 3.Jacquet P., Szpankowski W. Analytic Pattern Matching: From DNA to Twitter. Cambridge University Press; Cambridge, UK: 2015. [Google Scholar]
  • 4.Bell T.C., Cleary J.G., Witten I.H. Text Compression. Prentice-Hall; Upper Saddle River, NJ, USA: 1990. [Google Scholar]
  • 5.Burge C., Campbell A.M., Karlin S. Over-and under-representation of short oligonucleotides in DNA sequences. Proc. Natl. Acad. Sci. USA. 1992;89:1358–1362. doi: 10.1073/pnas.89.4.1358. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Fickett J.W., Torney D.C., Wolf D.R. Base compositional structure of genomes. Genomics. 1992;13:1056–1064. doi: 10.1016/0888-7543(92)90019-O. [DOI] [PubMed] [Google Scholar]
  • 7.Karlin S., Burge C., Campbell A.M. Statistical analyses of counts and distributions of restriction sites in DNA sequences. Nucleic Acids Res. 1992;20:1363–1370. doi: 10.1093/nar/20.6.1363. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Karlin S., Mrázek J., Campbell A.M. Frequent Oligonucleotides and Peptides of the Haemophilus Influenzae Genome. Nucleic Acids Res. 1996;24:4263–4272. doi: 10.1093/nar/24.21.4263. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Pevzner P.A., Borodovsky M.Y., Mironov A.A. Linguistics of Nucleotide Sequences II: Stationary Words in Genetic Texts and the Zonal Structure of DNA. J. Biomol. Struct. Dyn. 1989;6:1027–1038. doi: 10.1080/07391102.1989.10506529. [DOI] [PubMed] [Google Scholar]
  • 10.Chen X., Francia B., Li M., Mckinnon B., Seker A. Shared information and program plagiarism detection. IEEE Trans. Inf. Theory. 2004;50:1545–1551. doi: 10.1109/TIT.2004.830793. [DOI] [Google Scholar]
  • 11.Chor B., Horn D., Goldman N., Levy Y., Massingham T. Genomic DNA k-mer spectra: models and modalities. Genome Biol. 2009;10:R108. doi: 10.1186/gb-2009-10-10-r108. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Price A.L., Jones N.C., Pevzner P.A. De novo identification of repeat families in large genomes. Bioinformatics. 2005;21:i351–i358. doi: 10.1093/bioinformatics/bti1018. [DOI] [PubMed] [Google Scholar]
  • 13.Janson S., Lonardi S., Szpankowski W. Annual Symposium on Combinatorial Pattern Matching. Springer; Berlin/Heidelberger, Germany: 2004. On the Average Sequence Complexity; pp. 74–88. [Google Scholar]
  • 14.Jacquet P., Szpankowski W. Autocorrelation on words and its applications: Analysis of suffix trees by string-ruler approach. J. Comb. Theory Ser. A. 1994;66:237–269. doi: 10.1016/0097-3165(94)90065-5. [DOI] [Google Scholar]
  • 15.Liang F.M. Word Hy-phen-a-tion by Com-put-er. Technical Report; Stanford University; Stanford, CA, USA: 1983. [Google Scholar]
  • 16.Weiner P. Linear pattern matching algorithms; Proceedings of the 14th Annual Symposium on Switching and Automata Theory (swat 1973); Iowa City, IA, USA. 15–17 October 1973; pp. 1–11. [Google Scholar]
  • 17.Gheorghiciuc I., Ward M.D. On correlation Polynomials and Subword Complexity. Discrete Math. Theor. Comput. Sci. 2007;7:1–18. [Google Scholar]
  • 18.Bassino F., Clément J., Nicodème P. Counting occurrences for a finite set of words: Combinatorial methods. ACM Trans. Algorithms. 2012;8:31. doi: 10.1145/2229163.2229175. [DOI] [Google Scholar]
  • 19.Park G., Hwang H.K., Nicodème P., Szpankowski W. Latin American Symposium on Theoretical Informatics. Springer; Berlin/Heidelberger, Germany: 2008. Profile of Tries; pp. 1–11. [Google Scholar]
  • 20.Flajolet P., Sedgewick R. Analytic Combinatorics. Cambridge University Press; Cambridge, UK: 2009. [Google Scholar]
  • 21.Lothaire M. Applied Combinatorics on Words. Volume 105 Cambridge University Press; Cambridge, UK: 2005. [Google Scholar]
  • 22.Szpankowski W. Average Case Analysis of Algorithms on Sequences. Volume 50 John Wiley & Sons; Chichester, UK: 2011. [Google Scholar]
  • 23.Widder D.V. The Laplace Transform (PMS-6) Princeton University Press; Princeton, NJ, USA: 2015. [Google Scholar]
  • 24.van der Waerden B.L. On the method of saddle points. Appl. Sci. Res. 1952;2:33–45. doi: 10.1007/BF02919754. [DOI] [Google Scholar]

Articles from Entropy are provided here courtesy of Multidisciplinary Digital Publishing Institute (MDPI)

RESOURCES