Correction to: Scientific Reports 10.1038/srep28840, published online 29 June 2016
This Article contains errors in the Methods section, subheading ‘Mathematical Backgrounds’.
Formulation of Lemma 1 is incorrect:
“Given a genome of length n, if , then is the maximum value that can reach in the class of all possible genomes of length n”.
should read:
“Given a genome of length n, if , then is the maximum in the class of values ”.
The corresponding proof:
“The minimum value of k such that all k-mers are hapaxes of is . Therefore, if , then is maximum, according to the entropy Equipartition Property, because we have the maximum number of words occurring once in , and all these words have the same probability of occurring in ”.
should read:
“For empirical entropy is equal to , in fact is the number of distinct k-mers in . The same expression holds for any and , because string longer than k are hapaxes too. But, if , then , therefore ”.
The proof of Lemma 2 is incorrect for the lack of an explicit characterization of “random genomes”. Here a correct proof is given:
First at all, a random genome of length n is obtained by a random process of generation where, at each step, one of four possible genome symbols is generated with probability 1/4. Let . According to the theory of de Brujin sequences, it is possible to arrange all possible k-mers in a circular sequence (the last symbol of is followed by the first symbol of ) where each k-mer occurs exactly once. Of course, any contiguous portion long n of contains consecutive k-mers and corresponds to a random genome of length n (shortly, a n-genome). In fact, all symbols of are equiprobable, and this homogeneity holds along all positions of , in the sense that, going forward (circularly) a number of steps equal to the length of another de Brujin sequence, with the same equiprobability property is obtained, Let us consider the disjoint n-genomes (with no common k-mer) concatenated in . Their number is . But is the shortest circular string arranging all k-mers, then, maximum statistical homogeneity (required by randomness) is reached when , that is, when the probability that a k-mer has of occurring in one of the disjoint n-genomes of is the same of occurring in one of the k-mer positions of a n-genome (a sort of scale-free equiprobability). This condition is expressed by equation (14) of the paper, from which equation follows, corresponding to equation (15). Whence, equations (16), (17), (18) and the inequality (19) follow, from which the bounds given for derive.
Consequently, proposition 3’s opening sentence:
“In the class of genomes of length n, for every , the following relation holds:”
should read:
“In the class of genomes of length n, for every , the following relation holds:”
Finally, in Results, subheading ‘Information genomics laws’ Eq. 7 should be removed, because it follows from (8), being .
Acknowledgements
The authors thank Martin Andrade-Restrepo and Carlos Alvarez for pointing to them some of the inaccuracies corrected in this notice.