Skip to main content
. 2018 Nov 6;19:407. doi: 10.1186/s12859-018-2441-6

Table 2.

Definition of the adopted homology metrics (Alignment–free)

Metric Definition Description
n-gram distance qgramn(X,Y)=minxseq(X)yseq(Y)i|qixqiy|len(x)+len(y) A n-gram is a subsequence of n consecutive characters of a string [48]. If qx=q1x,q2x,,qKx is the n-gram vector of counts of n-gram occurrences in the sequence x the n-gram distance is given by the sum over the absolute differences |qixqiy|, where qix and qiy are the i-th unique n-grams of x and y respectively obtained by sliding a window of n characters wide over x and y and registering the occurring n-grams. The time complexity is O(len(xlen(y)).
Cosine similarity cosinen(X,Y)=maxxseq(X)yseq(Y)qx·qyqxqy The cosine similarity is the cosine of the angle between the two n-gram vectors qx and qy [40]. The time complexity is O(len(x)+len(y)).
Jaccard similarity jaccardn(X,Y)=maxxseq(X)yseq(Y)i1qix>0+1qiy>0i1qix>0·1qiy>01 The Jaccard coefficient measures the similarity between two finite sets, and is defined as the size of the intersection divided by the size of the union of the sample sets [49]. The size is computed from the set of unique n-grams by means of 1qix>0, the indicator function having the value 1 if the i-th n-gram is present in x, 0 otherwise. The time complexity is O(len(x)+len(y)).
Base–base correlation distance BBC(X,Y)=minxseq(X)yseq(Y)i=116(VxiVyi)2 The Base–base correlation measures the sequence similarity by computing the euclidean distance between two 16-dimensional feature vectors, Vx and Vy, which contain all base pair mutual information [50]. The time complexity is O(len(xlen(y)).
Average common substring distance ACS(X,Y)=minxseq(X)yseq(Y)12i=1len(x)lcs(x(i),y)len(x)+i=1len(y)lcs(y(i),x)len(y) The average common substring is the average lengths of maximum common substrings for constructing phylogenetic trees [51]. Specifically, the lcs(x(i),y) (lcs(y(i),x)) is the length of the longest common substring of x (y) starting at each position i of x (y) and exactly matching some substring in y (x). The time complexity is O(len(x)+len(y)).
Lempel–Ziv complexity distance LZ(X,Y)=minxseq(X)yseq(Y)c(x,y)c(x)+c(yx)c(y)12[c(xy)+c(yx)] The Lempel–Ziv complexity distance is defined by considering the minimum number of components over all production histories of x and y, c(x) and c(y) and their concatenations, c(xy) and c(yx) [52]. The time complexity is O(len(xlen(y)).
Jensen–Shannon distance JSD(X,Y)=minxseq(X)yseq(Y)12KL(Vx,VM)+12KL(Vy,VM) The Jensen–Shannon distance is computed by averaging the Kullback–Leibler Divergence (KL) of Vx with respect to VM and Vy with respect to VM, where Vx and Vy are the same 16-dimensional feature vectors defined for BBC, and VM=Vx+Vy2 [41]. The time complexity is O(len(x)+len(y)).
Hamming distance HDist(X,Y)=minxseq(X)yseq(Y)hd(r(x),r(y)) The Hamming distance is defined between two strings of the same length as the number of positions in which corresponding values are different. We adopt two bit strings of length n, namely r(x) and r(y), representing the regulatory transcriptional machinery of x and y respectively, and n is the number of all transcription factors available in JASPAR [24]. Each position i of such bit strings is equal to 1 if the i-th transcription factor binds the promoter while 0 otherwise. The time complexity is O(n).

X and Y are two candidate long non coding genes, seq(X) and seq(Y) are the sets of representative sequences of X and Y respectively (promoter or transcript), len(x) and len(y) are the lengths of sequences x and y respectively. Where applicable a metric is normalized with respect to the sum of sequence length [42] and is minimized (maximized) for distance (similarity) metrics among all couple of transcript sequences xseq(X),yseq(Y)