Skip to main content
. 2015 Apr 2;6:15. doi: 10.1186/s13326-015-0011-7

Table 2.

Equations used in SEAM

C-value (a) [18] log2afa,αisnotnestedlog2afa1PTαbϵTαfb,otherwise
where:
α is the candidate string
f(.) is its frequency of occurrence in the corpus
Τa is the set of extracted candidate terms that contain a
Pa) Is the number of these candidate terms
Termhood (a) logPvote=yesPvote=no [53] = −0.7836 +
0.7541* FirstPOS _ ADJECTIVE –
1.3722* FirstPOS _ ADVERB +
0.3541* FirstPOS _ NOUN +
1.4182 * FirstPOS _ VERB –
0.7722 * LastPOS _ ADJECTIVE +
2.2576 * LastPOS _ ADVERB +
0.0285 * LastPOS_NOUN +
0.6038 * LastPOS _ VERB +
1.2899 * NP _ VALUE +
1.0475 * REPEAT _ SUP _ GREATER _ MEDIAN +
0.8417 * REPEAT _ SUB _ GREATER _ MEDIAN +
0.8422 * DISTINCT _ PERHOST _ GREATER _ THAN _ MEDIAN
where:
POS is Part of Speech tag
REPEAT_SUP is number of supra (candidate terms containing a) = Pa)
REPEAT_SUB is subgroup (candidate terms that are contained within a) = P (Αt)
NP_VALUE is a a noun phrase
DISTINCT_PER_HOST is equivalent to document frequency
MEDIAN is calculated for the whole document set
TF-IDF = wi,j = TFi,j x IDFi [43] TFi,j=fi,jmaxzfz,j
where:
TFi,j is term frequency for keyword ki in document dj
fi,j is the number of times ki appears in dj
maxzfz,j is the maximum frequency across all keywords kz in dj
IDFi=logNni
where:
IDFi is the inverse document frequency for keyword ki
N is the total number of documents in the corpus
nj is the number of documents that ki appears in
Cosine similarity [43] cosinewc,ws=wcwswc×ws =i=1Kwi,cwi,si=1Kwi,c2i=1Kwi,s2
where
wi,j is defined above