Table 2.
Equations used in SEAM
C-value (a) [18] | |
where: | |
is the candidate string | |
f(.) is its frequency of occurrence in the corpus | |
Τa is the set of extracted candidate terms that contain a | |
P(Τa) Is the number of these candidate terms | |
Termhood (a) [53] | = −0.7836 + |
0.7541* FirstPOS _ ADJECTIVE – | |
1.3722* FirstPOS _ ADVERB + | |
0.3541* FirstPOS _ NOUN + | |
1.4182 * FirstPOS _ VERB – | |
0.7722 * LastPOS _ ADJECTIVE + | |
2.2576 * LastPOS _ ADVERB + | |
0.0285 * LastPOS_NOUN + | |
0.6038 * LastPOS _ VERB + | |
1.2899 * NP _ VALUE + | |
1.0475 * REPEAT _ SUP _ GREATER _ MEDIAN + | |
0.8417 * REPEAT _ SUB _ GREATER _ MEDIAN + | |
0.8422 * DISTINCT _ PERHOST _ GREATER _ THAN _ MEDIAN | |
where: | |
POS is Part of Speech tag | |
REPEAT_SUP is number of supra (candidate terms containing a) = P (Τa) | |
REPEAT_SUB is subgroup (candidate terms that are contained within a) = P (Αt) | |
NP_VALUE is a a noun phrase | |
DISTINCT_PER_HOST is equivalent to document frequency | |
MEDIAN is calculated for the whole document set | |
TF-IDF = wi,j = TFi,j x IDFi [43] | |
where: | |
TFi,j is term frequency for keyword ki in document dj | |
fi,j is the number of times ki appears in dj | |
maxzfz,j is the maximum frequency across all keywords kz in dj | |
where: | |
IDFi is the inverse document frequency for keyword ki | |
N is the total number of documents in the corpus | |
nj is the number of documents that ki appears in | |
Cosine similarity [43] | |
where | |
wi,j is defined above |