Skip to main content
. 2021 Aug 8;35(7):4957–4973. doi: 10.1007/s00521-021-06401-z

Table 2.

Natural language processing (NLP) features [57]

NLP Features
Feature Explanation
Raw Word Count The number of words obtained after parsing the URL by special characters
Brand Check for Domain Is domain of the analyzed URL in the brand name list
Average Word Length The average length of the words in the raw word list
Longest Word Length The length of the longest word in the raw word list
Shortest Word Length The length of the shortest word in the raw word list
Standard Deviation Standard deviation of word lengths in the raw word list
Adjacent Word Count Number of adjacent words processed in the WDM module
Average Adjacent Word Length The average length of the detected adjacent words
Separated Word Count The number of words obtained as a result of decomposing adjacent words
Keyword Count The number of keywords in the URL
Brand Name Count The number of the brand name in the URL
Similar Keyword Count The number of words in the URL that is similar to a keyword
Similar Brand Name Count The number of words in the URL that is similar to a brand name
Random Word Count The number of words in the URL, which is created with random characters
Target Brand Name Count The number of target brand name in the URL
Target Keyword Count The number of target keyword in the URL
Other Words Count The number of words that are not in the brand name and keyword lists but are in the English dictionary (e.g., computer, pencil, notebook etc …)
Digit Count (3) The number of digits in the URL. Calculation of numbers is calculated separately for domain, subdomain and file path
Subdomain Count The Number of subdomains in URL
Random Domain Is the registered domain created with random characters
Length (3) Length is calculated separately for the domain, subdomain and path
Known TLD [“com”, “org”, “net”, “de”, “edu”, “gov”, etc.] are the most widely used TLDs worldwide. Is the registered TLD known one
www, com (2) The expression of “www” and “com” in domain or subdomain is a common occurrence for malicious URLs
Puny Code Puny Code is a standard that allows the browser to decode certain special characters in the address field. Attackers may use Puny Code to avoid detecting malicious URLs
Special Character (8) Within the URL, the components are separated from each other by dots. However, an attacker could create a malicious URL using some special characters {‘-‘, ‘.’, ‘/’, ‘@’, ‘?’, ‘&’, ‘=’, ‘_’}
Consecutive Character Repeat Attackers can make small changes in brand names or keywords to deceive users. These slight changes can be in the form of using the same character more than once
Alexa Check (2) Alexa is the name of a service that places frequently used websites in a certain order according to their popularity. Is the domain in Alexa Top one million list