. 2021 Aug 8;35(7):4957–4973. doi: 10.1007/s00521-021-06401-z

Table 2.

Natural language processing (NLP) features [57]

NLP Features
Feature	Explanation
Raw Word Count	The number of words obtained after parsing the URL by special characters
Brand Check for Domain	Is domain of the analyzed URL in the brand name list
Average Word Length	The average length of the words in the raw word list
Longest Word Length	The length of the longest word in the raw word list
Shortest Word Length	The length of the shortest word in the raw word list
Standard Deviation	Standard deviation of word lengths in the raw word list
Adjacent Word Count	Number of adjacent words processed in the WDM module
Average Adjacent Word Length	The average length of the detected adjacent words
Separated Word Count	The number of words obtained as a result of decomposing adjacent words
Keyword Count	The number of keywords in the URL
Brand Name Count	The number of the brand name in the URL
Similar Keyword Count	The number of words in the URL that is similar to a keyword
Similar Brand Name Count	The number of words in the URL that is similar to a brand name
Random Word Count	The number of words in the URL, which is created with random characters
Target Brand Name Count	The number of target brand name in the URL
Target Keyword Count	The number of target keyword in the URL
Other Words Count	The number of words that are not in the brand name and keyword lists but are in the English dictionary (e.g., computer, pencil, notebook etc …)
Digit Count (3)	The number of digits in the URL. Calculation of numbers is calculated separately for domain, subdomain and file path
Subdomain Count	The Number of subdomains in URL
Random Domain	Is the registered domain created with random characters
Length (3)	Length is calculated separately for the domain, subdomain and path
Known TLD	[“com”, “org”, “net”, “de”, “edu”, “gov”, etc.] are the most widely used TLDs worldwide. Is the registered TLD known one
www, com (2)	The expression of “www” and “com” in domain or subdomain is a common occurrence for malicious URLs
Puny Code	Puny Code is a standard that allows the browser to decode certain special characters in the address field. Attackers may use Puny Code to avoid detecting malicious URLs
Special Character (8)	Within the URL, the components are separated from each other by dots. However, an attacker could create a malicious URL using some special characters {‘-‘, ‘.’, ‘/’, ‘@’, ‘?’, ‘&’, ‘=’, ‘_’}
Consecutive Character Repeat	Attackers can make small changes in brand names or keywords to deceive users. These slight changes can be in the form of using the same character more than once
Alexa Check (2)	Alexa is the name of a service that places frequently used websites in a certain order according to their popularity. Is the domain in Alexa Top one million list