Table 2.
Natural language processing (NLP) features [57]
NLP Features | |
---|---|
Feature | Explanation |
Raw Word Count | The number of words obtained after parsing the URL by special characters |
Brand Check for Domain | Is domain of the analyzed URL in the brand name list |
Average Word Length | The average length of the words in the raw word list |
Longest Word Length | The length of the longest word in the raw word list |
Shortest Word Length | The length of the shortest word in the raw word list |
Standard Deviation | Standard deviation of word lengths in the raw word list |
Adjacent Word Count | Number of adjacent words processed in the WDM module |
Average Adjacent Word Length | The average length of the detected adjacent words |
Separated Word Count | The number of words obtained as a result of decomposing adjacent words |
Keyword Count | The number of keywords in the URL |
Brand Name Count | The number of the brand name in the URL |
Similar Keyword Count | The number of words in the URL that is similar to a keyword |
Similar Brand Name Count | The number of words in the URL that is similar to a brand name |
Random Word Count | The number of words in the URL, which is created with random characters |
Target Brand Name Count | The number of target brand name in the URL |
Target Keyword Count | The number of target keyword in the URL |
Other Words Count | The number of words that are not in the brand name and keyword lists but are in the English dictionary (e.g., computer, pencil, notebook etc …) |
Digit Count (3) | The number of digits in the URL. Calculation of numbers is calculated separately for domain, subdomain and file path |
Subdomain Count | The Number of subdomains in URL |
Random Domain | Is the registered domain created with random characters |
Length (3) | Length is calculated separately for the domain, subdomain and path |
Known TLD | [“com”, “org”, “net”, “de”, “edu”, “gov”, etc.] are the most widely used TLDs worldwide. Is the registered TLD known one |
www, com (2) | The expression of “www” and “com” in domain or subdomain is a common occurrence for malicious URLs |
Puny Code | Puny Code is a standard that allows the browser to decode certain special characters in the address field. Attackers may use Puny Code to avoid detecting malicious URLs |
Special Character (8) | Within the URL, the components are separated from each other by dots. However, an attacker could create a malicious URL using some special characters {‘-‘, ‘.’, ‘/’, ‘@’, ‘?’, ‘&’, ‘=’, ‘_’} |
Consecutive Character Repeat | Attackers can make small changes in brand names or keywords to deceive users. These slight changes can be in the form of using the same character more than once |
Alexa Check (2) | Alexa is the name of a service that places frequently used websites in a certain order according to their popularity. Is the domain in Alexa Top one million list |