Skip to main content
Data in Brief logoLink to Data in Brief
. 2022 Mar 3;41:107968. doi: 10.1016/j.dib.2022.107968

Data set and machine learning models for the classification of network traffic originators

Daniele Canavese a,, Leonardo Regano a, Cataldo Basile a, Gabriele Ciravegna b, Antonio Lioy a
PMCID: PMC8920864  PMID: 35300388

Abstract

The widespread adoption of encryption in computer network traffic is increasing the difficulty of analyzing such traffic for security purposes. The data set presented in this data article is composed of network statistics computed on captures of TCP flows, originated by executing various network stress and web crawling tools, along with statistics of benign web browsing traffic. Furthermore, this data article describes a set of Machine Learning models, trained using the described data set, which can classify network traffic by the tool category (network stress tool, web crawler, web browser), the specific tool (e.g., Firefox), and also the tool version (e.g., Firefox 68) used to generate it. These models are compatible with the analysis of traffic with encrypted payload since statistics are evaluated only on the TCP headers of the packets. The data presented in this article can be useful to train and assess the performance of new Machine Learning models for tool classification.

Keywords: Network traffic anomaly, Intrusion detection, Machine learning, DoS attacks, Web crawling

Specifications Table

Subject Cryptography and Cybersecurity
Specific subject area Cyber-threat and Anomaly Detection
Type of data Table, Chart
How data were acquired Data was gathered by performing real live captures by our research group
Data format Analyzed, Filtered
Parameters for data collection All tools used to generate the dataset have been used with default parameters.
Description of data collection (1) Traffic generated by a set of scripts launching various network stress and web crawling
tools, along with traffic obtained through manual web browsing of various users, have been captured using tshark. 2) A set of network statistics on the resulting traffic captures evaluated using the tstat tool. The client/server IPs and TCP ports have been removed from the data set, both for anonymization purposes and because they are not useful to train the model.
Data source location City: Turin, Cuneo; Country: Italy
Data accessibility Repository name: Encryption agnostic classifiers of traffic originators and their application to anomaly detection - data sets and models
Data identification number: 10.5281/zenodo.5797882
Direct URL to data: 10.5281/zenodo.5797882
Related research
article
D. Canavese, L. Regano, C. Basile, G. Ciravegna, A. Lioy, Encryption agnostic classifiers
of traffic originators and their application to anomaly detection, Computers and Electrical Engineering, 10.1016/j.compeleceng.2021.107621

Value of the Data

  • This data may be used as a benchmark for developing Machine Learning models aimed at obtaining information about the tools that originated sniffed network traffic. Presently, no benchmark data are available for researchers wanting to perform this type of classification. These models are of interest for developers of security monitoring systems, like Intrusion Detection Systems. Several types of attacks, e.g., Distributed Denial of Service and web crawling attacks, are launched using ad hoc tools. Therefore, getting information about the tools that originate the traffic can improve the detection abilities of these monitoring systems.

  • These data are valuable as a data set for researchers interested in training Machine Learning models designed to obtain information about the tools that originate the sniffed traffic. Moreover, these data may serve for hyperparameters’ optimization processes.

  • Since several of the trained Machine Learning models are based on neural networks, these data also may be used to speed up the training of new neural networks via transfer learning.

  • These data allow assessing the results of research presented in [1], which first aimed at obtaining information about the tools that originated sniffed network traffic.

1. Data Description

1.1. Data set

This section reports several statistics about the data set Table 1. lists the tools used to generate the traffic considered in the presented data set Table 2. reports the features that have been used to train and test the Machine Learning models.

Table 1.

Tools used to generate the traffic considered in the experiments.

application category Windows Linux
Chrome 48 browser
Chrome 68 browser
Firefox 42 browser
Firefox 62 browser
Firefox 68 browser
Edge 42 browser
Opera 62 browser

GoldenEye 3.49.2 stress tool
HULK 1.0 stress tool
RudyJS 1.0.0 stress tool
SlowHTTPTest 1.6 stress tool
SlowLoris 7.70 stress tool

Curl 7.55 web crawler
GrabSite 2.1.16 web crawler
Httrack 3.49.2 web crawler
Wget 1.19 web crawler
Wpull 2.0.1 web crawler

Table 2.

TCP statistics used as classification features.

feature unit
1 # packets (both directions) packets
2 # packets with payload (both directions) packets
3 # retransmitted packets (both directions) packets
4 # out of sequence packets (both directions) packets
5 # packets with ACK set (both directions) packets
6 # packets with ACK set and no payload (both directions) packets
7 # packets with FIN set (both directions) packets
8 # packets with RST set (both directions)1 packets
9 # packets with SYN set (both directions) packets
10 # payload bytes excluding retransmissions (both directions) bytes
11 # payload bytes including retransmissions (both directions) bytes
12 # retransmitted bytes (both directions) bytes
13 flow duration ms
14 relative time of first payload packet (both directions) ms
15 relative time of last payload packet (both directions) ms
16 relative time of first ACK packet (both directions) ms
17 TCP connection correctly terminated boolean
1

This can be only 0 or 1 since a proper TCP implementation will reset a connection after receiving an RST packet.

Finally, several statistics, grouped by labels, are reported: the average number of packets and bytes sent by the client or server and the average connection duration in milliseconds Tables 3. and 4, respectively, report the averages for all the tools and their instances in the data set.

Table 3.

Means of some features for the tool in our data set.

sent by client
sent by server
tool packets bytes packets bytes duration [ms]
chrome 30.605 2587.957 43.548 46,727.950 36,023.258
curl 48.260 539.707 71.030 91280.696 608.296
edge 23.135 2024.088 21.693 21782.879 12693.361
firefox 40.620 2744.452 60.108 71260.752 24748.461
goldeneye 13.061 800.137 21.071 24220.021 1409.918
grabsite 368.901 3929.413 583.037 1803017.128 15427.538
httrack 16.424 1009.946 21.192 23018.443 2517.901
hulk 5.711 573.383 4.576 2659.303 5909.654
opera 25.263 1914.553 47.881 53419.715 40428.032
rudy 11.332 713.800 10.997 3342.403 15770.126
slowhttptest 8.640 1406.865 6.826 3494.015 11974.112
slowloris 5.280 164.620 3.890 47.859 13641.368
wget 129.312 2134.542 246.652 328985.405 2756.862
wpull 115.476 1239.060 214.299 296092.743 8558.179

Table 4.

Means of some features for the tool instance in our data set.

sent by client
sent by server
tool instance packets bytes packets bytes duration [ms]
chrome-48.0.2564.109 31.145 2337.780 41.950 41918.843 34469.858
chrome-68.0.3440.84 29.843 2941.171 45.803 53517.726 38216.437
curl-7.55.1 31.203 649.631 53.382 65814.020 431.330
curl-7.61.0 67.340 416.752 90.771 119766.433 806.241
edge-42.17134.1.0 23.135 2024.088 21.693 21782.879 12693.361
firefox-42.0 37.651 3162.645 57.196 66622.033 24610.871
firefox-62.0 49.066 3359.130 72.157 84900.460 33731.822
firefox-68.0 30.374 1155.053 43.732 54509.414 9834.131
goldeneye-2.1 13.061 800.137 21.071 24220.021 1409.918
grabsite-2.1.16 368.901 3929.413 583.037 1803017.128 15427.538
httrack-3.49.2 16.424 1009.946 21.192 23018.443 2517.901
hulk-1.0 5.711 573.383 4.576 2659.303 5909.654
opera-62.0.3331.66 25.263 1914.553 47.881 53419.715 40428.032
rudy-1.0.0 11.332 713.800 10.997 3342.403 15770.126
slowhttptest-1.6 8.640 1406.865 6.826 3494.015 11974.112
slowloris-0.1.4 5.404 164.220 3.973 48.299 13392.927
slowloris-0.1.5 5.159 165.008 3.810 47.434 13881.584
wget-1.11.4 92.605 1024.641 184.089 249312.020 2076.603
wget-1.19.5 176.123 3549.987 326.437 430592.154 3624.391
wpull-2.0.1 115.476 1239.060 214.299 296092.743 8558.179

1.2. Classifiers

This section reports several statistics and plots about the models for classifying the traffic into various classes. Three different models have been considered for each classification task: a random forest (via the RandomForestClassifier class in scikit-learn), an extra-trees (via the ExtraTreesClassifier class in scikit-learn), and a neural network (a custom class implemented in PyTorch and skorch). The optimization process was performed using the hyperopt package using a Bayesian optimization procedure.

For each classifier, the following data are reported:

  • The plots showing the values of the Rk statistics as our Bayesian hyper-parameters optimization process progressed (Fig. 1, Fig. 2, Fig. 3, Fig. 4, Fig. 5, Fig. 6, Fig. 7, Fig. 8, Fig. 9).

  • The tables listing the optimal hyper-parameters found by our Bayesian optimization process (Tables 5, 9, 13, 17, 21, 25, 29, 33, and 37) – we normally used the default values for the hyper-parameters not reported1.

  • The tables reporting several classification statistics computed on the training set, development set, known tools test Set and unknown tools test set (Tables 6, 10, 14, 18, 22, 26, 30, 34 and 38).

  • The confusion matrices for each classifier (Tables 7, 11, 15, 19, 23, 27, 31, 35, and 39).

  • The results of the classification of the unknown tools (Tables 8, 12, 16, 20, 24, 28, 32, 36, and 40).

Fig. 1.

Fig 1

Hyper-parameters optimization plot for the category classifier based on random forest.

Fig. 2.

Fig 2

Hyper-parameters optimization plot for the category classifier based on extra-trees.

Fig. 3.

Fig 3

Hyper-parameters optimization plot for the category classifier based on neural network.

Fig. 4.

Fig 4

Hyper-parameters optimization plot for the tool classifier based on random forest.

Fig. 5.

Fig 5

Hyper-parameters optimization plot for the tool classifier based on extra-trees.

Fig. 6.

Fig 6

Hyper-parameters optimization plot for the tool classifier based on neural network.

Fig. 7.

Fig 7

Hyper-parameters optimization plot for the tool instance classifier based on random forest.

Fig. 8.

Fig 8

Hyper-parameters optimization plot for the tool instance classifier based on extra-trees.

Fig. 9.

Fig 9

Hyper-parameters optimization plot for the tool instance classifier based on neural network.

Table 5.

Optimal hyper-parameters for the category classifier based on random forest.

hyper-parameter Value
criterion Entropy
max_depth 17
min_samples_leaf 9
min_samples_split 38
n_estimators 89

Table 9.

Optimal hyper-parameters for the category classifier based on extra-trees.

hyper-parameter value
criterion entropy
max_depth 20
min_samples_leaf 3
min_samples_split 37
n_estimators 88

Table 13.

Optimal hyper-parameters for the category classifier based on neural network.

hyper-parameter value
lr 0.0027014308955057255
module layers 4
module neurons_per_layer 434
module p 0.1292524544974843

Table 17.

Optimal hyper-parameters for the tool classifier based on random forest.

hyper-parameter value
criterion entropy
max_depth 20
min_samples_leaf 5
min_samples_split 22
n_estimators 417

Table 21.

Optimal hyper-parameters for the tool classifier based on extra-trees.

hyper-parameter value
criterion entropy
max_depth 20
min_samples_leaf 21
min_samples_split 49
n_estimators 96

Table 25.

Optimal hyper-parameters for the tool classifier based on neural network.

hyper-parameter value
lr 0.0015639059764891423
module layers 4
module neurons_per_layer 196
module p 0.34048448616373395

Table 29.

Optimal hyper-parameters for the tool instance classifier based on random forest.

hyper-parameter value
criterion entropy
max_depth 20
min_samples_leaf 6
min_samples_split 18
n_estimators 314

Table 33.

Optimal hyper-parameters for the tool instance classifier based on extra-trees.

hyper-parameter value
criterion gini
max_depth 20
min_samples_leaf 8
min_samples_split 18
n_estimators 417

Table 37.

Optimal hyper-parameters for the tool instance classifier based on neural network.

hyper-parameter value
lr 0.001044660236833224
module layers 4
module neurons_per_layer 478
module p 0.33926635120188525

Table 6.

Classification statistics for the category classifier based on random forest.

statistic training set dev set KTS UTS
samples 955,872 119,484 119,485 30,144
accuracy [%] 97.702 97.542 97.447 40.459
balanced accuracy [%] 95.568 93.997 94.041 34.031
precision [%] 86.660 85.637 85.226 32.723
recall [%] 95.568 93.997 94.041 34.031
Cohen's kappa [%] 86.105 85.021 84.577 1.097
F-score [%] 90.719 89.459 89.230 32.151
Jaccard score [%] 83.556 81.607 81.255 20.631
Hamming loss 0.023 0.025 0.026 0.595
zero-one loss 0.023 0.025 0.026 0.595
Rk 0.865 0.854 0.850 0.011

Table 10.

Classification statistics for the category classifier based on extra-trees.

statistic training set dev set KTS UTS
samples 955,872 119,484 119,485 30,144
accuracy [%] 85.085 85.033 85.015 37.646
balanced accuracy [%] 80.692 80.216 80.373 38.510
precision [%] 62.558 62.265 62.413 33.339
recall [%] 80.692 80.216 80.373 38.510
Cohen's kappa [%] 42.760 42.372 42.547 1.694
F-score [%] 62.138 61.837 61.866 33.949
Jaccard score [%] 51.661 51.322 51.552 21.426
Hamming loss 0.149 0.150 0.150 0.624
zero-one loss 0.149 0.150 0.150 0.624
Rk 0.491 0.487 0.489 0.017

Table 14.

Classification statistics for the category classifier based on neural network.

statistic training set dev set kts uts
samples 955,872 119,484 119,485 30,144
accuracy [%] 96.132 96.005 96.017 41.912
balanced accuracy [%] 91.123 90.207 90.308 35.058
precision [%] 75.600 74.957 74.653 33.739
recall [%] 91.123 90.207 90.308 35.058
Cohen's kappa [%] 77.644 76.827 76.988 2.587
F-score [%] 80.923 80.166 79.902 33.950
Jaccard score [%] 70.873 69.961 69.772 21.943
Hamming loss 0.039 0.040 0.040 0.581
zero-one loss 0.039 0.040 0.040 0.581
Rk 0.784 0.776 0.778 0.026

Table 18.

Classification statistics for the tool classifier based on random forest.

statistic training set dev set KTS UTS
samples 955,872 119,484 119,485 30,144
accuracy [%] 95.716 94.977 94.953 7.680
balanced accuracy [%] 95.590 89.399 90.366 8.858
precision [%] 80.233 76.446 76.447 2.624
recall [%] 95.590 89.399 90.366 2.531
Cohen's kappa [%] 91.752 90.300 90.285 3.298
F-score [%] 85.925 81.341 81.477 2.577
Jaccard score [%] 77.432 71.180 71.442 1.572
Hamming loss 0.043 0.050 0.050 0.923
zero-one loss 0.043 0.050 0.050 0.923
Rk 0.919 0.904 0.904 0.041

Table 22.

Classification statistics for the tool classifier based on extra-trees.

statistic training set dev set KTS UTS
samples 955,872 119,484 119,485 30,144
accuracy [%] 83.376 83.256 83.226 4.668
balanced accuracy [%] 69.140 66.884 66.701 5.383
precision [%] 50.983 49.885 49.685 3.105
recall [%] 69.140 66.884 66.701 1.538
Cohen's kappa [%] 70.282 70.017 70.042 2.396
F-score [%] 52.837 51.522 51.257 2.057
Jaccard score [%] 39.406 38.294 38.123 1.202
Hamming loss 0.166 0.167 0.168 0.953
zero-one loss 0.166 0.167 0.168 0.953
Rk 0.713 0.710 0.711 0.030

Table 26.

Classification statistics for the tool classifier based on neural network.

statistic training set dev set KTS UTS
samples 955,872 119,484 119,485 30,144
accuracy [%] 87.735 87.516 87.516 7.467
balanced accuracy [%] 81.721 79.971 80.332 8.613
precision [%] 53.469 52.862 52.688 2.854
recall [%] 81.721 79.971 80.332 2.461
Cohen's kappa [%] 77.740 77.340 77.379 3.560
F-score [%] 60.582 59.641 59.479 2.643
Jaccard score [%] 48.020 47.214 47.164 1.621
Hamming loss 0.123 0.125 0.125 0.925
zero-one loss 0.123 0.125 0.125 0.925
Rk 0.785 0.782 0.782 0.044

Table 30.

Classification statistics for the tool instance classifier based on random forest.

statistic training set dev set KTS UTS
samples 955,872 119,484 119,485 30,144
accuracy [%] 95.565 94.642 94.601 0.000
balanced accuracy [%] 95.196 87.183 88.021 0.000
precision [%] 79.996 74.641 75.061 0.000
recall [%] 95.196 87.183 88.021 0.000
Cohen's kappa [%] 91.495 89.690 89.648 0.000
F-score [%] 85.628 79.254 79.707 0.000
Jaccard score [%] 76.831 68.057 68.871 0.000
Hamming loss 0.044 0.054 0.054 1.000
zero-one loss 0.044 0.054 0.054 1.000
Rk 0.916 0.898 0.898 0.000

Table 34.

Classification statistics for the tool instance classifier based on extra-trees.

statistic training set dev set KTS UTS
samples 955,872 119,484 119,485 30,144
accuracy [%] 80.889 80.672 80.644 0.000
balanced accuracy [%] 66.162 63.546 62.533 0.000
precision [%] 44.696 42.762 42.506 0.000
recall [%] 66.162 63.546 62.533 0.000
Cohen's kappa [%] 66.786 66.362 66.387 0.000
F-score [%] 45.703 43.713 43.182 0.000
Jaccard score [%] 33.113 31.442 31.131 0.000
Hamming loss 0.191 0.193 0.194 1.000
zero-one loss 0.191 0.193 0.194 1.000
Rk 0.681 0.677 0.677 0.000

Table 38.

Classification statistics for the tool instance classifier based on neural network.

statistic training set dev set KTS UTS
samples 955,872 119,484 119,485 30,144
accuracy [%] 83.050 82.687 82.649 0.000
balanced accuracy [%] 79.137 76.554 76.767 0.000
precision [%] 45.881 44.808 44.725 0.000
recall [%] 79.137 76.554 76.767 0.000
Cohen's kappa [%] 70.892 70.285 70.290 0.000
F-score [%] 50.890 49.582 49.411 0.000
Jaccard score [%] 38.186 37.084 36.897 0.000
Hamming loss 0.170 0.173 0.174 1.000
zero-one loss 0.170 0.173 0.174 1.000
Rk 0.724 0.719 0.719 0.000

Table 7.

Confusion matrix for the category classifier based on random forest on the KTS.

inferred
browser crawler dos
target browser 6872 194 322
crawler 120 2112 83
dos 1980 352 107,450

Table 11.

Confusion matrix for the category classifier based on extra-trees on the KTS.

inferred
browser crawler dos
target browser 5255 1496 667
crawler 126 1954 235
dos 1439 13,942 94,401

Table 15.

Confusion matrix for the category classifier based on neural network on the KTS.

inferred
browser crawler dos
target browser 6423 489 476
crawler 148 2018 149
dos 1270 2227 106,285

Table 19.

Confusion matrix for the tool classifier based on random forest on the KTS.

chrome curl edge firefox goldeneye inferredhttrack hulk rudy slowloris wget wpull
target chrome 1999 25 166 113 110 9 12 1 3 14 22
curl 1 273 13 3 4 11 0 1 1 12 4
edge 73 16 2717 45 26 9 10 0 1 10 9
firefox 140 17 74 1588 71 41 17 3 2 16 29
goldeneye 1053 22 159 366 74,967 150 1511 107 10 31 318
httrack 17 6 4 16 20 1229 2 0 0 0 4
hulk 409 4 21 49 357 15 28,464 20 0 9 93
rudy 3 0 0 1 7 2 7 321 2 0 5
slowloris 0 2 1 9 0 20 0 1 1263 0 3
wget 2 4 3 3 12 0 1 1 0 449 7
wpull 2 3 4 2 10 4 0 0 0 3 184

Table 23.

Confusion matrix for the tool classifier based on extra-trees on the KTS.

chrome curl edge firefox goldeneye inferredhttrack hulk rudy slowloris wget wpull
target chrome 1256 73 208 112 358 126 43 13 153 64 68
curl 0 179 22 0 41 38 2 1 26 7 7
edge 124 35 2025 30 103 140 51 10 284 11 103
firefox 204 52 171 837 286 169 27 17 68 63 104
goldeneye 92 237 1084 57 66,908 5511 1061 173 2759 458 354
httrack 17 15 15 14 122 978 1 1 94 13 28
hulk 26 6 431 38 1496 1225 25,393 15 641 20 150
rudy 0 2 4 0 32 19 5 249 33 2 2
slowloris 1 0 2 0 0 35 0 0 1260 0 1
wget 4 18 19 0 40 51 4 3 68 255 20
wpull 1 5 8 1 36 38 0 6 12 3 102

Table 27.

Confusion matrix for the tool classifier based on neural network on the KTS.

chrome curl edge firefox goldeneye inferredhttrack hulk rudy slowloris wget wpull
target chrome 1509 50 204 300 214 42 35 4 2 75 39
curl 1 207 14 5 37 14 2 2 0 28 13
edge 61 16 2552 66 36 28 38 10 1 51 57
firefox 165 19 119 1269 184 80 50 10 1 34 67
goldeneye 1189 313 821 562 67,851 710 3873 1273 106 1176 820
httrack 13 1 37 38 98 1087 10 1 0 7 6
hulk 367 4 53 48 488 115 27,981 147 0 133 105
rudy 1 0 2 1 5 2 11 306 2 9 9
slowloris 2 0 0 19 0 15 0 3 1258 1 1
wget 4 15 16 1 17 5 2 6 0 383 33
wpull 1 2 5 5 12 4 3 4 0 10 166

Table 31.

Confusion matrix for the tool instance classifier based on random forest on the KTS (where go = goldeneye, fi = firefox, hu = hulk, wg = wget, ed = edge, ht = httrack, ch = chrome, ru = rudy, sl = slowloris, cu = curl and wp = wpull.

inferred
ch-48.0 ch-68.0 cu-7.55.1 cu-7.61.0 ed-42 fi-42.0 fi-62.0 go-2.1 ht-3.49.2 hu-1.0 ru-1.0.0 sl-0.1.4 sl-0.1.5 wg-1.11.4 wg-1.19.5 wp-2.0.1
target ch-48.0 1128 80 14 4 91 18 20 61 4 4 0 0 3 3 10 10
ch-68.0 49 810 7 1 36 18 33 53 6 4 0 0 0 0 1 6
cu-7.55.1 0 1 150 6 0 0 3 4 5 0 0 0 1 2 0 0
cu-7.61.0 0 0 2 122 9 2 0 0 4 0 1 0 0 2 6 3
ed-42 75 37 2 21 2656 32 19 35 9 10 0 0 2 4 6 8
fi-42.0 26 22 9 2 41 646 51 34 12 5 2 0 2 4 3 10
fi-62.0 27 45 9 1 28 46 894 37 17 6 0 0 3 0 5 11
go-2.1 114 1191 50 2 157 74 320 74,802 130 1459 93 4 5 18 10 265
ht-3.49.2 4 15 7 1 7 5 6 23 1224 2 0 0 1 0 0 3
hu-1.0 35 485 11 0 21 16 53 293 12 28,435 18 0 3 6 0 53
ru-1.0.0 3 1 0 1 0 1 0 6 3 7 321 1 0 0 0 4
sl-0.1.4 0 0 0 0 0 0 0 0 0 0 1 607 6 0 0 0
sl-0.1.5 0 0 4 0 1 6 1 0 13 0 1 48 608 0 0 3
wg-1.11.4 2 1 1 2 4 3 0 8 0 1 1 0 0 259 0 1
wg-1.19.5 1 0 0 1 0 0 0 0 0 0 0 0 0 0 196 1
wp-2.0.1 1 5 1 5 4 4 2 6 3 1 0 0 1 3 0 176

Table 35.

Confusion matrix for the tool instance classifier based on extra-trees on the KTS (where go = goldeneye, fi = firefox, hu = hulk, wg = wget, ed = edge, ht = httrack, ch = chrome, ru = rudy, sl = slowloris, cu = curl and wp = wpull.

inferred
ch-48.0 ch-68.0 cu-7.55.1 cu-7.61.0 ed-42 fi-42.0 fi-62.0 go-2.1 ht-3.49.2 hu-1.0 ru-1.0.0 sl-0.1.4 sl-0.1.5 wg-1.11.4 wg-1.19.5 wp-2.0.1
target ch-48.0 721 89 62 5 44 20 23 180 66 0 9 98 48 26 39 20
ch-68.0 92 484 22 4 85 26 20 151 45 32 0 2 6 18 20 17
cu-7.55.1 0 0 98 5 2 0 0 34 8 2 1 2 7 13 0 0
cu-7.61.0 0 0 2 86 12 0 0 0 18 0 0 13 10 3 3 4
ed-42 52 93 13 58 1950 17 11 95 133 47 2 209 126 31 23 56
fi-42.0 45 32 31 12 80 268 95 80 72 3 7 23 42 26 20 33
fi-62.0 70 34 32 6 46 70 498 184 65 3 5 2 20 26 39 29
go-2.1 59 72 346 366 901 11 179 64,648 5382 767 24 2364 1052 1754 660 109
ht-3.49.2 8 5 14 1 11 2 4 118 990 1 0 6 73 32 4 29
hu-1.0 2 42 140 52 457 3 171 1333 1129 24,964 0 612 52 43 296 145
ru-1.0.0 0 0 1 2 6 0 1 29 17 3 236 26 13 10 4 0
sl-0.1.4 0 0 0 0 0 0 0 0 0 0 0 525 89 0 0 0
sl-0.1.5 0 0 1 0 0 0 0 0 25 0 0 205 453 0 0 1
wg-1.11.4 1 1 1 6 5 0 0 22 11 0 0 5 3 212 6 10
wg-1.19.5 0 0 1 5 3 0 0 0 21 3 0 30 0 2 132 2
wp-2.0.1 2 0 3 9 8 0 1 35 33 0 4 4 7 12 1 93

Table 39.

Confusion matrix for the tool instance classifier based on neural network on the KTS (where go = goldeneye, fi = firefox, hu = hulk, wg = wget, ed = edge, ht = httrack, ch = chrome, ru = rudy, sl = slowloris, cu = curl and wp = wpull.

inferred
ch-48.0 ch-68.0 cu-7.55.1 cu-7.61.0 ed-42 fi-42.0 fi-62.0 go-2.1 ht-3.49.2 hu-1.0 ru-1.0.0 sl-0.1.4 sl-0.1.5 wg-1.11.4 wg-1.19.5 wp-2.0.1
target ch-48.0 832 178 26 14 91 57 78 43 16 6 22 0 2 8 41 36
ch-68.0 60 716 25 1 40 23 80 45 7 8 1 0 2 3 1 12
cu-7.55.1 0 2 133 5 0 1 1 6 11 0 0 0 1 6 0 6
cu-7.61.0 0 0 0 121 3 1 0 1 6 0 2 0 0 3 12 2
ed-42 38 77 2 117 2395 30 40 19 39 33 10 0 1 14 62 39
fi-42.0 25 55 5 16 47 437 155 19 56 5 6 0 4 4 7 28
fi-62.0 18 98 36 2 28 68 752 39 35 6 2 0 6 5 11 23
go-2.1 132 1922 2542 50 598 123 1033 63,582 1334 2713 1230 5 45 993 719 1673
ht-3.49.2 5 17 22 1 13 7 24 37 1150 1 1 0 4 5 3 8
hu-1.0 2 573 76 15 60 20 474 264 123 26,812 287 0 2 23 431 279
ru-1.0.0 2 0 1 0 1 1 0 1 5 9 312 2 0 7 0 7
sl-0.1.4 0 0 0 0 0 0 0 0 0 0 2 538 74 0 0 0
sl-0.1.5 0 0 0 1 0 5 0 0 27 0 1 248 402 0 1 0
wg-1.11.4 0 2 2 7 2 1 1 13 8 0 5 0 0 228 2 12
wg-1.19.5 0 0 0 6 4 0 0 1 0 0 0 0 0 0 185 3
wp-2.0.1 1 8 2 6 5 3 0 3 9 0 5 0 1 8 3 158

Table 8.

Classification of unknown tools for the category classifier based on random forest.

inferred class samples inferred class samples
browser 3291 browser 1369
crawler 2623 crawler 883
dos 620 dos 1413


(a) Classification of firefox-68.0. (b) Classification of grabsite-2.1.16.

browser 6149 browser 6657
crawler 605 crawler 2465
dos 2196 dos 1873


(c) Classification of opera-62.0.3331.66. (d) Classification of slowhttptest-1.6.

Table 12.

Classification of unknown tools for the category classifier based on extra-trees.

inferred class samples inferred class samples
browser 2570 browser 837
crawler 1242 crawler 1809
dos 2722 dos 1019


(a) Classification of firefox-68.0. (b) Classification of grabsite-2.1.16.

inferred class samples inferred class samples
browser 5237 browser 5077
crawler 1046 crawler 4186
dos 2667 dos 1732


(c) Classification of opera-62.0.3331.66. (d) Classification of slowhttptest-1.6.

Table 16.

Classification of unknown tools for the category classifier based on neural network.

balanced accuracy [%] 100

inferred class samples inferred class samples

browser 3366 browser 2012
crawler 358 crawler 835
dos 2810 dos 818


(a) Classification of firefox-68.0. (b) Classification of grabsite-2.1.16.

inferred class samples inferred class samples
browser 6086 browser 5201
crawler 738 crawler 3447
dos 2126 dos 2347


(c) Classification of opera-62.0.3331.66. (d) Classification of slowhttptest-1.6.

Table 20.

Classification of unknown tools for the tool classifier based on random forest.

inferred class samples inferred class samples
chrome 688 chrome 348
curl 6 curl 41
edge 329 edge 262
firefox 2315 firefox 639
goldeneye 405 goldeneye 1440
httrack 113 httrack 186
hulk 55 hulk 185
rudy 37 rudy 26
slowloris 2 slowloris 6
wget 53 wget 207
wpull 2531 wpull 325


(a) Classification of firefox-68.0. (b) Classification of grabsite-2.1.16.

inferred class samples inferred class samples
chrome 5006 chrome 175
curl 78 curl 57
edge 288 edge 1831
firefox 1013 firefox 2334
goldeneye 2139 goldeneye 1032
httrack 182 httrack 372
hulk 34 hulk 17
rudy 15 rudy 4022
slowloris 3 slowloris 39
wget 20 wget 39
wpull 172 wpull 1077


(c) Classification of opera-62.0.3331.66. (d) Classification of slowhttptest-1.6.

Table 24.

Classification of unknown tools for the tool classifier based on extra-trees.

inferred class samples inferred class samples
chrome 260 chrome 188
curl 350 curl 142
edge 663 edge 230
firefox 1407 firefox 187
goldeneye 495 goldeneye 1097
httrack 585 httrack 468
hulk 2249 hulk 32
rudy 8 rudy 33
slowloris 41 slowloris 132
wget 215 wget 293
wpull 261 wpull 863


(a) Classification of firefox-68.0. (b) Classification of grabsite-2.1.16.

inferred class samples inferred class samples
chrome 3498 chrome 373
curl 412 edge 1458
edge 494 firefox 1120
firefox 523 goldeneye 2033
goldeneye 3009 httrack 1598
httrack 354 hulk 61
hulk 39 rudy 3516
rudy 147 slowloris 463
slowloris 156 wget 53
wget 66 wpull 320
wpull 252


(c) Classification of opera-62.0.3331.66. (d) Classification of slowhttptest-1.6.

Table 28.

Classification of unknown tools for the tool classifier based on neural network.

inferred class samples inferred class samples
chrome 339 chrome 364
curl 0 curl 13
edge 717 edge 458
firefox 2251 firefox 558
goldeneye 2612 goldeneye 670
httrack 244 httrack 521
hulk 154 hulk 290
rudy 1 rudy 133
slowloris 0 slowloris 0
wget 76 wget 243
wpull 140 wpull 415


(a) Classification of firefox-68.0. (b) Classification of grabsite-2.1.16.

inferred class samples inferred class samples
chrome 4097 chrome 1831
curl 138 curl 109
edge 134 edge 704
firefox 1502 firefox 1323
goldeneye 2342 goldeneye 270
httrack 367 httrack 378
hulk 80 hulk 109
rudy 29 rudy 4144
slowloris 4 slowloris 338
wget 59 wget 640
wpull 198 wpull 1149


(c) Classification of opera-62.0.3331.66. (d) Classification of slowhttptest-1.6.

Table 32.

Classification of unknown tools for the tool instance classifier based on random forest.

inferred class samples inferred class samples
chrome-48.0.2564.109 367 chrome-48.0.2564.109 112
chrome-68.0.3440.84 226 chrome-68.0.3440.84 168
curl-7.55.1 9 curl-7.55.1 53
edge-42.17134.1.0 356 curl-7.61.0 2
firefox-42.0 607 edge-42.17134.1.0 260
firefox-62.0 1805 firefox-42.0 224
goldeneye-2.1 377 firefox-62.0 296
httrack-3.49.2 47 goldeneye-2.1 1579
hulk-1.0 49 httrack-3.49.2 164
rudy-1.0.0 33 hulk-1.0 223
slowloris-0.1.5 52 rudy-1.0.0 25
wget-1.19.5 60 slowloris-0.1.5 11
wpull-2.0.1 2546 wget-1.11.4 84
wget-1.19.5 23
wpull-2.0.1 441


(a) Classification of firefox-68.0. (b) Classification of grabsite-2.1.16.

inferred class samples inferred class samples
chrome-48.0.2564.109 2365 chrome-48.0.2564.109 165
chrome-68.0.3440.84 2178 chrome-68.0.3440.84 27
curl-7.55.1 103 curl-7.55.1 78
curl-7.61.0 5 curl-7.61.0 22
edge-42.17134.1.0 382 edge-42.17134.1.0 1795
firefox-42.0 275 firefox-42.0 394
firefox-62.0 714 firefox-62.0 1220
goldeneye-2.1 2431 goldeneye-2.1 1212
httrack-3.49.2 231 httrack-3.49.2 341
hulk-1.0 35 hulk-1.0 33
rudy-1.0.0 18 rudy-1.0.0 4061
slowloris-0.1.5 6 slowloris-0.1.4 1
wget-1.11.4 12 slowloris-0.1.5 56
wget-1.19.5 5 wget-1.11.4 25
wpull-2.0.1 190 wget-1.19.5 8
wpull-2.0.1 1557


(c) Classification of opera-62.0.3331.66. (d) Classification of slowhttptest-1.6.

Table 36.

Classification of unknown tools for the tool instance classifier based on extra-trees.

inferred class samples inferred class samples
chrome-48.0.2564.109 450 chrome-48.0.2564.109 148
chrome-68.0.3440.84 14 chrome-68.0.3440.84 24
curl-7.55.1 183 curl-7.55.1 80
curl-7.61.0 58 curl-7.61.0 42
edge-42.17134.1.0 573 edge-42.17134.1.0 217
firefox-42.0 342 firefox-42.0 46
firefox-62.0 1026 firefox-62.0 156
goldeneye-2.1 472 goldeneye-2.1 1039
httrack-3.49.2 596 httrack-3.49.2 404
hulk-1.0 2241 hulk-1.0 22
rudy-1.0.0 22 rudy-1.0.0 20
slowloris-0.1.4 2 slowloris-0.1.4 97
slowloris-0.1.5 23 slowloris-0.1.5 123
wget-1.11.4 49 wget-1.11.4 198
wget-1.19.5 260 wget-1.19.5 170
wpull-2.0.1 223 wpull-2.0.1 879


(a) Classification of firefox-68.0. (b) Classification of grabsite-2.1.16.

inferred class samples inferred class samples
chrome-48.0.2564.109 1913 chrome-48.0.2564.109 271
chrome-68.0.3440.84 1672 chrome-68.0.3440.84 214
curl-7.55.1 503 curl-7.55.1 6
curl-7.61.0 22 curl-7.61.0 19
edge-42.17134.1.0 432 edge-42.17134.1.0 1473
firefox-42.0 192 firefox-42.0 211
firefox-62.0 220 firefox-62.0 581
goldeneye-2.1 2978 goldeneye-2.1 1965
httrack-3.49.2 341 httrack-3.49.2 920
hulk-1.0 41 hulk-1.0 32
rudy-1.0.0 166 rudy-1.0.0 2674
slowloris-0.1.4 19 slowloris-0.1.4 160
slowloris-0.1.5 80 slowloris-0.1.5 1592
wget-1.11.4 92 wget-1.11.4 226
wget-1.19.5 81 wget-1.19.5 13
wpull-2.0.1 198 wpull-2.0.1 638


(c) Classification of opera-62.0.3331.66. (d) Classification of slowhttptest-1.6.

Table 40.

Classification of unknown tools for the tool instance classifier based on neural network.

inferred class samples inferred class samples
chrome-48.0.2564.109 142 chrome-48.0.2564.109 210
chrome-68.0.3440.84 336 chrome-68.0.3440.84 321
curl-7.55.1 62 curl-7.55.1 47
curl-7.61.0 3 curl-7.61.0 19
edge-42.17134.1.0 454 edge-42.17134.1.0 673
firefox-42.0 660 firefox-42.0 139
firefox-62.0 1651 firefox-62.0 439
goldeneye-2.1 2386 goldeneye-2.1 480
httrack-3.49.2 351 httrack-3.49.2 312
hulk-1.0 62 hulk-1.0 121
rudy-1.0.0 60 rudy-1.0.0 240
slowloris-0.1.4 0 slowloris-0.1.4 0
slowloris-0.1.5 75 slowloris-0.1.5 9
wget-1.11.4 5 wget-1.11.4 99
wget-1.19.5 96 wget-1.19.5 201
wpull-2.0.1 191 wpull-2.0.1 355


(a) Classification of firefox-68.0. (b) Classification of grabsite-2.1.16.

inferred class samples inferred class samples
chrome-48.0.2564.109 1636 chrome-48.0.2564.109 920
chrome-68.0.3440.84 3446 chrome-68.0.3440.84 239
curl-7.55.1 509 curl-7.55.1 33
curl-7.61.0 8 curl-7.61.0 71
edge-42.17134.1.0 75 edge-42.17134.1.0 1022
firefox-42.0 281 firefox-42.0 473
firefox-62.0 1091 firefox-62.0 1262
goldeneye-2.1 1292 goldeneye-2.1 100
httrack-3.49.2 322 httrack-3.49.2 735
hulk-1.0 26 hulk-1.0 61
rudy-1.0.0 20 rudy-1.0.0 4359
slowloris-0.1.4 0 slowloris-0.1.4 9
slowloris-0.1.5 7 slowloris-0.1.5 3
wget-1.11.4 45 wget-1.11.4 405
wget-1.19.5 28 wget-1.19.5 129
wpull-2.0.1 164 wpull-2.0.1 1174


(c) Classification of opera-62.0.3331.66. (d) Classification of slowhttptest-1.6.

1.3. Category classifiers

This section reports several statistics and plots about the models for classifying the traffic into categories (e.g., browser, crawler, and dos, a.k.a. network stress tools).

1.4. Tool classifiers

This section reports several statistics and plots about the models for classifying the traffic into tools (e.g., goldeneye, hulk, firefox, wget, edge, httrack, chrome, rudy, slowloris, curl, and wpull).

1.5. Tool instance classifiers

This section reports statistics and plots about the models for classifying the traffic into tool in- stances (e.g., goldeneye-2.1, firefox-62.0, hulk-1.0, wget-1.11.4, edge-42.17134.1.0, httrack- 3.49.2, chrome-48.0.2564.109, rudy-1.0.0, chrome-68.0.3440.84, firefox-42.0, slowloris-0.1.5, curl-7.55.1, curl-7.61.0, slowloris-0.1.4, wpull-2.0.1, and wget-1.19.5).

2. Experimental Design, Materials and Methods

The traffic used to generate the dataset has been captured using WireShark 2.6.4 with the tshark command-line interface. The web browsing part of the traffic dataset has been generated by manually browsing the Internet, as reported in Section 4.1 of the main paper [1].

Instead, to generate the traffic for the web crawling and DDoS categories, a set of Python 3 scripts have been written, which are available at 10.5281/zenodo.5797882. For each tool instance, these scripts automatically start the traffic capture (see Table 1), execute the tool on a set of websites (prior authorization to the owners has been asked whenever needed to execute such activities), wait for tool termination, and stop the capture. Then, the captures from background traffic not generated by the analyzed tools have been discarded. Multiple truncated versions of the filtered captures have been generated. As explained in Section 4.1 of the main paper [1], by truncating the captures, it has been possible to benchmark our models on incomplete connections, to test their suitability for a live analysis scenario. For instance, Fig. 2 reports the balanced accuracy of the random forest-based category classifier in a live analysis scenario, i.e., plotting the balanced accuracy with respect to the number of exchanged packets in the analyzed connection. Subsequently, the statistics have been computed on each of the TCP flows contained in both the full and truncated traffic captures using the TCP Statistic and Analysis Tool (Tstat) ,2 one of the most used traffic measurement tools. As reported in Table 2, the models use a subset of the statistics available in the Core/Basic TCP Set.3 Client/server IP addresses and TCP ports have not been reported for anonymization purposes. They are not needed in the training process of the machine learning models. All the datasets are available in the GitHub repository mentioned above.

Finally, a dataset has been generated to train a set of machine learning models able to classify a TCP connection based on the tool category, the specific tool, and the specific version of the tool used to generate it. Section 5.1 of the main paper [1] describes in detail how the models have been trained. This document reports the hyperparameters used to train each of the models. For instance, Table 14 reports the hyperparameters used to train the category classifier based on neural networks. The trained machine learning models have been made available as joblib objects4 at the previously mentioned GitHub repository.

CRediT authorship contribution statement

Daniele Canavese: Conceptualization, Methodology, Software, Validation, Writing – original draft. Leonardo Regano: Conceptualization, Validation, Writing – original draft. Cataldo Basile: Conceptualization, Investigation, Writing – review & editing. Gabriele Ciravegna: Software, Writing – original draft. Antonio Lioy: Resources, Writing – review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships which have, or could be perceived to have, influenced the work reported in this article.

Acknowledgements

This work has received funding from the European Union's Horizon 2020 Research and Innovation Program, under projects PALANTIR (Grant Agreement No. 883335) and CyberSec4Europe (Grant Agreement no. 830929).

1

The most notable exceptions are given by the neural networks’ batch size and number of epochs, that we chose to set to 1024 and 50, respectively.

Contributor Information

Daniele Canavese, Email: daniele.canavese@polito.it.

Leonardo Regano, Email: leonardo.regano@polito.it.

Cataldo Basile, Email: cataldo.basile@polito.it.

Gabriele Ciravegna, Email: gabriele.ciravegna@unifi.it.

Antonio Lioy, Email: lioy@polito.it.

Reference

  • 1.Canavese D., Regano L., Basile C., Ciravegna G., Lioy A. Encryption agnostic classifiers of traffic originators and their application to anomaly detection. Comput. Electr. Eng. 2021 doi: 10.1016/j.compeleceng.2021.107621. [DOI] [Google Scholar]

Articles from Data in Brief are provided here courtesy of Elsevier

RESOURCES