Abstract
Encryption of network traffic should guarantee anonymity and prevent potential interception of information. Encrypted virtual private networks (VPNs) are designed to create special data tunnels that allow reliable transmission between networks and/or end users. However, as has been shown in a number of scientific papers, encryption alone may not be sufficient to secure data transmissions in the sense that certain information may be exposed. Our team has constructed a large dataset that contains generated encrypted network traffic data. This dataset contains a general network traffic model consisting of different types of network traffic such as web, emailing, video conferencing, video streaming, and terminal services. For the same network traffic model, data are measured for different scenarios, i.e., for data traffic through different types of VPNs and without VPNs. Additionally, the dataset contains the initial handshake of the VPN connections. The dataset can be used by various data scientists dealing with the classification of encrypted network traffic and encrypted VPNs.
Keywords: Machine Learning; IP flow; IPFIX; Network traffic; SSTP, OpenVPN; Wireguard
Specifications Table
Subject | Computer Networks and Communications |
Specific subject area | Encrypted Private Virtual Networks and their classification. |
Type of data | Structured |
How the data were acquired | Data was obtained by simulating real-world traffic through network traffic probes, stripped of redundant information and organized into flows in a sense of context. The packet resolution at the time level is in the order of microseconds. The data was captured from Mikrotik RouterOS routers using the open-source software pmacct. The data was exported in the IPFIX [1] format into Apache Kafka, preprocessed using our own solution ipFlowDetector and finally exported to JSON files. |
Data format | Raw |
Description of data collection | The data was measured in our network laboratory by specific conditions and scenarios described in Section 2. The data was not rearranged but was filtered from insufficient traffic flows. |
Data source location |
|
Data accessibility | Repository name: Zenodo Data identification number: 10.5281/zenodo.7301756 Direct URL to data: https://zenodo.org/record/7301756 |
Value of the Data
-
•
Researchers can utilize the data to investigate patterns in the behavior of different VPN protocols and traffic types, and use this information to create machine learning models that can detect VPN usage and classify VPN and traffic types.
-
•
The proposed dataset can serve as a benchmark for researchers and developers studying VPN and traffic patterns. By utilizing machine learning and other algorithms, researchers can investigate patterns in VPN behavior and use this information to improve network Quality of Service (QoS) [2,3], investigate the privacy and security risks of each VPN type [4], or potentially develop VPN architectures that can bypass ISP and government restrictions.
-
•
The dataset includes a diverse range of VPN protocols, including L2TP [5], L2TP-IPSEC [5], PPTP [6], SSTP [7], WireGuard [8], and OpenVPN [9]. Additionally, the dataset includes initial handshake flows for each VPN type, providing valuable information for further analysis. To the best of our knowledge, this is the first dataset to include multiple types of VPN flows beyond OpenVPN.
Objective
The dataset aims to allow researchers to study and compare the different VPN types and internet traffic. We included multiple VPN types that are less studied in the literature compared to OpenVPN, making them more visible and accessible to researchers. And since many new web services emerged in recent years, we used new and more up-to-date versions of services and websites compared to similar datasets.
The most comparable dataset ISCXVPN2016 [10] contains a limited VPN variety and outdated traffic content (given that the content used to generate the data is older than 6 years). While newer datasets such as [4] and [11] only use OpenVPN and the data is not open-source for researchers to use. Making our dataset an important asset for scientists.
1. Data Description
The dataset consists of labeled network traffic. The traffic is either a VPN traffic or a non-VPN traffic. The VPN traffic is generated via a set of different VPN types:
-
•
PPTP: Point-to-Point Tunneling Protocol built by Microsoft, it operated in Layer 2 of the OSI Model [12], with not sufficient encryption level [13].
-
•
L2TP: Layer 2 Tunneling Protocol (in the OSI Model) without data encryption or strong authentication.
-
•
L2TP-IPSEC: Layer 2 Tunneling Protocol, encrypted using IPSEC protocol (NAT-Traversal mode).
-
•
SSTP: Secure Socket Tunneling Protocol, based on HTTPS and operates at the application layer of the OSI model.
-
•
WireGuard: Modern and open-source VPN protocol that operates on layer 7, utilizes state of art cryptography techniques, and uses UDP as its transport protocol.
-
•
OpenVPN: Modern, popular, and open-source protocol that operated on layer 7. It is used primarily for end-user connections.
We divided the generated traffic into seven types of traffic:
-
•
Non-streaming: HTTP/HTTPS traffic from websites that do not contain streaming content such as videos and audios. Example websites are www.google.com or www.github.com.
-
•
Streaming: HTTP/HTTPS traffic from websites that contain streaming content like Youtube and Twitch.
-
•
Email: Traffic generated from delivering emails.
-
•
VoIP: Traffic generated from videoconferencing services such as Google Meet.
-
•
SSH: Traffic generated from connecting to remote servers using Secure Shell protocol.
In addition to the flows of the generated traffic, we also included the first flows of each VPN's initial connection.
Table 1 demonstrates that the dataset contains a substantial number of flows and is varied among different types of VPNs. Fig. 1 illustrates the distribution of VPN flows in the dataset, with each slice of the pie chart representing the percentage of flows for a particular VPN type on the total dataset flows. And similarly for traffic types in Fig. 2.
Table 1.
The number of flows and the size of the dataset for each VPN traffic type (without counting the initial flows).
Vpn type | Size | Number of flows | |
---|---|---|---|
Non-VPN traffic | 4.3GB | 50,191 | |
VPN traffic | L2TP | 1.7GB | 1314 |
L2TP-IPSEC | 2.5GB | 1955 | |
PPTP | 2.3GB | 1521 | |
SSTP | 2.8GB | 1118 | |
WireGuard | 2.6GB | 1758 | |
OpenVPN | 2.3GB | 1120 | |
Total | 14.2GB | 6861 | |
Overall Total | 18GB | 58,977 |
Fig. 1.
Pie chart of the distribution of the flows for each VPN type. Each slice of the pie chart represents the percentage of flows for a particular VPN type on the total dataset flows.
Fig. 2.
Pie chart of the distribution of the flows for each traffic type. Each slice of the pie chart represents the percentage of flows for a particular traffic type on the total dataset flows.
The dataset is stored in the JSON format which is readable and supported by modern programming languages. The dataset on the top level is split into two folders, the first contains non-VPN flows while the second contains VPN flows. In the last folder, there are six folders for each type of VPN. The non-VPN folder contains five traffic JSON files, while each of the VPN folders has five traffic JSON files plus a JSON file containing the first flows when establishing the VPN connection. Fig. 3 demonstrates how the files are organized in the dataset and Fig. 4 shows the sizes of the dataset by VPN and traffic type.
Fig. 3.
The structure of folders and files in the dataset.
Fig. 4.
The size of the dataset by VPN and traffic type.
In each JSON file, there is an array of flows. A flow is represented as an object where its principal information are stored such as protocol name and used ports. The description of the flow object is found in Table 2.
Table 2.
The description of the flow object attributes.
Attribute Name | Description |
---|---|
ip_proto | Name of the protocol |
port_dst | Destination port |
port_src | Source port |
x_packets | Array of the captured packets in the flow |
Inside all flow objects, there is an array of the captured packets during that flow. Each packet is represented as a JSON object. The presence of attributes in packets may differ from one flow protocol to the other. The description of the packet object is in Table 3.
Table 3.
The description of the packet object attributes.
Attribute Name | Description |
---|---|
bytes | The size of the payload of the packet If the value was positive it means that the packet was in the forward direction. Otherwise, the packet was in the backward direction |
timestamp_start | The start timestamp of the captured packet |
timestamp_end | The start timestamp of the captured packet |
packets | The number of the captured packets during the capturing timestamp |
ip_header_len | The length of IP header |
tcp_header_len | The length of the TCP header |
tcp_ack_number | The TCP acknowledgement number |
tcp_flags | The TCP flags |
tcp_seq_number | The TCP sequence number |
2. Experimental Design, Materials and Methods
In this section, we describe the environment used for establishing and collecting VPN and non-VPN and network flows (Section 2.1), then we provide an overview of the data acquisition process (Section 2.2).
2.1. The Data Measurement Scheme
In Fig. 5, we show the environment used for the data acquisition. The scheme consists of five main components used for flow generation, VPN connection establishment, and capturing/filtering the network flows. The detailed overview of the roles and the specifications of each component is as follows:
-
•
Virtual Machine 0 (VM0): An Ubuntu 20.04 LTS virtual machine with the purpose of generating web traffic and storing the captured flows. It receives the captured flows from the Probe passing by Client MikroTik, then saves them. It also receives and sends traffic from and to Client MikroTik.
-
•
Client MikroTik: A MikroTik RouterOS virtual machine, it plays the role of a client in the VPN mode, and it links the Router and the VM0. The VPN type is set manually in this VM.
-
•
Server MikroTik: A MikroTik RouterOS virtual machine, it plays the role of a server in the VPN mode, and it links the Router and the internet. The VPN type is set manually in this VM.
-
•
Router: A physical router hosted in the university laboratory. It links the Client and the Server MikroTik virtual machines and sends the passing packets to the probe.
-
•
Probe: A physical computer that captures the mirrored traffic coming from the router, converts the traffic into the IPFIX format, and uploads the IPFIX records to a data storage.
Fig. 5.
The topology of the network items used for the traffic generation and capturing. All items are described in detail in the list above.
The MikroTik RouterOS already includes the configurations of all of the used VPNs. In the non-VPN setup, we disabled all of the VPN configurations and routed the traffic from VM0 directly through the Router.
The captured traffic from the probe is preprocessed and filtered using ipFlowDetector, a program that we made using the C++ programming language for efficiency purposes, then finally we exported the resulting flows into JSON files and stored them in VM0. The JSON files are later on anonymized from IP addresses and further filtered from broadcasting flows.
2.2. Traffic Generation
In our work, we divided the generated traffic into five types: streaming, non-streaming, mail, VoIP, and SSH (refer to Section 1 for each type description). The choice of this classification and the distribution of each type was mainly based on our intuition because there are few publications on the distribution of traffic types in the real world [14].
To automate the traffic generation process we created shell and python scripts. Each python script contains the automatization of a traffic type. While the shell script contains the order of commands to run ipFlowDetector program and the python script between different VPN and traffic types. The details of the automatization of each traffic type are as follows:
-
•
Non-streaming: Selenium library and Google Chrome version 104 were used. we collected a list of 1022 website URLs that do not contain streaming content, such websites are Wikipedia and Pinterest. The script opens the websites sequentially, waits for the page to load, stays on the page for a short duration then moves to the next website.
-
•
Streaming: Selenium python and Google Chrome version 104 were used. We collected a list of 105 streaming content mostly from Youtube; the rest are from Twitch, SoundCloud, and other streaming services. Similarly to non-streaming, the script opens the 105 websites sequentially but stays in them for a longer time.
-
•
VoIP: Selenium python and Google Chrome version 104 were used. We used google meet (voice and video), with a simulated camera on the side of VM0.
-
•
Mail: Sent multiple emails using redmail library and outlook.
-
•
SSH: Connected to a remote terminal and executed a list of commands multiple times using spur library.
In the VPN mode, ipFlowDetector captured the initial flows of each VPN connection establishment and saved them in initial-flows.json. The motivation for including these flows in the dataset is that OpenVPN handshakes have been used as a VPN fingerprinting method [4] and it can be useful for researchers to investigate other VPNs' handshakes. After establishing the VPN connection, we started capturing the flows of the five types of traffic.
Ethics Statements
Our work does not contain information retrieved from human subjects or based on animal experiments.
CRediT authorship contribution statement
Mohamed Naas: Data curation, Writing – review & editing. Jan Fesl: Conceptualization, Methodology, Supervision.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
The authors would like to acknowledge the University of South Bohemia, Faculty of Science for providing sufficient equipment necessary for the dataset creation.
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.
Data Availability
USBVPN2022 (Original data) (zenodo)
References
- 1.Brownlee N. Flow-based measurement: IPFIX development and deployment. IEICE Trans. Commun. 2011:2190–2198. 94.8. [Google Scholar]
- 2.Iliyasu A.S, Deng H. Semi-supervised encrypted traffic classification with deep convolutional generative adversarial networks. IEEE Access. 2019;8:118–126. [Google Scholar]
- 3.Maonan W., et al. Proceedings of the 2021 IEEE 6th International Conference on Computer and Communication Systems (ICCCS) IEEE; 2021. CENTIME: a direct comprehensive traffic features extraction for encrypted traffic classification. [Google Scholar]
- 4.Xue Diwen, et al. Proceedings of the 31st USENIX Security Symposium (USENIX Security 22) 2022. {OpenVPN} is open to {VPN} fingerprinting. [Google Scholar]
- 5.IETF . IETF; 1999. RFC 2661, Layer Two Tunneling Protocol “L2TP”.https://datatracker.ietf.org/doc/html/rfc2661 Accessed October 2022. [Google Scholar]
- 6.PPTP, Router OS, (Accessed October 2022), https://help.mikrotik.com/docs/display/ROS/PPTP.
- 7.MS-SSTP, (Accessed October 2022), https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-sstp/70adc1df-c4fe-4b02-8872-f1d8b9ad806a.
- 8.Donenfeld J.A. Wireguard: next generation kernel network tunnel. NDSS. 2017:1–12. [Google Scholar]
- 9.OpenVPN Cloud Knowledge Base, (Accessed October 2022), https://openvpn.net/cloud-docs/.
- 10.Draper-Gil G., et al. Proceedings of the 2nd international conference on information systems security and privacy (ICISSP) 2016. Characterization of encrypted and vpn traffic using time-related. [Google Scholar]
- 11.Afandi Waleed, et al. Fingerprinting technique for youtube videos identification in network traffic. IEEE Access. 2022;10:76731–76741. [Google Scholar]
- 12.ISO, ISO . ISO Standard. ISO; 1994. IEC 7498-1: 1994 information technology–open systems interconnection–basic reference model: the basic model. /IEC7498-1. [Google Scholar]
- 13.Microsoft Says Don't Use PPTP and MS-CHAP (Accessed October 2022), http://www.h-online.com/security/news/item/Microsoft-says-don-t-use-PPTP-and-MS-CHAP-1672257.html.
- 14.Schumann, Luca, et al. "Impact of evolving protocols and COVID-19 on internet traffic shares." arXiv preprintarXiv:2201.00142 (2022).
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
USBVPN2022 (Original data) (zenodo)