A novel dataset for encrypted virtual private network traffic analysis

Mohamed Naas; Jan Fesl

doi:10.1016/j.dib.2023.108945

. 2023 Feb 1;47:108945. doi: 10.1016/j.dib.2023.108945

A novel dataset for encrypted virtual private network traffic analysis

Mohamed Naas ^a,^⁎, Jan Fesl ^a,^b

PMCID: PMC9925847 PMID: 36798601

Abstract

Encryption of network traffic should guarantee anonymity and prevent potential interception of information. Encrypted virtual private networks (VPNs) are designed to create special data tunnels that allow reliable transmission between networks and/or end users. However, as has been shown in a number of scientific papers, encryption alone may not be sufficient to secure data transmissions in the sense that certain information may be exposed. Our team has constructed a large dataset that contains generated encrypted network traffic data. This dataset contains a general network traffic model consisting of different types of network traffic such as web, emailing, video conferencing, video streaming, and terminal services. For the same network traffic model, data are measured for different scenarios, i.e., for data traffic through different types of VPNs and without VPNs. Additionally, the dataset contains the initial handshake of the VPN connections. The dataset can be used by various data scientists dealing with the classification of encrypted network traffic and encrypted VPNs.

Keywords: Machine Learning; IP flow; IPFIX; Network traffic; SSTP, OpenVPN; Wireguard

Specifications Table

Subject	Computer Networks and Communications

Specific subject area	Encrypted Private Virtual Networks and their classification.
Type of data	Structured
How the data were acquired	Data was obtained by simulating real-world traffic through network traffic probes, stripped of redundant information and organized into flows in a sense of context. The packet resolution at the time level is in the order of microseconds. The data was captured from Mikrotik RouterOS routers using the open-source software pmacct. The data was exported in the IPFIX [1] format into Apache Kafka, preprocessed using our own solution ipFlowDetector and finally exported to JSON files.
Data format	Raw
Description of data collection	The data was measured in our network laboratory by specific conditions and scenarios described in Section 2. The data was not rearranged but was filtered from insufficient traffic flows.
Data source location	• Department of Informatics, Faculty of Science, University of South Bohemia. • Branišovská 31a, České Budějovice, Czech Republic, 37,001. • GPS Coordinates: 48° 58′ 28.09″ N, 14° 28′ 27.62″ E
Data accessibility	Repository name: Zenodo Data identification number: 10.5281/zenodo.7301756 Direct URL to data: https://zenodo.org/record/7301756

Open in a new tab

Value of the Data

•
Researchers can utilize the data to investigate patterns in the behavior of different VPN protocols and traffic types, and use this information to create machine learning models that can detect VPN usage and classify VPN and traffic types.
•
The proposed dataset can serve as a benchmark for researchers and developers studying VPN and traffic patterns. By utilizing machine learning and other algorithms, researchers can investigate patterns in VPN behavior and use this information to improve network Quality of Service (QoS) [2,3], investigate the privacy and security risks of each VPN type [4], or potentially develop VPN architectures that can bypass ISP and government restrictions.
•
The dataset includes a diverse range of VPN protocols, including L2TP [5], L2TP-IPSEC [5], PPTP [6], SSTP [7], WireGuard [8], and OpenVPN [9]. Additionally, the dataset includes initial handshake flows for each VPN type, providing valuable information for further analysis. To the best of our knowledge, this is the first dataset to include multiple types of VPN flows beyond OpenVPN.

Objective

The dataset aims to allow researchers to study and compare the different VPN types and internet traffic. We included multiple VPN types that are less studied in the literature compared to OpenVPN, making them more visible and accessible to researchers. And since many new web services emerged in recent years, we used new and more up-to-date versions of services and websites compared to similar datasets.

The most comparable dataset ISCXVPN2016 [10] contains a limited VPN variety and outdated traffic content (given that the content used to generate the data is older than 6 years). While newer datasets such as [4] and [11] only use OpenVPN and the data is not open-source for researchers to use. Making our dataset an important asset for scientists.

1. Data Description

The dataset consists of labeled network traffic. The traffic is either a VPN traffic or a non-VPN traffic. The VPN traffic is generated via a set of different VPN types:

•
PPTP: Point-to-Point Tunneling Protocol built by Microsoft, it operated in Layer 2 of the OSI Model [12], with not sufficient encryption level [13].
•
L2TP: Layer 2 Tunneling Protocol (in the OSI Model) without data encryption or strong authentication.
•
L2TP-IPSEC: Layer 2 Tunneling Protocol, encrypted using IPSEC protocol (NAT-Traversal mode).
•
SSTP: Secure Socket Tunneling Protocol, based on HTTPS and operates at the application layer of the OSI model.
•
WireGuard: Modern and open-source VPN protocol that operates on layer 7, utilizes state of art cryptography techniques, and uses UDP as its transport protocol.
•
OpenVPN: Modern, popular, and open-source protocol that operated on layer 7. It is used primarily for end-user connections.

We divided the generated traffic into seven types of traffic:

•
Non-streaming: HTTP/HTTPS traffic from websites that do not contain streaming content such as videos and audios. Example websites are www.google.com or www.github.com.
•
Streaming: HTTP/HTTPS traffic from websites that contain streaming content like Youtube and Twitch.
•
Email: Traffic generated from delivering emails.
•
VoIP: Traffic generated from videoconferencing services such as Google Meet.
•
SSH: Traffic generated from connecting to remote servers using Secure Shell protocol.

In addition to the flows of the generated traffic, we also included the first flows of each VPN's initial connection.

Table 1 demonstrates that the dataset contains a substantial number of flows and is varied among different types of VPNs. Fig. 1 illustrates the distribution of VPN flows in the dataset, with each slice of the pie chart representing the percentage of flows for a particular VPN type on the total dataset flows. And similarly for traffic types in Fig. 2.

Table 1.

The number of flows and the size of the dataset for each VPN traffic type (without counting the initial flows).

	Vpn type	Size	Number of flows
Non-VPN traffic		4.3GB	50,191
VPN traffic	L2TP	1.7GB	1314
	L2TP-IPSEC	2.5GB	1955
	PPTP	2.3GB	1521
	SSTP	2.8GB	1118
	WireGuard	2.6GB	1758
	OpenVPN	2.3GB	1120
	Total	14.2GB	6861
Overall Total		18GB	58,977

Open in a new tab

Fig 1 — Pie chart of the distribution of the flows for each VPN type. Each slice of the pie chart represents the percentage of flows for a particular VPN type on the total dataset flows.

Fig 2 — Pie chart of the distribution of the flows for each traffic type. Each slice of the pie chart represents the percentage of flows for a particular traffic type on the total dataset flows.

The dataset is stored in the JSON format which is readable and supported by modern programming languages. The dataset on the top level is split into two folders, the first contains non-VPN flows while the second contains VPN flows. In the last folder, there are six folders for each type of VPN. The non-VPN folder contains five traffic JSON files, while each of the VPN folders has five traffic JSON files plus a JSON file containing the first flows when establishing the VPN connection. Fig. 3 demonstrates how the files are organized in the dataset and Fig. 4 shows the sizes of the dataset by VPN and traffic type.

Fig 3 — The structure of folders and files in the dataset.

Fig 4 — The size of the dataset by VPN and traffic type.

In each JSON file, there is an array of flows. A flow is represented as an object where its principal information are stored such as protocol name and used ports. The description of the flow object is found in Table 2.

Table 2.

The description of the flow object attributes.

Attribute Name	Description
ip_proto	Name of the protocol
port_dst	Destination port
port_src	Source port
x_packets	Array of the captured packets in the flow

Open in a new tab

Inside all flow objects, there is an array of the captured packets during that flow. Each packet is represented as a JSON object. The presence of attributes in packets may differ from one flow protocol to the other. The description of the packet object is in Table 3.

Table 3.

The description of the packet object attributes.

Attribute Name	Description
bytes	The size of the payload of the packet If the value was positive it means that the packet was in the forward direction. Otherwise, the packet was in the backward direction
timestamp_start	The start timestamp of the captured packet
timestamp_end	The start timestamp of the captured packet
packets	The number of the captured packets during the capturing timestamp
ip_header_len	The length of IP header
tcp_header_len	The length of the TCP header
tcp_ack_number	The TCP acknowledgement number
tcp_flags	The TCP flags
tcp_seq_number	The TCP sequence number

Open in a new tab

2. Experimental Design, Materials and Methods

In this section, we describe the environment used for establishing and collecting VPN and non-VPN and network flows (Section 2.1), then we provide an overview of the data acquisition process (Section 2.2).

2.1. The Data Measurement Scheme

In Fig. 5, we show the environment used for the data acquisition. The scheme consists of five main components used for flow generation, VPN connection establishment, and capturing/filtering the network flows. The detailed overview of the roles and the specifications of each component is as follows:

•
Virtual Machine 0 (VM0): An Ubuntu 20.04 LTS virtual machine with the purpose of generating web traffic and storing the captured flows. It receives the captured flows from the Probe passing by Client MikroTik, then saves them. It also receives and sends traffic from and to Client MikroTik.
•
Client MikroTik: A MikroTik RouterOS virtual machine, it plays the role of a client in the VPN mode, and it links the Router and the VM0. The VPN type is set manually in this VM.
•
Server MikroTik: A MikroTik RouterOS virtual machine, it plays the role of a server in the VPN mode, and it links the Router and the internet. The VPN type is set manually in this VM.
•
Router: A physical router hosted in the university laboratory. It links the Client and the Server MikroTik virtual machines and sends the passing packets to the probe.
•
Probe: A physical computer that captures the mirrored traffic coming from the router, converts the traffic into the IPFIX format, and uploads the IPFIX records to a data storage.

The MikroTik RouterOS already includes the configurations of all of the used VPNs. In the non-VPN setup, we disabled all of the VPN configurations and routed the traffic from VM0 directly through the Router.

The captured traffic from the probe is preprocessed and filtered using ipFlowDetector, a program that we made using the C++ programming language for efficiency purposes, then finally we exported the resulting flows into JSON files and stored them in VM0. The JSON files are later on anonymized from IP addresses and further filtered from broadcasting flows.

2.2. Traffic Generation

In our work, we divided the generated traffic into five types: streaming, non-streaming, mail, VoIP, and SSH (refer to Section 1 for each type description). The choice of this classification and the distribution of each type was mainly based on our intuition because there are few publications on the distribution of traffic types in the real world [14].

To automate the traffic generation process we created shell and python scripts. Each python script contains the automatization of a traffic type. While the shell script contains the order of commands to run ipFlowDetector program and the python script between different VPN and traffic types. The details of the automatization of each traffic type are as follows:

•
Non-streaming: Selenium library and Google Chrome version 104 were used. we collected a list of 1022 website URLs that do not contain streaming content, such websites are Wikipedia and Pinterest. The script opens the websites sequentially, waits for the page to load, stays on the page for a short duration then moves to the next website.
•
Streaming: Selenium python and Google Chrome version 104 were used. We collected a list of 105 streaming content mostly from Youtube; the rest are from Twitch, SoundCloud, and other streaming services. Similarly to non-streaming, the script opens the 105 websites sequentially but stays in them for a longer time.
•
VoIP: Selenium python and Google Chrome version 104 were used. We used google meet (voice and video), with a simulated camera on the side of VM0.
•
Mail: Sent multiple emails using redmail library and outlook.
•
SSH: Connected to a remote terminal and executed a list of commands multiple times using spur library.

In the VPN mode, ipFlowDetector captured the initial flows of each VPN connection establishment and saved them in initial-flows.json. The motivation for including these flows in the dataset is that OpenVPN handshakes have been used as a VPN fingerprinting method [4] and it can be useful for researchers to investigate other VPNs' handshakes. After establishing the VPN connection, we started capturing the flows of the five types of traffic.

Ethics Statements

Our work does not contain information retrieved from human subjects or based on animal experiments.

CRediT authorship contribution statement

Mohamed Naas: Data curation, Writing – review & editing. Jan Fesl: Conceptualization, Methodology, Supervision.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

The authors would like to acknowledge the University of South Bohemia, Faculty of Science for providing sufficient equipment necessary for the dataset creation.

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Data Availability

USBVPN2022 (Original data) (zenodo)

References

1.Brownlee N. Flow-based measurement: IPFIX development and deployment. IEICE Trans. Commun. 2011:2190–2198. 94.8. [Google Scholar]
2.Iliyasu A.S, Deng H. Semi-supervised encrypted traffic classification with deep convolutional generative adversarial networks. IEEE Access. 2019;8:118–126. [Google Scholar]
3.Maonan W., et al. Proceedings of the 2021 IEEE 6th International Conference on Computer and Communication Systems (ICCCS) IEEE; 2021. CENTIME: a direct comprehensive traffic features extraction for encrypted traffic classification. [Google Scholar]
4.Xue Diwen, et al. Proceedings of the 31st USENIX Security Symposium (USENIX Security 22) 2022. {OpenVPN} is open to {VPN} fingerprinting. [Google Scholar]
5.IETF . IETF; 1999. RFC 2661, Layer Two Tunneling Protocol “L2TP”.https://datatracker.ietf.org/doc/html/rfc2661 Accessed October 2022. [Google Scholar]
6.PPTP, Router OS, (Accessed October 2022), https://help.mikrotik.com/docs/display/ROS/PPTP.
7.MS-SSTP, (Accessed October 2022), https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-sstp/70adc1df-c4fe-4b02-8872-f1d8b9ad806a.
8.Donenfeld J.A. Wireguard: next generation kernel network tunnel. NDSS. 2017:1–12. [Google Scholar]
9.OpenVPN Cloud Knowledge Base, (Accessed October 2022), https://openvpn.net/cloud-docs/.
10.Draper-Gil G., et al. Proceedings of the 2nd international conference on information systems security and privacy (ICISSP) 2016. Characterization of encrypted and vpn traffic using time-related. [Google Scholar]
11.Afandi Waleed, et al. Fingerprinting technique for youtube videos identification in network traffic. IEEE Access. 2022;10:76731–76741. [Google Scholar]
12.ISO, ISO . ISO Standard. ISO; 1994. IEC 7498-1: 1994 information technology–open systems interconnection–basic reference model: the basic model. /IEC7498-1. [Google Scholar]
13.Microsoft Says Don't Use PPTP and MS-CHAP (Accessed October 2022), http://www.h-online.com/security/news/item/Microsoft-says-don-t-use-PPTP-and-MS-CHAP-1672257.html.
14.Schumann, Luca, et al. "Impact of evolving protocols and COVID-19 on internet traffic shares." arXiv preprintarXiv:2201.00142 (2022).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

USBVPN2022 (Original data) (zenodo)

[bib0001] 1.Brownlee N. Flow-based measurement: IPFIX development and deployment. IEICE Trans. Commun. 2011:2190–2198. 94.8. [Google Scholar]

[bib0002] 2.Iliyasu A.S, Deng H. Semi-supervised encrypted traffic classification with deep convolutional generative adversarial networks. IEEE Access. 2019;8:118–126. [Google Scholar]

[bib0003] 3.Maonan W., et al. Proceedings of the 2021 IEEE 6th International Conference on Computer and Communication Systems (ICCCS) IEEE; 2021. CENTIME: a direct comprehensive traffic features extraction for encrypted traffic classification. [Google Scholar]

[bib0004] 4.Xue Diwen, et al. Proceedings of the 31st USENIX Security Symposium (USENIX Security 22) 2022. {OpenVPN} is open to {VPN} fingerprinting. [Google Scholar]

[bib0005] 5.IETF . IETF; 1999. RFC 2661, Layer Two Tunneling Protocol “L2TP”.https://datatracker.ietf.org/doc/html/rfc2661 Accessed October 2022. [Google Scholar]

[bib0006] 6.PPTP, Router OS, (Accessed October 2022), https://help.mikrotik.com/docs/display/ROS/PPTP.

[bib0007] 7.MS-SSTP, (Accessed October 2022), https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-sstp/70adc1df-c4fe-4b02-8872-f1d8b9ad806a.

[bib0008] 8.Donenfeld J.A. Wireguard: next generation kernel network tunnel. NDSS. 2017:1–12. [Google Scholar]

[bib0009] 9.OpenVPN Cloud Knowledge Base, (Accessed October 2022), https://openvpn.net/cloud-docs/.

[bib0010] 10.Draper-Gil G., et al. Proceedings of the 2nd international conference on information systems security and privacy (ICISSP) 2016. Characterization of encrypted and vpn traffic using time-related. [Google Scholar]

[bib0011] 11.Afandi Waleed, et al. Fingerprinting technique for youtube videos identification in network traffic. IEEE Access. 2022;10:76731–76741. [Google Scholar]

[bib0012] 12.ISO, ISO . ISO Standard. ISO; 1994. IEC 7498-1: 1994 information technology–open systems interconnection–basic reference model: the basic model. /IEC7498-1. [Google Scholar]

[bib0013] 13.Microsoft Says Don't Use PPTP and MS-CHAP (Accessed October 2022), http://www.h-online.com/security/news/item/Microsoft-says-don-t-use-PPTP-and-MS-CHAP-1672257.html.

[bib0014] 14.Schumann, Luca, et al. "Impact of evolving protocols and COVID-19 on internet traffic shares." arXiv preprintarXiv:2201.00142 (2022).

PERMALINK

A novel dataset for encrypted virtual private network traffic analysis

Mohamed Naas

Jan Fesl

Abstract

Value of the Data

Objective