Coverage-guided differential testing of TLS implementations based on syntax mutation

Yan Pan; Wei Lin; Yubo He; Yuefei Zhu

doi:10.1371/journal.pone.0262176

. 2022 Jan 24;17(1):e0262176. doi: 10.1371/journal.pone.0262176

Coverage-guided differential testing of TLS implementations based on syntax mutation

Yan Pan ¹, Wei Lin ¹, Yubo He ¹, Yuefei Zhu ^1,^*

Editor: Licheng Wang²

PMCID: PMC8786154 PMID: 35073360

Abstract

Transport layer security (TLS) protocol is the most widely used security protocol in modern network communications. However, protocol vulnerabilities caused by the design of the network protocol or its implementation by programmers emerge one after another. Meanwhile, various versions of TLS protocol implementations exhibit different behavioral characteristics. Researchers are attempting to find the differences in protocol implementations based on differential testing, which is conducive to discovering the vulnerabilities. This paper provides a solution to find the differences more efficiently by targeting the TLS protocol handshake process. The differences of different implementations during the fuzzing process, such as code coverage and response data, are taken to guide the mutation of test cases, and the seeds are mutated based on the TLS protocol syntax. In addition, the definition of duplicate discrepancies is theoretically explored to investigate the root cause of the discrepancies and to reduce the number of duplicate cases that are caused by the same reason. Besides, the necessary conditions for excluding duplicate cases are further analyzed to develop the deduplication strategy. The proposed method is developed based on open-source tools, i.e., NEZHA and TLS-diff. Three types of widely used TLS protocol implementations, i.e., OpenSSL, BoringSSL, and LibreSSL, are taken for experimental testing. The experimental results show that the proposed method can effectively improve the ability to find differences between different implementations. Under the same test scale or the same time, the amount of discrepancies increases by about 20% compared to TLS-diff, indicating the effectiveness of the deduplication strategy.

Introduction

With the rapid development of computer networks, more and more applications are being transformed into network applications. As carriers of various network transmissions, network protocols occupy an important position in the entire network and play an essential role in ensuring secure communication between network devices. The logical errors in network protocol design and the bugs implemented by programmers in programming lead to the fragility of protocol implementation. As a widely used encryption protocol, TLS protocol is fundamental to encrypted communication. Due to various versions of TLS protocol, different vulnerabilities are introduced by different implementations, such as the heart drip vulnerability of OpenSSL, CCS (Change Cipher Spec) injection vulnerability, and the goto fail vulnerability of GnuTLS.

To analyze the vulnerability of TLS implementation, researchers applied different analysis methods for different processes, such as source code analysis, fuzzing, and formal methods [1]. For standardized verification of the protocol implementation [2, 3], Chaki et al. [4] combined software model detection with standard protocol security models to automatically analyze the authentication and confidentiality of the protocol in the C language. For the protocol state machine [5, 6], Ruiter et al. [7] modeled a state machine for implementing the TLS protocol based on the active learning method. They also manually analyzed the generated state machine to find logical vulnerabilities. For data interaction, Somorovsky [8] developed an open-source framework called TLS-Attacker, which can perform fuzzy testing of the processing of data interaction over the TLS protocol. Concerning certificate validation, Brubaker et al. [9] performed differential testing of the certificate validation process in various implementations.

Among the existing methods, differential testing originates from software regression testing. It tests different versions of implementations by inputting semi-valid test cases and analyzing the differences between different implementations. Addressing the problem of poor performance in test case generation, Petsios et al. [10] introduced the concept of δ diversity. They integrated differential testing with guided testing ideas and developed the NEZHA open-source platform to test the consistency of behavior across multiple test programs. In addition, they heuristically proposed the concept of a path combination, combining the execution paths of multiple implementations into path combinations. They suggested that the test case is useful for finding differences in the output if it is a new combination of paths and even if the path of one of implementation is a covered path. TLS-diff [11] applied differential testing to the handshake process of TLS and proposed stimulating multiple TLS implementations with equivalent inputs via semi-randomly generated TLS protocol messages. Also, the implementation errors were analyzed based on the differences in their response. This method analyzed the interaction of the first packet data in the handshake process but lacked guidance.

For the TLS handshake process, the mutation efficiency is reduced due to the independence of the domain of NEZHA and the strong structure of the data packet. To more effectively detect differences in TLS implementation during the handshake process, this paper proposes a coverage-guided differential testing method based on syntax mutation. Given the TLS handshake protocol, this method improves the applicability of the NEZHA based algorithm and replaces the original random mutation with syntax-based mutation when generating test cases to improve efficiency.

Meanwhile, since the implementations of the above two methods do not give additional judgments about the results of the difference, that is, the difference can be caused by the same factor. Therefore, the “duplication” is defined in this paper based on the root cause of the alarm. Besides, the necessary conditions are discussed theoretically and implemented in the tool so that the repetitive difference test cases can be eliminated to some extent.

The main contributions of this paper are summarized as follows:

A hybrid methodology composed of coverage guidance and syntax mutation for differential testing on the first interaction of TLS protocol is proposed.
The method of eliminating duplicate discrepancies based on code coverage is discussed.
To facilitate the work of other researchers, the modified tool is open-sourced and provided at https://gitee.com/z11panyan/CGDTSM.

The rest of this paper is organized into six parts. The first part introduces the related work; the second part discusses related research; the third part describes the proposed method; the fourth part introduces experiment and evaluation; the fifth part discusses the manual analysis of differential testing cases; and the sixth part is a conclusion and perspective.

Related work

The related work is mainly presented from two aspects: fuzzing and differential testing.

Fuzzing

Fuzzing is currently a research hotspot for finding software vulnerabilities. Abnormal samples are generated and sent to the testing software for execution, so that the deviations in program processing can be detected and their vulnerabilities analyzed. Test cases generation and the control strategy feedback data are two key components. Fuzzing is usually divided into black-box, gray-box, and white-box testing based on the feedback data provided by the executing program. Black-box testing does not require any feedback data from the target program and pays more attention to mutation methods, such as input structure-based mutation strategies [12, 13], and input structure-based generation strategies built on deep learning [14–16]. White-box testing is based on the internal logic of the program to create test cases using dynamic symbolic execution and heuristic search algorithms for maximum coverage. SAGE [17] is a typical tool, but the prerequisites and complexity of white box testing are relatively high. Grey-box testing mainly focuses on the code coverage (basic blocks, paths, functions) and data flow. Common tools such as AFL [18] and LibFuzzer [19] are both used to obtain code coverage with code instrumentation (source code, binary). AFLNET [20] uses state-feedback and coverage-feedback to guide the mutation of seeds and treat the message sequences as the fuzzing input to enable deep interaction with protocol implementations. And the work in [21] focuses on data flow at runtime.

On Usenix2015, Ruiter et al. modeled for the first time a state machine of the TLS protocol implementations based on the active learning method. They also manually analyzed the generated state machine to find logical vulnerabilities. Besides, they improved the W-method equivalent query algorithm [22] based on the LearnLib framework [23], which reduces the number of equivalent queries and speeds up the construction of the state machine. According to the above method, state machines implemented by the TLS protocol can be built quickly. The state machines can be manually inspected to find incorrect state transitions or redundant states. Compared with real source code, bugs in the implementation can be fixed. Somorovsky proposed an open-source framework called TLS-attacker for evaluating the security of TLS implementations, which modified the source code of TLS client to make all protocol fields variable. At run time, the script can use the specified fuzzing operation to mutate, create test cases, and test the TLS implementation.

Differential testing

Differential testing was first proposed by Evans [24] to analyze the difference between old and new versions of software. Since there are various implementations of TLS, Brubaker et al. introduced the idea of differential testing into the certificate verification process in the SSL implementation. They collected 243,246 certificates over the network and generated the set of certificates by randomly changing the fields of the certificate. It can effectively detect differences in the certificate validation process across different implementations. Similarly, focusing on the certificate generation, Chen [25] diversified seed certificates by adapting Markov Chain Monte Carlo sampling and Tian [26] assembled the certificates based on the standard RFC (Request for Comments), which both improve traditional differential testing. HVLearn [27] analyzed host name validation when verifying certificates. SFADiff [28] combines automatic inference with differential testing. It derived the symbolic finite automata model by querying the target program using the black box method, and checked for differences in the inferred model. Walz et al. introduced the black box feedback idea of NEZHA to the TLS-diff [29]. They proposed a response-distribution guided strategy that uses a certain probability as a seed for a mutation to generate new test cases. This method depends on the NEZHA, so it matches the black box test.

In addition, differential testing has also been applied to other areas, such as DIFFUZZ for side-channel analysis [30], DLFuzz for deep learning systems [31], processing differences in malware recognition tools [32], and in combination with symbolic execution for further broaden observed differences [33]. Basically, differential testing and fuzzing create automatic or semi-automatic data as input to the program, and frack deviations in the program. Therefore, structured mutation and guided strategy are very important.

Methodology

Motivation

The SSL protocol includes two protocol sublayers. The bottom layer is the SSL recording protocol layer; the upper layer is the SSL handshake protocol layer. The SSL handshake protocol layer includes the SSL handshake protocol, the SSL cipher change specification protocol, and the SSL alert protocol. This paper focuses on the first communication in the SSL handshake protocol. After receiving the ClientHello packet, the server will parse the payload according to the grammar and validate each attribute field. If these attribute fields comply with the standard, the handshake sub-protocol is invoked with a value of 22 in the identification field; otherwise, the alarm sub-protocol is invoked with a value of 21 in the identification field.

Based on the structure of ClientHello data packets, TLS-diff proposes the concept of generic message trees, which is an ordered rooted tree, with each node representing a specific message field. A leaf node in a tree is an atomic message field (for example, an integer field) that can be directly transformed from the original data. An internal node in the tree is a compound message field, and the contents in that field are recursively retrieved from its child nodes. TLS-diff first converts the original data packet into a general message tree based on certain conversion rules; then it mutates the nodes of the tree with the following eight types of mutation operations: $O_{v o i d}, O_{r e m}, O_{d u p l}, O_{t r u n c}^{f u z z}, O_{i n t}^{f u z z}, O_{c o n t}^{f u z z}, O_{a p p}^{f u z z}, O_{s y n}^{f u z z}$ . Finally, the mutated tree structure is converted into a data packet through a serialization process. The tool generates many ClientHello data packets based on this mutation strategy and inputs them to different protocol implementations to obtain different responses.

NEZHA combines the idea of coverage guidance and guides and optimizes the creation of use cases in the fuzzing process. It focuses on the difference in the path and the difference in the output of different implementations. For a specific test case, the execution path and output of each test program are the combination of the execution paths and the output combination of the test case. If a combination of the execution path or an output combination is recreated, it is considered significant for finding output differences. In this case, the use case is added to the original dataset and used as the raw data for subsequent mutation. The original implementation of NEZHA provides three strategies: the path δ diversity (fine) matches a combination of execution paths: the path δ diversity (coarse) matches a combination of the execution path numbers: and output δ diversity matches the output combination. Meanwhile, if there is a discrepancy in the output of the test application, NEZHA will add the corresponding input to the general set of differences.

The above two methods have the same problem, if the responses of two differences caused by different reasons are the same, such as the tuple <21, 21, 0>, one of them will be discarded, leading to the loss of valid discrepancies. However, if the discrepancies with the same response are not discarded, but are recorded as the final discrepancies, the number of discrepancies in the literature [29] can reach thousands of levels. Therefore, it is difficult to judge the reliability, and the manual analysis is expensive. In addition, differential testing for certificate verification is facing the same problem [25, 26].

As for the generation of test cases, TLS-diff is a black box test based on grammatical mutation. While the generated test samples may closely match the grammar of ClientHello data packets, the mutation is relatively blind. The original intention of NEZHA is to conduct domain-independent guided testing, and the mutation strategy is random. However, there are still too many invalid use cases, hindering the improvement of efficiency.

Algorithm design

Regarding the above issues, this article proposes a hybrid technique called CGDTSM (Coverage-guided Differential Testing with Syntax-based Mutation) that applies the syntax-based mutation strategy proposed by TLS-diff to NEZHA. While it disrupts the domain-independent characteristics of NEZHA, it can amplify the mutation of NEZHA’s TLS packet to a certain extent.

At the same time, to minimize the “duplication” of test cases as much as possible, the test cases deduplication strategy is discussed below. Table 1 shows the relevant symbols often used in the process.

Table 1. Description of used notation.

Symbols	$I, I$	$P, p$	$C, c$	$B, b$	$A, a$
Meanings	Instructions	Programs	Test cases	Basic blocks	Alarm basic blocks

Open in a new tab

Firstly, the “duplicate difference” is defined based on underlying definitions.

Definition 1. A set of commands consisting of assembly instructions is denoted as

I = {I_{i} | i = 1, \dots, n} .

Definition 2. Program p is an ordered set of instructions, denoted as

p = {I_{i_{1}}, I_{i_{2}}, \dots, I_{i_{t}} | I_{i_{t}} \in I} .

The set of programs under test is denoted as $P = {p_{i} | i = 1, \dots, m}$ . These programs are various implementations of TLS.

Definition 3. A test case denoted as c is the single input generated for a specific type of program. The set of test cases is denoted as $C = {c_{i} | i = 1, \dots, k}$ .

Definition 4 (Instruction execution). Given a program $p \in P$ , a test case $c \in C$ , the execution of an instruction is defined as a mapping $e : C \times I \times T \to {t r u e, f a l s e}$ . Since the program selectively executes different instructions at runtime, the symbol t is introduced to represent a particular execution process. The set consisting of the symbols t is denoted as T. If the instruction I ∈ p is executed by a test case c, then e(c, I, t) = true; otherwise, e(c, I, t) = false.

Definition 5 (Basic block). For a given program $p \in P$ , a set of basic blocks $B_{p} = {b_{i} | i = 1, \dots, l}$ is uniquely determined, where the basic block $b \in B_{p}$ is a sequence of statements with atomicity, and can be represented as an ordered set of one or more instructions: $b = {I_{i_{s}}, I_{i_{s + 1}} \dots, I_{i_{s + j}} | I_{i_{s + j}} \in I}$ .

When $p = {I_{i_{1}}, \dots I_{i_{s - 1}}, I_{i_{s}}, \dots, I_{i_{s + j}}, I_{i_{s + j + 1}} \dots, I_{i_{t}} | I_{i_{t}} \in I}$ is selected, the basic block is $b = {I_{i_{s}}, I_{i_{s + 1}} \dots, I_{i_{s + j}} | I_{i_{s + j}} \in I}$ if and only if:

① $\forall c \in C, t \in T$ , s.t. $e (c, I_{i_{s}}, t) = e (c, I_{i_{s + 1}}, t) = \dots = e (c, I_{i_{s + j}}, t)$
② $\exists c \in C, t \in T$ , s.t. $e (c, I_{i_{s - 1}}, t) \neq e (c, I_{i_{s}}, t), e (c, I_{i_{s + j}}, t) \neq e (c, I_{i_{s + j + 1}}, t)$

Based on additional definitions, the mapping e can be extended from the execution of instructions to the basic block. The definition is as follows.

Definition 6. Given $p \in P$ , $c \in C$ , the mapping is $e : C \times B_{p} \times T \to {t r u e, f a l s e}$ . If the basic block $b \in B_{p}$ is executed, then e(c, b, t) = true, otherwise e(c, b, t) = false.

Definition 7. Given $p \in P$ , $c \in C$ , the set of covering basic blocks in an execution can be denoted as $B_{p} (c, t) = {b \in B_{p} | e (c, b, t) = t r u e}$ , and the basic blocks covered by the test case c can be denoted as $B_{p} (c) = \cup_{t \in T} B_{p} (c, t)$ . The space formed by $C, P, B_{p} (c)$ can be denoted as $B = {B_{p} {c | | p \in P, c \in C}$ .

Definition 8. Define a mapping $f : C \times P \to B$ for a given $p \in P, c \in C$ , $f (p, c) = B_{p} (c)$ .

The related definitions which are used in this paper are introduced below.

Definition 9 (Response output). In this paper, the set of programs to be tested P is composed of TLS implementations, and the output of programs to be tested can be defined as a mapping $o u t : C \times P \to {21, 0}$ . For $p \in P, c \in C$ ,

o u t (p, c) = {\begin{matrix} 21, & i f A l e r t \\ 0, & i f H a n d s h a k e \end{matrix} .

The set of programs p generating an alarm for the test case $c \in C$ is denoted as $P_{a} (c) = {p \in P | o u t (p, c) = 21}$ .

Definition 10 (Alarm basic block). Given $p \in P$ , the class of basic block sets $A_{p}, a \in A_{p}$ is determined uniquely if and only if

① $\forall t \in T, \forall c \in C, s . t . o u t (p, c) = 0, e (c, a, t) = f a l s e$ ;
② $\exists c \in C, s . t . o u t (p, c) = 21, e (c, a, t) = t r u e$ .

It is called an alarm basic block.

Property 1. Given $p \in P$ and $c \in C$ , if out (p, c) = 21, then $\forall t_{1}, t_{2} \in T, \forall a \in A_{p}$ .

If $a \in B_{p} (c, t_{1})$ , then $a \in B_{p} (c, t_{2})$ ; if $o u t (p, c) = 0, \forall a \in A_{p}$ , $\forall t \in T, a \notin B_{p} (c, t)$ .

During program execution, the main blocks are executed in a specific order. This paper focuses on the first alarm basic block during running.

Definition 11. Given $p \in P$ and $c \in C$ , the first alarm base block is defined as a mapping $h : B \to A \cup {0}$ ,

h (B_{p} (c)) = {\begin{matrix} a, & i f o u t (p, c) = 21 \\ 0, & i f o u t (p, c) = 0 \end{matrix} .

In this paper, the basic block a is the cause of alarm. Based on Quality 1, for given $p \in P$ and $c \in C$ , $\forall t_{1}, t_{2} \in T$ , it holds $h (B_{p} (c, t_{1})) = h (B_{p} (c, t_{2})) = h (B_{p} (c))$ .

Definition 12. Define a composite mapping g: h ∘ f, and $g : C \times P \to A \cup {0}$

g (p, c) = {\begin{matrix} a, & i f A l e r t \\ 0, & i f H a n d s h a k e \end{matrix} .

Definition 13. Given $p \in P$ , a relation on $C$ is defined as

\approx_{g} = {< c_{1}, c_{2} > | c_{1}, c_{2} \in C \land g {p, c_{1}} = g {p, c_{2}}} .

It is easy to prove that ≈_g is an equivalent relation, and its equivalence class is denoted as ${[c]}_{\approx_{g}}$ . In other words, if the elements in the equivalence class generate an alarm, the cause of the alarm is the same. This paper discusses only alarm test cases and the test cases with successful handshake are considered in the same way. Therefore, the elements in the equivalence class are called duplicate cases.

Theorem 1. Given $p \in P, \forall c_{1}, c_{2} \in C, \exists t_{1}, t_{2} \in T$ , if $B_{p} (c_{1}, t_{1}) = B_{p} (c_{2}, t_{2})$ , then c₁ ≈_g c₂.

Proof. If $B_{p} (c_{1}, t_{1}) = B_{p} (c_{2}, t_{2})$ , we have $h (B_{p} (c_{1}, t_{1})) = h (B_{p} (c_{2}, t_{2}))$ from Definition 11. Further, we have $h (B_{p} (c_{1})) = h (B_{p} (c_{2}))$ . Then based on Definition 8 and 12, g(p, c₁) = g(p, c₂), which means c₁ ≈_g c₂.

According to Theorem 1, if the sets of basic blocks which are triggered by two test cases are the same, the two test cases are considered a duplicate. When a test case is input into multiple program implementations at the same time, the following definition is added.

Definition 14. For a set of programs under test $P_{t} \subseteq P$ , the test case c is a different case, if and only if $\exists p_{i}, p_{j} \in P_{t}, o u t (p_{i}, c) \neq o u t (p_{j}, c)$ . The set of different cases is recorded as $C_{P_{t}} = {c | o u t (p_{i}, c) \neq o u t (p_{j}, c), | p_{i}, p_{j} \in P_{t}}$ .

Theorem 2. Given a set of programs $P_{t}$ , ≈_g is an equivalence relation on $C_{P_{t}}$ .

Definition 15. For a set of tested programs $P_{t} \subseteq P$ , $c_{1}, c_{2} \in C_{P_{t}}$ are repeated test cases for this set if and only if $P_{a} (c_{1}) = P_{a} (c_{2})$ , and for $\forall p \in P_{t}, c_{1} \approx_{g} c_{2}$ .

Theorem 3. For $\forall c_{1}, c_{2} \in C_{P_{t}}, \exists t_{1}, t_{2} \in T$ , if $P_{a} (c_{1}) = P_{a} (c_{2})$ and $\forall p \in P_{a} (c_{1})$ , $B_{p} (c_{1}, t_{1}) = B_{p} (c_{2}, t_{2})$ , then c₁, c₂ is the repeated test case for this set.

Theorem 3 is easy to prove. According to Theorem 3, the following deduplication method is formulated in this paper: for the difference case obtain the code coverage data of the program responding to the alert, and calculate its hash value. If it matches an existing hash library, then it is considered to satisfy the condition of Theorem 3. In other words, there is a difference case in the library that repeats with the test case, and it will not be put into the library, otherwise, it will be placed in the library.

Of course, the above definition can be extended to the verification class program, that is, a program that validates the input. A basic block in which an input is considered invalid is regarded as an alarm base unit that has the same property. Therefore, this paper describes the δ diversity of NEZHA in accordance with the above definition.

For the selected program $p_{1}, p_{2} \in P_{t}$ , a certain alarm basic block can be denoted as $A_{p_{1}} = {a_{11}, a_{12}, \dots, a_{1 u}}, A_{p_{2}} = {a_{11}, a_{12}, \dots, a_{1 v}}, \exists c_{1}, c_{2} \in C$ , which holds e(c₁, a₁₁, t) = true, e(c₁, a₂₁, t) = true, e(c₂, a₁₂, t) = true and e(c₂, a₂₂, t) = true. Traditional coverage-guided strategy marks c₁, c₂ as meaningful test cases. If there is $c_{3} \in C$ , s.t. e(c₃, a₁₁, t) = true and e(c₃, a₂₂, t) = true, the traditional coverage guidance strategy does not mark, while the strategy of NEZHA considers the new basic block coverage to be a combination (a₁₁, a₂₂) generated by the test case c₃, which has a specific value for creating new output differences or anomalies.

Combined with the above definition, the CGDTSM algorithm proposed in this article is shown in Algorithm 1. $C$ is the set of test cases, and $P$ is the set of protocol implementations. Lines 3–4 randomly select the test cases in the set and mutate based on the syntax, so that new test cases can be generated; lines 5–7 send each test case to several programs and record the execution path and response of several program; lines 8–10 indicate that if a new pattern is generated, the test case will be added to set $C$ and the increased number of features will be used as the weight of the test case. Lines 11–17 eliminate the repeated cases according to the above deduplication algorithm and only record the discrepancies after deduplication. Lines 4, 12, and 13 are the key new enhancements on top of NEZHA.

Algoritm 1 DiffTest: Report all discrepancies across applications $P$ after n generations, starting from a corpus $C$ .

Input: The initial set of test cases $C$ ; The set of programs under test $P$ ; The number of execution n;

Output: The set of difference cases diff;

1: procedure DiffTest( $C$ , $P$ , n)

2: while generation ≤ n do

3: input = RandomChoice( $C$ )

4: testcase = MutateOnSyntax(input)

5: for app $\in P$ do

6: path, outputs = RUN(app, testcase)

7: end for

8: if NewPattern(testcase) then

9: $C = C \cup t e s t c a s e$

10: end if

11: if IsDiscrepancy(testcase)

12: hash = Hash(GetBitmap())

13: if Hashmap.insert(hash,true)

14: diff ∪ = testcase

15: $C = C \cup t e s t c a s e$

16: end if

17: end if

18: end while return diff;

System design

The system design structure is shown in Fig 1. Based on the NEZHA framework (https://github.com/nezha-dt/nezha/), the shadow parts of the diagram have been modified, which correspond to Lines 4 and 12 in Algorithm 1 respectively. The mutation component invokes TLS-diff mutation strategy based on syntax. The feedback focuses on the output and coverage of basic blocks. If any condition is met, it is added to the source library.

To adapt to the latest SSL source code, the project and the SSL source code are compiled with Clang9 and instrumented based on Sanitizer Coverage.

Evaluation

Experimental design

Three of the most commonly used SSL implementations are used in the experiment, including BoringSSL, LibreSSL, and OpenSSL. Versions used are BoringSSL 2883, and master(3743aaf), LibreSSL 2.4.0 and 3.2.1, OpenSSL1.0.2(12ad22d), OpenSSL1.1.0(a3b54f0) and 3.0.0-alpha7-dev(b0614f0). The first is the version accepted in literature [10, 11], and the second is the latest version. NEZHA and TLS-diff are used as comparison methods to compare the effects of the CGDTSM method.

The abstract function of the response packet used by TLS-diff evaluates the difference in response data,

\begin{matrix} R_{1} (t) ≔ {\begin{matrix} \begin{matrix} 0 & i f t \equiv S e r v e r H e l l o \end{matrix} \\ \begin{matrix} 21 & o t h e r w i s e \end{matrix} \end{matrix} \end{matrix}

(1)

When the response packet is ServerHello, this abstract function is 0; otherwise, it returns 21.

Because TLS-diff is a pure black-box analysis of responses, it cannot record test cases coverage data, making the tool itself unable to use the method of judging discrepancies proposed in this paper. Hence, an appropriate number of test cases were generated as the seeds of the CGDTSM. Since the number of seeds exceeds the maximum number of test cases, the mutation strategy will be invoked, and the result will be the number of discrepancies produced by the test cases generated by TLS-diff. The original seeds are those in [10].

The maximum number of test cases is set to 80,000 and the maximum time is 1000 seconds, under which conditions the results tend to be flat. Five experiments are performed for each method. The average value of the five experiments for each method are taken for comparison.

Results

To verify the performance of the algorithm, three groups of controlled experiments are designed in this article:

Three implementations of the latest version were taken, which are hereafter referred to as the new version;
Three implementations of the version in the original article were taken, which are hereafter referred to as the old version;
The experiments use three different versions of OpenSSL, called the “SSL-3” version.

It is worth noting that the experiments are based on the same deduplication strategy, although there is still room for improvement on the deduplication strategy. The experimental results are analyzed as follows.

Q1: How does CGDTSM compare to NEZHA and TLS-diff?

Table 2 is the comparison results of the three methods for the new and old versions. The third line is the average number of discrepancies of five experiments with 1000 seconds. Compared to TLS-diff methods, the number of detected discrepancies found has increased by about 32% and 10% respectively. Fig 2 shows the trend of the number of discrepancies with respect to time under the same standard. At the beginning of testing, TLS-diff shows a better growth trend. However, thanks to the guidance of coverage, the effect of our method exceeded that of TLS-diff.

Table 2. The average discrepancies of the old and new versions based on the three methods.

Version	Old			New
strategy	NEHA	TLS-diff	CGDTSM	NEHA	TLS-diff	CGDTSM
Average discrepancies	16.0	119.8	132.2	57.4	188.2	248.2

Open in a new tab

Fig 2 — Dashed line: NEZHA, dotted line: TLS-diff, solid line: CGDTSM. The three lines represent the response difference obtained by the three methods in 1000 seconds. The experimental object are OpenSSL3.0.0, LibreSSL3.2.1, and BoringSSL(3743aaf).

Through manual analysis of discrepancies, it turned out that there are still many repetitive test cases. And Tables 3 and 4 represent the results after coarse manual analysis. Table 3 represents the results of a horizontal comparison of the above experiments. For the three experiments “new”, “old” and “SSL3”, we find 18, 13, and 7 unique discrepancies respectively. Table 4 is the detailed number of discrepancies obtained by testing the “new” version. In the first three columns of the table, 21 means that the protocol alarms a test case, and 0 means the clientHello packet is valid and can be negotiated. The first line means that there is a different test case that makes OpenSSL 3.0.0 and LibreSSL 3.2.1 respond to the handshake packet, and at the same time cause BoringSSL-master to response a warning. The meaning of other lines can be inferred similarly.

Table 3. The total number of discrepancies for the three experiments.

Version	Number
Openssl3.0.0 vs LibreSSL3.2.1 vs boringssl-master	18
Openssl1.0.2 vs LibreSSL2.4.0 vs boringssl2883	13
Openssl3.0.0 vs Openssl1.1.0 vs Openssl1.0.2	7

Open in a new tab

Table 4. The detailed number of discrepancies for the “new” version.

OpenSSL3.0.0	LibreSSL3.2.1	BoringSSL-master	Number
0	0	21	1
0	21	21	1
0	21	0	3
21	0	0	6
21	0	21	2
21	21	0	5

Open in a new tab

In summary, the syntax-based mutation is more appropriate than a random mutation, which provides increased coverage for the same number of test cases. Compared to syntax-based uncontrolled mutation, the proposed method can find the same amount of discrepancies faster. Compared to these two methods, there is a certain improvement in the ability to find differences.

Q2: How effective is the added judgment clause in reducing the duplication of discrepancies?

Since NEZHA and CGDTSM have feedback, the test cases deduplication affects the feedback. Based on the consideration of the one-factor variable, the experiment adopts the TLS-diff method, and the implementation of the new and old versions is tested separately. In addition, the impact of the same judgment condition on the deduplication of the duplicate discrepancies is analyzed. As the number of test cases increases, the trend toward increasing discrepancies between the new and old versions within the deduplication and origin strategies is shown in Fig 3. Red and blue represent the deduplication and origin, respectively. The dashed line and the solid line represent the implementation of old and new versions, respectively. The number of response discrepancies after deduplication is significantly lower than that of without deduplication, and the growth trend is slower. With the increasing number of test cases, there are more duplicate discrepancies. The detailed number of using and not using the deduplication strategy are shown in Table 5. Without the deduplication strategy, that is, the strategy used in TLS-diff, we can see that the number of discrepancies produced by the new version is 447.2, which is about 85.4 after deduplication. The number after 81% represents a reduction between the two strategies. In other words, the deduplication strategy reduces repeated use cases by 81%. Similarly, the number of discrepancies produced by the version is 936.8. After deduplication, it is 63.6, indicating that about 93% of repeated use cases can be removed. Thus, the deduplication strategy has a more obvious deduplication effect, which can further reduce the cost of manual analysis.

Fig 3 — The dashed lines and the solid lines represent the implementation of old and new versions, respectively. The red lines are the number of response discrepancies after deduplication, which are lower than that of without deduplication (blue lines).

Table 5. The effect of the deduplication strategy on old and new versions.

Version	Old		New
strategy	Origin	Dedup	Origin	Dedup
Average discrepancies	936.8	63.6(93%)	447.2	85.4(81%)

Open in a new tab

Investigation of some discrepancies

The various use cases generated by the tool are analyzed manually. It was found that the differences caused by parsing different implementations in the case the RFC specification are not specified clearly. An analysis of the last three implementations is presented as follows.

OpenSSL

The parsing of the OpenSSL specifications for Supported Point Formats Extension in RFC4492 is incompatible with BoringSSL and LibreSSL. The definition of ECPointFormat structure in RFC4492 is shown in Table 6, which contains three types of ECPointFormat, and the uncompressed is the type that should be supported. In other words, this type should be enabled, if the ec_point_formats extension exists in the ClientHello package. The lower half of Table 6 is the source code for handling the ECPointFormat in OpenSSL. The implementations of BoringSSL and LibreSSL meet these conditions. It is not validated that clienthello packets do not contain uncompressed type in OpenSSL, which caused the difference.

Table 6. Code for validating the ECPointFormat.

enum

{
2 uncompressed (0), ansiX962_compressed prime (1),
3 ansiX962_compressed_char2 (2), reserved (248..255)
4 } ECPointFormat;
5

struct

{
6 ECPointFormat ec_point_format_list <1..2ˆ8–1>
7 } ECPointFormatList;
8

int

tls_parse_ctos_ec_pt_formats()
9 {
10

if

(! PACKET_as_length_prefixed_1 (pkt, &ec_point_format_list)
11 || PACKET_remaining (&ec_point_format_list) == 0) {
12 SSLfatal (s,SSL_AD_DECODE_ERROR, SSL_F_TLS_PARSE_CTOS_EC_PT_FORMATS,XXX);
13

return

0;
14 }
15 }

Open in a new tab

LibreSSL

The original maximum length of an SSL record chunk is 2¹⁴. Due to the bandwidth limitation which is caused by Internet of Things devices, the chunk length needs to be adjusted in specific situations. The extension to negotiate maximum fragment length negotiatio n is defined in RFC6066. The client should include the max_fragment_length extension in the clientHello to negotiate this type, and use the following values: 2⁹(1), 2¹⁰(2), 2¹¹(3), 2¹²(4), (255). The server can respond to this, and if the requested value exceeds the specified range, it must alert “illegal_parameter”. In the latest implementation, Libressl still does not support this extension, while OpenSSL and BoringSSL do support this extension. In experiment, if a test case contains outliers in an extension, Openssl and BoringSSL alarms, but LibreSSL does not validate the extension, resulting in inconsistencies.

BoringSSL

To negotiate the use of Secure Real-time Transport Protocol security policy, RFC5764 emphasizes that the client should include the use_srtp extension, and the server should select one of the profiles sent from the client for subsequent interactions. If the server fails to find a profile available to the client, the server should return an appropriate DTLS warning. As shown in Table 7, if the profiles provided in the test case are empty, then BoringSSL fails to make a decision and does not raise an alarm. Both OpenSSL and Libressl issue judgment and alarm.

Table 7. Code for parsing negotiates srtp.

static bool

ext_srtp_parse_clienthello() {
2

const

STACK_OF() * server_profiles = SSL_get_srtp_profiles(ssl);
3

for

(

const

* server_profile: server_profiles) {
4

while

(CBS_len(&profile_ids_tmp) > 0) {
5 ……
6 }
7 }
8

return true

;
9 }

Open in a new tab

In addition, different implementations make inconsistent judgments of the content fields in parsing some extensions. For example, for parsing renegotiation extensions, OpenSSL only parses the length bytes, while LibreSSL makes further judgment about subsequent bytes. OpenSSL performs preliminary content analysis of the online certificate status protocol OCSP (Online Certificate Status Protocol) extension. If the RFC requirements are not met, an error will be reported, while BoringSSL only reads the status_type field and makes no any judgment about the next content.

Conclusion

This paper proposes a coverage-guided differential testing method based on syntax mutation for TLS, which is mainly focused on handling clientHello packets by TLS, and combines the ideas of syntax mutation and guided testing. Regarding the problem that discrepancies are due to the same reason, the duplicate test case is specifically defined in collection, and the necessary conditions for duplicate discrepancies are given. Accordingly, a deduplication strategy is formulated, and the rationality of differential guidance in the general situation is analyzed, which is expected to expand to the other differential testing experiments, such as certificate verification.

Using Openssl, LibreSSL, and BoringSSL in an experiment, the proposed method is compared with NEZHA and TLS-diff tools to verify the effectiveness of the hybrid method in the process of finding discrepancies. Meanwhile, the adopted deduplication strategy can effectively eliminate about 87% of repeated discrepancies, reducing the cost of manual analysis.

However, manual analysis is still required to figure out the root cause of the difference. Based on the assessment of the existing target, the difference in code coverage is only a necessary condition of different cases caused by different reasons, and the necessary condition is weak. Therefore, there is still a certain amount of repeated use cases in the output result. For further removal of the duplicate discrepancies, we can use various code coverage tools to eliminate them as much as possible, such as gcov (the code coverage statistics tool of GCC). In addition, investigation of necessary and sufficient conditions for duplicate discrepancies, and automatic determination of the location of the code location that causes the difference in the output will be done in the next study. This work can be extended to compare the differences between patches so that the causes of vulnerabilities can be analyzed.

Beyond that, the current guidance is only intended to extract meaningful use cases and use them as seeds for the next mutation. The mutation strategy can be customized by further analyzing the position of the mutation field and the effect of the mutation operation on the coverage rate. The structure-based differential mutation method used in this article can be extended to other protocols and applications, such as DTLS protocol, to analyze the differences between other applications. We are working on the development of this tool. Besides, as discussed in the paper [11], fully interactive differential testing will be the most interesting direction for future work.

Supporting information

S1 File

(TXT)

Click here for additional data file.^{(69B, txt)}

S1 Code

(ZIP)

Click here for additional data file.^{(50.2MB, zip)}

Data Availability

All source code files are available from https://gitee.com/z11panyan/CGDTSM.

Funding Statement

This work is supported by National Key Research and Development Project of China (2019QY1300). We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work, there is no professional or other personal interest of any nature or kind in any product, service or company that could be construed as influencing the position presented in, or the review of, the manuscript.

References

1. Wen S, Meng Q, Feng C, Tang C. Protocol vulnerability detection based on network traffic analysis and binary reverse engineering. PloS one. 2017;12(10):e0186188. doi: 10.1371/journal.pone.0186188 [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Aizatulin M, Gordon AD, Jürjens J. Extracting and verifying cryptographic models from C protocol code by symbolic execution. In: Proceedings of the 18th ACM conference on Computer and communications security; 2011. p. 331–340.
3.Hoque E, Chowdhury O, Chau SY, Nita-Rotaru C, Li N. Analyzing operational behavior of stateful protocol implementations for detecting semantic bugs. In: 2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE; 2017. p. 627–638.
4.Chaki S, Datta A. ASPIER: An automated framework for verifying security protocol implementations. In: 2009 22nd IEEE Computer Security Foundations Symposium. IEEE; 2009. p. 172–185.
5. Yadav T, Sadhukhan K. Identification of Bugs and Vulnerabilities in TLS Implementation for Windows Operating System Using State Machine Learning. In: International Symposium on Security in Computing and Communication. Springer; 2018. p. 348–362. [Google Scholar]
6. Fiterau-Brostean P, Jonsson B, Merget R, de Ruiter J, Sagonas K, Somorovsky J. Analysis of DTLS Implementations Using Protocol State Fuzzing. In: 29th USENIX Security Symposium (USENIX Security 20); 2020. p. 2523–2540. [Google Scholar]
7. De Ruiter J, Poll E. Protocol State Fuzzing of TLS Implementations. In: 24th USENIX Security Symposium (USENIX Security 15); 2015. p. 193–206. [Google Scholar]
8.Somorovsky J. Systematic fuzzing and testing of TLS libraries. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security; 2016. p. 1492–1504.
9. Brubaker C, Jana S, Ray B, Khurshid S, Shmatikov V. Using frankencerts for automated adversarial testing of certificate validation in SSL/TLS implementations. In: 2014 IEEE Symposium on Security and Privacy. IEEE; 2014. p. 114–129. [PMC free article] [PubMed] [Google Scholar]
10. Petsios T, Tang A, Stolfo S, Keromytis AD, Jana S. Nezha: Efficient domain-independent differential testing. In: 2017 IEEE Symposium on security and privacy (SP). IEEE; 2017. p. 615–632. [Google Scholar]
11. Walz A, Sikora A. Exploiting Dissent: Towards Fuzzing-Based Differential Black-Box Testing of TLS Implementations. IEEE Transactions on Dependable and Secure Computing. 2017;17(2):278–291. doi: 10.1109/TDSC.2017.2763947 [DOI] [Google Scholar]
12. Liu X, Cui B, Fu J, Ma J. HFuzz: Towards automatic fuzzing testing of NB-IoT core network protocols implementations. Future Generation Computer Systems. 2020;108:390–400. doi: 10.1016/j.future.2019.12.032 [DOI] [Google Scholar]
13. Luo Z, Zuo F, Jiang Y, Gao J, Jiao X, Sun J. Polar: Function code aware fuzz testing of ics protocol. ACM Transactions on Embedded Computing Systems (TECS). 2019;18(5s):1–22. doi: 10.1145/335822734084098 [DOI] [Google Scholar]
14.Godefroid P, Peleg H, Singh R. Learn&fuzz: Machine learning for input fuzzing. In: 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE; 2017. p. 50–59.
15.Jero S, Pacheco ML, Goldwasser D, Nita-Rotaru C. Leveraging textual specifications for grammar-based fuzzing of network protocols. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 33; 2019. p. 9478–9483.
16.Hu Z, Shi J, Huang Y, Xiong J, Bu X. GANFuzz: a GAN-based industrial network protocol fuzzing framework. In: Proceedings of the 15th ACM International Conference on Computing Frontiers; 2018. p. 138–145.
17. Godefroid P, Levin MY, Molnar D. SAGE: whitebox fuzzing for security testing. Communications of the ACM. 2012;55(3):40–44. doi: 10.1145/2093548.2093564 [DOI] [Google Scholar]
18.Zalewski M. American fuzzy lop (AFL). URL: http://lcamtuf.coredump.cx/afl. 2017;.
19. Serebryany K. libFuzzer–a library for coverage-guided fuzz testing. LLVM project. 2015;. [Google Scholar]
20.Pham VT, Böhme M, Roychoudhury A. AFLNet: a greybox fuzzer for network protocols. In: 2020 IEEE 13th International Conference on Software Testing, Validation and Verification (ICST). IEEE; 2020. p. 460–465.
21. Choi YH, Park MW, Eom JH, Chung TM. Dynamic binary analyzer for scanning vulnerabilities with taint analysis. Multimedia Tools and Applications. 2015;74(7):2301–2320. doi: 10.1007/s11042-014-1922-5 [DOI] [Google Scholar]
22. Chow TS. Testing software design modeled by finite-state machines. IEEE transactions on software engineering. 1978;(3):178–187. doi: 10.1109/TSE.1978.231496 [DOI] [Google Scholar]
23. Isberner M, Steffen B, Howar F. LearnLib Tutorial. In: Runtime Verification. Springer; 2015. p. 358–377. [Google Scholar]
24.Evans RB, Savoia A. Differential testing: a new approach to change detection. In: The 6th Joint Meeting on European software engineering conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering: Companion Papers; 2007. p. 549–552.
25.Chen Y, Su Z. Guided differential testing of certificate validation in SSL/TLS implementations. In: Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering; 2015. p. 793–804.
26. Tian C, Chen C, Duan Z, Zhao L. Differential testing of certificate validation in SSL/TLS implementations: An rfc-guided approach. ACM Transactions on Software Engineering and Methodology (TOSEM). 2019;28(4):1–37. doi: 10.1145/3355048 [DOI] [Google Scholar]
27. Sivakorn S, Argyros G, Pei K, Keromytis AD, Jana S. HVLearn: Automated black-box analysis of hostname verification in SSL/TLS implementations. In: 2017 IEEE Symposium on Security and Privacy (SP). IEEE; 2017. p. 521–538. [Google Scholar]
28.Argyros G, Stais I, Jana S, Keromytis AD, Kiayias A. Sfadiff: Automated evasion attacks and fingerprinting using black-box differential automata learning. In: Proceedings of the 2016 ACM SIGSAC conference on computer and communications security; 2016. p. 1690–1701.
29.Walz A, Sikora A. Maximizing and leveraging behavioral discrepancies in TLS implementations using response-guided differential fuzzing. In: 2018 International Carnahan Conference on Security Technology (ICCST). IEEE; 2018. p. 1–5.
30.Nilizadeh S, Noller Y, Pasareanu CS. DifFuzz: differential fuzzing for side-channel analysis. In: 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE; 2019. p. 176–187.
31.Guo J, Jiang Y, Zhao Y, Chen Q, Sun J. Dlfuzz: Differential fuzzing testing of deep learning systems. In: Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering; 2018. p. 739–743.
32. Xu W, Qi Y, Evans D. Automatically Evading Classifiers: A Case Study on PDF Malware Classifiers. NDSS; 2016. [Google Scholar]
33.Noller Y, Păsăreanu CS, Böhme M, Sun Y, Nguyen HL, Grunske L. HyDiff: Hybrid differential software analysis. In: 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE). IEEE; 2020. p. 1273–1285.

PLoS One. doi: 10.1371/journal.pone.0262176.r001

Decision Letter 0

Licheng Wang

1 Nov 2021

PONE-D-21-31321Coverage-guided differential testing of TLS implementations based on syntax mutationPLOS ONE

Dear Dr. Zhu,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Dec 16 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Licheng Wang

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: I Don't Know

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: This paper offers a solution to efficiently find differences in the TLS protocol handshake process.

The experimental results proved the method is effective.

However, there are some editorial errors, such as

P2 99：

“243,246 certificates were collected over the network, and the set of the Frankencerts certificate 100 was generated by randomly changing the fields in the certificate. ”

In addition, figures should locate above the caption. It took me a lot of effort to find them.

Reviewer #2: In this paper, the authors propose a hybrid methodology to find differences more efficiently by targeting the TLS protocol handshake process, and then help to discovering the vulnerabilities. In addition, they explore the duplicate discrepancies theoretically and give a method of eliminating duplicate discrepancies based on code. By experiment testing on three types of widely used TLS protocol implementations (OpenSSL, BoringSSL, and LibreSSL), it shows that their work can effectively improve the ability to find differences between different implementations. Thus I suggest accept it after some minor modifies.

1. In the article, there are many abbreviations which make reading difficult, so I suggest the authors give an explanation when they first appear, or list a table to explain them.

2. The presentation quality is to be polished. For example,

--line 18 in the Abstract, “the proposed method can effectively improve finding differences between…”: improve ability or speed ??

--line 41-42 on page 2, “the mutation…due to the….and due to the… ”: please delete the second “due to”.

--line 176 on page 5, “First” should be “Firstly”.

--line 206 on page 6, “The related definitions used in this article are introduced below” should be “The related definitions which are used in this…..”

--line 230:"which is are triggered", line 237: "c1,c2 is a ...??".

--line331-332, “there are 1..cases.. ”

--line 337-338, "Due to the ...cased ..use of ???"

Or I suggest the author to find a professional organization to polish English.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2022 Jan 24;17(1):e0262176. doi: 10.1371/journal.pone.0262176.r002

Author response to Decision Letter 0

7 Nov 2021

Dear Editors and Reviewers:

On behalf of my co-authors, we appreciate editors and reviewers for their constructive comments and suggestions on our manuscript entitled “Coverage-guided differential testing of TLS implementations based on syntax mutation” (PONE-D-21-31321). We tried our best to revise our manuscript according to the comments and described the chagnes in the attachments. We look forward to your response.

Attachment

Submitted filename: Response to Reviewers.docx

Click here for additional data file.^{(19.5KB, docx)}

PLoS One. doi: 10.1371/journal.pone.0262176.r003

Decision Letter 1

Licheng Wang

19 Dec 2021

Coverage-guided differential testing of TLS implementations based on syntax mutation

PONE-D-21-31321R1

Dear Dr. Zhu,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Licheng Wang

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Based on two round reviewing process, including your responses and revising efforts, we would like to tell you that this paper can be accepted for publication.

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: (No Response)

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: I Don't Know

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #1: Yes

Reviewer #2: (No Response)

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #1: Yes

Reviewer #2: (No Response)

**********

6. Review Comments to the Author

Reviewer #1: The experimental results show that the proposed method can effectively improve the ability to find differences between different implementations. However, the illustration of Table 4 made me confused. I hope it will be improved before publishing.

Reviewer #2: In this paper, the authors proposed a method which can effectively improve the ability to find differences between different implementations. And by experiment testing on three types of widely used TLS protocol implementations (OpenSSL, BoringSSL, and LibreSSL), their experimental results also support the ability to find differences between different implementations.

In the revised paper, the authors made a good modification according to the comments of the reviewer, so I would like to recommend acceptance.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

PLoS One. doi: 10.1371/journal.pone.0262176.r004

Acceptance letter

Licheng Wang

11 Jan 2022

PONE-D-21-31321R1

Coverage-guided differential testing of TLS implementations based on syntax mutation

Dear Dr. Zhu:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Professor Licheng Wang

Academic Editor

PLOS ONE

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 File

(TXT)

Click here for additional data file.^{(69B, txt)}

S1 Code

(ZIP)

Click here for additional data file.^{(50.2MB, zip)}

Attachment

Submitted filename: Response to Reviewers.docx

Click here for additional data file.^{(19.5KB, docx)}

Data Availability Statement

All source code files are available from https://gitee.com/z11panyan/CGDTSM.

[pone.0262176.ref001] 1. Wen S, Meng Q, Feng C, Tang C. Protocol vulnerability detection based on network traffic analysis and binary reverse engineering. PloS one. 2017;12(10):e0186188. doi: 10.1371/journal.pone.0186188 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0262176.ref002] 2.Aizatulin M, Gordon AD, Jürjens J. Extracting and verifying cryptographic models from C protocol code by symbolic execution. In: Proceedings of the 18th ACM conference on Computer and communications security; 2011. p. 331–340.

[pone.0262176.ref003] 3.Hoque E, Chowdhury O, Chau SY, Nita-Rotaru C, Li N. Analyzing operational behavior of stateful protocol implementations for detecting semantic bugs. In: 2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE; 2017. p. 627–638.

[pone.0262176.ref004] 4.Chaki S, Datta A. ASPIER: An automated framework for verifying security protocol implementations. In: 2009 22nd IEEE Computer Security Foundations Symposium. IEEE; 2009. p. 172–185.

[pone.0262176.ref005] 5. Yadav T, Sadhukhan K. Identification of Bugs and Vulnerabilities in TLS Implementation for Windows Operating System Using State Machine Learning. In: International Symposium on Security in Computing and Communication. Springer; 2018. p. 348–362. [Google Scholar]

[pone.0262176.ref006] 6. Fiterau-Brostean P, Jonsson B, Merget R, de Ruiter J, Sagonas K, Somorovsky J. Analysis of DTLS Implementations Using Protocol State Fuzzing. In: 29th USENIX Security Symposium (USENIX Security 20); 2020. p. 2523–2540. [Google Scholar]

[pone.0262176.ref007] 7. De Ruiter J, Poll E. Protocol State Fuzzing of TLS Implementations. In: 24th USENIX Security Symposium (USENIX Security 15); 2015. p. 193–206. [Google Scholar]

[pone.0262176.ref008] 8.Somorovsky J. Systematic fuzzing and testing of TLS libraries. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security; 2016. p. 1492–1504.

[pone.0262176.ref009] 9. Brubaker C, Jana S, Ray B, Khurshid S, Shmatikov V. Using frankencerts for automated adversarial testing of certificate validation in SSL/TLS implementations. In: 2014 IEEE Symposium on Security and Privacy. IEEE; 2014. p. 114–129. [PMC free article] [PubMed] [Google Scholar]

[pone.0262176.ref010] 10. Petsios T, Tang A, Stolfo S, Keromytis AD, Jana S. Nezha: Efficient domain-independent differential testing. In: 2017 IEEE Symposium on security and privacy (SP). IEEE; 2017. p. 615–632. [Google Scholar]

[pone.0262176.ref011] 11. Walz A, Sikora A. Exploiting Dissent: Towards Fuzzing-Based Differential Black-Box Testing of TLS Implementations. IEEE Transactions on Dependable and Secure Computing. 2017;17(2):278–291. doi: 10.1109/TDSC.2017.2763947 [DOI] [Google Scholar]

[pone.0262176.ref012] 12. Liu X, Cui B, Fu J, Ma J. HFuzz: Towards automatic fuzzing testing of NB-IoT core network protocols implementations. Future Generation Computer Systems. 2020;108:390–400. doi: 10.1016/j.future.2019.12.032 [DOI] [Google Scholar]

[pone.0262176.ref013] 13. Luo Z, Zuo F, Jiang Y, Gao J, Jiao X, Sun J. Polar: Function code aware fuzz testing of ics protocol. ACM Transactions on Embedded Computing Systems (TECS). 2019;18(5s):1–22. doi: 10.1145/335822734084098 [DOI] [Google Scholar]

[pone.0262176.ref014] 14.Godefroid P, Peleg H, Singh R. Learn&fuzz: Machine learning for input fuzzing. In: 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE; 2017. p. 50–59.

[pone.0262176.ref015] 15.Jero S, Pacheco ML, Goldwasser D, Nita-Rotaru C. Leveraging textual specifications for grammar-based fuzzing of network protocols. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 33; 2019. p. 9478–9483.

[pone.0262176.ref016] 16.Hu Z, Shi J, Huang Y, Xiong J, Bu X. GANFuzz: a GAN-based industrial network protocol fuzzing framework. In: Proceedings of the 15th ACM International Conference on Computing Frontiers; 2018. p. 138–145.

[pone.0262176.ref017] 17. Godefroid P, Levin MY, Molnar D. SAGE: whitebox fuzzing for security testing. Communications of the ACM. 2012;55(3):40–44. doi: 10.1145/2093548.2093564 [DOI] [Google Scholar]

[pone.0262176.ref018] 18.Zalewski M. American fuzzy lop (AFL). URL: http://lcamtuf.coredump.cx/afl. 2017;.

[pone.0262176.ref019] 19. Serebryany K. libFuzzer–a library for coverage-guided fuzz testing. LLVM project. 2015;. [Google Scholar]

[pone.0262176.ref020] 20.Pham VT, Böhme M, Roychoudhury A. AFLNet: a greybox fuzzer for network protocols. In: 2020 IEEE 13th International Conference on Software Testing, Validation and Verification (ICST). IEEE; 2020. p. 460–465.

[pone.0262176.ref021] 21. Choi YH, Park MW, Eom JH, Chung TM. Dynamic binary analyzer for scanning vulnerabilities with taint analysis. Multimedia Tools and Applications. 2015;74(7):2301–2320. doi: 10.1007/s11042-014-1922-5 [DOI] [Google Scholar]

[pone.0262176.ref022] 22. Chow TS. Testing software design modeled by finite-state machines. IEEE transactions on software engineering. 1978;(3):178–187. doi: 10.1109/TSE.1978.231496 [DOI] [Google Scholar]

[pone.0262176.ref023] 23. Isberner M, Steffen B, Howar F. LearnLib Tutorial. In: Runtime Verification. Springer; 2015. p. 358–377. [Google Scholar]

[pone.0262176.ref024] 24.Evans RB, Savoia A. Differential testing: a new approach to change detection. In: The 6th Joint Meeting on European software engineering conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering: Companion Papers; 2007. p. 549–552.

[pone.0262176.ref025] 25.Chen Y, Su Z. Guided differential testing of certificate validation in SSL/TLS implementations. In: Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering; 2015. p. 793–804.

[pone.0262176.ref026] 26. Tian C, Chen C, Duan Z, Zhao L. Differential testing of certificate validation in SSL/TLS implementations: An rfc-guided approach. ACM Transactions on Software Engineering and Methodology (TOSEM). 2019;28(4):1–37. doi: 10.1145/3355048 [DOI] [Google Scholar]

[pone.0262176.ref027] 27. Sivakorn S, Argyros G, Pei K, Keromytis AD, Jana S. HVLearn: Automated black-box analysis of hostname verification in SSL/TLS implementations. In: 2017 IEEE Symposium on Security and Privacy (SP). IEEE; 2017. p. 521–538. [Google Scholar]

[pone.0262176.ref028] 28.Argyros G, Stais I, Jana S, Keromytis AD, Kiayias A. Sfadiff: Automated evasion attacks and fingerprinting using black-box differential automata learning. In: Proceedings of the 2016 ACM SIGSAC conference on computer and communications security; 2016. p. 1690–1701.

[pone.0262176.ref029] 29.Walz A, Sikora A. Maximizing and leveraging behavioral discrepancies in TLS implementations using response-guided differential fuzzing. In: 2018 International Carnahan Conference on Security Technology (ICCST). IEEE; 2018. p. 1–5.

[pone.0262176.ref030] 30.Nilizadeh S, Noller Y, Pasareanu CS. DifFuzz: differential fuzzing for side-channel analysis. In: 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE; 2019. p. 176–187.

[pone.0262176.ref031] 31.Guo J, Jiang Y, Zhao Y, Chen Q, Sun J. Dlfuzz: Differential fuzzing testing of deep learning systems. In: Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering; 2018. p. 739–743.

[pone.0262176.ref032] 32. Xu W, Qi Y, Evans D. Automatically Evading Classifiers: A Case Study on PDF Malware Classifiers. NDSS; 2016. [Google Scholar]

[pone.0262176.ref033] 33.Noller Y, Păsăreanu CS, Böhme M, Sun Y, Nguyen HL, Grunske L. HyDiff: Hybrid differential software analysis. In: 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE). IEEE; 2020. p. 1273–1285.

PERMALINK

Coverage-guided differential testing of TLS implementations based on syntax mutation

Yan Pan

Wei Lin

Yubo He

Yuefei Zhu

Roles

Abstract

Introduction

Related work

Fuzzing

Differential testing

Methodology

Motivation

Algorithm design

Table 1. Description of used notation.

System design

Fig 1. Integrated framework of the approach.

Evaluation

Experimental design

Results

Table 2. The average discrepancies of the old and new versions based on the three methods.

Fig 2. Comparison of three algorithms based on the “new” version.

Table 3. The total number of discrepancies for the three experiments.

Table 4. The detailed number of discrepancies for the “new” version.

Fig 3. TLS-diff based deduplication versus no-duplication comparison.

Table 5. The effect of the deduplication strategy on old and new versions.

Investigation of some discrepancies

OpenSSL

Table 6. Code for validating the ECPointFormat.

LibreSSL

BoringSSL

Table 7. Code for parsing negotiates srtp.

Conclusion

Supporting information

Data Availability

Funding Statement

References

Decision Letter 0

Licheng Wang

Roles

Author response to Decision Letter 0

Decision Letter 1

Licheng Wang

Roles

Acceptance letter

Licheng Wang

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases