Risks of ignoring uncertainty propagation in AI‐augmented security pipelines

Emanuele Mezzi; Aurora Papotti; Fabio Massacci; Katja Tuma

doi:10.1111/risa.70059

. 2025 Jun 22;45(12):4469–4489. doi: 10.1111/risa.70059

Risks of ignoring uncertainty propagation in AI‐augmented security pipelines

Emanuele Mezzi ^1,^✉, Aurora Papotti ¹, Fabio Massacci ^1,², Katja Tuma ³

PMCID: PMC12747714 PMID: 40545447

Abstract

The use of AI technologies is being integrated into the secure development of software‐based systems, with an increasing trend of composing AI‐based subsystems (with uncertain levels of performance) into automated pipelines. This presents a fundamental research challenge and seriously threatens safety‐critical domains. Despite the existing knowledge about uncertainty in risk analysis, no previous work has estimated the uncertainty of AI‐augmented systems given the propagation of errors in the pipeline. We provide the formal underpinnings for capturing uncertainty propagation, develop a simulator to quantify uncertainty, and evaluate the simulation of propagating errors with one case study. We discuss the generalizability of our approach and its limitations and present recommendations for evaluation policies concerning AI systems. Future work includes extending the approach by relaxing the remaining assumptions and by experimenting with a real system.

Keywords: artificial intelligence, automatic program repair, uncertainty quantification

1. INTRODUCTION

Due to the increasing availability of data, AI technologies have spread and are being used in almost every computing system, including in safety‐critical domains (Perez‐Cerrolaza et al., 2024). Although the use of AI‐augmented systems comes with new promises of improved performance, it also introduces significant risks and challenges (Cox, 2020; Nateghi & Aven, 2021). A major challenge in using AI for risk analysis is conveying to decision makers the uncertainty inherent to predictions of models, because it clashes with the common practice in the realm of AI to communicate uncertainty with point estimates or ignoring it completely (Guikema, 2020).

With the rise of open‐source software development and large‐scale cloud deployment, more security risk decision‐making is automated by running sequences of AI‐augmented analyses like automated program repair (APR, Fu et al., 2024; Li et al., 2022; Long et al., 2017; Xia & Zhang, 2022; Ye et al., 2021). The use of AI in automated security pipelines, where the first classifier detects a vulnerability and the second tool fixes it, is now becoming more common (Bui et al., 2024), bringing about a fundamental research challenge:

Propagating uncertainty is a new major challenge for assessing the risk of automated security pipelines.

This foundational problem has already manifested in security pipelines with no AI‐based computation. To illustrate this problem, we consider four studies: verifying the presence of code smells (Tufano et al., 2017), generalizing the Śliwerski, Zimmermann, and Zeller (SZZ) algorithm to identify the past versions of software affected by a vulnerability (Dashevskyi et al., 2018), identifying vulnerabilities in Java libraries (Kula et al., 2018), and finding how vulnerable Android libraries could be automatically updated (Derr, 2017).

A few years later, Pashchencko et al. (2022) showed that the results by Kula et al. (2018) are incorrect, and Huang et al. (2019) found that the claims by Derr et al. (2017) are incorrect, both to a large extent. We argue that the reason for this mishap is foundational. All these studies share the impossibility of running manual validation and do not report the uncertainty of their outcomes. The proposed solutions process huge inputs (e.g., 246K commits in Dashevskyi et al., 2018) so they need an automated tool, with an error rate, to decide whether each sample satisfies the property of interest.

With the appearance of new AI‐based approaches, such as SeqTrans (Chi et al., 2022), it is becoming imperative to investigate this problem now, before it is too late and AI‐augmented systems without global measures of risk become weaved into the automated pipelines in organizations.

To address these issues, we focus on understanding the uncertainty due to error propagation in AI‐augmented systems. Among the pipeline, each component may be a potential source of error that leads to an underestimation or overestimation of the actual effectiveness of the proposed solution. Therefore, we formulate the overarching research question:

RQ: How to estimate the total error (or success rate) of the AI‐augmented system, given the propagating errors of the classifiers in the pipeline?

If analytical models for the classifiers and the fixer components existed, it could be possible to use the error propagation models used for calculus (Benke et al., 2018). Unfortunately, analytical models of the recall and precision of these tools are extremely rare, therefore, we must resort to the much coarse‐grained approximation with probability bound analysis (PBA, Iskandar, 2021).

1.1. Contributions

We provide the formal underpinnings for capturing uncertainty propagation in AI‐augmented APR pipelines. In addition, we develop a simulator to quantify the effects that propagating uncertainty has in automated APR tools (such as the one presented in Figure 1). We evaluate the simulator and present one case study in which we calculate the effects of uncertainty regarding the proposed solution. We provide the code in a GitHub repository (Mezzi & Papotti, 2024). Finally, given our findings, we discuss recommendations for the evaluation policies concerning AI systems.

Illustration of an AI‐augmented system which performs vulnerability detection and program repair. The first classifier receives in input the code samples and determines which are the positive samples to be sent to the fixer to be repaired. The fixer, based on its effectiveness, tries to repair them. The second classifier checks whether the fixing is correct. We can also observe the errors made by each component of the pipeline. The first classifier can wrongly classify samples as positive when they are negative. The fixer can fail in repairing positive samples, and the second classifier can misclassify the samples received and that were modified by the fixer.

2. BACKGROUND AND RELATED WORK

As background, we illustrate the composition of AI‐augmented APR tools and present related work on uncertainty quantification in AI and the applications of AI to vulnerability detection and APR.

2.1. AI‐augmented systems

Figure 1 shows the simplest example of composed AI‐augmented system in the area of APR. It is composed by (i) a classifier, which labels code samples as Good or Bad by detecting which sample is not vulnerable and which is vulnerable, (ii) a fixer tool to transform Bad samples into Good samples, and (iii) the second classifier, which can be either equal to the first or different, which analyzes the samples modified by the fixer to check whether they have been successfully repaired. The outcome of the final step is what we call claimed success rate or fix rate, representing the ratio of fixed vulnerable samples concerning the total number of vulnerable samples. Here, we list each step executed by the AI‐augmented APR tool and the possible errors propagating from it:

Step one: the first classifier analyzes the code samples, and labels each of them as Good or Bad. If a code sample presents features not encoded in the distribution learned by the classifier, misclassification is probable, and thus the possibility that a Good code sample is misclassified as Bad or vice versa.
Step two: the fixer tries to fix every Bad code sample, transforming it into a Fixed code sample. Here, the possibility of error lies in the fixer's performance.
Step three: the second classifier analyzes the Fixed code samples. This is the outcome of the entire system. The second classifier performs a final analysis to detect which applications have not been successfully fixed by the fixer. The possibility of errors lies in the same conditions defined for the first classifier.

In our research, we focus on the errors of the first and second classifiers by modeling and propagating the uncertainty which characterizes the classifier's capacity to spot vulnerable code. The classifier's capacity to detect vulnerable code is measured by the Recall ( $r e c$ ) or True Positive Rate ( $T P R$ ), which is the ratio between the true positives and all the positive samples. We formally define the Recall in Section 3.

2.2. Uncertainty quantification in AI

Hüllermeier and Waegeman (2021) highlight two macrocategories of methods employed to quantify and manage uncertainty in Machine Learning (ML). The first discerns between frequentist‐inspired and Bayesian‐inspired quantification methods. The second considers the distinction between uncertainty quantification and set‐values prediction. Uncertainty quantification methods allow the model to output the prediction and the paired level of certainty, while the set‐value methods consist of predefining a desired level of certainty and producing a set of candidates that comply with it.

Abdar et al. (2021) focus their analysis on Deep Learning (DL). Bayesian‐inspired methods and ensemble methods represent two of the major categories to represent uncertainty in DL. Through Bayesian methods, the DL model samples its parameters from a learnt posterior distribution, allowing the model to avoid fixed parameters and allowing us to inspect the variance and uncover the uncertainty which surrounds the model predictions. The most common Bayesian‐inspired technique is the Monte Carlo (MC) dropout. Ensemble methods combine different predictions from different deterministic predictors. Although they were not introduced in the first instance to explicitly handle uncertainties, they give an intuitive way of representing the model uncertainty on a prediction by evaluating the variety among the predictors (Gawlikowski et al., 2023).

Key Observation 1. Extensive research was performed in the field of uncertainty quantification in AI, which brought the development of a variety of methods. However, these approaches focus on uncertainty quantification of isolated models without accounting for how uncertainty characterizing a model's output can propagate and impact subsequent system components when the model is part of an AI‐augmented pipeline and its output constitutes the input to other models.

2.3. AI in vulnerability detection

Vulnerability detection is a crucial step in risk analysis of software systems and includes running automated tools scanning parts of the system to prevent future exploitation. Given its potential, experts integrated AI into their vulnerability detection systems, to scale them and make them more flexible to new threats.

One of the approaches to perform vulnerability detection is obtained by applying Natural Language Processing. In their approach, Hou et al. (2022) represent the code in the form of a syntax tree and input it to a Transformer model, which leverages the attention mechanism to improve the probability of detecting vulnerabilities. Akter et al. (2022) create embeddings using GloVe (Pennington et al., 2014) and fastText (Joulin et al., 2016), word embedding methods that aim to capture the relations between words. Then, they use long short‐term memory (LSTM) and Quantum LSTM models to perform vulnerability detection, showing lower execution time and higher accuracy, precision, and recall for the Quantum LSTM.

Another line of research excludes Natural Language Processing or embeds it with graph approaches. Yang et al. (2022) propose a new code representation method called vulnerability dependence representation graph, allowing the embedding of the data dependence of the variables in the statements and the control structures corresponding to the statements. Moreover, they propose a graph learning network based on a heterogeneous graph transformer, which can automatically learn the importance of contextual sentences for vulnerable sentences. They carry out experiments on the software assurance reference dataset (SARD) (NIST, 2021) with an improvement in performance between 4.1% and 62.7%. Fan et al. (2023) propose a circle‐gated graph neural network (CGGNN) that receives an input tensor structure used to represent information of code. CGGNN possess the capacity to perform heterogeneous graph information fusion more directly and effectively which allows the researchers to reach a higher accuracy precision and recall compared to the TensorGCN (Liu et al., 2020) and Devign (Zhou et al., 2019) methods.

Finally, Zhang et al. (2023) propose VulGAI to overcome the limitations posed by the training time in graph neural network models. They base their methods on graphs and images and unroll their approach in four phases: the graph generation from the code, the node embedding, and the image generation from the node embedding. Then, vulnerability detection through convolutional neural networks (CNN) is applied.

VulGAI was tested on 40,657 functions, outperforming other methods such as VulDePecker, SySeVR, Devign, VulCNN, and mVulPreter. Furthermore, VulGAI showed high accuracy, recall, and f1‐score, improving by 3.9 times the detection time of VulCNN.

Key Observation 2. Extensive research and different approaches have been tested in the past, with a high level of performance. However, previous work does not quantify (or communicate) the uncertainty regarding the performance of the proposed methods, and yet, the overestimated performance of the vulnerability detection model could affect the entire pipeline performance.

2.4. APR and composed pipelines

The step which follows automatic vulnerability detection through AI is the application of AI to automatic code fixing.

2.4.1. Code fixers

Li et al. (2022) propose DEAR, a DL approach which supports fixing general bugs. Experiments run on three selected data sets: Defects4J (395 bugs), BigFix (+26k bugs), and CPatMiner (+44k bugs) show that the DEAR approach outperforms existing baselines. Chi et al. (2022) leverage Neural Machine Translation (NMT) techniques to provide a novel approach called SeqTrans to exploit historical vulnerability fixes to automatically fix the source code. Xia and Zhang (2022) propose AlphaRepair, which directly leverages large pretrained code models for APR without any fine‐tuning/retraining on historical bug fixes.

2.4.2. Composed pipelines

AIBugHunter combines vulnerability detection and code repair. The pipeline is implemented by Fu et al. (2024), combining LineVul (Fu & Tantithamthavorn, 2022) and VulRepair (Fu et al., 2022), two software implemented by the same author. Yang et al. (2020) propose a DL approach based on autoencoders and CNNs, automating bug localization and repairs. Another example of a complete pipeline combining vulnerability detection and code repair is HERCULES, which employs ML to fix code (Saha et al., 2019). Liu et al. (2021) evaluate the effect of fault localization by introducing the metric fault localization sensitiveness (Sens) and analyzing 11 APR tools. Sens is calculated with the ratio of plausibly fixed bugs by modifying the code on nonbuggy positions, and the percentage of bugs which could be correctly fixed when the exact bug positions are available but cannot be correctly fixed by the APR tool with its normal fault localization configuration. This metric, to the best of our knowledge, is the first to quantify the impact of the vulnerability detector capability on the overall pipeline. Nevertheless, it does not provide an interval to describe the best and worst pipeline performance, and thus the quantification of the risk in terms of the percentage of errors which the pipeline will overlook when it is employed.

Key Observation 3. Recently, substantial research has appeared regarding the automation of vulnerability fixing by using ML. These advances are important and could help to manage the manual effort spent on sieving through tool warnings. However, to the best of our knowledge, the propagation of errors (or final uncertainty of the result) has not been investigated in such automated pipelines.

3. PIPELINE FORMALIZATION

In this section, we present the formal basis for our simulator. To simplify the analysis, we make the following assumptions in our model:

No breaking: We assume that the fixer will never turn a true Good sample that is classified as Bad into a Bad sample.
No degradation: We assume that all elements that are fixed cannot be distinguished from Good elements from the beginning. In other words, the performance of the second classifier does not degrade with the fix.
Constant prevalence rate: We initially assume that the prevalence rate $P_{R}$ which defines the number of positive samples in the data set is the same for both the training and the test data set. We relax this assumption in Section 6.

3.1. Identify the classifier metrics

To evaluate the performance of the AI‐augmented system, we use the metrics which are typically used to report the performance of a classifier: True Positive Rate ( $T P R$ ) or Recall ( $r e c$ ), precision ( $p r e c$ ), and False Alert Rate ( $F A R$ ), or False Positive Rate ( $F P R$ ), which we use interchangeably throughout this manuscript. We also use the prevalence rate ( $P_{R}$ ) of the positive elements ( $P o s$ ) among the total number of objects ( $N$ ) in the domain of interest. The prevalence rate is not typically known, so we will assume it to be a parameter whose effects need to be explored by simulation. Specificity is rarely cited in publications using AI models and its absence makes it difficult to reverse engineer the True Negatives.

\begin{matrix} P o s & = T P + F N, \end{matrix}

(1)

\begin{matrix} N e g & = N - P o s \end{matrix}

(2)

\begin{matrix} P_{R} & = \frac{P o s}{N} \end{matrix}

(3)

\begin{matrix} T P R & = \frac{T P}{P o s} = r e c = \frac{T P}{T P + F N} \end{matrix}

(4)

\begin{matrix} F A R & = \frac{F P}{N e g} = \frac{F P}{F P + T N} \end{matrix}

(5)

\begin{matrix} p r e c & = \frac{T P}{T P + F P} . \end{matrix}

(6)

$T P$ , $F P$ , $T N$ , and $F N$ , which are necessary to calculate the metrics of interest are, respectively, the True Positives, False Positives, True Negatives, and False Negatives. The TP represent the share of elements classified as positive which are positive while the FP represents the elements classified as positive which are negative. The TN are the elements classified as negative which are negative, and the FN are the elements classified as negative which are instead positive.

For our purposes, it is more useful to express $T P$ , $F N$ , and $F P$ in terms of the other values that are often found in publications reporting results of AI‐augmented system components.

Proposition 1

Let $r e c$ be the recall of a classifier and $p r e c$ be its precision. When applied to a domain with $N$ elements and a prevalence rate of $P_{R}$ , the true positives $T P$ , false negatives $F N$ , and false positives $F P$ of the classifier are as follows:

$\begin{matrix} T P & = r e c \cdot P_{R} \cdot N \end{matrix}$ (7)

$\begin{matrix} F N & = (1 - r e c) \cdot P_{R} \cdot N \end{matrix}$ (8)

$\begin{matrix} F P & = r e c \cdot \frac{1 - p r e c}{p r e c} \cdot P_{R} \cdot N . \end{matrix}$ (9)

The first two equations are simply an inversion of the definition of recall (4), where positives $P o s$ are expressed as a function of the prevalence rate (3). The third equation is obtained by inverting the definition of precision (6) to express false positives $F P$ as a function of $T P$ and $p r e c$ and then replace into it the equation computing $T P$ as a function of recall $r e c$ and prevalence $P_{R}$ (7). $□$

3.2. Deterministic recall, partial repairs, no breaking changes

Proposition 2

Let $r e c$ be the recall rate of a classifier that is used both as a first and second classifier, and let $f_{R}$ be the theoretical fix rate of the fixer, which (i) only affects positive (vulnerable) code and (ii) does not break nor make vulnerable code of the not vulnerable code, which is eventually piped through it. The classifier can also correctly recognize unsatisfactory fixes (iii) with the same $r e c$ . Then, the AI‐augmented system true performance when applied to a domain with $N$ elements and an initial prevalence rate of $P_{R}$ is

$\begin{matrix} f (a i a s) & = f_{R} \cdot r e c \end{matrix}$ (10)

$\begin{matrix} P_{R} (a i a s) & = (1 - f_{R} \cdot r e c) \cdot P_{R} \end{matrix}$ (11)

$\begin{matrix} T P R (a i a s) & = r e c \cdot \frac{(1 - f_{R}) \cdot r e c}{1 - f_{R} \cdot r e c} \end{matrix}$ (12)

$\begin{matrix} F A R (a i a s) & = {r e c}^{2} \cdot \frac{1 - p r e c}{p r e c} \frac{(1 - f_{R}) \cdot P_{R}}{1 - (1 - f_{R} \cdot r e c) \cdot P_{R}} . \end{matrix}$ (13)

Results show that, unless the fix rate is perfect, the final prevalence rate is not reduced to zero and it will depend on the uncertainty in the recall.

An apparently surprising result is that if the fix rate is perfect then the overall $T P R$ is zero. This is actually to be expected: with a perfect fix rate, all identified positives are fixed. This does not mean that all positives are eliminated because the false negatives from the first classifier are still present. In general, since $r e c \leq 1$ we have that the term $\frac{(1 - f_{R}) \cdot r e c}{1 - f_{R} \cdot r e c} \leq 1$ and therefore the recall of the AI‐augmented system as a whole is lower than the recall of the first classifier, that is, $T P R (a i a s) \leq T P R$ (see Appendix, Section B.4).

While the recall of the AI‐augmented system does not depend on the prevalence rate, the $F A R$ depends in a nonlinear way on the overall prevalence rate of the system. It is still possible to prove that the $F A R$ of the AI‐augmented system as a whole is lower than the $F A R$ of the first classifier, that is, $F A R (a i a s) \leq F A R$ .

Proposition 2. The first classifier receives in input the positives and the negatives and divides them into $T P$ , $F P$ , $F N$ , and $T N$ .

The fixer receives in input $T P_{1^{s t}} + F P_{1^{s t}}$ of which only a fraction $f_{R}$ of $T P_{1^{s t}}$ is actually fixed (Assumption (i)). According to assumption (ii), the fixer will not transform the false positive into new positives (i.e., it will not transform them into positives nor will it break them). Since the second instance of the classifier does not change the nature of the processed object but at worst misclassifies it we have that

$\begin{matrix} P o s (a i a s) & = \overset{unfixed by fixer}{\overset{︷}{(1 - f_{R}) T P_{1^{s t}}}} + \overset{misclassified by first classifier}{\overset{︷}{F N_{1^{s t}}}} \end{matrix}$ (14)

$\begin{matrix} P o s (a i a s) & = P o s - f (a i a s) \cdot P o s = P_{R} \cdot N - f (a i a s) \cdot P_{R} \cdot N . \end{matrix}$ (15)

We now equate the terms, replace $T P_{1^{s t}}$ and $F N_{1^{s t}}$ with the corresponding equations, and simplify $P_{R} \cdot N$ from both sides of the equation to obtain $1 - f (a i a s) = (1 - f_{R}) \cdot r e c + (1 - r e c)$ which simplifies to $f (a i a s) = f_{R} \cdot r e c$ (see Appendix, Section B.1).

We can use Equation (15) to directly obtain the prevalence rate for the AI‐augmented system by replacing the value of $f (a i a s)$ just computed and dividing by the total number of elements $N$ .

To compute the true positive rate we replace in the definition of $T P R$ (4), the number of $T P$ surviving at the end of the second classifier which is $(1 - f_{R}) \cdot T P_{1^{s t}} \cdot r e c$ because by assumptions (ii) and (iii) only the original true positives will be reclassified as positives. We divide by the total number of positives of the AI‐augmented system as computed from Equation (15). By simplifying both numerator and denominator for $P_{R} \cdot N$ , we obtain

$\begin{matrix} T P R (a i a s) & = \frac{(1 - f_{R}) \cdot r e c \cdot r e c}{1 - f_{R} \cdot r e c} . \end{matrix}$ (16)

To compute the $F A R$ , we need to compute first the false positives of the second classifier. To this extent, we rewrite the definition of false positives (9) in terms of the new set of positives $(1 - f_{R}) T P_{1^{s t}}$ at the end of the fixer according to Equation (14).

$\begin{matrix} F P (a i a s) & = r e c \cdot \frac{1 - p r e c}{p r e c} \cdot (1 - f_{R}) \cdot r e c \cdot P_{R} \cdot N . \end{matrix}$ (17)

Then we substitute this value into the definition of the $F A R$ (5) with the value of the overall negatives of the system. $□$

Corollary 1

Unless the fix rate is perfect ( $f_{R} = 1$ ) the number of false negatives of the AI‐augmented system satisfying the condition of Proposition 2 is higher than the number of false negatives that would result from just the first classifier. The false negatives of the AI‐augmented system also increase with the increase in recall $r e c$ .

This result is surprising as we expected the system to improve as recall improves. However, a larger recall would also mean that more positives would be piped through the fixer and tested again. Since the fixer is not perfect the number of false negatives emerging from the second run of the classifier will increase.

We compute the false negatives at the end of the AI‐augmented system starting from the definition as

$\begin{matrix} F N (a i a s) & = \underset{Escaping the second classifier}{\underset{︸}{{\overset{︷}{[(1 - f_{R}) \cdot T P_{1^{s t}}]}}^{unfixed positives} \cdot (1 - r e c)}} + F N_{1^{s t}} . \end{matrix}$ (18)

We plug in the definition of $T P_{1^{s t}}$ and $F N_{1^{s t}}$ in terms of positives $P o s$ and thus of the prevalence rate $P_{R}$ and the overall number of objects $N$ and rearrange the terms to obtain

$\begin{matrix} F N (a i a s) & = [1 + (1 - f_{R}) \cdot r e c] \cdot (1 - r e c) \cdot P_{R} \cdot N \end{matrix}$ (19)

$\begin{matrix} = [1 + (1 - f_{R}) \cdot r e c] \cdot F N_{1^{s t}} . \end{matrix}$ (20)

$□$

By using the above equations, we can compute the total number of elements which will be passed to the fixer and the second classifier.

\begin{matrix} N_{2^{n d}} & = T P_{1^{s t}} + F P_{1^{s t}} = \frac{r e c}{p r e c} \cdot P_{R} \cdot N . \end{matrix}

(21)

By using assumption (iii), the positive that will recognized as such $T P (a i a s)$ as

\begin{matrix} T P_{2^{n d}} & = \underset{after the second classifier}{\underset{︸}{{\overset{︷}{(1 - f_{R}) \cdot T P_{1^{s t}}}}^{unfixed positives} \cdot r e c}} . \end{matrix}

(22)

By expanding the definition of $T P_{1^{s t}} = r e c \cdot P o s$ (7) and the definition of positives as $P o s = P_{R} \cdot N$ (1) and rearranging the terms we have

\begin{matrix} T P (a i a s) & = (1 - f_{R}) \cdot {r e c}^{2} \cdot P_{R} \cdot N . \end{matrix}

(23)

Then, we can revise the final prevalence rate as

\begin{matrix} P_{R} (a i a s) & = \frac{T P (a i a s) + F N (a i a s)}{T P (a i a s) + F N (a i a s) + T N (a i a s) + F P (a i a s)} \\ = \frac{T P (a i a s) + F N (a i a s)}{N} . \end{matrix}

(24)

We can plug the solution for $T P (a i a s)$ and $F N (a i a s)$ and observe that they are both multiplied by common factor $P_{R} \cdot N$ which allows us to simplify the denominator and remove the dependency by the total number of objects. The ratio between the final prevalence and the initial prevalence rate is then captured by the following expression:

\begin{matrix} \frac{P_{R} (a i a s)}{P_{R}} & = (1 - f_{R}) \cdot {r e c}^{2} + [1 + (1 - f_{R}) \cdot r e c] \cdot (1 - r e c) \end{matrix}

(25)

which further algebraically simplifies as follows:

\begin{matrix} P_{R} (a i a s) & = (1 - f_{R} \cdot r e c) \cdot P_{R} . \end{matrix}

(26)

By multiplying both ends by $N$ , we obtain the total number of positives before and after the treatment by the AI‐augmented system pipeline. The AI‐augmented system fix rate is therefore equal to

\begin{matrix} \frac{P o s - P o s (a i a s)}{} P o s & = f_{R} \cdot r e c . \end{matrix}

(27)

Complete derivations of $P_{R} (a i a s)$ , $T P R (a i a s)$ , $F A R (a i a s)$ , $N_{2^{n d}}$ , $F N (a i a s)$ , and $T P (a i a s)$ , in Sections B.12, B.3, B.6, B.8, B.11, and B.10 in the Appendix.

3.3. Uncertain recall

Considering the low availability of recall values, we do not have enough data to approximate a specific cumulative distribution function (CDF) (e.g., beta, normal, etc.) (Ferson et al., 2013; Gray et al., 2019). We thus rely on distribution‐free analysis (Gray et al., 2022). Specifically, to model uncertainty in the recall and propagate it in the form of intervals, we employ nonparametric probability boxes (p‐boxes) and PBA (Ferson & Ginzburg, 1996; Ferson et al., 2003; Iskandar 2021).

By substituting a specific CDF with p‐boxes, PBA allows to model the lack of knowledge regarding the specific CDF from which the recall values are sampled. Considering that we cannot possess exhaustive information regarding the CDF of recalls of AI vulnerability detectors, the choice of this mathematical tool is preferred, compared to the use of precise probability density functions. Thus, we employ nonparametric p‐boxes which allows to model uncertainty when the shape of the distribution is not known but the parameters of the CDF are known such as the $m i n$ , $m a x$ , and $μ$ , which, respectively, correspond to the minimum, maximum, and expected value of the random variable.

Equations (28) and (29) are the inverse p‐boxes derived by Iskandar (2021) that substitute the inverse of the specific CDF and thus are used to sample the lower and upper bound recall values.

\begin{matrix} \underline{r e c} {(p)}_{a, b, μ}^{- 1} = \{\begin{matrix} [a, μ] f o r p = 0 \\ \frac{p \cdot a - μ}{p - 1} f o r 0 < p < \frac{b - μ}{b - a} \\ b f o r \frac{b - μ}{b - a} \leq p \leq 1 \end{matrix} \end{matrix}

(28)

\begin{matrix} \bar{r e c} {(p)}_{a, b, μ}^{- 1} = \{\begin{matrix} a f o r 0 \leq p \leq \frac{b - μ}{b - a} \\ b - \frac{b - μ}{p} f o r \frac{b - μ}{b - a} < p < 1 \\ [μ, b] f o r p = 1, \end{matrix} \end{matrix}

(29)

where $\underline{r e c}$ stands for the inverse of the p‐box that models the recall, and $a$ , $b$ , and $μ$ , respectively, correspond to the minimum, the maximum, and expected value of the recall registered during the literature review (Section 4). $p$ is the value sampled from the standard uniform distribution and given in input to the inverse probability box to sample the recall value. Sampling recall values from the lower and upper bounds of the inverse of the probability boxes allows treating recall as an interval, consisting of a minimum and a maximum possible values. Now, employing an intervalized recall will have as a consequence the formation of intervals in all the equations in which the recall is used. Respectively, the use of intervalized recall in Equations (26), (10), and (18), will lead to the generation of intervalized final prevalence rate, final fix rate, and final false negatives ratio.

4. RECALL IN THE FIELD

We collect the reported recall values (and precision) of AI‐augmented vulnerability detectors and derive the parameters necessary to implement the p‐boxes in our simulations.

4.1. Search in digital libraries

Figure 2 illustrates the steps that define our search. We defined a search string to filter publications stored in digital libraries: (“vulnerability detection” OR “fault localization”) AND (“artificial intelligence” OR “AI” OR “deep learning” OR “DL” OR “machine learning” OR “ML”) AND (“sensitivity” OR “true positive rate” OR “TPR” OR “recall” OR “hit rate”) AND “code.”

The figure shows the steps implemented to gather the publications from which to extract recall values. The first step consists of an initial search on Scopus that retrieves 548 publications. Based on Scopus relevance's ranking, the first 200 are selected (each 50 down the ranking, the fraction of relevant papers drops significantly and no relevant paper was found from 200 to 250). We check which of these 200 publications implement vulnerability detection or fault localization using AI (SC1 and SC2). In the end, of the 142 publications that respect the previous conditions, we select only the ones which used recall as an evaluation metric (SC3).

We define a list of selection criteria (SC) that a publication must respect to be selected for the extraction of data points.

SC1. The publication must be related to the topic of vulnerability detection or fault localization. For instance, we discard publications related to general feature location.
SC2. The publication must apply ML or DL algorithms to the problem of vulnerability detection. We discard the publications which do not employ ML or DL.
SC3. Since the metrics considered $P_{R} (a i a s)$ , $f (a i a s)$ , and $F N (a i a s)$ , depend uniquely on the recall, the publication must (at least) report the recall of the vulnerability detectors.

By employing the search string on Scopus, the resulting total number of publications is 548. The following selection of the papers is guided by the standards of the Preferred Reporting Items for Systematic Review and Meta‐Analysis (PRISMA) Statement (Page et al., 2021), which suggests relevance as a selection method. Therefore, we use the built‐in relevance score provided by Scopus which ranks publications based on their affinity with the presented search string (Elsevier, 2024). We empirically found that after the 200th publication, the selected publications do not either concern the problem of vulnerability detection or the application of AI, DL, or ML to the problem. By looking at the first paper to the 50th, we selected 45 papers respecting all success criteria (90% of the scanned sample); from the 51st to the 100th, we selected 35 papers matching the criteria (70%); from the 101st to the 150th, 25 papers (50%); and from the 151st to the 200th, 11 papers (22%). We kept analyzing until the 250th paper and found no publication that respected the selected criteria. Therefore, we stopped the search and considered the first 200 papers. After applying SC1 and SC2 to the title and abstract, we retain 142 publications. Finally, applying SC3 resulted in removing 26 more publications.

4.2. Collected samples

For each article, we select the recall value of the model presented in the publication. We also select the recall values related to baseline models, but only if those values are derived from new experiments. If the values are simply reported from the publications where baseline models are presented, we consider them duplicates. We include recall values related to the same model used in different publications because as a consequence of repeated experiments, the model performance can differ between different studies. The factors responsible for different performances for the same model are the following:

Different data set: the data set used by the new paper on which the new model is tested and compared to old models can be different compared to the data set on which previous models were originally tested.
Different training modalities: if the authors of the new paper retrain all the models and change the training modalities, this will impact the models' performance.
Random changes: even when adopting the same training techniques other factors can influence the final training result, such as the random training‐test splitting and the hardware on which experiments are performed.

From an initial sample of 2328 values, eliminating the values not derived by new experiments and the outliers, we obtain 2227 samples that we use to calculate the p‐boxes parameters for the simulation. We eliminate outliers by employing z‐scores. Specifically, if the recall data point possesses a z‐score greater or equal to 3 or a z‐score smaller or equal to −3 we consider it to be an outlier (Chen et al., 2022). The minimum and maximum reported recall are, respectively, 0.06 and 1.00, while the mean is 0.75. For completeness, in Table 1 we also show descriptive statistics on the collected precision samples (but we do not use them yet in our simulation).

TABLE 1.

Descriptive statistics regarding recall and precision data, gathered from publications related to the applications of AI to vulnerability detection. For both recall and precision, the table reports the minimum and maximum value registered, the first quartile (Q1), the third quartile (Q3), the mean, the median, and the standard deviation (SD).

Measure	Samples	Selected	Min	Q1	Median	Q3	Max	Mean	SD
Recall	2227	116	0.06	0.62	0.80	0.92	1.00	0.75	0.21
Precision	2016	100	0.00	0.56	0.78	0.92	1.00	0.71	0.27

Open in a new tab

Regarding the True Negative Rate $(T N R)$ and the $F P R$ , where $T N R = 1 - F P R$ , among the selected publications only two report the $T N R$ and only 14 publications report the $F P R$ . This remains a significant limitation of the data reported in the literature, so it is difficult to understand the trade‐off faced by the studies.

5. SIMULATION ONE: CONSTANT PREVALENCE RATE

Through the simulation, we are interested in calculating $P_{R} (a i a s)$ , $f (a i a s)$ , and $F N (a i a s)$ , which, as previously shown (Section 3), depend uniquely on the recall. To allow for future extensions, we implemented the simulator taking into account $T N$ and $F P$ , which are needed to define specificity. At this stage of the research, the specificity value does not affect the final result, thus we set its value to zero. We implement the simulation through the pba‐for‐python library as it allows to perform rigorous p‐box arithmetic (Gray et al., 2022). In addition, to show the influence that the number of samples has on the precision of the simulation, we implement the experiments also through MC simulation (Metropolis & Ulam, 1949) and report the results in the Appendix Sections A.1.1 A.1..3, A.2., A.2..2, and A.2.2.

5.1. Simulator

Figure 3 illustrates the subsystems of our simulation pipeline. It comprises a fixer and a classifier that acts as the first and second classifiers.

Illustration of the process that leads to the calculation of the final prevalence rate ( $P_{R} (aias)$ ), given a fixed rate of 0.50 and three different starting prevalence rates ( $P_{R}$ ). $P_{R}$ determines the number of positives in the ground truth, while $P_{R} (a i a s)$ is the ratio between the positives ( $TP (aias) + FN (aias)$ ) and the total elements ( $N$ ) at the end of the process. We represent the pipeline as a loop because the first and the second classifiers possess the same recall and specificity. By considering at each step in the pipeline, a lower and upper bound of the recall, we propagate the uncertainty, with the consequence that also the $P_{R} (a i a s)$ , as $T P$ , $F N$ , $T N$ , and $F P$ , will have an upper and lower bound. The lower bound of the $P_{R} (a i a s)$ is the best‐case scenario, which is the case in which the classifier is perfect. In reality, $P_{R} (a i a s)$ can be equal to all the values contained in the interval, depending on the classifier's performance.

5.1.1. Ground truth generator

The ground truth generator creates the data set that allows the simulation of the pipeline. Each generated element represents a code sample, which can be vulnerable or not vulnerable. Thus, the ground truth generator produces fictional positive and negative elements ( $P o s$ , $N e g$ ):

It receives as input the total number of elements ( $N_{E}$ ), set to 100,000, and the initial prevalence rate ( $P_{R}$ ), which defines the initial number of vulnerable elements.
The generator labels each object as vulnerable with probability equal to $P_{R}$ , and not vulnerable with probability $1 - P_{R}$ and returns a list containing all the samples generated.

5.1.2. p‐Boxes and recall sampling

We employ the pba‐for‐python library to sample $N_{R}$ lower and $N_{R}$ upper bound recall values, where $N_{R} = 202$ (default value set by the pba‐for‐python library):

The parameters used to sample from p‐boxes formulas as the minimum (0.06), maximum (1.00), and mean (0.75) value generated from the exploratory data analysis in Section 4.2.
Given two lists of recall values, one representing the lower bounds and the other the upper bounds, each of $N_{R}$ samples, we perform the simulation to estimate the upper and lower bounds of the metrics of interest.

5.1.3. First classifier

After generating the lower and upper bound recall values, the first classifier executes the first subdivision of the samples, generating $T P_{1^{s t}}$ , $F N_{1^{s t}}$ , $T N_{1^{s t}}$ , and $F P_{1^{s t}}$ .

The first classifier discerns each vulnerable element of the ground truth between $T P$ with probability equal to $r e c$ and as $F N$ with probability equal to $1 - r e c$ . This means that the greater the recall the greater the probability that vulnerable objects are classified as $T P$ .
Since the first classifier is simulated with both lower and upper bound recall values, in the end, we obtain lower and upper bounds for each element, thus $[\underline{T P_{1^{s t}}}, \bar{T P_{1^{s t}}}]$ , $[\underline{F N_{1^{s t}}}, \bar{F N_{1^{s t}}}]$ , $[\underline{T N_{1^{s t}}}, \bar{T N_{1^{s t}}}]$ , $[\underline{F P_{1^{s t}}}, \bar{F P_{1^{s t}}}]$ .

5.1.4. Fixer

The fixer, with fix rate $f_{R}$ , tries to repair the samples classified as positives by the first classifier, namely, $T P_{1^{s t}}$ and $F P_{1^{s t}}$ . The fixer repairs each sample classified as positive with probability equal to $f_{R}$ . Since we assume that a $F P$ cannot be broken, the intervention on $F P$ cannot cause it to become a $T P$ .

5.1.5. Second classifier

The second classifier, with the same recall and specificity as the first classifier, classifies the objects that passed through the fixer, generating $T P_{2^{n d}}$ , $F N_{2^{n d}}$ , $T N_{2^{n d}}$ , and $F P_{2^{n d}}$ .

The second classifier labels each vulnerable object that passed through the fixer as $T P$ with probability equal to the $r e c$ and as $F N$ with probability $1 - r e c$ .
Since the second classifier is simulated with both lower and upper bound recall values, we obtain lower and upper bounds for each element, thus $[\underline{T P_{2^{n d}}}, \bar{T P_{2^{n d}}}]$ , $[\underline{F N_{2^{n d}}}, \bar{F N_{2^{n d}}}]$ , $[\underline{T N_{2^{n d}}}, \bar{T N_{2^{n d}}}]$ , and $[\underline{F P_{2^{n d}}}, \bar{F P_{2^{n d}}}]$

5.1.6. Final counter

The final counter gathers the results from the first classifier, the fixer, and the second classifier and that calculates the final prevalence rate ( $P_{R} (a i a s)$ ), the final fix rate ( $f (a i a s)$ ), and the false negatives ratio ( $F N_{r a t i o}$ ) between the final number of false negatives ( $F N (a i a s)$ ) and the false negatives generated by the first classifier ( $F N_{1^{s t}}$ ). Since the uncertainty propagates until the final counter, each metric will be characterized by a lower and upper bound, thus: $[\underline{P_{R} (a i a s)}, \bar{P_{R} (a i a s)}]$ , $[\underline{f (a i a s)}, \bar{f (a i a s)}]$ , $[\underline{F N_{r a t i o}}, \bar{F N_{r a t i o}}]$ .

5.2. Simulation results

We present the simulation results and show how propagating uncertainty affects $P_{R} (a i a s)$ (see Table 2), $f (a i a s)$ (see Table 3), and the false negatives (see Table 4). Figure 3 instantiates the simulated pipeline, with the results obtained from the simulation with $P_{R} = 0.50$ and $f_{R} = 0.50$ .

TABLE 2.

This table shows how the bounds of the final prevalence rate ( $P_{R} (aias)$ ) given initial prevalence rate ( $P_{R}$ ), and theoretical fix rate ( $f_{R}$ ), but uncertain recall. The theoretical decrease of the initial prevalence rate given a fix rate only consists of the lower bound of the interval, which is when recall is equal to one. For instance, when $P_{R} = 1.00$ and $f_{R} = 0.50$ , the prevalence rate decreases of the $0.50 %$ but only as a lower bound. Section A.1.1 in the Appendix, presents the results of the calculation of the final prevalence rate obtained through Monte Carlo (MC) simulation.

P_{R} (aias)

P_{R}

f_{R} = 0.50

f_{R} = 0.70

f_{R} = 0.90

f_{R} = 1.00

0.10

[0.050, 0.097]

[0.030, 0.096]

[0.010, 0.095]

[0.000, 0.094]

0.50

[0.250, 0.485]

[0.150, 0.479]

[0.050, 0.473]

[0.000, 0.470]

1.00

[0.500, 0.970]

[0.300, 0.958]

[0.100, 0.946]

[0.000, 0.940]

Open in a new tab

TABLE 3.

Comparison between the theoretical fix rate ( $f_{R}$ ) and the real fix rate ( $f (aias)$ ), when $P_{R} = 0.50$ . The theoretical fix rate only translates into the upper bound of the interval, while the real fix rate can fall within a much wider range of values, which will eventually depend on the quality of the classifier. Section A.1.2 in the Appendix presents the resulting final fix rate obtained through Monte Carlo (MC) simulation.

f_{R}

f (aias)

0.50

[0.030, 0.500]

0.70

[0.042, 0.700]

0.90

[0.054, 0.900]

1.00

[0.060, 1.000]

Open in a new tab

TABLE 4.

This table shows the final bounds regarding the ratio ( ${FN}_{ratio}$ ). Between the first and the second classifier, the number of $F N$ grows apart in the case in which the $f_{R} = 1$ . When the $f_{R} = 1$ , $FN (aias) = {FN}_{1^{st}}$ , because there will be no positives that can be classified as $F N$ by the second classifier and thus the number will not increase, leaving the ratio equal to one. Section A.1.3 in the Appendix presents the results related to the $F N_{r a t i o}$ obtained through Monte Carlo (MC) simulation.

f_{R}

{FN}_{ratio}

0.50

[1.000, 1.030]

0.70

[1.000, 1.018]

0.90

[1.000, 1.006]

1.00

[1.000, 1.000]

Open in a new tab

5.2.1. Final prevalence rate

Table 2 shows the results related to the decrease in the prevalence rate. We run simulations with $P_{R} = (0.10, 0.50, 1.00)$ , thus in the first, second, and third sets of simulation, the total number of vulnerable samples is equal to the $10 %$ , $50 %$ , and $100 %$ of the total samples. For each of these simulation sets, we calculate the final prevalence rate with $f_{R} = (0.50, 0.70, 0.90, 1.00)$ , meaning that the expected decrease in the prevalence rate is, respectively, $50 %$ , $70 %$ , $90 %$ , and $100 %$ . But the theoretical decrease in the prevalence rate that should be observed given a specific starting prevalence rate and fix rate is only the lower bound of the interval, which corresponds to the minimum prevalence rate obtainable when the capacity to locate vulnerable elements is perfect. In all the other cases, the value will fall within the bounds of the interval. For example, when $P_{R} = 0.50$ and $f_{R} = 0.50$ we should observe a decrease in the final prevalence rate of $50 %$ , thus $P_{R} (a i a s) = 0.250$ . But Table 2 and Figure 3 show that the $50 %$ decrease only represents the lower bound, contrasting with an upper bound of 0.485.

5.2.2. Real fix rate

Table 3 shows the results related to the final fix rate. We run simulations with theoretical fix rate $f_{R} = (0.50, 0.70, 0.90, 1.00)$ . At the end of the simulations, $f_{R}$ only corresponds to the upper bound of the interval of $f (a i a s)$ , which is the maximum fix rate obtainable when the capacity to locate vulnerable elements is maximum. For example, when $f_{R} = 0.50$ , the $f (a i a s)$ oscillates between a maximum of 0.500 equal to $f_{R}$ and a minimum of 0.030. This illustrates the limitations of APR tools and the importance of stating the final results in terms of intervals and not of single numbers in order to represent the uncertainty that characterizes these systems when they are applied to real‐world scenarios.

5.2.3. False negatives ratio

Table 4 shows the results related to $F N_{r a t i o}$ , which is the ratio between the false negatives generated by the first classifier $F N_{1^{s t}}$ and the overall number of false negatives registered at the end of the pipeline $F N (a i a s)$ . Apart from $f_{R} = 1$ , the final ratio is always greater than one, and this indicates that the pipeline is unable to avoid the growth of the number of $F N$ between the first and the second classifier.

6. SIMULATION TWO: BEYOND CONSTANT PREVALENCE RATE

In the general case, the constant prevalence rate assumption, underlying the first simulation, does not hold. It is possible to get an expected number of $T P$ , but impossible to know which positives are actually $T P$ , when the classifier is applied to actual data. Only by applying the classifier on the field it is possible to know whether the positives are $T P$ or $F P$ . The data of the simulation can be used to train the classifier and calculate its recall, which would be a characteristic (fixed value) of the classifier. If the classifier is applied to a different data set, it is incorrect to just calculate $T P$ from the definition of recall.

In this simulation, we aim to analyze the effects of removing the constant prevalence rate assumption on the final fix rate of the pipeline. To relax the assumption, we calculate $T P$ , $F N$ , $T N$ , and $F P$ , relying on the notion of Positive Predicted Value ( $P P V$ ) and Negative Predicted Value ( $N P V$ ) (Gray et al., 2020; Parikh et al., 2008). $P P V$ and $N P V$ are defined as follows:

\begin{matrix} P P V & = \frac{r e c \cdot P_{R}}{r e c \cdot P_{R} + (1 - s p e c) \cdot (1 - P_{R})} \end{matrix}

(30)

\begin{matrix} N P V & = \frac{s p e c \cdot (1 - P_{R})}{s p e c \cdot (1 - P_{R}) + (1 - r e c) \cdot P_{R}}, \end{matrix}

(31)

where $r e c$ corresponds to the $r e c a l l$ or $s e n s i t i v i t y$ of the classifier, $P_{R}$ is the prevalence rate of the data set, and $s p e c$ is the specificity of the classifier. Then, we use $P P V$ to calculate the number of $T P$ and $F P$ and $N P V$ to calculate $T N$ and $F N$ as follows:

\begin{matrix} T P & = P P V \cdot P o s \end{matrix}

(32)

\begin{matrix} F P & = (1 - P P V) \cdot P o s \end{matrix}

(33)

\begin{matrix} T N & = N P V \cdot N e g \end{matrix}

(34)

\begin{matrix} F N & = (1 - N P V) \cdot N e g, \end{matrix}

(35)

where $P o s$ are the elements classified as positive and $N e g$ are the elements classified as negatives.

Differently from the first simulation, we assume the $m i n$ , $m a x$ , and $μ$ parameters for the p‐boxes. This allows us to employ p‐boxes to sample $s p e c i f i c i t y$ values, as using assumed parameters removes the limitation posed by the lack of $s p e c i f i c i t y$ values reported in the literature. We measure the performance of the simulated APR tool, with three different $m i n$ values for $s e n s i t i v i t y$ and $s p e c i f i c i t y$ 0.50, 0.70, and 0.90, and the $m a x$ value of 1.00, measuring how raising the minimum value of recall and specificity will impact the final $f_{R}$ of the pipeline. We chose those values because they allow to cover the recall range from the first quartile to the third quartile (Table 1), including also the case of perfect recall. For specificity, we use the same values because we do not have enough values to make an informed choice.

6.1. Results of the simulation

Tables 5 and 6, respectively, show the resulting fix rate, obtained by maintaining a constant prevalence rate and by relaxing the assumption.

TABLE 5.

This table shows for each theoretical fix rate ( $f_{R}$ ) how increasing the minimum recall and specificity of the classifier, affects the lower bound of the final fix rate ( $f (aias)$ ), in the case in which we maintain the assumption related to the consistency of the prevalence rate between training and test data sets. Section A.2.1 in the Appendix shows the results related to the $f (a i a s)$ , obtained through Monte Carlo (MC) simulation and with constant prevalence rate.

f (aias)

Min. Rec and Spec

f_{R}

Min = 0.50

Min = 0.70

Min = 0.90

0.50

[0.250, 0.500]

[0.350, 0.500]

[0.450, 0.500]

0.70

[0.350, 0.700]

[0.490, 0.700]

[0.630, 0.700]

0.90

[0.450, 0.900]

[0.630, 0.900]

[0.810, 0.900]

1.00

[0.500, 1.000]

[0.700, 1.000]

[0.900, 1.000]

Open in a new tab

TABLE 6.

This table shows for each theoretical fix rate $f_{R}$ how increasing the minimum recall and specificity of the classifier, affects the lower bound of the final fix rate ( $f (aias)$ ), in the case in which we relax the assumption related to the consistency of the prevalence rate between the training and test data sets. Section A.2.2 in the Appendix presents the results related to the $f (a i a s)$ obtained through Monte Carlo (MC) simulation when relaxing the assumption regarding the constant prevalence rate.

f (aias)

Min. Rec and Spec

f_{R}

min = 0.50

min = 0.70

min = 0.90

0.50

[0.000, 0.500]

[0.331, 0.500]

[0.500, 0.614]

0.70

[0.000, 0.700]

[0.334, 0.700]

[0.684, 0.700]

0.90

[0.000, 0.754]

[0.446, 0.787]

[0.779, 0.874]

1.00

[0.000, 0.756]

[0.538, 0.794]

[0.877, 0.911]

Open in a new tab

The results show that accounting for the shift in the prevalence rate modifies the final estimates of the $f (a i a s)$ , downgrading what we can expect from the overall pipeline performance. For instance, examining the case in which the $f_{R} = 0.90$ , comparing the results obtained considering the constant prevalence and shifted prevalence rate, the lower bound of the resulting $f (a i a s)$ is always higher when the prevalence rate is constant: when minimum recall and specificity are 0.50, the lower bound for constant prevalence rate is 0.450 and is lowered to 0.000 when accounting for nonconstant prevalence rate when recall and specificity are 0.70, the lower bounds are, respectively, 0.630 and 0.450, and when minimum recall and specificity are 0.90 the lower bounds are, respectively, 0.810 and 0.779.

This points to the necessity to account for possible variation in the prevalence rate of the data set, by calculating the $T P$ , $F P$ , $T N$ , and $F N$ through which $f (a i a s)$ is calculated employing the $P P V$ and $N P V$ , to get a more realistic estimate of the capacities of the pipeline.

We also see how progressively raising the minimum recall and specificity affects the final lower bound of the fix rate, both in the case of a constant prevalence rate and in the case in which the assumption is removed. For example, when the theoretical fix rate is 0.70 and the minimum recall and specificity are 0.50, the lower bound of the final fix rate is 0.350, while raising the minimum value of recall and specificity to 0.70 and then 0.90, makes the lower bound grow to 0.490 first and then 0.630. The same can be said when the prevalence rate is not consistent and the $T P$ , $F P$ , $T N$ , and $F N$ are calculated by employing the $P P V$ and $N P V$ . When the theoretical fix rate is 0.70, the final lower bound of the final fix rate increases from 0.000, when the minimum recall and specificity are 0.50, to 0.684, when the minimum recall and specificity are 0.90.

7. CASE STUDY: AI‐BASED APR

We present a case study to measure the impact of uncertainty on AI‐based APR tools.

This case study examines the possibility of obtaining an AI‐augmented APR tool, composed of two AI subsystems, one dedicated to vulnerability detection, and the other to vulnerability repair.

We analyze a DL‐based APR tool, AIBugHunter (Fu et al., 2024). This pipeline is the result of the assembly of two systems, namely, LineVul (Fu & Tantithamthavorn, 2022), which performs vulnerability detection and VulRepair (Fu et al., 2022), which performs bug‐fixing. Since the authors specified that they did not evaluate the whole AIBugHunter pipeline in the dedicated publication, but that they evaluated the two composing tools separately, we use this case study to show to what extent uncertainty can impact the overall performance of an APR pipeline composed by different AI subsystems, trained on different data sets. We consider the data set on which AIBugHunter is tested, composed of 879 total code samples, all of which have vulnerabilities. We calculate the number of the samples that the first classifier of the pipeline highlights to be vulnerable by multiplying the total code samples by the recall reported in the publication dedicated to LineVul (Fu & Tantithamthavorn, 2022) which amounts to 0.86, obtaining 756 Bad code samples. Then, VulRepair (Fu et al., 2022), with a reported repairing accuracy of 0.44 is used to correct the bugs. Thus, we multiply the repairing accuracy by the number of Bad code samples, obtaining 333 Fixed code samples. Thus, the number of positive elements which the pipeline does not correct is equal to 423. We then use our simulation pipeline to account for uncertainty in the recall, considering the same number of code samples and the same point estimate for repairing accuracy. When accounting for uncertainty the final repairing accuracy can be as high as 0.470, and as low as 0.030, compared to the starting point estimate of 0.44.

8. DISCUSSION

8.1. Summary of results

The results of the first simulation show that, once the uncertainty in the recall of the vulnerability detectors is propagated through the pipeline, it affects the overall pipeline performance, in terms of prevalence rate reduction and real fix rate. The simulated AI system can obtain the expected theoretical reduction of the prevalence rate, and a final fix rate equal to the theoretical fix rate, only in the best‐case scenario, which is when the recall is maximum. In all the other cases, the real reduction of the number of vulnerable code samples, and the final fix rate, can widely vary, falling in the intervals calculated during the simulation. This finding was confirmed when investigating the case study, as it confirms that the final fix rate depends on the oscillation of the classifier recall.

Second, our simulations show that the uncertainty characterizing the $F N_{r a t i o}$ is smaller compared to the uncertainty characterizing $P_{R} (a i a s)$ and $f (a i a s)$ . That is, the width of the intervals related to the $F N_{r a t i o}$ is smaller compared to the intervals of $P_{R} (a i a s)$ and $f (a i a s)$ . However, the incapacity of the pipeline to keep the $F N$ stable between the first and the second classifiers could mean overlooking true vulnerabilities due to overapproximation of classifier performance, which could lead to untrustworthy decisions about security risks exposing the possible discrepancy between the preference of risk managers who use the AI system, and the risk tolerance embedded in the system (Paté‐Cornell, 2024).

The results of the second simulation show that the estimates for the final fix rate are lowered when accounting for shifts in the prevalence rate which can happen when testing and deploying a system, thus demonstrating the importance of accounting for variations in the prevalence rate before deploying a tested system in real‐world scenarios. Moreover, the second simulation also shows that increasing the minimum possible recall and specificity that can be sampled has a direct effect on the lower bound of the final fix rate indicating that it is fundamental to understand what is the minimum possible performance of a classifier when employing it in larger AI‐augmented systems.

Answer to RQ How to estimate the total error (or success rate) of the AI‐augmented system, given the propagating errors of the classifiers in the pipeline?

Our methodology to assess the risk of propagating uncertainty in a security pipeline can determine the overall intervals for $P_{R} (a i a s)$ , $f (a i a s)$ , and $F N_{r a t i o}$ through simulation. We use it to evaluate the potential propagation of uncertainty on a case study using an AI‐based program repair system (AIBugHunter), showing that although the best (claimed) fix rate could be $f_{R} = 47 %$ , it could be as low as $3 %$ once uncertainty is accounted for.

8.2. Policy implications on AI evaluation

The integration of AI subsystems in safety and security systems will continue, and will progressively align with the evolution of AI models (Collier et al., 2025).

Risk analysis practices are being revolutionized by the integration of AI in several safety and security domains, from cybersecurity (Kaur et al., 2023) to healthcare (Alowais et al., 2023), from predicting natural hazards (Gharehtoragh & Johnson, 2024) to implementing digital twins (DT), which allow to replicate real‐world objects and processes, also in safety and security scenarios (Kreuzer et al., 2024).

As a response to the accompanying risks, new regulations and standardizations have started to come into force worldwide (AI act (European Union, 2024), the US Executive Order No. 14110 (The White House, 2023), the European Union Aviation Safety Agency (EASA) Artificial Intelligence Roadmap (2023a), the ISO/IEC 42001:2023 (International Standard Organization, 2023), and the AI Risk Management Framework (NIST, 2024)).

However, in the process of AI development, application, and regulation, developers, researchers, and policymakers often regard AI models in isolation. They do not consider that AI chains result from the composition of multiple AI models, where the output of one model might become the input for the succeeding model in the toolchain. Even when uncertainty is quantified, uncertainty propagation is ignored, and as our research shows, this can have consequences on the final performance that are elusive to the decision maker.

In light of our results, we recommend that policies which are being developed to support external and impartial evaluation of AI models should include uncertainty quantification as an explicit indicator. In addition, when systems under evaluation are composed of multiple AI models, the uncertainty quantification should be performed at the system level, quantifying how the uncertainty propagates from one AI model to the next.

In what follows, we dive into the recently published guidelines on the use of ML applications in aviation. By focusing on a concrete safety‐critical domain, we highlight the gap regarding the quantification of uncertainty propagation and provide recommendations on possible guidelines improvement.

Policy recommendations for aviation. The necessity to consider uncertainty at the system level has implications for the policies to be adopted in scenarios where AI is applied to safety‐risk systems such as in the case of aviation.

Although the EASA (2023a, 2023b) highlight the potential of AI applied to cybersecurity and the importance of uncertainty quantification, a major gap still exists:

Subsystem focus: in the realm of safety assessment and information security, which constitute two important building blocks of the trustworthy AI framework defined by EASA, and of which the first include uncertainty management, the objectives to be reached are characterized at subsystem level (EASA, 2023b):
Objective SA‐01: The applicant should perform a safety (support) assessment for all AI‐based (sub)systems, identifying and addressing specificities introduced by AI/ML usage.

Objective IS‐01: For each AI‐based (sub)system and its data sets, the applicant should identify those information security risks with an impact on safety, identifying and addressing specific threats introduced by AI/ML usage.

Contrasting with the EASA approach, our results, related to APR tools but whose implications can be extended also to other AI‐augmented systems, highlight the importance of modeling uncertainty at the system level, propagating it from the singular subsystems, to verify how the entanglement of the uncertainties of the different components affects the entire system. Thus, to improve the guidelines, we advise integrating the current evaluation policy with additional guidelines emphasizing that safety and risk assessment with the consequent uncertainty quantification, should be performed not only at (sub)system level but also at system level.

8.3. Limitations

No‐breaking assumption: In our research, we assume that the fixer cannot break the samples that the first classifier classifies as positive when they are negative. Since this is a simplification because we cannot assume that the fixer is perfect and cannot break the code, in future studies we will remove this assumption by experimenting with the breaking‐possibility scenario.

No‐degradation assumption: We assume that all elements that are fixed, cannot be distinguished from Good elements from the beginning. The performance of the classifier does not degrade with the fix. We are assuming that the fixer generates code within the same distribution of the originals that are analyzed by the first classifier, thus allowing us to use a second classifier equal to the first. The plan is to use two different classifiers in the future.

Generalization of simulator to real systems: While we assume that the simulation is realistic as it is rooted in relevant theory and recall values reported in related work, we are not working with a real system. In the next step of our research, we will experiment with an actual pipeline, accounting for uncertainty and checking to what extent the results obtained during the simulation are reflected in an actual system.

9. CONCLUSIONS

In practice, good performance of APR tools is still challenging to achieve. In a recent publication, Ami et al. (2023) surveyed 89 practitioners who use automated security testing, and one participant summarized the rate of false positives in reality: “(At present) 80% of them are false positives and 20% of them are something we can fix.” In addition, the lack of assessing the risk of introducing false negatives into the system is the bigger concern (Ami et al., 2023), which brings challenges for AI‐based APR adoption:

“If the tools miss something, we can not detect that issue, and we just overlook the issues …because no one ever reports about false negatives, and we don't check if the tool ever misses the vulnerabilities.”

We presented a new approach for assessing the risk of uncertainty propagation and showed, by simulation, that the final performance of an AI‐augmented system may be an entire order of magnitude lower (0.44 vs. 0.03) when estimating the effect of propagating errors. Our simulations of the level of uncertainty are in line with the recall values reported in the related work. In addition, the modular implementation of the simulator allows domain experts to use an internal or alternative data set of recall values, to approximate p‐boxes and run a more precise, domain‐specific simulation of the propagating uncertainty in their systems. This would allow them to make more informed security risk decisions.

However, future work is needed to validate to what extent the proposed simulation is perceived as useful and how practitioners interpret the communicated uncertainty. For instance, a validation could test whether other factors, connected to real‐world and real‐time scenarios, such as network traffic and limited bandwidth, or human factors, affect the system's global uncertainty.

Beyond the scenarios modeled in this work, it is worth considering how errors propagate in cases when the fixer modifies a misclassified sample, potentially introducing new vulnerabilities. Moreover, it is worth considering scenarios where the fixer introduces changes with patterns different from the ones that the first classifier is trained to recognize, as it can happen when the classifier and the fixer are trained on different data sets (Fu et al., 2024), as is often the case, as organizations adopt technologies based on their needs. Capturing these scenarios would allow policymakers to assess when model retraining is required and quantify the drop in residual uncertainty in their systems.

Finally, improvements in the policies that regulate the evaluation of AI systems are required to guide the risk assessment of AI‐based APR tools and in general of AI systems composed of multiple AI models, to quantify the error propagating from (sub)systems to the system level.

ACKNOWLEDGMENTS

This work was partially supported by the Nederlandse Organisatie voor Wetenschappelijk Onderzoek (NWO) under the KIC HEWSTI Project under Grant no. KIC1.VE01.20.004, and the Horizon Europe Sec4AI4Sec Project under Grant no. 101120393.

APPENDIX A. MONTE CARLO (MC) SIMULATION

Here we present the results obtained through MC simulation. We implement MC simulation when assuming a constant prevalence rate (Sections A.1..2, A.1..3, and A.1.3) and when relaxing this assumption comparing the results with a constant and nonconstant prevalence rate (Sections A.2.1 and A.2.2). In each section, we present the results when running the MC simulation with 100 sampled recall values and 1000 sampled recall and show, through the standard error of the mean and the percentiles, how the different sample size impacts the precision of the simulation.

A.1. Simulation one: Constant prevalence rate

A.1..1. A.1.1 PR(aias) calculation

See Tables A.1, A.2, A.3.

TABLE A.1.

The tables show the lower and upper bounds of the final prevalence rate obtained through Monte Carlo (MC) simulation with 100 sampled recall values ( $P_{R} {(aias)}_{100}$ ) and 1000 sampled recall values ( $P_{R} {(aias)}_{1000}$ ).

P_{R} {(aias)}_{100}

P_{R}

f_{R} = 0.50

f_{R} = 0.70

f_{R} = 0.90

f_{R} = 1.00

0.10

[0.036, 0.095]

[0.016, 0.095]

[0.002, 0.093]

[0.000, 0.094]

0.50

[0.228, 0.488]

[0.125, 0.481]

[0.033, 0.477]

[0.000, 0.472]

1.00

[0.463, 0.977]

[0.266, 0.972]

[0.079, 0.960]

[0.000, 0.951]

P_{R} {(aias)}_{1000}

P_{R}

f_{R} = 0.50

f_{R} = 0.70

f_{R} = 0.90

f_{R} = 1.00

0.10

[0.049, 0.098]

[0.029, 0.097]

[0.009, 0.096]

[0.000, 0.096]

0.50

[0.244, 0.482]

[0.145, 0.477]

[0.048, 0.471]

[0.000, 0.468]

1.00

[0.495, 0.972]

[0.296, 0.960]

[0.097, 0.948]

[0.000, 0.943]

Open in a new tab

TABLE A.2.

The tables show the standard error of the mean of the lower and upper bounds of the final prevalence rate, when sampling 100 recall values ( $σ_{P_{R} {(aias)}_{100}}$ ) and 1000 recall values ( $σ_{P_{R} {(aias)}_{1000}}$ ).

σ_{P_{R} {(aias)}_{100}}

P_{R}

f_{R} = 0.50

f_{R} = 0.70

f_{R} = 0.90

f_{R} = 1.00

0.10

[0.007, 0.014]

[0.007, 0.018]

[0.008, 0.023]

[0.007, 0.026]

0.50

[0.004, 0.014]

[0.006, 0.019]

[0.008, 0.024]

[0.008, 0.027]

1.00

[0.004, 0.014]

[0.005, 0.019]

[0.007, 0.025]

[0.008, 0.028]

σ_{P_{R} {(aias)}_{1000}}

P_{R}

f_{R} = 0.50

f_{R} = 0.70

f_{R} = 0.90

f_{R} = 1.00

0.10

[0.001, 0.004]

[0.002, 0.006]

[0.002, 0.008]

[0.002, 0.009]

0.50

[0.001, 0.004]

[0.002, 0.006]

[0.002, 0.007]

[0.002, 0.008]

1.00

[0.001, 0.004]

[0.002, 0.006]

[0.002, 0.008]

Open in a new tab

TABLE A.3.

The tables show the 25th $(P_{25})$ , the 50th $(P_{50})$ , and the 75th $(P_{75})$ percentiles for the final prevalence rate, when the initial prevalence rate and theoretical fix rate are equal to 0.50, and when sampling 100 recall values ( $P_{P_{R} {(aias)}_{100}}$ ) and 1000 recall values ( $P_{P_{R} {(aias)}_{1000}}$ ).

P_{P_{R} {(aias)}_{100}}

f_{R}

P_{25}

P_{50}

P_{75}

0.50

0.254

0.310

0.388

0.70

0.156

0.233

0.338

0.90

0.055

0.160

0.291

1.00

0.000

0.126

0.268

P_{P_{R} {(aias)}_{1000}}

f_{R}

P_{25}

P_{50}

P_{75}

0.50

0.249

0.310

0.380

0.70

0.149

0.236

0.334

0.90

0.050

0.162

0.289

1.00

0.000

0.125

0.265

Open in a new tab

A.1..2. A.1.2. F(aias) calculation

See Tables A.4, A.5, A.6.

TABLE A.4.

The tables show the upper and lower bounds of the final fix rate, when the initial prevalence rate and theoretical fix rate are 0.50, obtained when sampling 100 recall values ( ${f (aias)}_{100}$ ) and 1000 recall values ( ${f (aias)}_{1000}$ ).

f_{R}

{f (aias)}_{100}

0.50

[0.024, 0.544]

0.70

[0.038, 0.750]

0.90

[0.046, 0.934]

1.00

[0.056, 1.000]

f_{R}

{f (aias)}_{1000}

0.50

[0.035, 0.513]

0.70

[0.047, 0.709]

0.90

[0.058, 0.904]

1.00

[0.064, 1.000]

Open in a new tab

TABLE A.5.

The tables show the standard error of the mean, for the final fix rate, when the initial prevalence rate is 0.50, with 100 sampled recall values ( $σ_{{f (aias)}_{100}}$ ) and 1000 sampled recall values ( $σ_{{f (aias)}_{1000}}$ ).

f_{R}

σ_{{f (aias)}_{100}}

0.50

[0.004, 0.014]

0.70

[0.006, 0.019]

0.90

[0.008, 0.024]

1.00

[0.008, 0.027]

f_{R}

σ_{{f (aias)}_{1000}}

0.50

[0.001, 0.004]

0.70

[0.002, 0.006]

0.90

[0.002, 0.007]

1.00

[0.002, 0.008]

Open in a new tab

TABLE A.6.

The tables show the 25th $(P_{25})$ , the 50th $(P_{50})$ , and the 75th $(P_{75})$ percentiles for the final fix rate, when the initial prevalence rate is 0.50. The tables report the percentiles when sampling 100 recall values ( $P_{{f (aias)}_{100}}$ ) and 1000 recall values ( $P_{{f (aias)}_{1000}}$ ).

P_{{f (aias)}_{100}}

f_{R}

P_{25}

P_{50}

P_{75}

0.50

0.223

0.380

0.492

0.70

0.325

0.534

0.688

0.90

0.418

0.681

0.891

1.00

0.464

0.747

1.000

P_{{f (aias)}_{1000}}

f_{R}

P_{25}

P_{50}

P_{75}

0.50

0.239

0.380

0.503

0.70

0.332

0.528

0.701

0.90

0.423

0.677

0.900

1.00

0.469

0.750

1.000

Open in a new tab

A.1..3. A.1.3. ${FN}_{ratio}$ calculation

See Tables A.7, A.8, A.9.

TABLE A.7.

The tables show the upper and lower bounds for the false negatives ratio, when sampling 100 recall values ( ${FN}_{ratio}_{100}$ ) and 1000 recall values ( ${FN}_{ratio}_{1000}$ ).

f_{R}

{FN}_{ratio}_{100}

0.50

[1.111, 1.929]

0.70

[1.087, 1.389]

0.90

[1.160, 1.176]

1.00

[1.000, 1.000]

f_{R}

{FN}_{ratio}_{1000}

0.50

[1.376, 1.403]

0.70

[1.221, 1.236]

0.90

[1.081, 1.075]

1.00

[1.000, 1.000]

Open in a new tab

TABLE A.8.

The tables show the standard error of the mean of the lower and upper bounds of the final false negatives ratio when sampling 100 recall values ( ${σ_{{FN}_{ratio}}}_{100}$ ) and 1000 recall values ( ${σ_{{FN}_{ratio}}}_{1000}$ ).

f_{R}

{σ_{{FN}_{ratio}}}_{100}

0.50

[0.023, 0.018]

0.70

[0.019, 0.010]

0.90

[0.006, 0.005]

1.00

[0.000, 0.000]

f_{R}

{σ_{{FN}_{ratio}}}_{1000}

0.50

[0.006, 0.004]

0.70

[0.004, 0.003]

0.90

[0.001, 0.001]

1.00

[0.000, 0.000]

Open in a new tab

TABLE A.9.

The tables show the 25th $(P_{25})$ , the 50th $(P_{50})$ , and the 75th $(P_{75})$ percentiles for the final $F N_{r a t i o}$ . The tables report the percentiles when sampling 100 recall values ( ${P_{{FN}_{ratio}}}_{100}$ ) and 1000 recall values ( ${P_{{FN}_{ratio}}}_{1000}$ ).

{P_{{FN}_{ratio}}}_{100}

f_{R}

P_{25}

P_{50}

P_{75}

0.50

1.315

1.520

1.724

0.70

1.162

1.238

1.313

0.90

1.164

1.168

1.172

1.00

1.000

{P_{{FN}_{ratio}}}_{1000}

f_{R}

P_{25}

P_{50}

P_{75}

0.50

1.382

1.389

1.395

0.70

1.225

1.229

1.233

0.90

1.076

1.078

1.079

1.00

1.000

Open in a new tab

A.2. Simulation two: Beyond constant prevalence rate

A.2..1. A.2.1. F(aias) calculation with constant prevalence rate

See Tables A.10, A.11, A.12.

TABLE A.10.

The tables show the final fix rate calculated when the prevalence rate is constant, respectively, when the number of sampled recall values is 100 ( ${f (aias)}_{100}$ ) and when the number of sampled recall values is 1000 ( ${f (aias)}_{1000}$ ).

{f (aias)}_{100}

Min. Rec and Spec

f_{R}

Min = 0.50

Min = 0.70

Min = 0.90

0.50

[0.240, 0.690]

[0.350, 0.680]

[0.450, 0.710]

0.70

[0.320, 0.850]

[0.440, 0.840]

[0.570, 0.850]

0.90

[0.410, 0.980]

[0.600, 0.970]

[0.750, 0.970]

1.00

[0.470, 1.000]

[0.640, 1.000]

[0.840, 1.000]

{f (aias)}_{1000}

Min. Rec and Spec

f_{R}

Min = 0.50

Min = 0.70

Min = 0.90

0.50

[0.239, 0.519]

[0.340, 0.517]

[0.436, 0.516]

0.70

[0.338, 0.717]

[0.476, 0.717]

[0.618, 0.713]

0.90

[0.434, 0.912]

[0.618, 0.910]

[0.798, 0.910]

1.00

[0.489, 1.000]

[0.688, 1.000]

[0.891, 1.000]

Open in a new tab

TABLE A.11.

The tables show the standard error of the mean for the upper and lower bound of the final fix rate, with constant prevalence rate, and when the number of sampled recall values is 100 ( ${σ_{f (aias)}}_{100}$ ) and the number of sampled recall values is 1000 ( ${σ_{f (aias)}}_{1000}$ ).

{σ_{f (aias)}}_{100}

Min. Rec and Spec

f_{R}

Min = 0.50

Min = 0.70

Min = 0.90

0.50

[0.006, 0.006]

[0.005, 0.005]

[0.005, 0.004]

0.70

[0.007, 0.007]

[0.005, 0.006]

[0.005, 0.005]

0.90

[0.008, 0.008]

[0.005, 0.006]

[0.003, 0.003]

1.00

[0.008, 0.009]

[0.005, 0.007]

[0.002, 0.003]

{σ_{f (aias)}}_{1000}

Min. Rec and Spec

f_{R}

Min = 0.50

Min = 0.70

Min = 0.90

0.50

[0.001, 0.001]

[0.000, 0.000]

0.70

[0.002, 0.002]

[0.001, 0.001]

[0.000, 0.000]

0.90

[0.003, 0.003]

[0.002, 0.002]

[0.001, 0.001]

1.00

[0.003, 0.003]

[0.002, 0.002]

[0.001, 0.001]

Open in a new tab

TABLE A.12.

The tables show the 25th $(P_{25})$ , the 50th $(P_{50})$ , and the 75th $(P_{75})$ percentiles for the final fix rate when prevalence rate is constant, the minimum recall is equal to 0.50. The tables report the percentiles when sampling 100 recall values ( ${P_{f (aias)}}_{100}$ ) and 1000 recall values ( ${P_{f (aias)}}_{1000}$ ).

{P_{f (aias)}}_{100}

f_{R}

P_{25}

P_{50}

P_{75}

0.50

0.378

0.460

0.540

0.70

0.470

0.590

0.700

0.90

0.570

0.715

0.890

1.00

0.618

0.770

1.000

{P_{f (aias)}}_{1000}

f_{R}

P_{25}

P_{50}

P_{75}

0.50

0.260

0.377

0.492

0.70

0.360

0.526

0.687

0.90

0.461

0.676

0.881

1.00

0.511

0.750

0.981

Open in a new tab

A.2..2. A.2.2. F(aias) calculation without constant prevalence rate

See Tables A.13, A.14, A.15.

TABLE A.13.

The tables show the calculation of the upper and lower bounds of the final fix rate without a constant prevalence rate. The two tables show the results when the number of recall values sampled is 100 ( ${f (aias)}_{100}$ ) and when the number of recall values sampled is 1000 ( ${f (aias)}_{1000}$ ).

{f (aias)}_{100}

Min. Rec and Spec

f_{R}

min = 0.50

min = 0.70

min = 0.90

0.50

[0.000, 0.690]

[0.319, 0.690]

[0.640, 0.710]

0.70

[0.000, 0.850]

[0.190, 0.863]

[0.667, 0.850]

0.90

[0.000, 0.900]

[0.250, 0.875]

[0.700, 0.950]

1.00

[0.000, 0.910]

[0.233, 0.921]

[0.733, 0.963]

{f (aias)}_{1000}

Min. Rec and Spec

f_{R}

min = 0.50

min = 0.70

min = 0.90

0.50

[0.000, 0.519]

[0.340, 0.519]

[0.550, 0.630]

0.70

[0.000, 0.717]

[0.345, 0.717]

[0.650, 0.713]

0.90

[0.000, 0.770]

[0.460, 0.790]

[0.756, 0.860]

1.00

[0.000, 0.771]

[0.550, 0.810]

[0.851, 0.920]

Open in a new tab

TABLE A.14.

The tables show the standard error of the mean for the lower and upper bounds of the final fix rate when the initial prevalence rate is not constant. The two tables show the results when the number of recall values is 100 ( ${σ_{f (aias)}}_{100}$ ) and when the number of recall values is 1000 ( ${σ_{f (aias)}}_{1000}$ ).

{σ_{f (aias)}}_{100}

Min. Rec and Spec

f_{R}

min = 0.50

min = 0.70

min = 0.90

0.50

[0.005, 0.016]

[0.005, 0.008]

[0.004, 0.001]

0.70

[0.008, 0.016]

[0.003, 0.009]

[0.001, 0.002]

0.90

[0.015, 0.017]

[0.008, 0.010]

[0.002, 0.003]

1.00

[0.019, 0.017]

[0.011, 0.011]

[0.004, 0.004]

{σ_{f (aias)}}_{1000}

Min. Rec and Spec

f_{R}

min = 0.50

min = 0.70

min = 0.90

0.50

[0.001, 0.005]

[0.002, 0.003]

[0.001, 0.000]

0.70

[0.003, 0.005]

[0.001, 0.003]

[0.000, 0.000]

0.90

[0.005, 0.006]

[0.003, 0.003]

[0.001, 0.001]

1.00

[0.006, 0.006]

[0.004, 0.004]

[0.001, 0.001]

Open in a new tab

TABLE A.15.

The tables show the 25th $(P_{25})$ , the 50th $(P_{50})$ , and the 75th $(P_{75})$ percentiles for the final fix rate when prevalence rate is not constant and when minimum recall and specificity are equal to 0.50. The tables report the percentiles when sampling 100 recall values ( ${P_{f (aias)}}_{100}$ ) and 1000 recall values ( ${P_{f (aias)}}_{1000}$ ).

{P_{f (aias)}}_{100}

f_{R}

P_{25}

P_{50}

P_{75}

0.50

0.067

0.440

0.571

0.70

0.064

0.461

0.711

0.90

0.062

0.456

0.853

1.00

0.078

0.459

0.891

{P_{f (aias)}}_{1000}

f_{R}

P_{25}

P_{50}

P_{75}

0.50

0.009

0.440

0.505

0.70

0.008

0.458

0.697

0.90

0.009

0.475

0.731

1.00

0.009

0.481

0.747

Open in a new tab

APPENDIX B. FORMULA DERIVATIONS

B.1. Derivation of AI‐augmented system fix rate from the positives

\begin{matrix} P o s (a i a s) & = (1 - f_{R}) \cdot T P_{1^{s t}} + F N_{1^{s t}} \end{matrix}

(B.1)

\begin{matrix} P o s (a i a s) & = P o s - f (a i a s) \cdot P o s \end{matrix}

(B.2)

\begin{matrix} P_{R} \cdot N - f (a i a s) \cdot P_{R} \cdot N = (1 - f_{R}) \cdot r e c \cdot P_{R} \cdot N \\ + (1 - r e c) \cdot P_{R} \cdot N \end{matrix}

(B.3)

\begin{matrix} 1 - f (a i a s) & = (1 - f_{R}) \cdot r e c + (1 - r e c) \end{matrix}

(B.4)

\begin{matrix} f (a i a s) & = 1 - (1 - f_{R}) \cdot r e c - (1 - r e c) \end{matrix}

(B.5)

\begin{matrix} = 1 - r e c + f_{R} \cdot r e c - (1 - r e c) \end{matrix}

(B.6)

\begin{matrix} = f_{R} \cdot r e c . \end{matrix}

(B.7)

B.2. Derivation of $P_{R} (aias)$ from $Pos (aias)$

\begin{matrix} P o s (a i a s) & = P o s - f (a i a s) \cdot P o s \end{matrix}

(B.8)

\begin{matrix} P_{R} (a i a s) \cdot N & = P_{R} \cdot N - f_{R} \cdot r e c \cdot P_{R} \cdot N \end{matrix}

(B.9)

\begin{matrix} P_{R} (a i a s) & = (1 - f_{R} \cdot r e c) \cdot P_{R} . \end{matrix}

(B.10)

B.3. Derivation of $TPR (aias)$

\begin{matrix} T P R (a i a s) & = \frac{T P_{2^{n d}}}{P o s (a i a s)} \end{matrix}

(B.11)

\begin{matrix} = \frac{(1 - f_{R}) \cdot T P_{1^{s t}} \cdot r e c}{P o s - f (a i a s) \cdot P o s} \end{matrix}

(B.12)

\begin{matrix} = \frac{(1 - f_{R}) \cdot r e c \cdot P_{R} \cdot N \cdot r e c}{P_{R} \cdot N - f_{R} \cdot r e c \cdot P_{R} \cdot N} \end{matrix}

(B.13)

\begin{matrix} = \frac{(1 - f_{R}) \cdot r e c \cdot r e c}{1 - f_{R} \cdot r e c} . \end{matrix}

(B.14)

B.4. $TPR (aias) \leq TPR$

\begin{matrix} \frac{(1 - f_{R}) \cdot r e c}{1 - f_{R} \cdot r e c} & \leq 1 \end{matrix}

(B.15)

\begin{matrix} (1 - f_{R}) \cdot r e c & \leq 1 - f_{R} \cdot r e c \end{matrix}

(B.16)

\begin{matrix} r e c - f_{R} \cdot r e c & \geq 1 - f_{R} \cdot r e c \end{matrix}

(B.17)

\begin{matrix} r e c & \geq 1 . \end{matrix}

(B.18)

B.5. Derivation of the false positives

\begin{matrix} F P (a i a s) & = r e c \cdot \frac{1 - p r e c}{p r e c} \cdot (1 - f_{R}) \cdot r e c \cdot P_{R} \cdot N . \end{matrix}

(B.19)

B.6. Derivation of the $FAR (aias)$

\begin{matrix} F A R (a i a s) & = \frac{F P (a i a s)}{N e g (a i a s)} = \frac{F P (a i a s)}{N - P o s (a i a s)} \end{matrix}

(B.20)

\begin{matrix} = \frac{r e c \cdot \frac{1 - p r e c}{p r e c} \cdot (1 - f_{R}) \cdot r e c \cdot P_{R} \cdot N}{N - (P_{R} \cdot N - f (a i a s) \cdot P_{R} \cdot N)} \end{matrix}

(B.21)

\begin{matrix} = \frac{r e c \cdot \frac{1 - p r e c}{p r e c} \cdot (1 - f_{R}) \cdot r e c \cdot P_{R}}{1 - (P_{R} - f_{R} \cdot r e c \cdot P_{R})} \end{matrix}

(B.22)

\begin{matrix} = r e c \cdot \frac{1 - p r e c}{p r e c} \frac{(1 - f_{R}) \cdot r e c \cdot P_{R}}{1 - (1 - f_{R} \cdot r e c) \cdot P_{R}} \end{matrix}

(B.23)

\begin{matrix} = {r e c}^{2} \cdot \frac{1 - p r e c}{p r e c} \frac{(1 - f_{R}) \cdot P_{R}}{1 - (1 - f_{R} \cdot r e c) \cdot P_{R}} . \end{matrix}

(B.24)

B.7. Proof that the AI‐augmented system false alert rate ( $F A R$ ) is less than or equal to the $F A R$ of the first classifier ( $FAR (aias) \leq FAR$ )

\begin{matrix} F A R (a i a s) & \leq F A R \end{matrix}

(B.25)

\begin{matrix} {r e c}^{2} \cdot \frac{1 - p r e c}{p r e c} \frac{(1 - f_{R}) \cdot P_{R}}{1 - (1 - f_{R} \cdot r e c) \cdot P_{R}} \\ \leq r e c \cdot \frac{1 - p r e c}{p r e c} \frac{P_{R} \cdot N}{N - P_{R} \cdot N} \end{matrix}

(B.26)

\begin{matrix} r e c \cdot \frac{(1 - f_{R}) \cdot P_{R}}{1 - (1 - f_{R} \cdot r e c) \cdot P_{R}} & \leq \frac{P_{R}}{1 - P_{R}} \end{matrix}

(B.27)

\begin{matrix} r e c \cdot \frac{(1 - f_{R})}{1 - (1 - f_{R} \cdot r e c) \cdot P_{R}} & \leq \frac{1}{1 - P_{R}} \end{matrix}

(B.28)

\begin{matrix} r e c \cdot (1 - f_{R}) (1 - P_{R}) & \leq 1 - (1 - f_{R} \cdot r e c) \cdot P_{R} \end{matrix}

(B.29)

\begin{matrix} r e c \cdot (1 - f_{R} - P_{R} + f_{R} \cdot P_{R}) & \leq 1 - P_{R} + f_{R} \cdot r e c \cdot P_{R} \end{matrix}

(B.30)

\begin{matrix} r e c - r e c \cdot f_{R} - r e c \cdot P_{R} + r e c \cdot f_{R} \cdot P_{R} & \leq 1 - P_{R} + f_{R} \cdot r e c \cdot P_{R} \end{matrix}

(B.31)

\begin{matrix} r e c - r e c \cdot f_{R} - r e c \cdot P_{R} & \leq 1 - P_{R} \end{matrix}

(B.32)

\begin{matrix} r e c \cdot (1 - f_{R} - P_{R}) & \leq 1 - P_{R} \end{matrix}

(B.33)

\begin{matrix} if 1 - f_{R} - P_{R} > 0 which is 1 > f_{R} + P_{R} \end{matrix}

(B.34)

\begin{matrix} r e c \leq \frac{1 - P_{R}}{1 - f_{R} - P_{R}} and 1 - f_{R} - P_{R} \leq 1 - P_{R} \\ implies 1 \leq \frac{1 - P_{R}}{1 - f_{R} - P_{R}} \end{matrix}

(B.35)

\begin{matrix} r e c & \leq 1 \leq \frac{1 - P_{R}}{1 - f_{R} - P_{R}} always true \end{matrix}

(B.36)

\begin{matrix} if 1 - f_{R} - P_{R} < 0 which is 1 < f_{R} + P_{R} \end{matrix}

(B.37)

\begin{matrix} r e c & \geq \frac{P_{R} - 1}{f_{R} + P_{R} - 1} and P_{R} - 1 \geq 0 implies \frac{P_{R} - 1}{f_{R} + P_{R} - 1} \geq 0 \end{matrix}

(B.38)

\begin{matrix} r e c & \geq 0 \geq \frac{P_{R} - 1}{f_{R} + P_{R} - 1} always true . \end{matrix}

(B.39)

B.8. Derivation of the total number of elements passed to the fixer

\begin{matrix} N_{2^{n d}} & = T P_{1^{s t}} + F P_{1^{s t}} \end{matrix}

(B.40)

\begin{matrix} = r e c \cdot P_{R} \cdot N + r e c \cdot \frac{1 - p r e c}{p r e c} \cdot P_{R} \cdot N \end{matrix}

(B.41)

\begin{matrix} = \frac{p r e c \cdot r e c + r e c - r e c \cdot p r e c}{p r e c} \cdot P_{R} \cdot N \end{matrix}

(B.42)

\begin{matrix} = \frac{r e c}{p r e c} \cdot P_{R} \cdot N . \end{matrix}

(B.43)

B.9. Derivation of the false positives starting from the precision

\begin{matrix} p r e c & = \frac{T P}{T P + F P} \end{matrix}

(B.44)

\begin{matrix} (T P + F P) \cdot p r e c & = T P \end{matrix}

(B.45)

\begin{matrix} F P \cdot p r e c & = T P \cdot (1 - p r e c) \end{matrix}

(B.46)

\begin{matrix} F P & = P o s \cdot r e c \cdot \frac{1 - p r e c}{p r e c} . \end{matrix}

(B.47)

B.10. Derivation of the final number of true positives

\begin{matrix} T P (a i a s) & = (1 - f_{R}) \cdot T P_{1^{s t}} \cdot r e c \end{matrix}

(B.48)

\begin{matrix} = (1 - f_{R}) \cdot (P o s \cdot r e c) \cdot r e c \end{matrix}

(B.49)

\begin{matrix} = (1 - f_{R}) \cdot {r e c}^{2} \cdot P o s \end{matrix}

(B.50)

\begin{matrix} = (1 - f_{R}) \cdot {r e c}^{2} \cdot P_{R} \cdot N . \end{matrix}

(B.51)

B.11. Derivation of the AI‐augmented system false negatives ( $FN (aias)$ )

\begin{matrix} F N (a i a s) & = [(1 - f_{R}) \cdot T P_{1^{s t}}] \cdot (1 - r e c) + F N_{1^{s t}} \end{matrix}

(B.52)

\begin{matrix} = [(1 - f_{R}) \cdot (P o s \cdot r e c)] \cdot (1 - r e c) + (P o s \cdot (1 - r e c)) \end{matrix}

(B.53)

\begin{matrix} = \{[(1 - f_{R}) \cdot r e c] \cdot (1 - r e c) + (1 - r e c)\} \cdot P o s \end{matrix}

(B.54)

\begin{matrix} = \{[(1 - f_{R}) \cdot r e c] + 1\} \cdot (1 - r e c) \cdot P o s \end{matrix}

(B.55)

\begin{matrix} = [1 + (1 - f_{R}) \cdot r e c] \cdot (1 - r e c) \cdot P_{R} \cdot N \end{matrix}

(B.56)

\begin{matrix} = [1 + (1 - f_{R}) \cdot r e c] \cdot F N_{1^{s t}} . \end{matrix}

(B.57)

B.12. Derivation of the AI‐augmented system prevalence rate ( $P_{R} (aias)$ )

\begin{matrix} P_{R} (a i a s) & = \frac{T P (a i a s) + F N (a i a s)}{T P (a i a s) + F N (a i a s) + T N (a i a s) + F P (a i a s)} \end{matrix}

(B.58)

\begin{matrix} = \frac{T P (a i a s) + F N (a i a s)}{N} \end{matrix}

(B.59)

\begin{matrix} = \frac{(1 - f_{R}) \cdot {r e c}^{2} \cdot P_{R} \cdot N + [1 + (1 - f_{R}) \cdot r e c] \cdot (1 - r e c) \cdot P_{R} \cdot N}{N} \end{matrix}

(B.60)

\begin{matrix} = (1 - f_{R}) \cdot {r e c}^{2} \cdot P_{R} + [1 + (1 - f_{R}) \cdot r e c] \cdot (1 - r e c) \cdot P_{R} \end{matrix}

(B.61)

\begin{matrix} = [(1 - f_{R}) \cdot {r e c}^{2} + [1 + (1 - f_{R}) \cdot r e c] \cdot (1 - r e c)] \cdot P_{R} \end{matrix}

(B.62)

\begin{matrix} = [(1 - f_{R}) \cdot {r e c}^{2} + (1 - r e c) + (1 - f_{R}) \cdot r e c \cdot (1 - r e c)] \cdot P_{R} \end{matrix}

(B.63)

\begin{matrix} = [(1 - f_{R}) \cdot {r e c}^{2} + (1 - r e c) + (1 - f_{R}) \cdot r e c - (1 - f_{R}) \cdot {r e c}^{2})] \cdot P_{R} \end{matrix}

(B.64)

\begin{matrix} = [1 - r e c + r e c - f_{R} \cdot r e c)] \cdot P_{R} \end{matrix}

(B.65)

\begin{matrix} = [1 - f_{R} \cdot r e c] \cdot P_{R} . \end{matrix}

(B.66)

Mezzi, E. , Papotti, A. , Massacci, F. , & Tuma, K. (2025). Risks of ignoring uncertainty propagation in AI‐augmented security pipelines. Risk Analysis, 45, 4469–4489. 10.1111/risa.70059

REFERENCES

Abdar, M. , Pourpanah, F. , Hussain, S. , Rezazadegan, D. , Liu, L. , Ghavamzadeh, M. , Fieguth, P. , Cao, X. , Khosravi, A. , Acharya, U. R. , Makarenkov, V. , & Saeid, N. (2021). A review of uncertainty quantification in deep learning: Techniques, applications and challenges. Information Fusion, 76, 243–297. [Google Scholar]
Akter, M. S. , Shahriar, H. , & Bhuiya, Z. A. (2022). Automated vulnerability detection in source code using quantum natural language processing. In International Conference on Ubiquitous Security (pp. 83–102). Springer. [Google Scholar]
Alowais, S. A. , Alghamdi, S. S. , Alsuhebany, N. , Alqahtani, T. , Alshaya, A. I. , Almohareb, S. N. , & others . (2023). Revolutionizing healthcare: the role of artificial intelligence in clinical practice. BMC medical education, 23(1), 689. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ami, A. S. , Moran, K. , Poshyvanyk, D. , & Nadkarni, A. (2023). “False negative ‐ that one is going to kill you”: Understanding industry perspectives of static analysis based security testing. In 2024 IEEE Symposium on Security and Privacy (SP) (pp. 19–19). IEEE Computer Society. [Google Scholar]
Benke, K. , Norng, S. , Robinson, N. , Benke, L. , & Peterson, T. (2018). Error propagation in computer models: Analytic approaches, advantages, disadvantages and constraints. Stochastic Environmental Research and Risk Assessment, 32, 2971–2985. [Google Scholar]
Bui, Q.‐C. , Paramitha, R. , Vu, D.‐L. , Massacci, F. , & Scandariato, R. (2024). APR4Vul: An empirical study of automatic program repair techniques on real‐world Java vulnerabilities. Empirical Software Engineering, 29(1), 18. [Google Scholar]
Chen, L. , Wang, L. , Hu, Z. , Tao, Y. , Song, W. , An, Y. , & Li, X. (2022). Combining Z‐score and maternal copy number variation analysis increases the positive rate and accuracy in non‐invasive prenatal testing. Frontiers in Genetics, 13, 887176. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chi, J. , Qu, Y. , Liu, T. , Zheng, Q. , & Yin, H. (2022). SeqTrans: Automatic vulnerability fix via sequence to sequence learning. IEEE Transactions on Software Engineering, 49(2), 564–585. [Google Scholar]
Collier, Z. A. , Gruss, R. J. , & Abrahams, A. S. (2025). How good are large language models at product risk assessment? Risk Analysis, 45(4), 766–789. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cox, L. A., Jr (2020). Answerable and unanswerable questions in risk analysis with open‐world novelty. Risk Analysis, 40(S1), 2144–2177. [DOI] [PubMed] [Google Scholar]
Dashevskyi, S. , Brucker, A. D. , & Massacci, F. (2018). A screening test for disclosed vulnerabilities in FOSS components. IEEE Transactions on Software Engineering, 45(10), 945–966. [Google Scholar]
Derr, E. , Bugiel, S. , Fahl, S. , Acar, Y. , & Backes, M. (2017). Keep me updated: An empirical study of third‐party library updatability on android. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (pp. 2187–2200).
EASA . (2023a). EASA Artificial Intelligence Roadmap 2.0 [White paper]. EASA. https://www.easa.europa.eu/en/downloads/137919/en
EASA . (2023b). EASA Concept Paper: First usable guidance for Level 1&2 machine learning applications [White paper]. EASA. https://www.easa.europa.eu/en/downloads/137631/en
Elsevier . (2024). What does “relevance” mean in scopus? Scopus. [Google Scholar]
European Union . (2024). Artificial Intelligence Act . Official Journal of the European Union, L 1689. https://artificialintelligenceact.eu
Fan, Y. , Wan, C. , Fu, C. , Han, L. , & Xu, H. (2023). VDoTR: Vulnerability detection based on tensor representation of comprehensive code graphs. Computers & Security, 130, 103247. [Google Scholar]
Ferson, S. , Balch, M. , Sentz, K. , & Siegrist, J. (2013). Computing with confidence. In Proceedings of the Eighth International Symposium on Imprecise Probability: Theories and Applications , SIPTA; (pp. 129–138). [Google Scholar]
Ferson, S. , & Ginzburg, L. R. (1996). Different methods are needed to propagate ignorance and variability. Reliability Engineering & System Safety, 54(2‐3), 133–144. [Google Scholar]
Ferson, S. , Kreinovich, V. , Ginzburg, L. , Myers, D. S. , & Sentz, K. (2003). Constructing probability boxes and Dempster‐Shafer structures . Number 4015. Sandia National Laboratories. [Google Scholar]
Fu, M. , & Tantithamthavorn, C. (2022). LineVul: A transformer‐based line‐level vulnerability prediction. In Proceedings of the 19th International Conference on Mining Software Repositories (pp. 608–620).
Fu, M. , Tantithamthavorn, C. , Le, T. , Kume, Y. , Nguyen, V. , Phung, D. , & Grundy, J. (2024). AIBugHunter: A practical tool for predicting, classifying and repairing software vulnerabilities. Empirical Software Engineering, 29(1), 4. [Google Scholar]
Fu, M. , Tantithamthavorn, C. , Le, T. , Nguyen, V. , & Phung, D. (2022). VulRepair: A T5‐based automated software vulnerability repair. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (pp. 935–947).
Gawlikowski, J. , Tassi, C. R. N. , Ali, M. , Lee, J. , Humt, M. , Feng, J. , Kruspe, A. , Triebel, R. , Jung, P. , Roscher, R. , Shahzad, M. , Yang, W. , Bamler, R. , & Zhu, X. X. (2023). A survey of uncertainty in deep neural networks. Artificial Intelligence Review, 56(Suppl 1), 1513–1589. [Google Scholar]
Gharehtoragh, M. A. , & Johnson, D. R. (2024). Using surrogate modeling to predict storm surge on evolving landscapes under climate change. npj Natural Hazards, 1(1), 33. [Google Scholar]
Gray, A. , Ferson, S. , Kreinovich, V. , & Patelli, E. (2022). Distribution‐free risk analysis. International Journal of Approximate Reasoning, 146, 133–156. [Google Scholar]
Gray, N. , Calleja, D. , Wimbush, A. , Miralles‐Dolz, E. , Gray, A. , De Angelis, M. , Derrer‐Merk, E. , Oparaji, B. U. , Stepanov, V. , Clearkin, L. , & Ferson, S. (2020). Is “no test is better than a bad test”? Impact of diagnostic uncertainty in mass testing on the spread of COVID‐19. PLoS One, 15(10), e0240775. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gray, N. , De Angelis, M. , Calleja, D. , & Ferson, S. (2019). A problem in the Bayesian analysis of data without gold standards. In 29th European Safety and Reliability Conference, ESREL 2019 (pp. 2628–2634).
Gray, N. , Ferson, S. , De Angelis, M. , Gray, A. , & de Oliveira, F. B. (2022). Probability bounds analysis for Python. Software Impacts, 12, 100246. [Google Scholar]
Guikema, S. (2020). Artificial intelligence for natural hazards risk analysis: Potential, challenges, and research needs. Risk Analysis, 40(6), 1117–1123. [DOI] [PubMed] [Google Scholar]
Hou, F. , Zhou, K. , Li, L. , Tian, Y. , Li, J. , & Li, J. (2022). A vulnerability detection algorithm based on transformer model. In International Conference on Artificial Intelligence and Security (pp. 43–55).
Huang, J. , Borges, N. , Bugiel, S. , & Backes, M. (2019). Up‐to‐crash: Evaluating third‐party library updatability on Android. In 2019 IEEE European Symposium on Security and Privacy (EuroS&P) (pp. 15–30).
Hüllermeier, E. , & Waegeman, W. (2021). Aleatoric and epistemic uncertainty in machine learning: An introduction to concepts and methods. Machine Learning, 110, 457–506. [Google Scholar]
International Organization for Standardization . (2023). ISO/IEC 42001:2023 ‐ Information technology ‐ Artificial intelligence ‐ Management system.
Iskandar, R. (2021). Probability bound analysis: A novel approach for quantifying parameter uncertainty in decision‐analytic modeling and cost‐effectiveness analysis. Statistics in Medicine, 40(29), 6501–6522. [DOI] [PMC free article] [PubMed] [Google Scholar]
Joulin, A. , Grave, E. , Bojanowski, P. , Douze, M. , Jégou, H. , & Mikolov, T. (2016). FastText.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651 .
Kaur, R. , Gabrijelčič, D. , & Klobučar, T. (2023). Artificial intelligence for cybersecurity: Literature review and future research directions. Information Fusion, 97, 101804. [Google Scholar]
Kreuzer, T. , Papapetrou, P. , & Zdravkovic, J. (2024). Artificial intelligence in digital twins—A systematic literature review. Data & Knowledge Engineering, 151, 102304. [Google Scholar]
Kula, R. G. , German, D. M. , Ouni, A. , Ishio, T. , & Inoue, K. (2018). Do developers update their library dependencies? An empirical study on the impact of security advisories on library migration. Empirical Software Engineering, 23(1), 384–417. [Google Scholar]
Li, Y. , Wang, S. , & Nguyen, T. N. (2022). Dear: A novel deep learning‐based approach for automated program repair. In Proceedings of the 44th International Conference on Software Engineering (pp. 511–523).
Liu, K. , Li, L. , Koyuncu, A. , Kim, D. , Liu, Z. , Klein, J. , & Bissyandé, T. F. (2021). A critical review on the evaluation of automated program repair systems. Journal of Systems and Software, 171, 110817. [Google Scholar]
Liu, X. , You, X. , Zhang, X. , Wu, J. , & Lv, P. (2020). Tensor graph convolutional networks for text classification. In Proceedings of the AAAI Conference on Artificial Intelligence , Vol. 34 (pp. 8409–8416). [Google Scholar]
Long, F. , Amidon, P. , & Rinard, M. (2017). Automatic inference of code transforms for patch generation. In Proceedings of the 11th Joint Meeting on Foundations of Software Engineering (pp. 727–739).
Metropolis, N. , & Ulam, S. (1949). The Monte Carlo method. Journal of the American Statistical Association, 44(247), 335–341. [DOI] [PubMed] [Google Scholar]
Mezzi, E. , & Papotti, A. (2024). Simulator for AI‐augmented systems . https://github.com/EMezzi/AI‐Augmented
Nateghi, R. , & Aven, T. (2021). Risk analysis in the age of big data: The promises and pitfalls. Risk Analysis, 41(10), 1751–1758. [DOI] [PubMed] [Google Scholar]
NIST . (2021). NIST software assurance reference dataset . https://samate.nist.gov/SARD
NIST . (2024). Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile. https://www.nist.gov/itl/ai‐risk‐management‐framework
Page, M. J. , Moher, D. , Bossuyt, P. M. , Boutron, I. , Hoffmann, T. C. , Mulrow, C. D. , Shamseer, L. , Tetzlaff, J. M. , Akl, E. A. , Brennan, S. E. , Chou, R. , Glanville, J. , Grimshaw, J. M. , Hröbjartsson, A. , Lalu, M. M. , Li, T. , Loder, E. W. , Mayo‐Wilson, E. , McDonald, S. , … McKenzie, J. E. (2021). PRISMA 2020 explanation and elaboration: Updated guidance and exemplars for reporting systematic reviews. BMJ, 372, n160. [DOI] [PMC free article] [PubMed] [Google Scholar]
Parikh, R. , Mathai, A. , Parikh, S. , Sekhar, G. C. , & Thomas, R. (2008). Understanding and using sensitivity, specificity and predictive values. Indian Journal of Ophthalmology, 56(1), 45–50. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pashchenko, I. , Plate, H. , Ponta, S. E. , Sabetta, A. , & Massacci, F. (2022). Vuln4Real: A methodology for counting actually vulnerable dependencies. IEEE Transactions on Software Engineering, 48(5), 1592–1609. [Google Scholar]
Paté‐Cornell, E. (2024). Preferences in AI algorithms: The need for relevant risk attitudes in automated decisions under uncertainties. Risk Analysis, 44(10), 2317–2323. [DOI] [PubMed] [Google Scholar]
Pennington, J. , Socher, R. , & Manning, C. D. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 1532–1543).
Perez‐Cerrolaza, J. , Abella, J. , Borg, M. , Donzella, C. , Cerquides, J. , Cazorla, F. J. , Englund, C. , Tauber, M. , Nikolakopoulos, G. , & Flores, J. L. (2024). Artificial intelligence for safety‐critical systems in industrial and transportation domains: A survey. ACM Computing Surveys, 56(7), 1–40. [Google Scholar]
Saha, S. , Saha, R. , & Prasad, M. R. (2019). Harnessing evolution for multi‐hunk program repair. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE) (pp. 13–24). IEEE. [Google Scholar]
The White House . (2023). Executive Order 14110 on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence. https://www.federalregister.gov/documents/2023/11/01/2023‐24283/safe‐secure‐and‐trustworthy‐development‐and‐use‐of‐artificial‐intelligence
Tufano, M. , Palomba, F. , Bavota, G. , Oliveto, R. , Di Penta, M. , De Lucia, A. , & Poshyvanyk, D. (2017). When and why your code starts to smell bad (and whether the smells go away). IEEE Transactions on Software Engineering, 43(11), 1063–1088. [Google Scholar]
Xia, C. S. , & Zhang, L. (2022). Less training, more repairing please: Revisiting automated program repair via zero‐shot learning. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (pp. 959–971).
Yang, G. , Min, K. , & Lee, B. (2020). Applying deep learning algorithm to automatic bug localization and repair. In Proceedings of the 35th Annual ACM Symposium on Applied Computing (pp. 1634–1641).
Yang, H. , Yang, H. , & Zhang, L. (2022). VDHGT: A source code vulnerability detection method based on heterogeneous graph transformer. In International Symposium on Cyberspace Safety and Security (pp. 217–224).
Ye, H. , Martinez, M. , & Monperrus, M. (2021). Automated patch assessment for program repair at scale. Empirical Software Engineering, 26, 1–38. [Google Scholar]
Zhang, C. , & Xin, Y. (2023). VulGAI: Vulnerability detection based on graphs and images. Computers & Security, 135, 103501. [Google Scholar]
Zhou, Y. , Liu, S. , Siow, J. , Du, X. , & Liu, Y. (2019). Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks. In Advances in Neural Information Processing Systems , 32. [Google Scholar]

[risa70059-bib-0001] Abdar, M. , Pourpanah, F. , Hussain, S. , Rezazadegan, D. , Liu, L. , Ghavamzadeh, M. , Fieguth, P. , Cao, X. , Khosravi, A. , Acharya, U. R. , Makarenkov, V. , & Saeid, N. (2021). A review of uncertainty quantification in deep learning: Techniques, applications and challenges. Information Fusion, 76, 243–297. [Google Scholar]

[risa70059-bib-0002] Akter, M. S. , Shahriar, H. , & Bhuiya, Z. A. (2022). Automated vulnerability detection in source code using quantum natural language processing. In International Conference on Ubiquitous Security (pp. 83–102). Springer. [Google Scholar]

[risa70059-bib-0064] Alowais, S. A. , Alghamdi, S. S. , Alsuhebany, N. , Alqahtani, T. , Alshaya, A. I. , Almohareb, S. N. , & others . (2023). Revolutionizing healthcare: the role of artificial intelligence in clinical practice. BMC medical education, 23(1), 689. [DOI] [PMC free article] [PubMed] [Google Scholar]

[risa70059-bib-0004] Ami, A. S. , Moran, K. , Poshyvanyk, D. , & Nadkarni, A. (2023). “False negative ‐ that one is going to kill you”: Understanding industry perspectives of static analysis based security testing. In 2024 IEEE Symposium on Security and Privacy (SP) (pp. 19–19). IEEE Computer Society. [Google Scholar]

[risa70059-bib-0005] Benke, K. , Norng, S. , Robinson, N. , Benke, L. , & Peterson, T. (2018). Error propagation in computer models: Analytic approaches, advantages, disadvantages and constraints. Stochastic Environmental Research and Risk Assessment, 32, 2971–2985. [Google Scholar]

[risa70059-bib-0006] Bui, Q.‐C. , Paramitha, R. , Vu, D.‐L. , Massacci, F. , & Scandariato, R. (2024). APR4Vul: An empirical study of automatic program repair techniques on real‐world Java vulnerabilities. Empirical Software Engineering, 29(1), 18. [Google Scholar]

[risa70059-bib-0007] Chen, L. , Wang, L. , Hu, Z. , Tao, Y. , Song, W. , An, Y. , & Li, X. (2022). Combining Z‐score and maternal copy number variation analysis increases the positive rate and accuracy in non‐invasive prenatal testing. Frontiers in Genetics, 13, 887176. [DOI] [PMC free article] [PubMed] [Google Scholar]

[risa70059-bib-0008] Chi, J. , Qu, Y. , Liu, T. , Zheng, Q. , & Yin, H. (2022). SeqTrans: Automatic vulnerability fix via sequence to sequence learning. IEEE Transactions on Software Engineering, 49(2), 564–585. [Google Scholar]

[risa70059-bib-0009] Collier, Z. A. , Gruss, R. J. , & Abrahams, A. S. (2025). How good are large language models at product risk assessment? Risk Analysis, 45(4), 766–789. [DOI] [PMC free article] [PubMed] [Google Scholar]

[risa70059-bib-0010] Cox, L. A., Jr (2020). Answerable and unanswerable questions in risk analysis with open‐world novelty. Risk Analysis, 40(S1), 2144–2177. [DOI] [PubMed] [Google Scholar]

[risa70059-bib-0011] Dashevskyi, S. , Brucker, A. D. , & Massacci, F. (2018). A screening test for disclosed vulnerabilities in FOSS components. IEEE Transactions on Software Engineering, 45(10), 945–966. [Google Scholar]

[risa70059-bib-0012] Derr, E. , Bugiel, S. , Fahl, S. , Acar, Y. , & Backes, M. (2017). Keep me updated: An empirical study of third‐party library updatability on android. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (pp. 2187–2200).

[risa70059-bib-0013] EASA . (2023a). EASA Artificial Intelligence Roadmap 2.0 [White paper]. EASA. https://www.easa.europa.eu/en/downloads/137919/en

[risa70059-bib-0014] EASA . (2023b). EASA Concept Paper: First usable guidance for Level 1&2 machine learning applications [White paper]. EASA. https://www.easa.europa.eu/en/downloads/137631/en

[risa70059-bib-0015] Elsevier . (2024). What does “relevance” mean in scopus? Scopus. [Google Scholar]

[risa70059-bib-0066] European Union . (2024). Artificial Intelligence Act . Official Journal of the European Union, L 1689. https://artificialintelligenceact.eu

[risa70059-bib-0018] Fan, Y. , Wan, C. , Fu, C. , Han, L. , & Xu, H. (2023). VDoTR: Vulnerability detection based on tensor representation of comprehensive code graphs. Computers & Security, 130, 103247. [Google Scholar]

[risa70059-bib-0019] Ferson, S. , Balch, M. , Sentz, K. , & Siegrist, J. (2013). Computing with confidence. In Proceedings of the Eighth International Symposium on Imprecise Probability: Theories and Applications , SIPTA; (pp. 129–138). [Google Scholar]

[risa70059-bib-0068] Ferson, S. , & Ginzburg, L. R. (1996). Different methods are needed to propagate ignorance and variability. Reliability Engineering & System Safety, 54(2‐3), 133–144. [Google Scholar]

[risa70059-bib-0021] Ferson, S. , Kreinovich, V. , Ginzburg, L. , Myers, D. S. , & Sentz, K. (2003). Constructing probability boxes and Dempster‐Shafer structures . Number 4015. Sandia National Laboratories. [Google Scholar]

[risa70059-bib-0022] Fu, M. , & Tantithamthavorn, C. (2022). LineVul: A transformer‐based line‐level vulnerability prediction. In Proceedings of the 19th International Conference on Mining Software Repositories (pp. 608–620).

[risa70059-bib-0023] Fu, M. , Tantithamthavorn, C. , Le, T. , Kume, Y. , Nguyen, V. , Phung, D. , & Grundy, J. (2024). AIBugHunter: A practical tool for predicting, classifying and repairing software vulnerabilities. Empirical Software Engineering, 29(1), 4. [Google Scholar]

[risa70059-bib-0024] Fu, M. , Tantithamthavorn, C. , Le, T. , Nguyen, V. , & Phung, D. (2022). VulRepair: A T5‐based automated software vulnerability repair. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (pp. 935–947).

[risa70059-bib-0025] Gawlikowski, J. , Tassi, C. R. N. , Ali, M. , Lee, J. , Humt, M. , Feng, J. , Kruspe, A. , Triebel, R. , Jung, P. , Roscher, R. , Shahzad, M. , Yang, W. , Bamler, R. , & Zhu, X. X. (2023). A survey of uncertainty in deep neural networks. Artificial Intelligence Review, 56(Suppl 1), 1513–1589. [Google Scholar]

[risa70059-bib-0026] Gharehtoragh, M. A. , & Johnson, D. R. (2024). Using surrogate modeling to predict storm surge on evolving landscapes under climate change. npj Natural Hazards, 1(1), 33. [Google Scholar]

[risa70059-bib-0027] Gray, A. , Ferson, S. , Kreinovich, V. , & Patelli, E. (2022). Distribution‐free risk analysis. International Journal of Approximate Reasoning, 146, 133–156. [Google Scholar]

[risa70059-bib-0028] Gray, N. , Calleja, D. , Wimbush, A. , Miralles‐Dolz, E. , Gray, A. , De Angelis, M. , Derrer‐Merk, E. , Oparaji, B. U. , Stepanov, V. , Clearkin, L. , & Ferson, S. (2020). Is “no test is better than a bad test”? Impact of diagnostic uncertainty in mass testing on the spread of COVID‐19. PLoS One, 15(10), e0240775. [DOI] [PMC free article] [PubMed] [Google Scholar]

[risa70059-bib-0029] Gray, N. , De Angelis, M. , Calleja, D. , & Ferson, S. (2019). A problem in the Bayesian analysis of data without gold standards. In 29th European Safety and Reliability Conference, ESREL 2019 (pp. 2628–2634).

[risa70059-bib-0030] Gray, N. , Ferson, S. , De Angelis, M. , Gray, A. , & de Oliveira, F. B. (2022). Probability bounds analysis for Python. Software Impacts, 12, 100246. [Google Scholar]

[risa70059-bib-0031] Guikema, S. (2020). Artificial intelligence for natural hazards risk analysis: Potential, challenges, and research needs. Risk Analysis, 40(6), 1117–1123. [DOI] [PubMed] [Google Scholar]

[risa70059-bib-0032] Hou, F. , Zhou, K. , Li, L. , Tian, Y. , Li, J. , & Li, J. (2022). A vulnerability detection algorithm based on transformer model. In International Conference on Artificial Intelligence and Security (pp. 43–55).

[risa70059-bib-0033] Huang, J. , Borges, N. , Bugiel, S. , & Backes, M. (2019). Up‐to‐crash: Evaluating third‐party library updatability on Android. In 2019 IEEE European Symposium on Security and Privacy (EuroS&P) (pp. 15–30).

[risa70059-bib-0034] Hüllermeier, E. , & Waegeman, W. (2021). Aleatoric and epistemic uncertainty in machine learning: An introduction to concepts and methods. Machine Learning, 110, 457–506. [Google Scholar]

[risa70059-bib-0035] International Organization for Standardization . (2023). ISO/IEC 42001:2023 ‐ Information technology ‐ Artificial intelligence ‐ Management system.

[risa70059-bib-0036] Iskandar, R. (2021). Probability bound analysis: A novel approach for quantifying parameter uncertainty in decision‐analytic modeling and cost‐effectiveness analysis. Statistics in Medicine, 40(29), 6501–6522. [DOI] [PMC free article] [PubMed] [Google Scholar]

[risa70059-bib-0037] Joulin, A. , Grave, E. , Bojanowski, P. , Douze, M. , Jégou, H. , & Mikolov, T. (2016). FastText.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651 .

[risa70059-bib-0038] Kaur, R. , Gabrijelčič, D. , & Klobučar, T. (2023). Artificial intelligence for cybersecurity: Literature review and future research directions. Information Fusion, 97, 101804. [Google Scholar]

[risa70059-bib-0039] Kreuzer, T. , Papapetrou, P. , & Zdravkovic, J. (2024). Artificial intelligence in digital twins—A systematic literature review. Data & Knowledge Engineering, 151, 102304. [Google Scholar]

[risa70059-bib-0040] Kula, R. G. , German, D. M. , Ouni, A. , Ishio, T. , & Inoue, K. (2018). Do developers update their library dependencies? An empirical study on the impact of security advisories on library migration. Empirical Software Engineering, 23(1), 384–417. [Google Scholar]

[risa70059-bib-0041] Li, Y. , Wang, S. , & Nguyen, T. N. (2022). Dear: A novel deep learning‐based approach for automated program repair. In Proceedings of the 44th International Conference on Software Engineering (pp. 511–523).

[risa70059-bib-0042] Liu, K. , Li, L. , Koyuncu, A. , Kim, D. , Liu, Z. , Klein, J. , & Bissyandé, T. F. (2021). A critical review on the evaluation of automated program repair systems. Journal of Systems and Software, 171, 110817. [Google Scholar]

[risa70059-bib-0043] Liu, X. , You, X. , Zhang, X. , Wu, J. , & Lv, P. (2020). Tensor graph convolutional networks for text classification. In Proceedings of the AAAI Conference on Artificial Intelligence , Vol. 34 (pp. 8409–8416). [Google Scholar]

[risa70059-bib-0044] Long, F. , Amidon, P. , & Rinard, M. (2017). Automatic inference of code transforms for patch generation. In Proceedings of the 11th Joint Meeting on Foundations of Software Engineering (pp. 727–739).

[risa70059-bib-0045] Metropolis, N. , & Ulam, S. (1949). The Monte Carlo method. Journal of the American Statistical Association, 44(247), 335–341. [DOI] [PubMed] [Google Scholar]

[risa70059-bib-0046] Mezzi, E. , & Papotti, A. (2024). Simulator for AI‐augmented systems . https://github.com/EMezzi/AI‐Augmented

[risa70059-bib-0047] Nateghi, R. , & Aven, T. (2021). Risk analysis in the age of big data: The promises and pitfalls. Risk Analysis, 41(10), 1751–1758. [DOI] [PubMed] [Google Scholar]

[risa70059-bib-0048] NIST . (2021). NIST software assurance reference dataset . https://samate.nist.gov/SARD

[risa70059-bib-0067] NIST . (2024). Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile. https://www.nist.gov/itl/ai‐risk‐management‐framework

[risa70059-bib-0069] Page, M. J. , Moher, D. , Bossuyt, P. M. , Boutron, I. , Hoffmann, T. C. , Mulrow, C. D. , Shamseer, L. , Tetzlaff, J. M. , Akl, E. A. , Brennan, S. E. , Chou, R. , Glanville, J. , Grimshaw, J. M. , Hröbjartsson, A. , Lalu, M. M. , Li, T. , Loder, E. W. , Mayo‐Wilson, E. , McDonald, S. , … McKenzie, J. E. (2021). PRISMA 2020 explanation and elaboration: Updated guidance and exemplars for reporting systematic reviews. BMJ, 372, n160. [DOI] [PMC free article] [PubMed] [Google Scholar]

[risa70059-bib-0051] Parikh, R. , Mathai, A. , Parikh, S. , Sekhar, G. C. , & Thomas, R. (2008). Understanding and using sensitivity, specificity and predictive values. Indian Journal of Ophthalmology, 56(1), 45–50. [DOI] [PMC free article] [PubMed] [Google Scholar]

[risa70059-bib-0052] Pashchenko, I. , Plate, H. , Ponta, S. E. , Sabetta, A. , & Massacci, F. (2022). Vuln4Real: A methodology for counting actually vulnerable dependencies. IEEE Transactions on Software Engineering, 48(5), 1592–1609. [Google Scholar]

[risa70059-bib-0053] Paté‐Cornell, E. (2024). Preferences in AI algorithms: The need for relevant risk attitudes in automated decisions under uncertainties. Risk Analysis, 44(10), 2317–2323. [DOI] [PubMed] [Google Scholar]

[risa70059-bib-0054] Pennington, J. , Socher, R. , & Manning, C. D. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 1532–1543).

[risa70059-bib-0055] Perez‐Cerrolaza, J. , Abella, J. , Borg, M. , Donzella, C. , Cerquides, J. , Cazorla, F. J. , Englund, C. , Tauber, M. , Nikolakopoulos, G. , & Flores, J. L. (2024). Artificial intelligence for safety‐critical systems in industrial and transportation domains: A survey. ACM Computing Surveys, 56(7), 1–40. [Google Scholar]

[risa70059-bib-0056] Saha, S. , Saha, R. , & Prasad, M. R. (2019). Harnessing evolution for multi‐hunk program repair. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE) (pp. 13–24). IEEE. [Google Scholar]

[risa70059-bib-0065] The White House . (2023). Executive Order 14110 on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence. https://www.federalregister.gov/documents/2023/11/01/2023‐24283/safe‐secure‐and‐trustworthy‐development‐and‐use‐of‐artificial‐intelligence

[risa70059-bib-0057] Tufano, M. , Palomba, F. , Bavota, G. , Oliveto, R. , Di Penta, M. , De Lucia, A. , & Poshyvanyk, D. (2017). When and why your code starts to smell bad (and whether the smells go away). IEEE Transactions on Software Engineering, 43(11), 1063–1088. [Google Scholar]

[risa70059-bib-0058] Xia, C. S. , & Zhang, L. (2022). Less training, more repairing please: Revisiting automated program repair via zero‐shot learning. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (pp. 959–971).

[risa70059-bib-0059] Yang, G. , Min, K. , & Lee, B. (2020). Applying deep learning algorithm to automatic bug localization and repair. In Proceedings of the 35th Annual ACM Symposium on Applied Computing (pp. 1634–1641).

[risa70059-bib-0060] Yang, H. , Yang, H. , & Zhang, L. (2022). VDHGT: A source code vulnerability detection method based on heterogeneous graph transformer. In International Symposium on Cyberspace Safety and Security (pp. 217–224).

[risa70059-bib-0061] Ye, H. , Martinez, M. , & Monperrus, M. (2021). Automated patch assessment for program repair at scale. Empirical Software Engineering, 26, 1–38. [Google Scholar]

[risa70059-bib-0062] Zhang, C. , & Xin, Y. (2023). VulGAI: Vulnerability detection based on graphs and images. Computers & Security, 135, 103501. [Google Scholar]

[risa70059-bib-0063] Zhou, Y. , Liu, S. , Siow, J. , Du, X. , & Liu, Y. (2019). Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks. In Advances in Neural Information Processing Systems , 32. [Google Scholar]

PERMALINK

Risks of ignoring uncertainty propagation in AI‐augmented security pipelines

Emanuele Mezzi

Aurora Papotti

Fabio Massacci

Katja Tuma

Abstract

1. INTRODUCTION

1.1. Contributions

FIGURE 1.

2. BACKGROUND AND RELATED WORK

2.1. AI‐augmented systems

2.2. Uncertainty quantification in AI

2.3. AI in vulnerability detection

2.4. APR and composed pipelines

2.4.1. Code fixers

2.4.2. Composed pipelines

3. PIPELINE FORMALIZATION

3.1. Identify the classifier metrics

Proposition 1

3.2. Deterministic recall, partial repairs, no breaking changes

Proposition 2

Corollary 1

3.3. Uncertain recall

4. RECALL IN THE FIELD

4.1. Search in digital libraries

FIGURE 2.

4.2. Collected samples

TABLE 1.

5. SIMULATION ONE: CONSTANT PREVALENCE RATE

5.1. Simulator

FIGURE 3.

5.1.1. Ground truth generator

5.1.2. p‐Boxes and recall sampling

5.1.3. First classifier

5.1.4. Fixer

5.1.5. Second classifier

5.1.6. Final counter

5.2. Simulation results

TABLE 2.

TABLE 3.

TABLE 4.

5.2.1. Final prevalence rate

5.2.2. Real fix rate

5.2.3. False negatives ratio

6. SIMULATION TWO: BEYOND CONSTANT PREVALENCE RATE

6.1. Results of the simulation

TABLE 5.

TABLE 6.

7. CASE STUDY: AI‐BASED APR

8. DISCUSSION

8.1. Summary of results

8.2. Policy implications on AI evaluation

8.3. Limitations

9. CONCLUSIONS

ACKNOWLEDGMENTS

APPENDIX A. MONTE CARLO (MC) SIMULATION

A.1. Simulation one: Constant prevalence rate

A.1..1. A.1.1 PR(aias) calculation

TABLE A.1.

TABLE A.2.

TABLE A.3.

A.1..2. A.1.2. F(aias) calculation

TABLE A.4.

TABLE A.5.

TABLE A.6.

A.1..3. A.1.3. FNratio calculation

TABLE A.7.

TABLE A.8.

TABLE A.9.

A.2. Simulation two: Beyond constant prevalence rate

A.2..1. A.2.1. F(aias) calculation with constant prevalence rate

TABLE A.10.

TABLE A.11.

TABLE A.12.

A.2..2. A.2.2. F(aias) calculation without constant prevalence rate

TABLE A.13.

TABLE A.14.

TABLE A.15.

APPENDIX B. FORMULA DERIVATIONS

A.1..3. A.1.3. ${FN}_{ratio}$ calculation

B.2. Derivation of $P_{R} (aias)$ from $Pos (aias)$

B.3. Derivation of $TPR (aias)$

B.4. $TPR (aias) \leq TPR$

B.6. Derivation of the $FAR (aias)$

B.7. Proof that the AI‐augmented system false alert rate ( $F A R$ ) is less than or equal to the $F A R$ of the first classifier ( $FAR (aias) \leq FAR$ )

B.11. Derivation of the AI‐augmented system false negatives ( $FN (aias)$ )

B.12. Derivation of the AI‐augmented system prevalence rate ( $P_{R} (aias)$ )