Skip to main content
PLOS One logoLink to PLOS One
. 2025 Sep 26;20(9):e0321440. doi: 10.1371/journal.pone.0321440

Partial verification bias correction using scaled inverse probability resampling for binary diagnostic tests

Wan Nor Arifin 1,*, Umi Kalsom Yusof 2,*
Editor: Hayrunnisa Nadaroglu3
PMCID: PMC12469146  PMID: 41004551

Abstract

Diagnostic accuracy studies are crucial for evaluating new tests before their clinical application. These tests are compared with gold standard tests, and accuracy measures such as sensitivity (Sn) and specificity (Sp) are often calculated. However, these studies frequently suffer from partial verification bias (PVB) due to selective verification of patients, which leads to biased accuracy estimates. Among the methods for correcting PVB under the missing-at-random assumption for binary diagnostic tests, a bootstrap-based method known as the inverse probability bootstrap (IPB) was proposed. IPB demonstrated low bias for estimating Sn and Sp, but exhibited higher standard errors (SE) than other PVB correction methods and only corrected the distribution of the verified portion of the PVB data. This paper introduces two new methods to address these limitations: scaled inverse probability weighted resampling (SIPW) and scaled inverse probability weighted balanced resampling (SIPW-B), both built upon IPB. Using simulated and clinical datasets, SIPW and SIPW-B were compared against IPB and other existing methods (Begg and Greenes’, inverse probability weighting estimator, and multiple imputation). For the simulated data sets, different combinations of disease prevalence (0.4 and 0.1), Sn (0.3 to 0.9), Sp (0.6 and 0.9), and sample sizes (200 and 1000) were generated. Two commonly used clinical datasets in PVB correction studies were also used. Performance was evaluated using bias and SE for the simulated data. Simulation results showed that both new methods outperformed IPB by producing lower bias and SE for Sn and Sp estimation, showing results comparable to existing methods, and demonstrating good performance at low disease prevalence. In clinical datasets, SIPW and SIPW-B were consistent with existing methods. The new methods also improve upon IPB by allowing full data restoration. Although the methods are computationally demanding at present, this limitation is expected to become less important as computing power continues to increase.

Introduction

Diagnostic tests are crucial in medical care, so ensuring their clinical validity through diagnostic accuracy studies is essential [1,2]. These studies involve comparing a new test with an established gold standard to evaluate its performance using accuracy measures such as sensitivity (Sn) and specificity (Sp) for binary tests [1,35]. However, verifying disease status with the gold standard can be expensive, time-consuming, and invasive [1,58]. This verification challenge often leads to partial verification bias (PVB), where patients with positive test results are predominantly selected for gold standard verification, while fewer patients with negative test results are verified [1,6,8,9]. This gives rise to a missing-at-random (MAR) missing data mechanism, as the decision to verify depends on the diagnostic test result [5,6].

It is important to correct bias during analysis because PVB leads to biased estimates of diagnostic accuracy measures [1,6,10]. This can affect clinical practice, as biased estimates may result in clinically invalid tests and incorrect clinical decisions [2,6]. Methods for correcting PVB have been comprehensively reviewed elsewhere [2,11]. For binary tests under the MAR assumption, PVB correction methods can be categorized into Begg and Greenes’ (BG)-based, propensity score (PS)-based, and multiple imputation (MI) methods. BG-based and MI methods adjust accuracy measures by calculating the probability of disease status given the test result, which are unbiased under MAR assumptions [3,12]. In contrast, PS-based methods estimate the probability of verification given the test result and use weighting to correct the bias [13,14]. Several medical studies have applied these correction methods in their research, highlighting the importance of the bias correction [1519].

Inverse probability bootstrap (IPB) sampling was introduced to address sampling bias in model-based analyses [20]. While the bootstrap technique is traditionally used for estimating standard errors (SE), IPB leverages this technique to achieve unbiased parameter estimates by creating weighted samples. This method corrects the sample distribution without requiring extensive modifications or new methods [20]. Because it is a bootstrap approach, IPB simplifies the estimation of SE and enables the calculation of confidence intervals for statistical inference [20].

The PS-based method for PVB correction and IPB share a common approach; both begin by estimating the selection probability, or verification probability in the context of PVB, and then use this probability to correct bias through weighting methods. The IPB approach offers an attractive way to address bias because it relies on the bootstrap technique, which provides several advantages. Therefore, IPB was adapted in the context of PVB correction in a study by Arifin and Yusof [21].

In their study [21], IPB demonstrated low bias in the estimation of Sn and Sp. However, it exhibited a relatively higher SE than other PVB correction methods and only corrected the distribution of the verified portion of the PVB data. To address these limitations, this study proposes two methods: scaled inverse probability weighted resampling (SIPW) and scaled inverse probability weighted balanced resampling (SIPW-B), designed for PVB correction under the MAR assumption for binary diagnostic tests.

Materials and methods

The simulated and clinical data sets used in this study, the proposed methods for PVB correction based on IPB, the metrics for performance evaluation, the selected methods for comparison, and the experimental setup are described in this section. The following notations are used: T = test result, D = disease status, V = verification status, n1 = verified observations, n0 = unverified observations, and n = all observations, or n1+n0.

Data sets

This study evaluated and compared different methods using both simulated and clinical datasets. Simulated data allowed performance assessment against known parameter values [20,22], while clinical data enabled comparisons with established reference data, following the practice of previous PVB correction studies [2326].

Simulated data sets.

The simulated data sets were generated using the settings described in [21], which were adapted from [2426]. The settings are outlined as follows:

  1. True disease prevalence (p) or P(D=1): moderate = 0.40 and low = 0.10.

  2. True sensitivity (Sn) P(T=1|D=1): low = 0.3, moderate = 0.6, high = 0.9

  3. True specificity (Sp) P(T=0|D=0): moderate = 0.6, high = 0.9.

  4. Verification probabilities: When verification depends only on the test result, this represents an MAR missingness mechanism. Fixed verification probabilities given the test result P(V=1|T=t), were set at P(V=1|T=1) = 0.8 and P(V=1|T=0) = 0.4 [24]. In other words, patients with positive test results are more likely to be verified, with a probability of 0.8. Conversely, patients with negative test results are less likely to be verified, with a probability of 0.4.

  5. Sample sizes, n: 200 and 1000.

For the complete data, the probabilities of the counts in a 2×2 cross-tabulated table of T versus D follow a multinomial distribution [24,26]. Based on pre-specified values of Sn = P(T=1|D=1), Sp = P(T=0|D=0) and p=P(D=1)=π, the probabilities of the counts are given by M(π1,π2,π3,π4), where

π1=P(T=1,D=1)=P(T=1|D=1)P(D=1),π2=P(T=0,D=1)=P(T=0|D=1)P(D=1)=[1P(T=1|D=1)]P(D=1),π3=P(T=1,D=0)=P(T=1|D=0)P(D=0)=[1P(T=0|D=0)]P(D=0),π4=P(T=0,D=0)=P(T=0|D=0)P(D=0).

For each specified sample size n, generating a simulated data set with MAR-induced PVB involves the following steps:

  1. A complete data set of size n, following a multinomial distribution, M(π1,π2,π3,π4) was generated. Numerical values were randomly drawn from 1, 2, 3, 4 according to these probabilities.

  2. The values were converted into realizations of the T = t and D = d variables, where the numbers were mapped as follows: 1(T=1,D=1), 2(T=0,D=1), 3(T=1,D=0) and 4(T=0,D=0).

  3. Under the MAR assumption, the PVB data set was generated by adding V={1,0} with verification probabilities of P(V=1|T=1) = 0.8 and P(V=1|T=0) = 0.4, where V follows a binomial distribution. D values for V=0 observations were set to NA to create missing values.

The layout of the simulated data for both the complete and PVB data sets is shown in Fig 1.

Fig 1. Simulated data layout for complete and PVB data.

Fig 1

Clinical data sets.

This study utilized two commonly used clinical data sets [24,2630] to show and compare the implementation of the PVB correction methods in real-world settings. The original data from these studies were converted to an analysis-ready format (.csv). These data sets are described as follows:

  1. Hepatic scintigraphy test

    The data set relates to hepatic scintigraphy, a diagnostic imaging technique for detecting liver cancer, as reported in the original study [27]. The test was performed on 650 patients, with 344 patients verified by liver pathological examination (gold standard). The percentage of unverified patients was 47.1%. The data set includes the following variables:
    • Liver cancer, disease: Binary, 1 = Yes, 0 = No
    • Hepatic scintigraphy, test: Binary, 1 = Positive, 0 = Negative
    • Verified, verified: Binary, 1 = Yes, 0 = No
  2. Diaphanography test

    The data set relates to the diaphanography test for detecting breast cancer, as reported in the original study [28]. Diaphanography is a noninvasive breast examination method that uses transillumination with visible or infrared light to detect breast cancer. The test was performed on 900 patients, but only 88 patients were verified by breast tissue biopsy for histological examination (gold standard). The percentage of unverified patients was 90.2%. The data set includes the following variables:
    • Breast cancer, disease: Binary, 1 = Yes, 0 = No
    • Diaphanography, test: Binary, 1 = Positive, 0 = Negative
    • Verified, verified: Binary, 1 = Yes, 0 = No

Cross-tabulations of these data sets are given in Fig 2.

Fig 2. Cross-tabulation of hepatic scintigraphy and diaphanography data sets.

Fig 2

Proposed methods

Scaled inverse probability weighted resampling.

Based on the IPB method proposed by Arifin and Yusof [21], the scaled inverse probability weighted resampling (SIPW) method is proposed to overcome the limitations of IPB in correcting PVB. The SIPW algorithm is shown in Algorithm 1. IPB defines n as the observed sample size, while SIPW defines n as the complete sample size, including both verified and unverified samples. The verified sample size is n1, and the unverified sample size is n0. A sample of size n is drawn with replacement b times from n1, producing b samples. Since SIPW does not perform sampling with replacement to match the original n1 size, these are not bootstrap samples as implemented in IPB. Therefore, the general term resampling is used in the method name.

Algorithm 1 Scaled inverse probability weighted resampling.

Data, notations and definitions:

  Test status T, where T = {Positive: 1, Negative: 0}

  Disease status D, where D = {Yes: 1, No: 0, Unknown: NA}

  Verification status V, where V = {Yes: 1, No: 0}

  PVB data Y={T,D,V} of size n×3

  Individual patient instance is denoted with subscript i=1,...,n

D consists of verified (V=1) and unverified (V=0) sets, where D={D1,D0}

  Unverified set of data Y0 is of size n0, consisting of instances with Vi=0

  Verified set of data Y1 is of size n1 = nn0, consisting of instances with Vi=1

  Propensity score for ith patient PSi is P^(Vi=1|Ti)

  Inverse probability weight for ith patient IPWi is ViPSi+1Vi1PSi

  Scaled inverse probability weight for ith patient SIPWi is IPWii=1n1IPWi, such that i=1n1SIPWi=1

  Sensitivity Sn is P(T=1|D=1)

  Specificity Sp is P(T=0|D=0)

Procedure:

1: Estimate PS^i, using Y by a logistic regression

2: Calculate IPW^i1PS^i for instances with Vi=1

3: Calculate SIPW^iIPW^ii=1n1IPW^i for instances with Vi=1

  /*—Begin resampling block—*/

4: YList Empty list of size b samples

5: i1

6: while i<b + 1 do

7:   Yi Sample n instances from Y1 of n1 instances with probability SIPWi

  /*Ensure valid Yi, where T × D cross-classification table dimension TblDim is 4*/

8:   if TblDim<4 then

9:    Discard Yi

10:    Repeat sampling, ii

11:   else if TblDim = 4 then

12:    Save ith sample, YiListYi

13:    Continue with next i, ii+1

14:   end if

15: end while

  /*—End resampling block—*/

16: SnList Empty list of size b samples

17: SpList Empty list of size b samples

18: for i = 1 to b in YList do

19:   SniList Estimate Sn^i from T×D cross-classification table

20:   SpiList Estimate Sp^i from T×D cross-classification table

21: end for

21: Estimate Sn^1bi=1bSn^i from SnList

22: Estimate Sp^1bi=1bSn^i from SpList

Scaled inverse probability weighted balanced resampling.

During this research, it was observed that subgroup sizes, nD = 1 and nD = 0 for diseased and non-diseased observations respectively, directly affect the precision (i.e. the SE) of Sn or Sp, as smaller sample sizes result in lower precision (higher SE). The SE, especially for Sn, can increase further when disease prevalence decreases because the subgroup size of nD = 1 also becomes smaller. To reduce the SE, a possible solution was inspired by the way diagnostic accuracy studies are designed. A diagnostic accuracy study is ideally designed as a cohort study [3,4,31]. When the disease is very rare, a case-control study is more practical, where patients with the disease (cases) are compared to those without the disease (controls), often using a 1:1 to 1:3 ratio [31,32]. This approach is similar to data-level resampling methods used to address class imbalance in machine learning, where class distribution is adjusted through over- or under-sampling methods [33].

Therefore, the scaled inverse probability weighted balanced resampling (SIPW-B) method is proposed to mimic the case-control study design. SIPW-B balances subgroup sizes by resizing the sizes to a predefined ratio of nD = 0:nD = 1, while keeping the original sample size n=nD=0+nD=1. The SIPW-B algorithm is shown in Algorithm 2. In the algorithm, the target control:case ratio or nD=0:nD=1 is achieved through the following steps:

  1. Set the desired relative size or ratio of control:case.

  2. Calculate the initial relative size of nD = 0 to nD = 1 in the PVB sample.

  3. Update IPWi for instances with Di = 1 by multiplying the value with the initial relative size.

  4. Calculate k as the sum of IPWi for instances with Di = 0, divided by sum of IPWi for instances with Di = 1.

  5. Update IPWi for instances with Di = 1 by multiplying the value with kDesired relative size

  6. Calculate SIPWi as IPWi divided by the sum of IPWi.

For the experiments in this study, the target relative size of control:case was set at 1, or a ratio 1:1.

Algorithm 2 Scaled inverse probability weighted balanced resampling.

Data, notations and definitions:

  Test status T, where T = {Positive: 1, Negative: 0}

  Disease status D, where D = {Yes: 1, No: 0, Unknown: NA}

  Verification status V, where V = {Yes: 1, No: 0}

  PVB data Y={T,D,V} of size n×3

  Individual patient instance is denoted with subscript i=1,...,n

D consists of verified (V=1) and unverified (V=0) sets, where D={D1,D0}

  Unverified set of data Y0 is of size n0, consisting of instances with Vi=0

Verified set of data Y1 is of size n0 = nn0, consisting of instances with Vi=1

  Propensity score for ith patient PSi is P^(Vi=1|Ti)

  Inverse probability weight for ith patient IPWi is ViPSi+1Vi1PSi

  Desired relative size of nD = 0:nD = 1, RelSize i.e. control:case size

  Scaled inverse probability weight for ith patient SIPWi is IPWii=1nD=1IPWi, such that i=1nD=1SIPWi=1

  Sensitivity Sn is P(T=1|D=1) and Specificity Sp is P(T=0|D=0)

Procedure:

1: Estimate PS^i, using Y by a logistic regression

2: Calculate IPW^i1PS^i for instances with Vi=1

  /*Balance D = 1 size relative to D = 0 size*/

3: Calculate RelSizeInitnD=0nD=1

4: Update IPW^iIPW^i×RelSizeInit for instances with Di=1Vi=1

5: Calculate constant ki=1nD=0IPW^i/i=1nD=1IPW^i

6: Update IPW^iIPW^i×(k/RelSize) for instances with Di=1Vi=1

7: Calculate SIPW^iIPW^i/i=1nD=1IPW^i for instances with Vi=1

  /*—Begin resampling block—*/

8: YList Empty list of size b samples

9: i1

10: while i<b + 1 do

11:   Yi Sample n instances from Y1 of nV=1 instances with probability SIPWi

  /*Ensure valid Yi, where T × D cross-classification table dimension TblDim is 4*/

12:   if TblDim<4 then

13:    Discard Yi

14:    Repeat sampling, ii

15:   else if TblDim = 4 then

16:    Save ith sample, YiListYi

17:    Continue with next i, ii+1

18:   end if

19: end while

  /*—End resampling block—*/

20: SnList Empty list of size b samples

21: SpList Empty list of size b samples

22: for i = 1 to b in YList do

23:   SniList Estimate Sn^i from T×D cross-classification table

24:   SpiList Estimate Sp^i from T×D cross-classification table

25: end for

26: Estimate Sn^1bi=1bSn^i from SnList

27: Estimate Sp^1bi=1bSn^i from SpList

Performance evaluation

The performance evaluation used metrics that measure the difference between an estimate and its true value [22,34,35]. For a finite number of simulations B, the selected metrics are calculated as follows:

  1. Bias

    Bias of a point estimator θ^ is defined as the difference between its expected value and the true value of a parameter θ [35]. It is calculated as:
    Bias=E[θ^]θ=1Bi=1Bθ^iθ. (1)
  2. Standard error

    Standard error (SE) is the square root of the variance and is calculated as:
    SE=Var(θ^)=1B1i=1B(θ^iθ¯)2, (2)

    where θ¯ is the mean of θ^i across repetitions.

Bias is often the primary metric of interest [22] because it reflects the accuracy of a method [35] and whether, on average, the method targets the parameter θ [22]. SE reflects the precision of the method [22,35], with a smaller SE indicating higher precision [35].

Methods for comparison

The proposed SIPW and SIPW-B were compared with selected existing PVB correction methods, which are the BG method (representing BG-based methods), the inverse probability weighting estimator (IPWE) and IPB methods (representing PS-based methods), and the MI method (representing the imputation-based methods). These methods were chosen to represent different approaches for PVB correction. In addition, two more methods were included for comparison: full data analysis (FDA), serving as an ideal benchmark when complete and unbiased data are available [3]; and complete case analysis (CCA), an uncorrected method that exhibits bias in the presence of PVB [36]. Details of Sn and Sp calculations for these existing methods have been described elsewhere [11,21].

For the simulated datasets, the methods were compared based on the mean of the estimates, bias and SE, organized by sample size and Sn-Sp combination. Coverage, defined as the proportion of times the confidence interval (CI) includes the true estimate [22,34,35], was not included as a performance metric in our simulation because we did not propose a new method for obtaining the CI for SIPW and SIPW-B. For the clinical data sets, point estimates and their 95% CIs were compared.

Experimental setup

All experiments were conducted using R version 3.6.3 [37] within the RStudio integrated development environment [38]. The final stable version of the 3.x.x series was chosen to ensure reproducibility, as the 4.x.x series is still under active development. The mice [39] (version 3.14.0) and simstudy [40] (version 0.5.0) R packages were used. A random seed of 3209673 was set for the entire study.

Simulation setup and data analysis.

To test the performance of PVB correction methods, simulated data sets were generated following the procedures described in the Simulated data sets subsection above. This step was followed by analysis using FDA, CCA, BG, IPWE, MI, IPB, SIPW, and SIPW-B. The general settings for data generation and analysis were: number of simulation runs B = 500, samples b = 1000 (for IPB, SIPW, and SIPW-B), and imputations m = 100 [41,42]. There were 12 different combinations of experimental settings (disease prevalence, true Sn, true Sp) at two sample sizes (n=200,1000). The general steps for data generation and analysis were as follows:

  1. For a selected combination of experimental settings (for example, p = 0.4, Sn = 0.3, Sp = 0.6), complete data sets were generated, followed by MAR-induced PVB data sets.

  2. Each generated data set was checked for validity, where it had to form a 2×2 cross-tabulated table (T=1,0 versus D=1,0) with no zero cell count. Any invalid data set was discarded.

  3. The generation process continued until B = 500 valid complete and PVB data sets were obtained for each sample size (n = 200 and n = 1000), resulting in four sets of data sets in total.

  4. Sn and Sp were estimated for each data set. Complete data sets were analyzed using the FDA method, while PVB data sets were analyzed by CCA, BG, IPWE, MI, IPB, SIPW, and SIPW-B. Mean estimates, bias, and SE were then calculated.

  5. The steps above were repeated for each combination of experimental settings for all 12 combinations.

The code to generate and analyze the simulated data sets is available at https://github.com/wnarifin/sipw_in_pvb.

Clinical data analysis.

Clinical data sets were analyzed using CCA, BG, IPWE, MI, IPB, SIPW and SIPW-B. For CCA, CIs for Sn and Sp were calculated using Wald interval, while for the BG method, the calculation steps from the original article were followed [43]. For IPWE, IPB, SIPW and SIPW-B, CIs were obtained by bootstrap percentile interval method [44]. For MI, CIs were obtained using Rubin’s rule [11,45]. IPB did not require an additional bootstrapping step to obtain its CI because it already generated valid bootstrap samples for CI estimation [20,21]. Analysis settings were: number of samples b = 1000 (for IPB, SIPW and SIPW-B), bootstrap replicates R = 1000 (to obtain the CIs for IPWE, SIPW and SIPW-B), and m = the percentage of incomplete cases for real clinical data sets [4648]. The clinical data sets and code to reproduce the results are available at https://github.com/wnarifin/sipw_in_pvb.

Ethics

The simulation component of this study did not involve human participants or any identifiable personal data. The simulated data were generated using the computational settings described previously. The clinical data analysis component used publicly available, aggregated secondary data to apply the selected methods. Therefore, no ethical approval was needed for either component of this research.

Results

Simulated data analysis

The simulation results for FDA, CCA and the PVB correction methods for p = 0.4 are displayed in Table 1, comparing sample sizes n = 200 and 1000. The results are organized by parameter combinations of Sn = (0.3, 0.6, 0.9) and Sp = (0.6, 0.9). The proportions of verification P(V=1) were 0.54, 0.47, 0.59, 0.52, 0.64 and 0.57 for the (Sn, Sp) pairs (0.3, 0.6), (0.3, 0.9), (0.6, 0.6), (0.6, 0.9), (0.9, 0.6) and (0.9, 0.9) respectively. The best values (i.e. the smallest bias and SE values) achieved by IPB, SIPW, and SIPW-B are marked with an asterisk.

Table 1. Comparison between IPB, SIPW, SIPW-B and existing PVB correction methods for p = 0.4 with n = 200 and 1000 under six combinations of Sn and Sp.

Methods Mean Bias SE Mean Bias SE Mean Bias SE Mean Bias SE
n = 200 n = 1000
Sn = 0.3 Sp = 0.6 Sn = 0.3 Sp = 0.6
FDA 0.302 0.002 0.050 0.603 0.003 0.044 0.302 0.002 0.023 0.600 0.000 0.019
CCA 0.465 0.165 0.076 0.430 –0.170 0.060 0.465 0.165 0.036 0.429 –0.171 0.027
BG 0.303 0.003 0.058 0.602 0.002 0.051 0.303 0.003 0.028 0.601 0.001 0.023
IPWE 0.303 0.003 0.058 0.602 0.002 0.051 0.303 0.003 0.028 0.601 0.001 0.023
MI 0.305 0.005 0.060 0.600 0.000 0.053 0.303 0.003 0.029 0.600 0.000 0.023
IPB 0.300 *0.000 0.085 0.596 *–0.004 0.080 0.302 *0.002 0.042 0.600 *0.000 0.035
SIPW 0.302 0.002 0.079 0.604 *0.004 0.070 0.302 *0.002 0.036 0.601 0.001 *0.031
SIPW-B 0.302 0.002 *0.078 0.605 0.005 *0.067 0.304 0.004 *0.035 0.602 0.002 0.032
Sn = 0.3 Sp = 0.9 Sn = 0.3 Sp = 0.9
FDA 0.302 0.002 0.050 0.902 0.002 0.027 0.302 0.002 0.023 0.900 0.000 0.013
CCA 0.466 0.166 0.077 0.821 –0.079 0.053 0.464 0.164 0.034 0.818 –0.082 0.024
BG 0.305 0.005 0.060 0.902 0.002 0.030 0.302 0.002 0.026 0.900 0.000 0.014
IPWE 0.305 0.005 0.060 0.902 0.002 0.030 0.302 0.002 0.026 0.900 0.000 0.014
MI 0.306 0.006 0.061 0.901 0.001 0.030 0.302 0.002 0.027 0.900 0.000 0.014
IPB 0.306 *0.006 0.095 0.903 0.003 0.050 0.301 *0.001 0.043 0.900 *0.000 0.022
SIPW 0.307 0.007 0.081 0.902 *0.002 *0.041 0.302 0.002 0.035 0.901 0.001 *0.018
SIPW-B 0.308 0.008 *0.076 0.905 0.005 0.042 0.301 *0.001 *0.034 0.900 *0.000 0.019
Sn = 0.6 Sp = 0.6 Sn = 0.6 Sp = 0.6
FDA 0.603 0.003 0.055 0.602 0.002 0.044 0.602 0.002 0.023 0.600 0.000 0.019
CCA 0.754 0.154 0.060 0.430 –0.170 0.060 0.751 0.151 0.026 0.428 –0.172 0.026
BG 0.607 0.007 0.072 0.602 0.002 0.050 0.602 0.002 0.031 0.600 0.000 0.022
IPWE 0.607 0.007 0.072 0.602 0.002 0.050 0.602 0.002 0.031 0.600 0.000 0.022
MI 0.605 0.005 0.075 0.599 –0.001 0.052 0.601 0.001 0.032 0.599 –0.001 0.022
IPB 0.609 0.009 0.105 0.602 0.002 0.078 0.599 *–0.001 0.044 0.601 0.001 0.034
SIPW 0.608 0.008 0.090 0.603 0.003 *0.065 0.604 0.004 *0.037 0.600 *0.000 *0.029
SIPW-B 0.606 *0.006 *0.084 0.600 *0.000 0.070 0.602 0.002 0.038 0.600 *0.000 0.031
Sn = 0.6 Sp = 0.9 Sn = 0.6 Sp = 0.9
FDA 0.603 0.003 0.055 0.902 0.002 0.027 0.602 0.002 0.023 0.900 0.000 0.012
CCA 0.754 0.154 0.061 0.822 –0.078 0.054 0.752 0.152 0.026 0.818 –0.082 0.025
BG 0.608 0.008 0.075 0.903 0.003 0.030 0.602 0.002 0.031 0.900 0.000 0.014
IPWE 0.608 0.008 0.075 0.903 0.003 0.030 0.602 0.002 0.031 0.900 0.000 0.014
MI 0.605 0.005 0.076 0.901 0.001 0.031 0.602 0.002 0.032 0.900 0.000 0.014
IPB 0.605 *0.005 0.118 0.903 *0.003 0.049 0.599 *–0.001 0.048 0.899 –0.001 0.024
SIPW 0.610 0.010 0.094 0.903 *0.003 *0.040 0.604 0.004 0.040 0.901 0.001 *0.019
SIPW-B 0.608 0.008 *0.086 0.905 0.005 0.042 0.602 0.002 *0.039 0.900 *0.000 0.020
Sn = 0.9 Sp = 0.6 Sn = 0.9 Sp = 0.6
FDA 0.899 –0.001 0.033 0.601 0.001 0.044 0.901 0.001 0.015 0.601 0.001 0.020
CCA 0.945 0.045 0.027 0.427 –0.173 0.057 0.948 0.048 0.012 0.429 –0.171 0.027
BG 0.896 –0.004 0.046 0.600 0.000 0.046 0.901 0.001 0.022 0.601 0.001 0.021
IPWE 0.896 –0.004 0.046 0.600 0.000 0.046 0.901 0.001 0.022 0.601 0.001 0.021
MI 0.889 –0.011 0.046 0.598 –0.002 0.047 0.899 –0.001 0.023 0.600 0.000 0.021
IPB 0.894 –0.006 0.064 0.595 –0.005 0.072 0.901 *0.001 0.028 0.600 *0.000 0.033
SIPW 0.896 –0.004 0.058 0.597 –0.003 *0.064 0.901 *0.001 0.027 0.600 *0.000 *0.029
SIPW-B 0.897 *–0.003 *0.055 0.599 *–0.001 0.069 0.901 *0.001 *0.025 0.603 0.003 0.031
Sn = 0.9 Sp = 0.9 Sn = 0.9 Sp = 0.9
FDA 0.899 –0.001 0.033 0.900 0.000 0.028 0.901 0.001 0.015 0.901 0.001 0.012
CCA 0.946 0.046 0.028 0.817 –0.083 0.054 0.948 0.048 0.013 0.819 –0.081 0.024
BG 0.899 –0.001 0.048 0.900 0.000 0.030 0.901 0.001 0.022 0.901 0.001 0.013
IPWE 0.899 –0.001 0.048 0.900 0.000 0.030 0.901 0.001 0.022 0.901 0.001 0.013
MI 0.889 –0.011 0.050 0.898 –0.002 0.031 0.899 –0.001 0.023 0.901 0.001 0.013
IPB 0.899 –0.001 0.066 0.901 0.001 0.045 0.901 *0.001 0.031 0.901 *0.001 0.020
SIPW 0.900 *0.000 0.060 0.900 *0.000 *0.041 0.902 0.002 *0.026 0.901 *0.001 *0.019
SIPW-B 0.898 –0.002 *0.057 0.898 –0.002 0.044 0.902 0.002 *0.026 0.901 *0.001 *0.019

Abbreviations: BG, Begg and Greenes’ method; CCA, complete case analysis; FDA, Full data analysis; IPB, inverse probability bootstrap; IPWE, inverse probability weighting estimator; MI, multiple imputation; n, sample size; p, disease prevalence; SE, standard error; SIPW, scaled inverse probability weighted resampling; SIPW-B, scaled inverse probability weighted balanced resampling; Sn, sensitivity; Sp, specificity.

Next, The simulation results for FDA, CCA and the PVB correction methods for p = 0.1 are displayed in Table 2 for sample sizes n = 200 and 1000. The results are organized by parameter combinations of Sn = (0.3, 0.6, 0.9) and Sp = (0.6, 0.9). The proportions of verification P(V=1) were 0.56, 0.45, 0.57, 0.46, 0.58 and 0.47 for the (Sn, Sp) pairs (0.3, 0.6), (0.3, 0.9), (0.6, 0.6), (0.6, 0.9), (0.9, 0.6) and (0.9, 0.9) respectively. Again, the smallest bias and SE values achieved by IPB, SIPW, and SIPW-B are marked with an asterisk.

Table 2. Comparison between IPB, SIPW, SIPW-B and existing PVB correction methods for p = 0.1 with n = 200 and 1000 under six combinations of Sn and Sp.

Methods Mean Bias SE Mean Bias SE Mean Bias SE Mean Bias SE
n = 200 n = 1000
Sn = 0.3 Sp = 0.6 Sn = 0.3 Sp = 0.6
FDA 0.301 0.001 0.107 0.601 0.001 0.037 0.300 0.000 0.043 0.600 0.000 0.016
CCA 0.459 0.159 0.159 0.428 -0.172 0.050 0.462 0.162 0.066 0.429 -0.171 0.022
BG 0.310 0.010 0.141 0.600 0.000 0.039 0.302 0.002 0.055 0.600 0.000 0.017
IPWE 0.310 0.010 0.141 0.600 0.000 0.039 0.302 0.002 0.055 0.600 0.000 0.017
MI 0.303 0.003 0.133 0.598 -0.002 0.039 0.303 0.003 0.055 0.600 0.000 0.017
IPB 0.309 0.009 0.209 0.601 *0.001 0.063 0.303 0.003 0.080 0.600 *0.000 0.027
SIPW 0.306 *0.006 0.167 0.599 *-0.001 *0.052 0.308 0.008 0.071 0.599 -0.001 *0.024
SIPW-B 0.307 0.007 *0.148 0.601 *0.001 0.060 0.300 *0.000 *0.058 0.599 -0.001 0.028
Sn = 0.3 Sp = 0.9 Sn = 0.3 Sp = 0.9
FDA 0.298 –0.002 0.106 0.901 0.001 0.021 0.300 0.000 0.043 0.900 0.000 0.010
CCA 0.464 0.164 0.160 0.819 –0.081 0.042 0.461 0.161 0.068 0.818 –0.082 0.018
BG 0.314 0.014 0.136 0.901 0.001 0.022 0.302 0.002 0.056 0.900 0.000 0.010
IPWE 0.314 0.014 0.136 0.901 0.001 0.022 0.302 0.002 0.056 0.900 0.000 0.010
MI 0.302 0.002 0.129 0.901 0.001 0.023 0.299 –0.001 0.059 0.900 0.000 0.010
IPB 0.323 0.023 0.227 0.903 0.003 0.039 0.305 0.005 0.092 0.900 *0.000 0.019
SIPW 0.310 *0.010 0.178 0.901 *0.001 *0.033 0.304 0.004 0.074 0.900 *0.000 *0.014
SIPW-B 0.318 0.018 *0.143 0.904 0.004 0.037 0.302 *0.002 *0.061 0.900 *0.000 0.017
Sn = 0.6 Sp = 0.6 Sn = 0.6 Sp = 0.6
FDA 0.596 –0.004 0.112 0.601 0.001 0.037 0.600 0.000 0.048 0.600 0.000 0.017
CCA 0.743 0.143 0.117 0.429 –0.171 0.049 0.754 0.154 0.053 0.429 –0.171 0.022
BG 0.603 0.003 0.144 0.601 0.001 0.038 0.607 0.007 0.067 0.601 0.001 0.017
IPWE 0.603 0.003 0.144 0.601 0.001 0.038 0.607 0.007 0.067 0.601 0.001 0.017
MI 0.579 –0.021 0.136 0.598 –0.002 0.039 0.601 0.001 0.067 0.600 0.000 0.017
IPB 0.595 –0.005 0.202 0.606 0.006 0.062 0.605 0.005 0.093 0.600 *0.000 0.027
SIPW 0.613 0.013 0.184 0.603 0.003 *0.055 0.604 *0.004 0.084 0.600 *0.000 *0.025
SIPW-B 0.600 *0.000 *0.157 0.601 *0.001 0.062 0.607 0.007 *0.071 0.601 0.001 0.027
Sn = 0.6 Sp = 0.9 Sn = 0.6 Sp = 0.9
FDA 0.600 0.000 0.115 0.900 0.000 0.022 0.600 0.000 0.048 0.900 0.000 0.010
CCA 0.738 0.138 0.119 0.818 –0.082 0.043 0.749 0.149 0.052 0.819 –0.081 0.019
BG 0.598 –0.002 0.143 0.900 0.000 0.023 0.601 0.001 0.065 0.900 0.000 0.010
IPWE 0.598 –0.002 0.143 0.900 0.000 0.023 0.601 0.001 0.065 0.900 0.000 0.010
MI 0.568 –0.032 0.134 0.900 0.000 0.023 0.593 –0.007 0.065 0.900 0.000 0.010
IPB 0.599 –0.001 0.214 0.898 –0.002 0.042 0.604 0.004 0.099 0.901 0.001 0.018
SIPW 0.600 *0.000 0.178 0.899 *–0.001 *0.033 0.602 0.002 0.080 0.901 0.001 *0.014
SIPW-B 0.598 –0.002 *0.152 0.902 0.002 0.038 0.600 *0.000 *0.067 0.900 *0.000 0.017
Sn = 0.9 Sp = 0.6 Sn = 0.9 Sp = 0.6
FDA 0.875 –0.025 0.062 0.602 0.002 0.035 0.899 –0.001 0.028 0.600 0.000 0.016
CCA 0.910 0.010 0.042 0.430 –0.170 0.048 0.947 0.047 0.025 0.428 –0.172 0.021
BG 0.837 –0.063 0.068 0.599 –0.001 0.037 0.900 0.000 0.044 0.600 0.000 0.017
IPWE 0.837 –0.063 0.068 0.599 –0.001 0.037 0.900 0.000 0.044 0.600 0.000 0.017
MI 0.800 –0.100 0.080 0.597 –0.003 0.037 0.889 –0.011 0.046 0.600 0.000 0.017
IPB 0.832 –0.068 0.133 0.605 0.005 0.059 0.897 –0.003 0.060 0.601 0.001 0.027
SIPW 0.837 *–0.063 0.108 0.600 *0.000 *0.050 0.901 0.001 0.052 0.600 *0.000 *0.023
SIPW-B 0.837 *–0.063 *0.077 0.600 *0.000 0.061 0.900 *0.000 *0.045 0.601 0.001 0.028
Sn = 0.9 Sp = 0.9 Sn = 0.9 Sp = 0.9
FDA 0.874 –0.026 0.062 0.901 0.001 0.023 0.900 0.000 0.029 0.900 0.000 0.010
CCA 0.905 0.005 0.051 0.819 –0.081 0.045 0.948 0.048 0.025 0.819 –0.081 0.020
BG 0.830 –0.070 0.079 0.900 0.000 0.025 0.901 0.001 0.044 0.900 0.000 0.011
IPWE 0.830 –0.070 0.079 0.900 0.000 0.025 0.901 0.001 0.044 0.900 0.000 0.011
MI 0.777 –0.123 0.087 0.899 –0.001 0.025 0.887 –0.013 0.045 0.900 0.000 0.011
IPB 0.829 –0.071 0.158 0.900 *0.000 0.044 0.898 –0.002 0.061 0.900 *0.000 0.018
SIPW 0.831 –0.069 0.112 0.900 *0.000 *0.034 0.901 *0.001 0.055 0.900 *0.000 *0.015
SIPW-B 0.832 *–0.068 *0.089 0.900 *0.000 0.037 0.901 *0.001 *0.046 0.900 *0.000 0.016

Abbreviations: BG, Begg and Greenes’ method; CCA, complete case analysis; FDA, Full data analysis; IPB, inverse probability bootstrap; IPWE, inverse probability weighting estimator; MI, multiple imputation; n, sample size; p, disease prevalence; SE, standard error; SIPW, scaled inverse probability weighted resampling; SIPW-B, scaled inverse probability weighted balanced resampling; Sn, sensitivity; Sp, specificity.

As observed in Tables 1 and 2, both SIPW and SIPW-B generally performed better than IPB for Sn and Sp estimation, as indicated by smaller bias and SE values. Bias results were mixed, with only marginal differences among IPB, SIPW and SIPW-B. However, SIPW and SIPW-B most often showed slightly lower bias than IPB. SIPW-B showed smaller SE for Sn estimation compared to SIPW because it enlarges the size of the case group (nD = 1). In contrast, SIPW showed smaller SE than SIPW-B for Sp estimation because it maintains the original size of the control group (nD = 0), which is larger than the case group when disease prevalence is p = 0.4 (Table 1). This effect was more pronounced at a lower disease prevalence of p = 0.1 (Table 2). When compared to existing methods (BG, IPWE, and MI), both new methods closely matched their performance in terms of bias and SE. Another observation was that, across all PVB correction methods, both bias and SE decreased as disease prevalence and sample size increased. Counterintuitively, when prevalence was low and sample size was small, CCA showed less bias than FDA and other PVB correction methods at a very high Sn value of 0.9.

Clinical data analysis

The results comparing CCA (bias uncorrected) and the PVB correction methods using the clinical data sets are displayed in Table 3. Across these data sets, all PVB correction methods showed nearly identical point estimates for Sn and Sp, except for MI, which showed a slightly different Sn estimate for the diaphanography data set. For this data set, SIPW and SIPW-B showed 95% CIs consistent with existing methods. Specifically, for Sn, three PS-based methods (IPWE, SIPW, and SIPW-B) showed almost identical 95% CIs, while BG and MI were similar to each other. In contrast, IPB exhibited notably wider 95% CIs for both Sn and Sp, differing from the estimates obtained by SIPW, SIPW-B and the existing methods.

Table 3. Sn and Sp estimates of IPB, SIPW, SIPW-B and other methods with the respective 95% CIs using clinical data sets.

Methods Scintigraphy data set Diaphanography data set
Sn (95% CI) Sp (95% CI) Sn (95% CI) Sp (95% CI)
CCA 0.895 (0.858, 0.933) 0.628 (0.526, 0.730) 0.788 (0.648, 0.927) 0.800 (0.694, 0.906)
BG 0.836 (0.788, 0.884) 0.738 (0.662, 0.815) 0.292 (0.134, 0.449) 0.973 (0.958, 0.988)
IPWE 0.836 (0.785, 0.885) 0.738 (0.656, 0.812) 0.292 (0.177, 0.548) 0.973 (0.957, 0.987)
MI 0.834 (0.782, 0.885) 0.738 (0.661, 0.815) 0.279 (0.124, 0.435) 0.972 (0.957, 0.987)
IPB 0.838 (0.793, 0.881) 0.738 (0.650, 0.824) 0.290 (0.077, 0.529) 0.973 (0.931, 1.000)
SIPW 0.837 (0.785, 0.886) 0.739 (0.655, 0.811) 0.292 (0.176, 0.548) 0.973 (0.957, 0.987)
SIPW-B 0.837 (0.785, 0.885) 0.739 (0.655, 0.812) 0.291 (0.176, 0.548) 0.973 (0.957, 0.987)

Abbreviations: BG, Begg and Greenes’ method; CCA, complete case analysis; CI, confidence interval; IPB, inverse probability bootstrap; IPWE, inverse probability weighting estimator; MI, multiple imputation; SIPW, scaled inverse probability weighted resampling; SIPW-B, scaled inverse probability weighted balanced resampling; Sn, sensitivity; Sp, specificity.

Discussion

To address the limitations of IPB, this study introduces two new methods based on IPB: SIPW and SIPW-B. The first limitation of IPB is that, although it showed small bias in estimating Sn and Sp, it had relatively larger SE than existing methods (BG, IPWE and MI). The second limitation of IPB is that IPB only corrects the distribution of the verified portion of the PVB data. This study demonstrated that the proposed methods successfully overcome these limitations, as discussed in detail below.

From the simulated data analysis, both SIPW and SIPW-B showed smaller bias and SE than IPB for Sn and Sp estimation, although the differences in bias were less pronounced. SIPW-B showed the lowest SE among these three methods for Sn estimation, while SIPW showed the lowest SE for Sp estimation. The new methods also matched the existing methods by showing low bias and SE for both Sn and Sp estimation. In contrast, as noted in [21], a major drawback of IPB is its large SE compared to other existing methods. SIPW and SIPW-B overcame this issue and demonstrated SE comparable to the existing methods. Although both methods performed better than IPB, SIPW performed better than SIPW-B for Sp estimation, while SIPW-B performed better than SIPW for Sn estimation. SIPW-B was designed to mimic the case-control study design. This approach is similar to common machine learning techniques that handle class imbalance by rebalancing the class distribution through over- or under-sampling [33,4951]. SIPW-B balances the effect of subgroup sample sizes by resizing them to a predefined ratio of nD = 0:nD = 1, while keeping the original sample size n=nD=0+nD=1. The trade-off is that it lowers the precision of the larger subgroup, although the precision for the smaller subgroup increases. For example, for a sample size n = 1000 with p = 0.2, the subgroup sizes are nD=1=n×p=200 and nD=0=n×(1p)=800. If the desired control:case ratio is 1:1, before balancing, Sn is calculated from nD = 1 = 200 and Sp from nD = 0 = 800. After balancing, Sn is calculated from nD = 1 = 500 and Sp from nD = 0 = 500. Notably, nD = 0 drops from 800 to 500 for Sp calculation, reducing both numerator and denominator counts, which lowers Sp precision. This pattern was also observed in [50], where random under-sampling degraded the performance of machine learning methods, mainly due to information loss from discarding observations [51]. On the other hand, nD = 1 increases from 200 to 500 for Sn calculation, increasing both numerator and denominator counts and improving Sn precision. In contrast, SIPW keeps nD = 0 unchanged for Sp estimation. In addition, after applying SIPW-B, statistics that rely on the true distribution of P(D=1) (such as PPV = P[D=1|T=1] or NPV = P[D=0|T=0]) are no longer valid [8]. Although PPV and NPV can be calculated indirectly from Sn and Sp estimates of SIPW-B by utilizing Bayes’ theorem [3,11], further research is needed to study this implementation. Therefore, SIPW-B is recommended when only Sn and Sp are the main estimates. On the other hand, SIPW is recommended when full data restoration is needed for further analysis involving metrics that depend on prevalence, such as PPV, NPV and accuracy [3,52].

At low prevalence settings, MI generally exhibited the largest bias in Sn estimation compared to other correction methods. The proposed methods showed lower bias than MI and were comparable to BG and IPWE in terms of bias, while showing smaller SE than IPB. As explained in a previous study [21], IPB showed larger SE because it only resamples verified observations. When prevalence p (i.e., P(D=1)) is low, this limits the available observations for estimating Sn to nD=1=p×P(V=1)×n. The proposed methods do not have this limitation because they restore the complete sample by resampling both verified and unverified observations, making the available observations for estimating Sn equal to nD=1=p×n, which is no longer dependent on V=1. For practical applications, when disease prevalence is low, SIPW-B is preferable for estimating Sn, whereas SIPW is better suited for Sp estimation. The higher bias shown by MI at low prevalence is unexpected, given that MI also restores the complete sample. Although this may be related to MI’s reliance on accurate estimation of P(D=d|T=t), low prevalence did not affect the BG method, which also depends on this probability. This suggests the need for a further study to investigate the performance of MI-based PVB correction methods, particularly under different experimental conditions and imputation strategies [45,53]. Although Day et al. [15] examined an extension of the BG method and two MI-based methods, this issue was not observed because their proposed methods were not tested on simulated data sets.

In the clinical data sets, the results varied by data set. For the scintigraphy data set, all correction methods generally showed consistent results with each other. In contrast, for the diaphanography data set, the point estimate of Sn for MI differed from other methods, which could be explained by previously discussed simulation findings. SIPW and SIPW-B showed 95% CIs comparable to existing methods, whereas IPB exhibited notably wider 95% CIs for both Sn and Sp, diverging from those estimated by other methods. Given the small observed sample size for this data set (only 88 verified out of 900 patients), IPB appears to perform poorly for interval estimation when data are limited [21]. This could also be attributed to its large SE, as demonstrated in the simulated data sets. Since smaller sample sizes are generally associated with larger SE [21,44], this finding is expected and was previously noted in [21]. SIPW and SIPW-B, which incorporate full data restoration, performed better than IPB in this condition. The observation that SIPW and SIPW-B performed comparably to existing methods is consistent with findings by Day et al. [15], who reported good performance of MI-based approaches, also employing full data restoration, on a clinical dataset.

The correction methods examined in this study differ in the probabilities they estimate relative to the test result. PS-based methods (IPWE, IPB, SIPW and SIPW-B) rely on PS, the probability of verification given the test result, P(V=1|T=t). This probability is directly related to the verification problem, particularly the PVB problem [21]. PS is used as the weight to adjust for the verification bias, resulting in corrected Sn and Sp estimates [54]. Reweighting is also a recommended strategy in mitigating bias in machine learning [55]. While BG-based and MI methods rely on accurate estimation of the probability of disease status given the test result, P(D=d|T=t) [15], PS-based methods instead rely on accurate estimation of P(V=1|T=t) to perform the correction [11,55]. This makes PS-based methods particularly advantageous in diagnostic accuracy studies employing a case-control design, as P(D=d|T=t) may be inaccurately estimated in this situation [8]. Wang et al. [54] recently showed that their methods based on IPWE, a PS-based method, demonstrated good performance. In the diaphanography data set, using either P(V=1|T=t) or P(D=d|T=t)) to correct for bias led to different Sn estimates. Three PS-based methods (IPWE, SIPW, and SIPW-B), which rely on P(V=1|T=t), showed nearly identical 95% CIs. In contrast, BG and MI, which rely on P(D=d|T=t), showed similar 95% CIs to each other. This finding suggests that further investigation is needed to evaluate the conditions under which these approaches produce different results.

The correction methods in this study also differ by the stage where bias correction occurs, either at the data level or the algorithm level [49,51,56,57]. The data-level methods in this study are MI, IPB, SIPW, and SIPW-B, while the algorithm-level methods are BG and IPWE. Data-level methods modify the data itself, which allows the use of any existing analytical methods or combinations of algorithms [20,57]. In contrast, algorithm-level methods require adapting or creating specific algorithms for specific use cases, but they are more computationally efficient and easier to apply than data-level methods [20,57]. MI, IPB, SIPW, and SIPW-B apply data-level bias correction [56] by restoring the sample distribution of data affected by PVB, which allows further analysis using any complete-data methods [20,45]. These four methods only differ in how they restore the data, while Sn and Sp can be calculated from standard formulas for these accuracy measures [11,24]. This was shown in [58] for the kappa coefficient in diagnostic accuracy studies and in [15] for PVB correction in multiple-test situations. If an algorithm-level approach had been used in [58] and [15], it would have required specific algorithms. For this reason, Day et al. [15] extended the existing BG method for their study, while MI was easily applied using an existing MI method [24]. Among the four data-level methods, IPB restores only the portion of data containing verified observations (n1) [20,21], whereas SIPW, SIPW-B, and MI [15,24] restore the full data containing both verified and unverified observations (n). As seen in the simulated and clinical datasets, this reduced the precision of IPB, shown by larger SE and wider confidence intervals. Compared to MI, SIPW and SIPW-B are easier to apply because they only require estimating PS values, followed by performing a weighted resampling procedure. In contrast, MI for PVB correction requires selecting suitable imputation methods, as its performance in bias correction depends on the chosen method [15,45,53].

While this study has demonstrated the strengths of SIPW and SIPW-B, these methods present two notable limitations. First, they are computationally intensive methods because they rely on repeated resampling. For example, with b = 1000 samples, they require substantially more computational resources to perform resampling 1000 times, whereas algorithm-level methods typically require only a single iteration. Second, unlike IPB, CIs cannot be derived directly from these resamples because they no longer form valid bootstrap samples. For interval estimation, an additional computationally demanding bootstrapping procedure must be performed on top of the algorithms. Specifically, if 1000 bootstrap replicates (R = 1000) are required for interval estimation, the computational burden increases significantly. For instance, if 1000 bootstrap replicates (R = 1000) are needed for interval estimation, the time taken to complete a full SIPW procedure (e.g., 10 seconds with b = 1000 SIPW samples) will be multiplied by 1000.

Conclusion

This paper proposes the SIPW and SIPW-B methods to address the limitations of the IPB method in the context of PVB correction under the MAR assumption for binary diagnostic tests. The results show that both SIPW and SIPW-B outperformed IPB and were consistent with existing PVB correction methods. Specifically, SIPW excelled in estimating Sp, while SIPW-B performed best in Sn estimation. The proposed methods also demonstrated good performance when disease prevalence was low. In addition, they improve upon IPB by allowing full data restoration, enabling subsequent analysis using any complete-data analytical methods. Although the new methods currently require more computational resources, this is expected to become less of an issue with advancements in computational power.

Acknowledgments

We thank our colleagues at the School of Computer Sciences and the School of Medical Sciences, Universiti Sains Malaysia for their comments on the early findings of this study and this article’s draft.

Data Availability

The code and datasets used in this study are available at the following GitHub repository: https://github.com/wnarifin/sipw_in_pvb.

Funding Statement

Funding 1: WNA: This is generic funding for publication without grant number. Research Creativity and Management Office, Universiti Sains Malaysia https://research.usm.my/. The funder did not have any role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. Funding 2: WNA: This is generic funding for publication without grant number. School of Medical Sciences, Universiti Sains Malaysia. https://rni.kk.usm.my/. The funder did not have any role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.O’Sullivan JW, Banerjee A, Heneghan C, Pluddemann A. Verification bias. BMJ Evid Based Med. 2018;23(2):54–5. doi: 10.1136/bmjebm-2018-110919 [DOI] [PubMed] [Google Scholar]
  • 2.Umemneku Chikere CM, Wilson K, Graziadio S, Vale L, Allen AJ. Diagnostic test evaluation methodology: a systematic review of methods employed to evaluate diagnostic tests in the absence of gold standard - an update. PLoS One. 2019;14(10):e0223832. doi: 10.1371/journal.pone.0223832 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Zhou XH, Obuchowski NA, McClish DK. Statistical Methods in Diagnostic Medicine. 2nd ed. Hoboken, New Jersey: John Wiley & Sons; 2011.
  • 4.Pepe MS. The statistical evaluation of medical tests for classification and prediction. New York, USA: Oxford University Press; 2011.
  • 5.Alonzo TA. Verification bias-impact and methods for correction when assessing accuracy of diagnostic tests. Revstat Statistical Journal. 2014;12(1):67–83. [Google Scholar]
  • 6.de Groot JAH, Bossuyt PMM, Reitsma JB, Rutjes AWS, Dendukuri N, Janssen KJM, et al. Verification problems in diagnostic accuracy studies: consequences and solutions. BMJ. 2011;343:d4770. doi: 10.1136/bmj.d4770 [DOI] [PubMed] [Google Scholar]
  • 7.Schmidt RL, Walker BS, Cohen MB. Verification and classification bias interactions in diagnostic test accuracy studies for fine-needle aspiration biopsy. Cancer Cytopathol. 2015;123(3):193–201. doi: 10.1002/cncy.21503 [DOI] [PubMed] [Google Scholar]
  • 8.Kohn MA. Studies of diagnostic test accuracy: partial verification bias and test result-based sampling. J Clin Epidemiol. 2022;145:179–82. doi: 10.1016/j.jclinepi.2022.01.022 [DOI] [PubMed] [Google Scholar]
  • 9.Schmidt RL, Factor RE. Understanding sources of bias in diagnostic accuracy studies. Arch Pathol Lab Med. 2013;137(4):558–65. doi: 10.5858/arpa.2012-0198-RA [DOI] [PubMed] [Google Scholar]
  • 10.Rutjes AWS, Reitsma JB, Coomarasamy A, Khan KS, Bossuyt PMM. Evaluation of diagnostic tests when there is no gold standard. A review of methods. Health Technol Assess. 2007;11(50):iii, ix–51. doi: 10.3310/hta11500 [DOI] [PubMed] [Google Scholar]
  • 11.Arifin WN, Yusof UK. Correcting for partial verification bias in diagnostic accuracy studies: a tutorial using R. Stat Med. 2022;41(9):1709–27. doi: 10.1002/sim.9311 [DOI] [PubMed] [Google Scholar]
  • 12.Zhou XH. Effect of verification bias on positive and negative predictive values. Stat Med. 1994;13(17):1737–45. doi: 10.1002/sim.4780131705 [DOI] [PubMed] [Google Scholar]
  • 13.Alonzo TA, Pepe MS. Assessing accuracy of a continuous screening test in the presence of verification bias. Journal of the Royal Statistical Society: Series C (Applied Statistics). 2005;54(1):173–90. [Google Scholar]
  • 14.He H, McDermott MP. A robust method using propensity score stratification for correcting verification bias for binary tests. Biostatistics. 2012;13(1):32–47. doi: 10.1093/biostatistics/kxr020 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Day E, Eldred-Evans D, Prevost AT, Ahmed HU, Fiorentino F. Adjusting for verification bias in diagnostic accuracy measures when comparing multiple screening tests - an application to the IP1-PROSTAGRAM study. BMC Med Res Methodol. 2022;22(1):70. doi: 10.1186/s12874-021-01481-w [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Robles C, Rudzite D, Polaka I, Sjomina O, Tzivian L, Kikuste I, et al. Assessment of Serum Pepsinogens with and without Co-Testing with Gastrin-17 in Gastric Cancer Risk Assessment-Results from the GISTAR Pilot Study. Diagnostics (Basel). 2022;12(7):1746. doi: 10.3390/diagnostics12071746 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.El Chamieh C, Vielh P, Chevret S. Statistical methods for evaluating the fine needle aspiration cytology procedure in breast cancer diagnosis. BMC Med Res Methodol. 2022;22(1):40. doi: 10.1186/s12874-022-01506-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Niknejad F, Ahmadi F, Roudbari M. Verification bias correction in endometrial abnormalities in infertile women referred to royan institute using statistical methods. Med J Islam Repub Iran. 2023;37:122. doi: 10.47176/mjiri.37.122 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Olthof EP, Bergink-Voorthuis BJ, Wenzel HHB, Mongula J, van der Velden J, Spijkerboer AM, et al. Diagnostic accuracy of MRI, CT, and [18F]FDG-PET-CT in detecting lymph node metastases in clinically early-stage cervical cancer - a nationwide Dutch cohort study. Insights Imaging. 2024;15(1):36. doi: 10.1186/s13244-023-01589-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Nahorniak M, Larsen DP, Volk C, Jordan CE. Using inverse probability bootstrap sampling to eliminate sample induced bias in model based analysis of unequal probability samples. PLoS One. 2015;10(6):e0131765. doi: 10.1371/journal.pone.0131765 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Arifin WN, Yusof UK. Partial verification bias correction using inverse probability bootstrap sampling for binary diagnostic tests. Diagnostics (Basel). 2022;12(11):2839. doi: 10.3390/diagnostics12112839 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Morris TP, White IR, Crowther MJ. Using simulation studies to evaluate statistical methods. Stat Med. 2019;38(11):2074–102. doi: 10.1002/sim.8086 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Kosinski AS, Barnhart HX. Accounting for nonignorable verification bias in assessment of diagnostic tests. Biometrics. 2003;59(1):163–71. doi: 10.1111/1541-0420.00019 [DOI] [PubMed] [Google Scholar]
  • 24.Harel O, Zhou X-H. Multiple imputation for correcting verification bias. Stat Med. 2006;25(22):3769–86. doi: 10.1002/sim.2494 [DOI] [PubMed] [Google Scholar]
  • 25.Ünal İ, Burgut HR. Verification bias on sensitivity and specificity measurements in diagnostic medicine: a comparison of some approaches used for correction. Journal of Applied Statistics. 2014;41(5):1091–104. [Google Scholar]
  • 26.Rochani H, Samawi HM, Vogel RL, Yin J. Correction of verification bias using log-linear models for a single binaryscale diagnostic tests. Journal of Biometrics and Biostatistics. 2015;6(5):266. [Google Scholar]
  • 27.Drum DE, Christacopoulos JS. Hepatic scintigraphy in clinical decision making. J Nucl Med. 1972;13(12):908–15. [PubMed] [Google Scholar]
  • 28.Marshall V, Williams DC, Smith KD. Diaphanography as a means of detecting breast cancer. Radiology. 1984;150(2):339–43. doi: 10.1148/radiology.150.2.6691086 [DOI] [PubMed] [Google Scholar]
  • 29.Greenes RA, Begg CB. Assessment of diagnostic technologies. Methodology for unbiased estimation from samples of selectively verified patients. Invest Radiol. 1985;20(7):751–6. [PubMed] [Google Scholar]
  • 30.Zhou XH. Maximum likelihood estimators of sensitivity and specificity corrected for verification bias. Communications in Statistics-Theory and Methods. 1993;22(11):3177–98. [Google Scholar]
  • 31.Linnet K, Bossuyt PMM, Moons KGM, Reitsma JBR. Quantifying the accuracy of a diagnostic test or marker. Clin Chem. 2012;58(9):1292–301. doi: 10.1373/clinchem.2012.182543 [DOI] [PubMed] [Google Scholar]
  • 32.Leeflang MMG, Allerberger F. How to: evaluate a diagnostic test. Clin Microbiol Infect. 2019;25(1):54–9. doi: 10.1016/j.cmi.2018.06.011 [DOI] [PubMed] [Google Scholar]
  • 33.Khushi M, Shaukat K, Alam TM, Hameed IA, Uddin S, Luo S, et al. A comparative performance analysis of data resampling methods on imbalance medical data. IEEE Access. 2021;9:109960–75. doi: 10.1109/access.2021.3102399 [DOI] [Google Scholar]
  • 34.Burton A, Altman DG, Royston P, Holder RL. The design of simulation studies in medical statistics. Stat Med. 2006;25(24):4279–92. doi: 10.1002/sim.2673 [DOI] [PubMed] [Google Scholar]
  • 35.Casella G, Berger RL. Statistical inference. 2nd ed. Delhi, India: Cengage Learning; 2002.
  • 36.de Groot JAH, Janssen KJM, Zwinderman AH, Bossuyt PMM, Reitsma JB, Moons KGM. Correcting for partial verification bias: a comparison of methods. Ann Epidemiol. 2011;21(2):139–48. doi: 10.1016/j.annepidem.2010.10.004 [DOI] [PubMed] [Google Scholar]
  • 37.R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; 2025.
  • 38.R ST. RStudio: Integrated Development for R. Boston, MA: RStudio, Inc. 2025.
  • 39.van Buuren S, Groothuis-Oudshoorn K. Mice: multivariate imputation by chained equations in R. Journal of Statistical Software. 2011;45(3):1–67. [Google Scholar]
  • 40.Goldfeld K, Wujciak-Jens J. Simstudy: illuminating research methods through data generation. Journal of Open Source Software. 2020;5(54):2763. doi: 10.21105/joss.02763 [DOI] [Google Scholar]
  • 41.Dong Y, Peng C-YJ. Principled missing data methods for researchers. Springerplus. 2013;2(1):222. doi: 10.1186/2193-1801-2-222 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Royston P, White I. Multiple imputation by chained equations (MICE): implementation in Stata. Journal of Statistical Software. 2011;45(i04). [Google Scholar]
  • 43.Begg CB, Greenes RA. Assessment of diagnostic tests when disease verification is subject to selection bias. Biometrics. 1983;39(1):207–15. [PubMed] [Google Scholar]
  • 44.Woodward M. Epidemiology: study design and data analysis. 3rd ed. Boca Raton, FL: CRC Press; 2014.
  • 45.van Buuren S. Flexible imputation of missing data. 2nd ed. Boca Raton, FL: Chapman & Hall/CRC Interdisciplinary Statistics; 2018.
  • 46.Bodner TE. What improves with increased missing data imputations?. Structural Equation Modeling: A Multidisciplinary Journal. 2008;15(4):651–75. doi: 10.1080/10705510802339072 [DOI] [Google Scholar]
  • 47.White IR, Royston P, Wood AM. Multiple imputation using chained equations: Issues and guidance for practice. Stat Med. 2011;30(4):377–99. doi: 10.1002/sim.4067 [DOI] [PubMed] [Google Scholar]
  • 48.Pedersen AB, Mikkelsen EM, Cronin-Fenton D, Kristensen NR, Pham TM, Pedersen L, et al. Missing data and multiple imputation in clinical epidemiological research. Clin Epidemiol. 2017;9:157–66. doi: 10.2147/CLEP.S129785 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Johnson JM, Khoshgoftaar TM. Survey on deep learning with class imbalance. Journal of Big Data. 2019;6(1):1–54. [Google Scholar]
  • 50.Mohammed R, Rawashdeh J, Abdullah M. Machine learning with oversampling and undersampling techniques: overview study and experimental results. In: 2020 11th International Conference on Information and Communication Systems (ICICS), 2020. p. 243–8.
  • 51.Carvalho M, Pinho AJ, Brás S. Resampling approaches to handle class imbalance: a review from a data perspective. Journal of Big Data. 2025;12(1):71. [Google Scholar]
  • 52.Šimundić A-M. Measures of diagnostic accuracy: basic definitions. EJIFCC. 2009;19(4):203–11. [PMC free article] [PubMed] [Google Scholar]
  • 53.Faisal S, Tutz G. Multiple imputation using nearest neighbor methods. Information Sciences. 2021;570:500–16. doi: 10.1016/j.ins.2021.04.009 [DOI] [Google Scholar]
  • 54.Wang S, Shi S, Qin G. Interval estimation for the Youden index of a continuous diagnostic test with verification biased data. Stat Methods Med Res. 2025;34(4):796–811. doi: 10.1177/09622802251322989 [DOI] [PubMed] [Google Scholar]
  • 55.Van Giffen B, Herhausen D, Fahse T. Overcoming the pitfalls and perils of algorithms: a classification of machine learning biases and mitigation methods. Journal of Business Research. 2022;144:93–106. [Google Scholar]
  • 56.Hort M, Chen Z, Zhang JM, Harman M, Sarro F. Bias mitigation for machine learning classifiers: a comprehensive survey. ACM Journal on Responsible Computing. 2024;1(2):1–52. [Google Scholar]
  • 57.Chen W, Yang K, Yu Z, Shi Y, Chen CP. A survey on imbalanced learning: latest research, applications and future directions. Artificial Intelligence Review. 2024;57(6):137. [Google Scholar]
  • 58.Roldán-Nofuentes JA, Regad SB. Estimation of the average kappa coefficient of a binary diagnostic test in the presence of partial verification. Mathematics. 2021;9(14):1694. [Google Scholar]

Decision Letter 0

Hayrunnisa Nadaroglu

11 Jun 2025

PONE-D-25-12005

Partial Verification Bias Correction Using Scaled Inverse Probability Resampling for Binary Diagnostic Tests

PLOS ONE

Dear Dr. Arifin,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Jul 26 2025 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Hayrunnisa Nadaroglu

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1.Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Please note that funding information should not appear in any section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form. Please remove any funding-related text from the manuscript.

3. Please note that your Data Availability Statement is currently missing the repository name and/or the DOI/accession number of each dataset OR a direct link to access each database. If your manuscript is accepted for publication, you will be asked to provide these details on a very short timeline. We therefore suggest that you provide this information now, though we will not hold up the peer review process if you are unable.

4. Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: “Partial Verification Bias Correction Using Scaled Inverse Probability Resampling for Binary Diagnostic Tests”

Minor corrections to the article titled are listed below;

1-The summary should be expanded a bit.

2- The last sentence in the summary is not understandable, it should be rewritten

3-The spelling of the “universiti” in the address section should be corrected.

4-The sentence starting with "in the study" on line 34 and the rest should be written in a new paragraph.

5-The word "ın words" in line 62 should be corrected to "In the other words".

6-The sentence starting with "In addition" in line 171 is not understood.

7-"The side effect" in line 207 this term is used in the medical field. Please write the sentence with another word.

8- A word expressing possibility should be used instead of "should" in line 328.

9-The spelling of the word “universiti” in line 339 should be corrected.

10- Article spelling and grammar rules should be reviewed.

Reviewer #2: The study aimed to reveal the importance of performing diagnostic accuracy tests to evaluate new tests before clinical tests.

These tests, where accuracy measurements such as sensitivity (Sn) and specificity (Sp) are frequently used, are compared with standard tests.

Partial bias (PVB) occurs due to selective validation. Inverse probability bootstrap (IPB) is recommended for PVB correction. IPB showed higher standard error than other correction methods. For this situation, two methods, SIPW and SIPW-B, were suggested in the study. These gave better results than Sn and Sp.

The study was evaluated as a statistical method, and it was concluded that the proposed methods made significant contributions to the scientific method.

In the manuscript, especially in the discussion section, the importance of the statistical contribution of the proposed methods in terms of scientific research was not emphasized enough. In addition, it was observed that the literature support supporting the findings was not sufficient in the discussion and that recent studies were not cited as references.

The elimination of these deficiencies will bring the manuscript to a higher quality level.

Good luck.

Reviewer #3: Recommendations for Improving the Manuscript for PONE-D-25-12005

To ensure that the manuscript meets the editorial and ethical standards of PLOS ONE, the following revisions are recommended:

-Please include a clear and specific ethics statement in the Methods section.

-Although the study is based on simulations and secondary data, PLOS ONE requires a formal ethics declaration.

-In accordance with the PLOS ONE data availability policy, authors must provide unrestricted access to the data and code underlying the findings.

-The simulation setup should be described in sufficient detail to allow full reproducibility.

-While the results show improvements over existing methods, the Discussion section should explicitly acknowledge potential limitations.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2025 Sep 26;20(9):e0321440. doi: 10.1371/journal.pone.0321440.r002

Author response to Decision Letter 1


2 Sep 2025

�Response to reviewers

PONE-D-25-12005

Partial Verification Bias Correction Using Scaled Inverse Probability Resampling for Binary Diagnostic Tests

Dear reviewers,

Thank you for the constructive comments given to the manuscript. We provide all responses to your comments below (and also in the "Response to reviewers.docx" attached with the submission):

Reviewer 1:

-------------

> 1-The summary should be expanded a bit.

This has been expanded to 300 words

> 2- The last sentence in the summary is not understandable, it should be rewritten

This sentence has been rewritten as “Although the methods are computationally demanding, ...”

> 3-The spelling of the “universiti” in the address section should be corrected.

> 9-The spelling of the word “universiti” in line 339 should be corrected.

This is the official name and spelling of the university (in Malay), which is indexed in academic databases. This should not be changed.

> 4-The sentence starting with "in the study" on line 34 and the rest should be written in a new paragraph.

“In the study” refers to Arifin and Yusof’s study (preceding sentence). Therefore, the paragraph starting with “In the study” has been moved to a new paragraph and “In the study” has been rephrased as “In their study...”.

> 5-The word "ın words" in line 62 should be corrected to "In the other words".

This has been corrected as suggested.

> 6-The sentence starting with "In addition" in line 171 is not understood.

The sentence starting from “In addition…” has been rephrased for clarity as suggested.

> 7-"The side effect" in line 207 this term is used in the medical field. Please write the sentence with another word.

This phrase has been replaced with “trade-off”.

> 8- A word expressing possibility should be used instead of "should" in line 328.

This has been replaced with “is expected to”.

> 10- Article spelling and grammar rules should be reviewed.

The comments related to the choice of words and sentence construction have been addressed accordingly. In addition, the article has been carefully proofread again as suggested.

Reviewer 2:

-------------

> The study was evaluated as a statistical method, and it was concluded that the proposed methods made significant contributions to the scientific method.

> In the manuscript, especially in the discussion section, the importance of the statistical contribution of the proposed methods in terms of scientific research was not emphasized enough. In addition, it was observed that the literature support supporting the findings was not sufficient in the discussion and that recent studies were not cited as references.

The discussion has been revised according to the comment. Discussion has been arranged in a way to better highlight the findings, and to facilitate comparison with literature. Relevant recent papers have been cited accordingly.

In addition, for clarity and to better reflect the flow of text in the discussion, some parts of the results section have been rearranged.

Reviewer 3:

-------------

> To ensure that the manuscript meets the editorial and ethical standards of PLOS ONE, the following revisions are recommended:

> -Please include a clear and specific ethics statement in the Methods section.

> -Although the study is based on simulations and secondary data, PLOS ONE requires a formal ethics declaration.

A new section “Ethics” has been added as required, covering the ethical aspect of simulation and secondary data analyses.

> - In accordance with the PLOS ONE data availability policy, authors must provide unrestricted access to the data and code underlying the findings.

This information is provided in PlosOne online submission form at Data availability section with a statement:

“The code and datasets used in this study are available at the following GitHub repository: https://github.com/wnarifin/sipw_in_pvb.”

We have also selected Yes to the question: Do the authors confirm that all data underlying the findings described in their manuscript are fully available without restriction?

In addition, the following texts have been added to experimental setup subsection:

“The code to generate and analyze the simulated data sets is available at https://github.com/wnarifin/sipw_in_pvb.”

“The clinical data sets and code to reproduce the results are available at https://github.com/wnarifin/sipw_in_pvb.”

> -The simulation setup should be described in sufficient detail to allow full reproducibility.

The “experimental setup” section has been expanded into two subsections: simulation setup and analysis, clinical data analysis.

The names of the subsections in Results have also been revised to “simulated data analysis” and “clinical data analysis” to reflect these changes.

> -While the results show improvements over existing methods, the Discussion section should explicitly acknowledge potential limitations.

The limitations are better highlighted in the Discussion section in the revised manuscript:

“While this study has demonstrated the strengths …” to “will be multiplied by 1000.” (the last paragraph in the discussion section)

----

In addition, we have made some changes to the affiliation of authors.

Attachment

Submitted filename: Response to reviewers.docx

pone.0321440.s001.docx (16.4KB, docx)

Decision Letter 1

Hayrunnisa Nadaroglu

8 Sep 2025

Partial Verification Bias Correction Using Scaled Inverse Probability Resampling for Binary Diagnostic Tests

PONE-D-25-12005R1

Dear Dr. Arifin,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice will be generated when your article is formally accepted. Please note, if your institution has a publishing partnership with PLOS and your article meets the relevant criteria, all or part of your publication costs will be covered. Please make sure your user information is up-to-date by logging into Editorial Manager at Editorial Manager® and clicking the ‘Update My Information' link at the top of the page. For questions related to billing, please contact billing support.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Hayrunnisa Nadaroglu

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Acceptance letter

Hayrunnisa Nadaroglu

PONE-D-25-12005R1

PLOS ONE

Dear Dr. Arifin,

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now being handed over to our production team.

At this stage, our production department will prepare your paper for publication. This includes ensuring the following:

* All references, tables, and figures are properly cited

* All relevant supporting information is included in the manuscript submission,

* There are no issues that prevent the paper from being properly typeset

You will receive further instructions from the production team, including instructions on how to review your proof when it is ready. Please keep in mind that we are working through a large volume of accepted articles, so please give us a few days to review your paper and let you know the next and final steps.

Lastly, if your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

You will receive an invoice from PLOS for your publication fee after your manuscript has reached the completed accept phase. If you receive an email requesting payment before acceptance or for any other service, this may be a phishing scheme. Learn how to identify phishing emails and protect your accounts at https://explore.plos.org/phishing.

If we can help with anything else, please email us at customercare@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Professor Hayrunnisa Nadaroglu

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    Attachment

    Submitted filename: Response to reviewers.docx

    pone.0321440.s001.docx (16.4KB, docx)

    Data Availability Statement

    The code and datasets used in this study are available at the following GitHub repository: https://github.com/wnarifin/sipw_in_pvb.


    Articles from PLOS One are provided here courtesy of PLOS

    RESOURCES