Skip to main content
Springer Nature - PMC COVID-19 Collection logoLink to Springer Nature - PMC COVID-19 Collection
. 2022 May 20;15(1):1–30. doi: 10.1007/s12561-022-09345-7

New Confidence Intervals for Relative Risk of Two Correlated Proportions

Natalie DelRocco 1,#, Yipeng Wang 1,#, Dongyuan Wu 1,#, Yuting Yang 1,#, Guogen Shan 1,2,
PMCID: PMC9122488  PMID: 35615750

Abstract

Biomedical studies, such as clinical trials, often require the comparison of measurements from two correlated tests in which each unit of observation is associated with a binary outcome of interest via relative risk. The associated confidence interval is crucial because it provides an appreciation of the spectrum of possible values, allowing for a more robust interpretation of relative risk. Of the available confidence interval methods for relative risk, the asymptotic score interval is the most widely recommended for practical use. We propose a modified score interval for relative risk and we also extend an existing nonparametric U-statistic-based confidence interval to relative risk. In addition, we theoretically prove that the original asymptotic score interval is equivalent to the constrained maximum likelihood-based interval proposed by Nam and Blackwelder. Two clinically relevant oncology trials are used to demonstrate the real-world performance of our methods. The finite sample properties of the new approaches, the current standard of practice, and other alternatives are studied via extensive simulation studies. We show that, as the strength of correlation increases, when the sample size is not too large the new score-based intervals outperform the existing intervals in terms of coverage probability. Moreover, our results indicate that the new nonparametric interval provides the coverage that most consistently meets or exceeds the nominal coverage probability.

Keywords: Confidence interval, Continuity correction, Nonparametric method, Paired data, Relative risk, Wilson score interval

Introduction

Correlated samples occur in many situations such as pre- and post-studies, cross-over studies, natural pairings (i.e., twin studies), and matched-pairs designs [1]. The degree to which the two samples are correlated must be appropriately accounted for in statistical inference of the interested parameter. In the case of binary data, this design can therefore be parametrized by the probability of observing the primary outcome for each test and the correlation between the two tests (test 1 and test 2). The research question of interest is then investigated by comparing the proportion of responses in the two groups by conducting a statistical hypothesis test for an appropriate, clinically relevant parameter [25].

A classic example of such experiments is a study where an outcome is measured on a single sample before and after an exposure of interest. The study by Okely et al. [6] investigates the changes in physical activity among older adults in the 1936 Lothian Birth Cohort due to the public health lockdown due to COVID-19 in Scotland [7]. A total of 137 adults over the age of 75 were surveyed about their physical activity, sleep quality, social activity, and psychological state before and after the national lockdown. The question of whether the proportion of respondents reporting low physical activity after the lockdown compared to before was of interest. As the pre-lockdown and post-lockdown measurements are taken on the same observational units, and the outcome of interest is binary (low physical activity), this is an important experiment comparing correlated binary proportions [810].

The most notable parameters in this setting are the risk difference (RD), the relative risk/risk ratio (RR), and the odds ratio [11, 12]. In practice, often the response rates are small, such that a relative measure provides more information about the magnitude of the association. Therefore, RR is often used because it is the ratio of respective binomial proportions from the two tests. In this article, we focus on the RR in the particular situation when correlated binary outcomes are reported.

Of equal importance to the point estimate of RR is the corresponding interval estimate. This is most often represented by the frequentist confidence interval (CI) [7, 13, 14]. The prominent frequentist parametric CI methods for correlated RR which are studied in this paper can be broadly classified according to two factors: (1) Hybrid vs. Nonhybrid intervals and (2) Wald-based vs. Score-based intervals. Hybrid refers to obtaining an interval for each proportion separately, then combining them to obtain a single CI. Wald-based and Score-based intervals refer to the associated test statistic which is inverted [15] as the basis of the method. That is, to obtain a Wald-based confidence interval, one should invert the Wald test for the relative risk. Contrastingly, to obtain a Score-based confidence interval, one inverts the Score test. It is of note that there are likelihood-based methods which are based on inverting the likelihood ratio test (LRT). However, we found that Wald-based and Score-based intervals are more frequently used in practice. Therefore, we do not include likelihood-based intervals in this study but direct the interested reader to the literature [3, 1618].

The simplest form of the frequentist CI for RR is the Wald asymptotic CI, which is a nonhybrid interval derived on the log scale under the assumption of joint asymptotic normality of the response rates for test 1 and test 2 [19]. The asymptotic normal approximation is not always appropriate given the discrete nature of binomial data and has previously been shown to yield low coverage probabilities compared to other methods [2, 2022]. It is possible to apply a continuity correction to the upper and lower limits of the Wald interval which is inversely proportional to the study sample size. However, the continuity-corrected Wald interval has been found to yield poor average coverage probabilities compared to alternatives and is consequently not recommended for use in practice [3].

As an improvement to the traditional Wald interval for a single proportion [23, 24], Wilson [25] proposed a score-based CI for binomial proportions. Employing Wilson’s approach in a hybrid manner is the basis for most of the improved CIs for the ratio of two correlated proportions described below. Nam and Blackwelder proposed one of the first improved CIs for RR [18], which was initially solved iteratively and later extended to closed-form confidence limits [26]. This interval improves the Wald interval by employing constrained maximum likelihood estimates of the proportions of interest to develop a Fieller-type CI for RR [18]. This was closely followed by Bonett and Price who combined two Wilson score intervals into a hybrid Fieller-type interval for RR [27]. A continuity-corrected version of the Bonett–Price interval is obtained by applying a constant penalty to the individual upper and lower confidence limits for each individual proportion. However, the continuity-corrected Bonett–Price interval has been shown to be overly conservative [2].

Tang et al. [28] recently extended the score CI to the ratio of two dependent proportions by reparametrizing the multinomial probability model and deriving the corresponding score test statistic. Fagerland et al. [2] showed via simulation studies that the performance of the Nam–Blackwelder interval was nearly identical to Tang’s asymptotic score interval in terms of coverage probability. Finally, the MOVER hybrid score CI is another hybrid Fieller-type CI, which differs primarily from the Bonett–Price interval in that it is based on Newcombe’s “Square and Add” method for the difference in two proportions [3]. Donner and Zou [29] recently extended the original MOVER interval to the ratio of two correlated proportions with acceptable performance on average coverage probability [2].

Thus far, Tang’s Score interval is the most frequently used CI for the ratio of two correlated proportions, and is recommended along with the Bonett–Price interval in simulation studies for general use [2, 30]. The proposed approach of Tang et al. [28] relies on the asymptotic properties of the score test statistic, namely, the asymptotic standard normal distribution of the score test statistic conditional on a given RR. In general, continuity corrections serve to penalize the width of an interval for using a continuous function to approximate a discrete function, which may not be reasonable in small samples [31]. Continuity corrections applied to approximate CIs seek to approach the coverage of exact CIs by imposing more conservative confidence limits [19]. In the case of the normal approximation to the binomial distribution, the approximation with continuity correction is typically superior to that without [15]. Therefore, we propose a new confidence interval which imposes a continuity correction to Tang’s existing Score interval, which is a nonhybrid score-based CI for the ratio of two correlated proportions.

It can be advantageous in certain situations not to make assumptions regarding the asymptotic distribution of an estimator, or even to rely on asymptotic sample sizes in practice. In such situations, nonparametric CIs may be considered. Nonparametric CIs for paired binary data do not rely on particular asymptotic distributions (e.g., normal distribution) in the derivation of the estimate or the variance–covariance matrix. Duan et al. [32] developed a nonparametric U-statistic-based CI for the difference of two correlated proportions. To our knowledge, no one has developed a nonparametric CI specifically for the ratio of two correlated proportions. In Sect. 2, we further extend the work of Duan et al. to the situation where the RR is the parameter of interest.

In Sect. 2.1, we explicitly define the statistical notations and the model for correlated two-test experimental designs with binary outcomes. In Sect. 2.2, we present the existing CI methods discussed above. In Sect. 2.3, we introduce two new methods to calculate CIs for the correlated RR. Specifically, in Sect. 2.3.1, we propose a new test based on Tang’s score method in conjunction with continuity correction and provide a closed-form solution for ease of implementation. In Sect. 2.3.2, we extend Duan’s nonparametric U-Statistic-based CI for correlated RD to correlated RR. In Sect. 3, the results of using the CIs in Sects. 2.2 and 2.3 are presented. We apply the CIs to real data from two case studies: one oncology trial used by existing methodological reviews [2] to validate our methods and another recent oncology trial. We additionally compare the proposed CI methods with the existing CIs through extensive simulation studies with regard to coverage probability, interval length, mean squared error, and the proportion of configurations above the nominal coverage level. Finally, we summarize our findings and discuss conclusions in the last Section.

Methodology

Model and Notation

The general setup for the ratio of correlated proportions is introduced below. A standard two-by-two contingency table layout [19], often arising from a matched-pairs study design, is given in Table 1. Let X1 and X2 be the binary outcome from test 1 and test 2 with marginal probabilities of success p1 and p2, respectively. We denote Xt=1 for event and Xt=2 for nonevent, t=1, 2. Table 1 presents the observed frequency and the corresponding probability xij and pij for participants having the outcome X1=i and X2=j, where ij = 1, 2. The total sample size is n=i=12j=12xij. The parameter of interest is the relative risk between test 1 and test 2:

θRR=p1p2.

We focus on the confidence intervals for θRR in this article.

Table 1.

Observed counts of the paired study are shown with the corresponding probabilities in parentheses

Test 1 Test 2
Event Nonevent Total
Event x11(p11) x12(p12) x11+x12(p1)
Nonevent x21(p21) x22(p22) x21+x22(1-p1)
Total x11+x21(p2) x12+x22(1-p2) n

The data vector x={xij:i,j=1,2} follows a multinomial distribution with the probability density function:

Pr(x|p,n)=n!x11!x12!x21!x22!p11x11p12x12p21x21p22x22, 1

where p={pij[0,1]:i,j=1,2} such that i=12j=12pij=1. For a pair of Bernoulli random variables (X1,X2), the Pearson correlation coefficient ρ is given by

ρ=p11-p1p2p1(1-p1)p2(1-p2),

where p1,p20,1. We can re-parameterize p into the equivalent parameter set {p1,p2,ρ}, but there are restrictions for the values of {p1,p2,ρ}, not every arbitrary combination is feasible because the natural range of the probability is between 0 and 1 (pij[0,1]). Using the necessary conditions of pairwise probabilities for p11 [33], we have the inequality,

max(0,p1+p2-1)ρp1(1-p1)p2(1-p2)+p1p2min(p1,p2).

Given the values of {p1,p2}, the upper and lower bounds of ρ are obtained by solving the above inequality, see Appendix 1 for the detailed formulas for the upper and lower bounds of ρ (Lρ and Uρ).

Conditional on n, a multinomial distribution is defined by the parameter set {p1,p2,ρ}, which satisfies p1,p20,1 and ρ is in [Lρ,Uρ]. Given the relative risk θRR=p1/p2 and a fixed n, the parameter set {p2,θRR,ρ} can also specify a multinomial distribution because p1=p2θRR. The maximum likelihood method is used to estimate probabilities: p^ij=xij/n for i,j=1,2, p^1=(x11+x12)/n, p^2=(x11+x21)/n, and θ^RR=p^1/p^2.

In the next Sect. 2.2, we present four existing methods to calculate the 100(1-α)% CI for θRR (e.g., a 95% CI when α=0.05). In Sect. 2.3, we consider three strengths of continuity correction score intervals and one nonparametric interval.

Existing Interval Estimation Methods

Wald CI

The asymptotic Wald CI is constructed assuming that the joint sampling distribution of the two sample proportions p^1 and p^2 is reasonably approximated by a bivariate normal distribution when n is sufficiently large. Under this assumption, using the Delta method [19] to obtain the asymptotic variance of log(θ^RR) gives us the following 100(1-α)% Wald CI for θRR:

explog(θ^RR)±z1-α2x12+x21(x11+x21)(x11+x12), 2

where z1-α2 is the upper α2 quantile of the standard normal distribution. This interval is strictly positive and asymmetric.

Bonett–Price CI

The Bonett–Price CI is a hybrid-type CI [27]. Two individual one-sample Wilson score intervals (L1,U1) and (L2,U2) are calculated for p1 and p2. Bonett and Price [27] showed that for the ratio of two proportions with 100(1-α)% Wilson CIs (L1,U1) and (L2,U2), the 100(1-α)% CI for θRR is given by

L1U2,U1L2, 3

where

(L1,U1)=2(x11+x12)+ψ2±ψψ2+4(x11+x12)1-x11+x12n2(n+ψ), 4
(L2,U2)=2(x11+x21)+ψ2±ψψ2+4(x11+x21)1-x11+x21n2(n+ψ2), 5

n=x11+x12+x21, and ψ is a function of z1-α/2 and x [27].

MOVER Wilson score CI

The lower and upper limits for the CI of the ratio may be solved for in terms of the lower and upper limits of the individual CIs for p1 and p2 by noting that p1-p2θRR=0. Based on this relationship, Donner and Zou [29] applied the original square and add method to show that the closed form 100(1-α)% MOVER confidence limits for θRR are given by (L,U):

L=p^1p^2-CL-p^1p^2-CL2-L1U2(2p^1-L1)(2p^2-U2)U2(2p^2-U2)U=p^1p^2-CU+p^1p^2-CU2-U1L2(2p^1-U1)(2p^2-L2)L2(2p^2-L2), 6

where CL=r(p^1-L1)(U2-p^2), CU=r(U1-p^1)(p^2-L2), (L1,U1) and (L2,U2) are individual 100(1-α)% CIs of choice for p1 and p2 in Eqs. (4) and  (5), and r=corr^(p^1,p^2) is an appropriate estimate of the correlation between the two sample proportions (e.g., the Pearson correlation coefficient).

Score Asymptotic CI

Under the null hypothesis H0:θRR=θ0, the score test statistic by Tang et al. [30] is given as

S(θ0)=(x11+x12)-(x11+x21)θ0n(1+θ0)p~21+(x11+x12+x21)(θ0-1), 7

where

p~21=(-b+b2-4ac)/2a,a=n(1+θ0),b=(x21+x11)θ02-(x11+x12+2x21),c=x21(1-θ0)(x11+x21+x12)/n.

The score test statistic is asymptotically normal under the null hypothesis. In the previous publications [30], due to symmetry, the lower and upper 100(1-α)% confidence limits are found iteratively as the roots of

S(θ)=±z1-α2.

CI by Nam and Blackwelder

Nam and Blackwelder [18] derived the constrained maximum likelihood estimates of p12 and p21 as

p^12=-(x11+x12)+θ02(x11+x21+2x12)+[(x11+x12)-θ02(x11+x21)]2+4θ02x12x212nθ0(θ0+1),

and

p^21=θ0p^12-(θ0-1)1-x22n.

Based on the Fieller-type statistic,

T(θ0)=n(x11+x12)-θ0(x11+x21)nθ0(p^12+p^21), 8

Nam and Blackwelder construct a 100(1-α)% CI for θRR by solving T(θ)=±zα/2 to obtain the confidence interval.

Theorem 1

The Nam–Blackwelder CI based on the test statistic T(θ) in Eq. (8) is equivalent to the score interval by Tang et al. [30] based on the test statistic S(θ) in Eq. (7).

Proof

We show that T(θ)=S(θ) for any observed data in Appendix 2.

Later, Nam [26] derived the closed-form estimation of the Nam–Blackwelder interval using Ferrari’s formulation. Following his approach, we directly derive the closed-form solutions for the score interval in Appendix 3. The closed-form solutions are computationally less intensive and hence faster than iterative methods. Additionally, the availability of a closed-form solution avoids the common issues that befall root-finding algorithms. This prevents the need to choose an optimization method (or to rely on the default implementation which may not be optimal), specify convergence criteria, diagnose possible failure to converge in extreme contingency tables, etc. The closed form, noniterative solution is also more accessible for clinical researchers who may not have experience with iterative computational algorithms.

Proposed Interval Estimation Methods

Continuity-corrected Score Asymptotic CI

Adding continuity correction to the asymptotic score test statistic in Eq. (7), we have the asymptotic score continuity-corrected (ASCC) test statistic as

Sδ(θ0)=|(x11+x12)-(x11+x21)θ0|-(1δ)(1n)(x11+x21)n(1+θ0)p~21+(x11+x12+x21)(θ0-1), 9

where p~21 is obtained as in Sect. 2.2.4, and δ is a continuity correction value. For the upper confidence limit θU, we transform

(x11+x21)θ-(x11+x12)+1δn(x11+x21)n(1+θ)p~21+(x11+x12+x21)(θ-1)=z1-α2

into a quartic equation. Then, using the same process in the uncorrected case to calculate θU, the lower confidence limit θL is calculated in the same way by transforming

(x11+x12)-(x11+x21)θ+1δn(x11+x21)n(1+θ)p~21+(x11+x12+x21)(θ-1)=z1-α2

into a quartic equation. So, the 100(1-α)% CI of θRR based on the continuity correction is then obtained as (θL,θU).

In the traditional continuity correction interval calculation, the value of δ is chosen as 2. In this article, we consider continuity corrections of varying strength. We take δ=2,4,8 for continuity corrections of high, medium, and low strength, respectively. A smaller value of δ corresponds to a stronger continuity correction. Therefore, this is denoted as “ASCC-H” when δ=2, “ASCC-M” when δ=4, and “ASCC-L” when δ=8 in the presentation of results.

In addition to being computationally fast and intuitive for a clinical audience due to the explicit representation of the CI rather than a set of iterative equations (see Appendix 3 for closed-form solutions), the incorporation of a continuity correction in the score interval allows the flexibility to be more conservative when necessary. That is, our proposed modification to the asymptotic score interval allows the researcher to easily choose the strength of penalty to the confidence limits guided by the estimates of the effect sizes, correlation between measurements from the two tests, and study sample size. General recommendations are provided in the Sect. 4. Because the score interval is recommended for practical use due to stability and desirable properties in recent simulation studies [2, 30], providing a flexible modification to the score interval which may be easily employed under the conditions discussed below can help researchers maintain nominal coverage.

Nonparametric CI

Duan et al. [32] proposed a nonparametric CI for the difference in two correlated proportions derived from the rank-based estimation procedures for correlated area under the curve (AUC) data outlined by Lang [17]. Duan et al. relied on the well-known representation of AUC as a U-Statistic [34] in the context of ROC curve analysis. They derived a general covariance matrix for a study with multiple groups. Following DeLong et al. [35], Duan et al. derive the variance–covariance matrix for (p^1p^2)T when the components of interest are (p1p2)T. The full expression is given in Appendix 4. An estimate of the asymptotic variance of a parameter of choice is then easily obtained via the Delta method [19]. We refer the interested reader to Duan et al. [32] for further details when the parameter of interest is the correlated RD.

We extend this line of work by deriving the asymptotic variance for the U-statistic-based estimate of the ratio of two correlated proportions. The derivation of the estimates of the individual proportions follows the same reasoning as outlined above. Since the parameter of interest is now θRR=p1/p2, then in the case of two groups the estimate of interest is calculated as the ratio of respective sample proportions, θ^RR=p^1/p^2. We then apply the Delta method to obtain the asymptotic variance estimate of θ^RR. Full derivation details can be found in Appendix 4. The corresponding asymptotic 100(1-α)% CI using a conservative t approximation for the degrees of freedom is then given by

explog(θ^RR)±tn-1,1-α/2V^LRR,

where V^LRR=Var^[log(θ^RR)] as derived by the Delta method. We refer this nonparametric interval as the NonP interval.

The U-statistic-based approach to CI construction for correlated RD proposed by Duan et al. is computationally fast with a simple closed-form expression. Additionally, nonparametric CIs are beneficial in studies under which the asymptotic approximations of the previously presented methods do not hold or if the underlying data do not follow a binomial distribution. Our extension of this method by derivation of the CI for the correlated RR makes these desirable properties accessible for studies where the RR is the primary measure of interest.

Results

We first conduct extensive simulation studies comparing the finite sample properties of the four proposed intervals and another four existing intervals, followed by application of the intervals to two real cancer trials. These 8 intervals are computed from the six methods: (1) the asymptotic Wald CI, (2) the Bonett–Price CI, (3) the CI based on the MOVER hybrid score, (4) Tang’s asymptotic score CI, (5) the proposed asymptotic continuity-corrected score CI, and (6) the extended Duan’s nonparametric CI.

Simulation Results

All simulations were conducted using R Statistical Software Version 4.0 [36]. Functions from the ratesci package in R [37] and a R function from Duan et al. [32] were adapted to obtain existing and proposed confidence intervals. Data were simulated according to the probability model described in Eq. (1). Individual contingency tables for each of B=20,000 Monte Carlo simulations were randomly generated using the simstudy package by R [38]. The simulation was parametrized in terms of sample size n, correlation ρ between two tests, relative risk θRR, and p2. As opposed to specifying p1 and p2, this allows us to directly display the strength of association, making the results more intuitive. We studied the combinations of the following sets of parameter values: n{15,30,60,100,150,300,500}, θRR{1,1.5,2,3,4,5}, ρ{0.05,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9}, and p2 from 0.1 to 0.95 by increments of 0.05.

Given each configuration for p2 and θRR, we have a specific bound for ρ (see details in Appendix 1). We remove the following three cases which cause undefined confidence intervals for the existing methods: (1) n=15 and p2<0.3; (2) n=30 and p2<0.2; (3) n=60 and p2<0.1. Based on these skip rules, we have 2,235 combinations of parameters in this simulation (referred to the parameter space as Ω). In addition, if the simulated data satisfied x11+x12=0 and x11+x21=0 simultaneously, the case was discarded and new data were generated for this combination until we obtained 20,000 cases that can be used.

Figure 1 presents an example of the performance of the Wald interval when p2=0.2 and θRR=1.5,2. When θRR=1.5 and ρ=0.2, the coverage probability of the score-based intervals starts off above 95% when n=30 (95.31%, 95.31%, 95.54%, and 95.69% for uncorrected, low, medium, and high corrections, respectively). The Wald interval additionally exceeds nominal coverage on average for n=30 with coverage probability of 95.90%. However, as the correlation increases, the probability of coverage by the Wald interval decreases for ρ=0.5 (94.52%) and ρ=0.6 (93.32%). This low coverage probability by the Wald interval for n=30 is exacerbated by increased θRR from 1.5 to 2. A similar, if slightly less extreme trend is observed when n=60. Interestingly, when θRR=1.5 and ρ=0.6, the Wald interval does not reach nominal coverage on average until n=200, which is a large sample size in practice. When θRR=2, this holds true for both ρ=0.6 and slightly weaker ρ=0.5.

Fig. 1.

Fig. 1

Coverage probabilities for eight confidence intervals for correlated relative risk by study sample size

As expected, the performance of the Wald asymptotic confidence interval is dubious for small-to-moderate sample sizes. Under weaker correlation between two tests, for n=30,60,150, we see the general trend of over-conservatism of the Wald interval, which dissipates for larger sample sizes. The coverage probability quickly drops as correlation increases, but recovers as sample size increases.

In the situation described above (Fig. 1), when n=30 and θRR=1.5, the uncorrected score interval was above nominal coverage on average for all strengths of correlation considered. As the sample size increases, the coverage of the score interval stabilizes close to the nominal 95%. Observing the trend over increasing strength of correlation further highlights this behavior. Figure 2 shows that, when θRR=2, p2=0.2, and n = 30, the coverage of all score-based intervals increases with increasing correlation, while the coverage of all alternatives decreases. When n=100, most confidence intervals display a decrease in coverage probability for increasing correlation, while the score-based methods maintain a reasonable level of coverage. In fact, the ASCC-H maintains coverage closest to the nominal level across all values of ρ considered in Fig. 2. This effect is still present, if somewhat mitigated, when θRR=1.5 and p2=0.4.

Fig. 2.

Fig. 2

Coverage probabilities for eight confidence intervals for correlated relative risk by correlation between test 1 and test 2

Further comparing the proposed ASCC intervals to the uncorrected score asymptotic interval, we see that, intuitively, when the uncorrected interval already has good coverage then the corrected interval is too conservative. However, the real benefit of the corrected interval presents itself when the score interval has poor coverage. In such scenarios, the continuity-corrected score interval not only brings the score interval closer to nominal coverage, but the closest out of all the confidence interval methods compared. Taking panel 6 of Fig. 1 for example (p2=0.2, ρ=0.6, and θRR=2), Panel 6 is overall the poorest performing scenario in this figure due to the strength of correlation and size of θRR relative to p2. The coverage of the uncorrected score interval starts off too conservative when n=30, then corrects itself to hover just below the nominal level (between 94.55% when n=60 and 94.79% when n=500). Applying the ASCC with a strong correction brings the coverage within 0.01% of nominal coverage while keeping the coverage at or above nominal level for sample sizes between 60 and 300.

The score asymptotic interval was well behaved in most simulation scenarios included in our investigation. We observed that the score interval tended to be over-conservative for small sample sizes no matter the strength of correlation between tests, which we prefer to the frequency with which the Bonett–Price, MOVER Wilson, and Wald intervals failed to reach nominal coverage for small-to-moderate sample sizes when strong correlation is present. The example represented by Figs. 1 and 2 illustrates that under situations where the asymptotic score interval does not achieve nominal coverage and the alternative intervals also perform poorly (i.e., high correlation and large magnitude of effect), the ASCC can help achieve nominal coverage.

In line with Fagerland et al. [2], we found that the coverage probability of the MOVER Wilson interval is lower than other methods in most of our simulated scenarios. In Fig. 1, when n=30,60 and θRR=1.5, the MOVER Wilson interval does not meet nominal coverage even when ρ=0.2, at 94.58% and 94.91%. The coverage only worsens as the strength of correlation increases. The same can be seen in Fig. 2, where the coverage probability of the MOVER Wilson interval only sporadically exceeds the nominal level for various correlation strengths. For the same combination of parameters, the average confidence interval lengths can be seen in Fig. 3. For sample sizes 30, 60, and 100, the MOVER Wilson interval consistently has the shortest interval lengths across all values of ρ studied. This is at odds with the poor coverage probability described above.

Fig. 3.

Fig. 3

Average confidence interval lengths for eight confidence intervals for correlated relative risk by correlation between test 1 and test 2

An exception is when p1 and p2 take on relatively large values, taking θRR=2 and ρ=0.3 in Fig. 4. When p2=0.1,0.2 (and thus p1=0.2,0.4, respectively), the MOVER Wilson interval has the lowest coverage probability on average for all sample sizes. However, as p2 increases to 0.45, the MOVER Wilson interval maintains coverage very close to the nominal level, only dipping to 94.90% when n=300. Contrastingly, the performance of the Bonett–Price interval when n=15 plummets to 93.09%, recovering to 94.69% when the sample size reaches 60, but never meeting nominal coverage. Thus, in Fig. 4, when p2 is large causing p1 to lie closer to the exterior of the parameter space, the Bonett–Price and the Wald intervals have coverage the farthest below nominal while the MOVER Wilson interval maintains satisfactory coverage. The trend of improved performance is seen as p2 increases in the panels of Fig. 4. This indicates that, when the correlation is moderate, the bias which may affect the MOVER Wilson interval in general may be mediated for large values of the response probabilities.

Fig. 4.

Fig. 4

Coverage probabilities for eight confidence intervals for correlated relative risk by study sample size

Figure 5 looks at the performance of the considered confidence intervals under a typical sample size (n=100) and probability of success (p2=0.1) for a broad range of true relative risk values. Weak correlation (ρ=0.05) and a more moderate correlation (ρ=0.2) are considered. We see similar trends as observed in previous figures, with decreasing coverage for stronger correlation. For very small values of relative risk (θRR<1) all confidence intervals tend to be overly conservative. When θRR is 5, most confidence intervals drop just below nominal coverage. In this case, using the nonparametric interval preserves coverage above nominal level. On the other hand, all the ASCC corrections fall below nominal coverage, but attain the closest coverage to 95% of any alternatives (particularly ASCC-H). This is excepting the Wald interval which, as expected under weaker correlations and sufficient sample size, performs well—intermediate coverage to the nonparametric interval and alternatives.

Fig. 5.

Fig. 5

Coverage probabilities for eight confidence intervals for correlated relative risk by true relative risk

We further compare the proposed four CIs: ASCC-L, ASCC-M, ASCC-H, and the nonparametric interval, with regard to the proportion of scenarios under which each confidence interval method achieves nominal 95% coverage among all the configurations in Ω in Fig. 6. That proportion for a given sample size n is computed as

(θRR,ρ,p2|n)Ω(n)I(CP(θRR,ρ,p2|n)95%)|Ω(n)|,

where |Ω(n)| is the size of the parameter space given sample size n, and CP(θRR,ρ,p2|n) is the coverage probability given the study design parameters: θRR,ρ, and p2. The score interval is added as reference. Both ASCC-M and ASCC-H have the guaranteed coverage proportion above 80% for all these studies sample sizes, and their proportions of guaranteed coverage probability are much higher than that of the ASCC-L interval. In Fig. 6, we also show the average deviation of simulation-based coverage probability from nominal 95% coverage (or MSE), specifically

MSE=(θRR,ρ,p2|n)Ω(n)[CP(θRR,ρ,p2|n)-95%]2|Ω(n)|.

For MSE, the proposed ASCC-M interval generally has lower MSE than ASCC-H. Based on the results from MSE and the proportion of guaranteed coverage, we would recommend the proposed ASCC-M interval among the three ASCC intervals.

Fig. 6.

Fig. 6

Mean squared deviation of the average coverage probability from the nominal 95% coverage (MSE) and the proportion of cases with average coverage probability exceeding the nominal 95% for five confidence intervals for correlated relative risk by study sample size

The nonparametric confidence interval typically often has the largest MSE for sample sizes less than n=100. This reflects to expected conservatism of nonparametric methods in general, the price paid for making fewer assumptions in the construction of the method. The insights above indicate that, relative to all other confidence interval methods considered, the nonparametric confidence interval’s coverage probability rarely drops below the nominal coverage level on average. This means that the proposed nonparametric confidence interval has a general tendency to be conservative, even when the performance of alternative methods is poor.

Case Studies

Airway Hyper-Responsiveness

The airway hyper-responsiveness (AHR) study [39] was used in the literature to compare CIs for two correlated proportions, including the recent article by Fagerland et al. [2]. Children often experience pulmonary complications following stem cell transplant (SCT). AHR is indicative of the degree of sensitivity of the lungs to foreign stimuli and is associated with unfavorable respiratory symptoms [40]. A prospective pediatric study of 21 participants compared the incidence of AHR before and after stem cell transplant (see data in Table 2). The test for AHR is binary (positive/negative) and is paired data by pre- and post-SCT measurements. We obtain the same confidence intervals for the uncorrected asymptotic score, Wald, Bonett–Price, and MOVER Wilson intervals as Fagerland et al. [2] in Table 3. Similar to their findings, we find an increased risk of AHR following SCT.

Table 2.

Observed counts of pediatric airway hyper-responsiveness (AHR) before and after stem cell transplant (SCT)

Pre-SCT Post-SCT
AHR No AHR Total
AHR 1 1 2
No AHR 7 12 19
Total 8 13 21
Table 3.

CI lengths for θRR for the AHR pre- and post-SCT study

Method 95% CI
Lower limit Upper limit Width
Asymptotic score 0.0653 0.9069 0.8416
ASCC-L 0.0628 0.9167 0.8539
ASCC-M 0.0603 0.9265 0.8662
ASCC-H 0.0555 0.9461 0.8906
Wald asymptotic 0.0625 0.9996 0.9371
Bonett–price hybrid Wilson score 0.0677 0.9227 0.8550
MOVER Wilson score 0.0686 0.8695 0.8008
Nonparametric method 0.0551 1.1333 1.0781

The Wald CI is quite wide compared to alternatives, with a width of 0.94. This is only surpassed in width by the proposed nonparametric method, with a width of 1.08. This is intuitive, as nonparametric methodology tends to be conservative by not making distributional assumptions, which is likely appropriate in a study of this size. The proposed ASCC method widens the original score asymptotic CI from 0.84 to 0.85, 0.87, and 0.89 for continuity corrections of increasing strength, which is particularly important in a study of this size. The MOVER CI has the shortest width at 0.80.

Breast Cancer Detection

We additionally apply the methods described above to data from a prospective study of adult women at risk for invasive breast cancer [41]. Dense breast tissue is associated with an increased risk of breast cancer and a greater likelihood of a false-negative mammogram from a screening for early detection of breast cancer [42]. Advanced medical techniques are therefore needed to detect invasive breast cancer in women with dense breast tissue. Sonogram and magnetic resonance imaging (MRI) are traditionally accepted detection technology. However, sonograms are labor intensive and associated with low specificity while gold-standard MRIs are expensive and thus unrealistic as a population screening technology. Increasingly popular alternative detection technologies are digital breast tomosynthesis (DBT) and abbreviated magnetic resonance imaging (AB-MRI).

Comstock et al. [41] sought to compare the detection rates of DBT and AB-MRI to the results of surgical biopsy, the standard of care for breast cancer diagnosis. A total of 1430 women aged 40-75 with dense breast tissue were enrolled in the study between December 2016 and November 2017 (see data in Tables 4 and  5). The primary outcome was detection of breast cancer, defined as either invasive breast cancer or ductal carcinoma in situ (DCIS) which is a cancerous noninvasive lesion. During the course of the study, each participant received DBT screening and AB-MRI screening for breast cancer. Formal diagnosis with breast cancer was obtained by follow-up within two years of study conclusion via surgical biopsy, the standard of care (SOC) diagnosis method.

Table 4.

Diagnostic accuracy of DBT compared to standard of care

DBT Invasive cancer or DCIS
Present Absent Total
Positive 9 36 45
Negative 14 1371 1385
Total 23 1407 1430
Table 5.

Diagnostic accuracy of AB-MRI compared to standard of care

AB-MRI Invasive cancer or DCIS
Present Absent Total
Positive 22 187 209
Negative 1 1220 1221
Total 23 1407 1430

Interestingly, when comparing DBT to SOC in Table 6, the nonparametric CI is of intermediate width (1.74) between those of the score intervals (ranging from 1.748 for uncorrected to 1.749 for high correction) and the Wald, Bonett–Price, and MOVER intervals. This is likely owing to the large study sample size. Comparing AB-MRI to SOC presented in Table 7, the nonparametric CI is the longest at 7.231. We see the same increasing trend in the widths of the score-based CIs, from 7.208 uncorrected to 7.209 with high correction. In both DBT and AB-MRI comparisons, the MOVER interval again has the shortest width at 1.717 and 7.147 for DBT and AB-MRI, respectively. This particular case study illustrates that the continuity correction modification of the proposed interval makes only a small, likely not practically significant, difference in large studies where the asymptotic approximation assumed by the original score interval is likely reasonable.

Table 6.

CI lengths for θRR for the breast cancer detection study comparing DBT to SOC

Method 95% CI
Lower limit Upper limit Width
Asymptotic score 1.2795 3.0274 1.7479
ASCC-L 1.2794 3.0275 1.7481
ASCC-M 1.2794 3.0276 1.7482
ASCC-H 1.2792 3.0278 1.7486
Wald asymptotic 1.2717 3.0100 1.7383
Bonett–price hybrid Wilson score 1.2744 3.0037 1.7293
MOVER Wilson score 1.2705 2.9880 1.7175
Nonparametric method 1.2711 3.0116 1.7405
Table 7.

CI lengths for θRR for the breast cancer detection study comparing AB-MRI to SOC

Method 95% CI
Lower limit Upper limit Width
Asymptotic score 6.2443 13.4526 7.2083
ASCC-L 6.2443 13.4527 7.2085
ASCC-M 6.2442 13.4528 7.2086
ASCC-H 6.2441 13.4531 7.2090
Wald asymptotic 6.1671 13.3892 7.2220
Bonett–price hybrid Wilson score 6.1815 13.3580 7.1764
MOVER Wilson score 6.1704 13.3173 7.1469
Nonparametric method 6.1643 13.3954 7.2311

The lower limits for comparing AB-MRI to SOC are all above 6, which indicate that AB-MRI is a test that could have a very high false-positive rate. Meanwhile, the lower limits for comparing DBT to SOC are between 1.2 and 1.3 indicating a better performance of DBT as compared to AM-MRI.

Discussion

As our simulations have shown, imposing a continuity correction to Tang’s iterative asymptotic score interval can be quite beneficial in certain situations. In particular, for small sample sizes, the asymptotic score with continuity correction provides the closest average coverage probability to the specified nominal level as correlation increases. Additionally, in both small (N=30) and moderate (N=60,100) sample sizes, ASCC was shown to be beneficial under increasing strength of correlation when the probability of event is small relative to the true relative risk. Practically, this indicates that if an investigator expects a stronger correlation between the two tests in the study, and expects small probabilities of observing the event but a larger effect size, you will more likely meet nominal coverage when applying ASCC than Tang’s original score interval (or any other studied methods, save the nonparametric CI).

When the sample size is large and/or the correlation is low, applying a continuity correction can be conservative compared to the uncorrected asymptotic score interval. This contrasts with the behavior of the Wald, Bonett–Price, and Mover Wilson intervals in such situations, which are over-conservative for weakly correlated samples, but experience unacceptably low average coverage probabilities with increasing strength of correlation. Therefore, in situations with moderate sample and effect sizes, the standard recommendation holds to use the uncorrected asymptotic score interval. However, for larger anticipated effect sizes with strong correlation between tests, we recommend the asymptotic score with continuity correction for general practice. We developed an R shiny app at the link: https://dongyuanwu.shinyapps.io/PairedRR/.

Additional to the improved operating characteristics of the proposed ASCC intervals under the above scenarios are the computational benefits of the proposed method. Provision of a closed form of the uncorrected and continuity-corrected asymptotic score interval decreases computation load relative to the iterative solution search previously required by Tang et al. [30]. Further, closed-form expressions for the confidence limits are more accessible to clinical audience and avoid common optimization challenges in the root-finding process. Regardless of the decision to impose a continuity correction, the use of the closed-form solution is recommended. R function implementations of all available methods can be obtained from the authors upon request.

In agreement with the findings of Duan et al. [32], we illustrate that the U-statistic-based nonparametric method is conservative when the sample size is small. In fact, in the majority of simulation scenarios studied, the nonparametric interval rarely drops below nominal coverage probability. The nonparametric interval frequently meets or exceeds nominal coverage, as shown in the bottom panel of Fig. 6. As a specific example, in Fig. 1, the nonparametric method only shows coverage probabilities consistently below nominal coverage when ρ=0.6 and θRR=2, while the performance of most other methods is severely challenged.

Though we recommend the use of the asymptotic score interval (corrected or uncorrected based on the characteristics of the available data) for practical use to achieve coverage probability closer to the nominal level, the practical context of the analysis may warrant the choice of more conservative coverage at the expense of a wider interval. This makes our extension of Duan’s nonparametric interval for correlated RD to correlated RR useful. The decision to use the nonparametric confidence interval should in general be motivated by a risk–benefit assessment informed by the real-world consequences of failing to capture the true value of the parameter under study in the confidence interval. This may be particularly desirable in the drug development context, where national regulatory agencies tend to prefer conservative inferential techniques. Note that for both the RD and RR, the intervals are nonparametric in the sense that the derivation of the point estimate and the variance–covariance matrix are conducted without distributional assumptions, as outlined in Sect. 2.3.2. However, the variance of RD and RR are both derived using the Delta method, and hence some asymptotic behavior can be observed in the lines representing the nonparametric interval in the simulation summaries.

In the small sample case where a conservative CI is desired, one could also consider an exact CI for the correlated RR. In contrast with traditional methods, exact CIs do not rely on the asymptotic normal approximation to the binomial distribution to hold reasonably, but instead use the binomial distribution directly to enumerate all cumulative probabilities of interest for interval construction. Thus, exact CIs can be computationally intensive and are often most feasible in small-to-moderate sample size settings. This provides the benefit of improved coverage in small sample settings but sacrifices the simple closed-form expression and ease of computation of our proposed nonparametric interval for the paired RR. According to the authors of the ExactCIdiff package in R, even in relatively small sample sizes (such as n=100) the computation for a single exact confidence interval for the difference of two correlated proportions can take an hour to complete on an HP laptop with Intel(R) Core(TM) i5=2520M CPU@2.50 GHz and 8 GB RAM. This puts a time comparison outside of the scope of the current simulation study. Extensive development in this area can be found in the literature [43]. We believe a conservative interval is a good addition to any simulation study for comparison purposes but preferred to include an interval with less computational intensity. For similar reasons, we did not include the Bonett–Price with continuity correction in our simulation study. Fagerland et al. [2] found that the continuity-corrected Bonett–Price interval was so overly conservative, it approached the performance of an exact CI rather than an approximation.

In this paper, improved methodology for calculating confidence intervals for the correlated relative risk is presented in the context of a two-by-two contingency table. A future direction of research is to extend the methods and notations described here to two-way contingency tables of higher dimension. One example would be a study that is stratified by covariate(s) (e.g., gender, race). Estimation in this setting entails providing confidence interval formulas for the stratified correlated relative risk.

We would like to bring attention to another future direction of research which is not exclusive to this paper, but would have greatly added to the quality of the investigation. Though we thoroughly searched, it was a difficult task to find any applied example in which both the experimental design was applicable and the necessary information to construct the 2 by 2 table of interest was published. As a result, we were unable to conduct a resampling-based assessment of confidence interval coverage probabilities in our real data examples. This is a well-known challenge for methodological research related to clinical trial design and analysis. We hope that collaborative movements to make full clinical trial data publicly available where appropriate make this possible in our future lines of research.

Acknowledgements

The authors are very grateful to the Editor, Associate Editor, and two reviewers for their insightful comments that help improve the manuscript. We would like to thank Dr. Pete Laud from University of Sheffield who gave us the permission to use, modify, and redistribute the functions in his developed R package ratesci. The authors thank Dr. Duan’s group who shared their R code for their nonparametric interval. Shan’s research is partially supported by grants from NIH: R01AG070849 and R03CA248006.

Appendix 1: The Boundary of the Pearson Correlation Coefficient

To obtain the range of the Pearson correlation coefficient ρ for a pair of Bernoulli random variables (X1,X2), the following inequality is used, max(0,p1+p2-1)p11min(p1,p2). The above inequality can be rewritten as

max(0,p1+p2-1)ρp1(1-p1)p2(1-p2)+p1p2min(p1,p2).

When p1,p20,1, we can solve the right side of the inequality to obtain the upper bound of ρ,

ρmin(p1,p2)-p1p2p1(1-p1)p2(1-p2)ρminp1(1-p2),p2(1-p1)p1(1-p1)p2(1-p2)ρminp1(1-p2)p2(1-p1)12,p2(1-p1)p1(1-p2)12.

Moreover, the lower bound of ρ is obtained by solving the left side of the inequality,

ρmax(0,p1+p2-1)-p1p2p1(1-p1)p2(1-p2)ρmax-p1p2,-(1-p1)(1-p2)p1(1-p1)p2(1-p2)ρmax-p1p2(1-p1)(1-p2)12,-(1-p1)(1-p2)p1p212.

Appendix 1 shows the derivation of the formulas for Lρ and Uρ

Lρ=max-p1p2(1-p1)(1-p2)12,-(1-p1)(1-p2)p1p212;Uρ=minp1(1-p2)p2(1-p1)12,p2(1-p1)p1(1-p2)12.

The lower bound and the upper bound can be further improved when the ranges of p12, p21, and p22 are utilized. In the R program, we checked all these four cell probabilities to make sure that they are between 0 and 1.

Appendix 2: Proof of Theorem 1: The Nam–Blackwelder CI Based on the Test Statistic T(θ0) is Equivalent to the Score Interval Based on the Test Statistic S(θ0)

Nam and Blackwelder’s interval is calculated from T(θ)=±z1-α/2. Given α, the score asymptotic CI is calculated using S(θ). These two CI methods are equal if T(θ)=S(θ). To simplify the notations, we denote x·1=x11+x21 and x1·=x11+x12. These two test statistics are

T(θ)=x1·-x·1θnθ(p^12+p^21)

and

S(θ)=x1·-x·1θn(1+θ)p~21+(x11+x12+x21)(θ-1),

where p~21, p^12, and p^21 are presented in the manuscript. We calculated the difference of denominators of T(θ) and S(θ),

nθ(p^12+p^21)-[n(1+θ)p~21+(x11+x12+x21)(θ-1)]=nθ(1+θ)p^12-θ(θ-1)(x11+x12+x21)-[12(x1·-x·1θ2)2+4x12x21θ2-12x·1θ2+(θ-1)(x11+x12+x21)+12(x1·+2x21)]=12(x1·-x·1θ2)2+4x12x21θ2-(x1·-x·1θ2)2+4x12x21θ2=0.

Therefore, T(θ)=S(θ). The Nam–Blackwelder 100(1-α)% CI based on the constrained maximum likelihood is exactly the same as the 100(1-α)% score asymptotic CI.

Appendix 3: Closed-Form Solutions for the Score Interval and the Proposed ASCC Intervals

Following the Appendix in Nam [26], we first derive the closed-form estimation of confidence limits for the score asymptotic CI. Then, we present the closed-form formula to calculate exact confidence limits of the continuity-corrected score asymptotic CI. To simplify the notations, we denote x·1=x11+x21 and x1·=x11+x12.

Similar to the Appendix in Nam [26], we solve the quartic equation to obtain the exact values of two confidence limits for the Score Asymptotic CI. The following context presents steps to obtain the exact solutions. When θ greater than x1·/x·1, the equation to calculate the upper confidence limit is

x·1θ-x1·n(1+θ)p~21+(x11+x12+x21)(θ-1)=z1-α2,

where p~21 is shown in Sect. 2.2.4. Solving the equation,

x·1θ-x1·=z1-α2n(1+θ)p~21+(x11+x12+x21)(θ-1)(x·1θ-x1·)2=z1-α22n(1+θ)p~21+(x11+x12+x21)(θ-1)z1-α22n(1+θ)p~21=x·12θ2-2x·1x1·+z1-α22(x11+x12+x21)θ+x1·2+z1-α22(x11+x12+x21).

Input p~21 as an expression of θ, we have

12z1-α22T=x·12+12z1-α22x·1θ2-2x·1x1·+z1-α22(x11+x12+x21)θ+x1·2+12z1-α22x1·, 10

where T0 and T=x·12θ4-2(x112+x11x12+x11x21-x21x12)θ2+x1·2. To calculate θ, we need to square both sides of Eq. (10). However, the right side of Eq. (10), denoted as V, is not necessarily nonnegative and redundant solutions of θ are added in such a case. Therefore, if the solutions of θ satisfy V<0, the solutions should be discarded. Squaring both sides of Eq. (10), we can obtain the quartic equation as

aθ4+bθ3+cθ2+dθ+e=0, 11

where

a=x·14+z1-α22x·13b=-(2x·12+z1-α22x·1)[2x·1x1·+z1-α22(x11+x12+x21)]c=6x·12x1·2+z1-α24(x1·+x·1)(x11+x12+x21)+z1-α22x·1x1·(6x11+5x12+5x21)d=-(2x1·2+z1-α22x1·)[2x·1x1·+z1-α22(x11+x12+x21)]e=x1·4+z1-α22x1·3.

Using the Ferrari’s method, the formulas of the roots are

θ1=-b4a-S-12-4S2-2k+hSθ2=-b4a-S+12-4S2-2k+hSθ3=-b4a+S-12-4S2-2k-hSθ4=-b4a+S+12-4S2-2k-hS, 12

where

k=8ac-3b28a2h=b3-4abc+8a2d8a3.

The value of S is calculated using different formulas under various conditions. We denote that Δ0=c2-3bd+12ae, Δ1=2c3-9bcd+27b2e+27ad2-72ace, and Δ=(4Δ03-Δ12)/27. If Δ<0 and Δ00,

Q=Δ1+(Δ12-4Δ03)23S=12-23k+13a(Q+Δ0Q).

If Δ<0 and Δ0=0,

Q=Δ13S=12-23k+13a(Q+Δ0Q).

If Δ>0,

φ=arccosΔ12Δ03S=12-23k+23aΔ0cosφ3.

If Δ=0 and Δ00, the formula in case Δ>0 can be used to calculate S. If Δ=0 and Δ0=0, thus Δ1=0, at least three roots of Eq. (11) are equal. In this special case, four roots of Eq. (11) are denoted as the triple root θtri and the unique root θuni. The roots are calculated using the following procedure. Solving the equation, 6aθ2+3bθ+c=0, to obtain two values. Plugging in the two values to the left side of Eq. (11) to obtain the value satisfied the quartic equation. This common root of the two equations is θtri, then the unique root is calculated by θuni=-3θtri-b/a. If S=0, the associated depressed quartic equation of Eq. (11) is a biquadratic equation,

θ4+(8ac-3b28a2)θ2+256a3e+16ab2c-3b4-64a2bd256a4=0,

the four roots of Eq. (11) can be obtained by solving the upper equation.

Because of symmetry, the lower confidence limit also satisfies Eq. (11). After the four roots are obtained, calculating the corresponding values of V. The roots with V<0 are discarded, then plugging in the left roots to Eq. (7) to select the upper and lower confidence limits θU and θL satisfying S(θU)=-z1-α2 and S(θL)=z1-α2.

To calculate confidence limits of the 100(1-α)% Continuity-corrected Score Asymptotic CI, we need to solve the equation,

|x1·-x·1θ|-x·1δnn(1+θ)p~21+(x11+x12+x21)(θ-1)=z1-α2,

where δ is a constant (e.g., δ=2). Thus, solving the equation,

x·1θ-(x1·+x·1δn)n(1+θ)p~21+(x11+x12+x21)(θ-1)=z1-α2, 13

to obtain θU. And solving the equation,

(x1·-x·1δn)-x·1θn(1+θ)p~21+(x11+x12+x21)(θ-1)=z1-α2, 14

to obtain θL. From Eq. (13), we can obtain the quartic equation like Eq. (11) where

a=x·14+z1-α22x·13b=-(2x·12+z1-α22x·1)2x·1x1·+x·1δn+z1-α22(x11+x12+x21)c=6x·12x1·+x·1δn2+z1-α24(x1·+x·1)(x11+x12+x21)+z1-α22x·1[x1·+x·1δn2+4(x11+x12+x21)x1·+x·1δn+x·1x1·]d=-2x1·+x·1δn2+z1-α22x1·2x·1x1·+x·1δn+z1-α22(x11+x12+x21)e=x1·+x·1δn4+z1-α22x1·x1·+x·1δn2.

Then the four roots are calculated by Eq. (12) obtained from the Ferrari’s method. Because θU satisfies Eq. (13), only one of the four roots is selected as θU. From Eq. (14), the quartic equation like Eq. (11) is obtained where

a=x·14+z1-α22x·13b=-(2x·12+z1-α22x·1)2x·1x1·-x·1δn+z1-α22(x11+x12+x21)c=6x·12x1·-x·1δn2+z1-α24(x1·+x·1)(x11+x12+x21)+z1-α22x·1[x1·-x·1δn2+4(x11+x12+x21)x1·-x·1δn+x·1x1·]d=-2x1·-x·1δn2+z1-α22x1·2x·1x1·-x·1δn+z1-α22(x11+x12+x21)e=x1·-x·1δn4+z1-α22x1·x1·-x·1δn2.

The four roots are calculated by Eq. (12) and θL is selected from the four roots using the condition that θL satisfies Eq. (14).

Appendix 4: The Asymptotic Based Nonparametric Confidence Interval

For correlated binary data of two tests, Duan et al. [32] estimated the variance–covariance matrix of (p1p2)T as

V^=(x11+x12)(x21+x22)n2(n-1)x11x22-x12x21n2(n-1)x11x22-x12x21n2(n-1)(x11+x21)(x12+x22)n2(n-1).

The proof of this estimated covariance matrix could be found in their paper’s appendix using Langes’s rank-based method.

For the log relative risk g(p1,p2)=log(p1/p2), using the first-order Taylor expansion, we have

g(p1,p2)logp^1p^2+1p^1-1p^2p1-p1^p2-p2^,

where p^1=(x11+x12)/n, p^2=(x11+x21)/n, and p^1,p^2>0. Therefore, the estimated variance of the log relative risk can be obtained using the Delta method

V^LRR=1p^1-1p^2V^1p^1-1p^2=1n2(n-1)p^12p^22[(x11x12+x21x22)p^12+(x11x21+x12x22)p^22+x11x22(p^1-p^2)2+x12x21(p^1+p^2)2].

It is straightforward that V^LRR>0 because p^1 and p^2 are nonzero positive values.

Based on V^LRR, we then constructed a approximated 100(1-α)% CI for log relative risk. The simulation results in Duan et al. [32] showed that the t approximation with the degree of freedom df0=n-1 provided preferable performance of the corresponding CI of risk differences. Compared with the normal approximation, the t approximation with degrees of freedom df0 leads to slightly conservative test following Brunner et al. [44, 45]. So, a 100(1-α)% CI of the log relative risk is constructed as (L,U):

L=logp^1p^2-tn-1,1-α/2V^LRRU=logp^1p^2+tn-1,1-α/2V^LRR,

where tn-1,1-α/2 is the α/2 upper quantile of the t distribution with n-1 degrees of freedom. Moreover, the 100(1-α)% CI of the relative risk is obtained as (exp(L),exp(U)).

Footnotes

Natalie DelRocco, Yipeng Wang, Dongyuan Wu and Yuting Yang have contributed equally on this project.

References

  • 1.Piantadosi S. Clinical trials: a methodological perspective. 2. Hoboken: Wiley; 2005. [Google Scholar]
  • 2.Fagerland MW, Lydersen S, Laake P. Recommended tests and confidence intervals for paired binomial proportions. Stat Med. 2014;33(16):2850–2875. doi: 10.1002/sim.6148. [DOI] [PubMed] [Google Scholar]
  • 3.Newcombe RG. Two-sided confidence intervals for the single proportion: comparison of seven methods. Stat Med. 1998;872(May 1995):857–872. doi: 10.1002/(SICI)1097-0258(19980430)17:8&#x0003c;857::AID-SIM777&#x0003e;3.0.CO;2-E. [DOI] [PubMed] [Google Scholar]
  • 4.Shan G. Exact statistical inference for categorical data. 1. San Diego: Academic Press; 2015. [Google Scholar]
  • 5.Shan G, Ma C, Hutson AD, Wilding GE. An efficient and exact approach for detecting trends with binary endpoints. Stat Med. 2012;31(2):155–164. doi: 10.1002/sim.4411. [DOI] [PubMed] [Google Scholar]
  • 6.Okely JA, Corley J, Welstead M, Taylor AM, Page D, Skarabela B, Redmond P, Cox SR, Russ TC. Change in physical activity, sleep quality, and psychosocial variables during COVID-19 lockdown: evidence from the lothian birth cohort 1936. Int J Environ Res Public Health. 2021;18(1):1–16. doi: 10.3390/ijerph18010210. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Shan G, Wang W. Advanced statistical methods and designs for clinical trials for COVID-19. Int J Antimicrob Agents. 2021;57(1):106167. doi: 10.1016/j.ijantimicag.2020.106167. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Shan G. Exact unconditional testing procedures for comparing two independent Poisson rates. J Stat Comput Simul. 2013;85(5):947–955. doi: 10.1080/00949655.2013.855776. [DOI] [Google Scholar]
  • 9.Shan G. New nonparametric rank-based tests for paired data. Open J Stat. 2014;04(07):495–503. doi: 10.4236/ojs.2014.47047. [DOI] [Google Scholar]
  • 10.Shan G. Exact confidence intervals for randomized response strategies. J Appl Stat. 2016;43(7):1279–1290. doi: 10.1080/02664763.2015.1094454. [DOI] [Google Scholar]
  • 11.Shan G. Exact approaches for testing non-inferiority or superiority of two incidence rates. Stat Probab Lett. 2014;85:129–134. doi: 10.1016/j.spl.2013.11.010. [DOI] [Google Scholar]
  • 12.Shan G, Gerstenberger S. Fisher’s exact approach for post hoc analysis of a chi-squared test. PLoS ONE. 2017;12(12):e0188709. doi: 10.1371/journal.pone.0188709. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Shan G (2015b) Improved confidence intervals for the Youden Index. PLoS ONE 10(7), e0127272+ [DOI] [PMC free article] [PubMed]
  • 14.Shan G. Accurate confidence intervals for proportion in studies with clustered binary outcome. Stat Methods Med Res. 2020;29(10):3006–3018. doi: 10.1177/0962280220913971. [DOI] [PubMed] [Google Scholar]
  • 15.Casella G, Berger RL. Statistical inference. 2. Mason: Wadsworth Cengage Learning; 2002. [Google Scholar]
  • 16.Gart JJ, Nam J-M. Approximate interval estimation of the ratio of binomial parameters: a review and corrections for skewness. Int Biometric Soc. 1988;44(2):323–338. doi: 10.2307/2531848. [DOI] [PubMed] [Google Scholar]
  • 17.Lang JB. Score and profile likelihood confidence intervals for contingency table parameters. Stat Med. 2008;27:5975–5990. doi: 10.1002/sim.3391. [DOI] [PubMed] [Google Scholar]
  • 18.Nam J-M, Blackwelder WC. Analysis of the ratio of marginal probabilities in a matched-pair setting. Stat Med. 2002;21(5):689–699. doi: 10.1002/sim.1017. [DOI] [PubMed] [Google Scholar]
  • 19.Agresti A. Categorical data analysis. 2. Hoboken: Wiley; 2002. [Google Scholar]
  • 20.Shan G. Optimal two-stage designs based on restricted mean survival time for a single-arm study. Contemp Clin Trials Commun. 2021;21:100732. doi: 10.1016/j.conctc.2021.100732. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Shan G, Ma C, Hutson AD, Wilding GE. Randomized two-stage phase II clinical trial designs based on Barnard’s exact test. J Biopharm Stat. 2013;23(5):1081–1090. doi: 10.1080/10543406.2013.813525. [DOI] [PubMed] [Google Scholar]
  • 22.Shan G, Wang W. Exact one-sided confidence limits for Cohen’s kappa as a measurement of agreement. Stat Methods Med Res. 2017;26(2):615–632. doi: 10.1177/0962280214552881. [DOI] [PubMed] [Google Scholar]
  • 23.Shan G, Dodge-Francis C, Wilding GE. Exact unconditional tests for dichotomous data when comparing multiple treatments with a single control. Ther Innov Regul Sci. 2020;54(2):411–417. doi: 10.1007/s43441-019-00070-w. [DOI] [PubMed] [Google Scholar]
  • 24.Shan G, Zhang H, Barbour J (2021) Bootstrap confidence intervals for correlation between continuous repeated measures. Stat Methods Appl, 1–21
  • 25.Wilson EB. Probable inference, the law of succession, and statistical inference. J Am Stat Assoc. 1927;22(158):209–212. doi: 10.1080/01621459.1927.10502953. [DOI] [Google Scholar]
  • 26.Nam J-M. Efficient interval estimation of a ratio of marginal probabilities in matched-pair data: non-iterative method. Stat Med. 2009;28(23):2929–2935. doi: 10.1002/sim.3685. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Bonett DG, Price RM. Confidence intervals for a ratio of binomial proportions based on paired data. Stat Med. 2006;25(17):3039–3047. doi: 10.1002/sim.2440. [DOI] [PubMed] [Google Scholar]
  • 28.Tang NS, Tang ML, Chan ISF. On tests of equivalence via non-unity relative risk for matched-pair design. Stat Med. 2003;22(8):1217–1233. doi: 10.1002/sim.1213. [DOI] [PubMed] [Google Scholar]
  • 29.Donner A, Zou GY. Closed-form confidence intervals for functions of the normal mean and standard deviation. Stat Methods Med Res. 2012;21(4):347–359. doi: 10.1177/0962280210383082. [DOI] [PubMed] [Google Scholar]
  • 30.Tang ML, Li HQ, Tang NS. Confidence interval construction for proportion ratio in paired studies based on hybrid method. Stat Methods Med Res. 2012;21(4):361–378. doi: 10.1177/0962280210384714. [DOI] [PubMed] [Google Scholar]
  • 31.Plackett RL. The continuity correction in 2x2 tables. Biometrika. 1964;51(3):327–337. [Google Scholar]
  • 32.Duan C, Cao Y, Zhou L, Tan MT, Chen P. A novel nonparametric confidence interval for differences of proportions for correlated binary data. Stat Methods Med Res. 2018;27(8):2249–2263. doi: 10.1177/0962280216679040. [DOI] [PubMed] [Google Scholar]
  • 33.Leisch F, Weingessel A, Hornik K (1998) On the generation of correlated artificial binary data on the generation of correlated artificial binary data
  • 34.Kowalski J, Xin T. Modern applied U-statistics. Hoboken: Wiley; 2008. [Google Scholar]
  • 35.DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics. 1988;44(3):837–845. doi: 10.2307/2531595. [DOI] [PubMed] [Google Scholar]
  • 36.R Core Team (2021) R: a language and environment for statistical computing
  • 37.Laud P (2018) ratesci: confidence intervals for comparisons of binomial or Poisson rates
  • 38.Goldfeld K, Wujciak-Jens J (2020) simstudy: simulation of study data
  • 39.Bentur L, Lapidot M, Livnat G, Hakim F, Lidroneta-Katz C, Porat I, Vilozni D, Elhasid R. Airway reactivity in children before and after stem cell transplantation. Pediatric Pulmonol. 2009;44(9):845–850. doi: 10.1002/ppul.20964. [DOI] [PubMed] [Google Scholar]
  • 40.Postma DS, Kerstjens HAM, Postma S (1998) Characteristics of airway hyperresponsiveness in asthma and chronic obstructive pulmonary disease RELATIONSHIP BETWEEN SEVERITY OF HYPERRESPONSIVENESS AND FEV 1 LEVEL. Technical report [DOI] [PubMed]
  • 41.Comstock CE, Gatsonis C, Newstead GM, Snyder BS, Gareen IF, Bergin JT, Rahbar H, Sung JS, Jacobs C, Harvey JA, Nicholson MH, Ward RC, Holt J, Prather A, Miller KD, Schnall MD, Kuhl CK. Comparison of abbreviated breast MRI vs digital breast tomosynthesis for breast cancer detection among women with dense breasts undergoing screening. JAMA. 2020;323(8):746–756. doi: 10.1001/jama.2020.0572. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Boyd NF, Rommens JM, Vogt K, Lee V, Hopper JL, Yaffe MJ, Paterson AD. Mammographic breast density as an intermediate phenotype for breast cancer. Lancet Oncol. 2005;6:798–808. doi: 10.1016/S1470-2045(05)70390-9. [DOI] [PubMed] [Google Scholar]
  • 43.Wang W, Shan G. Exact confidence intervals for the relative risk and the odds ratio. Biometrics. 2015;71(4):985–995. doi: 10.1111/biom.12360. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Brunner E, Dette H, Munk A. Box-type approximations in nonparametric factorial designs. J Am Stat Assoc. 1997;92(440):1494–1502. doi: 10.1080/01621459.1997.10473671. [DOI] [Google Scholar]
  • 45.Brunner E, Munzel U, Puri ML. The multivariate nonparametric Behrens-Fisher problem. J Stat Plan Inference. 2002;108(1–2):37–53. doi: 10.1016/S0378-3758(02)00269-0. [DOI] [Google Scholar]

Articles from Statistics in Biosciences are provided here courtesy of Nature Publishing Group

RESOURCES