Design and Analysis Methods for Trials with AI-Based Diagnostic Devices for Breast Cancer

Lu Liu; Kevin J Parker; Sin-Ho Jung

doi:10.3390/jpm11111150

. 2021 Nov 4;11(11):1150. doi: 10.3390/jpm11111150

Design and Analysis Methods for Trials with AI-Based Diagnostic Devices for Breast Cancer

Lu Liu ¹, Kevin J Parker ², Sin-Ho Jung ^1,^*

Editors: Gianluca Franceschini, Alba Di Leone, Alejandro Martin Sanchez

PMCID: PMC8617855 PMID: 34834502

Abstract

Imaging is important in cancer diagnostics. It takes a long period of medical training and clinical experience for radiologists to be able to accurately interpret diagnostic images. With the advance of big data analysis, machine learning and AI-based devices are currently under development and taking a role in imaging diagnostics. If an AI-based imaging device can read the image as accurately as experienced radiologists, it may be able to help radiologists increase the accuracy of their reading and manage their workloads. In this paper, we consider two potential study objectives of a clinical trial to evaluate an AI-based device for breast cancer diagnosis by comparing its concordance with human radiologists. We propose statistical design and analysis methods for each study objective. Extensive numerical studies are conducted to show that the proposed statistical testing methods control the type I error rate accurately and the design methods provide required sample sizes with statistical powers close to pre-specified nominal levels. The proposed methods were successfully used to design and analyze a real device trial.

Keywords: artificial intelligence (AI), breast cancer, clinical device trial, concordance rate, generalized estimating equation, sample size calculation, statistical test

1. Introduction

There are different types of device trials depending on the use of device and the study objectives. In this paper, we introduce statistical design and analysis methods for a trial on an artificial intelligence (AI)-based device for the diagnosis of breast cancer.

Imaging technologies play a major role in the diagnosis of breast cancer. The reading and interpretation of imaging requires intensive medical training and significant clinical experience. With the advance of big data analysis methods, machine learning and AI-based imaging systems are currently under active development [1]. If an AI-based imaging device can read the image as accurately as experienced radiologists, it may be able to help radiologists increase the accuracy of their reading, manage their workloads, or possibly replace radiologists in remote clinics that would not have an experienced radiologist available for consultation.

In the assessment of breast lesions, the BI-RADS reporting system and classification are widely used [2]. This system includes categories between 1 and 5 (benign to malignant) with a key diagnostic transition subdivided into categories 4a (low suspicion of malignancy), 4b (moderate suspicion) and 4c (high suspicion, greater than 50% likelihood but less than 95% likelihood of malignancy). Furthermore, the BI-RADS lexicon covers radiological descriptive features that are important in diagnostic assessments, and these vary by modality. Examples of ultrasound lexicon used in AI-based classifications is given in Table A1. The earlier approaches to breast ultrasound technology concentrated on the extraction of features of lesions such as size, shape, texture, and boundaries within a clustering or classification or rule-based decision making algorithms [3,4,5,6]. More recent developments in AI, machine learning, and deep learning systems have utilized layers of convolution neural network models, a variety of approaches and extensive training sets to produce differentiated output classifications [7,8].

In this paper, we consider the requirements for a clinical device trial to evaluate the performance of an AI-based imaging device using BI-RADS reporting system for the diagnosis of breast cancer. Since BI-RADS reporting system does not have a gold standard, we evaluate the performance of the device by how well its reading aligns with those of radiologists. We propose design and analysis methods for two different types of study objectives that can be used for such a trial. The first objective is to test if the reading of the AI-based device concurs with those of radiologists as much as the readings concord among radiologists. The second objective is to test if the reading of the AI-based device is more concordant with those of experienced radiologists than with those of junior radiologists. For each objective, we propose a statistical testing method and its sample size calculation formula. The proposed testing methods will be used to analyze the data for each of the five BI-RADS lexicon classification category listed in Table A1, but the sample size calculation for a trial may be conducted only for the most important one. The performance of these methods are evaluated using simulations.

2. Materials and Methods

We consider two types of study objectives to evaluate the performance of an AI-based device for the diagnosis of breast cancer. For each study objective, we propose a testing method and its sample size formula. Suppose that we have images from n subjects.

2.1. Objective 1: Is the Concordance Rate between the AI-Based Device and Radiologists as High as That among Radiologists?

The image of each subject is read by m radiologists and the AI-based device. BI-RADS lexicon does not have a gold standard. So, in order to validate a device with an AI-based algorithm, we should show that the reading of the device concurs with those with radiologists. For example, in Table A1, for BI-RADS lexicon classification, Shape, two radiologists will be declared to be concurrent for an image if they both read oval, round or irregular. The question is how high the concordance rate should be between the device and the radiologists. The concordance rate among radiologists is used as a reference.

For each category of BI-RADS lexicon classifications, let $p_{r}$ and $p_{s}$ denote the concordance rate among radiologists and that between radiologists and the device, respectively. Since the latter can not be higher than the former, we specify a similarity margin $δ_{1} (> 0)$ . That is, we will not be interested in the AI-based device if $p_{s} \leq p_{r} - δ_{1}$ and will be highly interested in it if $p_{s} = p_{r}$ . So, we want to test a null hypothesis $H_{1} : p_{s} = p_{r} - δ_{1}$ against the alternative hypothesis ${\bar{H}}_{1} : p_{s} > p_{r} - δ_{1}$ .

2.1.1. Statistical Testing Method

Suppose that there are n patients, and the image of each patient is read by the AI-based device and m radiologists. For subject $i (= 1, \dots, n)$ and radiologist $j (= 1, \dots, m)$ , let $r_{i j j^{'}} = 1$ if radiologists j and $j^{'}$ concur and $= 0$ otherwise, and let $s_{i j} = 1$ if radiologist j and the device concur and $= 0$ otherwise. Note that we have $p_{r} = E (r_{i j j^{'}})$ and $p_{s} = E (s_{i j})$ . Since $r_{i j j^{'}} = r_{i j^{'} j}$ and $r_{i j j} = 1$ for $j, j^{'} = 1, \dots, m$ , the number of informative concordance scores among m radiologists is $m (m - 1) / 2$ for each image, the concordance rate among radiologists for subject i is estimated by

r_{i} = \frac{\sum_{j = 1}^{m - 1} \sum_{j^{'} = j + 1}^{m} r_{i j j^{'}}}{m (m - 1) / 2} .

On the other hand, for subject i, the concordance rate between the device and m radiologists is estimated by

s_{i} = \frac{\sum_{j = 1}^{m} s_{i j}}{m}

Using the images from n subjects, concordance rate among radiologists is estimated by

{\hat{p}}_{r} = \frac{2}{n m (m - 1)} \sum_{i = 1}^{n} \sum_{j = 1}^{m - 1} \sum_{j^{'} = j + 1}^{m} r_{i j j^{'}} = \frac{1}{n} \sum_{i = 1}^{n} r_{i}

and that between the device and radiologists is estimated by

{\hat{p}}_{s} = \frac{1}{n m} \sum_{i = 1}^{n} \sum_{j = 1}^{m} s_{i j} = \frac{1}{n} \sum_{i = 1}^{n} s_{i}

Those estimates are unbiased because

E ({\hat{p}}_{r}) = E (\frac{2}{n m (m - 1)} \sum_{i = 1}^{n} \sum_{j = 1}^{m - 1} \sum_{j^{'} = j + 1}^{m} r_{i j j^{'}}) = \frac{2}{n m (m - 1)} \frac{n m (m - 1)}{2} p_{r} = p_{r}

and

E ({\hat{p}}_{s}) = E (\frac{1}{n m} \sum_{i = 1}^{n} \sum_{j = 1}^{m} s_{i j}) = \frac{1}{n m} n m p_{s} = p_{s} .

Since $r_{1}, \dots, r_{n}$ are independent random variables with mean $p_{r}$ , for large n by the central limit theorem,

\sqrt{n} ({\hat{p}}_{r} - p_{r}) = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} (r_{i} - p_{r})

is asymptotically normal with mean 0 and variance $σ_{r}^{2}$ that can be consistently estimated by

{\hat{σ}}_{r}^{2} = \frac{1}{n} \sum_{i = 1}^{n} {(r_{i} - {\hat{p}}_{r})}^{2}

Similarly, $s_{1}, \dots, s_{n}$ are independent random variables with mean $p_{s}$ , so that for large n,

\sqrt{n} ({\hat{p}}_{s} - p_{s}) = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} (s_{i} - p_{s})

is asymptotically normal with mean 0 and variance $σ_{s}^{2}$ that can be consistently estimated by

{\hat{σ}}_{s}^{2} = \frac{1}{n} \sum_{i = 1}^{n} {(s_{i} - {\hat{p}}_{s})}^{2}

Since each subject’s image is read by the device and radiologists, $r_{i}$ and $s_{i}$ are correlated. However, $(s_{i} - r_{i} + δ_{1}, i = 1, \dots, n)$ are independent, with mean 0 under the null hypothesis $H_{1}$ . Hence, by the central limit theorem under $H_{1}$ ,

\sqrt{n} ({\hat{p}}_{s} - {\hat{p}}_{r} + δ_{1}) = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} (s_{i} - r_{i} + δ_{1})

is asymptotically normal with mean 0 and variance $σ_{1}^{2}$ that can be consistently estimated by

{\hat{σ}}_{1}^{2} = \frac{1}{n} \sum_{i = 1}^{n} {(s_{i} - r_{i} + δ_{1})}^{2}

Hence, we reject the null hypothesis $H_{1} : p_{s} \leq p_{r} - δ_{1}$ if $Z_{1} > z_{1 - α}$ , where

Z_{1} = \frac{\sqrt{n} ({\hat{p}}_{s} - {\hat{p}}_{r} + δ_{1})}{{\hat{σ}}_{1}}

and $z_{1 - α}$ is the $100 (1 - α)$ percentile of the standard normal distribution. Note that we use a 1-sided test because the hypotheses are 1-sided and to avoid too large a sample size with a small $δ_{1}$ .

Note that ${\hat{p}}_{r}$ and ${\hat{p}}_{s}$ are the generalized estimating Equation [9] (GEE) estimators of $p_{r}$ and $p_{s}$ , respectively, using the working independent correlation. Furthermore, the robust estimator of $σ_{1}^{2}$ by the GEE method is given as

{\tilde{σ}}_{1}^{2} = \frac{1}{n} \sum_{i = 1}^{n} {(s_{i} - r_{i} + {\hat{p}}_{s} - {\hat{p}}_{r})}^{2}

Since ${\hat{p}}_{s} - {\hat{p}}_{r}$ is a consistent estimator of $p_{s} - p_{r}$ , ${\tilde{σ}}_{1}^{2}$ is asymptotically identical to ${\hat{σ}}_{1}^{2}$ under $H_{1}$ . Hence, $Z_{1}$ can be counted as a test statistic based on the GEE method with the working independent correlation.

2.1.2. Power and Sample Size Calculation

We calculate the sample size for the test statistic $Z_{1}$ under a specific alternative hypothesis ${\bar{H}}_{1} : p_{r} = p_{s}$ . An accurate sample size calculation for the statistical test requires specification of correlation coefficients between $r_{i j_{1} j_{2}}$ and $r_{i j_{1}^{'} j_{2}^{'}}$ , between $r_{i j_{1} j_{2}}$ and $s_{i j}$ , and between $s_{i j}$ and $s_{i j^{'}}$ . The dependency between $r_{i j_{1} j_{2}}$ and $r_{i j_{1} j_{2}^{'}}$ is expected to be higher than that between $r_{i j_{1} j_{2}}$ and $r_{i j_{1}^{'} j_{2}^{'}}$ for $j_{1} \neq j_{1}^{'} \neq j_{2} \neq j_{2}^{'}$ because the former pair includes the same reader $j_{1}$ while the latter pair contains four different readers. Similarly, we expect that the dependency between $r_{i j_{1} j_{2}}$ and $s_{i j_{1}}$ is expected to be higher than that between $r_{i j_{1} j_{2}}$ and $s_{i j_{1}^{'}}$ for $j_{1}, j_{2} \neq j_{1}^{'}$ .

For a simplified sample size formula, we just specify $ρ_{1} = corr (r_{i}, s_{i})$ . We define the correlation coefficients among the concordance scores $ρ_{r 1} = corr (r_{i 12}, r_{i 13})$ , $ρ_{r 2} = corr (r_{i 12}, r_{i 34})$ , $ρ_{s 1} = corr (r_{i 12}, s_{i 1}) = corr (r_{i 12}, s_{i 2})$ , and $ρ_{s 2} = corr (r_{i 12}, s_{i 3})$ , and $ρ_{s s} = corr (s_{i 1}, s_{i 2})$ . Appendix A.1 shows that we have $ρ_{1} = corr (r_{i}, s_{i})$ is expressed as

ρ_{1} = \frac{\frac{2}{m} ρ_{s 1} + \frac{m - 2}{m} ρ_{s 2}}{\sqrt{(\frac{2}{m (m - 1)} + \frac{4 (m - 2)}{m (m - 1)} ρ_{r 1} + \frac{(m - 2) (m - 3)}{m (m - 1)} ρ_{r 2}) (\frac{1}{m} + \frac{m - 1}{m} ρ_{s s})}}

Under ${\bar{H}}_{1} : p_{r} = p_{s}$ , ${(s_{i} - r_{i}), i = 1, \dots, n}$ are independent random variables with mean 0, so that $\sqrt{n} ({\hat{p}}_{s} - {\hat{p}}_{r})$ is asymptotically normal with mean 0 and variance $σ_{1}^{2}$ that can be consistently estimated by $s_{1}^{2} = {\sqrt{n}}^{- 1} \sum_{i = 1}^{n} {(s_{i} - r_{i})}^{2}$ . Since ${\hat{σ}}_{1}^{2}$ is asymptotically identical to $s_{1}^{2} + δ_{1}^{2}$ under ${\bar{H}}_{1}$ , it converges to $σ_{1}^{2} + δ_{1}^{2}$ . Hence, the power for a given sample size n is

1 - β = P (\frac{\sqrt{n} ({\hat{p}}_{s} - {\hat{p}}_{r} + δ_{1})}{{\hat{σ}}_{1}} > z_{1 - α} | p_{r} = p_{s})

= P (\frac{\sqrt{n} ({\hat{p}}_{s} - {\hat{p}}_{r}) + \sqrt{n} δ_{1})}{\sqrt{σ_{1}^{2} + δ_{1}^{2}}} > z_{1 - α} | p_{r} = p_{s})

= P (\frac{\sqrt{n} ({\hat{p}}_{s} - {\hat{p}}_{r})}{σ_{1}} > \frac{z_{1 - α} \sqrt{σ_{1}^{2} + δ_{1}^{2}} - \sqrt{n} δ_{1}}{σ_{1}} | p_{r} = p_{s})

= Φ (\frac{z_{1 - α} \sqrt{σ_{1}^{2} + δ_{1}^{2}} - \sqrt{n} δ_{1}}{σ_{1}})

(1)

where $Φ (.)$ is the survivor function of the standard normal distribution and $σ_{1}^{2}$ is the limit of $n^{- 1} \sum_{i = 1}^{n} {(s_{i} - r_{i})}^{2}$ .

By solving the power Equation (1) with respect to n, we obtain the required sample size for power $1 - β$

n = \frac{{(z_{1 - α} \sqrt{σ_{1}^{2} + δ_{1}^{2}} + z_{1 - β} σ_{1})}^{2}}{δ_{1}^{2}}

(2)

where, as shown in the Appendix A.2,

σ_{1}^{2} = var (s_{i}) + var (r_{i}) - 2 ρ_{1} \sqrt{var (s_{i}) var (r_{i})}

(3)

var (s_{i}) = p_{r} (1 - p_{r}) {1 / m + ρ_{s s} (m - 1) / m}

and

var (r_{i}) = p_{r} (1 - p_{r}) (\frac{2}{m (m - 1)} + \frac{4 (m - 2)}{m (m - 1)} ρ_{r 1} + \frac{(m - 2) (m - 3)}{m (m - 1)} ρ_{r 2}) .

The process of calculating the required sample size is summarized as follows:

(1)
Specify ( $α, 1 - β$ ), expected concordance rate among radiologists $p_{r}$ , similarity margin $δ_{1}$ and hypothetical correlation coefficients $ρ_{r 1}, ρ_{r 2}, ρ_{s s}, ρ_{s 1}$ and $ρ_{s 2}$ .
(2)
Calculate $σ_{1}^{2}$ using (3).
(3)
Obtain sample size using (2).

It may be difficult to specify the correlation coefficients $ρ_{r 1}, ρ_{r 2}, ρ_{s s}, ρ_{s 1}$ and $ρ_{s 2}$ . If pilot data are available, we may estimate them from the pilot data. Otherwise, we may conduct a two-stage trial to estimate these correlation coefficients from the first stage data and recalculate the sample size for the whole trial based on the estimated correlation coefficients.

2.2. Objective 2: Is the AI-Based Device More Concordant with Experienced Radiologists Than with Junior Radiologists?

As another study objective, we may want to test if the reading of the AI-based device agrees more with those of experienced radiologists than with those of junior radiologists for each BI-RADS lexicon classification category.

Let $p_{x}$ and $p_{y}$ denote the concordance rate between the AI-based device and highly experienced radiologists and that between the AI-based device and less experienced radiologists, respectively. We want to test the null hypothesis $H_{2} : p_{x} = p_{y}$ against the alternative hypothesis ${\bar{H}}_{2} : p_{x} > p_{y}$ .

2.2.1. Statistical Testing Method

Let m (= 5, say) denote the number of radiologists in each group (highly experienced group and less experienced group). For subject $i (= 1, \dots, n)$ and senior radiologist $j (= 1, \dots, m)$ , let $x_{i j} = 1$ if the reading by senior radiologist j and that by the AI-based device agree and =0 otherwise, and let $y_{i j} = 1$ if the reading by less experienced radiologist $j (= 1, \dots, m)$ and that by the AI-based device agree and =0 otherwise. Then, we have $p_{x} = E (x_{i j})$ and $p_{y} = E (y_{i j})$ . Using the data from subject i,, $p_{x}$ is estimated by

x_{i} = \frac{\sum_{j = 1}^{m} x_{i j}}{m}

and $p_{y}$ is estimated by

y_{i} = \frac{\sum_{j = 1}^{m} y_{i j}}{m}

Using the whole data, we estimate $p_{x}$ and $p_{y}$ by

{\hat{p}}_{x} = \frac{1}{n m} \sum_{i = 1}^{n} \sum_{j = 1}^{m} x_{i j} = \frac{1}{n} \sum_{i = 1}^{n} x_{i}

and

{\hat{p}}_{y} = \frac{1}{n m} \sum_{i = 1}^{n} \sum_{j = 1}^{m} y_{i j} = \frac{1}{n} \sum_{i = 1}^{n} y_{i}

respectively. Note that those estimates are unbiased.

For large n under $H_{2}$ , $\sqrt{n} ({\hat{p}}_{x} - {\hat{p}}_{y})$ is approximately normal with mean 0 and variance $σ_{2}^{2}$ that can be estimated by

{\hat{σ_{2}}}^{2} = \frac{1}{n} \sum_{i = 1}^{n} {(x_{i} - y_{i})}^{2}

Hence, we reject $H_{2} : p_{x} = p_{y}$ if $| Z_{2} | > z_{1 - α / 2}$ , where

Z_{2} = \frac{\sqrt{n} ({\hat{p}}_{x} - {\hat{p}}_{y})}{{\hat{σ}}_{2}} .

Note that we use a standard 2-sided test since usually there is no small effect size issue in this case.

2.2.2. Power and Sample Size Calculation

We calculate the sample size under a specific alternative hypothesis ${\bar{H}}_{2} : p_{x} = p_{y} + δ_{2}$ . Since each subject’s image is read by all experienced and inexperienced radiologists as well as the device, ${(x_{i j}, y_{i j}), j = 1, \dots, m}$ are correlated.

Let $ρ_{x x} = corr (x_{i 1}, x_{i 2})$ denote the correlation coefficient between the concordance score between the AI-based device and a highly experienced radiologist and the concordance score between the device and another highly experienced radiologist, and $ρ_{y y} = corr (y_{i 1}, y_{i 2})$ denote the correlation coefficient between the AI-based device and a less experienced radiologist and the concordance score between the device and another less experienced radiologist. Furthermore, let $ρ_{x y} = corr (x_{i j}, y_{i j^{'}})$ for $j, j^{'} = 1, \dots, m$ denote the correlation coefficient between the concordance score between the device and a highly experienced radiologist and that between the device and a less experienced radiologist. Let $ρ_{2} = corr (x_{i}, y_{i})$ . As shown in Appendix A.3. $ρ_{2}$ is a function of $ρ_{x x}, ρ_{y y}$ , and $ρ_{x y}$ .

Under ${\bar{H}}_{2} : p_{x} = p_{y} + δ_{2}$ , ${(x_{i} - y_{i} - δ_{2}), i = 1, \dots, n}$ are independent random variables with mean 0, so that $\sqrt{n} ({\hat{p}}_{x} - {\hat{p}}_{y} - δ_{2})$ is asymptotically normal with mean 0 and variance $σ_{2}^{2}$ that can be consistently estimated by $s_{2}^{2} = {\sqrt{n}}^{- 1} \sum_{i = 1}^{n} {(x_{i} - y_{i} - δ_{2})}^{2}$ . Note that ${\hat{σ}}_{2}^{2}$ is asymptotically identical to $s_{2}^{2} + δ_{2}^{2}$ under ${\bar{H}}_{2}$ , so that it converges to $σ_{2}^{2} + δ_{2}^{2}$ . Hence, the power for a given sample size n is

1 - β = P (\frac{\sqrt{n} ({\hat{p}}_{x} - {\hat{p}}_{y})}{{\hat{σ}}_{2}} > z_{1 - α / 2} | p_{x} = p_{y} + δ_{2})

= P (\frac{\sqrt{n} ({\hat{p}}_{x} - {\hat{p}}_{y} - δ_{2}) + \sqrt{n} δ_{2}}{\sqrt{s_{2} + δ_{2}^{2}}} > z_{1 - α / 2} | p_{x} = p_{y} + δ_{2})

= P (\frac{\sqrt{n} ({\hat{p}}_{x} - {\hat{p}}_{y} - δ_{2})}{σ_{2}} > \frac{z_{1 - α / 2} \sqrt{σ_{2}^{2} + δ_{2}^{2}} - \sqrt{n} δ_{2}}{σ_{2}} | p_{x} = p_{y} + δ_{2})

= Φ (\frac{z_{1 - α / 2} \sqrt{σ_{2}^{2} + δ_{2}^{2}} - \sqrt{n} δ_{2}}{σ_{2}})

(4)

since $\sqrt{n} ({\hat{p}}_{x} - {\hat{p}}_{y} - δ_{2}) / σ_{2}$ is $N (0, 1)$ under ${\bar{H}}_{2}$ , where $σ_{2}^{2}$ is the limit of $n^{- 1} \sum_{i = 1}^{n} {(x_{i} - y_{i} - δ_{2})}^{2}$ under ${\bar{H}}_{2}$ .

By solving (4) with respect to n, we obtain the required sample size for power $1 - β$

n = \frac{{(z_{1 - α / 2} \sqrt{σ_{2}^{2} + δ_{2}^{2}} + z_{1 - β} σ_{2})}^{2}}{δ_{2}^{2}}

(5)

Appendix A.4 shows that

σ_{2}^{2} = var (x_{i}) + var (y_{i}) - 2 ρ_{2} \sqrt{var (x_{i}) var (y_{i})}

(6)

where

var (x_{i}) = p_{x} (1 - p_{x}) (\frac{1}{m} + \frac{m - 1}{m} ρ_{x x})

and

var (y_{i}) = (p_{x} - δ_{2}) (1 - p_{x} + δ_{2}) (\frac{1}{m} + \frac{m - 1}{m} ρ_{y y})

under ${\bar{H}}_{2}$ .

The process of calculating the required sample size is summarized as follows:

1.
Specify ( $α, 1 - β$ ), expected concordance rate between the AI-based device and a highly experienced radiologist $p_{x}$ , clinically meaningful difference in concordance rates $δ_{2}$ and correlation coefficients $ρ_{x x}, ρ_{y y}$ and $ρ_{x y}$ .
2.
Calculate $σ_{2}^{2}$ using (6).
3.
Obtain the required sample size using (5).

It will be difficult to specify the correlation coefficients $ρ_{x x}, ρ_{y y}$ and $ρ_{x y}$ . If pilot data are available, we may estimate them from the pilot data. Otherwise, we may use a two-stage design to estimate these correlation coefficients from the first stage data and calculate the sample size for the whole trial based on the estimated correlation coefficients.

3. Numerical Studies and Results

Note that our test statistics and sample size formulas are derived based on large sample approximations. We want to conduct simulation studies to evaluate their finite sample performance.

We consider the first type of study objective to test if the concordance rate between an AI-based device and radiologists is as high as that among radiologists. Suppose that each subject’s image is read by the AI-based device and $m = 10$ radiologists. We set $α = 0.05$ , $1 - β = 0.8 or 0.9$ , $p_{r} = 0.3, 0.5 or 0.7$ , $δ_{1} = 0.05 or 0.1$ and $ρ_{1} = 0.1, 0.3, 0.5, or 0.7$ . Assuming $ρ_{s 1} = ρ_{s 2} + 0.1, ρ_{r 1} = ρ_{r 2} + 0.1, ρ_{r 1} = ρ_{s s} = ρ_{s 1} + 0.1$ , we calculate the correlation coefficients for a given $ρ = corr (r_{i}, s_{i})$ . That is, we obtain $(ρ_{s 1}, ρ_{s 2}, ρ_{s s}, ρ_{r 1}, ρ_{r 2}) = (0.101, 0.001, 0.201, 0.201, 0.101)$ , $(0.16, 0.06, 0.26, 0.26, 0.16)$ , $(0.26, 0.16, 0.36, 0.36, 0.26)$ and $(0.48, 0.38, 0.58, 0.58, 0.48)$ for $ρ_{1} = 0.1, 0.3, 0.5, or 0.7$ , respectively.

For each design setting, we calculate the required sample size n using our proposed formula (2) and generate 10,000 simulation data sets of size n under the design setting and $H_{1}$ or ${\bar{H}}_{1}$ . Then, we apply the statistical testing using $Z_{1}$ to each simulation data set, and compute the empirical type I error ( $\hat{α}$ ) and power ( $1 - \hat{β}$ ) by the proportion of samples that reject $H_{1}$ among the 10,000 samples simulated under $H_{1} : p_{s} = p_{r} - δ_{1}$ and ${\bar{H}}_{1} : p_{s} = p_{r}$ , respectively. The correlated concordance (binary) data are generated by first generating multivariate normal data and then dichotomizing them with corresponding proportion level [10].

Table 1 reports the sample size n, empirical type I error rate $\hat{α}$ , and power $1 - \hat{β}$ under each design setting. We observe that the required sample size increases in $1 - β$ and decreases in $δ_{1}$ and $ρ_{1}$ . With other design parameters fixed, we have the same sample sizes for $p_{r} = 0.3$ and $p_{r} = 0.7$ . We have this result because, from (2), the sample size depends on $p_{r}$ only through $p_{r} (1 - p_{r})$ . Since the empirical type I errors are very close to the nominal $α = 0.05$ overall, our test statistic $Z_{1}$ controls the type I error rate accurately. On the other hand, the empirical powers are close to the corresponding nominal level $1 - β = 0.8 or 0.9$ overall, so that we conclude that our sample size formula is accurate too.

Table 1.

Sample size (empirical type I error rate, empirical power), $n (\hat{α}, 1 - \hat{β})$ , under various design settings of ( $p_{r}, δ_{1}, ρ_{1}, 1 - β$ ) for the first type of study objective.

$p_{r}$	$δ_{1}$	$ρ_{1}$	$1 - β = 0.8$	$1 - β = 0.9$
0.3	0.05	0.1	$210 (0.044, 0.808)$	$290 (0.051, 0.910)$
		0.3	$206 (0.048, 0.812)$	$285 (0.047, 0.903)$
		0.5	$200 (0.049, 0.805)$	$275 (0.049, 0.910)$
		0.7	$186 (0.054, 0.811)$	$256 (0.056, 0.903)$
	0.1	0.1	$56 (0.041, 0.829)$	$76 (0.045, 0.915)$
		0.3	$55 (0.047, 0.823)$	$75 (0.042, 0.914)$
		0.5	$53 (0.048, 0.822)$	$73 (0.052, 0.921)$
		0.7	$50 (0.061, 0.822)$	$68 (0.060, 0.913)$
0.5	0.05	0.1	$249 (0.047, 0.804)$	$344 (0.048, 0.904)$
		0.3	$245 (0.051, 0.808)$	$338 (0.053, 0.901)$
		0.5	$237 (0.045, 0.812)$	$327 (0.050, 0.907)$
		0.7	$220 (0.053, 0.798)$	$304 (0.050, 0.904)$
	0.1	0.1	$66 (0.052, 0.815)$	$90 (0.048, 0.911)$
		0.3	$65 (0.049, 0.824)$	$88 (0.046, 0.914)$
		0.5	$63 (0.051, 0.831)$	$86 (0.048, 0.912)$
		0.7	$58 (0.054, 0.813)$	$80 (0.054, 0.909)$
0.7	0.05	0.1	$210 (0.052, 0.804)$	$290 (0.054, 0.902)$
		0.3	$206 (0.050, 0.800)$	$285 (0.048, 0.906)$
		0.5	$200 (0.052, 0.802)$	$275 (0.051, 0.906)$
		0.7	$186 (0.055, 0.806)$	$256 (0.049, 0.899)$
	0.1	0.1	$56 (0.055, 0.821)$	$76 (0.054, 0.909)$
		0.3	$55 (0.055, 0.816)$	$75 (0.049, 0.904)$
		0.5	$53 (0.052, 0.814)$	$73 (0.054, 0.911)$
		0.7	$50 (0.060, 0.821)$	$68 (0.058, 0.912)$

Open in a new tab

Now we conduct simulations for the second type of study objective to test if the AI-based device is more concordant with experienced radiologists than with junior radiologists. We assume that each subject’s image is read by $m = 5$ experienced radiologists and $m = 5$ junior radiologists. We set $α = 0.05$ , $1 - β = 0.8 or 0.9$ , $p_{x} = 0.3, 0.5, or 0.7$ , $δ_{2} = 0.05 or 0.1$ , and $ρ_{2} = 0.1, 0.3, 0.5, or 0.7$ . Assuming $ρ_{x x} = ρ_{y y} = ρ_{x y} + 0.1$ , we solve the corresponding correlation coefficients for given $ρ_{2} = corr (x_{i}, y_{i})$ . So, we have $(ρ_{x x}, ρ_{y y}, ρ_{x y}) = (0.13, 0.13, 0.03)$ , $(0.21, 0.21, 0.11)$ , $(0.33, 0.33, 0.23)$ , and $(0.55, 0.55, 0.45)$ for $ρ_{2} = 0.1, 0.3, 0.5$ , and $0.7$ , respectively. For each design setting, we calculate sample size n using (5), and generate 10,000 samples of size n under the design setting and $H_{2}$ or ${\bar{H}}_{2}$ . We apply the test statistic $Z_{2}$ to each sample, and calculate the empirical type I error rate and power $(\hat{α}, 1 - \hat{β})$ under $H_{2} : p_{x} = p_{y}$ and ${\bar{H}}_{2} : p_{x} = p_{y} + δ_{0}$ , respectively.

Table 2 summarizes the required sample size n, and empirical type I error rate and power, $(\hat{α}, 1 - \hat{β})$ , under each design setting. We observe that the required sample size increases in $1 - β$ and decreases in $δ_{2}$ and $ρ_{2}$ . Since the empirical type I errors are very close to the nominal $α = 0.05$ overall, our test statistic $Z_{2}$ controls the type I error accurately. On the other hand, the empirical powers are close to the corresponding nominal level $1 - β = 0.8 or 0.9$ overall, so that we conclude that our sample size formula is accurate too.

Table 2.

Sample size (empirical type I error rate, empirical power), $n (\hat{α}, 1 - \hat{β})$ , under various design settings of ( $p_{x}, δ_{2}, ρ_{2}, 1 - β$ ) for the second type of study objective.

$p_{x}$	$δ_{2}$	$ρ_{2}$	$1 - β = 0.8$	$1 - β = 0.9$
0.3	0.05	0.1	$348 (0.052, 0.810)$	$465 (0.052, 0.898)$
		0.3	$328 (0.050, 0.812)$	$438 (0.048, 0.894)$
		0.5	$298 (0.049, 0.807)$	$397 (0.047, 0.900)$
		0.7	$245 (0.053, 0.807)$	$327 (0.048, 0.907)$
	0.1	0.1	$86 (0.054, 0.816)$	$113 (0.050, 0.905)$
		0.3	$81 (0.053, 0.820)$	$107 (0.047, 0.912)$
		0.5	$74 (0.047, 0.825)$	$98 (0.047, 0.914)$
		0.7	$63 (0.053, 0.844)$	$83 (0.048, 0.934)$
0.5	0.05	0.1	$434 (0.050, 0.798)$	$580 (0.048, 0.902)$
		0.3	$409 (0.050, 0.810)$	$546 (0.049, 0.904)$
		0.5	$370 (0.049, 0.806)$	$495 (0.052, 0.905)$
		0.7	$304 (0.054, 0.802)$	$406 (0.050, 0.905)$
	0.1	0.1	$111 (0.053, 0.816)$	$148 (0.053, 0.909)$
		0.3	$105 (0.047, 0.814)$	$140 (0.051, 0.909)$
		0.5	$96 (0.050, 0.811)$	$127 (0.052, 0.911)$
		0.7	$79 (0.052, 0.829)$	$105 (0.050, 0.913)$
0.7	0.05	0.1	$382 (0.051, 0.803)$	$511 (0.052, 0.900)$
		0.3	$360 (0.052, 0.797)$	$481 (0.048, 0.901)$
		0.5	$327 (0.046, 0.804)$	$436 (0.055, 0.902)$
		0.7	$269 (0.047, 0.811)$	$359 (0.053, 0.909)$
	0.1	0.1	$103 (0.050, 0.815)$	$136 (0.049, 0.914)$
		0.3	$97 (0.054, 0.817)$	$129 (0.049, 0.912)$
		0.5	$89 (0.050, 0.821)$	$117 (0.050, 0.917)$
		0.7	$74 (0.048, 0.840)$	$98 (0.045, 0.925)$

Open in a new tab

4. Discussion and Conclusions

Existing papers on comparing correlated concordance rates mainly focus on comparing two (or more) competitive diagnosis methods using their concordance rates with a gold standard on multiple sites [11]. In this paper, there is no gold standard and we compare the concordance rate between an AI-based diagnostic device and human radiologists and that among radiologists. We also compare the concordance rate between an AI-based diagnostic device and highly experienced radiologists and that between AI-based device and less experienced radiologists. In our design setting, each study subject has single site but is rated by the AI-based device and multiple human radiologists. We extend existing methods to perform design and analysis in this new setting.

We provide design and analysis plan for two types of study objectives to perform different comparisons of concordance between the AI-based diagnostic device and human radiologists. For each type of study objective, we propose a test statistic using GEE method with independent working correlation to account for the dependency in the observations from the device and the radiologists for each study subject, and derive its sample size formula based on large sample theory. Through extensive simulations, we show that the test statistics control the type I error accurately and the sample size formulas estimate sample sizes with powers close to the specified ones accounting for the dependency of images read by radiologists and device.

Since each subject’s image is read by the device and many radiologists, the concordance scores have complicated dependency structure, while the test statistics do not require specification of the multiple correlation coefficients by using the GEE method, the sample size formulas require specification of these correlation coefficients. Since it is difficult to accurately specify the correlation coefficients, we propose to conduct a two-stage device trial to estimate these correlation coefficients from the first stage data and recalculate the required sample size for the whole trial based on the estimated correlation coefficients.

We use concordance rate as a measure of agreement among multiple raters. Cohen’s kappa is another measure of agreement that is popularly used, e.g., Qureshi et al. [12]. Unlike concordance rate, however, it is not clear how similar two kappa values should be to conclude similarity of two different groups of raters. The proposed methods were successfully used by O’Connell et al. [8] to design and analyze a device trial.

Appendix A

Table A1.

Examples of ultrasound lexicon.

Ground Truth	Lesion Type
Shape	Oval
	Round
	Irregular
Margin	Circumscribed
	Indistinct
	Angular
	Microlobulated
	Spiculated
Orientation	Parallel
	Not parallel
Echo pattern	Anechoic
	Hypoechoic
	Complex cystic and solid
	Isoechoic
	Hyperechoic
	Heterogeneous
Posterior features	No features
	Enhancement
	Shadowing
	Combined pattern

Open in a new tab

Appendix A.1. Derivation of ρ₁

Since $r_{i} = {m (m - 1) / 2}^{- 1} \sum_{j = 1}^{m - 1} \sum_{j^{'} = j + 1}^{m} r_{i j j^{'}}$ and $s_{i} = m^{- 1} \sum_{j = 1}^{m} s_{i j}$ , we have

ρ_{1} = corr (r_{i}, s_{i}) = \frac{cov (r_{i}, s_{i})}{\sqrt{var (r_{i}) var (s_{i})}}

Here

cov (r_{i}, s_{i}) = cov (\frac{\sum_{j = 1}^{m - 1} \sum_{j^{'} = j + 1}^{m} r_{i j j^{'}}}{m (m - 1) / 2}, \frac{\sum_{j = 1}^{m} s_{i j}}{m}) = \frac{2}{m^{2} (m - 1)} cov (\sum_{j = 1}^{m - 1} \sum_{j^{'} = j + 1}^{m} r_{i j j^{'}}, \sum_{j = 1}^{m} s_{i j})

= \frac{2}{m^{2} (m - 1)} \{m (m - 1) cov (r_{i j_{1} j_{2}}, s_{i j_{1}}) + \frac{m (m - 1) (m - 2)}{2} cov (r_{i j_{1} j_{2}}, s_{i j_{1}^{'}})\}

= \frac{2}{m} cov (r_{i 12}, s_{i 1}) + \frac{m - 2}{m} cov (r_{i 12}, s_{i 3})

var (r_{i}) = \frac{4}{m^{2} {(m - 1)}^{2}} var (\sum_{j = 1}^{m - 1} \sum_{j^{'} = j + 1}^{m} r_{i j j^{'}})

= \frac{4}{m^{2} {(m - 1)}^{2}} {\frac{m (m - 1)}{2} var (r_{i j_{1} j_{2}}) + m (m - 1) (m - 2) cov (r_{i j_{1} j_{2}}, r_{i j_{1} j_{2}^{'}})

+ \frac{m (m - 1) (m - 2) (m - 3)}{4} cov (r_{i j_{1} j_{2}}, r_{i j_{1}^{'} j_{2}^{'}})}

= \frac{2}{m (m - 1)} var (r_{i 12}) + \frac{4 (m - 2)}{m (m - 1)} cov (r_{i 12}, r_{i 13}) + \frac{(m - 2) (m - 3)}{m (m - 1)} cov (r_{i 12}, r_{i 34})

and

var (s_{i}) = \frac{1}{m^{2}} {m var (s_{i 1}) + m (m - 1) cov (s_{i 1}, s_{i 2})}

= \frac{1}{m} var (s_{i 1}) + \frac{m - 1}{m} cov (s_{i 1}, s_{i 2})

Hence,

ρ_{1} = \frac{\frac{2}{m} cov (r_{i 12}, s_{i 1}) + \frac{m - 2}{m} cov (r_{i 12}, s_{i 3})}{\sqrt{{\frac{2}{m (m - 1)} var (r_{i 12}) + \frac{4 (m - 2)}{m (m - 1)} cov (r_{i 12}, r_{i 13}) + \frac{(m - 2) (m - 3)}{m (m - 1)} cov (r_{i 12}, r_{i 34})}}}

* \frac{1}{\sqrt{{\frac{1}{m} var (s_{i 1}) + \frac{m - 1}{m} cov (s_{i 1}, s_{i 2})}}}

= \frac{\frac{2}{m} ρ_{s 1} + \frac{m - 2}{m} ρ_{s 2}}{\sqrt{{\frac{2}{m (m - 1)} + \frac{4 (m - 2)}{m (m - 1)} ρ_{r 1} + \frac{(m - 2) (m - 3)}{m (m - 1)} ρ_{r 2}} (\frac{1}{m} + \frac{m - 1}{m} ρ_{s s})}}

Appendix A.2. The Limit of ${\hat{σ}}_{1}^{2}$ under ${\bar{H}}_{1}$

The limit of ${\hat{σ}}_{1}^{2} = n^{- 1} \sum_{i = 1}^{n} {(s_{i} - r_{i})}^{2}$ is its expected value $σ_{1}^{2} = E {(s_{i} - r_{i})}^{2}$ . Since $E (s_{i} - r_{i}) = p_{s} - p_{r} = 0$ under ${\bar{H}}_{1} : p_{r} = p_{s}$ , $E {(s_{i} - r_{i})}^{2} = var (s_{i} - r_{i}) = var (s_{i}) + var (r_{i}) - 2 ρ_{1} \sqrt{var (s_{i}) var (r_{i})}$ where $ρ_{1} = corr (r_{i}, s_{i})$ ,

var (s_{i}) = \frac{1}{m^{2}} var (\sum_{j = 1}^{m} s_{i j}) = \frac{1}{m^{2}} {m var (s_{i j}) + m (m - 1) cov (s_{i j}, s_{i j^{'}})}

= \frac{1}{m} var (s_{i j}) + \frac{m - 1}{m} ρ_{s s} var (s_{i j}) = p_{s} (1 - p_{s}) (\frac{1}{m} + \frac{m - 1}{m} ρ_{s s})

and

var (r_{i}) = var (\frac{1}{m (m - 1) / 2} \sum_{j_{1} = 1}^{m - 1} \sum_{j_{2} = j + 1}^{m} r_{i j_{1} j_{2}})

= {\{\frac{1}{m (m - 1) / 2}\}}^{2} {\frac{m (m - 1)}{2} var (r_{i 12}) + m (m - 1) (m - 2) cov (r_{i 12}, r_{i 13})

+ \frac{m (m - 1) (m - 2) (m - 3)}{4} cov (r_{i 12}, r_{i 34})}

= (\frac{2}{m (m - 1)} var (r_{i 12}) + \frac{4 (m - 2)}{m (m - 1)} ρ_{r 1} var (r_{i 12}) + \frac{(m - 2) (m - 3)}{m (m - 1)} ρ_{r 2} var (r_{i 12}))

= p_{r} (1 - p_{r}) \{\frac{2}{m (m - 1)} + \frac{4 (m - 2)}{m (m - 1)} ρ_{r 1} + \frac{(m - 2) (m - 3)}{m (m - 1)} ρ_{r 2}\}

since $s_{i j} \sim Bernoulli (p_{s})$ and $r_{i j_{1} j_{2}} \sim Bernoulli (p_{r})$ .

Appendix A.3. Derivation of ρ₂

Since $x_{i} = m^{- 1} \sum_{j = 1}^{m} x_{i j}$ and $y_{i} = m^{- 1} \sum_{j = 1}^{m} y_{i j}$ ,

ρ_{2} = corr (x_{i}, y_{i}) = \frac{cov (x_{i}, y_{i})}{\sqrt{var (x_{i}) var (y_{i})}}

Here,

cov (x_{i}, y_{i}) = cov (\frac{\sum_{j = 1}^{m} x_{i j}}{m}, \frac{\sum_{j = 1}^{m} y_{i j}}{m}) = cov (x_{i j}, y_{i j})

var (x_{i}) = \frac{1}{m} var (x_{i 1}) + \frac{m - 1}{m} cov (x_{i 1}, x_{i 2})

and, similarly,

var (y_{i}) = \frac{1}{m} var (y_{i 1}) + \frac{m - 1}{m} cov (y_{i 1}, y_{i 2})

Hence,

ρ_{2} = \frac{cov (x_{i 1}, y_{i 1})}{\sqrt{{\frac{1}{m} var (x_{i 1}) + \frac{m - 1}{m} cov (x_{i 1}, x_{i 2})} {\frac{1}{m} var (y_{i 1}) + \frac{m - 1}{m} cov (y_{i 1}, y_{i 2})}}}

= \frac{ρ_{x y}}{\sqrt{(\frac{1}{m} + \frac{m - 1}{m} ρ_{x x}) (\frac{1}{m} + \frac{m - 1}{m} ρ_{y y})}}

Appendix A.4. The Limit of ${\hat{σ}}_{2}^{2}$ under ${\bar{H}}_{2}$

The limit of ${\hat{σ}}_{2}^{2} = n^{- 1} \sum_{i = 1}^{n} {(x_{i} - y_{i} - δ_{2})}^{2}$ is $σ_{2}^{2} = E {(x_{i} - y_{i} - δ_{2})}^{2}$ . Since $E (x_{i} - y_{i} - δ_{2}) = p_{x} - p_{y} - δ_{2} = 0$ under ${\bar{H}}_{2} : p_{x} = p_{y} + δ_{0}$ , $E {(x_{i} - y_{i} - δ_{2})}^{2} = var (x_{i} - y_{i} - δ_{2}) = var (x_{i} - y_{i}) = var (x_{i}) + var (y_{i}) - 2 ρ_{2} \sqrt{var (x_{i}) var (y_{i})}$ . Here, $ρ_{2} = corr (x_{i}, y_{i})$ ,

var (x_{i}) = \frac{1}{m^{2}} {m var (x_{i 1}) + m (m - 1) cov (x_{i 1}, x_{i 2})}

= \frac{var (x_{i 1})}{m} + \frac{m - 1}{m} ρ_{x x} var (x_{i 1})) = p_{x} (1 - p_{x}) (\frac{1}{m} + \frac{m - 1}{m} ρ_{x x})

and, similarly,

var (y_{i}) = p_{y} (1 - p_{y}) (\frac{1}{m} + \frac{m - 1}{m} ρ_{y y})

= (p_{x} - δ_{0}) (1 - p_{x} + δ_{0}) (\frac{1}{m} + \frac{m - 1}{m} ρ_{y y})

since $x_{i j} \sim Bernoulli (p_{x}), y_{i j} \sim Bernoulli (p_{y})$ .

Author Contributions

Conceptualization, L.L. and S.-H.J.; methodology, L.L.; software, L.L.; validation, L.L. and S.-H.J.; formal analysis, L.L.; investigation, L.L. and S.-H.J.; resources, K.J.P. and S.-H.J.; data curation, L.L.; writing—original draft preparation, L.L.; writing—review and editing, S.-H.J.; visualization, L.L.; supervision, K.J.P. and S.-H.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interest.

Footnotes

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1.Zhang Z., Sejdic E. Radiological images and machine learning: Trends, perspectives, and prospects. Comput. Biol. Med. 2019;108:354–370. doi: 10.1016/j.compbiomed.2019.02.017. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.DSickles E.A., D’Orsi C.J., Bassett L.W., Appleton C.M., Berg W.A., Burnside E.S. ACR BI-RADS®Atlas, Breast Imaging Reporting and Data System. American College of Radiology; Reston, VA, USA: 2013. [Google Scholar]
3.Wu W.J., Lin S.W., Moon W.K. Combining support vector machine with genetic algorithm to classify ultrasound breast tumor images. Comput. Med. Imaging Graph. 2012;36:627–633. doi: 10.1016/j.compmedimag.2012.07.004. [DOI] [PubMed] [Google Scholar]
4.Liu B., Cheng H.D., Huang J., Tian J., Tang X., Liu J. Fully automatic and segmentation-robust classification of breast tumors based on local texture analysis of ultrasound images. Pattern Recogn. 2010;43:280–298. doi: 10.1016/j.patcog.2009.06.002. [DOI] [Google Scholar]
5.Shan J., Cheng H.D., Wang Y. Completely automated segmentation approach for breast ultrasound images using multiple-domain features. Ultrasound Med. Biol. 2012;38:262–275. doi: 10.1016/j.ultrasmedbio.2011.10.022. [DOI] [PubMed] [Google Scholar]
6.Cheng H.D., Shan J., Ju W., Guo Y., Zhang L. Automated breast cancer detection and classification using ultrasound images: A survey. Pattern Recogn. 2010;43:299–317. doi: 10.1016/j.patcog.2009.05.012. [DOI] [Google Scholar]
7.Wu G.G., Zhou L.Q., Xu J.W., Wang J.Y., Wei Q., Deng Y.B., Cui X.W., Dietrich C.F. Artificial intelligence in breast ultrasound. World J. Radiol. 2019;11:19–26. doi: 10.4329/wjr.v11.i2.19. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.O’Connell A.M. Diagnostic Performance of An Artificial Intelligence System in Breast Ultrasound. J. Ultrasound Med. 2021 doi: 10.1002/jum.15684. [DOI] [PubMed] [Google Scholar]
9.Liang K.Y., Zeger S. Longitudinal data analysis using generalized linear models. Biometrika. 1986;73:13–22. doi: 10.1093/biomet/73.1.13. [DOI] [Google Scholar]
10.Emrich I.J., Piedmonte M.R. A method for generating high dimensional multivariate binary variables. Am. Stat. 1991;45:302–304. [Google Scholar]
11.Jung S.H., Barnhart H.X., Sohn I., Stinnett S.S., Wallace D.K. Sample Size for Comparing Correlated Concordance Rates. J. Biopharm. Stat. 2008;18:359–369. doi: 10.1080/10543400701697216. [DOI] [PubMed] [Google Scholar]
12.Qureshi A., Lakhtakia R., Bahri M.A., Al Haddabi I., Saparamadu A., Shalaby A., Al Riyami M., Rizvi G. Gleason’s Grading of Prostatic Adenocarcinoma: Inter-Observer Variation Among Seven Pathologists at a Tertiary Care Center in Oman. Asian Pac. J. Cancer Prev. 2016;17:4867–4868. doi: 10.22034/APJCP.2016.17.11.4867. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B1-jpm-11-01150] 1.Zhang Z., Sejdic E. Radiological images and machine learning: Trends, perspectives, and prospects. Comput. Biol. Med. 2019;108:354–370. doi: 10.1016/j.compbiomed.2019.02.017. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B2-jpm-11-01150] 2.DSickles E.A., D’Orsi C.J., Bassett L.W., Appleton C.M., Berg W.A., Burnside E.S. ACR BI-RADS®Atlas, Breast Imaging Reporting and Data System. American College of Radiology; Reston, VA, USA: 2013. [Google Scholar]

[B3-jpm-11-01150] 3.Wu W.J., Lin S.W., Moon W.K. Combining support vector machine with genetic algorithm to classify ultrasound breast tumor images. Comput. Med. Imaging Graph. 2012;36:627–633. doi: 10.1016/j.compmedimag.2012.07.004. [DOI] [PubMed] [Google Scholar]

[B4-jpm-11-01150] 4.Liu B., Cheng H.D., Huang J., Tian J., Tang X., Liu J. Fully automatic and segmentation-robust classification of breast tumors based on local texture analysis of ultrasound images. Pattern Recogn. 2010;43:280–298. doi: 10.1016/j.patcog.2009.06.002. [DOI] [Google Scholar]

[B5-jpm-11-01150] 5.Shan J., Cheng H.D., Wang Y. Completely automated segmentation approach for breast ultrasound images using multiple-domain features. Ultrasound Med. Biol. 2012;38:262–275. doi: 10.1016/j.ultrasmedbio.2011.10.022. [DOI] [PubMed] [Google Scholar]

[B6-jpm-11-01150] 6.Cheng H.D., Shan J., Ju W., Guo Y., Zhang L. Automated breast cancer detection and classification using ultrasound images: A survey. Pattern Recogn. 2010;43:299–317. doi: 10.1016/j.patcog.2009.05.012. [DOI] [Google Scholar]

[B7-jpm-11-01150] 7.Wu G.G., Zhou L.Q., Xu J.W., Wang J.Y., Wei Q., Deng Y.B., Cui X.W., Dietrich C.F. Artificial intelligence in breast ultrasound. World J. Radiol. 2019;11:19–26. doi: 10.4329/wjr.v11.i2.19. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B8-jpm-11-01150] 8.O’Connell A.M. Diagnostic Performance of An Artificial Intelligence System in Breast Ultrasound. J. Ultrasound Med. 2021 doi: 10.1002/jum.15684. [DOI] [PubMed] [Google Scholar]

[B9-jpm-11-01150] 9.Liang K.Y., Zeger S. Longitudinal data analysis using generalized linear models. Biometrika. 1986;73:13–22. doi: 10.1093/biomet/73.1.13. [DOI] [Google Scholar]

[B10-jpm-11-01150] 10.Emrich I.J., Piedmonte M.R. A method for generating high dimensional multivariate binary variables. Am. Stat. 1991;45:302–304. [Google Scholar]

[B11-jpm-11-01150] 11.Jung S.H., Barnhart H.X., Sohn I., Stinnett S.S., Wallace D.K. Sample Size for Comparing Correlated Concordance Rates. J. Biopharm. Stat. 2008;18:359–369. doi: 10.1080/10543400701697216. [DOI] [PubMed] [Google Scholar]

[B12-jpm-11-01150] 12.Qureshi A., Lakhtakia R., Bahri M.A., Al Haddabi I., Saparamadu A., Shalaby A., Al Riyami M., Rizvi G. Gleason’s Grading of Prostatic Adenocarcinoma: Inter-Observer Variation Among Seven Pathologists at a Tertiary Care Center in Oman. Asian Pac. J. Cancer Prev. 2016;17:4867–4868. doi: 10.22034/APJCP.2016.17.11.4867. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Design and Analysis Methods for Trials with AI-Based Diagnostic Devices for Breast Cancer

Lu Liu

Kevin J Parker

Sin-Ho Jung

Roles

Abstract

1. Introduction

2. Materials and Methods

2.1. Objective 1: Is the Concordance Rate between the AI-Based Device and Radiologists as High as That among Radiologists?

2.1.1. Statistical Testing Method

2.1.2. Power and Sample Size Calculation

2.2. Objective 2: Is the AI-Based Device More Concordant with Experienced Radiologists Than with Junior Radiologists?

2.2.1. Statistical Testing Method

2.2.2. Power and Sample Size Calculation

3. Numerical Studies and Results

Table 1.

Table 2.

4. Discussion and Conclusions

Appendix A

Table A1.

Appendix A.1. Derivation of ρ₁

Appendix A.2. The Limit of ${\hat{σ}}_{1}^{2}$ under ${\bar{H}}_{1}$

Appendix A.3. Derivation of ρ₂

Appendix A.4. The Limit of ${\hat{σ}}_{2}^{2}$ under ${\bar{H}}_{2}$

Author Contributions

Funding

Conflicts of Interest

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Design and Analysis Methods for Trials with AI-Based Diagnostic Devices for Breast Cancer

Lu Liu

Kevin J Parker

Sin-Ho Jung

Roles

Abstract

1. Introduction

2. Materials and Methods

2.1. Objective 1: Is the Concordance Rate between the AI-Based Device and Radiologists as High as That among Radiologists?

2.1.1. Statistical Testing Method

2.1.2. Power and Sample Size Calculation

2.2. Objective 2: Is the AI-Based Device More Concordant with Experienced Radiologists Than with Junior Radiologists?

2.2.1. Statistical Testing Method

2.2.2. Power and Sample Size Calculation

3. Numerical Studies and Results

Table 1.

Table 2.

4. Discussion and Conclusions

Appendix A

Table A1.

Appendix A.1. Derivation of ρ1

Appendix A.2. The Limit of σ^12 under H¯1

Appendix A.3. Derivation of ρ2

Appendix A.4. The Limit of σ^22 under H¯2

Author Contributions

Funding

Conflicts of Interest

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Appendix A.1. Derivation of ρ₁

Appendix A.2. The Limit of ${\hat{σ}}_{1}^{2}$ under ${\bar{H}}_{1}$

Appendix A.3. Derivation of ρ₂

Appendix A.4. The Limit of ${\hat{σ}}_{2}^{2}$ under ${\bar{H}}_{2}$