Hypothesis testing procedure for binary and multi-class F1-scores in the paired design

Kanae Takahashi; Kouji Yamamoto; Aya Kuchiba; Ayumi Shintani; Tatsuki Koyama

doi:10.1002/sim.9853

. Author manuscript; available in PMC: 2024 Oct 17.

Published in final edited form as: Stat Med. 2023 Aug 1;42(23):4177–4192. doi: 10.1002/sim.9853

Hypothesis testing procedure for binary and multi-class $F_{1}$ -scores in the paired design

Kanae Takahashi ¹, Kouji Yamamoto ², Aya Kuchiba ³, Ayumi Shintani ⁴, Tatsuki Koyama ⁵

PMCID: PMC11483486 NIHMSID: NIHMS2021606 PMID: 37527903

Abstract

In modern medicine, medical tests are used for various purposes including diagnosis, disease screening, prognosis, and risk prediction. To quantify the performance of the binary medical test, we often use sensitivity, specificity, and negative and positive predictive values as measures. Additionally, the $F_{1}$ -score, which is defined as the harmonic mean of precision (positive predictive value) and recall (sensitivity), has come to be used in the medical field due to its favorable characteristics. The $F_{1}$ -score has been extended for multi-class classification, and two types of $F_{1}$ -scores have been proposed for multi-class classification: a micro-averaged $F_{1}$ -score and a macro-averaged $F_{1}$ -score. The micro-averaged $F_{1}$ -score pools per-sample classifications across classes and then calculates the overall $F_{1}$ -score, whereas the macro-averaged $F_{1}$ -score computes an arithmetic mean of the $F_{1}$ -scores for each class. Additionally, Sokolova and Lapalme¹ gave an alternative definition of the macro-averaged $F_{1}$ -score as the harmonic mean of the arithmetic means of the precision and recall over classes. Although some statistical methods of inference for binary and multi-class $F_{1}$ -scores have been proposed, the methodology development of hypothesis testing procedure for them has not been fully progressing yet. Therefore, we aim to develop hypothesis testing procedure for comparing two $F_{1}$ -scores in paired study design based on the large sample multivariate central limit theorem.

Keywords: delta-method, F₁ measures, multi-class classification, precision, recall

1 |. INTRODUCTION

Medical tests are important for the early detection and treatment of disease in modern medicine. Tests are used for various purposes including diagnosis, disease screening, prognosis, and risk prediction. Some measures exist to quantify the test performance; sensitivity, specificity, and positive and negative predictive values are commonly used for binary tests. Additionally, the $F_{1}$ -score for binary data (binary $F_{1}$ -score), which is defined as the harmonic mean of precision (positive predictive value) and recall (sensitivity), has been used in the medical field.^1,2

The binary $F_{1}$ -score is especially useful when evaluation of true negatives is relatively unimportant because true negatives are not included in computation of either precision or recall. In addition, the binary $F_{1}$ -score performs well for a poor diagnostic test that identifies majority of the data as positive. In this situation, a simple arithmetic mean of precision and recall may be as high as 0.50 because recall will be 1.00 if all the data are diagnosed as positive. However, the binary $F_{1}$ -score will be appropriately low in these instances: it will be 0.18 and 0.02 when the precision is 0.10 and 0.01, respectively, even if recall is 1.00. Therefore, $F_{1}$ -score is a better statistic to report.²

Most of measures for performance of medical tests are only applicable to binary classification data, and multi-class classification data need to be dichotomized to compute these measures. In the motivating example,³ for instance, skin cancer images were originally classified into six categories (malignant melanoma (MM), basal cell carcinoma (BCC), nevus, seborrheic keratosis (SK), senile lentigo (SL) and hematoma/hemangioma (H/H)), and the classification performances of board-certified dermatologists and dermatologic trainees were compared. The classification performance was assessed by accuracy, sensitivity, specificity, false negative rate, false positive rate, and positive predictive value after dichotomizing the six categories (MM and BCC vs. nevus, SK, SL, and H/H). However, evaluating the performance with the original six categories would have been preferable because dichotomization led to loss of information regarding the performance of the this classification.^4,5

As measures of multi-class classification performance, a micro-averaged $F_{1}$ -score and a macro-averaged $F_{1}$ -score have been proposed.² The micro-averaged $F_{1}$ -score calculates the overall $F_{1}$ -score by pooling per-sample classifications across classes. Contrarily, the macro-averaged $F_{1}$ -score computes an arithmetic mean of the $F_{1}$ -scores for each class. In addition, Sokolova and Lapalme⁶ proposed an alternative macro-averaged $F_{1}$ -score as the harmonic mean of the arithmetic mean of the precisions and recalls for each class.

Although $F_{1}$ -scores for binary and multi-class classifications have been originally used for measuring the performance of text classification in the field of information retrieval or of a classifier in machine learning, it has become frequently used in medicine.^7–14 Some statistical methods for inference have been proposed for the binary $F_{1}$ -score,¹⁵ and the methods for estimating confidence intervals of the micro-averaged $F_{1}$ -scores and macro-averaged $F_{1}$ -scores has been developed.^16,17 However, these previous methods are for inference from one-sample. To our knowledge, no method is available for hypothesis testing of $F_{1}$ -scores for paired samples as in our motivating example or two independent samples. Thus, we aim to provide the methods for comparing the binary $F_{1}$ -scores, micro-averaged $F_{1}$ -scores and macro-averaged $F_{1}$ -scores in the paired-design setting. For two-independent-sample setting, the proposed method is readily applicable by setting the covariance part of the test statistics to 0.

The layout of this article is as follows: In Section 2, the definitions of the binary $F_{1}$ -score, micro-averaged $F_{1}$ -score and macro-averaged $F_{1}$ -score are reviewed. Test statistics for comparing those scores are derived in Section 3. Then, the simulation results of the proposed statistics and the application to the motivating example are presented in Sections 4 and 5, respectively. Finally, our brief discussions are provided in Section 6.

2 |. REVIEW OF F1-SCORES

This section introduces notations and definitions of binary $F_{1}$ -score ( $b i F$ ), micro-averaged $F_{1}$ -score ( $m i F$ ), and macro-averaged $F_{1}$ -score ( $m a F$ ). Consider an $r \times r \times r$ table of data for a nominal categorical variable with $r$ levels $(r \geq 2)$ . Each true class $1, \dots, r$ has an $r \times r$ table representing prediction frequencies of the two tests to be compared.

This arrangement of data represents the binary classification when $r = 2$ , and the multi-class classification when $r > 2$ . Table 1 shows general notations for each cell probability $p_{i j k}$ , where $i$ indicates the class of Test 1, $j$ indicates the class of Test 2, and $k$ indicates the true condition. Let Test 1 be a new medical test and Test 2 be an existing medical test. We consider a hypothesis testing to compare $F_{1}$ -scores of Test 1 and Test 2. Using these notations, the true positive rate $(T P_{a})$ , the false positive rate $(F P_{a})$ , and the false negative rate $(F N_{a})$ for each class $a (a = 1, \dots, r)$ in Test 1 are defined as follows:

T P_{1 a} = p_{a . a}, F P_{1 a} = \sum_{\begin{matrix} k = 1 \\ k \neq a \end{matrix}}^{r} p_{a . k}, F N_{1 a} = \sum_{\begin{matrix} i = 1 \\ i \neq a \end{matrix}}^{r} p_{i . a} .

TABLE 1.

General notations.

		True condition = 1					True condition = r

		Test 2					Test 2

		1	2	⋯	r		1	2	⋯	r
Test 1	1	p ₁₁₁	p ₁₂₁	⋯	p _1r1	p _1.1	p _11r	p _12r	⋯	p _1rr	p _1.r	p _1..
	2	p ₂₁₁	p ₂₂₁	⋯	p _2r1	p _2.1	p _21r	p _22r	⋯	p _2rr	p _2.r	p _2..
	⋮	⋮	⋮	⋱	⋮	⋮	⋮	⋮	⋱	⋮	⋮	⋮
	r	p _r11	p _r21	⋯	p _rr1	p _r.1	p _r1r	p _r2r	⋯	p_rrr	p_r.r	p _r..
		p _.11	p _.21	⋯	p _.r1	p _..1	p _.1r	p _.2r	⋯	p _.rr	p _..r	1

Open in a new tab

Note that $T P_{1 a} + F P_{1 a} = p_{a . .}$ , and $T P_{1 a} + F N_{1 a} = p_{. . a}$ . Similarly, $T P_{a}, F P_{a}, F N_{a}$ for each class $a (a = 1, \dots, r)$ for Test 2 are defined as follows:

T P_{2 a} = p_{. a a}, F P_{2 a} = \sum_{\begin{matrix} k = 1 \\ k \neq a \end{matrix}}^{r} p_{. a k}, F N_{2 a} = \sum_{\begin{matrix} j = 1 \\ j \neq a \end{matrix}}^{r} p_{. j a} .

Note that $T P_{2 a} + F P_{2 a} = p_{. a .}$ , and $T P_{2 a} + F N_{2 a} = p_{. . a}$ .

2.1 |. Binary $F_{1}$ -score

When $r = 2$ , we consider the following precision $(b i P)$ and recall $(b i R)$ for Test 1 as:

b i P_{1} = \frac{T P_{11}}{T P_{11} + F P_{11}} = p_{1.1} / p_{1 . .},

b i R_{1} = \frac{T P_{11}}{T P_{11} + F N_{11}} = p_{1.1} / p_{. . 1} .

And binary $F_{1}$ -score for Test 1 $(b i F_{1})$ is defined as the harmonic mean of $b i P_{1}$ and $b i R_{1}$ , that is,

b i F_{1} = 2 \frac{b i P_{1} \times b i R_{1}}{b i P_{1} + b i R_{1}} = 2 \frac{p_{1.1}}{p_{1 . .} + p_{. . 1}} .

(1)

Similarly, the binary $F_{1}$ -score for Test 2 $(b i F_{2})$ is as follows:

b i P_{2} = \frac{T P_{21}}{T P_{21} + F P_{21}} = p_{. 11} / p_{. 1 .},

b i R_{2} = \frac{T P_{21}}{T P_{21} + F N_{21}} = p_{. 11} / p_{. . 1},

b i F_{2} = 2 \frac{b i P_{2} \times b i R_{2}}{b i P_{2} + b i R_{2}} = 2 \frac{p_{. 11}}{p_{. 1 .} + p_{. . 1}} .

(2)

2.2 |. Micro-averaged $F_{1}$ -score

When $r > 2$ the micro-averaged precision (miP) and micro-averaged recall (miR) are obtained from the sum of each class of TP_i, FP_i, FN_i. miP and miR for Test 1 can be written as

m i P_{1} = \frac{\sum_{a = 1}^{r} T P_{1 a}}{\sum_{a = 1}^{r} (T P_{1 a} + F P_{1 a})} = \frac{\sum p_{a . a}}{\sum p_{a . .}} = \sum_{a = 1}^{r} p_{a . a},

m i R_{1} = \frac{\sum_{a = 1}^{r} T P_{1 a}}{\sum_{a = 1}^{r} (T P_{1 a} + F N_{1 a})} = \frac{\sum p_{a . a}}{\sum p_{. . a}} = \sum_{a = 1}^{r} p_{a . a} .

Finally, as the harmonic mean of $m i P_{1}$ and $m i R_{1}$ , we have the micro-averaged $F_{1}$ -score for Test 1 $(m i F_{1})$ as

m i F_{1} = 2 \frac{m i P_{1} \times m i R_{1}}{m i P_{1} + m i R_{1}} = \sum_{a = 1}^{r} p_{a . a} .

(3)

Similarly, the micro-averaged $F_{1}$ -score for Test 2 $(m i F_{2})$ is

m i P_{2} = \frac{\sum_{a = 1}^{r} T P_{2 a}}{\sum_{a = 1}^{r} (T P_{2 a} + F P_{2 a})} = \frac{\sum p_{. a a}}{\sum p_{. a .}} = \sum_{a = 1}^{r} p_{. a a},

m i R_{2} = \frac{\sum_{a = 1}^{r} T P_{2 a}}{\sum_{a = 1}^{r} (T P_{2 a} + F N_{2 a})} = \frac{\sum p_{. a a}}{\sum p_{. . a}} = \sum_{a = 1}^{r} p_{. a a} .

m i F_{2} = 2 \frac{m i P_{2} \times m i R_{2}}{m i P_{2} + m i R_{2}} = \sum_{a = 1}^{r} p_{. a a} .

(4)

2.3 |. Macro-averaged $F_{1}$ -score

When $r > 2$ , to define the macro-averaged $F_{1}$ -score for Test 1 $(m a F_{1})$ , first consider the following precision $(P_{1 a})$ and recall ( $R_{1 a}$ ) within each class, $a = 1, \dots, r$ :

P_{1 a} = \frac{T P_{1 a}}{T P_{1 a} + F P_{1 a}} = p_{a . a} / p_{a . .},

(5)

R_{1 a} = \frac{T P_{1 a}}{T P_{1 a} + F N_{1 a}} = p_{a . a} / p_{. . a} .

(6)

And $F_{1}$ -score within each class for Test 1 $(F_{1 a})$ is defined as the harmonic mean of $P_{1 a}$ and $R_{1 a}$ , that is,

F_{1 a} = 2 \frac{P_{1 a} \times R_{1 a}}{P_{1 a} + R_{1 a}} = 2 \frac{p_{a . a}}{p_{a . .} + p_{. . a}} .

The macro-averaged $F_{1}$ -score for Test 1 $(m a F_{1})$ is the simple arithmetic mean of $F_{1 a}$ :

m a F_{1} = \frac{1}{r} \sum_{a = 1}^{r} F_{1 a} = \frac{2}{r} \sum_{a = 1}^{r} \frac{p_{a . a}}{p_{a ..} + p_{. . a}} .

(7)

Similarly, the macro-averaged $F_{1}$ -score for Test 2 $(m a F_{2})$ is

P_{2 a} = \frac{T P_{2 a}}{(T P_{2 a} + F P_{2 a})} = p_{. a a} / p_{. a .},

R_{2 a} = \frac{T P_{2 a}}{(T P_{2 a} + F N_{2 a})} = p_{. a a} / p_{. . a} .

F_{2 a} = 2 \frac{P_{2 a} \times R_{2 a}}{P_{2 a} + R_{2 a}} = 2 \frac{p_{. a a}}{p_{. a .} + p_{. . a}} .

m a F_{2} = \frac{1}{r} \sum_{a = 1}^{r} F_{2 a} = \frac{2}{r} \sum_{a = 1}^{r} \frac{p_{. a a}}{p_{. a .} + p_{. . a}} .

(8)

2.4 |. Alternate definition of macro-averaged $F_{1}$ -score

Sokolova and Lapalme⁶ gave an alternative definition of the macro-averaged $F_{1}$ . First, macro-averaged precision $(m a P)$ and macro-averaged recall $(m a R)$ for Test 1 are defined as simple arithmetic means of the within-class precision and within-class recall in (5) and (6), respectively.

m a P_{1} = \frac{1}{r} \sum_{a = 1}^{r} P_{1 a} = \frac{1}{r} \sum_{a = 1}^{r} \frac{p_{a . a}}{p_{a . .}},

{m a R}_{1} = \frac{1}{r} \sum_{a = 1}^{r} R_{1 a} = \frac{1}{r} \sum_{a = 1}^{r} \frac{p_{a . a}}{p_{. . a}} .

And the alternate definition of macro-averaged $F_{1}$ -score for Test 1 $(m a F_{1}^{*})$ is the harmonic mean of these quantities.

m a F_{1}^{*} = 2 \frac{m a P_{1} \times m a R_{1}}{m a P_{1} + m a R_{1}} .

(9)

Similarly, the alternate definition of macro-averaged $F_{1}$ -score for Test 2 $(m a F_{2}^{*})$ is

m a P_{2} = \frac{1}{r} \sum_{a = 1}^{r} P_{2 a} = \frac{1}{r} \sum_{a = 1}^{r} \frac{p_{. a a}}{p_{. a .}},

m a R_{2} = \frac{1}{r} \sum_{a = 1}^{r} R_{2 a} = \frac{1}{r} \sum_{a = 1}^{r} \frac{p_{. a a}}{p_{. . a}} .

m a F_{2}^{*} = 2 \frac{m a P_{2} \times m a R_{2}}{m a P_{2} + m a R_{2}} .

(10)

3 |. PROPOSED HYPOTHESIS TESTING PROCEDURE

In this section, we derive the test statistics for comparing two $F_{1}$ -scores ( $b i F_{1}$ and $b i F_{2}; m i F_{1}$ and $m i F_{2}; m a F_{1}$ and $m a F_{2}$ ; and $m a F_{1}^{*}$ and $m a F_{2}^{*}$ ). We assume that the observed frequencies, $n_{i j k}$ for $1 \leq i \leq r, 1 \leq j \leq r, 1 \leq k \leq r$ , have a multinomial distribution with overall sample size $N = \sum_{i, j, k} n_{i j k}$ and probabilities $p = {[p_{111}, \dots, p_{1 r 1}, \dots, p_{r r 1}, \dots, p_{r r r}]}^{T}$ , where $i$ indicates the class of Test 1, $j$ indicates the class of Test 2, $k$ indicates the true condition, and “T” represents the transpose. The maximum likelihood estimate (MLE) of $p_{i j k}$ is ${\hat{p}}_{i j k} = n_{i j k} / N$ . That is

(n_{111}, n_{121}, \dots, n_{rrr}) \sim M u l t i n o m i a l (N; p) .

By invariance property of MLE’s, the maximum likelihood estimate of $b i F, m i F, m a F, m a F^{*}$ , and other quantities in the previous section can be obtained by substituting $p_{i j k}$ by ${\hat{p}}_{i j k}$ .

3.1 |. Test statistic for comparing two $b i F s$

Let $b i F = {(b i F_{1}, b i F_{2})}^{T}$ be a vector whose components are the $b i F s$ of the two medical tests, and let $\hat{b i F}$ be the MLE of $b i F . \hat{b i F}$ can be obtained by substituting $p_{i j k}$ by their MLE’s in (1) and (2).

{\hat{b i F}}_{1} = 2 \frac{{\hat{p}}_{1.1}}{{\hat{p}}_{1 . .} + {\hat{p}}_{. . 1}} = 2 \frac{n_{1.1}}{n_{1 . .} + n_{. . 1}}, \hat{b i F_{2}} = 2 \frac{{\hat{p}}_{. 11}}{{\hat{p}}_{. 1 .} + {\hat{p}}_{. . 1}} = 2 \frac{n_{. 11}}{n_{. 1 .} + n_{. . 1}} .

Using the delta-method and the multivariate central limit theorem, we have

\sqrt{N} (\hat{b i F} - b i F) \dot{\sim} N o r m a l (0, {[\frac{\partial (b i F)}{\partial (p)}]}^{T} [d i a g (p) - p p^{T}] [\frac{\partial (b i F)}{\partial (p)}]),

where $d i a g (p)$ is an $r^{2} \times r^{2} \times r^{2}$ diagonal matrix whose elements are the diagonal elements of $p$ , and “ $\dot{\sim}$ ” represents “approximately distributed as”. The Wald statistic for testing $H_{0} : b i F_{1} = b i F_{2}$ vs. $H_{1} : b i F_{1} \neq b i F_{2}$ , therefore, is

T_{b i F}^{W} = \frac{{({\hat{b i F}}_{1} - {\hat{b i F}}_{2})}^{2}}{{\hat{V a r}}_{b i F d}},

where ${\hat{V a r}}_{b i F d}$ is the variance of $({\hat{b i F}}_{1} - {\hat{b i F}}_{2})$ with $\{p_{i j k}\}$ replaced by $\{{\hat{p}}_{i j k}\}$ . Derivation of the variance of $({\hat{b i F}}_{1} - {\hat{b i F}}_{2})$ appear in Appendix A.1. The test statistic is distributed asymptotically as a $χ^{2}$ distribution with one degree of freedom under the null hypothesis.

As a side note, the confidence interval of $b i F$ for each test can be derived in the same way. $A (1 - α) \times 100 %$ confidence interval of $b i F_{1}$ and $b i F_{2}$ is

{\hat{b i F}}_{1} \pm Z_{1 - α / 2} \times \sqrt{{\hat{V a r}}_{b F_{1}},}

{\hat{b i F}}_{2} \pm Z_{1 - α / 2} \times \sqrt{{\hat{V a r}}_{b i F_{2}}},

where $Z_{p}$ denote the $100 p$ -th percentile of the standard normal distribution, and ${\hat{V a r}}_{b i F_{1}}$ and ${\hat{V a r}}_{b i F_{2}}$ are the variance of ${\hat{b i F}}_{1}$ and the variance of ${\hat{b i F}}_{2}$ with $\{p_{i j k}\}$ replaced by $\{{\hat{p}}_{i j k}\}$ . These simple formulas based on the multinomial distribution have not been proposed yet. Wang et al. proposed a confidence interval of $b i F$ based on the beta prime distribution and associated calculations using the bootstrap method.^15,18

For the score statistic, we consider the MLE of $\{p_{i j k}\}$ under the null hypothesis that could be obtained, for example by applying the Newton-Raphson method to the log-likelihood equations. The score statistic for testing $H_{0} : b i F_{1} = b i F_{2}$ vs. $H_{1} : b i F_{1} \neq b i F_{2}$ is

T_{b i F}^{S} = \frac{{({\hat{b i F}}_{1} - {\hat{b i F}}_{2})}^{2}}{{\tilde{V a r}}_{b i F d}},

where ${\tilde{V a r}}_{b i F d}$ is the variance of $({\hat{b i F}}_{1} - {\hat{b i F}}_{2})$ with $\{p_{i j k}\}$ replaced by $\{{\tilde{p}}_{i j k}\}$ , that is calculated from the MLE of $\{p_{i j k}\}$ under the null hypothesis.

3.2 |. Test statistic for comparing two $m i F s$

As shown in (3) and (4), $m i F_{1} = \sum p_{a . a}, m i F_{2} = \sum p_{. a a}$ , and the MLE of $m i F_{1}$ and $m i F_{2}$ are

\hat{m i F_{1}} = \sum_{a = 1}^{r} {\hat{p}}_{a . a} = \sum_{a = 1}^{r} \frac{n_{a . a}}{N}, \hat{m i F_{2}} = \sum_{a = 1}^{r} {\hat{p}}_{. a a} = \sum_{a = 1}^{r} \frac{n_{. a a}}{N} .

Again by the delta-method and multivariate central limit theorem (Appendix A.2), the Wald statistic for testing $H_{0} : m i F_{1} = m i F_{2}$ versus $H_{1} : m i F_{1} \neq m i F_{2}$ is

T_{miF}^{W} = \frac{{({\hat{m i F}}_{1} - {\hat{m i F}}_{2})}^{2}}{{\hat{V a r}}_{m i F d}},

where ${\hat{V a r}}_{miFd}$ is the variance of $({\hat{m i F}}_{1} - {\hat{m i F}}_{2})$ with $\{p_{i j k}\}$ replaced by $\{{\hat{p}}_{i j k}\}$ . The test statistic is distributed a symptotically as a $χ^{2}$ distribution with one degree of freedom under the null hypothesis.

Again to develop the score statistic, we consider the MLE of $\{p_{i j k}\}$ under the null hypothesis as in the case of $b i F$ . The score statistic for testing $H_{0} : m i F_{1} = m i F_{2}$ versus $H_{1} : m i F_{1} \neq m i F_{2}$ is

T_{m i F}^{S} = \frac{{({\hat{m i F}}_{1} - {\hat{m i F}}_{2})}^{2}}{{\tilde{V a r}}_{m i F d}},

where ${\tilde{V a r}}_{miFd}$ is the variance of $({\hat{m i F}}_{1} - {\hat{m i F}}_{2})$ with $\{p_{i j k}\}$ replaced by $\{{\tilde{p}}_{i j k}\}$ , that is calculated from the MLE of $\{p_{i j k}\}$ under the null hypothesis.

3.3 |. Test statistic for comparing two $m a F s$

The MLE of $m a F_{1}$ and $m a F_{2}$ can be obtained by substituting $p_{a . a}, p_{. a a}, p_{a . .}, p_{. a .}$ and $p_{. . a}$ by their MLE’s in (7) and (8).

{\hat{m a F}}_{1} = \frac{2}{r} \sum_{a = 1}^{r} \frac{{\hat{p}}_{a . a}}{{\hat{p}}_{a . .} + {\hat{p}}_{. . a}} = \frac{2}{r} \sum_{a = 1}^{r} \frac{n_{a . a}}{n_{a . .} + n_{. . a}}, {\hat{m a F}}_{2} = \frac{2}{r} \sum_{a = 1}^{r} \frac{{\hat{p}}_{. a a}}{{\hat{p}}_{. a .} + {\hat{p}}_{. . a}} = \frac{2}{r} \sum_{a = 1}^{r} \frac{n_{. a a}}{n_{. a .} + n_{. . a}} .

Again by the delta-method and multivariate central limit theorem (Appendix A.3), we have the Wald statistic for testing $H_{0} : m a F_{1} = m a F_{2}$ versus $H_{1} : m a F_{1} \neq m a F_{2}$ as

T_{m a F}^{W} = \frac{{({\hat{m a F}}_{1} - {\hat{m a F}}_{2})}^{2}}{{\hat{V a r}}_{m a F d}},

where ${\hat{V a r}}_{m a F d}$ is the variance of $({\hat{m a F}}_{1} - {\hat{m a F}}_{2})$ with $\{p_{i j k}\}$ replaced by $\{{\hat{p}}_{i j k}\}$ . The test statistic is distributed asymptotically as a $χ^{2}$ distribution with one degree of freedom under the null hypothesis.

For the score statistic, we consider the MLE of $\{p_{i j k}\}$ under the null hypothesis as in the case of $b i F$ and $m i F$ . The score statistic for testing $H_{0} : m a F_{1} = m a F_{2}$ versus $H_{1} : m a F_{1} \neq m a F_{2}$ is

T_{m a F}^{S} = \frac{{({\hat{m a F}}_{1} - {\hat{m a F}}_{2})}^{2}}{{\tilde{V a r}}_{m a F d}},

where ${\tilde{V a r}}_{m a F d}$ is the variance of $({\hat{m a F}}_{1} - {\hat{m a F}}_{2})$ replaced by $\{{\tilde{p}}_{i j k}\}$ , that is calculated from the MLE of $\{p_{i j k}\}$ under the null hypothesis.

3.4 |. Test statistic for comparing two $m a F^{*} s$

To obtain the MLEs of $m a F_{1}^{*}$ and $m a F_{2}^{*}$ , we first substitute $p_{a . a}, p_{. a a}, p_{a . .}, p_{. a .}$ and $p_{. . a}$ by their MLE’s to get MLE’s of $m a P$ and $m a R$ and use these in (9) and (10):

{\hat{m a F}}_{1}^{*} = 2 \frac{\hat{m a P_{1}} \times \hat{m a R_{1}}}{\hat{m a P_{1}} + \hat{m a R_{1}}}, {\hat{m a F}}_{2}^{*} = 2 \frac{\hat{m a P_{2}} \times \hat{m a R_{2}}}{\hat{m a P_{2}} + \hat{m a R_{2}}} .

Using the delta-method and multivariate central limit theorem (Appendix A.4), we have the Wald statistic for testing $H_{0} : m a F_{1}^{*} = m a F_{2}^{*}$ versus $H_{1} : m a F_{1}^{*} \neq m a F_{2}^{*}$ as

T_{m a F}^{W} = \frac{{({\hat{m a F}}_{1}^{*} - {\hat{m a F}}_{2}^{*})}^{2}}{{\hat{V a r}}_{m a F d^{*}}},

Again to get ${\hat{V a r}}_{m a F d^{*}}$ , all components of the variance of $({\hat{m a F}}_{1}^{*} - {\hat{m a F}}_{2}^{*})$ are replaced by their respective MLE’s. The test statistic is distributed asymptotically as a $χ^{2}$ distribution with one degree of freedom under the null hypothesis.

On the other hand, for the score statistic, we consider the MLE of $\{p_{i j k}\}$ under the null hypothesis as in the case of $b i F, m i F$ , and $m a F$ . The score statistic for testing $H_{0} : m a F_{1}^{*} = m a F_{2}^{*}$ versus $H_{1} : m a F_{1}^{*} \neq m a F_{2}^{*}$ is

T_{m a F}^{S} = \frac{{({\hat{m a F}}_{1}^{*} - {\hat{m a F}}_{2}^{*})}^{2}}{{\tilde{V a r}}_{m a F d^{*}}},

where ${\tilde{V a r}}_{m a F d^{*}}$ is the variance of $({\hat{m a F}}_{1}^{*} - {\hat{m a F}}_{2}^{*})$ replaced by $\{{\tilde{p}}_{i j k}\}$ , that is calculated from the MLE of $\{p_{i j k}\}$ under the null hypothesis.

4 |. SIMULATION

4.1 |. Simulation setup

A simulation study was conducted to evaluate the performance of the test statistics proposed in Section 3. We set $r = 3$ (class 1, 2, 3), and generated data according to the multinomial distributions with $p$ shown in Table 2. Classes 2 and 3 were combined when calculating $b i F$ . The total sample size, $N$ , was set to 100, 300, 500, and 1,000. The nominal type I error rate was set to 0.05 (two-sided test). We used the empirical type I error rate and empirical power as performance measures. For each combination of the scenario and sample size, we performed 100,000 repeated simulations.

TABLE 2.

Simulation study: True cell probabilities.

		True class = 1			True class = 2			True class = 3

		Test 2			Test 2			Test 2

Scenario 1		1	2	3	1	2	3	1	2	3
Test 1	1	40/300	10/300	10/300	5/300	10/300	5/300	5/300	5/300	10/300
	2	10/300	5/300	5/300	10/300	40/300	10/300	5/300	5/300	10/300
	3	10/300	5/300	5/300	5/300	10/300	5/300	10/300	10/300	40/300
$b i F_{1} = m i F_{1} = m a F_{1} = m a F_{1}^{*} = 0.60$
$b i F_{2} = m i F_{2} = m a F_{2} = m a F_{2}^{*} = 0.60$
		True class = 1			True class = 2			True class = 3

		Test 2			Test 2			Test 2

Scenario 2		1	2	3	1	2	3	1	2	3
Test 1	1	120/500	30/500	30/500	5/500	10/500	5/500	5/500	5/500	10/500
	2	30/500	15/500	15/300	10/500	40/500	10/500	5/500	5/500	10/500
	3	30/500	15/500	15/500	5/500	10/500	5/500	10/500	10/500	40/500
$b i F_{1} = 0.69, m i F_{1} = 0.60, m a F_{1} = 0.56, m a F_{1}^{*} = 0.58$
$b i F_{2} = 0.69, m i F_{2} = 0.60, m a F_{2} = 0.56, m a F_{2}^{*} = 0.58$
		True class = 1			True class = 2			True class = 3

		Test 2			Test 2			Test 2

Scenario 3		1	2	3	1	2	3	1	2	3
Test 1	1	30/300	15/300	15/300	5/300	10/300	5/300	5/300	5/300	10/300
	2	10/300	5/300	5/300	15/300	30/300	15/300	5/300	5/300	10/300
	3	10/300	5/300	5/300	5/300	10/300	5/300	15/300	15/300	30/300
$b i F_{1} = m i F_{1} = m a F_{1} = m a F_{1}^{*} = 0.60$
$b i F_{2} = m i F_{2} = m a F_{2} = m a F_{2}^{*} = 0.50$
		True class = 1			True class = 2			True class = 3

		Test 2			Test 2			Test 2

Scenario 4		1	2	3	1	2	3	1	2	3
Test 1	1	90/500	45/500	45/500	5/500	10/500	5/500	5/500	5/500	10/500
	2	30/500	15/500	15/300	15/500	30/500	15/500	5/500	5/500	10/500
	3	30/500	15/500	15/500	5/500	10/500	5/500	15/500	15/500	30/500
$b i F_{1} = 0.69, m i F_{1} = 0.60, m a F_{1} = 0.56, m a F_{1}^{*} = 0.58$
$b i F_{2} = 0.60, m i F_{2} = 0.50, m a F_{2} = 0.47, m a F_{2}^{*} = 0.49$

Open in a new tab

Scenarios 1 and 2 are set up to evaluate the empirical type I error rate of the proposed test statistics, while scenario 3 and 4 are designed to assess their empirical power. In scenario 1, the true conditions of classes 1, 2, and 3 have the same probability (1∕3), and the recalls and precisions within each class are equal in the two tests (60%). Thus, ${m a R}_{1} = m a R_{2} = m a P_{1} = m a P_{2} = 0.60$ , and $F_{1 a} = F_{2 a} = 0.60$ for each class, $a = 1, 2, 3$ . Then, $m a F_{1} = m a F_{2} = m a F_{1}^{*} = m a F_{2}^{*} = 0.60$ . Because classes 2 and 3 are combined to calculate $b i F_{1} = F_{11} = 0.60$ and $b i F_{2} = F_{21} = 0.60$ . Also, $p_{a . a} = p_{. a a} = 0.20$ for each class $a = 1, 2, 3$ , and $m i F_{1} = m i F_{2} = 0.60$ .

In scenario 2, the true condition of class 1 has higher probability than the others (60% vs. 20%), and performances of two tests are equal: $b i F_{1} = b i F_{2} = 0.69, m i F_{1} = m i F_{2} = 0.60, m a F_{1} = m a F_{2} = 0.56$ , and $m a F_{1}^{*} = m a F_{2}^{*} = 0.58$ . Although the distributions in scenario 2 are the same as those in scenario 1 for each class, the value of $b i F, m a F$ and $m a F^{*}$ are different between scenarios because $T P_{a} / (T P_{a} + F P_{a})$ is large in the true class = 1 and, conversely, relatively small in the true classes 2 and 3. In contrast, $m i F$ in scenario 2 is the same as that in scenario 1 because $p_{a . a}$ and $p_{. a a}$ for each class $a = 1, 2, 3$ in scenarios 1 and 2 are equal.

The true conditions of classes 1, 2, and 3 have the same probability (1∕3) in scenario 3. However, $m a R$ and $m a P$ of Test 2 are lower than Test 1 (60% vs. 50%), $F_{2 a}$ are lower than $F_{1 a}$ (60% vs. 50%), and $p_{. a a}$ is lower than $p_{a . a}$ for each class $a = 1, 2, 3$ (20% vs. 17%). Therefore, $b i F_{1} = m i F_{1} = m a F_{1} = m a F_{1}^{*} = 0.60$ , whereas $b i F_{2} = m i F_{2} = m a F_{2} = m a F_{2}^{*} = 0.50$ .

In scenario 4, the true condition of class 1 has higher probability than the others (60% vs. 20%) as in scenario 2. However, the performance of two tests are different: $b i F_{1} = 0.69$ versus $b i F_{2} = 0.60, m i F_{1} = 0.60$ versus $m i F_{2} = 0.50, m a F_{1} = 0.56$ versus $m a F_{2} = 0.47$ , and $m a F_{1}^{*} = 0.58$ versus $m a F_{2}^{*} = 0.49$ .

4.2 |. Simulation result

Table 3 shows the empirical type I error rates of the proposed tests for scenarios 1 and 2. The empirical type I error rates for both test statistics were close to nominal type I error rate of 0.05 when the sample size is large (300, 500, 1000). When $N$ is relatively small (100), the empirical type I error rates tended to be slightly larger than 0.05, especially for Wald statistics. Contrarily, the empirical type I error rates with score statistics are close to the nominal type I error rate of 0.05 for all sample sizes. Table 4 shows the empirical power of the proposed tests for scenarios 3 and 4. As shown in Table 4, the empirical powers increase with the sample size. The empirical powers of Wald statistics and score statistics are similar, especially when the sample size is large.

TABLE 3.

Simulation study: Empirical type I error rates.

Scenario	N	$T_{b i F}^{W}$	$T_{b i F}^{S}$	$T_{m i F}^{W}$	$T_{m i F}^{S}$	$T_{m a F}^{W}$	$T_{m a F}^{S}$	$T_{m a F *}^{W}$	$T_{m a F *}^{S}$
1	100	0.057	0.050	0.053	0.049	0.055	0.051	0.057	0.053
	300	0.052	0.050	0.051	0.050	0.052	0.051	0.052	0.051
	500	0.051	0.050	0.050	0.050	0.051	0.050	0.051	0.050
	1000	0.050	0.049	0.051	0.050	0.051	0.050	0.051	0.051
2	100	0.052	0.049	0.054	0.049	0.058	0.053	0.061	0.055
	300	0.052	0.050	0.051	0.050	0.054	0.052	0.054	0.052
	500	0.051	0.050	0.051	0.050	0.051	0.050	0.052	0.051
	1000	0.050	0.050	0.051	0.051	0.052	0.051	0.051	0.051

Open in a new tab

TABLE 4.

Simulation study: Empirical power.

Scenario	N	$T_{b i F}^{W}$	$T_{b i F}^{S}$	$T_{m i F}^{W}$	$T_{m i F}^{S}$	$T_{m a F}^{W}$	$T_{m a F}^{S}$	$T_{m a F *}^{W}$	$T_{m a F *}^{S}$
3	100	0.192	0.174	0.304	0.289	0.309	0.297	0.310	0.300
	300	0.438	0.429	0.694	0.689	0.696	0.692	0.696	0.692
	500	0.641	0.635	0.890	0.888	0.889	0.888	0.889	0.888
	1000	0.905	0.904	0.995	0.995	0.995	0.995	0.995	0.995
4	100	0.235	0.226	0.305	0.291	0.291	0.278	0.271	0.256
	300	0.560	0.556	0.695	0.690	0.662	0.657	0.615	0.609
	500	0.773	0.771	0.889	0.887	0.865	0.863	0.826	0.824
	1000	0.969	0.969	0.995	0.995	0.992	0.992	0.984	0.984

Open in a new tab

5 |. EXAMPLE

We describe an application of the proposed hypothesis testing procedure to the motivating example.³ In this study, a skin cancer classification system with faster, region-based convolutional neural network algorithm (FRCNN) for brown to black pigmented skin lesions was developed using a deep learning method. The target diseases were malignant tumors (malignant melanoma (MM) and basal cell carcinoma (BCC)) and benign tumors (nevus, seborrheic keratosis (SK), senile lentigo (SL) and hematoma/hemangioma (H/H)), and 2000 images were evaluated. The 2000 images were obtained by randomly sampling 200 images from the 666 images 10 times. For illustration, all images were treated as independent in this study. The data are shown in Tables B1–B3, Appendix B. Although images were classified into six categories (MM, BCC, nevus, SK, SL, H/H), accuracy was the only performance measure computed for six-class classification data in the motivating example. Other performance measures, sensitivity, specificity, false negative, false positive, and positive predictive value, were calculated for two-class classification data after combining malignant tumors (MM and BCC) and benign tumors (nevus, SK, SL, and H/H). The accuracy of six-class classification by the FRCNN (86.2% ± 2.95%) was statistically higher than that of board-certified dermatologists (BCD) (79.5% ± 5.27%, p = 0.0081) and that of dermatologic trainees (75.1% ± 2.18%, p < 0.0001).

We compared the performance of skin cancer classification between the FRCNN and BCD using $b i F s, m i F s$ , $m a F s$ , and $m a F^{*} s$ with the proposed hypothesis testing procedures. $m i F s, m a F s$ , and $m a F^{*} s$ were calculated from six-class classification data, while $b i F s$ were calculated from two-class classification data (malignant tumors vs. benign tumors). The results are shown in Table 5. All $b i F s, m i F s, m a F s,$ and $m a F^{*} s$ of six-class classification by FRCNN were significantly higher than those by BCD.

TABLE 5.

Example.

	FRCNN	BCD	Difference	Test statistics	p-value
biF (Wald)	0.840	0.776	0.064	19.4	< 0.001
biF (score)	0.840	0.776	0.064	18.9	< 0.001
miF (Wald)	0.862	0.795	0.067	41.9	< 0.001
miF (score)	0.862	0.795	0.067	41.0	< 0.001
maF (Wald)	0.846	0.768	0.078	26.2	< 0.001
maF (score)	0.846	0.768	0.078	24.5	< 0.001
maF* (Wald)	0.848	0.772	0.076	26.4	< 0.001
maF* (score)	0.848	0.772	0.076	23.0	< 0.001

Open in a new tab

6 |. DISCUSSION

We developed hypothesis testing procedures for comparing two $F_{1}$ -scores $(b i F_{1}$ and $b i F_{2}, m i F_{1}$ and $m i F_{2}, m a F_{1}$ and $m a F_{2}$ , and $m a F_{1}^{*}$ and $m a F_{2}^{*}$ ) in paired study design. Through the simulation study and motivating example, we assessed the performance and feasibility of those testing procedures. We conclude that the method based on the score statistics ( $T_{b i F}^{S}, T_{m i F}^{S}, T_{m a F}^{S}$ , and $T_{m a F^{*}}^{S}$ ) is slightly better compared to the method based on the Wald statistics ( $T_{b i F}^{W}$ , $T_{m i F}^{W}$ , $T_{m a F}^{W}$ , and $T_{m a F^{*}}^{W}$ ) because the empirical type I error rate is closer to the nominal level even when the sample size is small. However, when multi-class classification is considered, typical sample size is much larger than 100, and both approaches perform equally well in such scenarios.

We did not observe a substantial disparity in the empirical powers of the two approaches.

At present, others have not studied hypothesis testing procedure of $b i F s, m i F s$ , $m a F s$ , and $m a F^{*} s$ , and only the point estimates of these scores were reported in most studies. Han et al¹⁹ applied one sample t-test for comparison of $b i F s$ ; however, this approach may not be appropriate because $b i F$ is the harmonic mean of precision and recall, and the distribution of the difference between two $b i F s$ is unlikely to follow a Student’s t-distribution.

A limitation of this work is that the proposed procedures are based on the large sample theory, and thus require a large sample size to provide strict control of the type I error rate. For future works, we are working on the exact test for comparing two $b i F s, m i F s$ , $m a F s$ , and $m a F^{*} s$ based on the methods presented in this article.

An R code for computing point estimates, Wald statistics, score statistics, and p-values for $b i F s, m i F s$ , $m a F s$ , and $m a F s$ of each statistic in the paired design, is available on the lead author’s GitHub page: https://github.com/kanaet52/f1score/blob/main/R/F1score_test.R. For two-sample designs, the $F_{1}$ -scores can be compared by setting the covariance part of the test statistic ( $C_{biF}^{W}, C_{miF}^{W}, C_{maF}^{W}, C_{maF *}^{W}$ , see Appendix A) to 0. Note that for the score statistics, the MLE of $\{p_{i j k}\}$ under each null hypothesis is obtained by applying the Newton-Raphson method to the log-likelihood equations in the code.

ACKNOWLEDGEMENTS

The authors would like to thank Dr Shunichi Jinnai for providing the motivating example data. This research was partially supported by grant-in-aid for Scientific Research (C) No. 18K11195 and 21K11790 (Yamamoto), grant-in-aid for Research Activity start-up no. 21K21170 (Takahashi), and P30CA068485 Cancer Center Support grant (Koyama).

Funding information

Cancer Center Support, Grant/Award Number: P30CA068485; Japan Society for the Promotion of Science, Grant/Award Numbers: 18K11195, 21K11790, 21K21170

APPENDIX A. DERIVATION OF VARIANCES

A.1. Variance of $b i F$

The derivation of the variance of $({\hat{b i F}}_{1} - {\hat{b i F}}_{2})$ is as follows:

\begin{array}{l} b i F = [\begin{matrix} b i F_{1} \\ b i F_{2} \end{matrix}], \\ {[\frac{\partial (b i F)}{\partial (p)}]}^{T} (p p^{T}) [\frac{\partial (b i F)}{\partial (p)}] = 0, \\ {[\frac{\partial (b i F)}{\partial (p)}]}^{T} (d i a g (p) - p p^{T}) [\frac{\partial (b i F)}{\partial (p)}] = {[\frac{\partial (b i F)}{\partial (p)}]}^{T} (d i a g (p)) [\frac{\partial (b i F)}{\partial (p)}] = [\begin{array}{l} A_{b i F}^{W} & C_{b i F}^{W} \\ C_{b i F}^{W} & B_{b i F}^{W} \end{array}], \end{array}

with

A_{b i F}^{W} = \frac{1}{{(p_{1 . .} + p_{. . 1})}^{2}} [p_{1.1} {(2 (1 - b i F_{1}))}^{2} + (p_{1.2} + p_{2.1}) b i F_{1}^{2}]

B_{b i F}^{W} = \frac{1}{{(p_{. 1 .} + p_{. . 1})}^{2}} [p_{. 11} {(2 (1 - b i F_{2}))}^{2} + (p_{. 12} + p_{. 21}) b i F_{2}^{2}]

C_{b i F}^{W} = \frac{1}{(p_{1 . .} + p_{. . 1}) (p_{. 1 .} + p_{. . 1})} [2^{2} p_{111} (1 - b i F_{1}) (1 - b i F_{2}) - 2 p_{121} (1 - b i F_{1}) b i F_{2} - 2 p_{211} b i F_{1} (1 - b i F 2) + (p_{221} + p_{112}) b i F_{1} b i F_{2}] .

Therefore, the variance of $\hat{b i F_{1}}$ is

\frac{1}{{(p_{1 . .} + p_{. . 1})}^{2}} [p_{1.1} {(2 (1 - b i F_{1}))}^{2} + (p_{1.2} + p_{2.1}) b i F_{1}^{2}] / n,

the variance of $\hat{b i F_{2}}$ is

\frac{1}{{(p_{. 1 .} + p_{. . 1})}^{2}} [p_{. 11} {(2 (1 - b i F_{2}))}^{2} + (p_{. 12} + p_{. 21}) b i F_{2}^{2}] / n,

the variance of $(\hat{b i F_{1}} - \hat{b i F_{2}})$ is

(A_{b i F}^{W} + B_{b i F}^{W} - 2 C_{b i F}^{W}) / n .

A.2. Variance of $m i F$

The derivation of the variance of $({\hat{m i F}}_{1} - {\hat{m i F}}_{2})$ is as follows:

\begin{array}{l} m i F = [\begin{matrix} m i F_{1} \\ m i F_{2} \end{matrix}], \\ {[\frac{\partial (m i F)}{\partial (p)}]}^{T} (p p^{T}) [\frac{\partial (m i F)}{\partial (p)}] = [\begin{matrix} {(\sum_{a = 1}^{r} p_{a . a})}^{2} & (\sum_{a = 1}^{r} p_{a . a}) (\sum_{a = 1}^{r} p_{. a a}) \\ (\sum_{a = 1}^{r} p_{a . a}) (\sum_{a = 1}^{r} p_{. a a}) & {(\sum_{a = 1}^{r} p_{. a a})}^{2} \end{matrix}] = [\begin{matrix} m i F_{1}^{2} & m i F_{1} m i F_{2} \\ m i F_{1} m i F_{2} & m i F_{2}^{2} \end{matrix}], \\ {[\frac{\partial (m i F}{\partial (p)}]}^{T} (d i a g (p)) [\frac{\partial (m i F)}{\partial (p)}] = [\begin{matrix} \sum_{a = 1}^{r} p_{a . a} & \sum_{a = 1}^{r} p_{a a a} \\ \sum_{a = 1}^{r} p_{a a a} & \sum_{a = 1}^{r} p_{. a a} \end{matrix}] = [\begin{matrix} m i F_{1} & \sum_{a = 1}^{r} p_{a a a} \\ \sum_{a = 1}^{r} p_{a a a} & m i F_{2} \end{matrix}], \\ {[\frac{\partial (m i F)}{\partial (p)}]}^{T} (d i a g (p) - p p^{T}) [\frac{\partial (m i F)}{\partial (p)}] = [\begin{matrix} A_{m i F}^{W} & C_{m i F}^{W} \\ C_{m i F}^{W} & B_{m i F}^{W} \end{matrix}], \end{array}

with

A_{m i F}^{W} = m i F_{1} (1 - m i F_{1}),

B_{m i F}^{W} = m i F_{2} (1 - m i F_{2}),

C_{m i F}^{W} = \sum_{a = 1}^{r} p_{a a a} - m i F_{1} m i F_{2} .

Therefore, the variance of $({\hat{m i F}}_{1} - {\hat{m i F}}_{2})$ is

(A_{m i F}^{W} + B_{m i F}^{W} - 2 C_{m i F}^{W}) / n .

A.3. Variance of $m a F$

The derivation of the variance of $({\hat{m a F}}_{1} - {\hat{m a F}}_{2})$ is as follows.

\begin{array}{l} m a F = [\begin{matrix} m a F_{1} \\ m a F_{2} \end{matrix}], \\ {[\frac{\partial (m a F)}{\partial (p)}]}^{T} (p p^{T}) [\frac{\partial (m a F)}{\partial (p)}] = 0, \\ {[\frac{\partial (m a F)}{\partial (p)}]}^{T} (d i a g (p) - p p^{T}) [\frac{\partial (m a F)}{\partial (p)}] = {[\frac{\partial (m a F)}{\partial (p)}]}^{T} (d i a g (p)) [\frac{\partial (m a F)}{\partial (p)}] \\ = \frac{1}{r^{2}} [\begin{array}{l} A_{m a F}^{W} & C_{m a F}^{W} \\ C_{m a F}^{W} & B_{m a F}^{W} \end{array}], \end{array}

with

\begin{array}{l} A_{m a F}^{W} = \sum_{a = 1}^{r} p_{a . a} {(\frac{2 (1 - F_{1 a})}{p_{a . .} + p_{. . a}})}^{2} + \sum_{a = 1}^{r} \sum_{b \neq a} p_{a . b} {(\frac{F_{1 a}}{p_{a . .} + p_{. . a}} + \frac{F_{1 b}}{p_{b . .} + p_{. . b}})}^{2}, \\ B_{m a F}^{W} = \sum_{a = 1}^{r} p_{. a a} {(\frac{2 (1 - F_{2 a})}{p_{. a .} + p_{. . a}})}^{2} + \sum_{a = 1}^{r} \sum_{b \neq a} p_{. a b} {(\frac{F_{2 a}}{p_{. a .} + p_{. . a}} + \frac{F_{2 b}}{p_{. b .} + p_{. . b}})}^{2}, \\ C_{m a F}^{W} = \sum_{a = 1}^{r} p_{a a a} \frac{2^{2} (1 - F_{1 a}) (1 - F_{2 a})}{(p_{a . .} + p_{. . a}) (p_{. a .} + p_{. . a})} \\ - \sum_{a = 1}^{r} \sum_{b \neq a} \{p_{a b a} \frac{2 (1 - F_{1 a})}{(p_{a . .} + p_{. . a})} (\frac{F_{2 a}}{(p_{. a .} + p_{. . a})} + \frac{F_{2 b}}{(p_{. b .} + p_{. . b})}) + p_{b a a} \frac{2 (1 - F_{2 a})}{(p_{. a .} + p_{. . i})} (\frac{F_{1 a}}{(p_{a . .} + p_{. . a})} + \frac{F_{1 b}}{(p_{b . .} + p_{. . b})})\} \\ + \sum_{a = 1}^{r} \sum_{b \neq a} \sum_{c \neq a} p_{b c a} (\frac{F_{1 a}}{(p_{a . .} + p_{. . a})} + \frac{F_{1 b}}{(p_{b . .} + p_{. . b})}) (\frac{F_{2 a}}{(p_{. a .} + p_{. . a})} + \frac{F_{2 c}}{(p_{. c .} + p_{. . c})}) . \end{array}

Therefore, the variance of $({\hat{m a F}}_{1} - {\hat{m a F}}_{2})$ is

\frac{1}{r^{2}} (A_{maF}^{W} + B_{maF}^{W} - 2 C_{maF}^{W}) / n .

A.4. Variance of $m a F^{*}$

The derivation of the variance of $({\hat{m a F^{*}}}_{1} - {\hat{m a F^{*}}}_{2})$ is as follows.

\begin{array}{l} m a F^{*} = (\binom{m a F_{1}^{*}}{m a F_{2}^{*}}), \\ {[\frac{\partial (m a F^{*})}{\partial (p)}]}^{T} (p p^{T}) [\frac{\partial (m a F^{*})}{\partial (p)}] = 0, \\ {[\frac{\partial (m a F^{*})}{\partial (p)}]}^{T} (d i a g (p) - p p^{T}) [\frac{\partial (m a F^{*})}{\partial (p)}] = {[\frac{\partial (m a F^{*})}{\partial (p)}]}^{T} (d i a g (p)) [\frac{\partial (m a F^{*})}{\partial (p)}] \\ = (\begin{array}{l} A_{m a F^{*}}^{W} & C_{m a F^{*}}^{W} \\ C_{m a F^{*}}^{W} & B_{m a F^{*}}^{W} \end{array}), \end{array}

with

A_{m a F^{*}}^{W} = \frac{2^{2}}{r^{2} {(m a P_{1} + m a R_{1})}^{4}} [\sum_{a = 1}^{r} p_{a . a} {(\frac{(p_{a . .} - p_{a . a}) m a R_{1}^{2}}{p_{a . .}^{2}} + \frac{(p_{. . a} - p_{a . a}) m a P_{1}^{2}}{p_{. . a}^{2}})}^{2} + \sum_{a = 1}^{r} \sum_{b \neq a} p_{a . b} {(\frac{p_{a . a} m a R_{1}^{2}}{p_{a . .}^{2}} + \frac{p_{b . b} m a P_{1}^{2}}{p_{. . b}^{2}})}^{2}],

B_{m a F^{*}}^{W} = \frac{2^{2}}{r^{2} {(m a P_{2} + m a R_{2})}^{4}} [\sum_{a = 1}^{r} p_{. a a} {(\frac{(p_{. a .} - p_{. a a}) m a R_{2}^{2}}{p_{. a .}^{2}} + \frac{(p_{. . a} - p_{. a a}) m a P_{2}^{2}}{p_{. . a}^{2}})}^{2} + \sum_{a = 1}^{r} \sum_{b \neq a} p_{. a b} {(\frac{p_{. a a} m a R_{2}^{2}}{p_{. a .}^{2}} + \frac{p_{. b b} m a P_{2}^{2}}{p_{. . b}^{2}})}^{2}],

\begin{array}{l} C_{m a F^{*}}^{W} = \frac{2^{2}}{r^{2} {(m a P_{1} + m a R_{1})}^{2} {(m a P_{2} + m a R_{2})}^{2}} \\ \times [\sum_{a = 1}^{r} p_{a a a} (\frac{p_{a . .} - p_{a . a}}{p_{a . .}^{2}} m a R_{1}^{2} + \frac{p_{. . a} - p_{a . a}}{p_{. . a}^{2}} m a P_{1}^{2}) (\frac{p_{. a .} - p_{. a a}}{p_{. a .}^{2}} m a R_{2}^{2} + \frac{p_{. . a} - p_{. a a}}{p_{. . a}^{2}} m a P_{2}^{2}) \\ - \sum_{a = 1}^{r} \sum_{b \neq a} \{p_{a b a} (\frac{p_{a . .} - p_{a . a}}{p_{a . .}^{2}} m a R_{1}^{2} + \frac{p_{. . a} - p_{a . a}}{p_{. . a}^{2}} m a P_{1}^{2}) (\frac{p_{. b b}}{p_{. b .}^{2}} m a R_{2}^{2} + \frac{p_{. b b}}{p_{. . b}^{2}} m a P_{2}^{2}) \\ + p_{b a a} (\frac{p_{. a .} - p_{. a a}}{p_{. a .}^{2}} m a R_{2}^{2} + \frac{p_{. . a} - p_{. a a}}{p_{. . a}^{2}} m a P_{2}^{2}) (\frac{p_{b . b}}{p_{b . .}^{2}} m a R_{1}^{2} + \frac{p_{b . b}}{p_{. . b}^{2}} m a P_{1}^{2})\} \\ + \sum_{a = 1}^{r} \sum_{b \neq a} \sum_{c \neq a} p_{b c a} (\frac{p_{b . b}}{p_{b . .}^{2}} m a R_{1}^{2} + \frac{p_{a . a}}{p_{. . a}^{2}} m a P_{1}^{2}) (\frac{p_{. c c}}{p_{. c .}^{2}} m a R_{2}^{2} + \frac{p_{. a a}}{p_{. . a}^{2}} m a P_{2}^{2})] . \end{array}

Therefore, the variance of $({\hat{m a F^{*}}}_{1} - {\hat{m a F^{*}}}_{2})$ is

(A_{m a F^{*}}^{W} + B_{m a F^{*}}^{W} - 2 C_{m a F^{*}}^{W}) / n .

APPENDIX B. EXAMPLE DATA

Tables B1, B2, and B3 here.

TABLE B1.

Example data.

		True condition

FRCNN	BCD	MM	BCC	Nevus	SK	H/H	SL
MM	MM	289	2	20	6	2	0
	BCC	9	2	2	5	0	0
	Nevus	10	0	14	0	0	0
	SK	14	2	4	10	0	0
	H/H	3	0	2	0	1	0
	SL	2	0	0	0	0	0
BCC	MM	6	6	0	1	0	0
	BCC	2	95	0	6	0	0
	Nevus	0	2	6	0	0	0
	SK	1	5	0	2	0	0
	H/H	0	0	0	0	0	0
	SL	0	0	0	0	0	0
Nevus	MM	32	1	108	0	1	0
	BCC	1	6	8	1	0	0
	Nevus	11	1	789	8	1	0
	SK	3	3	50	27	0	0
	H/H	0	1	9	0	16	0
	SL	1	0	3	0	0	0
SK	MM	13	1	3	11	0	0
	BCC	0	1	1	12	0	0
	Nevus	1	0	11	9	0	0
	SK	7	4	14	186	0	1
	H/H	0	0	0	0	0	0
	SL	0	0	1	5	0	2
H/H	MM	0	0	0	0	6	0
	BCC	0	0	0	0	1	0
	Nevus	0	0	3	0	5	0
	SK	0	0	0	0	1	0
	H/H	0	0	0	0	44	0
	SL	0	0	0	0	0	0
SL	MM	0	0	0	0	0	0
	BCC	0	0	0	0	0	1
	Nevus	0	0	0	0	0	0
	SK	1	0	0	0	0	6
	H/H	0	0	0	0	0	0
	SL	2	0	0	0	0	35

Open in a new tab

TABLE B2.

Example data (FRCNN only).

	True condition

FRCNN	MM	BCC	Nevus	SK	H/H	SL
MM	327	6	42	21	3	0
BCC	9	108	6	9	0	0
Nevus	48	12	967	36	18	0
SK	21	6	30	223	0	3
H/H	0	0	3	0	57	0
SL	3	0	0	0	0	42

Open in a new tab

TABLE B3.

Example data (BCD only).

	True condition

BCD	MM	BCC	Nevus	SK	H/H	SL
MM	340	10	131	18	9	0
BCC	12	104	11	24	1	1
Nevus	22	3	823	17	6	0
SK	26	14	68	225	1	7
H/H	3	1	11	0	61	0
SL	5	0	4	5	0	37

Open in a new tab

Footnotes

CONFLICT OF INTEREST STATEMENT

The authors have declared no conflict of interest.

DATA AVAILABILITY STATEMENT

Data sharing is not applicable to this article as no new data were created or analyzed in this study.

REFERENCES

1.van Rijsbergen CJ. Information Retrieval. London: Butterworths; 1979. [Google Scholar]
2.Manning CD, Raghavan P, Schütze H. Introduction to Information Retrieval. Cambridge: Cambridge University Press; 2008. [Google Scholar]
3.Jinnai S, Yamazaki N, Hirano Y, Sugawara Y, Ohe Y, Hamamoto R. The development of a skin cancer classification system for pigmented skin lesions using deep learning. Biomolecules. 2020;10(8):1123. doi: 10.3390/biom10081123 [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Altman DG, Royston P. The cost of dichotomising continuous variables. Bmj. 2006;332(7549):1080. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Fedorov V, Mannino F, Zhang R. Consequences of dichotomization. Pharm Stat. 2009;8(1):50–61. [DOI] [PubMed] [Google Scholar]
6.Sokolova M, Lapalme G. A systematic analysis of performance measures for classification tasks. Inf Process Manag. 2009;45:427–437. [Google Scholar]
7.Bhalla S, Kaur H, Kaur R, Sharma S, Raghava GPS. Expression based biomarkers and models to classify early and late-stage samples of papillary thyroid carcinoma. PloS One. 2020;15(4):e0231629. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Chowdhury S, Dong X, Qian L, et al. A multitask bi-directional RNN model for named entity recognition on Chinese electronic medical records. BMC Bioinform. 2018;19(17):499. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Döring K, Qaseem A, Becer M, et al. Automated recognition of functional compound-protein relationships in literature. PloS One. 2020;15(3):e0220925. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Hong N, Wen A, Stone DJ, et al. Developing a FHIR-based EHR phenotyping framework: a case study for identification of patients with obesity and multiple comorbidities from discharge summaries. J Biomed Inform. 2019;99:103310. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Lee GH, Shin SY. Federated learning on clinical benchmark data: performance assessment. J Med Internet Res. 2020;22(10):e20891. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Routray R, Tetarenko N, Abu-Assal C, et al. Application of augmented intelligence for pharmacovigilance case seriousness determination. Drug Saf. 2020;43(1):57–66. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Wang J, Zhang J, An Y, et al. Biomedical event trigger detection by dependency-based word embedding. BMC Med Genom. 2016; 9(2):45. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Zhu F, Li X, Mcgonigle D, et al. Analyze informant-based questionnaire for the early diagnosis of senile dementia using deep learning. IEEE J Transl Eng Health Med. 2020;8:2200106. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Wang Y, Li J, Li Y, Wangi R, Yang X. Confidence interval for $F_{1}$ measure of algorithm performance based on blocked 3 × 2 cross-validation. IEEE Trans Knowl Data Eng. 2015;27:651–659. [Google Scholar]
16.Zhang D, Wang J, Zhao X. Estimating the uncertainty of average $F_{1}$ scores. Proceedings of the 2015 International Conference on the Theory of Information Retrieval. 2015. [Google Scholar]
17.Takahashi K, Yamamoto K, Kuchiba A, Koyama T. Confidence interval for micro-averaged $F_{1}$ and macro-averaged $F_{1}$ scores. Appl Intell (Dordr). 2022;52(5):4961–4972. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Attia ZI, Noseworthy PA, Lopez-Jimenez F, et al. An artificial intelligence-enabled ECG algorithm for the identification of patients with atrial fibrillation during sinus rhythm: a retrospective analysis of outcome prediction. Lancet. 2019;394(10201):861–867. [DOI] [PubMed] [Google Scholar]
19.Han SS, Moon IJ, Lim W, et al. Keratinocytic skin cancer detection on the face using region-based convolutional neural network. JAMA Dermatol. 2020;156(1):29–37. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Data sharing is not applicable to this article as no new data were created or analyzed in this study.

[R1] 1.van Rijsbergen CJ. Information Retrieval. London: Butterworths; 1979. [Google Scholar]

[R2] 2.Manning CD, Raghavan P, Schütze H. Introduction to Information Retrieval. Cambridge: Cambridge University Press; 2008. [Google Scholar]

[R3] 3.Jinnai S, Yamazaki N, Hirano Y, Sugawara Y, Ohe Y, Hamamoto R. The development of a skin cancer classification system for pigmented skin lesions using deep learning. Biomolecules. 2020;10(8):1123. doi: 10.3390/biom10081123 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] 4.Altman DG, Royston P. The cost of dichotomising continuous variables. Bmj. 2006;332(7549):1080. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] 5.Fedorov V, Mannino F, Zhang R. Consequences of dichotomization. Pharm Stat. 2009;8(1):50–61. [DOI] [PubMed] [Google Scholar]

[R6] 6.Sokolova M, Lapalme G. A systematic analysis of performance measures for classification tasks. Inf Process Manag. 2009;45:427–437. [Google Scholar]

[R7] 7.Bhalla S, Kaur H, Kaur R, Sharma S, Raghava GPS. Expression based biomarkers and models to classify early and late-stage samples of papillary thyroid carcinoma. PloS One. 2020;15(4):e0231629. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] 8.Chowdhury S, Dong X, Qian L, et al. A multitask bi-directional RNN model for named entity recognition on Chinese electronic medical records. BMC Bioinform. 2018;19(17):499. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] 9.Döring K, Qaseem A, Becer M, et al. Automated recognition of functional compound-protein relationships in literature. PloS One. 2020;15(3):e0220925. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] 10.Hong N, Wen A, Stone DJ, et al. Developing a FHIR-based EHR phenotyping framework: a case study for identification of patients with obesity and multiple comorbidities from discharge summaries. J Biomed Inform. 2019;99:103310. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] 11.Lee GH, Shin SY. Federated learning on clinical benchmark data: performance assessment. J Med Internet Res. 2020;22(10):e20891. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] 12.Routray R, Tetarenko N, Abu-Assal C, et al. Application of augmented intelligence for pharmacovigilance case seriousness determination. Drug Saf. 2020;43(1):57–66. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] 13.Wang J, Zhang J, An Y, et al. Biomedical event trigger detection by dependency-based word embedding. BMC Med Genom. 2016; 9(2):45. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] 14.Zhu F, Li X, Mcgonigle D, et al. Analyze informant-based questionnaire for the early diagnosis of senile dementia using deep learning. IEEE J Transl Eng Health Med. 2020;8:2200106. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] 15.Wang Y, Li J, Li Y, Wangi R, Yang X. Confidence interval for $F_{1}$ measure of algorithm performance based on blocked 3 × 2 cross-validation. IEEE Trans Knowl Data Eng. 2015;27:651–659. [Google Scholar]

[R16] 16.Zhang D, Wang J, Zhao X. Estimating the uncertainty of average $F_{1}$ scores. Proceedings of the 2015 International Conference on the Theory of Information Retrieval. 2015. [Google Scholar]

[R17] 17.Takahashi K, Yamamoto K, Kuchiba A, Koyama T. Confidence interval for micro-averaged $F_{1}$ and macro-averaged $F_{1}$ scores. Appl Intell (Dordr). 2022;52(5):4961–4972. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] 18.Attia ZI, Noseworthy PA, Lopez-Jimenez F, et al. An artificial intelligence-enabled ECG algorithm for the identification of patients with atrial fibrillation during sinus rhythm: a retrospective analysis of outcome prediction. Lancet. 2019;394(10201):861–867. [DOI] [PubMed] [Google Scholar]

[R19] 19.Han SS, Moon IJ, Lim W, et al. Keratinocytic skin cancer detection on the face using region-based convolutional neural network. JAMA Dermatol. 2020;156(1):29–37. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Hypothesis testing procedure for binary and multi-class F1-scores in the paired design

Kanae Takahashi

Kouji Yamamoto

Aya Kuchiba

Ayumi Shintani

Tatsuki Koyama

Abstract

1 |. INTRODUCTION

2 |. REVIEW OF F1-SCORES

TABLE 1.

2.1 |. Binary F1-score

2.2 |. Micro-averaged F1-score

2.3 |. Macro-averaged F1-score

2.4 |. Alternate definition of macro-averaged F1-score

3 |. PROPOSED HYPOTHESIS TESTING PROCEDURE

3.1 |. Test statistic for comparing two biFs

3.2 |. Test statistic for comparing two miFs

3.3 |. Test statistic for comparing two maFs

3.4 |. Test statistic for comparing two maF*s

4 |. SIMULATION

4.1 |. Simulation setup

TABLE 2.

4.2 |. Simulation result

TABLE 3.

TABLE 4.

5 |. EXAMPLE

TABLE 5.

6 |. DISCUSSION

ACKNOWLEDGEMENTS

Funding information

APPENDIX A. DERIVATION OF VARIANCES

A.1. Variance of biF

A.2. Variance of miF

A.3. Variance of maF

A.4. Variance of maF*

APPENDIX B. EXAMPLE DATA

TABLE B1.

TABLE B2.

TABLE B3.

Footnotes

DATA AVAILABILITY STATEMENT

REFERENCES

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Hypothesis testing procedure for binary and multi-class $F_{1}$ -scores in the paired design

2.1 |. Binary $F_{1}$ -score

2.2 |. Micro-averaged $F_{1}$ -score

2.3 |. Macro-averaged $F_{1}$ -score

2.4 |. Alternate definition of macro-averaged $F_{1}$ -score

3.1 |. Test statistic for comparing two $b i F s$

3.2 |. Test statistic for comparing two $m i F s$

3.3 |. Test statistic for comparing two $m a F s$

3.4 |. Test statistic for comparing two $m a F^{*} s$

A.1. Variance of $b i F$

A.2. Variance of $m i F$

A.3. Variance of $m a F$

A.4. Variance of $m a F^{*}$