Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2024 Oct 17.
Published in final edited form as: Stat Med. 2023 Aug 1;42(23):4177–4192. doi: 10.1002/sim.9853

Hypothesis testing procedure for binary and multi-class F1-scores in the paired design

Kanae Takahashi 1, Kouji Yamamoto 2, Aya Kuchiba 3, Ayumi Shintani 4, Tatsuki Koyama 5
PMCID: PMC11483486  NIHMSID: NIHMS2021606  PMID: 37527903

Abstract

In modern medicine, medical tests are used for various purposes including diagnosis, disease screening, prognosis, and risk prediction. To quantify the performance of the binary medical test, we often use sensitivity, specificity, and negative and positive predictive values as measures. Additionally, the F1-score, which is defined as the harmonic mean of precision (positive predictive value) and recall (sensitivity), has come to be used in the medical field due to its favorable characteristics. The F1-score has been extended for multi-class classification, and two types of F1-scores have been proposed for multi-class classification: a micro-averaged F1-score and a macro-averaged F1-score. The micro-averaged F1-score pools per-sample classifications across classes and then calculates the overall F1-score, whereas the macro-averaged F1-score computes an arithmetic mean of the F1-scores for each class. Additionally, Sokolova and Lapalme1 gave an alternative definition of the macro-averaged F1-score as the harmonic mean of the arithmetic means of the precision and recall over classes. Although some statistical methods of inference for binary and multi-class F1-scores have been proposed, the methodology development of hypothesis testing procedure for them has not been fully progressing yet. Therefore, we aim to develop hypothesis testing procedure for comparing two F1-scores in paired study design based on the large sample multivariate central limit theorem.

Keywords: delta-method, F1 measures, multi-class classification, precision, recall

1 |. INTRODUCTION

Medical tests are important for the early detection and treatment of disease in modern medicine. Tests are used for various purposes including diagnosis, disease screening, prognosis, and risk prediction. Some measures exist to quantify the test performance; sensitivity, specificity, and positive and negative predictive values are commonly used for binary tests. Additionally, the F1-score for binary data (binary F1-score), which is defined as the harmonic mean of precision (positive predictive value) and recall (sensitivity), has been used in the medical field.1,2

The binary F1-score is especially useful when evaluation of true negatives is relatively unimportant because true negatives are not included in computation of either precision or recall. In addition, the binary F1-score performs well for a poor diagnostic test that identifies majority of the data as positive. In this situation, a simple arithmetic mean of precision and recall may be as high as 0.50 because recall will be 1.00 if all the data are diagnosed as positive. However, the binary F1-score will be appropriately low in these instances: it will be 0.18 and 0.02 when the precision is 0.10 and 0.01, respectively, even if recall is 1.00. Therefore, F1-score is a better statistic to report.2

Most of measures for performance of medical tests are only applicable to binary classification data, and multi-class classification data need to be dichotomized to compute these measures. In the motivating example,3 for instance, skin cancer images were originally classified into six categories (malignant melanoma (MM), basal cell carcinoma (BCC), nevus, seborrheic keratosis (SK), senile lentigo (SL) and hematoma/hemangioma (H/H)), and the classification performances of board-certified dermatologists and dermatologic trainees were compared. The classification performance was assessed by accuracy, sensitivity, specificity, false negative rate, false positive rate, and positive predictive value after dichotomizing the six categories (MM and BCC vs. nevus, SK, SL, and H/H). However, evaluating the performance with the original six categories would have been preferable because dichotomization led to loss of information regarding the performance of the this classification.4,5

As measures of multi-class classification performance, a micro-averaged F1-score and a macro-averaged F1-score have been proposed.2 The micro-averaged F1-score calculates the overall F1-score by pooling per-sample classifications across classes. Contrarily, the macro-averaged F1-score computes an arithmetic mean of the F1-scores for each class. In addition, Sokolova and Lapalme6 proposed an alternative macro-averaged F1-score as the harmonic mean of the arithmetic mean of the precisions and recalls for each class.

Although F1-scores for binary and multi-class classifications have been originally used for measuring the performance of text classification in the field of information retrieval or of a classifier in machine learning, it has become frequently used in medicine.714 Some statistical methods for inference have been proposed for the binary F1-score,15 and the methods for estimating confidence intervals of the micro-averaged F1-scores and macro-averaged F1-scores has been developed.16,17 However, these previous methods are for inference from one-sample. To our knowledge, no method is available for hypothesis testing of F1-scores for paired samples as in our motivating example or two independent samples. Thus, we aim to provide the methods for comparing the binary F1-scores, micro-averaged F1-scores and macro-averaged F1-scores in the paired-design setting. For two-independent-sample setting, the proposed method is readily applicable by setting the covariance part of the test statistics to 0.

The layout of this article is as follows: In Section 2, the definitions of the binary F1-score, micro-averaged F1-score and macro-averaged F1-score are reviewed. Test statistics for comparing those scores are derived in Section 3. Then, the simulation results of the proposed statistics and the application to the motivating example are presented in Sections 4 and 5, respectively. Finally, our brief discussions are provided in Section 6.

2 |. REVIEW OF F1-SCORES

This section introduces notations and definitions of binary F1-score (biF), micro-averaged F1-score (miF), and macro-averaged F1-score (maF). Consider an r×r×r table of data for a nominal categorical variable with r levels (r2). Each true class 1,,r has an r×r table representing prediction frequencies of the two tests to be compared.

This arrangement of data represents the binary classification when r=2, and the multi-class classification when r>2. Table 1 shows general notations for each cell probability pijk, where i indicates the class of Test 1, j indicates the class of Test 2, and k indicates the true condition. Let Test 1 be a new medical test and Test 2 be an existing medical test. We consider a hypothesis testing to compare F1-scores of Test 1 and Test 2. Using these notations, the true positive rate TPa, the false positive rate FPa, and the false negative rate FNa for each class a(a=1,,r) in Test 1 are defined as follows:

TP1a=pa.a,FP1a=k=1karpa.k,FN1a=i=1iarpi.a.

TABLE 1.

General notations.

True condition = 1 True condition = r


Test 2 Test 2


1 2 r 1 2 r
Test 1 1 p 111 p 121 p 1r1 p 1.1 p 11r p 12r p 1rr p 1.r p 1..
2 p 211 p 221 p 2r1 p 2.1 p 21r p 22r p 2rr p 2.r p 2..
r p r11 p r21 p rr1 p r.1 p r1r p r2r prrr pr.r p r..
p .11 p .21 p .r1 p ..1 p .1r p .2r p .rr p ..r 1

Note that TP1a+FP1a=pa.., and TP1a+FN1a=p..a. Similarly, TPa,FPa,FNa for each class a(a=1,,r) for Test 2 are defined as follows:

TP2a=p.aa,FP2a=k=1karp.ak,FN2a=j=1jarp.ja.

Note that TP2a+FP2a=p.a., and TP2a+FN2a=p..a.

2.1 |. Binary F1-score

When r=2, we consider the following precision (biP) and recall (biR) for Test 1 as:

biP1=TP11TP11+FP11=p1.1/p1..,
biR1=TP11TP11+FN11=p1.1/p..1.

And binary F1-score for Test 1 biF1 is defined as the harmonic mean of biP1 and biR1, that is,

biF1=2biP1×biR1biP1+biR1=2p1.1p1..+p..1. (1)

Similarly, the binary F1-score for Test 2 biF2 is as follows:

biP2=TP21TP21+FP21=p.11/p.1.,
biR2=TP21TP21+FN21=p.11/p..1,
biF2=2biP2×biR2biP2+biR2=2p.11p.1.+p..1. (2)

2.2 |. Micro-averaged F1-score

When r>2 the micro-averaged precision (miP) and micro-averaged recall (miR) are obtained from the sum of each class of TPi, FPi, FNi. miP and miR for Test 1 can be written as

miP1=a=1rTP1aa=1rTP1a+FP1a=pa.apa..=a=1rpa.a,
miR1=a=1rTP1aa=1rTP1a+FN1a=pa.ap..a=a=1rpa.a.

Finally, as the harmonic mean of miP1 and miR1, we have the micro-averaged F1-score for Test 1 miF1 as

miF1=2miP1×miR1miP1+miR1=a=1rpa.a. (3)

Similarly, the micro-averaged F1-score for Test 2 miF2 is

miP2=a=1rTP2aa=1rTP2a+FP2a=p.aap.a.=a=1rp.aa,
miR2=a=1rTP2aa=1rTP2a+FN2a=p.aap..a=a=1rp.aa.
miF2=2miP2×miR2miP2+miR2=a=1rp.aa. (4)

2.3 |. Macro-averaged F1-score

When r>2, to define the macro-averaged F1-score for Test 1 maF1, first consider the following precision P1a and recall (R1a) within each class, a=1,,r:

P1a=TP1aTP1a+FP1a=pa.a/pa.., (5)
R1a=TP1aTP1a+FN1a=pa.a/p..a. (6)

And F1-score within each class for Test 1 F1a is defined as the harmonic mean of P1a and R1a, that is,

F1a=2P1a×R1aP1a+R1a=2pa.apa..+p..a.

The macro-averaged F1-score for Test 1 maF1 is the simple arithmetic mean of F1a:

maF1=1ra=1rF1a=2ra=1rpa.apa..+p..a. (7)

Similarly, the macro-averaged F1-score for Test 2 maF2 is

P2a=TP2aTP2a+FP2a=p.aa/p.a.,
R2a=TP2aTP2a+FN2a=p.aa/p..a.
F2a=2P2a×R2aP2a+R2a=2p.aap.a.+p..a.
maF2=1ra=1rF2a=2ra=1rp.aap.a.+p..a. (8)

2.4 |. Alternate definition of macro-averaged F1-score

Sokolova and Lapalme6 gave an alternative definition of the macro-averaged F1. First, macro-averaged precision (maP) and macro-averaged recall (maR) for Test 1 are defined as simple arithmetic means of the within-class precision and within-class recall in (5) and (6), respectively.

maP1=1ra=1rP1a=1ra=1rpa.apa..,
maR1=1ra=1rR1a=1ra=1rpa.ap..a.

And the alternate definition of macro-averaged F1-score for Test 1 (maF1*) is the harmonic mean of these quantities.

maF1*=2maP1×maR1maP1+maR1. (9)

Similarly, the alternate definition of macro-averaged F1-score for Test 2 (maF2*) is

maP2=1ra=1rP2a=1ra=1rp.aap.a.,
maR2=1ra=1rR2a=1ra=1rp.aap..a.
maF2*=2maP2×maR2maP2+maR2. (10)

3 |. PROPOSED HYPOTHESIS TESTING PROCEDURE

In this section, we derive the test statistics for comparing two F1-scores (biF1 and biF2;miF1 and miF2;maF1 and maF2; and maF1* and maF2*). We assume that the observed frequencies, nijk for 1ir,1jr,1kr, have a multinomial distribution with overall sample size N=i,j,knijk and probabilities p=p111,,p1r1,,prr1,,prrrT, where i indicates the class of Test 1, j indicates the class of Test 2, k indicates the true condition, and “T” represents the transpose. The maximum likelihood estimate (MLE) of pijk is pˆijk=nijk/N. That is

n111,n121,,nrrrMultinomialN;p.

By invariance property of MLE’s, the maximum likelihood estimate of biF,miF,maF,maF*, and other quantities in the previous section can be obtained by substituting pijk by pˆijk.

3.1 |. Test statistic for comparing two biFs

Let biF=biF1,biF2T be a vector whose components are the biFs of the two medical tests, and let biF^ be the MLE of biF.biF^ can be obtained by substituting pijk by their MLE’s in (1) and (2).

biF^1=2pˆ1.1pˆ1..+pˆ..1=2n1.1n1..+n..1,biF2^=2pˆ.11pˆ.1.+pˆ..1=2n.11n.1.+n..1.

Using the delta-method and the multivariate central limit theorem, we have

NbiF^-biF˙Normal0,biFpTdiagp-ppTbiFp,

where diag(p) is an r2×r2×r2 diagonal matrix whose elements are the diagonal elements of p, and “˙” represents “approximately distributed as”. The Wald statistic for testing H0:biF1=biF2 vs. H1:biF1biF2, therefore, is

TbiFW=biF^1-biF^22Var^biFd,

where Var^biFd is the variance of biF^1-biF^2 with pijk replaced by pˆijk. Derivation of the variance of biF^1-biF^2 appear in Appendix A.1. The test statistic is distributed asymptotically as a χ2 distribution with one degree of freedom under the null hypothesis.

As a side note, the confidence interval of biF for each test can be derived in the same way. A(1α)×100% confidence interval of biF1 and biF2 is

biF^1±Z1-α/2×Var^bF1,
biF^2±Z1-α/2×Var^biF2,

where Zp denote the 100p-th percentile of the standard normal distribution, and Var^biF1 and Var^biF2 are the variance of biF^1 and the variance of biF^2 with pijk replaced by pˆijk. These simple formulas based on the multinomial distribution have not been proposed yet. Wang et al. proposed a confidence interval of biF based on the beta prime distribution and associated calculations using the bootstrap method.15,18

For the score statistic, we consider the MLE of pijk under the null hypothesis that could be obtained, for example by applying the Newton-Raphson method to the log-likelihood equations. The score statistic for testing H0:biF1=biF2 vs. H1:biF1biF2 is

TbiFS=biF^1-biF^22Var~biFd,

where Var~biFd is the variance of biF^1-biF^2 with pijk replaced by p˜ijk, that is calculated from the MLE of pijk under the null hypothesis.

3.2 |. Test statistic for comparing two miFs

As shown in (3) and (4), miF1=pa.a,miF2=p.aa, and the MLE of miF1 and miF2 are

miF1^=a=1rpˆa.a=a=1rna.aN,miF2^=a=1rpˆ.aa=a=1rn.aaN.

Again by the delta-method and multivariate central limit theorem (Appendix A.2), the Wald statistic for testing H0:miF1=miF2 versus H1:miF1miF2 is

TmiFW=miF^1-miF^22Var^miFd,

where Var^miFd is the variance of miF^1-miF^2 with pijk replaced by pˆijk. The test statistic is distributed a symptotically as a χ2 distribution with one degree of freedom under the null hypothesis.

Again to develop the score statistic, we consider the MLE of pijk under the null hypothesis as in the case of biF. The score statistic for testing H0:miF1=miF2 versus H1:miF1miF2 is

TmiFS=miF^1-miF^22Var~miFd,

where Var~miFd is the variance of miF^1-miF^2 with pijk replaced by p˜ijk, that is calculated from the MLE of pijk under the null hypothesis.

3.3 |. Test statistic for comparing two maFs

The MLE of maF1 and maF2 can be obtained by substituting pa.a,p.aa,pa..,p.a. and p..a by their MLE’s in (7) and (8).

maF^1=2ra=1rpˆa.apˆa..+pˆ..a=2ra=1rna.ana..+n..a,maF^2=2ra=1rpˆ.aapˆ.a.+pˆ..a=2ra=1rn.aan.a.+n..a.

Again by the delta-method and multivariate central limit theorem (Appendix A.3), we have the Wald statistic for testing H0:maF1=maF2 versus H1:maF1maF2 as

TmaFW=maF^1-maF^22Var^maFd,

where Var^maFd is the variance of maF^1-maF^2 with pijk replaced by pˆijk. The test statistic is distributed asymptotically as a χ2 distribution with one degree of freedom under the null hypothesis.

For the score statistic, we consider the MLE of pijk under the null hypothesis as in the case of biF and miF. The score statistic for testing H0:maF1=maF2 versus H1:maF1maF2 is

TmaFS=maF^1-maF^22Var~maFd,

where Var~maFd is the variance of maF^1-maF^2 replaced by p˜ijk, that is calculated from the MLE of pijk under the null hypothesis.

3.4 |. Test statistic for comparing two maF*s

To obtain the MLEs of maF1* and maF2*, we first substitute pa.a,p.aa,pa..,p.a. and p..a by their MLE’s to get MLE’s of maP and maR and use these in (9) and (10):

maF^1*=2maP1^×maR1^maP1^+maR1^,maF^2*=2maP2^×maR2^maP2^+maR2^.

Using the delta-method and multivariate central limit theorem (Appendix A.4), we have the Wald statistic for testing H0:maF1*=maF2* versus H1:maF1*maF2* as

TmaFW=maF^1*-maF^2*2Var^maFd*,

Again to get Var^maFd*, all components of the variance of maF^1*-maF^2* are replaced by their respective MLE’s. The test statistic is distributed asymptotically as a χ2 distribution with one degree of freedom under the null hypothesis.

On the other hand, for the score statistic, we consider the MLE of pijk under the null hypothesis as in the case of biF,miF, and maF. The score statistic for testing H0:maF1*=maF2* versus H1:maF1*maF2* is

TmaFS=maF^1*-maF^2*2Var~maFd*,

where Var~maFd* is the variance of maF^1*-maF^2* replaced by p˜ijk, that is calculated from the MLE of pijk under the null hypothesis.

4 |. SIMULATION

4.1 |. Simulation setup

A simulation study was conducted to evaluate the performance of the test statistics proposed in Section 3. We set r=3 (class 1, 2, 3), and generated data according to the multinomial distributions with p shown in Table 2. Classes 2 and 3 were combined when calculating biF. The total sample size, N, was set to 100, 300, 500, and 1,000. The nominal type I error rate was set to 0.05 (two-sided test). We used the empirical type I error rate and empirical power as performance measures. For each combination of the scenario and sample size, we performed 100,000 repeated simulations.

TABLE 2.

Simulation study: True cell probabilities.

True class = 1 True class = 2 True class = 3



Test 2 Test 2 Test 2



Scenario 1 1 2 3 1 2 3 1 2 3
Test 1 1 40/300 10/300 10/300 5/300 10/300 5/300 5/300 5/300 10/300
2 10/300 5/300 5/300 10/300 40/300 10/300 5/300 5/300 10/300
3 10/300 5/300 5/300 5/300 10/300 5/300 10/300 10/300 40/300
biF1=miF1=maF1=maF1*=0.60
biF2=miF2=maF2=maF2*=0.60
True class = 1 True class = 2 True class = 3



Test 2 Test 2 Test 2



Scenario 2 1 2 3 1 2 3 1 2 3
Test 1 1 120/500 30/500 30/500 5/500 10/500 5/500 5/500 5/500 10/500
2 30/500 15/500 15/300 10/500 40/500 10/500 5/500 5/500 10/500
3 30/500 15/500 15/500 5/500 10/500 5/500 10/500 10/500 40/500
biF1=0.69,miF1=0.60,maF1=0.56,maF1*=0.58
biF2=0.69,miF2=0.60,maF2=0.56,maF2*=0.58
True class = 1 True class = 2 True class = 3



Test 2 Test 2 Test 2



Scenario 3 1 2 3 1 2 3 1 2 3
Test 1 1 30/300 15/300 15/300 5/300 10/300 5/300 5/300 5/300 10/300
2 10/300 5/300 5/300 15/300 30/300 15/300 5/300 5/300 10/300
3 10/300 5/300 5/300 5/300 10/300 5/300 15/300 15/300 30/300
biF1=miF1=maF1=maF1*=0.60
biF2=miF2=maF2=maF2*=0.50
True class = 1 True class = 2 True class = 3



Test 2 Test 2 Test 2



Scenario 4 1 2 3 1 2 3 1 2 3
Test 1 1 90/500 45/500 45/500 5/500 10/500 5/500 5/500 5/500 10/500
2 30/500 15/500 15/300 15/500 30/500 15/500 5/500 5/500 10/500
3 30/500 15/500 15/500 5/500 10/500 5/500 15/500 15/500 30/500
biF1=0.69,miF1=0.60,maF1=0.56,maF1*=0.58
biF2=0.60,miF2=0.50,maF2=0.47,maF2*=0.49

Scenarios 1 and 2 are set up to evaluate the empirical type I error rate of the proposed test statistics, while scenario 3 and 4 are designed to assess their empirical power. In scenario 1, the true conditions of classes 1, 2, and 3 have the same probability (1∕3), and the recalls and precisions within each class are equal in the two tests (60%). Thus, maR1=maR2=maP1=maP2=0.60, and F1a=F2a=0.60 for each class, a=1,2,3. Then, maF1=maF2=maF1*=maF2*=0.60. Because classes 2 and 3 are combined to calculate biF1=F11=0.60 and biF2=F21=0.60. Also, pa.a=p.aa=0.20 for each class a=1,2,3, and miF1=miF2=0.60.

In scenario 2, the true condition of class 1 has higher probability than the others (60% vs. 20%), and performances of two tests are equal: biF1=biF2=0.69,miF1=miF2=0.60,maF1=maF2=0.56, and maF1*=maF2*=0.58. Although the distributions in scenario 2 are the same as those in scenario 1 for each class, the value of biF,maF and maF* are different between scenarios because TPa/TPa+FPa is large in the true class = 1 and, conversely, relatively small in the true classes 2 and 3. In contrast, miF in scenario 2 is the same as that in scenario 1 because pa.a and p.aa for each class a=1,2,3 in scenarios 1 and 2 are equal.

The true conditions of classes 1, 2, and 3 have the same probability (1∕3) in scenario 3. However, maR and maP of Test 2 are lower than Test 1 (60% vs. 50%), F2a are lower than F1a (60% vs. 50%), and p.aa is lower than pa.a for each class a=1,2,3 (20% vs. 17%). Therefore, biF1=miF1=maF1=maF1*=0.60, whereas biF2=miF2=maF2=maF2*=0.50.

In scenario 4, the true condition of class 1 has higher probability than the others (60% vs. 20%) as in scenario 2. However, the performance of two tests are different: biF1=0.69 versus biF2=0.60,miF1=0.60 versus miF2=0.50,maF1=0.56 versus maF2=0.47, and maF1*=0.58 versus maF2*=0.49.

4.2 |. Simulation result

Table 3 shows the empirical type I error rates of the proposed tests for scenarios 1 and 2. The empirical type I error rates for both test statistics were close to nominal type I error rate of 0.05 when the sample size is large (300, 500, 1000). When N is relatively small (100), the empirical type I error rates tended to be slightly larger than 0.05, especially for Wald statistics. Contrarily, the empirical type I error rates with score statistics are close to the nominal type I error rate of 0.05 for all sample sizes. Table 4 shows the empirical power of the proposed tests for scenarios 3 and 4. As shown in Table 4, the empirical powers increase with the sample size. The empirical powers of Wald statistics and score statistics are similar, especially when the sample size is large.

TABLE 3.

Simulation study: Empirical type I error rates.

Scenario N TbiFW TbiFS TmiFW TmiFS TmaFW TmaFS TmaF*W TmaF*S
1 100 0.057 0.050 0.053 0.049 0.055 0.051 0.057 0.053
300 0.052 0.050 0.051 0.050 0.052 0.051 0.052 0.051
500 0.051 0.050 0.050 0.050 0.051 0.050 0.051 0.050
1000 0.050 0.049 0.051 0.050 0.051 0.050 0.051 0.051
2 100 0.052 0.049 0.054 0.049 0.058 0.053 0.061 0.055
300 0.052 0.050 0.051 0.050 0.054 0.052 0.054 0.052
500 0.051 0.050 0.051 0.050 0.051 0.050 0.052 0.051
1000 0.050 0.050 0.051 0.051 0.052 0.051 0.051 0.051

TABLE 4.

Simulation study: Empirical power.

Scenario N TbiFW TbiFS TmiFW TmiFS TmaFW TmaFS TmaF*W TmaF*S
3 100 0.192 0.174 0.304 0.289 0.309 0.297 0.310 0.300
300 0.438 0.429 0.694 0.689 0.696 0.692 0.696 0.692
500 0.641 0.635 0.890 0.888 0.889 0.888 0.889 0.888
1000 0.905 0.904 0.995 0.995 0.995 0.995 0.995 0.995
4 100 0.235 0.226 0.305 0.291 0.291 0.278 0.271 0.256
300 0.560 0.556 0.695 0.690 0.662 0.657 0.615 0.609
500 0.773 0.771 0.889 0.887 0.865 0.863 0.826 0.824
1000 0.969 0.969 0.995 0.995 0.992 0.992 0.984 0.984

5 |. EXAMPLE

We describe an application of the proposed hypothesis testing procedure to the motivating example.3 In this study, a skin cancer classification system with faster, region-based convolutional neural network algorithm (FRCNN) for brown to black pigmented skin lesions was developed using a deep learning method. The target diseases were malignant tumors (malignant melanoma (MM) and basal cell carcinoma (BCC)) and benign tumors (nevus, seborrheic keratosis (SK), senile lentigo (SL) and hematoma/hemangioma (H/H)), and 2000 images were evaluated. The 2000 images were obtained by randomly sampling 200 images from the 666 images 10 times. For illustration, all images were treated as independent in this study. The data are shown in Tables B1B3, Appendix B. Although images were classified into six categories (MM, BCC, nevus, SK, SL, H/H), accuracy was the only performance measure computed for six-class classification data in the motivating example. Other performance measures, sensitivity, specificity, false negative, false positive, and positive predictive value, were calculated for two-class classification data after combining malignant tumors (MM and BCC) and benign tumors (nevus, SK, SL, and H/H). The accuracy of six-class classification by the FRCNN (86.2% ± 2.95%) was statistically higher than that of board-certified dermatologists (BCD) (79.5% ± 5.27%, p = 0.0081) and that of dermatologic trainees (75.1% ± 2.18%, p < 0.0001).

We compared the performance of skin cancer classification between the FRCNN and BCD using biFs,miFs, maFs, and maF*s with the proposed hypothesis testing procedures. miFs,maFs, and maF*s were calculated from six-class classification data, while biFs were calculated from two-class classification data (malignant tumors vs. benign tumors). The results are shown in Table 5. All biFs,miFs,maFs, and maF*s of six-class classification by FRCNN were significantly higher than those by BCD.

TABLE 5.

Example.

FRCNN BCD Difference Test statistics p-value
biF (Wald) 0.840 0.776 0.064 19.4 < 0.001
biF (score) 0.840 0.776 0.064 18.9 < 0.001
miF (Wald) 0.862 0.795 0.067 41.9 < 0.001
miF (score) 0.862 0.795 0.067 41.0 < 0.001
maF (Wald) 0.846 0.768 0.078 26.2 < 0.001
maF (score) 0.846 0.768 0.078 24.5 < 0.001
maF* (Wald) 0.848 0.772 0.076 26.4 < 0.001
maF* (score) 0.848 0.772 0.076 23.0 < 0.001

6 |. DISCUSSION

We developed hypothesis testing procedures for comparing two F1-scores biF1 and biF2,miF1 and miF2,maF1 and maF2, and maF1* and maF2*) in paired study design. Through the simulation study and motivating example, we assessed the performance and feasibility of those testing procedures. We conclude that the method based on the score statistics ( TbiFS,TmiFS,TmaFS, and TmaF*S) is slightly better compared to the method based on the Wald statistics (TbiFW, TmiFW, TmaFW, and TmaF*W) because the empirical type I error rate is closer to the nominal level even when the sample size is small. However, when multi-class classification is considered, typical sample size is much larger than 100, and both approaches perform equally well in such scenarios.

We did not observe a substantial disparity in the empirical powers of the two approaches.

At present, others have not studied hypothesis testing procedure of biFs,miFs, maFs, and maF*s, and only the point estimates of these scores were reported in most studies. Han et al19 applied one sample t-test for comparison of biFs; however, this approach may not be appropriate because biF is the harmonic mean of precision and recall, and the distribution of the difference between two biFs is unlikely to follow a Student’s t-distribution.

A limitation of this work is that the proposed procedures are based on the large sample theory, and thus require a large sample size to provide strict control of the type I error rate. For future works, we are working on the exact test for comparing two biFs,miFs, maFs, and maF*s based on the methods presented in this article.

An R code for computing point estimates, Wald statistics, score statistics, and p-values for biFs,miFs, maFs, and maFs of each statistic in the paired design, is available on the lead author’s GitHub page: https://github.com/kanaet52/f1score/blob/main/R/F1score_test.R. For two-sample designs, the F1-scores can be compared by setting the covariance part of the test statistic (CbiFW,CmiFW,CmaFW,CmaF*W, see Appendix A) to 0. Note that for the score statistics, the MLE of pijk under each null hypothesis is obtained by applying the Newton-Raphson method to the log-likelihood equations in the code.

ACKNOWLEDGEMENTS

The authors would like to thank Dr Shunichi Jinnai for providing the motivating example data. This research was partially supported by grant-in-aid for Scientific Research (C) No. 18K11195 and 21K11790 (Yamamoto), grant-in-aid for Research Activity start-up no. 21K21170 (Takahashi), and P30CA068485 Cancer Center Support grant (Koyama).

Funding information

Cancer Center Support, Grant/Award Number: P30CA068485; Japan Society for the Promotion of Science, Grant/Award Numbers: 18K11195, 21K11790, 21K21170

APPENDIX A. DERIVATION OF VARIANCES

A.1. Variance of biF

The derivation of the variance of biF^1-biF^2 is as follows:

biF=biF1biF2,(biF)(p)TppT(biF)(p)=0,(biF)(p)Tdiag(p)-ppT(biF)(p)=biFpTdiagpbiFp=AbiFWCbiFWCbiFWBbiFW,

with

AbiFW=1p1..+p..12p1.121-biF12+p1.2+p2.1biF12
BbiFW=1p.1.+p..12p.1121-biF22+p.12+p.21biF22
CbiFW=1p1..+p..1p.1.+p..122p1111-biF11-biF2-2p1211-biF1biF2-2p211biF1(1-biF2)+p221+p112biF1biF2.

Therefore, the variance of biF1^ is

1p1..+p..12p1.121-biF12+p1.2+p2.1biF12/n,

the variance of biF2^ is

1p.1.+p..12p.1121-biF22+p.12+p.21biF22/n,

the variance of biF1^-biF2^ is

AbiFW+BbiFW-2CbiFW/n.

A.2. Variance of miF

The derivation of the variance of miF^1-miF^2 is as follows:

miF=miF1miF2,(miF)(p)TppT(miF)(p)=a=1rpa.a2a=1rpa.aa=1rp.aaa=1rpa.aa=1rp.aaa=1rp.aa2=miF12miF1miF2miF1miF2miF22,(miFpTdiagpmiFp=a=1rpa.aa=1rpaaaa=1rpaaaa=1rp.aa=miF1a=1rpaaaa=1rpaaamiF2,(miF)(p)Tdiag(p)-ppT(miF)(p)=AmiFWCmiFWCmiFWBmiFW,

with

AmiFW=miF11-miF1,
BmiFW=miF21-miF2,
CmiFW=a=1rpaaa-miF1miF2.

Therefore, the variance of miF^1-miF^2 is

AmiFW+BmiFW-2CmiFW/n.

A.3. Variance of maF

The derivation of the variance of maF^1-maF^2 is as follows.

maF=maF1maF2,(maF)(p)TppT(maF)(p)=0,(maF)(p)Tdiag(p)-ppT(maF)(p)=(maF)(p)T(diag(p))(maF)(p)=1r2AmaFWCmaFWCmaFWBmaFW,

with

AmaFW=a=1rpa.a21-F1apa..+p..a2+a=1rbapa.bF1apa..+p..a+F1bpb..+p..b2,BmaFW=a=1rp.aa21-F2ap.a.+p..a2+a=1rbap.abF2ap.a.+p..a+F2bp.b.+p..b2,CmaFW=a=1rpaaa221-F1a1-F2apa..+p..ap.a.+p..a-a=1rbapaba21-F1apa..+p..aF2ap.a.+p..a+F2bp.b.+p..b+pbaa21-F2ap.a.+p..iF1apa..+p..a+F1bpb..+p..b+a=1rbacapbcaF1apa..+p..a+F1bpb..+p..bF2ap.a.+p..a+F2cp.c.+p..c.

Therefore, the variance of maF^1-maF^2 is

1r2AmaFW+BmaFW-2CmaFW/n.

A.4. Variance of maF*

The derivation of the variance of maF*^1-maF*^2 is as follows.

maF*=maF1*maF2*,(maF*)(p)TppT(maF*)(p)=0,(maF*)(p)Tdiag(p)-ppT(maF*)(p)=(maF*)(p)T(diag(p))(maF*)(p)=AmaF*WCmaF*WCmaF*WBmaF*W,

with

AmaF*W=22r2maP1+maR14a=1rpa.apa..-pa.amaR12pa..2+p..a-pa.amaP12p..a22+a=1rbapa.bpa.amaR12pa..2+pb.bmaP12p..b22,
BmaF*W=22r2maP2+maR24a=1rp.aap.a.-p.aamaR22p.a.2+p..a-p.aamaP22p..a22+a=1rbap.abp.aamaR22p.a.2+p.bbmaP22p..b22,
CmaF*W=22r2maP1+maR12maP2+maR22×a=1rpaaapa..-pa.apa..2maR12+p..a-pa.ap..a2maP12p.a.-p.aap.a.2maR22+p..a-p.aap..a2maP22-a=1rbapabapa..-pa.apa..2maR12+p..a-pa.ap..a2maP12p.bbp.b.2maR22+p.bbp..b2maP22+pbaap.a.-p.aap.a.2maR22+p..a-p.aap..a2maP22pb.bpb..2maR12+pb.bp..b2maP12+a=1rbacapbcapb.bpb..2maR12+pa.ap..a2maP12p.ccp.c.2maR22+p.aap..a2maP22.

Therefore, the variance of maF*^1-maF*^2 is

AmaF*W+BmaF*W-2CmaF*W/n.

APPENDIX B. EXAMPLE DATA

Tables B1, B2, and B3 here.

TABLE B1.

Example data.

True condition

FRCNN BCD MM BCC Nevus SK H/H SL
MM MM 289 2 20 6 2 0
BCC 9 2 2 5 0 0
Nevus 10 0 14 0 0 0
SK 14 2 4 10 0 0
H/H 3 0 2 0 1 0
SL 2 0 0 0 0 0
BCC MM 6 6 0 1 0 0
BCC 2 95 0 6 0 0
Nevus 0 2 6 0 0 0
SK 1 5 0 2 0 0
H/H 0 0 0 0 0 0
SL 0 0 0 0 0 0
Nevus MM 32 1 108 0 1 0
BCC 1 6 8 1 0 0
Nevus 11 1 789 8 1 0
SK 3 3 50 27 0 0
H/H 0 1 9 0 16 0
SL 1 0 3 0 0 0
SK MM 13 1 3 11 0 0
BCC 0 1 1 12 0 0
Nevus 1 0 11 9 0 0
SK 7 4 14 186 0 1
H/H 0 0 0 0 0 0
SL 0 0 1 5 0 2
H/H MM 0 0 0 0 6 0
BCC 0 0 0 0 1 0
Nevus 0 0 3 0 5 0
SK 0 0 0 0 1 0
H/H 0 0 0 0 44 0
SL 0 0 0 0 0 0
SL MM 0 0 0 0 0 0
BCC 0 0 0 0 0 1
Nevus 0 0 0 0 0 0
SK 1 0 0 0 0 6
H/H 0 0 0 0 0 0
SL 2 0 0 0 0 35

TABLE B2.

Example data (FRCNN only).

True condition

FRCNN MM BCC Nevus SK H/H SL
MM 327 6 42 21 3 0
BCC 9 108 6 9 0 0
Nevus 48 12 967 36 18 0
SK 21 6 30 223 0 3
H/H 0 0 3 0 57 0
SL 3 0 0 0 0 42

TABLE B3.

Example data (BCD only).

True condition

BCD MM BCC Nevus SK H/H SL
MM 340 10 131 18 9 0
BCC 12 104 11 24 1 1
Nevus 22 3 823 17 6 0
SK 26 14 68 225 1 7
H/H 3 1 11 0 61 0
SL 5 0 4 5 0 37

Footnotes

CONFLICT OF INTEREST STATEMENT

The authors have declared no conflict of interest.

DATA AVAILABILITY STATEMENT

Data sharing is not applicable to this article as no new data were created or analyzed in this study.

REFERENCES

  • 1.van Rijsbergen CJ. Information Retrieval. London: Butterworths; 1979. [Google Scholar]
  • 2.Manning CD, Raghavan P, Schütze H. Introduction to Information Retrieval. Cambridge: Cambridge University Press; 2008. [Google Scholar]
  • 3.Jinnai S, Yamazaki N, Hirano Y, Sugawara Y, Ohe Y, Hamamoto R. The development of a skin cancer classification system for pigmented skin lesions using deep learning. Biomolecules. 2020;10(8):1123. doi: 10.3390/biom10081123 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Altman DG, Royston P. The cost of dichotomising continuous variables. Bmj. 2006;332(7549):1080. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Fedorov V, Mannino F, Zhang R. Consequences of dichotomization. Pharm Stat. 2009;8(1):50–61. [DOI] [PubMed] [Google Scholar]
  • 6.Sokolova M, Lapalme G. A systematic analysis of performance measures for classification tasks. Inf Process Manag. 2009;45:427–437. [Google Scholar]
  • 7.Bhalla S, Kaur H, Kaur R, Sharma S, Raghava GPS. Expression based biomarkers and models to classify early and late-stage samples of papillary thyroid carcinoma. PloS One. 2020;15(4):e0231629. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Chowdhury S, Dong X, Qian L, et al. A multitask bi-directional RNN model for named entity recognition on Chinese electronic medical records. BMC Bioinform. 2018;19(17):499. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Döring K, Qaseem A, Becer M, et al. Automated recognition of functional compound-protein relationships in literature. PloS One. 2020;15(3):e0220925. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Hong N, Wen A, Stone DJ, et al. Developing a FHIR-based EHR phenotyping framework: a case study for identification of patients with obesity and multiple comorbidities from discharge summaries. J Biomed Inform. 2019;99:103310. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Lee GH, Shin SY. Federated learning on clinical benchmark data: performance assessment. J Med Internet Res. 2020;22(10):e20891. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Routray R, Tetarenko N, Abu-Assal C, et al. Application of augmented intelligence for pharmacovigilance case seriousness determination. Drug Saf. 2020;43(1):57–66. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Wang J, Zhang J, An Y, et al. Biomedical event trigger detection by dependency-based word embedding. BMC Med Genom. 2016; 9(2):45. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Zhu F, Li X, Mcgonigle D, et al. Analyze informant-based questionnaire for the early diagnosis of senile dementia using deep learning. IEEE J Transl Eng Health Med. 2020;8:2200106. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Wang Y, Li J, Li Y, Wangi R, Yang X. Confidence interval for F1 measure of algorithm performance based on blocked 3 × 2 cross-validation. IEEE Trans Knowl Data Eng. 2015;27:651–659. [Google Scholar]
  • 16.Zhang D, Wang J, Zhao X. Estimating the uncertainty of average F1 scores. Proceedings of the 2015 International Conference on the Theory of Information Retrieval. 2015. [Google Scholar]
  • 17.Takahashi K, Yamamoto K, Kuchiba A, Koyama T. Confidence interval for micro-averaged F1 and macro-averaged F1 scores. Appl Intell (Dordr). 2022;52(5):4961–4972. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Attia ZI, Noseworthy PA, Lopez-Jimenez F, et al. An artificial intelligence-enabled ECG algorithm for the identification of patients with atrial fibrillation during sinus rhythm: a retrospective analysis of outcome prediction. Lancet. 2019;394(10201):861–867. [DOI] [PubMed] [Google Scholar]
  • 19.Han SS, Moon IJ, Lim W, et al. Keratinocytic skin cancer detection on the face using region-based convolutional neural network. JAMA Dermatol. 2020;156(1):29–37. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Data sharing is not applicable to this article as no new data were created or analyzed in this study.

RESOURCES