Confidence interval for micro-averaged F1 and macro-averaged F1 scores

Kanae Takahashi; Kouji Yamamoto; Aya Kuchiba; Tatsuki Koyama

doi:10.1007/s10489-021-02635-5

. Author manuscript; available in PMC: 2022 Mar 21.

Published in final edited form as: Appl Intell (Dordr). 2021 Jul 31;52(5):4961–4972. doi: 10.1007/s10489-021-02635-5

Confidence interval for micro-averaged F₁ and macro-averaged F₁ scores

Kanae Takahashi ^1,², Kouji Yamamoto ³, Aya Kuchiba ^4,⁵, Tatsuki Koyama ⁶

PMCID: PMC8936911 NIHMSID: NIHMS1752503 PMID: 35317080

Abstract

A binary classification problem is common in medical field, and we often use sensitivity, specificity, accuracy, negative and positive predictive values as measures of performance of a binary predictor. In computer science, a classifier is usually evaluated with precision (positive predictive value) and recall (sensitivity). As a single summary measure of a classifier’s performance, F₁ score, defined as the harmonic mean of precision and recall, is widely used in the context of information retrieval and information extraction evaluation since it possesses favorable characteristics, especially when the prevalence is low. Some statistical methods for inference have been developed for the F₁ score in binary classification problems; however, they have not been extended to the problem of multi-class classification. There are three types of F₁ scores, and statistical properties of these F₁ scores have hardly ever been discussed. We propose methods based on the large sample multivariate central limit theorem for estimating F₁ scores with confidence intervals.

Keywords: Precision, Recall, Machine learning, F₁ measures, Multi-class classification, Delta-method

1. Introduction

In medical field, a binary classification problem is common, and we often use sensitivity, specificity, accuracy, negative and positive predictive values as measures of performance of a binary predictor. In computer science, a classifier is usually evaluated with precision and recall, which are equal to positive predictive value and sensitivity, respectively. For measuring the performance of text classification in the field of information retrieval and of a classifier in machine learning, the F score (F measure) has been widely used. In particular, the F₁ score has been popular, which is defined as the harmonic mean of precision and recall [1, 2]. The F₁ score is rarely used in diagnostic studies in medicine despite its favorable characteristics. As a single performance measure, the F₁ score may be preferred to specificity and accuracy, which may be artificially high even for a poor classifier with a high false negative probability when disease prevalence is low. The F₁ score is especially useful when identification of true negatives is relatively unimportant because the true negative rate is not included in the computation of either precision or recall.

To evaluate a multi-class classification, a single summary measure is often sought. And as extensions of the F₁ score for the binary classification, there exist two types of such measures: a micro-averaged F₁ score and a macro-averaged F₁ score [2]. The micro-averaged F₁ score pools per-sample classifications across classes, and then calculates the overall F₁ score. Contrarily, the macro-averaged F₁ score computes a simple average of the F₁ scores over classes. Sokolova and Lapalme [3] gave an alternative definition of the macro-averaged F₁ score as the harmonic mean of the simple averages of the precision and recall over classes. Both micro-averaged and macro-averaged F₁ scores have a simple interpretation as an average of precision and recall, with different ways of computing averages. Moreover, as will be shown in Section 2, the micro-averaged F₁ score has an additional interpretation as the total probability of true positive classifications.

For binary classification, some statistical methods for inference have been proposed for the F₁ scores (e.g., [4]); however, the methodology has not been extended to the multi-class F₁ scores. To our knowledge, methods for computing variance estimates of the micro-averaged F₁ score and macro-averaged F₁ score have not been reported. Thus, computing confidence intervals for the multi-class F₁ scores is not possible, and the inference about them is usually solely based on point estimates, and thus highly limited in practical utility. For example, consider the results of an analysis reported by Dong et al. [5]. In this analysis, the authors calculated the point estimates of macro-averaged F₁ scores for four classifiers, and they concluded a classifier outperformed the others by comparing the point estimates without taking into account their uncertainty. Others have also used multi-class F₁ scores but only reported point estimates without confidence intervals [6–16].

To address this knowledge gap, we provide herein the methods for computing variances of these multi-class F₁ scores so that estimating the micro-averaged F₁ score and macro-averaged F₁ score with confidence intervals becomes possible in multi-class classification.

The rest of the manuscript is organized as follows: The definitions of the micro-averaged F₁ score and macro-averaged F₁ score are reviewed in Section 2. In Section 3, variance estimates and confidence intervals for the multi-class F₁ scores are derived. A simulation study to investigate the coverage probabilities of the proposed confidence intervals is presented in Section 4. Then, our method is applied to a real study as an example in Section 5 followed by a brief discussion in Section 6.

2. Averaged F₁ scores

This section introduces notations and definitions of multi-class F₁ scores, namely, macro-averaged and micro-averaged F₁ scores. Consider an r × r contingency table for a nominal categorical variable with r classes (r ≥ 2). The columns indicate the true conditions, and rows indicate the predicted conditions. It is called the binary classification when r = 2, and the multi-class classification when r > 2. Such a table is also called a confusion matrix. We consider multi-class classification, i.e., r > 2, and denote cell probabilities and marginal probabilities by p_ij, p_i·, and p_·j, respectively (i, j = 1, ⋯, r). For each class (i = 1, ⋯, r), the true positive rate (TP_i), the false positive rate (FP_i), and the false negative rate (FN_i) are defined as follows:

T P_{i} = p_{i i},

F P_{i} = \sum_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{r} p_{i j},

F N_{i} = \sum_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{r} p_{j i} .

TP_i is the i-th diagonal element, FP_i is the sum of off-diagonal i-th row, and FN_i is the sum of off-diagonal elements of the i-th column. Note that TP_i + FP_i = p_i·, and TP_i + FN_i = p_·i.

In the current and following sections, we will use the simple 3-by-3 confusion matrix in Table 1 as an example to demonstrate various computations. Columns represent the true state, and rows represent the predicted classification. The total sample size is 100.

Table 1.

Numeric example

		True Classification
		Class 1	Class 2	Class 3
a: Frequencies
	Class 1	2	2	2
Prediction	Class 2	5	70	2
	Class 3	0	2	15
b: Proportions
	Class 1	0.02	0.02	0.02	0.06
Prediction	Class 2	0.05	0.70	0.02	0.77
	Class 3	0.00	0.02	0.15	0.17
		0.07	0.74	0.19

Open in a new tab

The within-class probabilities are:

T P_{1} = 0.02, T P_{2} = 0.70, T P_{3} = 0.15.

F P_{1} = 0.04, F P_{2} = 0.07, F P_{3} = 0.02.

F N_{1} = 0.05, F N_{2} = 0.04, F N_{3} = 0.04.

Micro-averaged F₁ score

The micro-averaged precision (miP) and micro-averaged recall (miR) are defined as

m i P = \frac{\sum_{i = 1}^{r} T P_{i}}{\sum_{i = 1}^{r} (T P_{i} + F P_{i})} = \frac{\sum p_{i i}}{\sum p_{i \cdot}} = \sum_{i = 1}^{r} p_{i i},

m i R = \frac{\sum_{i = 1}^{r} T P_{i}}{\sum_{i = 1}^{r} (T P_{i} + F P_{i})} = \frac{\sum p_{i i}}{\sum p_{\cdot i}} = \sum_{i = 1}^{r} p_{i i} .

Note that for both miP and miR, the denominator is the sum of all the elements (diagonal and off-diagonal) of the confusion matrix, and it is 1. Finally, the micro-averaged F₁ score is defined as the harmonic mean of these quantities:

m i F_{1} = 2 \frac{m i P \times m i R}{m i P + m i R} = \sum_{i = 1}^{r} p_{i i} .

(1)

This definition is commonly used (e.g., [6, 8–12, 14, 15]).

By definition, we have miP, miR, and miF₁ all equal to the sum of the diagonal elements, which, in our example, is 0.87.

Macro-averaged F₁ score

To define the macro-averaged F₁ score (maF₁), first consider the following precision (P_i) and recall (R_i) within each class, i = 1, ⋯, r:

P_{i} = \frac{T P_{i}}{(T P_{i} + F P_{i})} = p_{i i} / p_{i \cdot},

R_{i} = \frac{T P_{i}}{(T P_{i} + F N_{i})} = p_{i i} / p_{\cdot i} .

For our example, simple calculation shows:

P_{1} = 0.33, P_{2} = 0.91, P_{3} = 0.88,

R_{1} = 0.29, R_{2} = 0.95, R_{3} = 0.79.

And F₁ score within each class (F_1i) is defined as the harmonic mean of P_i and R_i, that is,

F_{1 i} = 2 \frac{P_{i} \times R_{i}}{P_{i} + R_{i}} = 2 \frac{p_{i i}}{p_{i \cdot} + p_{\cdot i}} .

The macro-averaged F₁ score is defined as the simple arithmetic mean of F_1i:

m a F_{1} = \frac{1}{r} \sum_{i = 1}^{r} F_{1 i} = \frac{2}{r} \sum_{i = 1}^{r} \frac{p_{i i}}{p_{i \cdot} + p_{\cdot i}} .

(2)

This score, like miF₁ is frequently reported (e.g., [5–10, 13]).

F_1i and maF₁ in our example are:

F_{11} = 0.308, F_{12} = 0.927, F_{13} = 0.833.

m a F_{1} = (0.308 + 0.927 + 0.833) / 3 = 0.689.

Alternative definition of Macro-averaged F₁ score

Sokolova and Lapalme [3] gave an alternative definition of the macro-averaged F₁ score ( $m a F_{1}^{*}$ ). First, macro-averaged precision (maP) and macro-averaged recall (maR) are defined as simple arithmetic means of the within-class precision and within-class recall, respectively.

m a P = \frac{1}{r} \sum_{i = 1}^{r} \frac{T P_{i}}{T P_{i} + F P_{i}} = \frac{1}{r} \sum_{i = 1}^{r} \frac{p_{i i}}{p_{i \cdot}},

m a R = \frac{1}{r} \sum_{i = 1}^{r} \frac{T P_{i}}{T P_{i} + F N_{i}} = \frac{1}{r} \sum_{i = 1}^{r} \frac{p_{i i}}{p_{\cdot i}} .

And $m a F_{1}^{*}$ is is defined as the harmonic mean of these quantities.

m a F_{1}^{*} = 2 \frac{m a P \times m a R}{m a P + m a R} .

(3)

This version of macro-averaged F₁ score is less frequently used (e.g., [11, 12, 16]). For our example,

m a P = (0.02 / 0.06 + 0.70 / 0.77 + 0.15 / 0.17) / 3 = 0.708.

m a R = (0.02 / 0.07 + 0.70 / 0.74 + 0.15 / 0.19) / 3 = 0.674.

m a F_{1}^{*} = 0.691.

In this example, the micro-averaged F₁ score is higher than the macro-averaged F₁ scores because both within-class precision and recall are much lower for the first class compared to the other two. Micro-averaging puts only a small weight on the first column because the sample size there is relatively small. This numeric example shows a shortcoming of summarizing a performance of a multi-class classification with a single number when within-class precision and recall vary substantially. However, aggregate measures such as the micro-averaged and macro-averaged F₁ scores are useful in quantifying the performance of a classifier as a whole.

3. Variance estimate and confidence interval

In this section, we derive the confidence interval for miF₁, maF₁, and $m a F_{1}^{*}$ . We assume that the observed frequencies, n_ij, for 1 ≤ i ≤ r, 1 ≤ j ≤ r, have a multinomial distribution with sample size n and probabilities

p = {(p_{11}, \dots, p_{1 r}, p_{21}, \dots, p_{2 r}, \dots, p_{r 1}, \dots, p_{r r})}^{T},

where “T” represents the transpose, that is

(n_{11}, n_{12}, \dots, n_{r r}) ~ Multinomial (n; p) .

The expectation, variance, and covariance for i, j = 1, ⋯, r, are:

E (n_{i j}) = n p_{i j},

Var (n_{i j}) = n p_{i j} (1 - p_{i j}),

Cov (n_{i j}, n_{k l}) = - n p_{i j} p_{k l}, for i \neq k or j \neq l,

respectively, where $n = \sum_{i, j} n_{i j}$ is the overall sample size. The maximum likelihood estimate (MLE) of p_ij is ${\hat{p}}_{i j} = n_{i j} / n$ . Using the multivariate central limit theorem, we have

\sqrt{n} (\hat{p} - p) \dot{~} Normal (0_{r^{2}}, diag (p) - p p^{T}),

where $0_{r^{2}}$ is r² × 1 vector whose elements are all 0, diag(p) is an r² × r² diagonal matrix whose diagonal elements are p, and “⩪”represents “approximately distributed as.”

By invariance property of MLE’s, the maximum likelihood estimates of miF₁, maF₁, and $m a F_{1}^{*}$ , and other quantities in the previous section can be obtained by substituting p_ij by ${\hat{p}}_{i j}$ . In the following subsections, we use the multivariate delta-method to derive large-sample distributions of $\hat{m i F_{1}}$ , $\hat{m a F_{1}}$ , and $\hat{m a F_{1}^{*}}$ .

3.1. Confidence interval for miF₁

As shown in (1), $m i F_{1} = \sum p_{i i}$ , and the maximum likelihood estimate (MLE) of miF₁ is

\hat{m i F_{1}} = \sum_{i = 1}^{r} {\hat{p}}_{i i} .

Using the multivariate delta-method (Appendix A), we have

\hat{m i F_{1}} \dot{~} Normal (m i F_{1}, Var (\hat{m i F_{1}})),

where variance of $\hat{m i F_{1}}$ is

Var (\hat{m i F_{1}}) = (\sum_{i = 1}^{r} p_{i i}) (1 - \sum_{i = 1}^{r} p_{i i}) / n .

(4)

And a (1 – α) × 100% confidence interval of maF₁ is

\hat{m i F_{1}} \pm Z_{1 - α / 2} \times \sqrt{\hat{V a r} (\hat{m i F_{1}})},

where $\hat{V a r} (\hat{m i F_{1}})$ is $Var (\hat{m i F_{1}})$ with {p_ii} replaced by ${{\hat{p}}_{i i}}$ , and Z_p denote the 100 p-th percentile of the standard normal distribution. Computation of $\hat{V a r} (\hat{m i F_{1}})$ for our numeric example is straightforward using (4):

\hat{V a r} (\hat{m i F_{1}}) = (0.02 + 0.70 + 0.15) \times {1 - (0.02 + 0.70 + 0.15)} / 100 = {0.0336}^{2} .

And a 95% confidence interval for miF₁ is

0.87 \pm 1.960 \times 0.0336 = (0.804, 0.936) .

3.2. Confidence interval for maF₁

The MLE of maF₁ can be obtained by substituting p_ii, p_·i and p_·i by their MLE’s in (2).

{\hat{m a F}}_{1} = \frac{2}{r} \sum_{i = 1}^{r} \frac{{\hat{p}}_{i i}}{{\hat{p}}_{i \cdot} + {\hat{p}}_{\cdot i}} .

Again by the multivariate delta-method (Appendix B), we have the variance of $\hat{m a F_{1}}$ as

Var (\hat{m a F_{1}}) = \frac{2}{r^{2}} {\sum_{i = 1}^{r} \frac{F_{1 i} (p_{i \cdot} + p_{\cdot i} - 2 p_{i i})}{{(p_{i \cdot} + p_{\cdot i})}^{2}} (\frac{p_{i \cdot} + p_{\cdot i} - 2 p_{i i}}{p_{i \cdot} + p_{\cdot i}} + \frac{F_{1 i}}{2}) + \sum_{i = 1}^{r} \sum_{j \neq i} \frac{p_{i j} F_{1 i} F_{1 j}}{(p_{i \cdot} + p_{\cdot i}) (p_{j \cdot} + p_{\cdot j})}} / n .

And a (1 – α) × 100% confidence interval of miF₁ is

\hat{m a F_{1}} \pm Z_{1 - α / 2} \times \sqrt{\hat{V a r} (\hat{m a F_{1}})},

where $\hat{V a r} (\hat{m a F_{1}})$ is $Var (\hat{m a F_{1}})$ with {p_ii} replaced by ${{\hat{p}}_{i j}}$ . This computation is complex even for a small 3 by 3 table; an R code (Appendix D) was used to compute the variance estimate and a 95% confidence interval of maF₁.

\hat{V a r} (\hat{m a F_{1}}) = {0.0650}^{2}, 0.69 \pm 1.960 \times 0.0650 = (0.562, 0.817) .

3.3. Confidence interval form $m a F_{1}^{*}$

To obtain the MLE’s of $m a F_{1}^{*}$ we first substitute p_ii, p_i and p_·i by their MLE’s of maP and maR and use these in (3):

\hat{m a F_{1}^{*}} = 2 \frac{\hat{m a P} \times \hat{m a R}}{\hat{m a P} + \hat{m a R}} .

As shown in Appendix C,

Var (\hat{m a F_{1}^{*}}) = 4 n \frac{m a R^{4} V a r (\hat{m a P}) + 2 m a P^{2} m a R^{2} Cov (\hat{m a P}, \hat{m a R}) + m a P^{4} V a r (\hat{m a R})}{{(m a P + m a R)}^{4}} / n,

where

V a r (\hat{m a P}) = \frac{1}{r^{2}} (\sum_{i = 1}^{r} \frac{p_{i i} (\sum_{j \neq i} p_{i j})}{p_{i \cdot}^{3}}) / n,

V a r (\hat{m a R}) = \frac{1}{r^{2}} (\sum_{i = 1}^{r} \frac{p_{i i} (\sum_{j \neq i} p_{j i})}{p_{\cdot i}^{3}}) / n,

Cov (\hat{m a P}, \hat{m a R}) = \frac{1}{r^{2}} (\sum_{i = 1}^{r} \frac{(\sum_{j \neq i} p_{i j}) p_{i i} (\sum_{j \neq i} p_{j i})}{p_{i \cdot}^{2} p_{\cdot i}^{2}} + \sum_{i = 1}^{r} \sum_{j \neq i} \frac{p_{i i} p_{i j} p_{j j}}{p_{i \cdot}^{2} p_{j \cdot}^{2}}) / n .

A (1 – α) × 100% confidence interval of $m a F_{1}^{*}$ is

\hat{m a F_{1}^{*}} \pm Z_{1 - α / 2} \times \sqrt{\hat{V a r} (\hat{m a F_{1}^{*}})} .

Again to get $\hat{V a r} (\hat{m a F_{1}^{*}})$ all components of $\hat{V a r} (\hat{m a F_{1}^{*}})$ are replaced by their respective MLE’s. Using the accompanying R code (Appendix D), we computed the variance estimate and a 95% confidence interval of $m a F_{1}^{*}$ :

\hat{V a r} (\hat{m a F_{1}^{*}}) = {0.0649}^{2} 0.69 \pm 1.960 \times 0.0649 = (0.563, 0.818) .

4. Simulation

We performed a simulation study to assess the coverage probability of the confidence intervals proposed in Section 3. We set r = 3 (class 1, 2, 3), and generated data according to the multinomial distributions with p summarized in Table 2. The total sample size, n, was set to 25, 50, 100, 500, 1,000, and 5,000. For each combination of the true distribution and sample size, we generated 1,000,000 data, each time computing 95% confidence intervals for miF₁, maF₁, and $m a F_{1}^{*}$ .

Table 2.

Simulation study: True cell probabilities

		True condition
		1	2	3
Scenario 1
	1	8/30	1/30	1/30
Predicted condition	2	1/30	8/30	1/30
	3	1/30	1/30	8/30
Scenario 2
	1	64/100	3/100	3/100
Predicted condition	2	8/100	4/100	3/100
	3	8/100	3/100	4/100
Scenario 3
	1	32/100	1/100	1/100
Predicted condition	2	24/100	8/100	1/100
	3	24/100	1/100	8/100

Open in a new tab

In scenario 1, the true conditions of class 1, 2, and 3 have the same probability (1/3), and the recall and precision are equal (80%). Thus miP = maP = 0.80, miR = maR = 0.80, and $m i F_{1} = m a F_{1} = m a F_{1}^{*} = 0.80$ .

In scenario 2, the true condition of class 1 has higher probability than the others (80% vs 10%), and the recall and precision of class 1 are also higher than the others (80% vs 40%, and 91% vs 27%, respectively). miF₁ gives equal weight to each per-sample classification decision, whereas miF₁ gives equal weight to each class. Thus, large classes dominate small classes in computing miF₁ [2], and miF₁ is larger than maF₁ (miF₁ = 0.72, maF₁= 0.50, $m a F_{1}^{*} = 0.51$ ) in scenario 2 because class 1 has higher probability and has higher precision and recall.

In scenario 3, the true condition of class 1 has higher probability than the others (80% vs 10%). The precision of class 1 is higher than the others (94% vs 24%), and the recall of class 1 is lower than the others, (40% vs 80%). Compared to the other two scenarios, the diagonal entries are relatively small, which makes miF₁ small (miF₁ = 0.48, maF₁ = 0.44, and $m a F_{1}^{*} = 0.55$ ).

Table 3 shows the coverage probability of the proposed 95% confidence intervals for each scenario. The coverage probabilities for both miF₁ and maF₁ are close to the nominal 95% when the sample size is large. When n smaller than 95%, especially for maF₁ and $m a F_{1}^{*}$ . Moreover, computing a confidence interval for $m a F_{1}^{*}$ for small n is often impossible because $\hat{m a F_{1}^{*}}$ is undefined when either p_i· = 0 or p_·j = 0 for any i or j. In typical applications where these F scores are computed, n is large, and the small n problem is unlikely to occur.

Table 3.

Simulation study: Coverage probability

	Scenario 1			Scenario 2			Scenario 3
n	miF ₁	maF ₁	$m a F_{1}^{*}$	miF ₁	maF ₁	$m a F_{1}^{*}$	miF ₁	maF ₁	$m a F_{1}^{*}$
25	0.885	0.901	0.890	0.921	0.790	0.774	0.930	0.870	0.821
50	0.937	0.935	0.923	0.941	0.864	0.853	0.935	0.918	0.905
100	0.933	0.938	0.936	0.937	0.914	0.914	0.943	0.936	0.933
500	0.949	0.949	0.948	0.947	0.944	0.945	0.946	0.947	0.947
1000	0.946	0.948	0.948	0.947	0.947	0.947	0.947	0.949	0.947
5000	0.950	0.950	0.950	0.951	0.949	0.949	0.951	0.950	0.950

Open in a new tab

5. Example

As an example, we applied our method to the temporal sleep stage classification data provided by Dong et al. [5]. They proposed a new approach based on a Mixed Neural Network (MNN) to classify sleep into five stages with one awake stage (W), three sleep stages (N1, N2, N3), and one rapid eye movement stage (REM). In addition to the MNN, they evaluated the following three classifiers: Support Vector Machine (SVM), Random Forest (RF), and Multilayer Perceptron (MLP). The data came from 62 healthy subjects, and classification by a single sleep expert was used as the gold standard. The staging is based on a 30-second window of the physiological signals called an EEG (electroencephalography) epoch. Thus, each subject contributes a large number of data to be classified. The total number of epochs depends on the classifiers, and it is about 59,000. Performance of each classifier was evaluated using maF₁ along with precision, recall, and overall accuracy. They concluded that the MNN outperformed the competitors by comparing the point estimates of maF₁ and overall accuracy. We provide here 95% confidence intervals for miF_1, maF₁, and $m a F_{1}^{*}$ for each of the four methods, as summarized in Table 4. The confidence intervals of miF₁, maF₁, and $m a F_{1}^{*}$ for the MNN do not overlap with the point estimates of other methods, providing further evidence that MNN is superior to the other method. For completeness we present 95% confidence intervals for other methods in Table 4 as well. As n is large for this example, the confidence intervals are narrow, and the ones for MNN do not overlap with confidence intervals for other three methods.

Table 4.

Point estimates and confidence intervals for miF₁, maF₁, and $m a F_{1}^{*}$

Method	n	$\hat{m i F_{1}}$	95% CI	$\hat{m a F_{1}}$	95% CI	$\hat{m a F_{1}^{*}}$	95% CI
MNN	59,066	0.859	(0.856, 0.862)	0.805	(0.801, 0.809)	0.807	(0.803, 0.811)
SVM	59,255	0.797	(0.794, 0.800)	0.750	(0.746, 0.754)	0.756	(0.752, 0.760)
RF	59,193	0.817	(0.814, 0.820)	0.724	(0.720, 0.729)	0.746	(0.741, 0.750)
MLP	59,130	0.814	(0.811, 0.817)	0.772	(0.768, 0.776)	0.778	(0.774, 0.782)

Open in a new tab

6. Discussion

We derived large sample variance estimates of miF₁, maF₁, and $m a F_{1}^{*}$ in terms of the observed cell probabilities and sample size. This enabled us to derive large sample confidence intervals.

Coverage probabilities of the proposed confidence intervals were assessed through the simulation study. According to the result of the simulation, when n is larger than 100, the coverage probability was close to the nominal level; however, for n < 100, the coverage probabilities tended to be smaller than the target. Moreover, with an extremely small sample size, $m a F_{1}^{*}$ could not be estimated as computation of $m a F_{1}^{*}$ requires all margins to be non-zero. Zhang et al. [17] have considered interval estimation miF₁ and maF₁ and proposed the highest density interval through Bayesian framework. On the other hand, we have proposed confidence interval for miF₁, maF₁, and maF_1* through frequentist framework using a large-sample approximation.

There is an inherit drawback of multi-class F₁ scores that these scores do not summarize the data appropriately when a large variability exists between classes. This was demonstrated in the numeric example in Section 2 for which the within-class F₁ values are 0.308, 0.927, and 0.833, and miF₁, maF₁, and $m a F_{1}^{*}$ are 0.870, 0.689, and 0.691, respectively. Reporting multiple within-class F₁ scores may be an option as done in [18] and [19]; however, an aggregate measure is useful in evaluating an overall performance of a classifier across classes. Another limitation with F₁ scores is that they do not take into consideration the true negative rate, and they may not be an appropriate measure when true negatives are important.

For future works, we are working on developing hypothesis testing procedure for miF₁, maF₁, and $m a F_{1}^{*}$ based on the variance estimates proposed in this article.

An R code for computing confidence intervals for miF₁, maF₁, and $m a F_{1}^{*}$ , available and presented in Appendix D.

Funding

This research was partially supported by Grant-in-Aid for Young Scientists No. 18K17325 (Takahashi), Grant-in-Aid for Scientific Research (C) No. 18K11195 (Yamamoto), and P30 CA068485 Cancer Center Support Grant (Koyama).

Biographies

graphic file with name nihms-1752503-b0001.gif

Kanae Takahashi is currently an Assistant Professor at Hyogo College of Medicine, Japan. She received the BS degree from Osaka University in 2008 and MPH degree from Kyoto University in 2010. Her research interests include design of clinical trials and diagnostic study.

graphic file with name nihms-1752503-b0002.gif

Kouji Yamamoto is currently an Associate Professor at Yokohama City University, School of Medicine, Japan. He received the PhD in statistics from Tokyo University of Science in 2009. His research interests include design of clinical trials, diagnostic study, and categorical data analysis.

graphic file with name nihms-1752503-b0003.gif

Aya Kuchiba is an Associate Professor at Graduate School of Health Innovation, Kanagawa University of Human Services, Japan. She received her PhD in Health Sciences (Biostatistics & Epidemiology) from University of Tokyo in 2008. Prior to joining Kanagawa University of Human Services, she was a Section Head of the Biostatistics Division at the National Cancer Center, Japan. Her research interest has focused on developing and applying statistical methods to cancer research in the areas of epidemiology with molecular and genetic data, diagnostic testing, and prevention, and in conducting clinical trials.

graphic file with name nihms-1752503-b0004.gif

Tatsuki Koyama is an Associate Professor of Biostatistics at Vanderbilt University Medical Center. He received his PhD in statistics from University of Pittsburgh in 2003. His research interests are primarily centered on flexible experimental designs for clinical trials and inference from the data arising from such flexible and adaptive designs both in the Frequentist and Bayesian paradigms. His medical research interests include comparative effectiveness of treatments for localized prostate cancer, and association of ambient air pollution and acute lung injury.

Appendix A: Derivation of the distribution and variance of $\hat{m i F_{1}}$

Let p be the ordered elements of a confusion matrix. p = (p₁₁, ⋯, p_1r, p₂₁, ⋯, p_2r, ⋯, p_r1, ⋯, p_rr)^T. Using the multivariate delta-method for $\hat{p}$ , we get

\sqrt{n} (\hat{m i F_{1}} - m i F_{1}) \dot{~} Normal (0, {[\frac{\partial (m i F_{1})}{\partial (p)}]}^{T} (diag (p) - p p^{T}) [\frac{\partial (m i F_{1})}{\partial (p)}]) .

(5)

Because $m i F_{1} = \sum_{i = 1}^{r} p_{i i}$ we have

\frac{\partial (m i F_{1})}{\partial (p_{i i})} = 1 \forall i = 1, \dots r and \frac{\partial (m i F_{1})}{\partial (p_{i j})} = 0 if i \neq j .

And

\frac{\partial (m i F_{1})}{\partial (p)} = {(1, 0, \dots, 0, 0, 1, 0, \dots, 0, \dots, 0, \dots, 0, 1)}^{T} .

Note that all the elements corresponding to the diagonal entries (p_ii) of the confusion matrix is 1. To evaluate the variance in (5), further note that

diag (p) = (\begin{matrix} p_{11} & 0 & 0 & \dots & 0 \\ 0 & p_{12} & 0 & \dots & 0 \\ 0 & 0 & p_{13} & \dots & 0 \\ ⋮ & ⋱ \\ 0 & 0 & 0 & \dots & p_{r r} \end{matrix}),

p p^{T} = (\begin{matrix} p_{11}^{2} & p_{11} p_{12} & p_{11} p_{13} & \dots & p_{11} p_{r r} \\ p_{12} p_{11} & p_{12}^{2} & p_{12} p_{13} & \dots & p_{12} p_{r r} \\ p_{13} p_{11} & p_{13} p_{12} & p_{13}^{2} & \dots & p_{13} p_{r r} \\ ⋮ & ⋱ \\ p_{r r} p_{11} & p_{r r} p_{12} & p_{r r} p_{13} & \dots & p_{r r}^{2} \end{matrix}) .

Then we have

{[\frac{\partial (m i F_{1})}{\partial (p)}]}^{T} (diag (p)) [\frac{\partial (m i F_{1})}{\partial (p)}] = (p_{11}, 0, \dots, p_{22}, 0, \dots, p_{33}, \dots, p_{r r}) [\frac{\partial (m i F_{1})}{\partial (p)}] = \sum_{i = 1}^{r} p_{i i} {[\frac{\partial (m i F_{1})}{\partial (p)}]}^{T} (p p^{T}) [\frac{\partial (m i F_{1})}{\partial (p)}] = (\sum_{i = 1}^{r} p_{i i} p_{11}, \sum_{i = 1}^{r} p_{i i} p_{12}, \dots, \sum_{i = 1}^{r} p_{i i} p_{r r}) [\frac{\partial (m i F_{1})}{\partial (p)}] = (\sum_{i = 1}^{r} p_{i i} p_{11} + \sum_{i = 1}^{r} p_{i i} p_{22} + \dots + \sum_{i = 1}^{r} p_{i i} p_{r r}) = {(\sum_{i = 1}^{r} p_{i i})}^{2} .

Thus,

{[\frac{\partial (m i F_{1})}{\partial (p)}]}^{T} (diag (p) - p p^{T}) [\frac{\partial (m i F_{1})}{\partial (p)}] = (\sum_{i = 1}^{r} p_{i i}) (1 - \sum_{i = 1}^{r} p_{i i}) .

Finally,

Var (\hat{m i F_{1}}) = (\sum_{i = 1}^{r} p_{i i}) (1 - \sum_{i = 1}^{r} p_{i i}) / n .

And

\hat{m i F_{1}} \dot{~} Normal (m i F_{1}, (\sum_{i = 1}^{r} p_{i i}) (1 - \sum_{i = 1}^{r} p_{i i}) / n) .

Appendix B: Derivation of the distribution and variance of $\hat{m a F_{1}}$

In a similar manner to Appendix A, using the multivariate delta-method, we get

\sqrt{n} (\hat{m a F_{1}} - m a F_{1}) \dot{~} Normal (0, {[\frac{\partial (m a F_{1})}{\partial (p)}]}^{T} \times (diag (p) - p p^{T}) [\frac{\partial (m a F_{1})}{\partial (p)}]) .

Now we take the partial derivatives of (2) to get

\frac{\partial (m a F_{1})}{\partial (p_{i i})} = \frac{2}{r} (\frac{p_{i \cdot} + p_{\cdot i} - 2 p_{i i}}{{(p_{i \cdot} + p_{\cdot i})}^{2}}), \forall i = 1, \dots r,

\frac{\partial (m a F_{1})}{\partial (p_{i j})} = \frac{2}{r} (\frac{- p_{i i}}{{(p_{i \cdot} + p_{\cdot i})}^{2}} + \frac{- p_{j j}}{{(p_{j \cdot} + p_{\cdot j})}^{2}}),

i, j = 1, \dots, r; i \neq j .

Arranging these terms according to the order of the elements in p, we have

\frac{\partial (m a F_{1})}{\partial (p)} = \frac{2}{r} {(\frac{p_{1 \cdot} + p_{\cdot 1} - 2 p_{11}}{{(p_{1 \cdot} + p_{\cdot 1})}^{2}}, \frac{- p_{11}}{{(p_{1 \cdot} + p_{\cdot 1})}^{2}} + \frac{- p_{22}}{{(p_{2 \cdot} + p_{\cdot 2})}^{2}}, \dots, \frac{p_{r \cdot} + p_{\cdot r} - 2 p_{r r}}{{(p_{r \cdot} + p_{\cdot r})}^{2}})}^{T} .

Next, we note

{[\frac{\partial (m a F_{1})}{\partial (p)}]}^{T} (p p^{T}) [\frac{\partial (m a F_{1})}{\partial (p)}] = 0

because

{[\frac{\partial (m a F_{1})}{\partial (p)}]}^{T} p = \frac{2}{r} {\sum_{i = 1}^{r} (\frac{p_{i \cdot} + p_{\cdot i} - 2 p_{i i}}{{(p_{i \cdot} + p_{\cdot i})}^{2}}) p_{i i} - \sum_{i = 1}^{r} (\frac{\sum_{j \neq i} p_{i j}}{{(p_{i \cdot} + p_{\cdot i})}^{2}}) p_{i i} - \sum_{j = 1}^{r} (\frac{\sum_{i \neq j} p_{i j}}{{(p_{j \cdot} + p_{\cdot j})}^{2}}) p_{j j}} = \frac{2}{r} {\sum_{i = 1}^{r} (\frac{p_{i \cdot} + p_{\cdot i} - 2 p_{i i}}{{(p_{i \cdot} + p_{\cdot i})}^{2}}) p_{i i} - \sum_{i = 1}^{r} (\frac{\sum_{j \neq i} (p_{i j} + p_{j i})}{{(p_{i \cdot} + p_{\cdot i})}^{2}}) p_{i i}} = 0.

Therefore,

{[\frac{\partial (m a F_{1})}{\partial (p)}]}^{T} (diag (p) - p p^{T}) [\frac{\partial (m a F_{1})}{\partial (p)}] = {[\frac{\partial (m a F_{1})}{\partial (p)}]}^{T} (diag (p)) [\frac{\partial (m a F_{1})}{\partial (p)}],

which can be shown to equal

= \frac{2}{r^{2}} {\sum_{i = 1}^{r} \frac{m a F_{1 i} (p_{i \cdot} + p_{\cdot i} - 2 p_{i i})}{{(p_{i \cdot} + p_{\cdot i})}^{2}} (\frac{p_{i \cdot} + p_{\cdot i} - 2 p_{i i}}{p_{i \cdot} + p_{\cdot i}} + \frac{m a F_{1 i}}{2}) + \sum_{i = 1}^{r} \sum_{j \neq i} \frac{p_{i j} m a F_{1 i} m a F_{1 j}}{(p_{i \cdot} + p_{\cdot i}) (p_{j \cdot} + p_{\cdot j})}} .

Putting all together, we have

\hat{m a F_{1}} \dot{~} Normal (m a F_{1}, Var (\hat{m a F_{1}})),

where

Var (\hat{m a F_{1}}) = \frac{2}{r^{2}} {\sum_{i = 1}^{r} \frac{m a F_{1 i} (p_{i \cdot} + p_{\cdot i} - 2 p_{i i})}{{(p_{i \cdot} + p_{\cdot i})}^{2}} (\frac{p_{i \cdot} + p_{\cdot i} - 2 p_{i i}}{p_{i \cdot} + p_{\cdot i}} + \frac{m a F_{1 i}}{2}) + \sum_{i = 1}^{r} \sum_{j \neq i} \frac{p_{i j} m a F_{1 i} m a F_{1 j}}{(p_{i \cdot} + p_{. i}) (p_{j \cdot} + p_{\cdot j})}} / n .

Appendix C: Derivation of the distribution and variance of $\hat{m a F_{1}^{*}}$

For marcro-averaged precision (maP) and macro-averaged recall (maR), let the vector m and its MLE $\hat{m}$ be

m = (\begin{array}{l} m a P \\ m a R \end{array}), \hat{m} = (\begin{array}{l} \hat{m a P} \\ \hat{m a R} \end{array}),

respectively. Using the multivariate delta-method, we have

\sqrt{n} (\hat{m} - m) \dot{~} Normal (0_{2}, Σ),

where

Σ = [\frac{\partial (m)}{\partial (p^{T})}] (diag (p) - p p^{T}) {[\frac{\partial (m)}{\partial (p^{T})}]}^{T} = (\begin{array}{l} [\frac{\partial (m a P)}{\partial (p^{T})}] (diag (p) - p p^{T}) {[\frac{\partial (m a P)}{\partial (p^{T})}]}^{T}, [\frac{\partial (m a P)}{\partial (p^{T})}] (diag (p) - p p^{T}) {[\frac{\partial (m a R)}{\partial (p^{T})}]}^{T} \\ [\frac{\partial (m a P)}{\partial (p^{T})}] (diag (p) - p p^{T}) {[\frac{\partial (m a R)}{\partial (p^{T})}]}^{T}, [\frac{\partial (m a R)}{\partial (p^{T})}] (diag (p) - p p^{T}) {[\frac{\partial (m a R)}{\partial (p^{T})}]}^{T} \end{array}) = n (\begin{matrix} Var (\hat{m a P}) & Cov (\hat{m a P}, \hat{m a R}) \\ Cov (\hat{m a P}, \hat{m a R}) & Var (\hat{m a R}) \end{matrix}) .

This is a 2 × 2 matrix with

Var (\hat{m a P}) = \frac{1}{r^{2}} (\sum_{i = 1}^{r} \frac{p_{i i} (\sum_{j \neq i} p_{i j})}{p_{i \cdot}^{3}}) / n,

Var (\hat{m a R}) = \frac{1}{r^{2}} (\sum_{i = 1}^{r} \frac{p_{i i} (\sum_{j \neq i} p_{j i})}{p_{\cdot i}^{3}}) / n,

Cov (\hat{m a P}, \hat{m a R}) = \frac{1}{r^{2}} {\sum_{i = 1}^{r} \frac{(\sum_{j \neq i} p_{i j}) p_{i i} (\sum_{j \neq i} p_{j i})}{p_{i \cdot}^{2} p_{\cdot i}^{2}} + (\sum_{i = 1}^{r} \sum_{j \neq i} \frac{p_{i i} p_{i j} p_{j j}}{p_{i \cdot}^{2} p_{j \cdot}^{2}})} / n .

Using the multivariate delta-method again, we get

\sqrt{n} (\hat{m a F_{1}^{*}} - m a F_{1}^{*}) \dot{~} Normal (0, {[\frac{\partial (m a F_{1}^{*})}{\partial (m)}]}^{T} Σ [\frac{\partial (m a F_{1}^{*})}{\partial (m)}]),

where

\frac{\partial (m a F_{1}^{*})}{\partial (m)} = (\begin{array}{l} \frac{2 m a R^{2}}{{(m a P + m a R)}^{2}} \\ \frac{2 m a P^{2}}{{(m a P + m a R)}^{2}} \end{array}) .

Using this and Σ above, we obtain

{[\frac{\partial (m a F_{1}^{*})}{\partial (m)}]}^{T} Σ [\frac{\partial (m a F_{1}^{*})}{\partial (m)}] = 4 n \frac{m a R^{4} V a r (\hat{m a P}) + 2 m a P^{2} m a R^{2} Cov (\hat{m a P}, \hat{m a R}) + m a P^{4} V a r (\hat{m a R})}{{(m a P + m a R)}^{4}} .

Finally, we have

\hat{m a F_{1}^{*}} \dot{~} Normal (m a F_{1}^{*}, Var (\hat{m a F_{1}^{*}})),

where

Var (\hat{m a F_{1}^{*}}) = 4 n \frac{m a R^{4} V a r (\hat{m a P}) + 2 m a P^{2} m a R^{2} Cov (\hat{m a P}, \hat{m a R}) + m a P^{4} V a r (\hat{m a R})}{{(m a P + m a R)}^{4}} / n .

Appendix D: R code

The following R code computes point estimates and confidence intervals for miF₁, maF₁, and $m a F_{1}^{*}$ .

## Takahashi et al. ##
## Computation of F1 score and its confidence interval ##

f1scores <- function(mat, conf.level=0.95){
   ## This function computes point estimates and (conf.level*100%) confidence intervals
   ## for microF1, macroF1, and macroF1* scores.

   ## mat is an r by r matrix (confusion matrix).
   ## Rows indicate the predicted (fitted) conditions,
   ## and columns indicate the truth.
   ## miF1 is micro F1
   ## maF1 is macro F1
   ## maF2 is macro F1* (Sokolova and Lapalme)

   ## ###### ##
   ## Set up ##
   ## ###### ##
   r <- ncol(mat)
   n <- sum(mat) ## Total sample size
   p <- mat/n ## probabilities
      pii <- diag(p)
      pi. <- rowSums(p)
      p.i <- colSums(p)

   ## ############### ##
   ## Point estimates ##
   ## ############### ##
      miP <- miR <- sum(pii) ## MICRO precision, recall
   miF1 <- miP ## MICRO F1
      F1i <- 2*pii/(pi.+p.i)
   maF1 <- sum(F1i)/r ## MACRO F1
      maP <- sum(pii/rowSums(p))/r ## MACRO precision
      maR <- sum(pii/colSums(p))/r ## MACRO recall
   maF2 <- 2*(maP*maR)/(maP+maR) ## MACRO F1*

   ## ################## ##
   ## Variance estimates ##
   ## ################## ##

   ## ----------------- ##
   ## MICRO F1 Variance ##
   ## ----------------- ##
   miF1.v <- sum(pii)*(1-sum(pii))/n
   miF1.s <- sqrt(miF1.v)

   ## ----------------- ##
   ## MACRO F1 Variance ##
   ## ----------------- ##



   for(i in 1:r){
         jj <- (1:r)[−i]
      for(j in jj){
         b <- b+ p[i,j]*F1i[i]*F1i[j]/((pi.[i]+p.i[i])*(pi.[j]+p.i[j]))
   }}
   maF1.v <- 2*(a+b)/(n*rˆ2)
   maF1.s <- sqrt(maF1.v)

   ## ------------------ ##
   ## MACRO F1* Variance ##
   ## ------------------ ##

   varmap <- sum(pii*(pi.−pii)/pi.ˆ3) / rˆ2 / n
   varmar <- sum(pii*(p.i−pii)/p.iˆ3) / rˆ2 / n
   covmpr1 <- sum( ((pi.−pii) * pii * (p.i−pii)) / (pi.ˆ2 * p.iˆ2) )
   covmpr2 <- 0
      for(i in 1:r){
         covmpr2 <- covmpr2 + sum(pii[i] * p[i,−i] * pii[−i] / pi.[i]ˆ2 / p.i[−i]ˆ2)
      }
   covmpr <- (covmpr1+covmpr2) / rˆ2 / n
   maF2.v <- 4 * (maRˆ4*varmap + 2*maPˆ2*maRˆ2*covmpr + maPˆ4*varmar) / (maP+maR)ˆ4
   maF2.s <- sqrt(maF2.v)

   ## #################### ##
   ## Confidence intervals ##
   ## #################### ##
   z <- qnorm(1-(1-conf.level)/2)
      miF1.ci <- miF1 + c(−1,1)*z*miF1.s
      maF1.ci <- maF1 + c(−1,1)*z*maF1.s
      maF2.ci <- maF2 + c(−1,1)*z*maF2.s


   ## ################# ##
   ##Formattnig output ##
   ## ################# ##
   pr <- data.frame(microPrecision=miP, microRecall=miR, macroPrecision=maP, macroRecall=maR)
   fss <- data.frame(
         rbind(miF1=c(miF1, miF1.s, miF1.ci),
            maF1=c(maF1, maF1.s, maF1.ci),
            maF1.star=c(maF2, maF2.s, maF2.ci)))
      names(fss) <- c(’PointEst’,’Sd’, ’Lower’,’Upper’)
      out <- list(pr, fss)
      names(out) <- c(’Precision.and.Recall’, ’Confidence.Interval’)
      out
}


## Example ##
## Table V from Dong et al. (2017) PMID: 28767373
mnn <- cbind(c(5022,577,188,19,395),
         c(407,2468,989,4,965),
         c(130,630,27254,1021,763),
         c(13,0,1236,6399,5),
         c(103,258,609,0,9611)
         )

f1scores(mnn)

## End ##

Footnotes

Code Availability The R code for computing point estimates and confidence intervals for miF₁, maF₁, and $m a F_{1}^{*}$ is available in Appendix D.

Conflict of Interests None.

References

1.van Rijsbergen CJ (1979) Information retrieval. Butterworths, Oxford [Google Scholar]
2.Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, Cambridge [Google Scholar]
3.Sokolova M, Lapalme G (2009) A systematic analysis of performance measures for classification tasks. Inf Process Manage 45:427–437 [Google Scholar]
4.Wang Y, Li J, Li Y, Wangi R, Yang X (2015) Confidence interval for F₁ measure of algorithm performance based on blocked 3 × 2 cross-validation. IEEE Trans Knowl Data Eng 27:651–659 [Google Scholar]
5.Dong H, Supratak A, Pan W, Wu C, Matthews PM, Guo Y (2018) Mixed neural network approach for temporal sleep stage classification. IEEE Trans Neural Syst Rehabil Eng 26(2):324–333 [DOI] [PubMed] [Google Scholar]
6.Wang J, Zhang J, An Y, Lin H, Yang Z, Zhang Y, Sun Y (2016) Biomedical event trigger detection by dependency-based word embedding. BMC Med Genomics 2(9 Suppl):45. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Socoró JC, Alías F, Alsina-Pagès RM (2017) An anomalous noise events detector for dynamic road traffic noise mapping in real-life urban and suburban environments. Sensors (Basel) 17(10) [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Chowdhury S, Dong X, Qian L, Li X, Guan Y, Yang J, Yu Q (2018) A multitask bi-directional RNN model for named entity recognition on Chinese electronic medical records. BMC Bioinforma 19(Suppl 17):499. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Troya-Galvis A, Gançarski P, Berti-Équille L (2018) Remote sensing image analysis by aggregation of segmentation-classification collaborative agents. Pattern Recognit 73:259–274 [Google Scholar]
10.Hong N, Wen A, Stone DJ, Tsuji S, Kingsbury PR, Rasmussen LV, Pacheco JA, Adekkanattu P, Wang F, Luo Y, Pathak J, Liu H, Jiang G (2019) Developing a FHIRbased EHR phenotyping framework: A case study for identification of patients with obesity and multiple comorbidities from discharge summaries. J Biomed Inform 99:103310. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Li L, Zhong B, Hutmacher C, Liang Y, Horrey WJ, Xu X (2020) Detection of driver manual distraction via image-based hand and ear recognition. Accid Anal Prev 137:105432. [DOI] [PubMed] [Google Scholar]
12.Zhou H, Ma Y, Li X (2020) Feature selection based on term frequency deviation rate for text classification. Appl Intell [Google Scholar]
13.Rashid MM, Kamruzzaman J, Hassan MM, Imam T, Gordon S (2020) Cyberattacks detection in IoT-based smart city applications using machine learning techniques. Int J Environ Res Public Health 17(24) [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Wang SH, Nayak DR, Guttery DS, Zhang X, Zhang YD (2021) COVID-19 classification by CCSHNet with deep fusion using transfer learning and discriminant correlation analysis. Inf Fusion 68:131–148 [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Hao J, Yue K, Zhang B, Duan L, Fu X (2021) Transfer learning of bayesian network for measuring qos of virtual machines. Appl Intell [Google Scholar]
16.Li J, Lin M (2021) Ensemble learning with diversified base models for fault diagnosis in nuclear power plants. Ann Nucl Energy 158:108265 [Google Scholar]
17.Zhang D, Wang J, Zhao X (2015) Estimating the uncertainty of average F₁ scores. In: Proceedings of the 2015 International conference on the theory of information retrieval [Google Scholar]
18.Zhu F, Li X, Mcgonigle D, Tang H, He Z, Zhang C, Hung GU, Chiu PY, Zhou W (2020) Analyze informant-based questionnaire for the early diagnosis of senile dementia using deep learning. IEEE J Transl Eng Health Med 8:2200106. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Bhalla S, Kaur H, Kaur R, Sharma S, Raghava GPS (2020) Expression based biomarkers and models to classify early and late-stage samples of papillary thyroid carcinoma. PLoS One 15(4):e0231629. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] 1.van Rijsbergen CJ (1979) Information retrieval. Butterworths, Oxford [Google Scholar]

[R2] 2.Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, Cambridge [Google Scholar]

[R3] 3.Sokolova M, Lapalme G (2009) A systematic analysis of performance measures for classification tasks. Inf Process Manage 45:427–437 [Google Scholar]

[R4] 4.Wang Y, Li J, Li Y, Wangi R, Yang X (2015) Confidence interval for F₁ measure of algorithm performance based on blocked 3 × 2 cross-validation. IEEE Trans Knowl Data Eng 27:651–659 [Google Scholar]

[R5] 5.Dong H, Supratak A, Pan W, Wu C, Matthews PM, Guo Y (2018) Mixed neural network approach for temporal sleep stage classification. IEEE Trans Neural Syst Rehabil Eng 26(2):324–333 [DOI] [PubMed] [Google Scholar]

[R6] 6.Wang J, Zhang J, An Y, Lin H, Yang Z, Zhang Y, Sun Y (2016) Biomedical event trigger detection by dependency-based word embedding. BMC Med Genomics 2(9 Suppl):45. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] 7.Socoró JC, Alías F, Alsina-Pagès RM (2017) An anomalous noise events detector for dynamic road traffic noise mapping in real-life urban and suburban environments. Sensors (Basel) 17(10) [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] 8.Chowdhury S, Dong X, Qian L, Li X, Guan Y, Yang J, Yu Q (2018) A multitask bi-directional RNN model for named entity recognition on Chinese electronic medical records. BMC Bioinforma 19(Suppl 17):499. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] 9.Troya-Galvis A, Gançarski P, Berti-Équille L (2018) Remote sensing image analysis by aggregation of segmentation-classification collaborative agents. Pattern Recognit 73:259–274 [Google Scholar]

[R10] 10.Hong N, Wen A, Stone DJ, Tsuji S, Kingsbury PR, Rasmussen LV, Pacheco JA, Adekkanattu P, Wang F, Luo Y, Pathak J, Liu H, Jiang G (2019) Developing a FHIRbased EHR phenotyping framework: A case study for identification of patients with obesity and multiple comorbidities from discharge summaries. J Biomed Inform 99:103310. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] 11.Li L, Zhong B, Hutmacher C, Liang Y, Horrey WJ, Xu X (2020) Detection of driver manual distraction via image-based hand and ear recognition. Accid Anal Prev 137:105432. [DOI] [PubMed] [Google Scholar]

[R12] 12.Zhou H, Ma Y, Li X (2020) Feature selection based on term frequency deviation rate for text classification. Appl Intell [Google Scholar]

[R13] 13.Rashid MM, Kamruzzaman J, Hassan MM, Imam T, Gordon S (2020) Cyberattacks detection in IoT-based smart city applications using machine learning techniques. Int J Environ Res Public Health 17(24) [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] 14.Wang SH, Nayak DR, Guttery DS, Zhang X, Zhang YD (2021) COVID-19 classification by CCSHNet with deep fusion using transfer learning and discriminant correlation analysis. Inf Fusion 68:131–148 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] 15.Hao J, Yue K, Zhang B, Duan L, Fu X (2021) Transfer learning of bayesian network for measuring qos of virtual machines. Appl Intell [Google Scholar]

[R16] 16.Li J, Lin M (2021) Ensemble learning with diversified base models for fault diagnosis in nuclear power plants. Ann Nucl Energy 158:108265 [Google Scholar]

[R17] 17.Zhang D, Wang J, Zhao X (2015) Estimating the uncertainty of average F₁ scores. In: Proceedings of the 2015 International conference on the theory of information retrieval [Google Scholar]

[R18] 18.Zhu F, Li X, Mcgonigle D, Tang H, He Z, Zhang C, Hung GU, Chiu PY, Zhou W (2020) Analyze informant-based questionnaire for the early diagnosis of senile dementia using deep learning. IEEE J Transl Eng Health Med 8:2200106. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] 19.Bhalla S, Kaur H, Kaur R, Sharma S, Raghava GPS (2020) Expression based biomarkers and models to classify early and late-stage samples of papillary thyroid carcinoma. PLoS One 15(4):e0231629. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Confidence interval for micro-averaged F₁ and macro-averaged F₁ scores

Kanae Takahashi

Kouji Yamamoto

Aya Kuchiba

Tatsuki Koyama

Abstract

1. Introduction

2. Averaged F₁ scores

Table 1.

Micro-averaged F₁ score

Macro-averaged F₁ score

Alternative definition of Macro-averaged F₁ score

3. Variance estimate and confidence interval

3.1. Confidence interval for miF₁

3.2. Confidence interval for maF₁

3.3. Confidence interval form $m a F_{1}^{*}$

4. Simulation

Table 2.

Table 3.

5. Example

Table 4.

6. Discussion

Funding

Biographies

Appendix A: Derivation of the distribution and variance of $\hat{m i F_{1}}$

Appendix B: Derivation of the distribution and variance of $\hat{m a F_{1}}$

Appendix C: Derivation of the distribution and variance of $\hat{m a F_{1}^{*}}$

Appendix D: R code

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Confidence interval for micro-averaged F1 and macro-averaged F1 scores

Kanae Takahashi

Kouji Yamamoto

Aya Kuchiba

Tatsuki Koyama

Abstract

1. Introduction

2. Averaged F1 scores

Table 1.

Micro-averaged F1 score

Macro-averaged F1 score

Alternative definition of Macro-averaged F1 score

3. Variance estimate and confidence interval

3.1. Confidence interval for miF1

3.2. Confidence interval for maF1

3.3. Confidence interval form maF1*

4. Simulation

Table 2.

Table 3.

5. Example

Table 4.

6. Discussion

Funding

Biographies

Appendix A: Derivation of the distribution and variance of miF1^

Appendix B: Derivation of the distribution and variance of maF1^

Appendix C: Derivation of the distribution and variance of maF1*^

Appendix D: R code

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Confidence interval for micro-averaged F₁ and macro-averaged F₁ scores

2. Averaged F₁ scores

Micro-averaged F₁ score

Macro-averaged F₁ score

Alternative definition of Macro-averaged F₁ score

3.1. Confidence interval for miF₁

3.2. Confidence interval for maF₁

3.3. Confidence interval form $m a F_{1}^{*}$

Appendix A: Derivation of the distribution and variance of $\hat{m i F_{1}}$

Appendix B: Derivation of the distribution and variance of $\hat{m a F_{1}}$

Appendix C: Derivation of the distribution and variance of $\hat{m a F_{1}^{*}}$