Abstract
In modern medicine, medical tests are used for various purposes including diagnosis, disease screening, prognosis, and risk prediction. To quantify the performance of the binary medical test, we often use sensitivity, specificity, and negative and positive predictive values as measures. Additionally, the -score, which is defined as the harmonic mean of precision (positive predictive value) and recall (sensitivity), has come to be used in the medical field due to its favorable characteristics. The -score has been extended for multi-class classification, and two types of -scores have been proposed for multi-class classification: a micro-averaged -score and a macro-averaged -score. The micro-averaged -score pools per-sample classifications across classes and then calculates the overall -score, whereas the macro-averaged -score computes an arithmetic mean of the -scores for each class. Additionally, Sokolova and Lapalme1 gave an alternative definition of the macro-averaged -score as the harmonic mean of the arithmetic means of the precision and recall over classes. Although some statistical methods of inference for binary and multi-class -scores have been proposed, the methodology development of hypothesis testing procedure for them has not been fully progressing yet. Therefore, we aim to develop hypothesis testing procedure for comparing two -scores in paired study design based on the large sample multivariate central limit theorem.
Keywords: delta-method, F1 measures, multi-class classification, precision, recall
1 |. INTRODUCTION
Medical tests are important for the early detection and treatment of disease in modern medicine. Tests are used for various purposes including diagnosis, disease screening, prognosis, and risk prediction. Some measures exist to quantify the test performance; sensitivity, specificity, and positive and negative predictive values are commonly used for binary tests. Additionally, the -score for binary data (binary -score), which is defined as the harmonic mean of precision (positive predictive value) and recall (sensitivity), has been used in the medical field.1,2
The binary -score is especially useful when evaluation of true negatives is relatively unimportant because true negatives are not included in computation of either precision or recall. In addition, the binary -score performs well for a poor diagnostic test that identifies majority of the data as positive. In this situation, a simple arithmetic mean of precision and recall may be as high as 0.50 because recall will be 1.00 if all the data are diagnosed as positive. However, the binary -score will be appropriately low in these instances: it will be 0.18 and 0.02 when the precision is 0.10 and 0.01, respectively, even if recall is 1.00. Therefore, -score is a better statistic to report.2
Most of measures for performance of medical tests are only applicable to binary classification data, and multi-class classification data need to be dichotomized to compute these measures. In the motivating example,3 for instance, skin cancer images were originally classified into six categories (malignant melanoma (MM), basal cell carcinoma (BCC), nevus, seborrheic keratosis (SK), senile lentigo (SL) and hematoma/hemangioma (H/H)), and the classification performances of board-certified dermatologists and dermatologic trainees were compared. The classification performance was assessed by accuracy, sensitivity, specificity, false negative rate, false positive rate, and positive predictive value after dichotomizing the six categories (MM and BCC vs. nevus, SK, SL, and H/H). However, evaluating the performance with the original six categories would have been preferable because dichotomization led to loss of information regarding the performance of the this classification.4,5
As measures of multi-class classification performance, a micro-averaged -score and a macro-averaged -score have been proposed.2 The micro-averaged -score calculates the overall -score by pooling per-sample classifications across classes. Contrarily, the macro-averaged -score computes an arithmetic mean of the -scores for each class. In addition, Sokolova and Lapalme6 proposed an alternative macro-averaged -score as the harmonic mean of the arithmetic mean of the precisions and recalls for each class.
Although -scores for binary and multi-class classifications have been originally used for measuring the performance of text classification in the field of information retrieval or of a classifier in machine learning, it has become frequently used in medicine.7–14 Some statistical methods for inference have been proposed for the binary -score,15 and the methods for estimating confidence intervals of the micro-averaged -scores and macro-averaged -scores has been developed.16,17 However, these previous methods are for inference from one-sample. To our knowledge, no method is available for hypothesis testing of -scores for paired samples as in our motivating example or two independent samples. Thus, we aim to provide the methods for comparing the binary -scores, micro-averaged -scores and macro-averaged -scores in the paired-design setting. For two-independent-sample setting, the proposed method is readily applicable by setting the covariance part of the test statistics to 0.
The layout of this article is as follows: In Section 2, the definitions of the binary -score, micro-averaged -score and macro-averaged -score are reviewed. Test statistics for comparing those scores are derived in Section 3. Then, the simulation results of the proposed statistics and the application to the motivating example are presented in Sections 4 and 5, respectively. Finally, our brief discussions are provided in Section 6.
2 |. REVIEW OF F1-SCORES
This section introduces notations and definitions of binary -score (), micro-averaged -score (), and macro-averaged -score (). Consider an table of data for a nominal categorical variable with levels . Each true class has an table representing prediction frequencies of the two tests to be compared.
This arrangement of data represents the binary classification when , and the multi-class classification when . Table 1 shows general notations for each cell probability , where indicates the class of Test 1, indicates the class of Test 2, and indicates the true condition. Let Test 1 be a new medical test and Test 2 be an existing medical test. We consider a hypothesis testing to compare -scores of Test 1 and Test 2. Using these notations, the true positive rate , the false positive rate , and the false negative rate for each class in Test 1 are defined as follows:
TABLE 1.
True condition = 1 | ⋯ | True condition = r | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
||||||||||||
Test 2 | Test 2 | ||||||||||||
|
|
||||||||||||
1 | 2 | ⋯ | r | 1 | 2 | ⋯ | r | ||||||
Test 1 | 1 | p 111 | p 121 | ⋯ | p 1r1 | p 1.1 | p 11r | p 12r | ⋯ | p 1rr | p 1.r | p 1.. | |
2 | p 211 | p 221 | ⋯ | p 2r1 | p 2.1 | p 21r | p 22r | ⋯ | p 2rr | p 2.r | p 2.. | ||
⋮ | ⋮ | ⋮ | ⋱ | ⋮ | ⋮ | ⋮ | ⋮ | ⋱ | ⋮ | ⋮ | ⋮ | ||
r | p r11 | p r21 | ⋯ | p rr1 | p r.1 | p r1r | p r2r | ⋯ | prrr | pr.r | p r.. | ||
p .11 | p .21 | ⋯ | p .r1 | p ..1 | p .1r | p .2r | ⋯ | p .rr | p ..r | 1 |
Note that , and . Similarly, for each class for Test 2 are defined as follows:
Note that , and .
2.1 |. Binary -score
When , we consider the following precision and recall for Test 1 as:
And binary -score for Test 1 is defined as the harmonic mean of and , that is,
(1) |
Similarly, the binary -score for Test 2 is as follows:
(2) |
2.2 |. Micro-averaged -score
When the micro-averaged precision (miP) and micro-averaged recall (miR) are obtained from the sum of each class of TPi, FPi, FNi. miP and miR for Test 1 can be written as
Finally, as the harmonic mean of and , we have the micro-averaged -score for Test 1 as
(3) |
Similarly, the micro-averaged -score for Test 2 is
(4) |
2.3 |. Macro-averaged -score
When , to define the macro-averaged -score for Test 1 , first consider the following precision and recall () within each class, :
(5) |
(6) |
And -score within each class for Test 1 is defined as the harmonic mean of and , that is,
The macro-averaged -score for Test 1 is the simple arithmetic mean of :
(7) |
Similarly, the macro-averaged -score for Test 2 is
(8) |
2.4 |. Alternate definition of macro-averaged -score
Sokolova and Lapalme6 gave an alternative definition of the macro-averaged . First, macro-averaged precision and macro-averaged recall for Test 1 are defined as simple arithmetic means of the within-class precision and within-class recall in (5) and (6), respectively.
And the alternate definition of macro-averaged -score for Test 1 is the harmonic mean of these quantities.
(9) |
Similarly, the alternate definition of macro-averaged -score for Test 2 is
(10) |
3 |. PROPOSED HYPOTHESIS TESTING PROCEDURE
In this section, we derive the test statistics for comparing two -scores ( and and and ; and and ). We assume that the observed frequencies, for , have a multinomial distribution with overall sample size and probabilities , where indicates the class of Test 1, indicates the class of Test 2, indicates the true condition, and “T” represents the transpose. The maximum likelihood estimate (MLE) of is . That is
By invariance property of MLE’s, the maximum likelihood estimate of , and other quantities in the previous section can be obtained by substituting by .
3.1 |. Test statistic for comparing two
Let be a vector whose components are the of the two medical tests, and let be the MLE of can be obtained by substituting by their MLE’s in (1) and (2).
Using the delta-method and the multivariate central limit theorem, we have
where is an diagonal matrix whose elements are the diagonal elements of , and “” represents “approximately distributed as”. The Wald statistic for testing vs. , therefore, is
where is the variance of with replaced by . Derivation of the variance of appear in Appendix A.1. The test statistic is distributed asymptotically as a distribution with one degree of freedom under the null hypothesis.
As a side note, the confidence interval of for each test can be derived in the same way. confidence interval of and is
where denote the -th percentile of the standard normal distribution, and and are the variance of and the variance of with replaced by . These simple formulas based on the multinomial distribution have not been proposed yet. Wang et al. proposed a confidence interval of based on the beta prime distribution and associated calculations using the bootstrap method.15,18
For the score statistic, we consider the MLE of under the null hypothesis that could be obtained, for example by applying the Newton-Raphson method to the log-likelihood equations. The score statistic for testing vs. is
where is the variance of with replaced by , that is calculated from the MLE of under the null hypothesis.
3.2 |. Test statistic for comparing two
As shown in (3) and (4), , and the MLE of and are
Again by the delta-method and multivariate central limit theorem (Appendix A.2), the Wald statistic for testing versus is
where is the variance of with replaced by . The test statistic is distributed a symptotically as a distribution with one degree of freedom under the null hypothesis.
Again to develop the score statistic, we consider the MLE of under the null hypothesis as in the case of . The score statistic for testing versus is
where is the variance of with replaced by , that is calculated from the MLE of under the null hypothesis.
3.3 |. Test statistic for comparing two
The MLE of and can be obtained by substituting and by their MLE’s in (7) and (8).
Again by the delta-method and multivariate central limit theorem (Appendix A.3), we have the Wald statistic for testing versus as
where is the variance of with replaced by . The test statistic is distributed asymptotically as a distribution with one degree of freedom under the null hypothesis.
For the score statistic, we consider the MLE of under the null hypothesis as in the case of and . The score statistic for testing versus is
where is the variance of replaced by , that is calculated from the MLE of under the null hypothesis.
3.4 |. Test statistic for comparing two
To obtain the MLEs of and , we first substitute and by their MLE’s to get MLE’s of and and use these in (9) and (10):
Using the delta-method and multivariate central limit theorem (Appendix A.4), we have the Wald statistic for testing versus as
Again to get , all components of the variance of are replaced by their respective MLE’s. The test statistic is distributed asymptotically as a distribution with one degree of freedom under the null hypothesis.
On the other hand, for the score statistic, we consider the MLE of under the null hypothesis as in the case of , and . The score statistic for testing versus is
where is the variance of replaced by , that is calculated from the MLE of under the null hypothesis.
4 |. SIMULATION
4.1 |. Simulation setup
A simulation study was conducted to evaluate the performance of the test statistics proposed in Section 3. We set (class 1, 2, 3), and generated data according to the multinomial distributions with shown in Table 2. Classes 2 and 3 were combined when calculating . The total sample size, , was set to 100, 300, 500, and 1,000. The nominal type I error rate was set to 0.05 (two-sided test). We used the empirical type I error rate and empirical power as performance measures. For each combination of the scenario and sample size, we performed 100,000 repeated simulations.
TABLE 2.
True class = 1 | True class = 2 | True class = 3 | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
|
|
|
||||||||
Test 2 | Test 2 | Test 2 | ||||||||
|
|
|
||||||||
Scenario 1 | 1 | 2 | 3 | 1 | 2 | 3 | 1 | 2 | 3 | |
Test 1 | 1 | 40/300 | 10/300 | 10/300 | 5/300 | 10/300 | 5/300 | 5/300 | 5/300 | 10/300 |
2 | 10/300 | 5/300 | 5/300 | 10/300 | 40/300 | 10/300 | 5/300 | 5/300 | 10/300 | |
3 | 10/300 | 5/300 | 5/300 | 5/300 | 10/300 | 5/300 | 10/300 | 10/300 | 40/300 | |
True class = 1 | True class = 2 | True class = 3 | ||||||||
|
|
|
||||||||
Test 2 | Test 2 | Test 2 | ||||||||
|
|
|
||||||||
Scenario 2 | 1 | 2 | 3 | 1 | 2 | 3 | 1 | 2 | 3 | |
Test 1 | 1 | 120/500 | 30/500 | 30/500 | 5/500 | 10/500 | 5/500 | 5/500 | 5/500 | 10/500 |
2 | 30/500 | 15/500 | 15/300 | 10/500 | 40/500 | 10/500 | 5/500 | 5/500 | 10/500 | |
3 | 30/500 | 15/500 | 15/500 | 5/500 | 10/500 | 5/500 | 10/500 | 10/500 | 40/500 | |
True class = 1 | True class = 2 | True class = 3 | ||||||||
|
|
|
||||||||
Test 2 | Test 2 | Test 2 | ||||||||
|
|
|
||||||||
Scenario 3 | 1 | 2 | 3 | 1 | 2 | 3 | 1 | 2 | 3 | |
Test 1 | 1 | 30/300 | 15/300 | 15/300 | 5/300 | 10/300 | 5/300 | 5/300 | 5/300 | 10/300 |
2 | 10/300 | 5/300 | 5/300 | 15/300 | 30/300 | 15/300 | 5/300 | 5/300 | 10/300 | |
3 | 10/300 | 5/300 | 5/300 | 5/300 | 10/300 | 5/300 | 15/300 | 15/300 | 30/300 | |
True class = 1 | True class = 2 | True class = 3 | ||||||||
|
|
|
||||||||
Test 2 | Test 2 | Test 2 | ||||||||
|
|
|
||||||||
Scenario 4 | 1 | 2 | 3 | 1 | 2 | 3 | 1 | 2 | 3 | |
Test 1 | 1 | 90/500 | 45/500 | 45/500 | 5/500 | 10/500 | 5/500 | 5/500 | 5/500 | 10/500 |
2 | 30/500 | 15/500 | 15/300 | 15/500 | 30/500 | 15/500 | 5/500 | 5/500 | 10/500 | |
3 | 30/500 | 15/500 | 15/500 | 5/500 | 10/500 | 5/500 | 15/500 | 15/500 | 30/500 | |
Scenarios 1 and 2 are set up to evaluate the empirical type I error rate of the proposed test statistics, while scenario 3 and 4 are designed to assess their empirical power. In scenario 1, the true conditions of classes 1, 2, and 3 have the same probability (1∕3), and the recalls and precisions within each class are equal in the two tests (60%). Thus, , and for each class, . Then, . Because classes 2 and 3 are combined to calculate and . Also, for each class , and .
In scenario 2, the true condition of class 1 has higher probability than the others (60% vs. 20%), and performances of two tests are equal: , and . Although the distributions in scenario 2 are the same as those in scenario 1 for each class, the value of and are different between scenarios because is large in the true class = 1 and, conversely, relatively small in the true classes 2 and 3. In contrast, in scenario 2 is the same as that in scenario 1 because and for each class in scenarios 1 and 2 are equal.
The true conditions of classes 1, 2, and 3 have the same probability (1∕3) in scenario 3. However, and of Test 2 are lower than Test 1 (60% vs. 50%), are lower than (60% vs. 50%), and is lower than for each class (20% vs. 17%). Therefore, , whereas .
In scenario 4, the true condition of class 1 has higher probability than the others (60% vs. 20%) as in scenario 2. However, the performance of two tests are different: versus versus versus , and versus .
4.2 |. Simulation result
Table 3 shows the empirical type I error rates of the proposed tests for scenarios 1 and 2. The empirical type I error rates for both test statistics were close to nominal type I error rate of 0.05 when the sample size is large (300, 500, 1000). When is relatively small (100), the empirical type I error rates tended to be slightly larger than 0.05, especially for Wald statistics. Contrarily, the empirical type I error rates with score statistics are close to the nominal type I error rate of 0.05 for all sample sizes. Table 4 shows the empirical power of the proposed tests for scenarios 3 and 4. As shown in Table 4, the empirical powers increase with the sample size. The empirical powers of Wald statistics and score statistics are similar, especially when the sample size is large.
TABLE 3.
Scenario | N | ||||||||
---|---|---|---|---|---|---|---|---|---|
1 | 100 | 0.057 | 0.050 | 0.053 | 0.049 | 0.055 | 0.051 | 0.057 | 0.053 |
300 | 0.052 | 0.050 | 0.051 | 0.050 | 0.052 | 0.051 | 0.052 | 0.051 | |
500 | 0.051 | 0.050 | 0.050 | 0.050 | 0.051 | 0.050 | 0.051 | 0.050 | |
1000 | 0.050 | 0.049 | 0.051 | 0.050 | 0.051 | 0.050 | 0.051 | 0.051 | |
2 | 100 | 0.052 | 0.049 | 0.054 | 0.049 | 0.058 | 0.053 | 0.061 | 0.055 |
300 | 0.052 | 0.050 | 0.051 | 0.050 | 0.054 | 0.052 | 0.054 | 0.052 | |
500 | 0.051 | 0.050 | 0.051 | 0.050 | 0.051 | 0.050 | 0.052 | 0.051 | |
1000 | 0.050 | 0.050 | 0.051 | 0.051 | 0.052 | 0.051 | 0.051 | 0.051 |
TABLE 4.
Scenario | N | ||||||||
---|---|---|---|---|---|---|---|---|---|
3 | 100 | 0.192 | 0.174 | 0.304 | 0.289 | 0.309 | 0.297 | 0.310 | 0.300 |
300 | 0.438 | 0.429 | 0.694 | 0.689 | 0.696 | 0.692 | 0.696 | 0.692 | |
500 | 0.641 | 0.635 | 0.890 | 0.888 | 0.889 | 0.888 | 0.889 | 0.888 | |
1000 | 0.905 | 0.904 | 0.995 | 0.995 | 0.995 | 0.995 | 0.995 | 0.995 | |
4 | 100 | 0.235 | 0.226 | 0.305 | 0.291 | 0.291 | 0.278 | 0.271 | 0.256 |
300 | 0.560 | 0.556 | 0.695 | 0.690 | 0.662 | 0.657 | 0.615 | 0.609 | |
500 | 0.773 | 0.771 | 0.889 | 0.887 | 0.865 | 0.863 | 0.826 | 0.824 | |
1000 | 0.969 | 0.969 | 0.995 | 0.995 | 0.992 | 0.992 | 0.984 | 0.984 |
5 |. EXAMPLE
We describe an application of the proposed hypothesis testing procedure to the motivating example.3 In this study, a skin cancer classification system with faster, region-based convolutional neural network algorithm (FRCNN) for brown to black pigmented skin lesions was developed using a deep learning method. The target diseases were malignant tumors (malignant melanoma (MM) and basal cell carcinoma (BCC)) and benign tumors (nevus, seborrheic keratosis (SK), senile lentigo (SL) and hematoma/hemangioma (H/H)), and 2000 images were evaluated. The 2000 images were obtained by randomly sampling 200 images from the 666 images 10 times. For illustration, all images were treated as independent in this study. The data are shown in Tables B1–B3, Appendix B. Although images were classified into six categories (MM, BCC, nevus, SK, SL, H/H), accuracy was the only performance measure computed for six-class classification data in the motivating example. Other performance measures, sensitivity, specificity, false negative, false positive, and positive predictive value, were calculated for two-class classification data after combining malignant tumors (MM and BCC) and benign tumors (nevus, SK, SL, and H/H). The accuracy of six-class classification by the FRCNN (86.2% ± 2.95%) was statistically higher than that of board-certified dermatologists (BCD) (79.5% ± 5.27%, p = 0.0081) and that of dermatologic trainees (75.1% ± 2.18%, p < 0.0001).
We compared the performance of skin cancer classification between the FRCNN and BCD using , , and with the proposed hypothesis testing procedures. , and were calculated from six-class classification data, while were calculated from two-class classification data (malignant tumors vs. benign tumors). The results are shown in Table 5. All and of six-class classification by FRCNN were significantly higher than those by BCD.
TABLE 5.
FRCNN | BCD | Difference | Test statistics | p-value | |
---|---|---|---|---|---|
biF (Wald) | 0.840 | 0.776 | 0.064 | 19.4 | < 0.001 |
biF (score) | 0.840 | 0.776 | 0.064 | 18.9 | < 0.001 |
miF (Wald) | 0.862 | 0.795 | 0.067 | 41.9 | < 0.001 |
miF (score) | 0.862 | 0.795 | 0.067 | 41.0 | < 0.001 |
maF (Wald) | 0.846 | 0.768 | 0.078 | 26.2 | < 0.001 |
maF (score) | 0.846 | 0.768 | 0.078 | 24.5 | < 0.001 |
maF* (Wald) | 0.848 | 0.772 | 0.076 | 26.4 | < 0.001 |
maF* (score) | 0.848 | 0.772 | 0.076 | 23.0 | < 0.001 |
6 |. DISCUSSION
We developed hypothesis testing procedures for comparing two -scores and and and , and and ) in paired study design. Through the simulation study and motivating example, we assessed the performance and feasibility of those testing procedures. We conclude that the method based on the score statistics ( , and ) is slightly better compared to the method based on the Wald statistics (, , , and ) because the empirical type I error rate is closer to the nominal level even when the sample size is small. However, when multi-class classification is considered, typical sample size is much larger than 100, and both approaches perform equally well in such scenarios.
We did not observe a substantial disparity in the empirical powers of the two approaches.
At present, others have not studied hypothesis testing procedure of , , and , and only the point estimates of these scores were reported in most studies. Han et al19 applied one sample t-test for comparison of ; however, this approach may not be appropriate because is the harmonic mean of precision and recall, and the distribution of the difference between two is unlikely to follow a Student’s t-distribution.
A limitation of this work is that the proposed procedures are based on the large sample theory, and thus require a large sample size to provide strict control of the type I error rate. For future works, we are working on the exact test for comparing two , , and based on the methods presented in this article.
An R code for computing point estimates, Wald statistics, score statistics, and p-values for , , and of each statistic in the paired design, is available on the lead author’s GitHub page: https://github.com/kanaet52/f1score/blob/main/R/F1score_test.R. For two-sample designs, the -scores can be compared by setting the covariance part of the test statistic (, see Appendix A) to 0. Note that for the score statistics, the MLE of under each null hypothesis is obtained by applying the Newton-Raphson method to the log-likelihood equations in the code.
ACKNOWLEDGEMENTS
The authors would like to thank Dr Shunichi Jinnai for providing the motivating example data. This research was partially supported by grant-in-aid for Scientific Research (C) No. 18K11195 and 21K11790 (Yamamoto), grant-in-aid for Research Activity start-up no. 21K21170 (Takahashi), and P30CA068485 Cancer Center Support grant (Koyama).
Funding information
Cancer Center Support, Grant/Award Number: P30CA068485; Japan Society for the Promotion of Science, Grant/Award Numbers: 18K11195, 21K11790, 21K21170
APPENDIX A. DERIVATION OF VARIANCES
A.1. Variance of
The derivation of the variance of is as follows:
with
Therefore, the variance of is
the variance of is
the variance of is
A.2. Variance of
The derivation of the variance of is as follows:
with
Therefore, the variance of is
A.3. Variance of
The derivation of the variance of is as follows.
with
Therefore, the variance of is
A.4. Variance of
The derivation of the variance of is as follows.
with
Therefore, the variance of is
APPENDIX B. EXAMPLE DATA
TABLE B1.
True condition | |||||||
---|---|---|---|---|---|---|---|
|
|||||||
FRCNN | BCD | MM | BCC | Nevus | SK | H/H | SL |
MM | MM | 289 | 2 | 20 | 6 | 2 | 0 |
BCC | 9 | 2 | 2 | 5 | 0 | 0 | |
Nevus | 10 | 0 | 14 | 0 | 0 | 0 | |
SK | 14 | 2 | 4 | 10 | 0 | 0 | |
H/H | 3 | 0 | 2 | 0 | 1 | 0 | |
SL | 2 | 0 | 0 | 0 | 0 | 0 | |
BCC | MM | 6 | 6 | 0 | 1 | 0 | 0 |
BCC | 2 | 95 | 0 | 6 | 0 | 0 | |
Nevus | 0 | 2 | 6 | 0 | 0 | 0 | |
SK | 1 | 5 | 0 | 2 | 0 | 0 | |
H/H | 0 | 0 | 0 | 0 | 0 | 0 | |
SL | 0 | 0 | 0 | 0 | 0 | 0 | |
Nevus | MM | 32 | 1 | 108 | 0 | 1 | 0 |
BCC | 1 | 6 | 8 | 1 | 0 | 0 | |
Nevus | 11 | 1 | 789 | 8 | 1 | 0 | |
SK | 3 | 3 | 50 | 27 | 0 | 0 | |
H/H | 0 | 1 | 9 | 0 | 16 | 0 | |
SL | 1 | 0 | 3 | 0 | 0 | 0 | |
SK | MM | 13 | 1 | 3 | 11 | 0 | 0 |
BCC | 0 | 1 | 1 | 12 | 0 | 0 | |
Nevus | 1 | 0 | 11 | 9 | 0 | 0 | |
SK | 7 | 4 | 14 | 186 | 0 | 1 | |
H/H | 0 | 0 | 0 | 0 | 0 | 0 | |
SL | 0 | 0 | 1 | 5 | 0 | 2 | |
H/H | MM | 0 | 0 | 0 | 0 | 6 | 0 |
BCC | 0 | 0 | 0 | 0 | 1 | 0 | |
Nevus | 0 | 0 | 3 | 0 | 5 | 0 | |
SK | 0 | 0 | 0 | 0 | 1 | 0 | |
H/H | 0 | 0 | 0 | 0 | 44 | 0 | |
SL | 0 | 0 | 0 | 0 | 0 | 0 | |
SL | MM | 0 | 0 | 0 | 0 | 0 | 0 |
BCC | 0 | 0 | 0 | 0 | 0 | 1 | |
Nevus | 0 | 0 | 0 | 0 | 0 | 0 | |
SK | 1 | 0 | 0 | 0 | 0 | 6 | |
H/H | 0 | 0 | 0 | 0 | 0 | 0 | |
SL | 2 | 0 | 0 | 0 | 0 | 35 |
TABLE B2.
True condition | ||||||
---|---|---|---|---|---|---|
|
||||||
FRCNN | MM | BCC | Nevus | SK | H/H | SL |
MM | 327 | 6 | 42 | 21 | 3 | 0 |
BCC | 9 | 108 | 6 | 9 | 0 | 0 |
Nevus | 48 | 12 | 967 | 36 | 18 | 0 |
SK | 21 | 6 | 30 | 223 | 0 | 3 |
H/H | 0 | 0 | 3 | 0 | 57 | 0 |
SL | 3 | 0 | 0 | 0 | 0 | 42 |
TABLE B3.
True condition | ||||||
---|---|---|---|---|---|---|
|
||||||
BCD | MM | BCC | Nevus | SK | H/H | SL |
MM | 340 | 10 | 131 | 18 | 9 | 0 |
BCC | 12 | 104 | 11 | 24 | 1 | 1 |
Nevus | 22 | 3 | 823 | 17 | 6 | 0 |
SK | 26 | 14 | 68 | 225 | 1 | 7 |
H/H | 3 | 1 | 11 | 0 | 61 | 0 |
SL | 5 | 0 | 4 | 5 | 0 | 37 |
Footnotes
CONFLICT OF INTEREST STATEMENT
The authors have declared no conflict of interest.
DATA AVAILABILITY STATEMENT
Data sharing is not applicable to this article as no new data were created or analyzed in this study.
REFERENCES
- 1.van Rijsbergen CJ. Information Retrieval. London: Butterworths; 1979. [Google Scholar]
- 2.Manning CD, Raghavan P, Schütze H. Introduction to Information Retrieval. Cambridge: Cambridge University Press; 2008. [Google Scholar]
- 3.Jinnai S, Yamazaki N, Hirano Y, Sugawara Y, Ohe Y, Hamamoto R. The development of a skin cancer classification system for pigmented skin lesions using deep learning. Biomolecules. 2020;10(8):1123. doi: 10.3390/biom10081123 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Altman DG, Royston P. The cost of dichotomising continuous variables. Bmj. 2006;332(7549):1080. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Fedorov V, Mannino F, Zhang R. Consequences of dichotomization. Pharm Stat. 2009;8(1):50–61. [DOI] [PubMed] [Google Scholar]
- 6.Sokolova M, Lapalme G. A systematic analysis of performance measures for classification tasks. Inf Process Manag. 2009;45:427–437. [Google Scholar]
- 7.Bhalla S, Kaur H, Kaur R, Sharma S, Raghava GPS. Expression based biomarkers and models to classify early and late-stage samples of papillary thyroid carcinoma. PloS One. 2020;15(4):e0231629. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Chowdhury S, Dong X, Qian L, et al. A multitask bi-directional RNN model for named entity recognition on Chinese electronic medical records. BMC Bioinform. 2018;19(17):499. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Döring K, Qaseem A, Becer M, et al. Automated recognition of functional compound-protein relationships in literature. PloS One. 2020;15(3):e0220925. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Hong N, Wen A, Stone DJ, et al. Developing a FHIR-based EHR phenotyping framework: a case study for identification of patients with obesity and multiple comorbidities from discharge summaries. J Biomed Inform. 2019;99:103310. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Lee GH, Shin SY. Federated learning on clinical benchmark data: performance assessment. J Med Internet Res. 2020;22(10):e20891. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Routray R, Tetarenko N, Abu-Assal C, et al. Application of augmented intelligence for pharmacovigilance case seriousness determination. Drug Saf. 2020;43(1):57–66. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Wang J, Zhang J, An Y, et al. Biomedical event trigger detection by dependency-based word embedding. BMC Med Genom. 2016; 9(2):45. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Zhu F, Li X, Mcgonigle D, et al. Analyze informant-based questionnaire for the early diagnosis of senile dementia using deep learning. IEEE J Transl Eng Health Med. 2020;8:2200106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Wang Y, Li J, Li Y, Wangi R, Yang X. Confidence interval for measure of algorithm performance based on blocked 3 × 2 cross-validation. IEEE Trans Knowl Data Eng. 2015;27:651–659. [Google Scholar]
- 16.Zhang D, Wang J, Zhao X. Estimating the uncertainty of average scores. Proceedings of the 2015 International Conference on the Theory of Information Retrieval. 2015. [Google Scholar]
- 17.Takahashi K, Yamamoto K, Kuchiba A, Koyama T. Confidence interval for micro-averaged and macro-averaged scores. Appl Intell (Dordr). 2022;52(5):4961–4972. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Attia ZI, Noseworthy PA, Lopez-Jimenez F, et al. An artificial intelligence-enabled ECG algorithm for the identification of patients with atrial fibrillation during sinus rhythm: a retrospective analysis of outcome prediction. Lancet. 2019;394(10201):861–867. [DOI] [PubMed] [Google Scholar]
- 19.Han SS, Moon IJ, Lim W, et al. Keratinocytic skin cancer detection on the face using region-based convolutional neural network. JAMA Dermatol. 2020;156(1):29–37. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
Data sharing is not applicable to this article as no new data were created or analyzed in this study.