. 2020 Apr 3;12:51. doi: 10.1186/s13148-020-00842-4

Table 1.

Brief overview of some of the most frequently used performance measures for machine learning models

Performance metric	Interpretation
Accuracy	Brief definition: Accuracy is a classifier that works best on balanced data sets. It is a measure that informs about the correct classifications out of all classifications. It can have values from 0 - 100 % Example: If we are dealing with a binary classification, e.g. cancer versus healthy, and we have 20 patients with cancer and 80 healthy controls, a model accuracy of 80% would mean that the model classified every subject into the majority class (healthy) and is completely unable to classify cancer patients, although the accuracy indicates a good performance.
Sensitivity	Brief definition: The sensitivity is the true positive rate of a test. This means, how many subjects with a disease are actually identified as having the disease by the test. The values range from 0 to 100%. Example: Let us say we have a epigenetic test, that claims to identify the presence of a specific type of cancer. When evaluating the test, it was able to identify 30 out of 60 cancer patients correctly. The sensitivity of this test would then be 50% (30/60)
Specificity	Brief definition: The specificity is the true negative rate of a test. In other words, it represents the proportion of people without the disease, that will have a negative result. Just like for sensitivity, the values range from 0- 100% Example: We assume we are dealing with the same diagnostic test for cancer as in the explanation of sensitivity. Out of 90 healthy subjects, 70 had a negative diagnosis. This means the specificity of the test is 78% (70/90)
Precision	Brief definition: Precision is a measure that tells us out of all predicted cases, how many are actual cases. Possible values range from 0 to 1. Example: In the cancer example, how many predicted cancer cases are actual cancer cases.
Recall	Brief definition: Recall is a measure that informs us how many cases we were able to identify as cases. The value range is 0 to 1. Example: Out of all the cancer patients, how many was the predictive model able to identify as cancer patients?
F1-Score	Brief definition: The F1-score is the harmonic mean between precision and recall. In this case, we aim for both high recall and high precision, meaning we want to be able to identify a large amount of cases and we also want to be sure that the majority of predicted cases are actual cases. The F1-score ranges from 0 to 1, where 0 is the worst performance. Example: If we have a near-perfect precision and recall, meaning we ate able to classify a large amount of the cancer patients as cancer patients (recall) and we are sure that our prediction is correct (precision), the harmonic mean between the two of them for a good model would be ~ 0.9.
Area under the receiver operator curve (ROC AUC)	Brief definition: The area under the receiver operator curve is a measure of how sensitive and specific a test performs. In a graphical representation, the x-axis depicts the negative predictions and the y-axis the positive predictions. If a test performs bad in terms of sensitivity and specificity, the area under the curve would be 0.5, which means it is not better than tossing a coin.