Precision |
Measures the proportion of correct words in the prediction relative to all predicted words. In the QA context, it evaluates the accuracy of the model’s generated answer. |
|
Recall |
Measures the proportion of correct predicted words relative to all words in the correct answer. Evaluates if the model captures the keywords of the expected response. |
|
Exact Match (EM) |
This metric measures the percentage of answers that exactly match the correct answer. It is a very strict metric, counting answers as correct only if they are identical to the expected response. |
|
F1-Score |
F1 is a metric that combines precision and recall. It is used to measure the overlap between predicted words and words in the correct answer. Unlike EM, it does not require exact identity but assesses how many words in the prediction match those in the correct answer. |
|