Table 3.
Subject-level total correct matching responses and accuracy consensus across compared models.
| Subject | GPTa-3.5 vs Bard | Bard vs GPT-4 | GPT-3.5 vs GPT-4 | Bard, GPT-3.5, and GPT-4 | ||||
|
|
Total correct matching responses, n | Accuracy consensus | Total correct matching responses, n | Accuracy consensus | Total correct matching responses, n | Accuracy consensus | Total correct matching responses, n | Accuracy consensus |
| Biology | 17 | 0.4 | 22 | 0.46 | 23 | 0.48b | 17 | 0.52 |
| Chemistry | 4 | 0.31 | 7 | 0.50 | 8 | 0.50b | 4 | 0.50 |
| Physics | 8 | 0.58 | 13 | 1.00 | 14 | 0.93b | 8 | 1.00 |
aGPT: Generative Pre-trained Transformers.
bHighest accuracy within a subject.