Table 1.
ChatGPT-4 | Microsoft Copilot | Google Gemini | ChatGPT-4 vs Google Gemini | ChatGPT-4 vs Microsoft Copilot | Microsoft Bing vs Google Gemini | Overall among AI chatbots | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Failure | Absolute frequ. | % | Absolute frequ. | % | Absolute frequ. | % | Chi2 | p-value | Chi2 | p-value | Chi2 | p-value | Chi2 | p-value |
57 | 6.96 | 83 | 10.13 | 246 | 30.04 | -0.23 | 0.00* | -0.031 | 0.199 | -0.198 | 0.00* | 312.76 | 0.000* | |
Logical reasoning and general culture | 39 | 68.42 | 51 | 61.45 | 126 | 51.22 | -0.28 | 0.00* | -0.038 | 0.70 | -0.242 | 0.00* | 52 | 0.000* |
Biology | 6 | 10.53 | 8 | 9.64 | 31 | 12.60 | -0.1 | 0.00* | -0.008 | 1.00 | -0.09 | 0.00* | 166.01 | 0.000* |
Chemistry | 7 | 12.28 | 11 | 13.25 | 32 | 13.01 | -0.16 | 0.00* | -0.025 | 1.00 | -0.13 | 0.00* | 73.03 | 0.000* |
Physics and mathematics | 5 | 8.77 | 13 | 15.66 | 57 | 23.17 | -0.43 | 0.00* | -0.066 | 0.46 | -0.366 | 0.00* | 94.16 | 0.000* |
* statistically significant findings