Skip to main content
. 2024 Nov 26;121(49):e2414955121. doi: 10.1073/pnas.2414955121

Fig. 3.

Fig. 3.

Model Performance Stratified by Question Difficulty. (A and B) 376 Bachelor’s and 693 Master’s questions, respectively, annotated using instructor-reported difficulty levels. (C) 207 questions annotated using Bloom’s taxonomy by two researchers in the learning sciences. Across all categorization schemes, GPT-4 performance slightly degrades as the questions become more complex and challenging. Performance is aggregated by the majority vote strategy. Error bars represent 95% CIs using the nonparametric bootstrap with 1,000 resamples.