Figure 2. GPT-4’s performance on the modified question and its reproducibility.
To further assess the performance of GPT-4, a “Modified Question” was made for all text-based questions in the 33rd-35th examination (n = 214), concealing the number of correct options. Reproducibility was defined as the proportion of questions answered correctly in the modified question format (A) out of those correctly answered in the conventional question format (A+B).
