Figure 2. Results of manual ratings.

A-C: Frequency of average manual rating for the training, internal testing, and external testing datasets. D-F: The pairwise weighted-κ between each rater in dataset was moderate and consistent across datasets. G-I: The pairwise polychoric correlation for each rater in all of the datasets was high.