Abstract
Large language models (LLMs) have shown great potential in automatic scoring. However, due to model characteristics and variation in training materials and pipelines, scoring inconsistency can exist within an LLM and across LLMs when rating the same response multiple times. This study investigates the intra-LLM and inter-LLM consistency in scoring with five LLMs (i.e., Claude, DeepSeek, Gemini, GPT, and Qwen), variability under different temperatures, and their relationship with scoring accuracy. Moreover, a voting strategy that assembles information from different LLMs was proposed to address inconsistent scoring. Using constructed-response items from a science education assessment and open-source data from the Automated Student Assessment Prize (ASAP), we find that: (a) LLMs generally exhibited almost perfect intra-LLM consistency regardless of temperature; (b) inter-LLM consistency was moderate, with higher agreement observed for items that were easier to score; (c) intra-LLM consistency consistently exceeded inter-LLM consistency, supporting the expectation that within-model consistency represents an upper bound for cross-model agreement; (d) intra-LLM consistency was not associated with scoring accuracy, whereas inter-LLM consistency showed a strong positive relationship with accuracy; and (e) majority voting across LLMs improved scoring accuracy by leveraging complementary strengths of different models.
Keywords: automatic scoring, large language model, scoring consistency, inter-rater consistency, reliability
Introduction
Constructed-response (CR) items are popular in educational measurement, given their potential in assessing complex skills and in-depth thinking (Bennett, 1991). However, the usage of CR items is costly in terms of human effort, as examinees will likely spend more time giving responses than they do on multiple-choice items, and multiple human raters are generally required to score them. Thus, automatic scoring becomes increasingly attractive (Braun et al., 1990; Kumar & Boulanger, 2023). Recent years have seen a switch from linguistic-feature-based algorithms to applications of large language models (LLMs; Latif & Zhai, 2024; Ormerod & Kehat, 2025), given their great potential in addressing a wide range of natural language processing (NLP) tasks (Bubeck et al., 2023), enabling better solutions for autoscoring tasks.
Regardless of manual scoring or automatic scoring, two kinds of consistency are relevant: (a) intra-rater consistency, which refers to how consistent a rater is in scoring the same response if it is given multiple times; (b) inter-rater consistency, which refers to how consistent multiple raters are in scoring the same responses (e.g., Neittaanmäki & Lamprianou, 2024). However, the reasons for inconsistency can differ between humans and LLM-based automatic scoring. For human raters, low intra-consistency can be due to human raters gaining more experience in rating, adjusting their standards, or changing their mood throughout rating, and low inter-consistency can be a result of variation in rating severity (Neittaanmäki & Lamprianou, 2024), training experience (Weigle, 1998), rater profiles (Ahmadi Shirazi, 2019), and so on.
In the context of automatic scoring, given the same prompt, the output of an LLM can vary because of two major characteristics of LLMs—the probabilistic model and auto-regression, which means that each predicted token from the LLM is probabilistic and depends on the preceding tokens (Raschka, 2024). Thus, the scores from an LLM are not deterministic, which can give rise to low intra-LLM consistency and can be a concern as the auto scores may be less reliable, eventually making it hard to evaluate the performance of the LLM-based automatic scoring and determine the final scores. Furthermore, variation in training data, architecture, and procedures across LLMs can lead to low inter-LLM consistencies in scoring (Bai et al., 2023; Bubeck et al., 2023). However, a low inter-LLM consistency might not be a concern if different LLMs can deal with different aspects of the scoring CR items, and when they are combined, a better scoring result might be found.
In terms of the relationship between intra- and inter-LLM consistency in automatic scoring, we propose that intra-LLM consistency be seen as an upper bound for inter-LLM consistency, since both share underlying randomness due to probabilistic generation, but inter-LLM variability is compounded by architectural and training differences.
While there is a lot of existing research on human rating consistency (e.g., Neittaanmäki & Lamprianou, 2024; Stemler, 2004; Wigglesworth, 1993), rating consistency of LLM-based automatic scoring is under-researched, which, as mentioned above, can be both a threat and a gift to automatic scoring. Therefore, this research aims to address four research questions:
RQ1: How consistent are scores generated by an LLM?
RQ2: How consistent are scores generated by multiple LLMs?
RQ3: What is the relationship among intra-LLM, inter-LLM consistency, and scoring accuracy?
RQ4: Can voting within and across LLMs increase the scoring accuracy?
This study contributes to the field of automatic scoring by (a) advancing the understanding of the magnitude of consistencies at intra- and inter-LLM in automatic scoring, (b) understanding the relationship among intra-LLM, inter-LLM consistency and scoring accuracy, and (c) proposing a voting strategy to combine scores from LLM-based automatic scoring, with the hope of increasing scoring accuracy.
In the following section, we first present a literature review on rater consistency and automatic scoring with LLMs, and then, four research questions are raised related to the intra-LLM consistency and inter-LLM consistency on scoring CR items. The method section elaborates on the five chosen LLMs (i.e., Claude, DeepSeek, Gemini, GPT, and Qwen), data, and evaluation metrics, followed by a detailed results section responding to the research questions and finally a discussion section.
Rater Consistency/Reliability
Rating behavior and measures of the rating quality have been used across multiple fields, such as educational and psychological measurement, performance assessment, and medical observations (Liao et al., 2010). Different purposes of the rating bring out different perspectives toward rater consistency; therefore, it is hard to offer a single definition of a good rating that satisfies the specialist in all field (Gwet, 2014). When a rater is involved in the measurement system, there can be at least two sources of measurement variances from the rater facet: intra-rater inconsistency and inter-rater inconsistency. Numerous efforts have been invested in understanding these inconsistencies and thus preventing or minimizing rater errors and improving scoring.
Intra-rater reliability involves the consistency of a single rater when scoring the same content more than once. However, in an educational setting, it is challenging to measure intra-rater reliability with manual scoring as a rater usually does not rate the same response twice, and the rater’s effect has long been assumed to be a fixed effect (Shrout & Fleiss, 1979). Moreover, raters have human memories and thus, the “repeated rating problem” might be a major concern (Wilson & Hoskens, 2001). Even so, it’s still an important concept in some clinical measurements concerning the reproducibility of the diagnosis (Gwet, 2008), usually measured by intraclass correlation coefficients (ICC) in a well-designed clinical interval rating setting, or kappa-related indices when the replication number is small.
Inter-rater reliability refers to consistency among ratings provided by multiple raters, which can also be referred to as inter-rater reliability (IRR) or inter-rater agreement. To make sure all raters are rating in terms of the same construct, scoring rubrics are often developed first (Mertler, 2000; Moskal & Leydens, 2000; Tierney & Simon, 2004), which establish a clear link with the construct being measured and thus can be also seen as an important part of the validity of the rating (Casabianca et al., 2025). In general, if raters are properly trained, then their inter-rater consistency should be improved (Jonsson & Svingby, 2007). While intra-consistency is considered to be more of a measure for the rater rather than the construct being measured, inter-rater consistency/reliability is a property of the testing situation (i.e., it depends on raters, proper training, scoring rubrics, etc.), not the instrument itself (Stemler, 2004). Hence, whenever a rating activity is performed, the inter-rater consistency needs to be re-established, even if the raters remain the same.
From a measurement perspective, intra-consistency can be understood as within-rater variation in the rating severity, while inter-rater consistency reflects between-rater variability. For LLM-based automatic scoring, intra-LLM consistency depends on how probabilistic the output of the LLM model is, which might be influenced by temperature, internal structure, training pipeline, etc. Intra-consistency serves as the basis of inter-consistency. That is, intra-LLM consistency should be the upper bound of inter-LLM consistency because within an LLM, the probabilistic characteristics and auto-regression make the scoring inconsistent, and across LLMs, the above two properties and the variation in training lead to inconsistent scores for the same response.
Automatic Scoring with LLM
The automated scoring of CR items has garnered increased attention to reduce the labor-intensive burden of manual grading, which can be dated back to the mid-1960s (Page, 2003). Applications for traditional automatic scoring have approached this task by extracting linguistic features and treating scoring as a classification problem (Braun et al., 1990; Kumar & Boulanger, 2023). Approaches such as latent semantic analysis (Landauer et al., 1998; LaVoie et al., 2020), generalized latent semantic analysis (Islam & Hoque, 2010), and models relying on lexical information or acoustic features (e.g., Bin et al., 2008; Chen et al., 2018) have achieved moderate success. Ormerod et al. (2023) proposed an ensemble method that integrates scores from deep neural networks and a latent semantic analysis-based model. However, these approaches often fall short when scoring nuanced responses that require contextual and semantic understanding beyond surface-level text features.
LLMs offer a compelling alternative to these traditional approaches. Pre-trained on massive and diverse corpora, LLMs such as GPT (Latif & Zhai, 2024) and Bidirectional encoder representations from transformers (BERT) (Mayfield & Black, 2020) possess robust semantic understanding and contextual fluency, making them well-suited for open-ended response evaluation. Mansour et al. (2024) explored prompt-engineering tactics to maximize the scoring performance of GPT and Llama. Their application in scoring tasks has shown marked improvements in agreement with human raters, especially when fine-tuned or used with strategically designed prompts (Lee et al., 2024). As a result, LLMs are increasingly being incorporated into scoring pipelines for both research and operational purposes.
Nonetheless, LLM-based scoring systems introduce new challenges, most notably in the area of consistency. Unlike traditional deterministic models, LLMs rely on probabilistic token generation, which makes their outputs inherently variable. This variability becomes evident when the same prompt yields different scores across repeated calls, a phenomenon that manifests as reduced intra-LLM consistency. Temperature, a parameter that controls the randomness of token sampling, plays a pivotal role in modulating this behavior. Setting the temperature to 0 minimizes randomness and tends to produce more consistent outputs (Lee et al., 2024; Organisciak et al., 2023). However, even at temperature 0, perfect reproducibility is not always guaranteed due to backend processing variability and model architecture (Schmalbach, 2025). Conversely, higher temperature values introduce diversity and can benefit scoring in domains where creativity and interpretative nuance are important, but at the cost of output stability.
In addition to within-model inconsistency, variation also arises across different LLMs scoring the same response—a phenomenon referred to as inter-LLM inconsistency. Such variation can stem from differences in training data, architecture, or tokenization strategies. Understanding the extent of this inconsistency is essential, particularly if models are to be used interchangeably or collaboratively in scoring settings.
One proposed method to mitigate LLM variability is the use of ensemble techniques, such as voting strategies. These approaches aggregate multiple predictions—either across repeated model calls or across different models—to arrive at a final score. In CR item grading, ensemble models have been shown to outperform individual models by reducing noise and leveraging complementary strengths (e.g., Ormerod et al., 2023; Sahu & Bhowmick, 2019). Similarly, in neural scoring systems, voting across outputs has improved both agreement with human raters and score stability (Taghipour & Ng, 2016). More recently, studies specific to LLM-based scoring found that majority voting across trials or across models enhanced scoring robustness in science education and essay tasks (Bexte et al., 2023; Wu et al., 2023; Xie et al., 2022). However, ensemble strategies must be carefully designed to avoid reinforcing shared model biases or suppressing legitimate score variance.
In summary, while LLMs present a powerful tool for automatic scoring, their probabilistic nature introduces complexities in output consistency. Addressing these challenges through temperature tuning and aggregation strategies, such as voting, is vital for ensuring that LLM-based scoring systems are both reliable and practically useful in educational assessment contexts.
The Current Research
This study aims to investigate the rating consistency of automatic scoring with LLMs by addressing the four research questions.
RQ1: How consistent are scores generated by an LLM?
RQ1 is related to intra-LLM consistency. Output randomness is critical to scoring consistency, which can be manipulated through two parameters: temperature and top_p. According to OpenAI’s (2024) guidance, only one parameter should be changed at a time. We manipulate the temperature of the LLM because it has been more often used to control the randomness of outputs and increase the consistency of automatic scoring (e.g., Lee et al., 2024). We ask an LLM to score the same response five times and consider two levels of temperature: 0 (i.e., the lowest temperature in LLMS) and 1 (i.e., typically the default temperature). That is because the temperature of 0 is often set in automatic scoring to ensure high consistency, and the temperature of 1 is generally used as the default temperature. Theoretically, a temperature of 0 leads to more consistency than a temperature of 1, but it is unclear how much inconsistency is associated with a higher temperature and what the cost of high consistency resulting from a 0 temperature is. That is because a temperature of 0 limits the possible range of LLM outputs and might suppress the use of some score categories.
RQ2: How consistent are scores generated by multiple LLMs?
RQ2 focuses on inter-LLM consistency. We consider five large language models in this study, that is, GPT-4o, DeepSeek-V3, Gemini-2.0-flash, Claude-3.7-Sonnet, and Qwen–3 plus, which are concurrently the latest versions of the most popular LLMs, whose details can be found in the method section. This research question is meaningful in multiple ways. First, it investigates the differences among LLMs in predicting scores, which helps us understand the strengths and weaknesses of different LLMs. Second, it raises our awareness of carefully choosing LLMs, as they could be inconsistent in scoring.
RQ3: What is the relationship among intra-LLM, inter-LLM consistency, and scoring accuracy?
Note that it is expected that intra-LLM consistency should be the upper bound of inter-LLM consistency. RQ3 aims to verify this expectation in automatic scoring for the context in this study. Moreover, accuracy is central to automatic scoring, so it is worth investigating how scoring inconsistency and accuracy are associated with each other.
RQ4: Can voting within and across LLMs increase the scoring accuracy?
If inconsistency among LLMs exists, what can be a good strategy to summarize the outputs and yield a summary score for a response? In RQ4, informed by the voting strategy used in the field of machine learning (Liang et al., 2024; Wang et al., 2022), we investigate a majority voting strategy in exploiting the inconsistency of automatic scoring by choosing the mode of the scores. Note that this study employs five LLMs, and each LLM will score the same response five times; thus, there are two voting scenarios. That is, voting can happen among the five trials within an LLM. Alternatively, voting can also be used within a trial but across LLMs.
Method
Large Language Models
The selection of these five models (GPT-4o, DeepSeek-V3, Gemini-2.0-flash, Claude-3.7-Sonnet, and Qwen–3-plus) was driven by the need to represent diverse model architectures, provider origins, and cost profiles. We selected models representing the current State-of-the-Art (SOTA) from major US-based providers (OpenAI, Google, Anthropic) and leading China-based providers (DeepSeek, Alibaba) to examine if geographic training data differences influence scoring agreement. Furthermore, the selection includes distinct architectural approaches: native multimodal models (GPT-4o, Gemini), Mixture-of-Experts (MoE) architectures (DeepSeek-V3), and hybrid reasoning models (Claude 3.7). This diversity is critical for determining whether scoring inconsistency is systemic to all LLMs or specific to certain model types.
First, GPT-4o is the latest version of GPT developed by OpenAI at the time of conducting this research, which has been successfully applied to many NLP tasks (e.g., Hurst et al., 2024). It is pre-trained on corpora up to October 2023, characterized by its multimodal capabilities (i.e., handling text, audio, and image), and moderated through a series of safety checks. GPT-4o could be seen as a State-of-the-Art (SOTA) LLM at the time when it was launched.
Second, DeepSeek-V3 is developed by DeepSeek-AI (2024), which features low-cost training, strong thought-chain, open-source, and a MoE manner of handling information (DeepSeek-AI, 2024). It is trained on 14.8 trillion diverse and high-quality tokens and has 671 billion total parameters (DeepSeek-AI, 2024). It has been receiving intensive attention since its launch. Lots of research and many applications have been built on DeepSeek (e.g., Gao et al., 2025).
Third, Gemini was launched by Google (Gemini Team Google, 2023). It is characterized by multimodal capability, superior speed, and a long context window. Successful applications can be found in education (Baytak, 2024), medical science (Mihalache et al., 2024), and so on. Gemini-2.0-flash is the latest version of Gemini at the time of conducting the automatic scoring and can balance output speed and quality.
Fourth, Claude-3.7-Sonnet was launched by Anthropic (2025). It is claimed to be the first hybrid reasoning LLM that combines slow and fast thinking in the market and shows impressive ability in coding and software engineering (Anthropic, 2025).
Finally, Qwen is an LLM developed by Alibaba Group, which demonstrates superior performance across a wide series of NLP tasks (Bai et al., 2023). It possesses excellent tool-use and planning capabilities, which make it ideal for building LLM agents. In Bai et al. (2023), the largest Qwen model has 14 billion parameters. In this study, we used the Qwen-plus, the latest and most powerful version.
The five LLMs were employed through the official API for each to ensure the models were up-to-date. As illustrated in Figure 1, the five LLMs were asked to rate the same set of responses five times, which led to 25 scores per response. A preliminary sensitivity analysis indicated that increasing trials from 5 to 10 yielded negligible differences (<2%) in consistency and accuracy; thus, 5 repetitions were deemed sufficient given the consideration of cost and efficiency.
Figure 1.
A Response is Rated Five Times by Each of the Five LLMs
Data
We employed two datasets in this study: an assessment of learning progression in science (LPS) and the open-source data from Automated Student Assessment Prize (ASAP).
LPS Dataset
We used six CR items from the LPS project, which were designed under the Berkeley Assessment System (Wilson, 2023). Validity and reliability evidence have been accumulated for the six items from expert review, think-aloud survey, field test, and psychometric analysis. In every item, students were asked to respond to a scientific phenomenon with their understanding of science (see Figure 2 for an example). 930 middle school students responded, and due to the rotation design, an item had around 300 to 450 responses. Four human raters were recruited to score the responses, and at least 1/3 of the responses were doubly scored, with exact agreement for all items exceeding 80%. Table 1 presents the summary statistics of the items.
Figure 2.
A Constructed-Response Item Used in This Study.
Note. The map shows the Vida River and its surrounding cities. I5. Why are most of the cities on this map located on or very near the rivers?
Table 1.
Summary Statistics of the Six LPS CR Items
| Item | Max scores | # of rsp. | Mean rsp. len. | SD rsp. len. | H2H QWK | H2H agreement |
|---|---|---|---|---|---|---|
| I1 | 2 | 306 | 12.53 | 8.68 | 0.92 | 0.97 |
| I2 | 2 | 448 | 24.39 | 17.42 | 0.85 | 0.91 |
| I3 | 2 | 446 | 21.05 | 12.57 | 0.85 | 0.88 |
| I4 | 3 | 448 | 25.77 | 18.44 | 0.81 | 0.87 |
| I5 | 3 | 416 | 19.54 | 15.14 | 0.86 | 0.88 |
| I6 | 4 | 404 | 17.31 | 13.75 | 0.82 | 0.80 |
Note. rsp. = response; Mean rsp. len. = mean response length in words; SD rsp. len. = standard deviation of response length in words; H2H QWK = human-to-human Quadratic Weighted Kappa; H2H agreement= human-to-human agreement.
ASAP Dataset
The ASAP dataset was released by the Hewlett Foundation for a competition about automatic scoring. It is one of the most popular datasets in the field of automatic scoring (Ke & Ng, 2019). We used the 10 items and 500 random responses per item from the short answer scoring dataset, which covers both English language arts and science. There are two manual scores for every response, with possible scores from 0 to 2 or 0 to 3, depending on the item. Detailed statistics are given in Table 2.
Table 2.
Summary Statistics of the 10 LPS CR Items
| Item | Max scores | Mean rsp. len. | SD rsp. len. | H2H QWK | H2H agreement |
|---|---|---|---|---|---|
| 1 | 3 | 47.89 | 26.62 | 0.95 | 0.91 |
| 2 | 3 | 58.07 | 21.85 | 0.91 | 0.84 |
| 3 | 2 | 47.73 | 14.46 | 0.74 | 0.76 |
| 4 | 2 | 38.87 | 16.09 | 0.72 | 0.79 |
| 5 | 3 | 24.91 | 22.12 | 0.95 | 0.96 |
| 6 | 3 | 23.79 | 23.04 | 0.97 | 0.97 |
| 7 | 2 | 39.94 | 23.26 | 0.97 | 0.96 |
| 8 | 2 | 52.81 | 31.20 | 0.85 | 0.85 |
| 9 | 2 | 48.13 | 33.54 | 0.84 | 0.82 |
| 10 | 2 | 39.95 | 26.03 | 0.89 | 0.89 |
Prompt Design
Prompts are a key component in eliciting desirable responses from LLMs. Informed by Xue et al. (2025), Mansour et al. (2024), and Chamieh et al. (2024), the scoring prompt used in this study consists of four major components (see Figure 3, which highlights the components in different colors). First, as highlighted in red, the scoring prompt starts with the scoring requirement, item stem, and possible scoring categories, which lays out the general background for the scoring task. Second, we present some examples of correct responses, which serve as the benchmark for scoring. Third, scoring rubrics are added, which specify the standards of each score category. Finally, students’ responses are appended to the prompts. Together, these four components encompass all relevant information needed for scoring.
Figure 3.
A Scoring Prompt for ASAP Item
Voting Strategy
The majority voting strategy is investigated in this study, which means that the prediction scores with the highest frequency are chosen as the final scores, as they stand out as the most confident scores. For example, for a response with prediction scores of [3, 3, 3, 2, 3], 3 is used as the final score. In the event of a tie, a random score will be picked from the even scores. There are two levels of voting: (a) given an LLM, voting can be applied to the five trials; and (b) in a trial, voting is used to choose the final scores from five LLMs.
Metrics
To evaluate the consistency of the multiple scores, Fleiss’s kappa is used with a possible range from -1 to 1, and 0 indicating agreement due to chance (Fleiss, 1971). A higher Fleiss’s kappa suggests a higher consistency. It takes the form of
| (1) |
where is the average observed agreement and represents the chance agreement. Landis and Koch (1977) proposed a rule of thumb to interpret the Fleiss’s kappa statistics (see Table 3).
Table 3.
Interpretation of Fleiss’s Kappa Statistics from Landis and Koch (1977)
| Fleiss’s Kappa | Strength of Agreement |
|---|---|
| <0.00 | Poor |
| [.00, 0.20] | Slight |
| [0.21, 0.40] | Fair |
| [0.41, 0.60] | Moderate |
| [0.61, 0.80] | Substantial |
| [0.81, 1.00] | Almost Perfect |
Moreover, Quadratic Weighted Kappa (QWK) is used to quantify the accuracy of automatic scoring when human scores are taken as the true labels. Like Fleiss’s kappa, QWK ranges from −1 to 1, and a higher value indicates higher accuracy and a threshold of 0.70 is often used as a rule-of-thumb. Compared with the exact agreement accuracy, it takes both agreement by chance and disagreement levels into consideration.
Results
Intra-LLM Consistency
To evaluate intra-LLM consistency, Table 4 presents the Fleiss’s kappa for each LLM for different datasets and temperatures. One clear pattern was that a temperature of 1 led to less consistency than that of 0 (t = −8.29, p < .001, d = 1.31). On average, when the temperature was 0 the average Fleiss’s kappa was 0.95 and using a temperature of 1 resulted in 0.86, a 9% decrease. However, both of them suggest that almost perfect agreement was reached according to Landis and Koch (1977). Among the five LLMs, Gemini yielded the highest consistency under different temperatures and datasets, while the consistency of Claude varied a lot depending on the datasets and temperatures.
Table 4.
Intra-LLM Fleiss’s Kappa
| LPS | ASAP | Average | ||||
|---|---|---|---|---|---|---|
| LLM | Tempr. = 0 | Tempr. = 1 | Tempr. = 0 | Tempr. = 1 | Tempr. = 0 | Tempr. = 1 |
| Claude | 0.994 | 0.856 | 0.980 | 0.686 | 0.985 | 0.750 |
| DeepSeek | 0.882 | 0.856 | 0.970 | 0.892 | 0.937 | 0.878 |
| Gemini | 0.990 | 0.932 | 0.991 | 0.942 | 0.991 | 0.938 |
| GPT | 0.950 | 0.873 | 0.952 | 0.836 | 0.951 | 0.850 |
| Qwen | 0.955 | 0.950 | 0.879 | 0.837 | 0.908 | 0.880 |
Inter-LLM Consistency
Across the five LLMs, the Fleiss’s kappa was 0.51 and 0.49 for temperatures of 0 and 1, respectively. Both were lower than intra-LLM consistency (t = −13.15, p < .001, d = 3.29) and suggested a moderate agreement in scoring consistency, which highlighted the importance of trying and testing different LLMs before making the decision, as they could yield considerably different scores for the same responses. Moreover, there was no statistically significant difference between the two temperatures at a 5% level (t = 0.49, p = .62), indicating temperatures didn’t contribute to the variations in scoring consistency across LMMs.
Relationship Among Intra-, Inter-Consistency, and Accuracy
In terms of the relationship between intra- and inter-LLM consistency, inter-LLM showed statistically significantly lower consistency than intra-LLM (t = −13.15, p < .001, d = 3.29), supporting the hypothesis that intra-consistency may act as an upper bound for inter-consistency. Moreover, we also observed that an item with low intra-LLM consistency tended to be more inconsistent across LLMs (see Figure 4). The Pearson correlation coefficient between intra- and inter-LLM was estimated to be 0.70 (p < .001) and 0.72 (p < .001) for temperatures of 0 and 1, respectively.
Figure 4.
Mean Intra- and Inter-LLM Consistency by Item
Regarding the relationship between intra-LLM consistency and scoring accuracy, Figure 5 illustrates a scatter plot between intra-LLM Fleiss’s kappa and QWK for the five LLMs on all items under different temperatures. Overall, there were no clear patterns between the intra-consistency and scoring accuracy as evidenced by the nonsignificant correlation (for temperature of 0: r = 0.08, p = .476; for temperature of 1: r = 0.21, p = .068). That is, there is no direct relationship between intra-consistency and scoring accuracy in automatic scoring with LLM. However, we noticed that under temperature 1, the relationship between intra-LLM consistency and scoring accuracy tended to be more widespread, potentially due to less ceiling effect for intra-LLM consistency.
Figure 5.
Scatter Plot Between Intra-Consistency (Fleiss’s Kappa) and Accuracy (QWK) for Each LLM and Item Under Different Temperatures
Figure 6 presents the relationship between inter-LLM consistency and scoring accuracy at the item level. We found that as the inter-LLM consistency increased, scoring got more accurate. The Pearson correlation coefficient was 0.84 (p < .001) for the temperature of 0, and 0.82 (p < .001) for the temperature of 1. This finding reflects that (a) LLMs tended to agree with each other in scoring the items that were easy to score; and (b) temperature has a small effect on the relationship between inter-LLM consistency and scoring accuracy.
Figure 6.
Scatter Plot Between Inter-Consistency (Fleiss’s Kappa) and Accuracy (QWK) for Each Item Under Different Temperatures
Voting Strategy
As shown in Table 5, we didn’t observe consistent improvement in the QWK without voting and with intra-LLM voting (t = 0.20, p = .84), which could be ascribed to two facts: (a) for the same response, multiple scoring within an LLM demonstrated high consistency as illustrated in the result above, leaving limited room for the improvement of accuracy; (b) variations in scoring the same response within an LLM might be seen as random noise, thus contributing negligibly to the accuracy. On the other hand, we found an improvement in the scoring accuracy using inter-LLM votes (t = 2.20, p < .05, d = 0.19). For the different temperatures, inter-LLM voting resulted in a similar 5% increase in the QWK. Finally, temperature has a negligible influence on the scoring accuracy in different voting strategies and datasets (t = 0.26, p = .80).
Table 5.
Scoring Accuracy (QWK) Under Various Voting Strategies
| LPS | ASAP | Average | ||||
|---|---|---|---|---|---|---|
| Voting Strategy | Tempr. = 0 | Tempr. = 1 | Tempr. = 0 | Tempr. = 1 | Tempr. = 0 | Tempr. = 1 |
| Without Voting | 0.600 | 0.600 | 0.547 | 0.542 | 0.567 | 0.563 |
| Intra-LLM Voting | 0.601 | 0.604 | 0.547 | 0.546 | 0.568 | 0.568 |
| Inter-LLM Voting | 0.634 | 0.637 | 0.572 | 0.572 | 0.595 | 0.600 |
Discussion
Scoring inconsistency is common in human ratings and has long received attention in the assessment literature (Fitzpatrick et al., 1998; Stemler, 2004). This phenomenon presents both challenges and opportunities for LLM-based automatic scoring. In this study, we examined scoring inconsistency using five LLMs, two temperature settings, and two datasets.
In terms of intra-LLM consistency, nearly all LLMs demonstrate almost perfect agreement when scoring the same responses except for Claude. This finding reflects that despite their probabilistic characteristics, LLMs generally prioritize producing outputs aligned with human expectations—in this case, human-assigned scores, so we can find consistent scores through multiple trials within an LLM. Notably, Gemini has the most consistent rating. However, because Gemini Team Google (2025) has disclosed limited information about the model’s architecture and training procedures, the mechanisms behind its strong consistency remain unclear.
As we expected, inter-LLM consistency was statistically significantly lower than intra-LLM consistency. While some divergence is inherent to the probabilistic nature of token generation, the magnitude of this gap highlights significant operational risks in swapping models without validation. This variation likely acts as a proxy for the unobservable differences in proprietary architecture. For instance, DeepSeek-V3 utilizes a MoE architecture, which activates a sparse subset of parameters per token, potentially creating distinct scoring patterns compared to dense models. Similarly, Claude-3.7-Sonnet is characterized as a “hybrid reasoning” model, integrating extended “chain-of-thought” processing steps before outputting a final score. This “reasoning” capability, where the model essentially debates the score internally, likely contributes to its distinct behavior compared to models like Gemini-2.0-Flash, which prioritize context-window efficiency. Since technical reports for these proprietary systems do not fully disclose training weights or alignment data (e.g., Reinforcement Learning from Human Feedback preferences), empirical comparison of their outputs remains the primary method for quantifying these architectural divergences.
Regarding scoring consistency and accuracy, we found no strong relationship between accuracy and intra-LLM consistency, which suggests that despite some fluctuation of scoring occurring for the same response for an LLM, this does not contribute much to the scoring accuracy. This may be due to two factors: (a) high intra-consistency indicates only a small fluctuation in scoring, and (b) we have hundreds of responses that can accommodate the fluctuation within an LLM. In contrast, a positive relationship is found between inter-LLM consistency and scoring accuracy. If an item is easier to score automatically, the LLMs tend to exhibit higher consistency. This suggests that agreement across models may serve as an indirect indicator of scoring tractability.
With respect to temperature, prior research has tended to set the temperature at 0 to increase scoring consistency (e.g., Lee et al., 2024). Our findings confirm that temperature 0 yields high intra-LLM consistency. Consistency decreased by about 9% when the temperature increased to 1, though values still fell within the “almost perfect” range (Landis & Koch, 1977). Importantly, temperature did not influence inter-LLM consistency, scoring accuracy, or the relationship between the two. Thus, when accuracy is the primary concern, using a higher temperature (e.g., 1) does not appear to be detrimental.
Given that the existence of inconsistency might occur across multiple LLMs or within an LLM across multiple trials, we took a step further by assembling information from multiple ratings using a majority voting strategy. We find that voting within an LLM does not lead to substantial improvement in terms of scoring accuracy. That is, the fluctuation in scoring within LLM reflects more about its stochastic variation rather than systematic divergence in scoring logic. However, voting across multiple LLMs helps increase the scoring accuracy by 5%, which implies two conclusions: (a) LLMs differ in scoring in a meaningful way; (b) voting can combine information from different LLMs and provide more accurate scores.
Finally, although ensembling LLMs through voting can improve scoring accuracy, it also increases computational cost because each response must be scored multiple times. Pricing differences across LLM providers exacerbate this challenge. Among the five models examined, Claude is the most expensive one, charging $3 per million input tokens and $15 per million output tokens, whereas GPT, Gemini, DeepSeek, and Qwen are substantially more affordable: DeepSeek, for example, charges only $ 0.28 per million input tokens and $ 0.42 per million output tokens. Researchers should therefore balance accuracy gains against cost considerations. For high-stakes assessments, ensembling may be justified, and cost may be reduced by incorporating lower-cost models into the ensemble.
In summary, this study integrates insights from two lines of research—rater reliability and LLM-based automatic scoring—to examine consistency in machine-generated scores. We found that scoring consistency is generally high within individual LLMs and moderate across different LLMs. Moreover, while intra-LLM consistency is not associated with scoring accuracy, inter-LLM consistency shows a meaningful positive relationship with accuracy. We also demonstrate that a majority-voting strategy improves scoring accuracy only when combining scores from different LLMs, rather than combining multiple trials of the same model. Overall, this study contributes by: (a) quantifying the magnitude of intra- and inter-LLM consistency in automatic scoring; (b) clarifying the relationships among intra-LLM consistency, inter-LLM consistency, and scoring accuracy; and (c) proposing and evaluating a voting strategy for aggregating LLM-based scores to enhance accuracy.
Limitations and Future Directions
The relationship between consistency and accuracy in automatic scoring with LLMs was investigated in this study because accuracy is an important and one of the most frequently used indices in the field of automatic scoring (Lee et al., 2024; Mansour et al., 2024). However, accuracy should not be the sole metric to quantify the performance of automatic scoring (e.g., McCaffrey et al., 2022). Hence, it is worth exploring how intra- and inter-LLM consistency is associated with other scoring qualities, such as construct validity, response process evidence, and fairness across subgroups.
Moreover, this research employed two datasets that contain items in science education and English language arts. Even though we observed similar patterns across the two datasets, it is worthwhile expanding the datasets in the future to investigate scoring consistency in a broader background, such as math (Baral et al., 2022), computational thinking (Tan et al., 2024), etc.
Finally, model interpretability can be important in making responsible decisions about automatic scoring with LLMs. However, this research focuses on scoring consistency, and the five LLMs employed in this study are proprietary, large-scale commercial systems with limited publicly available information about their internal architectures and training pipelines, which makes it infeasible to explore model interpretability and scoring consistency. Future research can explore interpretability-oriented approaches to better understand scoring consistency with LLMs
Footnotes
Authors’ Note: Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding: The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This material is based upon work supported by the National Science Foundation under Grant No. 2010322.
ORCID iDs: Mingfeng Xue
https://orcid.org/0000-0002-4801-3754
Xingyao Xiao
https://orcid.org/0000-0001-8430-0438
Yunting Liu
https://orcid.org/0009-0004-9594-9661
Open Practices Statement: Codes and data are available on https://osf.io/4zhux/
References
- Ahmadi Shirazi M. (2019). For a greater good: Bias analysis in writing assessment. Sage Open, 9(1), 2158244018822377. [Google Scholar]
- Anthropic. (2025, February 24). Claude 3.7 sonnet and Claude code. https://www.anthropic.com/news/claude-3-7-sonnet
- Bai J., Bai S., Chu Y., Cui Z., Dang K., Deng X., . . . Zhu T. (2023). Qwen technical report. arXiv:2309.16609. https://arxiv.org/abs/2309.16609 [Google Scholar]
- Baral S., Seetharaman K., Botelho A. F., Wang A., Heineman G., Heffernan N. T. (2022, July). Enhancing auto-scoring of student open responses in the presence of mathematical terms and expressions. In International Conference on Artificial Intelligence in Education (pp. 685–690). Springer. [Google Scholar]
- Baytak A. (2024). The content analysis of the lesson plans created by ChatGPT and Google Gemini. Research in Social Sciences and Technology, 9(1), 329–350. [Google Scholar]
- Bennett R. E. (1991). On the meanings of constructed response. ETS Research Report Series, 1991(2), i–46. [Google Scholar]
- Bexte M., Horbach A., Zesch T. (2023, July). Similarity-based content scoring-a more classroom-suitable alternative to instance-based scoring? In Findings of the Association for Computational Linguistics: ACL 2023 (pp. 1892–1903). Association for Computational Linguistics. [Google Scholar]
- Bin L., Jun L., Jian-Min Y., Qiao-Ming Z. (2008, December). Automated essay scoring using the KNN algorithm. In 2008 International Conference on Computer Science and Software Engineering (Vol. 1, pp. 735–738). IEEE. [Google Scholar]
- Braun H. I., Bennett R. E., Frye D., Soloway E. (1990). Scoring constructed responses using expert systems. Journal of Educational Measurement, 27(2), 93–108. [Google Scholar]
- Bubeck S., Chandrasekaran V., Eldan R., Gehrke J., Horvitz E., Kamar E., Zhang Y. (2023). Sparks of artificial general intelligence: Early experiments with GPT-4. arXiv:2303.12712. https://arxiv.org/abs/2303.12712 [Google Scholar]
- Casabianca J. M., McCaffrey D. F., Johnson M. S., Alper N., Zubenko V. (2025). Validity arguments for constructed response scoring using generative artificial intelligence applications. arXiv:2501.02334. https://arxiv.org/abs/2501.02334 [Google Scholar]
- Chamieh I., Zesch T., Giebermann K. (2024, June). Llms in short answer scoring: Limitations and promise of zero-shot and few-shot approaches. In Proceedings of the 19th workshop on innovative use of NLP for building educational applications (BEA 2024) (pp. 309–315). Association for Computational Linguistics. [Google Scholar]
- Chen L., Tao J., Ghaffarzadegan S., Qian Y. (2018, April). End-to-end neural network based automated speech scoring. In 2018 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (pp. 6234–6238). IEEE. [Google Scholar]
- DeepSeek-AI. (2024). Deepseek-v3 technical report. arXiv:2412.19437. https://arxiv.org/abs/2412.19437 [Google Scholar]
- Fitzpatrick A. R., Ercikan K., Yen W. M., Ferrara S. (1998). The consistency between raters scoring in different test years. Applied Measurement in Education, 11(2), 195–208. [Google Scholar]
- Fleiss J. L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5), 378–382. [Google Scholar]
- Gao T., Jin J., Ke Z. T., Moryoussef G. (2025). A comparison of DeepSeek and other LLMs. arXiv:2502.03688. https://arxiv.org/pdf/2502.03688 [Google Scholar]
- Gemini Team Google. (2023). Gemini: A family of highly capable multimodal models. ArXiv. https://arxiv.org/abs/2312.11805
- Gemini Team Google. (2025). Gemini 2.5: Pushing the Frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. ArXiv. https://arxiv.org/abs/2507.06261
- Gwet K. L. (2008). Intrarater reliability. Methods and Applications of Statistics in Clinical Trials, 2, 473–485. [Google Scholar]
- Gwet K. L. (2014). Handbook of inter-rater reliability: The definitive guide to measuring the extent of agreement among raters. Advanced Analytics, LLC. [Google Scholar]
- Hurst A., Lerer A., Goucher A. P., Perelman A., Ramesh A., Clark A., . . . Kivlichan I. (2024). GPT-4o system card. arXiv:2410.21276. https://arxiv.org/abs/2410.21276 [Google Scholar]
- Islam M. M., Hoque A. L. (2010, December). Automated essay scoring using generalized latent semantic analysis. In 2010 13th International Conference on Computer and Information Technology (ICCIT) (pp. 358–363). IEEE. [Google Scholar]
- Jonsson A., Svingby G. (2007). The use of scoring rubrics: Reliability, validity and educational consequences. Educational Research Review, 2(2), 130–144. [Google Scholar]
- Ke Z., Ng V. (2019). Automated essay scoring: A survey of the state of the art. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI- 19 (pp. 6300–6308). IJCAI. [Google Scholar]
- Kumar V., Boulanger D. (2023). Explainable automated essay scoring: Deep learning really has pedagogical value. Frontier in Education, 5, Article 572367. [Google Scholar]
- Landauer T. K., Foltz P. W., Laham D. (1998). An introduction to latent semantic analysis. Discourse Processes, 25(2–3), 259–284. 10.1080/01638539809545028 [DOI] [Google Scholar]
- Landis J. R., Koch G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174. [PubMed] [Google Scholar]
- Latif E., Zhai X. (2024). Fine-tuning ChatGPT for automatic scoring. Computers and Education: Artificial Intelligence, 6, 100210. [Google Scholar]
- LaVoie N., Parker J., Legree P. J., Ardison S., Kilcullen R. N. (2020). Using latent semantic analysis to score short answer constructed responses: Automated scoring of the consequences test. Educational and Psychological Measurement, 80(2), 399–414. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee G. G., Latif E., Wu X., Liu N., Zhai X. (2024). Applying large language models and chain-of-thought for automatic scoring. Computers and Education: Artificial Intelligence, 6, 100213. [Google Scholar]
- Liang X., Song S., Zheng Z., Wang H., Yu Q., Li X., . . . Li Z. (2024). Internal consistency and self-feedback in large language models: A survey. arXiv:2407.14507. https://arxiv.org/abs/2407.14507 [Google Scholar]
- Liao S. C., Hunt E. A., Chen W. (2010). Comparison between inter-rater reliability and inter-rater agreement in performance assessment. Annals Academy of Medicine Singapore, 39(8), 613. [PubMed] [Google Scholar]
- Mansour W. A., Albatarni S., Eltanbouly S., Elsayed T. (2024, May). Can large language models automatically score proficiency of written essays? In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) (pp. 2777–2786). ELRA and ICCL. [Google Scholar]
- Mayfield E., Black A. W. (2020). Should you fine-tune BERT for automated essay scoring? In Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications (pp. 151–162). Association for Computational Linguistics. [Google Scholar]
- McCaffrey D. F., Casabianca J. M., Ricker-Pedley K. L., Lawless R. R., Wendler C. (2022). Best practices for constructed-response scoring. ETS Research Report Series, 2022(1), 1–58. [Google Scholar]
- Mertler C. A. (2000). Designing scoring rubrics for your classroom. Practical Assessment, Research, and Evaluation, 7(1), 1–8. [Google Scholar]
- Mihalache A., Grad J., Patil N. S., Huang R. S., Popovic M. M., Mallipatna A., . . . Muni R. H. (2024). Google Gemini and Bard artificial intelligence chatbot performance in ophthalmology knowledge assessment. Eye, 38(13), 2530–2535. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Moskal B. M., Leydens J. A. (2000). Scoring rubric development: Validity and reliability. Practical Assessment, Research, and Evaluation, 7(1), 1–9. [Google Scholar]
- Neittaanmäki R., Lamprianou I. (2024). All types of experience are equal, but some are more equal: The effect of different types of experience on rater severity and rater consistency. Language Testing, 41(3), 606–626. [Google Scholar]
- OpenAI. (2024). API reference. https://platform.openai.com/docs/api-reference
- Organisciak P., Acar S., Dumas D., Berthiaume K. (2023). Beyond semantic distance: Automated scoring of divergent thinking greatly improves with large language models. Thinking Skills and Creativity, 49, 101356. [Google Scholar]
- Ormerod C., Kehat G. (2025, October). Long context automated essay scoring with language models. In Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Full Papers (pp. 35–42). National Council on Measurement in Education. [Google Scholar]
- Ormerod C., Lottridge S., Harris A. E., Patel M., van Wamelen P., Kodeswaran B., . . . Young M. (2023). Automated short answer scoring using an ensemble of neural networks and latent semantic analysis classifiers. International Journal of Artificial Intelligence in Education, 33(3), 467–496. [Google Scholar]
- Page E. B. (2003). Project essay grade: PEG. In Shermis M. D. (Ed.), Automated essay scoring: A cross-disciplinary perspective (p. 43). Lawrence Erlbaum Associates. [Google Scholar]
- Raschka S. (2024). Build a large language model (from scratch). Simon and Schuster. [Google Scholar]
- Sahu A., Bhowmick P. K. (2019). Feature engineering and ensemble-based approach for improving automatic short-answer grading performance. IEEE Transactions on Learning Technologies, 13(1), 77–90. [Google Scholar]
- Schmalbach V. (2025). Does temperature = 0 guarantee deterministic LLM outputs? https://www.vincentschmalbach.com/does-temperature-0-guarantee-deterministic-llm-outputs/
- Shrout P. E., Fleiss J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86(2), 420. [DOI] [PubMed] [Google Scholar]
- Stemler S. E. (2004). A comparison of consensus, consistency, and measurement approaches to estimating interrater reliability. Practical Assessment, Research, and Evaluation, 9(1), 1–11. [Google Scholar]
- Taghipour K., Ng H. T. (2016). A neural approach to automated essay scoring. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (pp. 1882–1891). Association for Computational Linguistics. [Google Scholar]
- Tan B., Jin H. Y., Cutumisu M. (2024). The applications of machine learning in computational thinking assessments: A scoping review. Computer Science Education, 34(2), 193–221. [Google Scholar]
- Tierney R., Simon M. (2004). What’s still wrong with rubrics: Focusing on the consistency of performance criteria across scale levels. Practical Assessment, Research, and Evaluation, 9(1), 1–7. [Google Scholar]
- Wang X., Wei J., Schuurmans D., Le Q., Chi E., Narang S., Zhou D. (2022). Self-consistency improves chain of thought reasoning in language models. arXiv preprint, arXiv:2203.11171. https://arxiv.org/abs/2203.11171 [Google Scholar]
- Weigle S. C. (1998). Using FACETS to model rater training effects. Language Testing, 15(2), 263–287. [Google Scholar]
- Wigglesworth G. (1993). Exploring bias analysis as a tool for improving rater consistency in assessing oral interaction. Language Testing, 10(3), 305–319. [Google Scholar]
- Wilson M. (2023). Constructing measures: An item response modeling approach (2nd edition). Routledge. [Google Scholar]
- Wilson M., Hoskens M. (2001). The rater bundle model. Journal of Educational and Behavioral Statistics, 26(3), 283–306. [Google Scholar]
- Wu X., He X., Liu T., Liu N., Zhai X. (2023). Matching exemplar as next sentence prediction (mensp): Zero-shot prompt learning for automatic scoring in science education. In International Conference on Artificial Intelligence in Education (pp. 401–413). Springer. [Google Scholar]
- Xie J., Cai K., Kong L., Zhou J., Qu W. (2022). Automated essay scoring via pairwise contrastive regression. In Proceedings of the 29th International Conference on Computational Linguistics (pp. 2724–2733). International Committee on Computational Linguistics. [Google Scholar]
- Xue M., Liu Y., Xiao X., Wilson M. (2025). Automatic prompt engineering for automatic scoring. Journal of Educational Measurement, 62(4), 559–587. [Google Scholar]






