Abstract
Recent advancements in natural language processing, computational linguistics, and Artificial Intelligence (AI) have propelled the use of Large Language Models (LLMs) in Automated Essay Scoring (AES), offering efficient and unbiased writing assessment. This study assesses the reliability of LLMs in AES tasks, focusing on scoring consistency and alignment with human raters. We explore the impact of prompt engineering, temperature settings, and multi-level rating dimensions on the scoring performance of LLMs. Results indicate that prompt engineering significantly affects the reliability of LLMs, with GPT-4 showing marked improvement over GPT-3.5 and Claude 2, achieving 112% and 114% increase in scoring accuracy under the criteria and sample-referenced justification prompt. Temperature settings also influence the output consistency of LLMs, with lower temperatures producing scores more in line with human evaluations, which is essential for maintaining fairness in large-scale assessment. Regarding multi-dimensional writing assessment, results indicate that GPT-4 performs well in dimensions regarding Ideas (QWK=0.551) and Organization (QWK=0.584) under well-crafted prompt engineering. These findings pave the way for a comprehensive exploration of LLMs' broader educational implications, offering insights into their capability to refine and potentially transform writing instruction, assessment, and the delivery of diagnostic and personalized feedback in the AI-powered educational age. While this study attached importance to the reliability and alignment of LLM-powered multi-dimensional AES, future research should broaden its scope to encompass diverse writing genres and a more extensive sample from varied backgrounds.
Keywords: Automated essay scoring (AES), Large language models (LLMs), Generative pre-trained transformer (GPT), Prompt engineering, Multi-dimensional writing assessment
1. Introduction
The intersection of Artificial Intelligence (AI) within the educational sector is undergoing a transformative shift, particularly with the adoption of Large Language Models (LLMs) [1], [2], [3]. The advanced text-processing prowess of LLMs holds the potential for both researchers and educators to perform complex linguistic tasks with remarkable precision. The application of LLMs highlights the pivotal role of generative AI in enriching educational practices, from facilitating chatbot interactions to providing personalized feedback, thereby benefiting both educators and students. Acknowledged for their extensive capabilities, LLMs hold the potential to revolutionize both educational evaluation and pedagogy through their accessible AI features. However, the effectiveness and reliability of LLMs in aiding multi-dimensional writing assessment and instruction remain to be fully explored.
In the domain of educational assessment, Automated Essay Scoring (AES) systems [4], [5], [6] have surfaced as transformative instruments, redefining the paradigm of writing assessment. Incorporating AES into pedagogical contexts meets a pivotal demand for efficiency and impartiality in evaluating written responses, particularly in large-scale testing scenarios. Their importance is accentuated by their capacity to yield consistent and impartial evaluations, devoid of human rater fatigue or partiality. Furthermore, AES systems can be customized to particular scoring rubrics, ensuring alignment with educational standards and instructional goals.
Current approaches to AES vary from feature-based algorithms to ensemble statistical and machine learning algorithms. Feature engineering is the essential phase for feature-based approaches, entailing the discernment and extraction of linguistic and stylistic features that signify writing quality [7], [8], [9]. These features span a broad spectrum of indicators, ranging from fundamental mechanics such as spelling and grammar [10], [11] to intricate elements like lexical diversity and sophistication [12], [13], syntactic complexity and sophistication [14], [15], cohesion [16], [17], and organization [18], [19].
Statistical and machine learning algorithms furnish the computational prowess necessary to analyze and assess the identified features. Initial models utilized linear regression approaches, which have progressed to include advanced algorithms such as support vector machines, random forest, decision trees, and ensemble techniques [20], [21]. The adoption of neural network frameworks, especially deep learning models [22], [23], [24], has further augmented the predictive precision of AES models, permitting them to assimilate data with greater subtlety and detail. Although statistical and machine learning algorithms have shown remarkable results in scoring accuracy, the demand for feedback that is both explainable and interpretable remains critical to understanding the rationale behind assigned scores and delivering pedagogical support. Thus, there exists an ongoing requirement to tap into the capabilities of LLMs in AES tasks, as LLMs could provide students and learners with customized and diagnostic feedback, potentially improving the understanding of core concepts of successful writing and providing guidance for enhancing their writing quality.
The birth of LLMs like Generative Pre-trained Transformer (GPT) [25] and Bidirectional Encoder Representations from Transformers (BERT) [26] has been a groundbreaking development, establishing new benchmarks in the domain of educational practices. Their pre-training on heterogeneous and extensive textual corpora enables them to grasp language nuances, encompassing idiomatic expressions, syntactic complexity, and diverse discourse styles [27], [28]. In AES tasks, LLMs present unparalleled potential to replicate human-like evaluative abilities, offering a level of refinement that may parallel seasoned educators [29], [30]. Their proficiency in comprehensively processing and appraising text, considering context and subtleties, makes them prime candidates for the forthcoming generation of AES systems.
Despite the potential of LLMs for AES tasks [31], [32], the optimal selection of various LLMs and hyperparameter configurations for AES remains unresolved. Besides, scholars have focused significantly on the potential of LLMs in holistic scoring endeavors, while the importance of analytic scoring tasks has often been overlooked. Providing analytic feedback is crucial for students who exhibit varying levels of development across different writing quality constructs, highlighting both strengths and weaknesses to facilitate improvement.
The present study explores the dependability of LLMs in AES tasks and their equivalence to human raters and examines the potential role that prompt engineering could play in enhancing the performance of LLMs for AES tasks. This study delves into how LLMs can accurately capture students' writing samples with different prompt engineering, thereby aligning more closely with scores of human raters. Next, it explores the stability of LLM-derived scores under varying temperature conditions, a determinant that affects the models' output variability and predictability. We evaluated the impact of different versions and hyperparameters within LLMs on scoring efficiency. More specifically, the study conducts a detailed examination of scoring across multiple rating dimensions vital to a comprehensive assessment of writing quality, including Ideas, Organization, Style, and Conventions. Finally, the possible implications of LLMs for reshaping multi-dimensional writing assessment are offered. Prompted by the emerging trends and identified research gaps, this study sets out to investigate the following research questions:
1) How accurate are LLMs in AES tasks when subjected to diverse prompt engineering strategies?
2) How does the accuracy of LLMs in AES tasks fluctuate under different temperature settings?
3) How do LLMs perform in AES tasks across multi-rating dimensions?
The significance of the present study is mainly manifested in the following aspects: primarily, it aims to alleviate the substantial workload and burden on educators and raters. By leveraging LLMs, this study facilitates the provision of quantitative feedback and offers substantial support to both teachers and students regarding various constructs of writing quality. Furthermore, the insights gained from LLM-generated feedback enable teachers to implement more varied and personalized instructional approaches.
Based on the arguments above, the substantial contributions of this study are encapsulated as follows: The current study deepens the understanding of potential of LLMs in multi-dimensional writing assessment and instruction. Besides, the integration of AI and LLM-powered AES presents a potential avenue for reducing the risks of bias and inconsistencies among raters, thereby ensuring students are evaluated against uniform standards. Through the utilization of analytic rating criteria and well-crafted prompt engineering, the present study transforms previously opaque black-box AES models into explainable and user-accessible tools, enabling teachers and researchers to carry out diversified and individualized teaching according to realize real-time feedback based on multi-dimensional assessment.
2. Related works
The considerable progress in the confluence of NLP, linguistics, AI, and education has catalyzed profound innovations in AES over the past fifty years [4], [5]. AES is garnering escalating interest within the realm of interdisciplinary scholarship, as it markedly alleviates the workload associated with manual evaluation and furnishes instructive feedback for educators and learners [33], [34]. The trajectory of AES systems has transitioned from nascent models predicated on manually curated linguistic indices to advanced systems embedded with AI-powered technologies.
2.1. Automated essay scoring systems
There have been burgeoning studies on the potential roles of AES systems in writing assessment and education. The earliest attempts to automatically score essays by computer were made by Ellis Page and his colleagues [20]. Project Essay Grader (PEG) was designed to ensure that the large-scale essay scoring process is more practical and efficient. However, PEG has been criticized for ignoring the semantic meaning of essays. Moreover, the inner workings of linguistic features and details concerning the creation of the holistic score are private.
With the advancement of NLP and computational linguistics, AES systems have evolved significantly over the past decades [35], [36], [37]. The evolution and application of AES systems have observed a momentous stride in writing assessment and instruction. AES systems provide numerous advantages in reducing time and costs and relieving the heavy burden of teachers and raters. In addition, they ensure a consistent application of rating traits, promoting the fairness of scoring tasks. However, most existing AES systems described above are based on feature-based approaches for measuring overall writing quality. As writing quality needs to be judged from different constructs, AES systems thus need to be trained with different rating traits corresponding to each construct of writing quality. It is of essential significance in writing assessment because learners develop at different rates in different constructs of writing quality, and they direct different attentional resources with different focuses.
Moreover, while existing AES systems have advanced with the development of ensemble approaches [5], [38], the inherent complexity involving data collection, model development, and evaluation poses substantial challenges for educators lacking in programming skills. Addressing these obstacles is crucial for expanding the accessibility of AI-powered scoring innovations, like ChatGPT, thereby enabling a more diverse group of educational professionals to utilize these innovations. This move is essential for leveraging generative AI's full potential in education, ensuring advanced scoring systems are available not just to tech-savvy researchers but also to frontline educators for enhancing writing assessment and personalized instructions.
2.2. Traditional feature-based approaches
Traditional approaches employing handcrafted linguistic features for AES tasks involve three phases. Initially, linguistic features are extracted from the essays. Subsequently, specific methodologies or algorithms are employed to iteratively discern the correlation between linguistic features and human raters' scores. Predefined linguistic features, including lexical, syntactic, and semantic elements, are integrated into algorithms for training to predict scores. Ultimately, the unity between the predicted scores and those given by human evaluators is ascertained using evaluation metrics.
Traditional feature-based approaches can be categorized into regression, classification, and ranking. Cutting-edge AES models and systems predominantly utilize regression algorithms for scoring tasks [39], [40], [41]. Compared to regression methods, a comparatively limited corpus of research has implemented classification [42], [43] and ranking [44] algorithms in scoring endeavors. Regarding linguistic features and constructs of writing quality, most of the traditional feature-based approaches have attached great importance to holistic scores, with different constructs of writing quality are being ignored. In terms of methods for model construction, the vast majority of the existing AES models with traditional approaches have placed much emphasis on multiple linear regression methods. However, there is a complex interaction between linguistic features and different constructs of writing quality, thereby requiring ensemble models to capture the intricate relationship.
The advent of LLMs marks a notable leap forward in AES tasks, which could enhance the accuracy and consistency of AES, overcoming many constraints of previous systems and ensuring a more equitable and thorough evaluation of intricate written tasks. Beyond mere assessment, LLMs also underscore the transformative potential of generative AI in education, as they extend their utility to bolster pedagogical strategies and learning outcomes, thereby becoming invaluable resources for both educators and students in writing assessment and evaluation.
2.3. Neural network approaches
Despite the significant emphasis placed on extracting an extensive array of linguistic features by traditional feature-based approaches, the advent of neural network paradigms, which facilitate the automatic extraction of features, has recently been incorporated into AES. This innovation renders the development of manually crafted features superfluous or, at the very least, ancillary. In contrast to traditional feature-based approaches, the superiority of neural network approaches is evident in their capacity to attain impressive efficacy in scoring assignments without reliance on a predetermined set of linguistic features [45], [46].
Models employing neural network paradigms in AES can directly extract features from unprocessed texts, obviating the need for linguistic expertise in selecting pertinent linguistic features for model construction. A multitude of algorithms underpinned by neural network paradigms [9] has yielded encouraging outcomes. Researchers have utilized various approaches such as the Convolutional Neural Network (CNN) [47], the Recurrent Neural Network (RNN) [48], and the Deep Convolution Recurrent Neural Network [49], [50] to investigate AES tasks.
With the advancement of transfer learning, neural network approaches can enhance their performance by leveraging knowledge acquired from similar or related tasks. A shared and augmented deep neural network strategy was devised for concurrently executing AES tasks across diverse topics within the context of AES tasks [51]. The findings indicated that the proposed strategy could achieve considerable accuracy with minimal task-specific training data by effectively amalgamating the shared representations of similar tasks. Furthermore, researchers have attained commendable results in NLP tasks, even with constraints on time and training data, through the application of pre-trained language models [52], [53].
Although neural network approaches have achieved high accuracy and generalizability of AES models, previous studies have often neglected the aspect of explainability of their proposed models, rendering the understanding of their decision-making processes opaque. This black box problem underscores the need for explainability to foster trust in AES systems. Therefore, the demand for transparency is paramount when employing these AES models for the formative assessment of students' writing samples.
2.4. LLM approaches
Recent years have witnessed considerable growth in research exploring the potential of AI tools to enhance writing assessment and education [3], [54], [55]. Amidst progress in AI and LLMs, scholars are working towards refining AES to achieve objectivity, uniformity, and efficiency within the evaluation process. The advent of the GPT series, notably GPT-4, has heralded a new era in utilizing LLMs for educational assessment [28], [56]. Results demonstrate that, with meticulous calibration, GPT-4's efficacy aligns with that of existing AES systems.
Despite advancements in prompt engineering, employing LLMs for complex reasoning tasks presents considerable challenges [57]. LLMs struggle with logical reasoning or arithmetic questions framed in complex statements. This underscores the complexity of AES tasks, which inherently involve intricate reasoning guided by a detailed rating rubric. Steering LLMs along well-crafted prompts could enhance their proficiency in tackling AES tasks. Thus, LLMs would generate rationales for automated essay scoring (AES) tasks, using these self-produced explanations to inform their final predictions. Well-crafted prompt engineering has demonstrated promise in applying LLMs across diverse reasoning activities. Nonetheless, its application in multi-dimensional writing assessment and instruction is still nascent.
Given all the arguments above, the present study investigates the potential of prompt engineering in LLMs for multi-dimensional writing assessment. More specifically, we conduct experiments on prompt engineering, temperature setting, and multiple rating dimensions to explore how effective LLMs are under different prompt strategies. Finally, practical implications for effective AI integration in writing assessment and instruction are provided.
3. Methodology
3.1. Data
In this study, Essay Set #7 from the Automated Student Assessment Prize (ASAP) [58] corpus was used for AES by LLMs. This dataset is comprised of eight distinct essay collections, each originating from a unique prompt. Essay lengths vary by prompt, typically ranging from 150 to 650 words across various collections. Authored by students from varied academic levels, these essays were assessed and assigned definitive evaluations by expert human raters. This subset was chosen for its incorporation of an extensive scoring rubric across multiple rating dimensions, which is exceedingly beneficial for the research aim of evaluating the dependability of LLMs in grading essays. The essay prompt for Essay Set #7 [58] is as follows:
ASAP dataset encompasses essays of different genres, including persuasive, narrative, and expository. The Essay Set #7, comprising 1,730 essays written by students at the 7th-grade level, offers a substantial corpus for the development and assessment of AES tasks. The essays in this collection are succinct yet comprehensive, with an average length of approximately 250 words. Each essay has been evaluated by two independent human raters, who have rendered scores based on four principal dimensions of writing proficiency: Ideas, Organization, Style, and Conventions. For an in-depth description of the scoring criteria, please consult the scoring criteria of the ASAP Essay Set #7 (see Appendix B) [58]. Fig. 1 depicts the sample essays' distribution of scores. Results demonstrate that the distributions of these rating dimensions tend to be slightly above the medium level. Moreover, students get higher scores in Style and Conventions when compared with Ideas and Organization.
Figure 1.
Distribution of scores for the selected essays from Essay Set #7.
3.2. Large language models
LLMs have demonstrated remarkable expertness in generating coherent and contextually relevant narratives across various tasks. These models are trained on expansive data corpora, utilizing sophisticated deep-learning techniques to grasp intricate subtleties in language usage. In this investigation, we utilize models including GPT-3.5, GPT-4, and Claude 2 to evaluate these essay responses, comparing their evaluations with human raters' scores using the same rating criteria. These LLMs were chosen for their superior reasoning capabilities and the provision of real-time feedback, attributes that are particularly advantageous for AES tasks with higher reliability and precision [2].
The GPT series employs the transformer architecture, distinguished by its self-attention mechanisms that assign varying degrees of importance to different input data segments. This design permits the GPT series to excel in the generation of text that is not only syntactically accurate but also semantically profound. GPT-3.5, a model with an extensive parameter count in the billions, has proven adept in interpreting and producing text closely resembling human expression. GPT-4 advances upon the groundwork established by earlier iterations, delivering superior performance with a significantly larger parameter base, yielding more precise and context-sensitive outputs.
Claude 2 emerges as a formidable contender in the realm of LLMs, crafted to tackle particular challenges within NLP. Although Claude 2's architectural and training details may diverge from the GPT series, its fundamental aim is consistent: to process and generate text that exhibits a level of understanding and fluency akin to human language.
In the development of LLMs, a rigorous methodology incorporating both supervised and reinforcement learning techniques is applied to enhance the model's proficiency in producing superior outputs from given prompts, as illustrated in Fig. 2. The process begins with the collection of demonstration data for training a supervised policy. This step entails selecting a prompt from a dataset and having annotators illustrate the preferred output response. The insights derived from this stage are subsequently leveraged to augment the LLMs through supervised learning techniques. In the next phase, comparative data is amassed to cultivate a reward model. This procedure includes the extraction of a prompt accompanied by multiple model-generated responses. These responses are subsequently assessed and ordered by annotators from the highest to the lowest in terms of quality. These rankings inform the development of a reward model tasked with assessing the caliber of the generated outputs, ensuring alignment with the desired standards.
Figure 2.
The training process for large language models.
The refinement process employs the Proximal Policy Optimization (PPO) algorithm [59]. In this process, a prompt initiates the generation of a response by the PPO model, which has been pre-adjusted through supervised learning. The quality of this output is then evaluated by the reward model, which determines the reward value. The policy undergoes refinement through Proximal Policy Optimization (PPO) to enhance the model's capability to produce contextually accurate and high-quality text. This iterative method, blending supervised and reinforcement learning, seeks to advance artificial intelligence's proficiency in producing human-like textual responses.
In assessing the impact of temperature setting on model efficacy, a deliberate adjustment was made, spanning from 0.0 to 1.0. Table 1 demonstrates the parameter configurations employed in the trials for LLMs, with temperature being the sole variable altered from its default value. This methodical approach facilitates a targeted examination of how temperature modifications influence the generative performance and output quality of these sophisticated language models. The token count was uniformly set at a maximum of 2048 to ensure consistency in output length across all evaluations.
Table 1.
Parameter settings for LLMs.
Model | Temperature | Top_p | Presence Penalty | Frequency Penalty | Max Tokens |
---|---|---|---|---|---|
GPT-3.5 & 4 | 0.0–1.0 | 1.0 | 0.0 | 0.0 | 2048 |
Claude-2 | 0.0–1.0 | 1.0 | 0.0 | 0.0 | 2048 |
3.3. Prompt engineering
To evaluate the reliability and consistency of LLMs with respect to AES across various prompt strategies and temperature settings, we systematically formulated a range of prompt strategies that extended from basic to complex. The example of the Criteria-based Scoring with Justification Prompt is illustrated in Fig. 3. For an exhaustive explanation of prompt engineering, refer to Appendix A.
Figure 3.
An example of the criteria-based scoring with justification prompt.
Overall Scoring Prompt: Allocate an aggregate score to each essay, abstaining from preliminary commentary or justifications.
Dimensional Scoring Prompt: Assign scores for distinct traits-namely Ideas, Organization, Style, and Conventions-for each essay, refraining from initial remarks or elucidations.
Criteria-Based Scoring Prompt: Assign scores across the four dimensions, adhering to the stipulated specific criteria for each while foregoing any explanatory content.
Criteria-Based Scoring with Justification Prompt: Assign scores for each dimension by the established criteria, furnish a succinct justification for each score, and emulate the evaluative commentary typically offered by human raters.
Criteria & Sample-Referenced Justification Prompt: In the course of assessing each essay, substantiate the scores by not only articulating the rationale but also by referencing the evaluations and justifications from human raters' exemplar essays to bolster the assessment's dependability and consistency.
3.4. Temperature settings
The temperature parameter is an integral factor in modulating the stochasticity of the model's outputs. A diminished temperature value engenders more predictable and conservative responses, whereas an elevated temperature fosters diversity and inventiveness, potentially introducing variability into the evaluation process. To ascertain the impact of the temperature parameter on the evaluative performance of the LLMs, we methodically varied this parameter during the automated essay evaluation procedure.
Each essay was evaluated by the GPT-4 model under both prompt scenarios at all quintuple temperature gradations. This methodology enabled us to evaluate the consistency of the model's evaluations across a gamut of stochasticity and to ascertain the optimal temperature calibration that produces evaluations with the most robust correlation to human assessors. Our objective was to determine whether particular temperature calibrations yielded more precise and consistent evaluations than human raters, thus offering insights into the judicious application of the temperature parameter within the realm of automated essay evaluation using LLMs.
3.5. Quadratic weighted kappa measurement
The Quadratic Weighted Kappa (QWK) coefficient is employed to evaluate the alignment between the scores conducted by advanced LLMs and those assigned by human raters. This advanced statistical metric quantifies the level of agreement between two raters, taking into account the possibility of agreement occurring by chance. Its applicability is particularly salient in the analysis of ordinal data, where the ratings exhibit an inherent order instead of merely nominal.
The computation of the QWK coefficient is predicated on a confusion matrix O, wherein each entry denotes the number of essays appraised with a score of i by one evaluator (e.g., a Language Model) and a score of j by another evaluator (e.g., a human assessor). The matrix E, which encapsulates the expected concurrence of ratings by chance, is derived under the presumption that the raters operate independently and their assessments are distributed randomly.
The QWK coefficient is ascertained using the equation:
(1) |
where signifies the weight accorded to the discordance between ratings i and j. For quadratic weighting, this weight is determined as:
(2) |
where N represents the total number of distinct rating categories. The weights are designed to escalate quadratically with the magnitude of the discrepancy between ratings, thereby imposing a more significant penalty for more pronounced disagreements.
The coefficient κ delineates the proportion of concordance after adjustment for chance, with a value of 1 denoting impeccable concordance and 0 indicating an absence of concordance beyond what would be expected by chance.
4. Results and discussion
To evaluate the effectiveness of advanced LLMs for AES tasks, we conducted a series of experiments to examine their scoring accuracy and reliability compared to human raters. These experiments were crafted to explore three specific aspects: the effect of different prompt engineering on the scoring outcomes of LLMs, the role of varying temperature settings on model performance, and a detailed examination of the scoring by LLMs across multiple rating dimensions.
4.1. Experiment 1: prompt engineering
To explore the effect of different prompt engineering on the scoring outcomes of LLMs, the initial temperature parameter for all prompts was set at 0. In our comparative analysis, we also included evaluations from a human rater (Rater 2) and a machine learning algorithm that utilizes word counts as fundamental benchmarks.
Across all experimental conditions (See Fig. 4), GPT-4 demonstrated a substantial enhancement in performance relative to GPT-3.5, with its highest QWK score being notably elevated in the prompt that incorporated criteria and sample-referenced scoring accompanied by justification. The inclusion of prompt justifications within the scoring process generally resulted in augmented QWK scores for all models, implying that LLM models yield more reliable scores when they produce explanatory in conjunction with numerical scores.
Figure 4.
Quadratic Weighted Kappa Scores Comparison on the effect of different prompt strategies on the scoring outcomes of LLMs. The term “Overall Score” refers to the Overall Scoring Prompt, “Dim. Score” refers to the Dimensional Scoring Prompt, and “Criteria-Based Score” refers to the Criteria-Based Scoring Prompt. “Criteria-Based Score + Just.” stands for Criteria-Based Scoring with Justification Prompt, and “Criteria & Sample-Ref. Score + Just.” represents Criteria & Sample-Referenced Justification Prompt.
Despite the prompt specifications requiring only numerical scores without elucidative commentary, Claude 2 furnished justifications and explanations for its ratings in certain instances, as illustrated in Fig. 5, which was not founded in GPT-3.5 and GPT-4. Results demonstrate that GPT models communicate better the task with clarity, and they can optimize the performance under the guidance of prompt engineering.
Figure 5.
Instances of Claude 2 providing justifications for Dimensional Scoring Prompt with detailed explanation.
The highest QWK score was recorded by GPT-4 in the Criteria and Sample-Referenced Scoring with Justification task, achieving a score of 0.5677, as indicated in Table 2. However, this score remained below the human baseline (Rater 2) QWK of 0.6573. It is particularly interesting that a primary machine learning classifier, relying solely on the word counts of the essays, attained a QWK of 0.2301.
Table 2.
QWK scores with confidence intervals (90%CI) for different LLMs and prompt strategies.
Prompt | GPT-3.5 | GPT-4 | Claude 2 |
---|---|---|---|
Overall Score | 0.0788 (± 0.0460) | 0.1947 (± 0.0867) | 0.3745 (± 0.1608) |
Dim. Score | 0.1239 (± 0.0687) | 0.2148 (± 0.1038) | 0.2409 (± 0.1142) |
Criteria-Based Score | 0.0798 (± 0.0659) | 0.2680 (± 0.1121) | 0.3060 (± 0.1336) |
Criteria-Based Score + Just. | 0.0971 (± 0.0648) | 0.2726 (± 0.1218) | 0.2497 (± 0.1239) |
Criteria & Sample-Ref. Score + Just. | 0.2683 (± 0.1323) | 0.5677 (± 0.1313) | 0.2647 (± 0.1132) |
Human Rater 2 (Baseline) | 0.6573 | ||
ML Classifier (Word Counts) | 0.2301 |
The confusion matrices depicted in Fig. 6 provide a graphical elucidation of the concordance between the ratings assigned by the GPT-4 model under diverse prompt configurations and those by a human rater (Rater 1). Furthermore, the sixth subplot juxtaposes the ratings of two human raters (Rater 1 and Rater 2), serving as a benchmark for inter-rater reliability. In essence, the matrices indicate that although the GPT-4 model can emulate human scoring to a certain degree, the extent of agreement fluctuates depending on the prompt engineering. The most pronounced levels of agreement are observed when scores are attributed across multiple dimensions (Dim. Score) and when criteria-based scoring is enacted. The integration of justifications and sample references does not substantially amplify the level of concordance. The inter-rater reliability matrix reveals that the agreement between GPT-4 and a human rater falls within a tolerable scope, acknowledging the intrinsic variability in human judgment. Additionally, it can be discerned that aside from the Criteria & Sample-Referenced Justification Prompt, the scores assigned by the GPT-4 model are consistently lower than those of Rater 1.
Figure 6.
Confusion matrices comparing GPT-4 model ratings with human raters for different prompt strategies and inter-rater reliability.
The experimental outcomes corroborate that the types of prompt engineering and the models employed markedly influence the dependability of LLMs in AES tasks. The evidence suggests that more elaborated and specific prompts, encompassing criteria-based scoring with justifications, facilitate LLMs to approximate human-level consistency more closely. GPT-4, as the most advanced model evaluated, exhibited a heightened concordance with human raters, especially when justifications were integrated into the scoring processes. In line with the previous studies related to prompting engineering in LLMs, the present study also offers evidence for the idea that prompt engineering is integral to optimizing the performance of LLMs [55], [60].
To summarize, exploring the potential of prompt engineering in LLMs can be fruitful for writing assessment and education. For example, Fig. 7 illustrates the multi-dimensional feedback that GPT-4 could offer when grading an essay based on the rating rubrics and Criteria & Sample-Referenced Justification Prompt utilized in the present study. With the help of LLMs, Students and learners can receive instant, diagnostic, and customized feedback on writing quality by integrating AES with LLMs. Besides, teachers can reduce the strain of correcting students' writing and mitigating evaluation bias while focusing on other essential parts of writing instruction regarding the organization, coherence, and logical flow of writing processes [3].
Figure 7.
Multi-dimensional feedback by GPT-4 with criteria & sample-referenced justification prompt.
We also explored a deeper analysis of the variability and consistency observed across different LLMs. Fig. 8 demonstrates the comparative performance of three state-of-the-art LLMs in AES tasks across diverse experimental setups. Initially, without a scoring guide, GPT-3.5 predominantly assigns mid-range scores, indicating a tendency towards moderate evaluations, whereas GPT-4 displays a wider scoring range, reflecting greater assessment variability. CLAUDE 2, in contrast, leans towards lower score assignments, suggesting a stricter evaluation criterion.
Figure 8.
Comparative distributions of holistic scores assigned by three different large language models.
The incorporation of varied prompt engineering techniques results in GPT-3.5 and GPT-4 achieving a more equitable distribution of scores, with GPT-4 showing a notably broader range of scores that align more closely with human evaluators. This reflects their capacity to adjust to intricate scoring frameworks. Conversely, CLAUDE 2 consistently assigns lower scores, maintaining a rigorous yet inflexible scoring stance. Results underscore the distinct scoring behaviors of each LLM and how rating rubrics, rationales, and prompt engineering strategies significantly sharpen their evaluative precision in AES tasks. It emphasizes the necessity of customizing scoring frameworks to leverage each model's unique capabilities for enhanced educational assessment efficacy.
Findings also indicate that LLMs can produce interpretable and explainable scores through well-crafted prompt engineering [2]. This approach not only facilitates the rectification of biases and ethical issues within LLM-based models through precise prompt modifications but also empowers educators to efficiently craft and preliminarily assess scoring rubrics, thereby expediting the scoring process. Such developments herald a paradigm shift in AES, transforming black-box models into transparent, user-friendly tools and bridging the gap between AES technologies and practical classroom applications. Teachers can employ LLMs to craft customized learning journeys for their students. By analyzing students' essays, these LLMs can offer individualized feedback and recommend resources that cater to each student's unique requirements. With well-designed prompt engineering, LLMs not only conserve teachers' time and energy in developing bespoke materials and feedback but also enable them to dedicate more attention to other teaching facets, like designing compelling and interactive lessons.
These benefits significantly simplify the process of creating advanced ensemble algorithms to enhance the interpretability and explainability of AES models. Consequently, incorporating LLMs represents a transformative shift in the field of AES, transitioning black-box models into transparent and accessible tools. This innovation paves the way for the future application of AI in AES, even within classroom environments, providing educators and researchers with a robust tool for delivering instantaneous, formative, and personalized feedback for writing instruction.
4.2. Experiment 2: impact of temperature settings on scoring reliability
This section examines the impact of temperature variations on the assessment capabilities of the GPT-4 model under two distinct prompt strategies: Criteria & Sample-Referenced Justification and Dimensional Scoring. The temperature parameters were adjusted to five discrete levels: 0.0, 0.25, 0.5, 0.75, and 1.0, offering a comprehensive continuum from the entirely deterministic output (temperature = 0.0) to maximal randomness (temperature = 1.0).
For Dimensional Scoring prompt, the data reveal a marked decline in the QWK coefficient as the temperature escalates from 0 to 0.25; this trend persists, albeit to a lesser extent, with temperature increases (shown in Fig. 9). These observations suggest that the concordance between the GPT-4 model's scores and those of human evaluators deteriorates with the augmentation of temperature due to the heightened variability in the model's outputs at elevated temperatures.
Figure 9.
Impact of temperature settings on GPT-4 model performance in automated essay scoring across different prompt strategies.
As for Criteria & Sample-Referenced Justification prompt, the QWK score was found to be the highest (0.5677) at a temperature of 0.0, as demonstrated in Table 3, denoting a moderate alignment with the human evaluator. In contrast to Dimensional Scoring prompt, the QWK scores at escalated temperatures did not manifest a consistent downward trajectory; instead, they displayed fluctuations. This suggests that while higher temperatures generally result in reduced agreement with human raters, the relationship is not invariably linear and may depend on the complexity of the task and the model's ability to articulate persuasive justifications.
Table 3.
QWK scores for GPT-4 model under temperature parameters for different prompt strategies.
Temperature | 0.0 | 0.25 | 0.5 | 0.75 | 1.0 |
---|---|---|---|---|---|
Dimensional Scoring | 0.2148 | 0.0969 | 0.0927 | 0.1095 | 0.1110 |
Criteria & Sample-Referenced Justification | 0.5677 | 0.2478 | 0.3699 | 0.2232 | 0.3519 |
The research findings accentuate the susceptibility of LLMs to thermal parameters in AES tasks. It is apparent that a lower temperature setting, which constrains the diversity of the model's responses, is more likely to produce scores that align closely with human raters. It is paramount to ensure the reliability and consistency of scoring in high-stakes educational evaluations.
We also conducted experiments on how the temperature parameters impact GPT-4's score distributions. Fig. 10 indicates the influence of temperature parameters on the GPT-4 model's score distributions in AES tasks. This parameter governs the model's output randomness-lower temperatures yield more consistent scores, while higher temperatures introduce variability and divergent outcomes.
Figure 10.
Score distributions of GPT-4 at different temperature settings with and without a scoring rubric compared to human rater.
With the dimensional scoring prompt strategy, increasing the temperature from 0.0 to 1.0 centralizes scores around a central peak. Interestingly, at zero temperature, scores display a multimodal distribution, indicating multiple scoring levels identified by the model in a deterministic setup. As temperature rises, scores shift towards a unimodal distribution around the median, suggesting a move towards average scoring without clear criteria.
Conversely, with the criteria & sampled-referenced justification prompt strategy, results indicate the well-designed prompt engineering strategy's role in guiding the scoring process. Compared to human raters, lower temperatures with well-crafted prompt engineering align more with human scoring, indicating better concordance. However, model scores diverge at higher temperatures from human patterns, highlighting that increased randomness reduces alignment with human evaluations.
Our findings corroborate with previous studies in that prompt engineering serves as a potent mechanism for LLMs [2], [57]. With the integration of well-crafted prompt engineering and lower temperatures, LLMs present significant advantages, notably reduced rating duration and enhanced scoring consistency. LLMs-powered AES emerges as a compelling option for educators and scholars. The integration of AES with LLMs could transform the landscape of writing assessment, offering an unbiased and consistent evaluation of the writing quality that adheres to specific rating rubrics.
4.3. Experiment 3: analysis of LLM effectiveness across scoring dimensions
To analyze the efficacy of LLM in scoring across various rating dimensions, we contrasted the performance of GPT-4 across different prompt strategies with human scoring benchmarks, yielding the results depicted in the subsequent Fig. 11. With more specific and sophisticated prompt strategies, the scoring accuracy of GPT-4 increased steadily, showing that prompt engineering is of vital importance for multi-dimensional writing assessment and instruction.
Figure 11.
Comparison of GPT-4 scoring performance with human benchmarks across different prompt strategies.
As shown in Table 4, for the rating dimension of Ideas, GPT-4 attained its highest QWK coefficient of 0.551 at the Criteria & Sample-Referenced Justification prompt, closely approximating the inter-rater reliability observed among human raters, which stands at 0.605. Pertaining to the rating dimension of Organization, GPT-4 demonstrated optimal scoring effectiveness with a QWK of 0.584 under the same prompt, exceeding the inter-rater reliability of 0.541. This finding accentuates the model's capability to effectively discern the structural components of essays when augmented with insights derived from human raters.
Table 4.
QWK scores for GPT-4 model across different prompt strategies and dimensions.
Prompt Condition | Ideas | Organization | Style | Conventions |
---|---|---|---|---|
Human Rater (Rater2) | 0.6052 | 0.5407 | 0.5653 | 0.4978 |
Dim. Score | 0.2447 | 0.2789 | 0.2621 | 0.0777 |
Criteria-Based Score | 0.3737 | 0.3577 | 0.2354 | 0.1000 |
Criteria-Based Score + Just. | 0.2934 | 0.3817 | 0.3408 | 0.0619 |
Criteria & Sample-Ref. Score + Just. | 0.5510 | 0.5839 | 0.4741 | 0.2155 |
As for the rating dimension of Style, GPT-4's alignment with human evaluative judgments was most pronounced when applying the Criteria & Sample-Referenced Justification prompt, achieving a QWK score of 0.474, albeit not reaching the inter-rater reliability of 0.565. This indicates that while the model exhibits a capacity to recognize stylistic elements of text when informed by annotated exemplars from human raters, there is a discernible margin for enhancement. Assessing Conventions posed a considerable challenge for GPT-4, with the most specific and sophisticated prompt, Criteria & Sample-Referenced Justification, attaining a QWK of merely 0.216, which is substantially below the human inter-rater reliability of 0.498.
The observed disparities in GPT-4's scoring proficiency across different dimensions may be ascribed to a multitude of factors. Dimensions that necessitate a more profound comprehension of content and context, such as Ideas and Organization, are advantaged by prompts that incorporate criteria and exemplars crafted by human evaluators. Conversely, the model's proficiency in evaluating Conventions appears intrinsically constrained, possibly due to these dimensions demanding a more delicate understanding of linguistic nuances and writing mechanics, which may be partially encapsulated within the model's training corpus or its extant algorithmic frameworks. More specifically, one of the challenges of deep neural network algorithms is that they are not explainable, which makes them seem like black boxes with unclear reasons for obtaining results. Therefore, it is still necessary to incorporate the explainable AI and LLMs into AES tasks [22], [61], [62].
With well-crafted prompt engineering for multi-dimensional rating dimensions, the present study identifies the potential strengths of LLMs in reducing the time teachers and educators spend on providing individualized and diagnostic feedback. Future studies can explore how LLMs can enable teachers and educators to create personalized content and facilitate differentiated writing instruction. Moreover, future research directions could include examining how LLMs enhance the precision of assessments in student writing by pinpointing specific challenges. This accuracy in identifying student difficulties allows for more targeted instructional support, which not only fosters student excellence but also paves the way for further advancements in writing skills development.
5. Conclusion
With the significant advancements in interdisciplinary studies of NLP, computational linguistics, and education, AES has witnessed gigantic innovations in the AI-powered new era. Our research has highlighted the transformative capacity of LLMs to augment both the efficiency and objectivity in evaluating students' essays, an attribute that proves exceptionally advantageous in large-scale testing scenarios. The imperative for AES systems to provide consistent and unbiased evaluations is of utmost importance, and their ability to be tailored to conform to specific scoring rubrics ensures alignment with established educational benchmarks and instructional goals.
Despite the recognized potential of LLMs for AES tasks [31], [32], the optimal selection of specific LLMs and their hyperparameter configurations for effective AES implementation remains an unresolved challenge. Moreover, while much attention has been directed toward the capabilities of LLMs in holistic scoring, the critical role of analytic rating traits in educational contexts has been underemphasized. Analytic feedback is essential for students, as it provides tailored insights into various dimensions of writing quality, pinpointing both strengths and areas for improvement.
To address the above gaps, this study has elucidated the complex interplay of variables that govern the reliability and accuracy of LLMs in AES tasks from the perspective of different prompt engineering, temperature settings and multi-rating dimensions. The empirical evidence indicates that the nature of prompt engineering and temperature settings are pivotal determinants for the accuracy and reliability of AES tasks. Significantly, GPT-4 has demonstrated notable concordance with human raters when furnished with specific and clarified prompts incorporating analytic rating criteria and rationales.
Nonetheless, our analysis also highlights the variability in GPT-4's scoring precision across different rating dimensions. Although the model exhibits adeptness in assessing Ideas and Organization, it reveals limitations in evaluating Conventions, which underscores the necessity for continued refinement and scholarly inquiry to capture the nuances of linguistic articulation and syntactic proficiency more adeptly.
The present study offers practical insights into the competencies and constraints of contemporary LLMs within the AES domain. Despite their potential, the incorporation of LLMs into AES tasks should be executed with a synergistic approach that marries explainable AI-based approaches with prompt-powered feedback, which can revolutionize multi-dimensional writing assessment and instruction.
Future research should delve deeper into innovative and complex prompt engineering strategies for LLMs to enhance AES and capture the subtleties of students' cognitive processes. Besides, thorough and fine-grained linguistic features predicting different constructs of writing quality would improve the explainability and interpretability of LLMs, thereby contributing to LLM-powered multi-dimensional writing assessment. Lastly, future research should broaden the scope to encompass a wider variety of writing genres and incorporate larger sample sizes. This expansion will enhance the robustness and generalizability of LLMs, providing a more comprehensive understanding of multi-dimensional AES tasks.
Ethics declaration
Review and/or approval by an ethics committee was not required for this study because the present study did not address the ethical considerations of animal, cell, and human experimentation.
Funding
This research was supported by the Fundamental Research Funds for the Central Universities (No. FRF-TP-22-126A1).
CRediT authorship contribution statement
Xiaoyi Tang: Writing – original draft, Validation, Software, Methodology, Funding acquisition, Conceptualization. Hongwei Chen: Writing – review & editing, Validation, Supervision, Conceptualization. Daoyu Lin: Validation, Supervision, Software, Methodology, Conceptualization. Kexin Li: Writing – review & editing, Supervision.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Appendix A. LLM prompts
A.1. Overall scoring prompt
You are a rater for essays written by students from grades 7 to 10. You will be sent with an essay prompt and students' responses.
For each essay, please provide a single score between 0 and 12. Remember only return the score in the subsequent Python dictionary format: {“Score”: score} without any introduction or explanation.
Essay Prompt: [essay_prompt]
Response Essay: [response_essay]
A.2. Dimensional scoring prompt
You are a rater for essays written by students from grades 7 to 10. You will be sent with an essay prompt and students' responses.
Please assign a rating of 0 to 3 for these rating scales: Ideas, Organization, Style, and Conventions. Please do not return any introductory words or explanation, just return scores in the subsequent Python dictionary format: {“Ideas”: score0, “Organization”: score1, “Style”: score2, “Conventions”: score3}
Essay Prompt: [essay_prompt]
Response Essay: [response_essay]
A.3. Criteria-based scoring prompt
You are a rater for essays written by students from grades 7 to 10. You will be sent with an essay prompt and students' responses. Please assign a rating of 0 to 3 for the these rating scales: Ideas, Organization, Style, and Conventions. Ratings are based on the following scoring criteria: [scoring criteria in Appendix B].
Please do not return any introductory words or explanation, just return scores in the subsequent Python dictionary format: {“Ideas”: score0, “Organization”: score1, “Style”: score2, “Conventions”: score3}
Essay Prompt: [essay_prompt]
Response Essay: [response_essay]
A.4. Criteria-based scoring with justification prompt
You are a rater for essays written by students from grades 7 to 10. You will be sent with an essay prompt and students' responses. Please assign a score of 0 to 3 for the these rating scales: Ideas, Organization, Style, and Conventions. Ratings are based on the following scoring criteria:[scoring criteria in Appendix B].
You should respond with your justification and give your score in the subsequent Python dictionary format:
Rating: {“Ideas”:score0, “Organization”:score1, “Style”:score2, “Conventions”:score3}
Rationale: {“Ideas”: “Your rationale here.”, “Organization”: “Your rationale here.”, “Style”: “Your rationale here.”, “Conventions”: “Your rationale here.”}
Essay Prompt: [essay_prompt]
Response Essay: [response_essay]
A.5. Criteria & sample-referenced justification prompt
You are a rater for essays written by students from grades 7 to 10. You will be sent with an essay prompt and students' responses. Please assign a rating of 0 to 3 for the these rating scales: Ideas, Organization, Style, and Conventions. Ratings are based on the subsequent scoring criteria: [scoring criteria in Appendix B].
Here are three examples. Each example includes Prompt, Response, Rating score and Rationale.
Example1:
Prompt: [Essay Prompt.]
Response: [ASAP dataset [58] essay id: 19373.]
Rating: {“Ideas”:0, “Organization”:1, “Style”:1, “Conventions”:1}
Rationale: {“Ideas”: “The response does not clearly focus on the topic of patience. The student does not provide a specific story about a time when they or someone they know was patient, nor do they develop the concept of patience in a clear or thorough way. The example given (waiting for the bathroom) is somewhat related to patience, but it's not developed or explained in a way that demonstrates a clear understanding of the concept. The student could improve by choosing a specific event or story that illustrates patience and developing that story with specific, relevant details.”, “Organization”: “The response lacks clear organization and logical sequencing of ideas. The student jumps from one idea to another without clear transitions or connections. The student could improve by clearly introducing the topic, providing a specific story or example, and then concluding with what was learned or gained from the experience.”, “Style”: “The student's command of language is limited. The sentences are repetitive and lack variety, and the word choice does not effectively support the writer's purpose. The student could improve by using more varied sentence structures and more effective and compelling word choices.”, “Conventions”: “The student demonstrates limited use of conventions of Standard English. There are issues with grammar, usage, and punctuation that hinder the clarity of the response. The student could improve by focusing on correct sentence structure, appropriate punctuation, and correct usage of words and phrases.”}
Example2:
Prompt: [Essay Prompt.]
Response: [ASAP dataset [58] essay id: 19389.]
Rating: {“Ideas”:2, “Organization”:2, “Style”:2, “Conventions”:2}
Rationale: {“Ideas”: “The student's response is somewhat focused on the topic of patience, as demonstrated by the story of waiting in line for a ride. However, the development of the idea could be improved with more specific and relevant details. For instance, the student could describe more about the difficulties they faced while waiting, their feelings, and how they managed to stay patient.”, “Organization”: “The response is logically sequenced, showing the progression of time while waiting in line. However, the introduction and conclusion could be more clearly defined to improve the overall organization. The student could start by introducing the setting and their anticipation for the ride, and conclude by reflecting on what they learned from the experience.”, “Style”: “The student shows an adequate command of language, with clear sentences that support the purpose of the narrative. However, there is room for improvement in terms of word choice and sentence structure. The student could use more compelling words to describe their experience and vary their sentence structures to keep the narrative engaging.”, “Conventions”: “The student demonstrates an adequate use of conventions of Standard English, but there are some errors in grammar, spelling, and punctuation that could be corrected. For example, the student could ensure that all sentences are complete and properly punctuated, and check that all words are spelled correctly.”}
Example3:
Prompt: [Essay Prompt.]
Response: [ASAP dataset [58] essay id: 17908.]
Rating: {“Ideas”:3, “Organization”:3, “Style”:3, “Conventions”:3}
Rationale: {“Ideas”: “The student's response clearly focuses on the topic of patience, as demonstrated by the story of waiting and strategizing during a game of paintball. The ideas are thoroughly developed with specific and relevant details, such as the description of the environment, the tension of the game, and the ultimate reward of patience. No improvement is needed in this area.”, “Organization”: “The organization and connections between ideas and events are clear and logically sequenced. The student effectively introduces the setting and the situation, builds up tension, and concludes with a resolution that ties back to the topic of patience. No improvement is needed in this area.”, “Style”: “The student shows a strong command of language, with effective and compelling word choice and varied sentence structure that support the narrative's purpose and engage the audience. The descriptions are vivid, and the narrative is suspenseful, which makes the story engaging. No improvement is needed in this area.”, “Conventions”: “The student demonstrates consistent and appropriate use of conventions of Standard English for grammar, usage, spelling, capitalization, and punctuation for the grade level. There are no obvious errors in the response. No improvement is needed in this area.”}
You should respond with your justification and give your score in the subsequent Python dictionary format:
Rating: {“Ideas”:score0, “Organization”:score1, “Style”:score2, “Conventions”:score3}
Rationale: {“Ideas”: “Your rationale here.”, “Organization”: “Your rationale here.”, “Style”: “Your rationale here.”, “Conventions”: “Your rationale here.”}
Essay Prompt: [essay_prompt]
Response Essay: [response_essay]
Appendix B. Scoring criteria
Data availability
The data utilized in this study is available from the corresponding author upon reasonable request.
References
- 1.Kasneci E., Seßler K., Küchemann S., Bannert M., Dementieva D., Fischer F., Gasser U., Groh G., Günnemann S., Hüllermeier E., et al. Chatgpt for good? On opportunities and challenges of large language models for education. Learn. Individ. Differ. 2023;103 [Google Scholar]
- 2.Lee G.-G., Latif E., Wu X., Liu N., Zhai X. Applying large language models and chain-of-thought for automatic scoring. Comput. Education Artif. Intell. 2024 [Google Scholar]
- 3.Mizumoto A., Eguchi M. Exploring the potential of using an ai language model for automated essay scoring. Res. Methods Appl. Linguistics. 2023;2(2) [Google Scholar]
- 4.Hussein M.A., Hassan H., Nassef M. Automated language essay scoring systems: a literature review. PeerJ Comput. Sci. 2019;5:e208. doi: 10.7717/peerj-cs.208. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Ramesh D., Sanampudi S.K. An automated essay scoring systems: a systematic literature review. Artif. Intell. Rev. 2022;55(3):2495–2527. doi: 10.1007/s10462-021-10068-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Schultz M.T. Handbook of Automated Essay Evaluation. 2013. The intellimetric™ automated essay scoring engine–a review and an application to Chinese essay scoring; pp. 89–98. [Google Scholar]
- 7.Golparvar S.E., Abolhasani H. Unpacking the contribution of linguistic features to graph writing quality: an analytic scoring approach. Assessing Writing. 2022;53 [Google Scholar]
- 8.Latifi S., Gierl M. Automated scoring of junior and senior high essays using coh-metrix features: implications for large-scale language testing. Lang. Test. 2021;38(1):62–85. [Google Scholar]
- 9.Shin J., Gierl M.J. More efficient processes for creating automated essay scoring frameworks: a demonstration of two algorithms. Lang. Test. 2021;38(2):247–272. [Google Scholar]
- 10.Kumar V.S., Boulanger D. Automated essay scoring and the deep learning black box: how are rubric scores determined? Int. J. Artif. Intell. Educ. 2021;31:538–584. [Google Scholar]
- 11.Crossley S.A., Bradfield F., Bustamante A. Using human judgments to examine the validity of automated grammar, syntax, and mechanical errors in writing. J. Writing Res. 2019;11(2):251–270. [Google Scholar]
- 12.Yang Y., Yap N.T., Ali A.M. Predicting efl expository writing quality with measures of lexical richness. Assessing Writing. 2023;57 [Google Scholar]
- 13.Kim S., Kessler M. Examining l2 English university students' uses of lexical bundles and their relationship to writing quality. Assessing Writing. 2022;51 [Google Scholar]
- 14.Kyle K., Crossley S.A. Measuring syntactic complexity in l2 writing using fine-grained clausal and phrasal indices. Mod. Lang. J. 2018;102(2):333–349. [Google Scholar]
- 15.Kyle K., Crossley S., Verspoor M. Measuring longitudinal writing development using indices of syntactic complexity and sophistication. Stud. Second Lang. Acquis. 2021;43(4):781–812. [Google Scholar]
- 16.Crossley S.A., Kyle K., Dascalu M. The tool for the automatic analysis of cohesion 2.0: integrating semantic similarity and text overlap. Behav. Res. Methods. 2019;51:14–27. doi: 10.3758/s13428-018-1142-4. [DOI] [PubMed] [Google Scholar]
- 17.Tian Y., Kim M., Crossley S., Wan Q. Cohesive devices as an indicator of l2 students' writing fluency. Read. Writ. 2021:1–23. [Google Scholar]
- 18.Marzuki, Widiati U., Rusdin D., Darwin, Indrawati I. The impact of ai writing tools on the content and organization of students' writing: efl teachers' perspective. Cogent Education. 2023;10(2) [Google Scholar]
- 19.Crossley S.A. Linguistic features in writing quality and development: an overview. J. Writing Res. 2020;11(3):415–443. [Google Scholar]
- 20.Page E.B. The use of the computer in analyzing student essays. Int. Rev. Educ. 1968:210–225. [Google Scholar]
- 21.Dong F., Zhang Y., Yang J. Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017) 2017. Attention-based recurrent convolutional neural network for automatic essay scoring; pp. 153–162. [Google Scholar]
- 22.Kumar V., Boulanger D. vol. 5. Frontiers Media SA; 2020. Explainable automated essay scoring: deep learning really has pedagogical value; p. 572367. (Frontiers in Education). [Google Scholar]
- 23.Mizumoto A. Calculating the relative importance of multiple regression predictor variables using dominance analysis and random forests. Lang. Learn. 2023;73(1):161–196. [Google Scholar]
- 24.Spring R., Johnson M. The possibility of improving automated calculation of measures of lexical richness for efl writing: a comparison of the lca, nltk and spacy tools. System. 2022;106 [Google Scholar]
- 25.OpenAI, Achiam J., Adler S., Agarwal S., Ahmad L., Akkaya I., Aleman F.L., Almeida D., Altenschmidt J., Altman S., Anadkat S., Avila R., Babuschkin I., Balaji S., Balcom V., Baltescu P., Bao H., Bavarian M., Belgum J., Bello I., Berdine J., Bernadett-Shapiro G., Berner C., Bogdonoff L., Boiko O., Boyd M., Brakman A.-L., Brockman G., Brooks T., Brundage M., Button K., Cai T., Campbell R., Cann A., Carey B., Carlson C., Carmichael R., Chan B., Chang C., Chantzis F., Chen D., Chen S., Chen R., Chen J., Chen M., Chess B., Cho C., Chu C., Chung H.W., Cummings D., Currier J., Dai Y., Decareaux C., Degry T., Deutsch N., Deville D., Dhar A., Dohan D., Dowling S., Dunning S., Ecoffet A., Eleti A., Eloundou T., Farhi D., Fedus L., Felix N., Fishman S.P., Forte J., Fulford I., Gao L., Georges E., Gibson C., Goel V., Gogineni T., Goh G., Gontijo-Lopes R., Gordon J., Grafstein M., Gray S., Greene R., Gross J., Gu S.S., Guo Y., Hallacy C., Han J., Harris J., He Y., Heaton M., Heidecke J., Hesse C., Hickey A., Hickey W., Hoeschele P., Houghton B., Hsu K., Hu S., Hu X., Huizinga J., Jain S., Jain S., Jang J., Jiang A., Jiang R., Jin H., Jin D., Jomoto S., Jonn B., Jun H., Kaftan T., Kaiser Łukasz, Kamali A., Kanitscheider I., Keskar N.S., Khan T., Kilpatrick L., Kim J.W., Kim C., Kim Y., Kirchner H., Kiros J., Knight M., Kokotajlo D., Kondraciuk Łukasz, Kondrich A., Konstantinidis A., Kosic K., Krueger G., Kuo V., Lampe M., Lan I., Lee T., Leike J., Leung J., Levy D., Li C.M., Lim R., Lin M., Lin S., Litwin M., Lopez T., Lowe R., Lue P., Makanju A., Malfacini K., Manning S., Markov T., Markovski Y., Martin B., Mayer K., Mayne A., McGrew B., McKinney S.M., McLeavey C., McMillan P., McNeil J., Medina D., Mehta A., Menick J., Metz L., Mishchenko A., Mishkin P., Monaco V., Morikawa E., Mossing D., Mu T., Murati M., Murk O., Mély D., Nair A., Nakano R., Nayak R., Neelakantan A., Ngo R., Noh H., Ouyang L., O'Keefe C., Pachocki J., Paino A., Palermo J., Pantuliano A., Parascandolo G., Parish J., Parparita E., Passos A., Pavlov M., Peng A., Perelman A., de Avila Belbute Peres F., Petrov M., de Oliveira Pinto Michael, Pokorny H.P. 2023. arXiv:2303.08774 Gpt-4 technical report. [Google Scholar]
- 26.Devlin J., Chang M.-W., Lee K., Toutanova K. NAACL-HLT. 2019. Bert: pre-training of deep bidirectional transformers for language understanding. [Google Scholar]
- 27.Mayer C.W., Ludwig S., Brandt S. Prompt text classifications with transformer models! An exemplary introduction to prompt-based learning with large language models. J. Res. Technol. Educ. 2023;55(1):125–141. [Google Scholar]
- 28.Pavlik J.V. Collaborating with chatgpt: considering the implications of generative artificial intelligence for journalism and media education. Journal. Mass Commun. Educ. 2023;78(1):84–93. [Google Scholar]
- 29.Malik A.R., Pratiwi Y., Andajani K., Numertayasa I.W., Suharti S., Darwis A., et al. Exploring artificial intelligence in academic essay: higher education student's perspective. Int. J. Educ. Res. 2023;5 [Google Scholar]
- 30.Yan D., Fauss M., Hao J., Cui W. Detection of ai-generated essays in writing assessment. Psychol. Test. Assess. Model. 2023;65(2):125–144. [Google Scholar]
- 31.Bai H., Hui S.C. A crowdsourcing-based incremental learning framework for automated essays scoring. Expert Syst. Appl. 2024;238 [Google Scholar]
- 32.Liu Y., Han J., Sboev A., Makarov I. Geef: a neural network model for automatic essay feedback generation by integrating writing skills assessment. Expert Syst. Appl. 2024;245 [Google Scholar]
- 33.Myers M. Automated Essay Scoring: A Cross-Disciplinary Perspective. 2003. What can computers and aes contribute to a k–12 writing program; pp. 3–20. [Google Scholar]
- 34.Rupp A.A., Casabianca J.M., Krüger M., Keller S., Köller O. Automated essay scoring at scale: a case study in Switzerland and Germany. ETS Res. Rep. Ser. 2019;2019(1):1–23. [Google Scholar]
- 35.Shermis M.D., Burstein J.C. Routledge; 2003. Automated Essay Scoring: A Cross-Disciplinary Perspective. [Google Scholar]
- 36.Attali Y., Burstein J. Automated essay scoring with e-rater® v. 2. J. Technol. Learn Assess. 2006;4(3) [Google Scholar]
- 37.Enright M.K., Quinlan T. Complementing human judgment of essays written by English language learners with e-rater® scoring. Lang. Test. 2010;27(3):317–334. [Google Scholar]
- 38.Wilson J., Huang Y. Validity of automated essay scores for elementary-age English language learners: evidence of bias? Assessing Writing. 2024;60 [Google Scholar]
- 39.Burstein J. In: Automated Essay Scoring: A Cross-disciplinary Perspective. Shermis M.D., Burstein J., editors. Lawrence Erlbaum Associates Publishers; 2003. The e-rater® scoring engine: Automated essay scoring with natural language processing; pp. 113–121. [Google Scholar]
- 40.Guo L., Crossley S.A., McNamara D.S. Predicting human judgments of essay quality in both integrated and independent second language writing samples: a comparison study. Assessing Writing. 2013;18(3):218–238. [Google Scholar]
- 41.McNamara D.S., Crossley S.A., McCarthy P.M. Linguistic features of writing quality. Writ. Commun. 2010;27(1):57–86. [Google Scholar]
- 42.Rudner L.M., Liang T. 2002. Automated essay scoring using bayes' theorem. [Google Scholar]
- 43.Vajjala S. Automated assessment of non-native learner essays: investigating the role of linguistic features. Int. J. Artif. Intell. Educ. 2018;28:79–105. [Google Scholar]
- 44.Chen H., Xu J., He B. Automated essay scoring by capturing relative writing quality. Comput. J. 2014;57(9):1318–1330. [Google Scholar]
- 45.Latifi S.M.F. Development and validation of an automated essay scoring framework by integrating deep features of English language. 2016. https://doi.org/10.7939/R37S7J134 [Unpublished doctoral dissertation]. University of Alberta, Education & Research Archive (ERA)
- 46.Taghipour K., Ng H. EMNLP. 2016. A neural approach to automated essay scoring. [Google Scholar]
- 47.Dong F., Zhang Y. Automatic features for essay scoring - an empirical study. EMNLP, Association for Computational Linguistics; Austin, Texas; 2016. pp. 1072–1077. [DOI] [Google Scholar]
- 48.Park K., Lee Y., Shin D., et al. Exploring the feasibility of an automated essay scoring model based on lstm. J. Curriculum Evaluation. 2021;24(4):223–238. [Google Scholar]
- 49.Dong F., Zhang Y., Yang J. CoNLL. 2017. Attention-based recurrent convolutional neural network for automatic essay scoring. [Google Scholar]
- 50.Dasgupta T., Naskar A., Dey L., Saha R. Proceedings of the 5th Workshop on Natural Language Processing Techniques for Educational Applications. 2018. Augmenting textual qualitative features in deep convolution recurrent neural network for automatic essay scoring; pp. 93–102. [Google Scholar]
- 51.Li X., Chen M., Nie J.-Y. Sednn: shared and enhanced deep neural network model for cross-prompt automated essay scoring. Knowl.-Based Syst. 2020;210 [Google Scholar]
- 52.Ludwig S., Mayer C., Hansen C., Eilers K., Brandt S. Automated essay scoring using transformer models. Psych. 2021;3(4):897–915. [Google Scholar]
- 53.Ormerod C.M., Malhotra A., Jafari A. Automated essay scoring using efficient transformer-based language models. arXiv:2102.13136 arXiv preprint.
- 54.Lee U., Jung H., Jeon Y., Sohn Y., Hwang W., Moon J., Kim H. Few-shot is enough: exploring chatgpt prompt engineering method for automatic question generation in English education. Educ. Inf. Technol. 2023:1–33. [Google Scholar]
- 55.Chen B., Zhang Z., Langrené N., Zhu S. Unleashing the potential of prompt engineering in large language models: a comprehensive review. arXiv:2310.14735 arXiv preprint.
- 56.Yancey K.P., Laflair G., Verardi A., Burstein J. Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023) 2023. Rating short l2 essays on the cefr scale with gpt-4; pp. 576–584. [Google Scholar]
- 57.Liu P., Yuan W., Fu J., Jiang Z., Hayashi H., Neubig G. Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. ACM Comput. Surv. 2023;55(9):1–35. [Google Scholar]
- 58.Prize A.S.A. 2019. The Hewlett Foundation: Automated Essay Scoring. [Google Scholar]
- 59.Schulman J., Wolski F., Dhariwal P., Radford A., Klimov O. Proximal policy optimization algorithms. arXiv:1707.06347 arXiv preprint.
- 60.Heston T.F., Khun C. Prompt engineering in medical education. Int. Med. Educ. 2023;2(3):198–205. [Google Scholar]
- 61.Angelov P.P., Soares E.A., Jiang R., Arnold N.I., Atkinson P.M. Explainable artificial intelligence: an analytical review. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2021;11(5) [Google Scholar]
- 62.Khosravi H., Shum S.B., Chen G., Conati C., Tsai Y.-S., Kay J., Knight S., Martinez-Maldonado R., Sadiq S., Gašević D. Explainable artificial intelligence in education. Comput. Education Artif. Intell. 2022;3 [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The data utilized in this study is available from the corresponding author upon reasonable request.