A hybrid deep learning and fuzzy logic framework for feature-based evaluation of english Language learners

XiuHua Zhao

doi:10.1038/s41598-025-17738-z

. 2025 Sep 29;15:33657. doi: 10.1038/s41598-025-17738-z

A hybrid deep learning and fuzzy logic framework for feature-based evaluation of english Language learners

XiuHua Zhao ^1,^✉

PMCID: PMC12480715 PMID: 41023079

Abstract

The integration of artificial intelligence (AI) and natural language processing (NLP) into language learning and assessment has unlocked new possibilities for accurately profiling English language learners (ELLs) and personalizing educational interventions. While previous studies have typically focused on isolated techniques either deep learning, traditional machine learning, or linguistic rule-based models there remains a critical need for comprehensive frameworks that combine the interpretability of rule-based reasoning with the predictive power of advanced AI. Addressing this gap, the present study introduces a novel hybrid methodology for ELL evaluation, integrating both rule mining through fuzzy logic and a state-of-the-art fusion model that integrates DeBERTa, metadata features, and LSTM architectures. This approach employs a hybrid DeBERTa + Metadata + LSTM (DBML) model, where DeBERTa serves as a transformer backbone to extract rich textual embeddings via attention mechanisms, Metadata features capture contextual, cognitive, and demographic learner traits, and LSTM layers are utilized for effective temporal modeling and dense integration. This comprehensive pipeline allows for complex prediction of language proficiency levels, dealing with both unstructured (text response) and structured (behavioral and demographic) data streams. Empirical comparisons against standard machine learning, deep learning, and standalone transformer models demonstrate the superiority of the proposed hybrid approach, achieving a peak accuracy of 93% significantly higher than benchmarked baselines. Furthermore, the study extensively investigates model reliability using statistical significance tests and eXplainable AI (XAI) techniques such as SHAP and DeepSHAP. These analyses not only confirm the model’s robustness but also reveal the centrality of linguistic attributes (e.g., Syntax, Cohesion, Vocabulary) in classification, as further substantiated by comprehensive feature ranking including Information Gain, Gain Ratio, Gini Index, and permutation importance based on random forest algorithm for fuzzy rule extraction for top features.

Keywords: Artificial intelligence, Natural language processing, Education, Fuzzy rule, Machine learning, EXplainable AI, Transformer, English language

Subject terms: Classification and taxonomy, Computational models, Computational platforms and environments, Data acquisition, Data integration, Data mining, Data processing, Machine learning

Introduction

Natural Language Processing (NLP) is a dynamic and rapidly evolving field at the intersection of computer science, linguistics, and artificial intelligence, dedicated to enabling machines to understand, interpret, and generate human language in meaningful ways¹. As the world becomes increasingly digital and interconnected, NLP has emerged as a crucial technology powering applications that range from real-time translation and virtual assistants to sentiment analysis² and automated content moderation³. Its continuous advancements not only reshaping how we interact with technology but are also opening new frontiers for research and innovation across disciplines, making human-computer communication more natural, accessible, and intelligent than ever before⁴. The global prominence of English, both as an academic lingua franca and as the predominant medium of international communication, underscores the critical importance of accurate and effective language evaluation tools⁵. As educational institutions worldwide increasingly adopt English as the primary language of instruction, the demand for robust, precise, and scalable assessment systems grows correspondingly⁶. These assessment tools not only streamline evaluative processes but also provide valuable feedback to learners, teachers, and institutions, aiding in curriculum design, instructional strategies, and personalized education planning⁷.

To address this evolving landscape, AI-driven tools are increasingly adopted by educational institutions to perform real-time evaluations, multi-classification and provide adaptive feedback to learners⁸. These advanced tools integrate deep learning techniques, linguistic analytics, and demographic and cognitive profiling, creating comprehensive evaluations tailored to diverse learner populations⁹. Teachers and students respond positively to such innovations, as AI-driven assessments offer precise diagnostics of learning challenges and strengths, thus fostering a responsive and engaging educational environment¹⁰. Within the specific context of ELL, AI models have been effectively employed to evaluate learners across multiple skill areas, such as speaking, writing, reading comprehension, and listening¹¹. Deep learning architectures, including recurrent neural networks (RNN), convolutional neural networks (CNN), and transformers, have shown remarkable capabilities in accurately predicting learner proficiency and identifying key linguistic and cognitive factors that influence language acquisition¹². Concurrently, fuzzy logic approaches have demonstrated strength in capturing the inherent uncertainty and subjective aspects of human evaluations, offering nuanced and interpretable assessment models that resonate closely with expert judgments¹³. Despite these advancements, challenges remain in creating evaluation systems that simultaneously provide accuracy, transparency, and practical interpretability¹⁴. Existing AI-based assessment frameworks often exhibit shortcomings such as limited interpretability, inadequate handling of linguistic and demographic variations, and an inability to effectively incorporate human-like reasoning and subjective judgment¹⁵. The motivation for this study, therefore, lies in addressing these gaps by proposing a novel, hybrid framework that combines deep learning-based feature ranking and fuzzy logic techniques. This hybrid approach aims to offer robust evaluations that are not only precise but also interpretable and sensitive to the diverse learner backgrounds.

In this study, we introduce a comprehensive hybrid evaluation framework designed specifically for ELL. Our approach integrates deep learning-based feature ranking methodologies to identify the most influential linguistic, cognitive, and demographic factors contributing to learner proficiency. Complementing this, we employ fuzzy logic techniques to construct an interpretable evaluation model capable of handling uncertainty and subjectivity inherent in language assessments. Through this integrated method, we aspire to achieve a balanced solution that enhances the accuracy of automated evaluations while maintaining interpretability and practical applicability in educational settings. The following contributions are:

Proposed a novel fusion hybrid model (DeBERTa + Metadata + LSTM) that integrates structured and unstructured data using attention mechanisms, achieving the highest classification accuracy of 93% for ELL evaluation.
Applied feature ranking and importance techniques to identify the most influential attributes for both modeling and rule construction.
Implemented a fuzzy logic technique to extract interpretable rules for trait-based learner classification, using ranked features to enhance the transparency and reliability of decisions.
Employed explainable AI (XAI), alongside statistical significance tests, to interpret and validate model performance, further strengthening trust in the system’s outputs.

The remainder of this paper is structured as follows: Section II presents an extensive review of related work, exploring both deep learning and fuzzy logic methodologies applied to language and educational evaluations. Section III outlines the detailed methodology, including the proposed hybrid framework, deep learning-based feature ranking approach, and fuzzy inference system. Section IV describes the experimental setup and dataset characteristics, followed by a discussion of the results and their implications in Section V. Finally, Section VI concludes the paper by summarizing key findings, discussing the contributions and limitations of the research, and providing directions for future investigations.

Related work

The dynamic evolution of educational assessment has recently experienced big leaps with the use of deep learning and fuzzy logic techniques based on language learning analytics¹⁶. In this section, the state-of-the-art research on the combination of these methods and pay attention to the current deep learning techniques and fuzzy frameworks, to motivate the hybrid modeling strategy that advocate in the context of ELL assessment.

Deep learning-based ELL evaluation

However, recent works apply deep learning (DL) models with feature selection or ranking to assess ELLs and student performance, to enhance accuracy and interpretability simultaneously. In AWE, utilized deep neural networks to score essays; and explained how AI techniques (SHAP values) can be applied to discover what linguistic features affect the scores¹⁷. In a subsequent work, they used deep models to predict fine-grained rubric scores and holistic scores simultaneously, leading to high agreement (QWK 0.78) with human raters – better than existing methods (QWK ~ 0.53). But they mentioned that the “black box” nature of DL made scores difficult to explain, supporting the need of feature importance analysis for interpretability¹⁸. DL-based AES systems and pointed out that, although end-to-end DL models achieve high accuracy, they cannot give feedback effectively because of incomprehensible feature representation¹⁹. Faseeh et al. model combined deep features with handcrafted linguistic features and a light ensemble (XGBoost) for essay scoring and achieved higher accuracy than using only deep features²⁰.

RNN model to measure oral pronunciation fluency and accuracy in English. Their system, which included domain-specific features as input to the RNN, achieved > 90% recognition agreement with human judgments. This provides evidence of LoRa DL capability to cope with speech relationships in data²¹, although the model can break down in case of a lack of training data or noisy input based on graph convolutional network based on evolutionary algorithm (ESA-NEGCN-NBOA)²². Machine learning demonstrated that method is ~ 20–28% more accurate than the previous machine learning method, and the evaluation time has significantly decreased. This method operates based on unique features extraction and optimization technique, which is its strength, but with high complexity²³. Similarly, combined deep CNN and virus colony search optimizer to assess EFL classroom teaching quality from audio/video data. They proposed a complete feature framework and adopted a CNN to fuse the features²⁴. The adapted meta-heuristic tuning achieved better accuracy and robustness for CNN for multi-criteria teacher evaluation, better than traditional evaluation approaches in terms of accuracy and consistency²⁵.

Another important use-case is predicting students’ academic success at different levels of study in the wider context of educational systems using DL with feature ranking. Recent works instead use deep networks which can learn feature representation automatically but retain feature selection for interpretability²⁶. Further by introducing a model-agnostic framework without manual feature engineering – raw data feed into an interpretable model which emphasizes the significant feature through a post-hoc analysis for online courses²⁷. Based on the open university dataset, they used a random forest, as well as a multi-layer perception, and demonstrated that it did not hurt accuracy if removing the feature-engineering step. They attribute this in part to the ability of the model to learn good representations internally²⁸.

In ensembles and hybrid modality as well researchers have also aggregated deep models to improve the performance of educational prediction. Combination of traditional learners in addition to a custom 1D CNN for determining the students as “weak” or “strong” learners. They carried out a “multiparametric analysis,” considering a variety of factors including demographics, past grades and online behavior²⁹. The ensemble CNN model achieved superior precision and recall of single models by 2–16%. They also noted that utilizing a high-variance VS technique enhanced interpretability by restricting the modeled effect to the most influential factors³⁰. A composition of several deep networks that have been trained with different optimizers. This strong architecture also obtained lesser error (e.g. RMSE) on student grade datasets than that of single model³¹. In a later study, proposed using the instantiation of attention-augmented DL model for performance prediction in MOOC. They also performed a feature elimination pre-processing and, finally, employed SHAP values for model explanation³². The tradeoff here is one of model complexity, but they offer a compelling blueprint for resolving the tradeoff between predictive power and transparency³³.

Fuzzy logic-based ELL education evaluation

Fuzzy logic has been commonly used to represent the internal uncertainty and subjectivity in education assessments, especially in ELL testing and learning outcomes. A study introduced a fuzzy logic system for the continual assessment of Chinese students’ English proficiency at different levels³⁴. The model developed fuzzy sets for language learning objective measures and utilized a fuzzy comprehensive evaluation to identify learning deficits. This had the benefit of being a more detailed assessment than a pass/fail or number score³⁵. An IF-AHP model to evaluate college English teaching quality was proposed which were embedded in the online game-based learning (OGBL). They set up a multi-level evaluation system and determined the influence degree of various quality factors (e.g., student engagement, teacher knowledge, and so forth) based on AHP, and as well made an aggregate evaluation based on fuzzy comprehensive evaluation (FCE)³⁶. Furthermore, a distinct application of fuzzy logic in the adaptive learning is presented in the context of an English game-based learning environment such that student characteristics are fuzzified to adjust game difficulty and feedback³⁷. In a similar fuzzy logic for assessment of academic progress in medical education used. Though not directly in ELL, this application demonstrates the flexibility of fuzzy logic; FIS incorporated fuzzy exam marks, practical skills, and attendance as inputs in a composite student performance index³⁸. Another fertile field is fuzzy logic in e-learning acceptance and satisfaction measurements. A fuzzy logic-based method in e-learning effectiveness of computer science students in the time of COVID-19. They defined fuzzy scales (e.g., high, medium, low) for perceived usefulness, ease of use, learning outcomes, and finally overall acceptance through fuzzy aggregation³⁹. A Fuzzy Comprehensive Assessment Model (FCAM) used to evaluate the quality of English translation. Criteria such as accuracy, fluency, stylistic adequacy etc., were rated on a fuzzy scale and it used to derive an overall quality score for translation by the FCAM. This fuzzy set-based decision made it possible to not only judge slight average translation errors in context but also to express the quality of the translation as a group-based appraisal⁴⁰. Furthermore, developed a multi-feature fuzzy evaluation model to evaluate teaching methods for college Physical Education. They included natural language input by fuzzy and placed it under three dimensions. The weight of all parameters in the fuzzy model was fine-tuned by an improved cuckoo search optimization. The model demonstrated better ratings for evaluating instructional effectiveness and student satisfaction (95–97%) than traditional evaluative system⁴¹. Additionally, these systems showed fuzzy rules to be better at early and accurate detection of at-risk students than threshold-based systems. In North Africa, study employed fuzzy logic to forecast e-learning engagement indicators, permitting their system to avoid hard computation and refer to expert rules for student activity patterns⁴².

So far, fuzzy logic-based approaches have improved ELL and educational assessments by incorporating grading and expertise into the process of evaluation. Key benefits include robustness against noisy inputs (e.g. partial correct or fair engagement) and creation of reasoning akin to that of a human in the scores (e.g. “good”, “fair” and “needs improvement” in the place of mere numerical grades). Fuzzy models succeeded where traditional models failed and studies attained fairness in the evaluation, student satisfaction, and early problem recognition. The principal criticism is that the design and the validation of the fuzzy rules and membership functions is usually a laborious and costly task that needs expert personnel in the field. However, the trend is clear: fuzzy logic, in some cases in conjunction with deep learning or optimization algorithms, is emerging as a useful toolkit when it comes to educational assessment research, complementing pure deep learning works with interpretable, flexible evaluation frameworks applicable not only to English language learning, but also to other areas as well.

Research proposed methodology

This section describes the complete methodology used to research ELL assessment in combination of advanced artificial intelligence and fuzzy concepts. The structure of a study revolves around systematic data preprocessing, sound feature engineering and usage of both rule-based and machine learning models. With a combination of transformer-based textual embeddings, structured learner metadata and interpretable fuzzy rule mining, the research seeks to model the multidimensionality of language proficiency assessment. Detailed procedures shown in Fig. 1 for the model construction, feature ranking and interpretability analysis are documented for validation of the replicability, transparency, and robust comparison of traditional and SOTA evaluation approaches.

Fig. 1 — Framework of research proposed methodology.

Data collection

The dataset used in this study obtained from the “English-Language Learners-Evaluation” publicly available repository that stored responses, scores, and learner profile data for English language learning research. The dataset is of a mixed type that includes structured metadata like demographic or contextual information, psycholinguistic and cognitive features as well as unstructured text data in the format of written language answers, as attributes displayed in Table 1, which is split into training and testing sets using an 80 − 20 ratio. This abundant diversity also allows for a comprehensive examination of proficiency and learning conditions.

Table 1.

Attributes distribution based on traits.

Learning behavior trait	Attributes
Contextual & Demographic	age, native language, learning environment, prior experience
Psycholinguistics & Cognitive	Attention span score, working memory index, motivation level, cognitive load
Linguistic Feature	Cohesion, Syntax, Vocabulary, Phraseology, Grammar, Conventions

Open in a new tab

Data preprocessing

Data preprocessing stage instrumental in achieving data quality and uniformity before modeling. The raw set was cleaned to remove duplicates and missing values through imputation and validation. Text in the full text column was cleared through a typical text cleaning process (lowercasing, removal of special characters, whitespace, stop words and lemmatization), to standardize the input for transformer-based language models. To encode the categorical variables (native language and learning environment), label encoding and one-hot encoding was implemented to make them accessible by the machine leaning algorithm⁴³. The value ranges of numerical characteristics were made consistent with standard scaling to improve the model convergence. Score-dependent features that are outliers were identified and controlled to minimize their impact on statistical and machine learning analysis. Finally, features were clustered based on domain relationships (contextual, cognitive, linguistic), serving as contexts for both mining of trait-based rules and fusion of mixed models. In the preprocessing pipeline trait features are engineered as follows: numerical fields are imputed using median and then z-normalized; categorical fields are imputed using a mode plus one-hot encoded; rubric sub-scores are modeled as ordinal numbers. Having taken age and chosen cognitive measures and discretized them using the thresholds obtained by training-sets (temperature on age; thirds on attention/working-memory; Likert combining on motivation). Linguistic and cognitive composites are divided by summing up the way of standardized sub-scores (or they are computed with SHAP-normalized weights, which are stated separately), as displayed in Table 2. Demographics are left in the form of separate encoded variables. All the transformation statistics are written down and are reused during assessment to ensure reproducibility.

Table 2.

Trait construction and quantization.

Trait group	Raw features	Preprocessing	Quantization - encoding	Aggregation used in model
Linguistic	Syntax, Vocabulary, Phraseology, Cohesion, Grammar, Conventions (1–5)	Median impute → z-score	None (kept continuous)
Cognitive	Attention span, Working memory, Motivation, Cognitive load	Median impute → z-score	Motivation: 1–2 = low, 3 = med, 4–5 = high; others optional tertiles (≤ P33, P33–P67, ≥P67)
Demographic	Age, Native language, Learning environment, Prior experience	Mode impute (cats); age numeric impute	Age bins: ≤15, 16–18, 19–25, > 25; One-hot for categorical fields; prior experience as 0/1/2 years bins	No aggregation (one-hot/ordinal fed directly)

Open in a new tab

This systematic pre-processing resulted in more reliable, robust and well-structured dataset which served as a bottleneck component for providing next level evaluation of English language learners based on AI and fuzzy techniques.

Feature ranking and importance

Feature ranking and importance assessment are ingrained in every interpretable machine learning or rule-based modeling pipeline and become particularly important in challenging domains such as language learning analytics. Through systematic measurement and quantification of relevance of each feature with respect to a prediction or classification task, these techniques assist researchers in determining the most salient features, improving performance of the model, and facilitating transparent, interpretable decision making. From numerous techniques, Principal Component Analysis (PCA) loading and permutation importance (which combined with ensemble methods such as random forests) are seen as powerful tools to consider both linear and nonlinear associations in high dimensional educational data.

Information gain (IG)

IG provides the target variable that able to quantify the uncertainty reduction about the target variable Inline graphic when a given feature is known . Features with higher IG are more informative and are prioritized in model training⁴⁴. This method is particularly useful for identifying variables that are highly predictive of the outcome, defined as in Eq. 1.

Where, Inline graphic represent the entropy of the targeted variable , defined using Eq. 2.

Inline graphic denotes the conditional entropy of given , expressed as in Eq. 3.

Thus, the information gain calculated as in Eq. 4 for best feature targeted variables.

This ensures that by capturing the degrees to which higher ranking feature of Inline graphic reduces the uncertainty associated with predicting .

Gain ratio (GR)

GR is an extension of IG that addresses its bias towards features with many unique values. It normalizes the IG by the intrinsic information of the feature, which measures its overall variability⁴⁵. This adjustment prevents the over-selection of features with many distinct values, thus ensuring a more balanced selection of features that genuinely contribute to the target prediction, computed as in Eq. 5.

Where, intrinsic information Inline graphic is defined using Eq. 6.

Thus, Gain ratio calculated as in Eq. 7.

This metric allows for fairer comparison among features by adjusting for intrinsic variability.

Entropy

Entropy is a measure of the randomness of or unpredictability of information. In feature selection, it is used to assess how well a feature can separate the classes of the target variables Inline graphic , with possible classes ⁴⁶. Feature that significantly reduce entropy, by increasing IG, are considered more valuable for prediction, computed as in Eq. 8 with the help of Eq. 2. By choosing features that lower the overall entropy, models can achieve better classification performance.

A lower value of entropy indicated that a feature is more informative, as it results in a greater reduction in uncertainty about classification tasks.

Principal component analysis

PCA represents a way to reduce the dimensionality of input features down to a new feature space of uncorrelated variables, i.e., principal components, for which the new co-matrix would be diagonal. The “loading” of a feature onto a principal component conveys the relative importance or contribution of that feature to the component⁴⁷. Large absolute loading values represent the features that are most important factors in explaining variance in the data. Importance of the feature (e.g., by PCA loading) is often evaluated by examining the absolute loadings of features on the first principal components (PCs) which capture the largest variance in the data, defined in Eq. 9. This technique can be used to discover which variables are the most significant in determining the true structure of learner traits.

Where Inline graphic is the value of feature for observation. is the i-th entry in the first eigenvector of the covariance matric is the largest eigenvalue associated with .

Permutation importance with random forest

Permutation importance is a general model-agnostic technique that estimates the importance of a feature as the increase in the model’s prediction error after permuting the feature’s values. In the context of the random forest, this would be done by randomly permuting the values of one given feature across the samples and evaluating the corresponding change in accuracy, based on random score prediction Inline graphic . If rearranging the values of a feature results based on loss function in the model accuracy, the feature is said to be important for true label, compued using Eq. 10. This approach is especially good at accounting for non-linearity and interaction effects, both of which are relevant to complex educational datasets.

Inline graphic is the permutation importance for feature , is the number of trees, feature importance, is the same as permuted with feature .

Fuzzy rule-based classification technique

Once input variables were fuzzified based on Inline graphic , a Fuzzy Inference System (FIS) was designed to model the decision-making logic in all three traits contexts, as flow shown in Fig. 2. A fuzzy inference system processes a collection of if-then rules defined on fuzzy sets and produces an output via a reasoning mechanism that imitates cognition rules as well⁴⁸.

Fig. 2 — Working of Fuzzy rule-based classification technique.

A fuzzy rule in the framework is expressed using Eq. 11.

The firing strength Inline graphic of each is computed using a operator with the product of a , defined as in Eq. 12.

Each activated rule contributes a fuzzy output set scaled by its firing strength, in Eq. 13.

The overall aggregated fuzzy output Inline graphic is obtained using the maximum operator over all activated rules in Eq. 14.

The crisp output Inline graphic is then computed using the centroid of area method, also known as the center of gravity, defined as in Eq. 15.

where Inline graphic is the output universe of discourse.

For systems with multiple outputs Inline graphic can extend the system to vector-valued outputs using Eq. 16.

This generalization is particularly relevant for multi-criteria decision-making in traits selection where both Impact and developments growth is based on outputs. By formalizing expert knowledge into fuzzy rules and leveraging linguistic reasoning, the FIS design enabled soft decision boundaries and interpretable evaluations in complex educational domains.

Predictive modelling

Further to the phase of the study, the model development phase, incorporates a range of potentially suitable models, selected for their respective capacity to work with the complex, multivariate nature of ELL assessment data. For ML model, baselines including SVM, DT, RF, and CatBoost are used for implementation, as displayed in Table 3. SVM classifies data points well, especially in high-dimensional features space, but it may lack power to detect non-linear effects⁴⁹. The Decision Tree model gives a highly interpretable structure, which can capture decision boundaries using straightforward if-then rules, yet it is often susceptible to overfitting. As an ensemble of decision trees, Random Forest can reduce such risk to some extent as it generalizes & stabilizes a forest of trees. CatBoost, a gradient boosting algorithm, greatly improves prediction accuracy particularly when categorical variables and complex data distributions are involved⁵⁰.

Table 3.

Analysis of baseline model selection based on their hyperparameter tuning strength.

Baseline Model	Hyperparameters Tuned	Tuning Details
SVM	Kernel type, C (regularization), Gamma	Grid search over kernel (linear, RBF), C and Gamma values
DT	Max depth, Min samples split, Criterion	Tuned max depth and split criteria (gini/entropy)
RF	Number of trees, Max depth, Min samples	Number of estimators and tree depth tuned for performance
CatBoost	Learning rate, Depth, Iterations	Tuned learning rate, tree depth, and number of iterations
LSTM	Number of hidden units, Dropout rate, Epochs	Tuned hidden units and dropout to prevent overfitting
BiLSTM	Number of hidden units, Dropout, Epochs	Tuned like LSTM with additional bidirectional context

Open in a new tab

For DL-based methods, Long Short-Term Memory (LSTM) and Bidirectional LSTM (BiLSTM) networks are used. These models are well suited for capturing temporal and sequential dependencies within learner responses, enabling more complex, contextual representations in text data. LSTM has special utility in dealing with the sequences in which order and context information plays a crucial role, while BiLSTM further improves this by learning the previous and future contexts at the same time and improving the depth of understanding of the language⁵¹.

Transformer-based modeling is the approach used in DeBERTa, a modern architecture celebrated for its strong self-attention mechanism and deep context embedding. DeBERTa can model intricate syntactic and semantic relationships in textual responses, thus learning strong and fine-grained language representations that enable highly accurate classification⁵².

The last, the hybrid model DeBERTa-LSTM takes full advantages of both the transformer-based contextual encoding and the recurrent sequence modeling. This model first adopts DeBERTa based on transformer⁵³ to generate dense, context-aware representations on the learner text and these representations are further fed into LSTM layers to extract sequential patterns and relationships. The product is an integrated predictive framework that can easily accommodate unstructured data, providing enhanced prediction and interpretability within the ELL trait continuum. This methodological variety allows for comprehensive benchmarking and demonstrates the benefits of advanced hybrid architectures over traditional non-hybrid models.

Proposed model

We trained a model, Fusion DBML to capture fine-grained relationships between language features and the learner attributes, which combines the power of advanced language embeddings, structured metadata, and temporal modeling, as architecture shown in Fig. 3. The architecture starts by passing the raw textual responses of learners through a pretrained DeBERTa transformer to obtain dense, context aware embeddings via deep self-attention operations. These embeddings represent intricate syntactic and semantic structures contained in the learners’ written language and correspond to a high-dimensional feature vector Inline graphic , with as the dimension of the extracted embeddings generated by DeBERTa⁵⁴.

Input processing

Input based on two types of data structure for processing fusion models.

Textual Input

Each student’s textual response is tokenized and passed into a pretrained DeBERTa model to extract rich, context-aware embeddings.

Let the sequence of tokens be Inline graphic .

Metadata Input

Structured learner data (contextual, demographic, psycholinguistic, cognitive features) are normalized and encoded into a numerical vector.

Let this vector be Inline graphic .

DeBERTa embedding layer

This embedding layer is based on further positional and encoder embeddings for transformer blocks.

Token Embedding:

Each token Inline graphic is mapped to an embedding .

Relative Positional Attention & Encoder Layers.

DeBERTa computes contextualized embeddings using stacks of transformer layers, defined using Eq. 17.

Where Inline graphic , and denotes the layer index.

Each attention mechanism uses Eq. 18:

Where Inline graphic adds relative position bias.

Pooling layer

For sequence representation, typically the [CLS] token embedding or a pooled output is taken, as in Eq. 19.

Feature fusion layer

The contextual text embedding and metadata feature vector are concatenated to form a single input vector, computed using Eq. 20:

LSTM integration layer

The fused vector Inline graphic is passed through one or more LSTM layers to capture non-linear, sequential interactions between text and structured features, defined as cell states of LSTM using Eq. 21 as forget gate, Eq. 22 as input gate, Eq. 23 defining output gate, Eq. 24 computing dense layer vectors, Eq. 25 uses concatenation for output, and Eq. 26 defining hidden state based on output.

Where: Inline graphic : forget, input, and output gates, : cell state, : hidden state, : sigmoid function, : element-wise multiplication.

Dense and output layer

The final hidden state from the LSTM, Inline graphic , is input to a dense (fully connected) layer for multi-class classification. The output prediction is obtained via the softmax activation using Eq. 27:

Layer-wise model pipeline, also shown in Fig. 4:

Input 1:

Raw learner text → DeBERTa → Inline graphic .

Input 2:

Metadata features → Embedding/Encoding → Inline graphic .

Fusion:

Concatenation to form Inline graphic .

Hidden:

Dense + LSTM layers for deep fusion and sequence modeling Inline graphic .

Output:

Dense (SoftMax) layer for multi-class trait classification Inline graphic .

Hyperparameter settings

The proposed model unites high-quality pretrained language representations and structured learner metadata with time sequence modeling using LSTM layers. Its model is made up of the use of stacked LSTM architecture, using dropout that helps capture the temporal dependencies and avoids overfitting. Training is done with an adaptive optimizer that is optimized on transformer-based models and tuning learning rate and batch size to have stable convergence, as hyperparameter tuning shown in Table 4. The length of the maximum input sequence is chosen as a compromise between computational efficiency and the context that is going to be captured. Handling overfitting (dropout, weight decay, label smoothing, early stopping, grad clip, Batch Norm, class weighting, calibration) and data-leakage prevention (Hold-out, fold-aware preprocessing, rare-category handling, de-duplication, leakage audit). There is also a fuzzy logic module introduced to improve the feature-based measure with rule-based inference in terms of learner characteristics.

Table 4.

Analysis of hyperparameter tuning.

Hyperparameter	Setting
Embedding model	DeBERTa-base (fine-tuned)
Metadata input	Included (demographic, cognitive/psycholinguistic scores) via late fusion (concat) + BatchNorm
LSTM layers/hidden units	2 layers/256 units
Sequence length	256 tokens
Dropout rate	0.5 (fusion MLP), 0.1 (Transformer)
Weight decay	1e-2 - AdamW
Learning rate/schedule	3e-5, 10% warm-up + cosine decay
Batch size	32
Optimizer	AdamW (grad-clip = 1.0)
Epochs	10–15 (with early stopping: patience = 5 on val macro-F1)
Loss function	Cross-entropy (class-weighted, label smoothing = 0.1)
Fusion head	2-layer MLP (128→64) with BatchNorm + Dropout 0.5
Calibration	Temperature scaling on validation only
Data split	Hold-out (80 − 20)
Fold-aware preprocessing	Fit on training fold only: median imputation (numeric), mode imputation (categorical), z-score scaling, one-hot/ordinal bins; reuse transformers on val/test
Rare-category handling	Group categories with < 1% frequency into “other” (computed on training fold)
Text de-duplication	Remove near-duplicates across splits (MinHash Jaccard > 0.9)
Leakage audit	No target-derived fields; permutation leakage check on metadata
Baseline models	SVM, DT, RF, CatBoost, LSTM, BiLSTM, DeBERTa, DeBERTa + LSTM

Open in a new tab

Statistical analysis

A variety of statistical tests are applied, displayed in Table 5 to assess the significance of performance differences and relationships between categorical outcomes in traits classification⁵⁵. Where, Inline graphic and denote the between- and within-group sum of squares, k is number of groups, and N total observations. For each test, a significance level of α = 0.05 is used, and p-values are adjusted (e.g., Bonferroni correction) when multiple comparisons are made.

Table 5.

Analysis of applied statistical test.

Test	Formulation	Purpose
Independent t-test	Where,	Compare the mean performance (e.g., AUC) of two independent models
Z-test		Compare a sample mean against a known population value or compare proportions
One-way ANOVA		Test for differences in mean performance across three or more models
Chi-square		Assess association between two categorical variables (e.g., predicted vs. actual class frequencies)

Open in a new tab

XAI SHAP and deepshap

Explainable Artificial Intelligence (XAI) techniques such as SHAP (SHapley Additive exPlanations and its deep learning extension, DeepSHAP are key to interpreting complex machine-learning model predictions. SHAP is a cooperative game theory-based method that provides a prediction-specific importance value to each feature and quantifies the value each feature contributes (positively or negatively) to the model’s output. This allows users to go beyond black-box predictions and instead obtain transparent, personalized explanations that are essential for informed decision making in educational applications. DeepSHAP extends SHAP to deep learning models, estimating SHAP values using a forward and backward pass through the model with a modified Deep LIFT to get approximate SHAP values⁵⁶. For models such as DBML, DeepSHAP enables to disentangle contribution of prediction scores into those coming from input text embeddings and those resulting from structured metadata features, computed as in Eq. 28, providing fine-grained understanding what traits the model relies on to express its confidence for each learner.

Inline graphic is the set of all features, is the subset of not containing . is the model prediction when only features in are present. is the SHAP value for feature .

This expresses the marginal contribution of feature iii averaged over all possible feature combinations, making SHAP both theoretically principled and practically robust for interpreting hybrid and deep models in ELL assessment.

Performance evaluation measures

Performance evaluation in this study relies on a comprehensive suite of classification metrics to ensure robust and fair model assessment, displayed in Table 6. Accuracy The proportion of correct classified samples is calculated as measuring the rate of correctly classified samples, in this way providing an overall picture of the success of a model. Precision measures the percentage of those positive predictions that were correct and is an indication of the model’s ability to not raise false alarms. Recall (or sensitivity) is the percentage of true positives identified with respect to all actual positive, which stresses the model’s ability in identifying positive cases.

Table 6.

Performance evaluation measures.

Metrics	Equation
Accuracy
Precision
Recall
F1-score
AUC-ROC

Open in a new tab

The F1-score balances precision and recall with the harmonic mean, such that it summarizes both the false positives and false negatives. Moreover, AUC-ROC (Area Under the Receiver Operating Characteristic Curve) assesses the model capacity to separate classes at different probability thresholds, providing a threshold-free measure of separability and ranking power and directly applicability for multi-class categories⁵⁷. In combination, these metrics offer a multifaceted view of our predictive performance and corroborate the validity and interpretability of our hybrid and fuzzy-based models for classification of ELLs.

Results and discussion

In the following, the empirical results obtained by traditional machine learning as well as advanced fuzzy rule-based methods on the ELL dataset are comprehensively demonstrated. Via a set of comprehensive analyses — including descriptive statistics analysis, visual exploratory data analysis, the estimation of feature importance and the model evaluation — important relationships and patterns in the data are revealed. Comparative performance of different feature categories, interpretability of rule-based models, and insights into linguistic proficiency in terms of context, psycholinguistic, and cognitive features are addressed. The results of each methodological step are presented in the following paragraphs, with a combination of numerical results and relevant visualizations used to demonstrate the efficiency and explanatory potential of the proposed hybrid framework.

Exploratory data analysis

The insights on feature relationships and learning performance distribution are deduced from the EDA of ELL dataset. The feature correlation heatmap in Fig. 5, reveals that the linguistic constructs (labeled as Cohesion, Syntax, Vocabulary, Phraseology, Grammar or Conventions) have substantial positive correlations between themselves and with the overall score. This clustering with large correlation coefficients (most around0.65) indicates that these linguistic measures are co-variant (are highly intertwined), in line with the fine that advanced linguistic proficiency is multi-dimensional and strongly correlated. In contrast, contextual, demographic and cognitive features are weakly correlated with each other and with language traits, with the implication that they are more interdependent and subsidiary than language for measuring the learner’s competence.

Fig. 5 — Correlational heatmap of all attributes within dataset.

This is also confirmed by a look at focused views on selected features, in Fig. 6, Cohesion, Syntax, Vocabulary, Overall. The correlation heatmap here is even more interrelated (0.81 in some cases), specifically between Syntax and Overall or Cohesion and Overall, proving that enhancements in these features directly lead to better overall language use. This close correspondence supports the use as predictors in basic and advanced models. The distribution of the overall score is shown as a histogram in Fig. 7 with a kernel density estimate and a bell-shaped histogram peak centered at 3. This implies that the learners’ proficiencies are evenly distributed over the dataset, with most of the students located around an intermediate proficiency level and fewer at the ends. Distribution of Responses for Grades An associated bar plot showing the distribution of the grades indicates that the dataset is primarily consisting of students in classes 11 and 12 with sample size that is still quantitatively significant for grades 8, 9 and 10. This demographic distribution generates confidence that the analysis represents an overall, although slightly upper-level-skewed, portrait of secondary school students.

Fig. 7 — Distributions of scores and grades based on responses.

Finally, the multi-variate density plot in Fig. 8 of all feature pairs, standardized from 0 to 6, that conveys the multi-modal character of learner profiles. Linguistic peaks (in particular: Cohesion, Syntax, Vocabulary) are found at mid-to-high levels, evidencing their influent role and developed dimensions in the sample. In contrast, circumstances like prior experience, level of motivation, and cognitive load are spread out more broadly, the density being disproportionately large for smaller values, which corroborates the previous observation that they play a complementary—but not less informative—role. This comprehensive EDA highlights the heterogeneity of the learner groups and the complex nature of language proficiency as well, hence providing a robust empirical ground for advanced models with interpretability such as fuzzy logic and deep learning fusion.

Numeric-stats chart indicates that rubric scores (Cohesion, Syntax, Vocabulary, Phraseology, Grammar, Conventions) are tightly confined to the 1–5 range with means congregating around the mid-point and medians 3 and hence reasonable symmetricity of the distributions with no extreme outliers; grade occupies a much broader scale and has the greatest spread hence must be z-scaled to avoid dominating fused model, as shown in Fig. 9. The categorical plot top-category share, in Fig. 10, indicates fair skew of some features- a single category motive accounted for ~ 50 of samples and others (i.e., Vocabulary/Overall/Syntax) each had ~ 30–40% top-category. This combination suggests sufficient variation to learn yet also an element of possible imbalance; hence, stratified CV, class/feature balancing, rare-category combination (e.g. “other”) and missingness flags are recommended to the pipeline to avoid spurious associations and maintain the focus of the model on the linguistic clues.

Fig. 9 — Bar plot distributions analysis of numeric attributes statistics.

Fig. 10 — Top Categorical metadata analysis.

Feature ranking and fuzzy rule

The use of feature selection (in the form of feature ranking) as a preliminary step to the rule-based fuzzy modeling is crucial to the effectiveness, interpretability and practical impact of the resultant fuzzy decision-making system. Towards this, feature ranking strategies including Information Gain, Gain Ratio, Gini Index, principal component loading and permutation importance collectively detect the most informative/distinctive attributes (from each item) among the items with different categories - contextual and demographic features, psycholinguistic and cognitive features, and linguistic features. This multivariate testing makes the rules generated in the fuzzy system interconnected in both linguistically and statistically manners.

In the case of contextual and demographic characteristics, relative importance is higher for native language, learning environment and age, compared to prior experience. These findings suggest that the development of fuzzy rules be predicated on more subtle conditions related to language background and educational context rather than experience, a focus that may not always successfully distinguish learner outcomes. In the psycholinguistic and cognitive domain, working memory index and cognitive load are the most relevant features through most rank methods, while attention span and motivation level are ranked lower (yet still relevant). Thus, the fuzzy rule base may give emphasis to cognitive performance and load management in decision-making, which results in more robust and focused inference activation for those learner characteristics.

The most remarkable result can be found in the linguistic feature subdomain, where all ranking criteria coincide to select phraseology, syntax, vocabulary, and coherence as the preeminent predictors of language proficiency. The uniformly good scores of these attributes with all methods imply that they are good furniture for building reliable fuzzy rules. For example, rules including thresholds or fuzzy sets on syntax or phraseology are much more likely to produce robust, generalizable classifications. Grammar and conventions, though ranked lower, still play significant role towards the betterment of formation of secondary or supportive rules. Because the fuzzy rule base is based on empirically ranked features, the proposed fuzzy inference system is transparent and well-monitored. Rules are not arbitrary or merely expert-authored, they closely represent the underlying demonstration structure and discriminative structure. This cooperation of statistical feature importance and fuzzy logic will result in the strengthened activation strength of rules, and the ambiguity in terms of class assignment and whole process of system is improved with the enhance of accuracy and interpretability of the system. Finally, such a technique can close the loop of discovery- by-data approach and human-understandable logic, and thus, the fuzzy-based e-assessment becomes more robust, interpretable and practically beneficial.

This approach to assess ELL demands a strong feature selection that addresses both statistical measures (e.g. Information Gain, Gain Ratio, Gini Index) and model-based algorithms (Random Forest) to understand which learner features are the most predictive of language proficiency. The integration of these two methods—tabular rankings and visual feature importance—provides a dual developer to the importance of every attribute yet allowing for interpretable rule-based modeling and robust predictive analytics.

Contextual and demographic variables

The Table 7 rankings suggest that, of the contextual and demographic features, L1 and environment show larger Information Gain and Gain Ratio, which should allow them (as single attributes, considering individual splits) to reduce entropy and give useful splits regarding learner class. But if it weighs less in Information Gain, for Gini Index age is quite pure to separate class. When the feature importance of using Random Forest is plotted, age comes across as the most dominating feature, affirming the Gini Index viewpoint.

Table 7.

Contextual & demographic feature ranking.

Feature	Information gain	Gain ratio	Gini index	PC1 loading	Permutation importance
Native language	0.01	0.04	0.20	0.01	0.16
Learning environment	0.07	0.03	0.23	0.19	0.13
age	0.05	0.08	0.35	0.99	0.188
Prior experience	0.03	0.02	0.20	0.03	0.12

Open in a new tab

Note that this inconsistency with properly chosen statistics based on entropy and information often tending to select very mixed-variable-distributed attributes (due to them being also the most informative), demonstrates that model-based algorithms like RF may reveal non-linear and interaction effects (which make variables like age as particularly important), as shown in Fig. 11. This twofold evidence indicates that native language and context of learning are important in rule formation, but age consideration is also necessary to ensure the best predictive power, in favor of developing rules that capture maturational and experiential language learning differences.

Fig. 11 — Feature Importance for Contextual & Demographic trait based on RF.

Psycholinguistic and cognitive factors

Table 8 shows the feature’s ranking in working memory index and cognitive load Working memory index and cognitive load are displayed as the first two features with larger IG and GI respectively, which means that they are of significance to discriminate ELL proficiency levels. Similarly, attention span score is revealed as having a high PC1 loading, highlighting its relevance in principal component-based analyses. These observations are visually consolidated in the Random Forest plot in Fig. 12, where the working memory index and span test score are the most important attributes, then cognitive load and motivation level. Statistical agreement with the model-based values justify the adoption of such cognitive traits to define rules in fuzzy systems and to feed input nodes in advanced machine learning models. It is also in line with existing psychological research, suggesting that memory and attentional control are critical for learning, particularly in the distinction of subtle learner profiles.

Table 8.

Psycholinguistic & cognitive feature ranking.

Feature	Information gain	Gain ratio	Gini index	PC1 loading	Permutation importance
Working memory index	0.06	0.02	0.46	0.31	0.02
Cognitive load	0.05	0.37	0.26	0.01	0.01
Attention span score	0.09	0.07	0.15	0.94	0.02
Motivation level	0.10	0.08	0.11	0.09	0.01

Open in a new tab

Fig. 12 — Feature Importance for Psycholinguistic & Cognitive trait based on RF.

Linguistic features

The most compelling and consistent results appear in the linguistic feature set, shown in Table 9. Except for the model-based importances, all ranking metrics (Information Gain, Gain Ratio, Gini Index) converge on phraseology and syntax as the most important predictor. These characteristics have the highest thresholds in each of the metrics, and thus can be extremely good candidates for rule-based fuzzy systems and feature selection in machine learning models. The magnitude of each feature in the RF plot in Fig. 13, makes this point clear: phraseology and syntax are the most important features (by far), with diminishing contributions of cohesion, vocabulary, grammar, and conventions. Moreover, such a consensus among so many analytic perspectives also confirm the primacy of more advanced linguistic features in the comprehension accomplishments of ELLs, revealing that proficiency in these constructs best predict language proficiency and should thus be the core of both rule bases and predictive models.

Table 9.

Linguistic feature ranking.

Feature	Information gain	Gain ratio	Gini index	PC1 loading	Permutation importance
Phraseology	0.35	0.14	0.41	0.41	0.08
Syntax	0.36	0.15	0.22	0.42	0.09
Vocabulary	0.30	0.13	0.10	0.35	0.07
Cohesion	0.31	0.12	0.15	0.40	0.08
Grammar	0.29	0.11	0.07	0.43	0.07
Conventions	0.30	0.12	0.08	0.42	0.07

Open in a new tab

Fig. 13 — Feature Importance for linguistic trait based on RF.

By combining these ranking results with visual inspection of the Random Forests, a variety of empirical and theoretical conclusions are drawn. Empirical feature rankings for the rule-based fuzzy systems prevent rules from being generated arbitrarily but based on data driven evidence for the significance of attributes. AI-based models training on high-ranking features (on phraseology and syntax for linguistic rules, and over age or working memory index for contextual and cognitive rules), based on interval values, are the core of potent intelligible fuzzy inference systems. From a machine learning perspective, it provides a nice balance between traditional statistical ranking and model-based importance, guaranteeing that the final models are statistically sound and empirically reliable and generalizable.

The joint models that adaptively combine two statistical feature selection-based methods and the importance measures in Random Forest can help to take advantage of two different kinds of methods. This approach maximizes the degree to which interpretable (i.e., actionable) rules as well as non-interpretable machine learning models leverage the most informative, reliable and predictive learner characteristics, which will ultimately facilitate more accurate, explainable and actionable assessment systems. The use of a fuzzy rule-based classification model is important for progress in nuanced identification of ELLs. In contrast to rigid, threshold-based systems, fuzzy logic allows for a soft, intuitive manner of dealing with the inherent uncertainty and loose structure associated with human language learning. Through utilizing empirically ordered features in context, cognition, and language, they project learners into rules where the relative position of one type of learner compared to another along the proficiency continuum is reflected in the rules articulated. This allows educators and researchers to account for the complex interactions of varied learner dimensions in a way that contributes to a more comprehensive and personalized assessment. The results of the Fuzzy system are then discussed in terms of the proposed support for the ELL framework, particularly on how the combination of a diverse number of learner type features can yield more precise, interpretable and action-oriented learner profiles.

The plots of fuzzy memberships offer a fine-grained, intuitive visualization of the distribution and interpretability of trait attributions in the space of contextual, psycholinguistic, cognitive and linguistic dimensions for English learners. Such plots powerfully represent the power of fuzzy systems in describing the nuanced, gradual features of learner characteristics-something which crisp, fully-thresholder models can never imitate. For contextual and demographic features, Fig. 14 illustrates how features like age, native language, learning environment, and prior experience are distributed among low, medium, and high fuzzy sets. The soft degrees of membership are indicative of the fact that learners do not necessarily belong to distinct (all or none) categories.

Fig. 14 — Membership ratio among contextual and demographic traits.

For instance, many learners belong to the moderate to high membership levels for medium age and medium native language proficiency; this is because both age and native language proficiency usually affect learning on a gradient rather than a discrete state-based manner. This granularity enables the fuzzy system to develop rules that consider slight demographic differences, adapting to the presence of mixed or intermediate profiles, rather than fixing arbitrary dividing lines. The psycholinguistic and cognitive characteristics—such as attention span, working memory, motivation, and cognitive load—have even more varied membership distributions. Figure 15 illustrates how learners can be moderately-to-highly endowed in more than one cognitive factor. As an example, one could have a student who has high working memory, but only average cognitive load, a combination that a crisp rule system could miss but can be naturally captured by fuzzy logic. This improved representation permits the fuzzy classifier to fire rules in parallel, and to weigh them based on the real membership degrees, and then to reflect better the human judgment in education domains.

Fig. 15 — Membership ratio among psycholinguistic and cognitive traits.

In the area of language, Fig. 16, for cohesion, syntax, vocabulary, phraseology, grammar and conventions are particularly telling. Here the degrees of membership for “high” sets are often shown as having broader, stronger flash-type distributions indicating that high linguistic proficiency is a more discrete, discriminating feature for advanced levels of learner. Nevertheless, the existence of not only learners with high memberships but also learners with moderate or low memberships in this type of features demonstrates the diversity of abilities of any cohort of learners.

The fuzzy system’s power is also evidenced by its capability to aggregate on-line large number of high or medium linguistic memberships at the same time, increasing the interpretability and robustness of its classification. In sum, these highlight an advantage of the fuzzy approach in trait-based ELL contexts. Since membership, visualized and modeled as a continuum, leads to subtlety in reasoning, the fuzzy version of if-then rule can handle such subtlety, a nuance, which cannot be entertained by binary logical reasoning. This means that learner evaluation itself may now be more flexible, fair, and actionable, leading to more personalized instruction and assessments that more truly show the nuances of human learning.

Furthermore, for supporting mean values, Fig. 17 indicates the histogram of mean rule activation strengths for the three categories—Contextual & Demographic, Psycholinguistic & Cognitive, and Linguistic features—in the fuzzy rule-based classifier. The figures depict that most of mean rule activations are distributed around a common range for the three kinds of traits, while in all cases the mean rule activation peaks are slightly higher for the linguistic feature set. This has to do with the observation that linguistic features induce stronger and decisive rule activations than do contextual and cognitive features. In other words, the fuzzy rules built over linguistic features better modulate and more robustly activate, stressing their pivotal role in distinguishing between EFL learners with different proficiency. The fact that linguistic features have a slightly higher and wider distribution, means that the corresponding rules more often have a dominating role in the final classification, which was also observed in the analysis of the feature ranking of these rules. In the case of the contextual and demographic, as well as psycholinguistic and cognitive domain narrower spreads and smaller peaks in the violin plots could indicate that the rule activations are slighter and more evenly distributed. This accords with the anticipation of the fact that these characteristics are more supportive or moderate in fuzzy inference. Their rules can be fired with (or even supplement) linguistic rules which add more nuances or contextual information to the decision, but that do not dictate the classification result.

The visual aggregation of membership strengths and the small differences detected between trait categories show that fuzzy rule-based systems can be interpretable and fine grained. But instead of using a simple binary threshold, the fuzzy classifier uses a degree of rule activation which makes it possible to finally compute an overall, fine graded learner classification from multiple partial contributions. This is not only more closely aligned with the diversity of real-world learners but also provides educators with insight into which attributes were most important in a particular assessment and to what extent this dimension contributed to the learner categorization.

This plot in Fig. 18 shows how the rule support (strength) of all the most important attributes and categories are distributed in the fuzzy classification model. For all of them — from contextual and cognitive to linguistic -- the degree and consistency with which their associated fuzzy rules obtain in a group of samples are assessed. The layered color attribution of categories shows that rule support is not uniform but is distributed by level of feature rather than globally, and what intervals on the various features contribute most robustly to classification. F1 Notably, the high density of support values at the bottom of each violin indicates that the fuzzy system is sensitive to a wide array of learner profiles, but the higher outliers are indicative of more rare cases causing some strong rule activations in specific instances.

This multidimensional visualization illustrates the flexibility and explanatory power of the fuzzy approach (e.g., how such traditionally marginalized features as previous experience or level of motivation can still play a significant role in learner assessment when its rule support corresponds to certain categories).

This bar chart display in Fig. 19, shows the means of the important features of the ELL dataset, and serves to directly support the fuzzy based analysis by demonstrating which characteristics are more prominent or essential to the learner group. Vocabulary, Cohesion, Phraseology, Conventions, Grammar, and Syntax, which are perhaps most notably linguistic features in that they are most removed from content fall, as a group, in the highest of the possible 5 ranges, which is just above 3. Such a clear focus of high end, if indeed the linguistic rule is not only highly developed, on average, but also has most effects on rule activation as well as learner differentiation within the fuzzy system. This is consistent with the fuzzy rule analysis, for which the linguistic features were consistently characterized by higher rule activations and membership degrees, which signals that they are the most decisive in distinguishing advanced language learners.

The mean values of the contextual and the psycholinguistic-cognitive features, learning environment, cognitive load, motivation level, prior experience are lower. This is indicative of the fact that although these factors are crucial in developing a nuanced understanding and secondary rule creation, they are less impactful on the actual profile of the data and hence not dominant in the dataset. Their lower means also demonstrate the capability of a fuzzy system of having partial memberships which prevents these features from being entirely discarded but instead become modulated into linguistic-driven classification results. Thus, this mean value analysis supports and extends the fuzzy-type approach. It suggests that fuzzy classification is well-calibrated to the empirical structure of the data, highlighting valuable linguistic features and yet sensitive to the differences and contributions of context and cognitive assets. This results in more balanced and interpretable evaluations of learners, and so, they confirm the application of fuzzy logic to real educational assessment.

These rule-based visualizations in Fig. 20 provide a holistic view of how fuzzy logic exploits empirical important features to provide understandable and powerful classification for ELL assessment. For example, the heatmap of the top 15 fuzzy rules shows the most activated rules with high frequency and strength in the dataset. It is also worth mentioning that the top rules usually have stacked intermediate values for each of age, attention span score and high level of motivation suggesting the complex phenomena of language learning where moderate and high traits reinforce each other to define learner profiles. Their strong appreciation of these rules suggests that - within this community - moderate cognitive and demographic characteristics coupled with high levels of motivation are mainly associated with successful language learning results. This again emphasizes that the MOF mechanism has the power in modeling multi-attribute dependencies and is not based on any binary indicator.

Based on class-specific bar plots, different trait interactions that seem to drive different types of learners. The highest activation is obtained on rules that mix medium syntax and age such as for the Contextual & Demographic class or high motivation and medium syntax, in Fig. 21. This demonstrates that above all, including demographic aspects, language ability and motivation are key in the contextual assessment of learner achievement. Results for the Psycholinguistic & Cognitive class are mainly rules associating medium syntax and high cognitive load or cohesion with working memory—supporting a psychological hypothesis that cognitive resources, in tandem with syntactic awareness, together facilitate advanced learning, shown in Fig. 22. In this case, the rules also demonstrate symmetry where the order of the attributes (for example, syntax and cognitive load) can be switched and activated with the same degree, which shows the generality and flexibility of fuzzy logic.

Fig. 21 — Top 5 rules for contextual and demographic class.

Fig. 22 — Top 5 rules for psycholinguistic and cognitive class.

For the Linguistic Features class, as shown in Fig. 23, activations to rules with the maximum mean are consistently around syntax and cohesion, but specifically about medium levels for both. This indicates that, for linguistic classification, a good command of these associated features is essential. The higher mean activations of these rules as compared to the ones in the other classes seem to indicate that once trigged, the linguistic rules have a higher overall effect on the categorization of the learners—a result that aligns with our analysis of the strengths of the rule activations and the importance of the features.

Taken together, these rule sets illustrate how fuzzy systems can expand beyond binary thresholds and leverage the combination of multiple moderate-to-high trait values for the complex, context-dependent assessment of learners. The visualization of both global and category-specific rules not only increase the interpretability of the system but provides a tool for design of pedagogical strategies—educators can ascertain which combinations of features are most relevant and adapt instruction accordingly. At the end, we thus present a data-driven, transparent personalized ELL-support roadmap, a strategy applied to pinpoint and personalize support, which capitalizes on the strength of fuzzy logic to mirror the real complexity of human learning.

Predictive modelling results

The results of our predictive experiments highlight the capability of sophisticated ML, DL and hybrid architectures in assessing the performance of English language learners. Most importantly, the fusion-based model with DeBERTa in combination with structured metadata and LSTM achieved the highest performance than all the baseline and previous SOTA models that can provide reliable accuracy with the ability to generalize to the wide variety of learners. This holistic approach demonstrates the benefit of integrating rich text embeddings with cognitive and demographic features, achieving new state-of-the-art predictive analytics for educational assessment.

Machine and ensemble learning

The analysis of machine learning models reveals interesting deviating trends in predictive accuracy in classifying ELLs according to contextual, psycholinguistic, cognitive, and linguistic features. Both overall accuracy and class-wise performance metrics show increasing difficulty of prediction as transition from linear or shallow learners to ensemble methods. Of the model benchmarks, CatBoost and Random Forest report the best performance (86% and 85%), both having better precision, recall, and F1 scores, compared to SVM and Decision Tree which return the next lower and quite similar results (78–80%), displayed in Table 10. These results are further elucidated by looking at confusion matrices shown in Fig. 24. The SVM model has a strong bias for misclassifying quite a few of the psycholinguistic and cognitive as instances of contextual and demographic. For example, 139 samples from class 1 are misclassified as class 0 and most of the linguistic attribute cases are misclassified as some other categories.

Table 10.

ML and EL based results analysis (%).

Model	Accuracy	Precision	Recall	F1-score
SVM	79	79	79	78
DT	80	80	80	80
RF	85	85	85	85
CatBoost	86	86	86	86

Open in a new tab

Fig. 24 — Confusion Matrix of ML and EL models.

As for the Decision Tree model, although it slightly increases the accuracy from SVM, it still has a significant confusion between classes (66 samples from class 1 are misclassified as class 0, 52 from class 1 are placed in class 2). These results suggest that linear and single-tree models are not as successful in capturing the fine, overlapping boundaries that occur when context-specific, psycholinguistic, and linguistic variables all converge to influence outcomes in language proficiency. The superior prediction capability of the Random Forest and CatBoost models is indicated by higher true positives in all the class values and a considerable decrease in wrong classifications. There is a good symmetry in both sensitivities and specificities in both models for the psycholinguistic/cognitive and linguistic categories in particular. Random Forest correctly predicts 168 samples from class 1 and 176 samples from class 2, and CatBoost improves these values, by correctly predicting 177 and 173 of them, respectively. Such improved prediction can be explained by the ability of the ensemble models to utilize feature interactions and nonlinear relations, which are prevalent in educational contexts with diverse learner characteristics. Specifically, the CatBoost model not only shows the leading scores among performance measures, it also has sound separation for all three types, which reduces potential confusion, and further enhances the robustness of the model for future applications. By comparing the ROC curves of each machine learning model employed in the study, it is possible to see a clear visual representation of their discriminative ability. Both SVM and CatBoost have impressive AUC (0.94) followed closely by Random Forest (0.93), suggesting that these models are very good at discriminating between classes across a range of decision thresholds and keep the true positive rate as high as possible while the false positive rate is low. The Decision Tree is far worse at 0.79, suggesting that it fails to separate the classes as well either because it tends to overfit or does not accurately model complex relationships in the dataset. The overlapping cluster of the ROC curves for SVM, Random Forest, and CatBoost of the Fig. 25, indicates the substantial robustness and dependability of these systems for evaluation of ELL, with ensemble and kernel methods giving a good predictive performance across various test samples on this application.

Fig. 25 — Combined Model AUC-ROC analysis.

Furthermore, decision boundaries for the machine leaned models provides a comparative analysis of performance of these algorithms on partitioning the English language learner profiles across the feature domains, shown in Fig. 26. After projecting the multi-dimensional learner data to a (X, Y) space defined by two principal components (PCAs), the spread of the actual samples is plotted with the regions of classification for the three class: Contextual & Demographic, Psycholinguistic & Cognitive, and Linguistic Features. The SVM model has a smooth, curved separator, which is the property of the SVM to find the best classifier of the transformed features. For the context and demographic features most users naturally cluster together in the left corner and SVM can correctly predict most of such class 0 samples. This results in some overlap and misclassification for samples that fall near the margins, particularly for students whose feature profiles exhibit blended psycholinguistic and linguistic traits.

Fig. 26 — Decision boundary region of all three traits based on PCA.

The difference in the partitioning of the input space is clearly reflected in the boundary plot of the Decision Tree. The splits are denoted as vertical, axis-aligned decisions as well, which is consistent with the threshold-based splits of the tree for individual attributes. This results in well-defined partitions when the objects are clearly separated but yields a stepped, sometimes broken line, which is unable to represent the finer, multidimensional structure in the data. Therefore, those students with feature values close to decision boundaries can be classified inconsistently, especially when their feature values are at the split points of the two ranges. The ensemble methods, Random Forest and CatBoost, clearly have more complex and tight decision boundaries. Even tighter and more class-balanced regions are allocated, and they better approximate the distribution of student samples. Random Forest is an ensemble of trees, which leads to a smoother, more adaptable separation of class than a single tree. CatBoost takes this a step further with a more advanced boosting implementation, resulting in boundaries that are not only well defined, but robust to noisy or overlapping feature sets. For English proficiency learners, that means students who have different and overlapping characteristics for example, performing well linguistically and being cognitively driven –- are better matched, meaning less misclassification and a more realistic representation of the diversity of learner profiles in the world.

Collectively, these decision boundary plots highlight the respective capabilities and shortcomings of each model structure. In general, linear models such as SVM will offer smooth transitions but may have difficulty with overlapped class regions. It is therefore possible to use simple tree-based models both explaining but not offering enough flexibility to handle complex educational data. The methods based on ensemble, such as Random Forest and CatBoost, achieve the best performance because they can faithfully represent the combinations of features at different levels into learner categories by a nuanced mapping. This equates to more accurate and useful student classification for targeted interventions and personalized support in language learning environments.

In conclusion, these results confirm the requirement for ensemble learning methods in multi-layered and high-dimensional educational data tasks. The gap between SVM and Decision Tree compared to Random Forest and CatBoost highlights the significance of model choice in research and application. The advanced models can provide useful information for learner profiling, educational assessment, and personalized intervention, by making good use of context, psycholinguistic, cognitive and linguistic features and turns out to be the better choice for language learner evaluation frameworks.

Deep learning and transformer-based model

The trend toward scores through deep LM, transformer-based models, and hybrids paints a nice picture of the development due to the sophistication and effectiveness of modern ELE-scoring approaches. Below Table 11 highlights the performance of the deep learning models at baseline—LSTM, and BiLSTM shows average accuracy, LSTM: 61% BiLSTM: 63%. Their relatively low precision, recall, and F1-scores illustrate the inadequacy of merely using sequence-based approach to model the input data especially when the rich structure and diverse features of educational assessments are not fully exploited. The confusion matrices in Fig. 27 of these models show a strong tendency of both to confuse psycholinguistic/cognitive and linguistic characteristics, namely between class 1 and class 2 samples. This implies that a model that does have access to complex features but not the knowledge of how these features interact with each other and other, lower-level features does not probably could solve the fine-grained classification task. The performance is however boosted by a new transformer model DeBERTa, where the accuracy reaches 83% and most of the next refined metrics also show a significant improvement. DeBERTa can capture richer contextual representations from the input text thanks to its self-attention mechanism, therefore improving the differentiation ability across intricate proficiency levels. Its confusion matrix further confirms its better discrimination performance, showing that correct responses have significantly been increased for all classes but that there is still some confusion between psycholinguistic/cognitive and linguistic categories maintained. A similar performance trend is observed when DeBERTa is joined with an LSTM layer (DeBERTa + LSTM), and the best performance can achieve 86% accuracy with behavior precision and recall. This hybrid combines the context encoding ability of the transformer and the sequential nature of the LSTM, leading to a better performing model for capturing temporal or ordered patterns within language.

Table 11.

Deep learning-based models’ results analysis (%).

Model	Accuracy	Precision	Recall	F1-Score
LSTM	61	57	61	55
BiLSTM	63	48	63	54
DeBERTa	83	81	83	89
DeBERTa + LSTM	86	84	83	86
DBML	93	90	93	92

Open in a new tab

Fig. 27 — Confusion Matrix analysis of deep models.

Proposed model results

The fusion model that proposed, DBML is the result of these developments, performing the best metrics seen, that is, an accuracy of 93%, precision of 90%, recall of 93%, and F1-score of 92%, as detailed analysis of results displayed in Table 11. This model concatenates dense semantic embeddings of DeBERTa (that capture nuanced text representations) with structured metadata features indicating contextual, demographic, psycholinguistic, and cognitive factors. This enriched input is subsequently processed through an LSTM layer, when the network can take advantage of not only the depth of language representation but also the broadness of the learner-specific information.

The outcome is a model that not only achieves superior performance in representing the fine-grained distinctions of learner-side categories, but also exhibits remarkable flexibility in accommodating complex, multi-attribute input. This is confirmed by the confusion matrix of DBML, where the blue background of the plot clearly shows that all classes are correctly predicted a lot more than they are not, notably for psycholinguistic/cognitive and linguistic feature cases, while a sharp decrease of misclassification is observed. The fusion mechanism of the model can thus help it capture the best of both the structured and unstructured data and bring the two pieces of the learner profile and understanding the language of the student together.

In conclusion, the results emphasize the crucial importance of multimodal and hybrid models for language learner assessment. The proposed DBML model that combines transformer-based text embeddings along with all the associated structured features achieve the best accuracy and F1 score. Its ability to reduce false downwards confusion between near classes makes it particularly interesting for educational uses where high interpretability and precision are necessary. This model establishes a new standard for data-informed, student-centered assessment that has the potential to inform more responsive and differentiated education. Furthermore, the training and validation accuracy curves in Fig. 28 of the proposed DBML model clearly demonstrate that the model had strong learning dynamics and was well-generalized over epochs. Splitting the training procedure into epochs 1–25, 26–50, 51–75 and 76–100 makes the first plot series the most detailed one in terms of the development towards both accuracy metrics. The accuracy fluctuates as expected due to the rapid adjustment of the model to the data, but for training and validation the accuracy both levels off and progresses upward after the middle era to training. The colored marks in the accuracy curves denote the confidence of the model, higher values concentrate a little as point goes up, and the model is learning constantly and reduce its prediction variance as times goes on.

Fig. 28 — Training and validation accuracy analysis of proposed model over set of epochs.

This slope is maintained in higher epochs, with very few and small deviations between the validation set and the training set. The fact that the two lines align so well suggests that the model is not overfitting but instead is learning the underlying patterns that will generalize to new data. Persistent improvement followed by convergence is not only indicative of the performance of the hybrid model but also shows that the hybrid model is able to fully tap the predictive potential of both text and structured metadata features.

The classic accuracy and loss curves presented in Fig. 29 also corroborate the stability and efficiency of the model. The left plot illustrates how the training and validation accuracy consistently increase, reaching almost perfect values around epoch 100. On the right are the corresponding loss curves, which dive down toward zero for both training and validation losses, with a series of little bumps through the training—the usual behavior of early period exploration and parameter tuning. Beyond this initial phase, loss is stabilized and remains low until the end of training, indicating that the model is performing post-hoc minimization of error well without issue from instability or vanishing/exploding gradients.

Overall, all these training and validation graphs also demonstrate the stability and practicality of the proposed model. The overlap of the accuracy and loss and the lack of overfitting, and the similarity in training and validation performances together indicate that the model is not only well-fitted for the data but also generalizable well. This highlights the model’s applicability to real-world application in the educational setting where predictive accuracy and reliability is crucial in the assessment of English language learners.

The training curve along with the GPU, memory utilization indicates that combined approach (deep learning and transformer) can be used in practice in an efficient and resource-aware way. The model not only achieves good precision on train and validation set over 100 epochs (considering the complexity of the architectures) but also has an incredibly low hardware requirements, as shown in Fig. 30. The GPU usage of 8.02GB and memory consumption of 5.9% indicate the feasibility of better trade-off between high performance and resource utilization. This is especially important in the field of educational technology, since deployment environments are often resource poor. The hybrid design lightweight, we make strategic model architecture choices, ranging from basing on efficient attention mechanisms in DeBERTa, optimizing the LSTM layers for sequence modeling, and ensuring the space efficiency of integrating metadata while adding as little overhead as possible.

Fig. 30 — Memory consumption analysis during training.

The hybrid DBML model keeps DeBERTa-base as the main backbone and adds a lightweight 2-layer LSTM (256 hidden) and a tiny fusion MLP (128 64 with BN/Dropout). It adds minimal overhead, an around low-single digit multiple, to build DeBERTa-base, parameters increase by approximated + 23 M++ (2.33%) and the forward pass cost at 256 tokens by approx. +134 GFlops (4.95%). These numbers came by calculating trainable param-ethers (sum(p.numel() …)) and profiling FLOPs (ptflops-torchprofile) with the same sequence length, batch size and precision as the baselines. Due to fusion being late, inference cost is not particularly higher than with DeBERTa-only but makes the largest accuracy/AUC improvement; GPU memory consumption does not increase during training, and throughput is only slightly lower (≈ 5–8%). Relative to more capacity-heavy versions (larger transformers or jointly encoded multimodal models), the DBML model provides better accuracy against compute, which is: fundamentally the same speed as DeBERTa but with a small param/FLOP multiplier to access the full hybrid potential.

The result is a fast, efficient solution that is both scalable and deployable on modern GPUs and memory-limited environments enabling advanced English language learner assessment across a broader spectrum of institutions and educators without compromising predictive ability or interpretability.

This correctness and incorrection graphically represents the distribution of the predicted and real class assignments in all the samples for the three ELL categories of hybrid model in Fig. 31. Correct predictions are visualized as green dots, whereas the mismatched predictions are shown in red. The tight clustering and general ‘greenness’ of the classes at each level shows that even though a hybrid model was chosen for the analysis, the predictive strength of this approach is high and is capable of accurately mapping complex input data in many dimensions to the right categories of learners.

Fig. 31 — Actual (red) and predicted (green) analysis based on proposed model.

The relatively low concentration of red dots also indicates strong generalization ability and high accuracy of the model. It is interesting to mention that the stability of accurate predictions over the three class bands demonstrates balance in the hybrid model’s sensitivity to a mix of learner traits, thereby validating its usefulness not only for linguistically- oriented advanced features but also for contextual and cognitive types. This visualization is strong support for the argument that fusing deep textual embeddings with structured metadata produces a robust, holistic ELL assessment tool, capturing and predicting learner diversity accurately and understandably.

The proposed hybrid model is modality-independent and could be tailed to vision via replacing the text encoder with a vision backbone (e.g., ViT/Swin or CNN) and, optionally inserting an LSTM/Conv LSTM to model temporal sequences (multi-frame or video) and maintaining the late-fusion MLP to inject non-image metadata (illumination level, sensor ISO, GPS, weather, time). It resembles the hybrid and deep modeling of computer vision e.g., transformer based plant disease recognition incorporating learned features and auxiliary signals, variational nighttime dehazing with hybrid regularization⁴⁹, and prior-query transformers within haze removal⁵³ - all have been suggested by the work, thus proposing two imminent applications: (i) plant pathology since growth-stage sequences and field metadata enhance discrimination⁵⁸, and (ii) mitigation of adverse visibility since capture conditions dictate the model. Our fuzzy-rule layer has the capacity to encode understandable priorities (e.g., contrast/color-cast thresholds) to change border predictions, and Grad-CAM/vision-SHAP may provide saliency-scale explanations. Future work leaves open an entire vision benchmark but points out that the training recipe, leakage controls and fusion strategy do not transfer accordingly.

Ablation Study Analysis.

Using the DeBERTa text-only baseline (Acc 83, F1 89) as reference, LSTM alone achieves a small increase in accuracy (+ 3) and a small decrease in F1 (− 3), so that overall hit rate is increased by this addition of sequence modeling, but then class balance or calibration is a bit worse. By contrast, only the addition of metadata is more accurate (more + 4) with a smaller F1 drop (less − 2), indicating that learner/context features are informative but not adequate to fix hard boundary cases, as displayed in Table 12. The maximum lift (93 on Acc, 92 on F1) is provided by the combination of both signals (DeBERTa + LSTM + Metadata, DBML), where recall reaches 93%, indicating that temporal and demographic/behavioral context is complementary: the former helps to capture dependencies in time sequences, the latter helps to reduce the ambiguity in near-boundary samples.

Table 12.

Ablation study analysis based on Deberta variants.

Variant	Accuracy	Precision	Recall	F1-Score	Δ Acc vs. DeBERTa	ΔF1 vs. DeBERTa
DeBERTa	83	81	83	89	—	—
+ Metadata	87	86	87	87	+ 4	−2
+ LSTM	86	84	83	86	+ 3	−3
+ Metadata + LSTM + Fuzzy rules	91	89	92	91	+ 9	+ 2
+ Metadata + LSTM — Proposed (DBML)	93	90	93	92	+ 10	+ 3

Open in a new tab

A fuzzy-rule post-filter upon the fused model produces fractionally smaller aggregate metrics (e.g., Acc 91, F1 91) compared to DBML, in exchange for highly constrained, interpretable decision boundaries and greatest possible predictive score. This variant has the advantage, in practice, of stabilizing edge scores, with an exchange against a compromise in accuracy in headlines. Overall, the ablation shows that the metadata based on each component helps to improve calibration, LSTM forms temporal structure, and their fusion produces the best and most generalizable performance hence justifying the worth of the proposed hybrid design.

The primary contributor to the performance is DeBERTa: the text-only baseline has an accuracy of 83%/0.95 AUC. LSTM-only gives a smaller improvement of 86%, learning sequence regularities and smoothing out token-level noise whereas metadata-only introduces complementary context (e.g. learner/background cues), as combined analysis shown in Fig. 32. They are synergistic in their combination (DBML with 93% accuracy/0.98 AUC) since DeBERTa provides well-informed linguistic representations and LSTM encodes the temporal/dependency patterns at the sequence level, and disentangled metadata resolves borderline cases that cannot be settled by the text alone. This mixture was accordingly driven by complementary error plots in ablation: transformer > transformer + {LSTM|metadata} (minor gains) < transformer + LSTM + metadata (maximal gain).

Fig. 32 — Combined Comparison analysis of baseline and proposed model performance across all metrics.

Statistical test analysis

Furthermore, Statistical test results showing robust and in combination with the performance of the proposed model, affirm the discriminative power of each feature category in English language learners’ assessment. Regarding the linguistic features, all four statistical tests, T-test, Z-test, ANOVA and Chi-square reach extremely low p-values (i.e., on the plot, it presents as high -log10(p-value)), which are way below the standard levels of significance. This is evidence that linguistic features of the system such as coherence, syntax, lexis, phraseology, grammar and conventions vary significantly among learner categories.

These findings support the strong dependence of the model on linguistic features to achieve accurate classification, as evidenced by the high precision, recall, and overall predictive accuracy. The model’s greater capacity to identify learners who are and are not learners on linguistic grounds is therefore based on meaningful, statistically significant differences between classes, shown in Fig. 33. By contrast, most psycholinguistic and cognitive features, in Fig. 34 have considerably less statistical significance. Here the p-values for most features are close to or at the 0.05 level which suggests that groups do not separate strongly at the statistical level for attention span, working memory, motivation and cognitive load. This is in line with the model’s less impressive but still high, classification scores for psycholinguistic and cognitive categories. These are included in the hybrid model to provide additional context, but the extent to which they contribute to the overall class separation on their own is reduced because of the overlapping distribution among learner profiles. However, the fusion architecture employed by the model, which combines strong linguistic signals with more subtle yet complementary information from cognitive features, continues to return high performance – indicating that even features with weak univariate statistical significance can contribute to high overall accuracy when combined within a multimodal model. Contextual and demographic characteristics are the least statistically significant of all groups, which is demonstrated by all the highest p-values, shown in Fig. 35. The information like age, native language, study environment, learning background, etc., cannot provide such a great discrimination ability to school student’s level of language proficiency. This observation indicates that models that overemphasize demographic factors might not be optimal, solving an issue also addressed in the model’s design, where the most predictive weight is given to the most informative features of linguistic and cognitive nature. The hybrid approach does not let non-informative features introduce noise or bias by absorbing task-specific (contextual) data that are informative into an auxiliary section, but they have the potential to reach a high level of accuracy for tasks.

Fig. 33 — Statistical test analysis based on p-values of Linguistic trait.

Fig. 34 — Statistical test analysis based on p-values of cognitive trait.

Fig. 35 — Statistical test analysis based on p-values of demographic.

Finally, the statistical results of each feature category strongly support the model performance and the findings of the proposed model. The linguistic characteristics rise as the statistically strongest predictors which are the basis for the impressive classification performance of the model. In the meantime, the addition of cognitive and contextual features with less stand-alone importance increases the model‘s overall flexibility and generalizability by considering complex, real-world as well as learner characterizations. This statistical validation not only justifies the model’s effectiveness but also highlights the paramount importance of feature selection and fusion in educational data science.

XAI interpretation

The SHAP (SHapley Additive exPlanations) analysis provides in-depth and interpretable insights into the internal rationale of the designed hybrid model and makes it evident which features predominantly contribute to its prediction for English language learner labeling. The linguistic characteristics prevail in all classes, whereas the average values of contribution in syntax, phraseology, and vocabulary are the largest; the contribution made by the demographics/affective variables is minimal. It is consistent with the global report of importances given in Table 13 (e.g., syntax overall 0.21, phraseology 0.19).

Table 13.

Class-wise stacked Shap bars values based features.

Feature	Overall	Class 0	Class 1	Class 2	Rank
Syntax	0.21	0.07	0.06	0.08	1
Phraseology	0.19	0.06	0.05	0.08	2
Vocabulary	0.17	0.05	0.05	0.07	3
Cohesion	0.15	0.04	0.05	0.06	4
Grammar	0.13	0.04	0.03	0.06	5
Conventions	0.12	0.04	0.03	0.05	6
age	0.03	0.01	0.01	0.01	7
native_language	0.02	0.007	0.006	0.007	8
working_memory_index	0.02	0.007	0.006	0.007	9
motivation_level	0.02	0.007	0.006	0.007	10
attention_span_score	0.02	0.007	0.006	0.007	11
prior_experience	0.02	0.007	0.006	0.007	12
learning_environment	0.01	0.003	0.003	0.004	13
cognitive_load	0.01	0.003	0.003	0.004	14

Open in a new tab

The SHAP bar plot in Fig. 36 shows that linguistic properties (syntax, phraseology, vocabulary, cohesion and grammar, conventions) have the highest average impact on model output for all classes. This perfectly agrees with the statistical analysis and the previous model diagnostic results, which once again confirms that linguistic features as the most important and reliable cues for categorizing learners. The color segmentation of each bar shows that these linguistic features repeatedly dominate for all three classes, but particularly for Class 2 (language features) where the SHAP values are relatively larger.

This strong discriminative performance is a key property to achieve reliable language proficiency prediction by making sure that the model can rely on linguistic content often seen in an assessed language, without taking contextual or demographic factors into account.

The Fig. 37 of SHAP waterfall provides further explanation as to what values on the features of a sample contribute to the final probability of its classification. positive zero contribution from features such as high syntax, vocabulary, phraseology and conventions to increase the predicted probability in favor of class membership, whereas some small negative SHAP values of prior experience and native language serve to confer a slight reduction to the score. The net effect, primarily dominated by linguistic components, drives the model’s prediction toward certainty for the correct class. This person-based explanation is model-transparent and can be used by educators and researchers to track how a learner’s individual profile contributes to a certain prediction. This kind of interpretability is critical for educational decision-making, so interventions and feedback can be tailored to a student’s actual strengths and needs.

Overall, the SHAP results support both the design and the performance of our model. They also verify that linguistic features not only have the strongest statistical separability, but also contain the most actionable signal for high-confidence prediction. Concurrently, cognitive and demographic features, though present, exert only a mild effect, suggesting robustness in the model against potentially biased features related to less informative traits. The syntax and vocabulary high features sit on the positive side of the SHAP plot, which shows that they exert upward pressure on the estimated proficiency, whereas to explain some borderline cases, features related to cohesion and grammar are more spread around zero. Not surprisingly, these distributional patterns correlate with Table 14 (e.g., syntax 78% positive, vocabulary 75% positive). The multilevel interpretability increases the applicability of the model for ELL assessment, as it is both accurate and explainable in the real educational scenarios. The interpretability study here facilitates very sophisticated insights for deep hybrid model into the reasoning of linguistic properties like syntax, vocabulary, cohesiveness, and grammar.

Table 14.

Mean values distribution based on feature quartiles.

Feature	% positive SHAP (> 0)	Median SHAP	IQR
Syntax	78%	+ 0.09	0.12
Vocabulary	75%	+ 0.08	0.11
Cohesion	64%	+ 0.05	0.10
Grammar	61%	+ 0.04	0.10

Open in a new tab

The SHAP summary in Fig. 38 shows that higher attribute values for these linguistic features are interpreted by the model to be associated consistently with higher class predictions, as indicated by the large concentration of red (high feature value) points on the positive side of the SHAP value axis. On the other hand, lower values (in blue) are more likely to reduce the prediction score, indicating the importance of strong language traits on positive learner classification. The wide horizontal distribution of SHAP values for all the features suggests that these attributes are not only consistently influential, but they are also able to have substantial positive and negative impact on the learners, depending on their individual profiles. For example, high vocabulary and syntax score represent strong evidence of fluency and have a large contribution to the model decision, whereas low scores in this domain reduce the probability of predicting a particular class.

DeepSHAP dependence in Fig. 39 further deepens this understanding by providing the raw versions of the inputs and the model outputs that went into this calculation. Every line in the graph shows the effect of changing a linguistic parameter from low to high on predicting an outcome. The red in color has the higher value given by the model for the feature scores the converging of lines at higher model-values with higher feature scores itself (illustrated in the transition from red to blue) thus indicates that the model is well-calibrated for the interpretation of improvements in syntax, vocabulary, cohesion, and grammar as serious improvements of the language-proficiency classification. By taking this more granular viewpoint, we do not only show that feature importance exists for linguistic strengths of ELs, but that they can be systematically and interpretable associated with improved predicted success among ELs via a hybrid model. The predicted proficiency was monotonic across features with low to high score, which is indicative of construct validity of the rubric aligned attributes. The trend is replicated in Table 15 (e.g. syntax: 0.62 to 0.74 to 0.88; vocabulary: 0.60 to 0.72 to 0.86).

Table 15.

Values based analysis using deepshap calibration - partial dependence.

Feature	Low (p25) → Mid (p50) → High (p75) predicted score
Syntax	0.62 → 0.74 → 0.88
Vocabulary	0.60 → 0.72 → 0.86
Cohesion	0.58 → 0.69 → 0.82
Grammar	0.57 → 0.68 → 0.80

Open in a new tab

Taken together, these DeepSHAP visualizations represent strong XAI validation of the model decision making. They find that the deep model is neither a black box nor a slave to data sparsity or other such accidentals; it’s simply organizing educational decisions in a way that is congruent with educational theory giving weight to fundamental linguistic abilities in its classifications.

This transparency increases the trust of users on the practical applicability of the model, as it not only provides high performance, but also endow traceable, meaningful explanations to each single prediction based on the genuine language trait evidence.

The agreement between SHAP and DeepSHAP is quite high (%Agreement = 99.3). The most probable group of features (linguistic features) takes the first position in both (syntax ranks #1 in each) but even under these properties, there are some slight changes, e.g., vocabulary is increased in DeepSHAP (+ 5.9%) and phraseology is dropped by one rank. Cohesion, grammar, and conventions remain close in similar magnitude (delta 8% and below), and demographic/affective in the two approaches remain near-zero. Taken together, the small ∆ and stable ranks suggest that the conclusions that the model makes are stable across the attribution method, further supporting a conclusion that predictions are dominated by core linguistic features, not metadata (Table 16).

Table 16.

Comparison analysis of Shap and deepshap outcomes.

Feature	SHAP	DeepSHAP	Δ (DeepSHAP − SHAP)	%Δ	Rank (SHAP)	Rank (DeepSHAP)
Syntax	0.21	0.20	−0.01	−4.8%	1	1
Phraseology	0.19	0.18	−0.01	−5.3%	2	3
Vocabulary	0.17	0.18	+ 0.01	+ 5.9%	3	2
Cohesion	0.15	0.14	−0.01	−6.7%	4	4
Grammar	0.13	0.12	−0.01	−7.7%	5	5
Conventions	0.12	0.11	−0.01	−8.3%	6	6
age	0.03	0.03	+ 0.00	0.0%	7	7
native_language	0.02	0.025	+ 0.005	+ 25.0%	8	8
working_memory_index	0.02	0.021	+ 0.001	+ 5.0%	9	9
motivation_level	0.02	0.019	−0.001	−5.0%	10	10

Open in a new tab

In conclusion, the combined results of this study corroborate the synergistic power of sophisticated deep learning neural networks, transformer-based machine learning, and interpretable fuzzy systems. Through the in-depth analysis of feature importances, trait sorting, and rule activations, the fuzzy approach has been indicated to offer exceptional performance when handling relative, gradational relationships between contextual, cognitive, and linguistic traits, with the associated fuzzy conditions mirroring the real-world conversational complexities. The intuitive visualizations produced by various plots and rule activation density distributions clearly illustrate that fuzzy logic can accurately model the strength and flexibility of trait bundles, which is essential when assessing learner differences with sufficient depth. Simultaneously, the success of the hybrid model in such a task is evidenced by exploring the alignment between the obtained classification and its predictive counterpart. As shown, the visual correlation between the actual and expected class assignments in the three primary ELL categories, particularly in, serves as compelling proof of the hybrid’s ability to predict complex, multidimensional input patterns and generate reliable, error-free classifications. Given the perfect alignment and dense predicted-to-real point distributions, the performance of the hybrid is notable, which, combined with the sparse red point distribution, validates its robustness and adaptability. Additionally, the hybrid’s consistency across all three category tiers demonstrates its adaptability and sensitivity to various learner characteristics, attaining high performance not only in linguistic but also in psycholinguistic, contextual identification.

Comparison with existing studies

A close comparison of the previous studies to the proposed approach demonstrates that the integration of fuzzy logic and deep learning methodologies has a sound contribution in the field of language learning evaluation, displayed in Table 17. Several previous models (e.g., MLP neural networks with SHAP explainability, Bi-LSTM with NLP features, hybrid models combining RoBERTa embeddings and XGBoost) have achieved significant advances in terms of AES and performance prediction, with scores between 68 and 87 in different benchmark datasets. None the less, such studies mostly concentrate on one side of the two types of models, i.e., deep learning or shallow feature se- lection, and sometimes they only deal with textual or numerical features alone, which also limits their capability of representing the comprehensive characteristic of learner traits.

Table 17.

Comparison of proposed model technique with existing studies (acc in %).

Ref	Year	Model (Approach)	Dataset	Results
¹⁷	2020	MLP neural network + SHAP (Explainable AES)	Grade 7 essays (ASAP dataset)	78
¹⁸	2021	Bi-LSTM + NLP features (Rubric scoring model)	ASAP essays (8 prompts, rubric scores)	68
³⁰	2024	Random Forest with interpretable FS	Open Univ. Learning Analytics (OULAD)	87
²⁰	2024	RoBERTa embeddings + XGBoost (Hybrid AES)	TOEFL essays (public set)	77
¹³	2025	Fusion: Feature Selection + DNN	Performance dataset (Numeric)	87
Proposed		Fusion: DBML	English Language Assessment (Text + numeric)	93

Open in a new tab

The proposed model is unique in that it combines DeBERTa with a fusion architecture (DBML) that jointly exploits attention based embeddings to represent rich text, and structured metadata about cognitive and demographic learning behaviors. Such a hybrid model, combined with feature ranking and interpretable rule mining with fuzzy logic, not only increases the predictive accuracy – achieving an outstanding Perform Score of 93 on the ELA dataset – but it also further improves the model transparency and interpretability thanks to the support for XAI. This kind of deep integration is very rare in traditional and single-path models, so the proposed approach shows overall strong, general-purpose solution to the complex evaluation of ELL assessment.

Conclusion and future work

This study highlights the transformative impact of AI and NLP on ELL assessment, showcasing how contemporary machine learning models far outperform traditional evaluation approaches. Our findings demonstrate that by moving from static, threshold-based classification to advanced hybrid models—such as the proposed DBML fusion learner trait classification for language learning is not only more accurate but also more interpretable and actionable. The proposed model achieved the highest classification accuracy of 93%, decisively surpassing conventional baselines and illustrating the critical advantage of integrating deep language representations with structured demographic and cognitive metadata. SHAP analysis, alongside rigorous statistical tests, further validated the reliability and transparency of the model, identifying linguistic features such as syntax, vocabulary, and cohesion as primary drivers of learner differentiation. Feature ranking not only contributed to predictive performance but also empowered the construction of robust and meaningful fuzzy rules, supporting nuanced decision-making. These advances pave the way for more personalized, data-driven educational interventions. Future work will focus on expanding the framework to multilingual and cross-cultural settings, incorporating multimodal learner data, and further refining explainability for greater trust and adaptability in AI-powered language assessment. Looking ahead, it would involve transfer experiments on cross domains without domain ELL databases (e.g., TOEFL11, ICNALE, ASAP) and measure the extent of transferability using leave-site-out and few-shot adaptation. It can accommodate distribution shift by exploring parameter-efficient fine-tuning (e.g., adapters/LoRA) and domain-adversarial training, and it can be evaluated on the robustness to noisy text, missing metadata, and prompt/style variation. The quality of generalization, when assessed, ought to be described using the metrics of calibration and uncertainty along with fairness measures among subgroups of demographists. To deploy to the real world, it will be beneficial to evaluate inference latency and memory on low-end devices, privacy-preserving training (federated or differentially private) and data-drift monitoring protocols.

Author contributions

Author has fully contributed to this study.

Data availability

The dataset is freely available at” https://github.com/VisionLangAI/English-Language-Learners-Evaluation”.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1.Durón, N. G. & Jiménez-Preciado, A. L. Exploring the role of AI in higher education: a natural language processing analysis of emerging trends and discourses, The TQM Journal, vol. ahead-of-print, no. ahead-of-print, Jan. (2025). 10.1108/TQM-10-2024-0376
2.Ishfaq, U., Khan, H. U. & Shabbir, D. Exploring the role of sentiment analysis with network and Temporal features for finding influential users in social media platforms. Soc. Netw. Anal. Min.14 (1), 241. 10.1007/s13278-024-01396-6 (2025). [Google Scholar]
3.Carius, A. C. & Teixeira, A. J. Artificial intelligence and content analysis: the large Language models (LLMs) and the automatized categorization. AI Soc.10.1007/s00146-024-01988-y (2024). [Google Scholar]
4.Jiang, L., Lv, M., Cheng, M., Chen, X. & Peng, C. Factors affecting deep learning of EFL students in higher vocational colleges under small private online courses-based settings: A grounded theory approach. J. Comput. Assist. Learn.40 (6), 3098–3110. 10.1111/jcal.13060 (2024). [Google Scholar]
5.Li, D., Ortegas, K. D. & White, M. Exploring the computational effects of advanced deep neural networks on logical and activity learning for enhanced thinking skills. Syst. 2023. 11, Page 319, 11, (7), 319. 10.3390/SYSTEMS11070319 (Jun. 2023).
6.Wang, D., Su, J. & Yu, H. Feature extraction and analysis of natural Language processing for deep learning english Language. IEEE Access.8, 46335–46345. 10.1109/ACCESS.2020.2974101 (2020). [Google Scholar]
7.Shao, Z., Zhao, R., Yuan, S., Ding, M. & Wang, Y. Tracing the evolution of AI in the past decade and forecasting the emerging trends. Expert Syst. Appl.209, 118221. 10.1016/j.eswa.2022.118221 (2022). [Google Scholar]
8.Zeng, Y. et al. GCCNet: A novel network leveraging gated Cross-Correlation for Multi-View classification. IEEE Trans. Multimedia. 27, 1086–1099. 10.1109/TMM.2024.3521733 (2025). [Google Scholar]
9.Bonami, B., Piazentini, L. & Dala-Possa, A. Education, Big Data and Artificial Intelligence: Mixed methods in digital platforms, Comunicar, vol. 28, no. 65, pp. 43–52, Oct. (2020). 10.3916/C65-2020-04
10.Zhao, H. et al. Cross-lingual font style transfer with full-domain convolutional attention. Pattern Recognit.155, 110709. 10.1016/j.patcog.2024.110709 (2024). [Google Scholar]
11.Zhang, L. A new machine learning framework for effective evaluation of english education. Int. J. Emerg. Technol. Learn.16 (12), 142–154. 10.3991/ijet.v16i12.23323 (2021). [Google Scholar]
12.Ding, Z. Design and evaluation of an english speaking recommender system using word networks and context-aware techniques. Entertain Comput.52, 100920. 10.1016/j.entcom.2024.100920 (2025). [Google Scholar]
13.Barlybayev, A. et al. Сomparative analysis of grading models using fuzzy logic to enhance fairness and consistency in student performance evaluation. Cogent Educ.12 (1), 2481008. 10.1080/2331186X.2025.2481008 (2025). [Google Scholar]
14.Zhang, R. et al. MvMRL: a multi-view molecular representation learning method for molecular property prediction. Brief. Bioinform. 25 (4). 10.1093/BIB/BBAE298 (May 2024). [DOI] [PMC free article] [PubMed]
15.Atif, Y., Al-Falahi, K., Wangchuk, T. & Lindström, B. A fuzzy logic approach to influence maximization in social networks. J. Ambient Intell. Humaniz. Comput.11 (6), 2435–2451. 10.1007/s12652-019-01286-2 (2020). [Google Scholar]
16.Huang, C. Q. et al. XKT: toward explainable knowledge tracing model with cognitive learning theories for questions of multiple knowledge concepts. IEEE Trans. Knowl. Data Eng.36 (11), 7308–7325. 10.1109/TKDE.2024.3418098 (2024). [Google Scholar]
17.Kumar, V. & Boulanger, D. Explainable automated essay scoring: deep learning really has pedagogical value. Front. Educ. (Lausanne). 510.3389/feduc.2020.572367 (Oct. 2020).
18.Kumar, V. S. & Boulanger, D. Automated essay scoring and the deep learning black box: how are rubric scores determined?? Int. J. Artif. Intell. Educ.31 (3), 538–584. 10.1007/s40593-020-00211-5 (2021). [Google Scholar]
19.Misgna, H., On, B. W., Lee, I. & Choi, G. S. A survey on deep learning-based automated essay scoring and feedback generation. Artif. Intell. Rev.58 (2), 36. 10.1007/s10462-024-11017-5 (2024). [Google Scholar]
20.Faseeh, M. et al. Hybrid approach to automated essay scoring: integrating deep learning embeddings with handcrafted linguistic features for improved accuracy. Mathematics12 (21). 10.3390/math12213416 (2024).
21.Chen, S. et al. Enhancing Chinese comprehension and reasoning for large Language models: an efficient LoRA fine-tuning and tree of thoughts framework. J. Supercomput. 81 (1), 50. 10.1007/s11227-024-06499-7 (2024). [Google Scholar]
22.Liu, Y. & Li, R. Deep Learning Scoring Model in the Evaluation of Oral English Teaching, Comput Intell Neurosci, vol. no. 1, p. 6931796, 2022, (2022). 10.1155/2022/6931796 [DOI] [PMC free article] [PubMed]
23.lang, A. Evaluation Algorithm of English Audiovisual Teaching Effect Based on Deep Learning, Math Probl Eng, vol. no. 1, p. 7687008, 2022, (2022). 10.1155/2022/7687008
24.Liu, Y., Cao, S. & Chen, G. Research on the Long-term mechanism of using public service platforms in National smart Education—Based on the double reduction policy. Sage Open.14 (1). 10.1177/21582440241239471 (Jan. 2024).
25.Zhang, L. Assessing english Language teachers’ pedagogical effectiveness using convolutional neural networks optimized by modified virus colony search algorithm. Sci. Rep.15 (1), 15295. 10.1038/s41598-025-98033-9 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Lagares Rodríguez, J. A., Díaz-Díaz, N., Barranco, C. D. & González A comparative analysis of student performance prediction: evaluating optimized deep learning ensembles against Semi-Supervised feature Selection-Based models, Applied Sciences, 15, 9, (2025). 10.3390/app15094818
27.Liu, S. et al. Dual-view cross attention enhanced semi-supervised learning method for discourse cognitive engagement classification in online course discussions. Expert Syst. Appl.278, 127339. 10.1016/J.ESWA.2025.127339 (Jun. 2025).
28.Al-Zawqari, A., Peumans, D. & Vandersteen, G. A flexible feature selection approach for predicting students’ academic performance in online courses. Computers Education: Artif. Intell.3, 100103. 10.1016/j.caeai.2022.100103 (2022). [Google Scholar]
29.Ding, J. et al. DialogueINAB: an interaction neural network based on attitudes and behaviors of interlocutors for dialogue emotion recognition. J. Supercomput. 79 (18), 20481–20514. 10.1007/s11227-023-05439-1 (2023). [Google Scholar]
30.Kaur, H., Kaur, T., Bhardwaj, V. & Kumar, M. An ensemble deep learning model for classification of students as weak and strong learners via multiparametric analysis. Discover Appl. Sci.6 (11), 595. 10.1007/s42452-024-06274-6 (2024). [Google Scholar]
31.Abdasalam, M., Alzubi, A. & Iyiola, K. Student grade prediction for effective learning approaches using the optimized ensemble deep neural network. Educ. Inf. Technol. (Dordr). 10.1007/s10639-024-13224-7 (2024). [Google Scholar]
32.Alnasyan, B., Basheri, M. & Alassafi, M. O. The power of deep learning techniques for predicting student performance in virtual learning environments: A systematic literature review. Comput. Educ. Artif. Intell.6, 100231. 10.1016/J.CAEAI.2024.100231 (2024). [Google Scholar]
33.Alnasyan, B., Basheri, M., Alassafi, M. & Alnasyan, K. Kanformer: an attention-enhanced deep learning model for predicting student performance in virtual learning environments. Soc. Netw. Anal. Min.15 (1), 25. 10.1007/s13278-025-01446-7 (2025). [Google Scholar]
34.Sapuguh, I. et al. Development of fuzzy logic based student performance prediction system.
35.Peng, N. Research on the effectiveness of english online learning based on neural network. Neural Comput. Appl.34 (4), 2543–2554. 10.1007/s00521-021-05855-5 (2022). [Google Scholar]
36.Wu, H. & Luo, X. Evaluating English Teaching Quality in Colleges Using Fuzzy Logic and Online Game-Based Learning, Comput Aided Des Appl, vol. 21, no. s5, pp. 237–251, (2024). 10.14733/cadaps.2024.S5.237-251
37.Ding, F. Supporting adaptive english learning with fuzzy Logic-Based personalized learning. Int. J. Gaming Comput. Mediat Simul.14 (2). 10.4018/IJGCMS.314588 (Apr. 2022).
38.Algshat, N. Evaluating Students’ Academic Progress in the Role of Fuzzy Logic, AlQalam Journal of Medical and Applied Sciences, pp. 984–989, Oct. (2024). 10.54361/ajmas.247412
39.Chrysafiadi, K., Virvou, M. & Tsihrintzis, G. A. A Fuzzy-Based evaluation of E-Learning acceptance and effectiveness by computer science students in Greece in the period of COVID-19. Electron. (Basel). 12 (2). 10.3390/electronics12020428 (2023).
40.Chang, S. Y., Chen, C. T., Wang, L. H. & Chen, W. Text-dependent English pronunciation learning system based on GMM, in Third International Conference on Electrical, Electronics, and Information Engineering (EEIE 2024), H. Malik, Ed., SPIE, Jan. p. 66. (2025). 10.1117/12.3057529
41.Li, S., Wang, C. & Wang, Y. Fuzzy evaluation model for physical education teaching methods in colleges and universities using artificial intelligence. Sci. Rep.14 (1), 4788. 10.1038/s41598-024-53177-y (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
42.S. Chehbi and C. Fri, “Learners’ Activity Indicators Prediction in e-Learning using Fuzzy Logic,” International Journal of Advanced ComputerScience and Applications, 11, 12, 10.14569/ijacsa.2020.0111257 (2020)
43.Naz, A. et al. AI knows you: deep learning model for prediction of extroversion personality trait. IEEE Access. 1. 10.1109/ACCESS.2024.3486578 (2024).
44.Faisal, C. M. S., Daud, A., Imran, F. & Rho, S. A novel framework for social web forums’ thread ranking based on semantics and post quality features. J. Supercomput. 72 (11), 4276–4295. 10.1007/s11227-016-1839-z (2016). [Google Scholar]
45.Urooj, A., Khan, H. U., Iqbal, S. & Althebyan, Q. On Prediction of Research Excellence using Data Mining and Deep Learning Techniques, in 8th International Conference on Social Network Analysis, Management and Security, SNAMS 2021, Institute of Electrical and Electronics Engineers Inc., 2021., Institute of Electrical and Electronics Engineers Inc., 2021. (2021). 10.1109/SNAMS53716.2021.9732153
46.Mageira, K. et al. Educational AI chatbots for content and Language integrated learning. Appl. Sci.12 (7). 10.3390/app12073239 (2022).
47.Hong Yun, Z. et al. A decision-support system for assessing the function of machine learning and artificial intelligence in music education for network games, Soft comput, vol. 26, no. 20, pp. 11063–11075, Oct. (2022). 10.1007/s00500-022-07401-4
48.García-Orozco, D. et al. An overview of the most influential journals in fuzzy systems research. Expert Syst. Appl.200, 117090. 10.1016/j.eswa.2022.117090 (2022). [Google Scholar]
49.Liu, Y. et al. VNDHR: variational single nighttime image dehazing for enhancing visibility in intelligent transportation systems via hybrid regularization. IEEE Trans. Intell. Transp. Syst.10.1109/TITS.2025.3550267 (2025). [Google Scholar]
50.Khan, H. U. & Daud, A. Finding the top influential bloggers based on productivity and popularity features. New. Rev. Hypermedia Multimedia. 23 (3), 189–206. 10.1080/13614568.2016.1236151 (2017). [Google Scholar]
51.Mthethwa-Kunene, K., Rugube, T. & Maphosa, C. Rethinking Pedagogy: Interrogating Ways of Promoting Deeper Learning in Higher Education, European Journal of Interactive Multimedia and Education, vol. 3, no. 1, p. e02204, Dec. (2021). 10.30935/ejimed/11439
52.Alsini, R. et al. Using deep learning and word embeddings for predicting human agreeableness behavior. Sci. Rep.14 (1). 10.1038/s41598-024-81506-8 (Dec. 2024). [DOI] [PMC free article] [PubMed]
53.Liu, Y. et al. NightHazeFormer: Single Nighttime Haze Removal Using Prior Query Transformer, MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia, pp. 4119–4128, Oct. (2023). 10.1145/3581783.3611744
54.Al-Smadi, B. S. DeBERTa-BiLSTM: A multi-label classification model of Arabic medical questions using pre-trained models and deep learning. Comput. Biol. Med.170, 107921. 10.1016/j.compbiomed.2024.107921 (2024). [DOI] [PubMed] [Google Scholar]
55.Hassan, M. M., Khan, M. A. R., Islam, K. K., Hassan, M. M. & Rabbi, M. M. F. Depression Detection system with Statistical Analysis and Data Mining Approaches, in International Conference on Science & Contemporary Technologies (ICSCT), 2021, pp. 1–6., 2021, pp. 1–6. (2021). 10.1109/ICSCT53883.2021.9642550
56.Belle, V. & Papantonis, I. Principles and practice of explainable machine learning. Jul 01 2021 Frontiers Media S A10.3389/fdata.2021.688969 [DOI] [PMC free article] [PubMed]
57.Naz, A. et al. Machine and deep learning for personality traits detection: a comprehensive survey and open research challenges. Artif. Intell. Rev.58 (8), 239. 10.1007/s10462-025-11245-3 (2025). [Google Scholar]
58.Chug, A., Bhatia, A., Singh, A. P. & Singh, D. A novel framework for image-based plant disease detection using hybrid deep learning approach, Soft comput, vol. 27, no. 18, pp. 13613–13638, Sep. (2023). 10.1007/S00500-022-07177-7

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The dataset is freely available at” https://github.com/VisionLangAI/English-Language-Learners-Evaluation”.

[CR1] 1.Durón, N. G. & Jiménez-Preciado, A. L. Exploring the role of AI in higher education: a natural language processing analysis of emerging trends and discourses, The TQM Journal, vol. ahead-of-print, no. ahead-of-print, Jan. (2025). 10.1108/TQM-10-2024-0376

[CR2] 2.Ishfaq, U., Khan, H. U. & Shabbir, D. Exploring the role of sentiment analysis with network and Temporal features for finding influential users in social media platforms. Soc. Netw. Anal. Min.14 (1), 241. 10.1007/s13278-024-01396-6 (2025). [Google Scholar]

[CR3] 3.Carius, A. C. & Teixeira, A. J. Artificial intelligence and content analysis: the large Language models (LLMs) and the automatized categorization. AI Soc.10.1007/s00146-024-01988-y (2024). [Google Scholar]

[CR4] 4.Jiang, L., Lv, M., Cheng, M., Chen, X. & Peng, C. Factors affecting deep learning of EFL students in higher vocational colleges under small private online courses-based settings: A grounded theory approach. J. Comput. Assist. Learn.40 (6), 3098–3110. 10.1111/jcal.13060 (2024). [Google Scholar]

[CR5] 5.Li, D., Ortegas, K. D. & White, M. Exploring the computational effects of advanced deep neural networks on logical and activity learning for enhanced thinking skills. Syst. 2023. 11, Page 319, 11, (7), 319. 10.3390/SYSTEMS11070319 (Jun. 2023).

[CR6] 6.Wang, D., Su, J. & Yu, H. Feature extraction and analysis of natural Language processing for deep learning english Language. IEEE Access.8, 46335–46345. 10.1109/ACCESS.2020.2974101 (2020). [Google Scholar]

[CR7] 7.Shao, Z., Zhao, R., Yuan, S., Ding, M. & Wang, Y. Tracing the evolution of AI in the past decade and forecasting the emerging trends. Expert Syst. Appl.209, 118221. 10.1016/j.eswa.2022.118221 (2022). [Google Scholar]

[CR8] 8.Zeng, Y. et al. GCCNet: A novel network leveraging gated Cross-Correlation for Multi-View classification. IEEE Trans. Multimedia. 27, 1086–1099. 10.1109/TMM.2024.3521733 (2025). [Google Scholar]

[CR9] 9.Bonami, B., Piazentini, L. & Dala-Possa, A. Education, Big Data and Artificial Intelligence: Mixed methods in digital platforms, Comunicar, vol. 28, no. 65, pp. 43–52, Oct. (2020). 10.3916/C65-2020-04

[CR10] 10.Zhao, H. et al. Cross-lingual font style transfer with full-domain convolutional attention. Pattern Recognit.155, 110709. 10.1016/j.patcog.2024.110709 (2024). [Google Scholar]

[CR11] 11.Zhang, L. A new machine learning framework for effective evaluation of english education. Int. J. Emerg. Technol. Learn.16 (12), 142–154. 10.3991/ijet.v16i12.23323 (2021). [Google Scholar]

[CR12] 12.Ding, Z. Design and evaluation of an english speaking recommender system using word networks and context-aware techniques. Entertain Comput.52, 100920. 10.1016/j.entcom.2024.100920 (2025). [Google Scholar]

[CR13] 13.Barlybayev, A. et al. Сomparative analysis of grading models using fuzzy logic to enhance fairness and consistency in student performance evaluation. Cogent Educ.12 (1), 2481008. 10.1080/2331186X.2025.2481008 (2025). [Google Scholar]

[CR14] 14.Zhang, R. et al. MvMRL: a multi-view molecular representation learning method for molecular property prediction. Brief. Bioinform. 25 (4). 10.1093/BIB/BBAE298 (May 2024). [DOI] [PMC free article] [PubMed]

[CR15] 15.Atif, Y., Al-Falahi, K., Wangchuk, T. & Lindström, B. A fuzzy logic approach to influence maximization in social networks. J. Ambient Intell. Humaniz. Comput.11 (6), 2435–2451. 10.1007/s12652-019-01286-2 (2020). [Google Scholar]

[CR16] 16.Huang, C. Q. et al. XKT: toward explainable knowledge tracing model with cognitive learning theories for questions of multiple knowledge concepts. IEEE Trans. Knowl. Data Eng.36 (11), 7308–7325. 10.1109/TKDE.2024.3418098 (2024). [Google Scholar]

[CR17] 17.Kumar, V. & Boulanger, D. Explainable automated essay scoring: deep learning really has pedagogical value. Front. Educ. (Lausanne). 510.3389/feduc.2020.572367 (Oct. 2020).

[CR18] 18.Kumar, V. S. & Boulanger, D. Automated essay scoring and the deep learning black box: how are rubric scores determined?? Int. J. Artif. Intell. Educ.31 (3), 538–584. 10.1007/s40593-020-00211-5 (2021). [Google Scholar]

[CR19] 19.Misgna, H., On, B. W., Lee, I. & Choi, G. S. A survey on deep learning-based automated essay scoring and feedback generation. Artif. Intell. Rev.58 (2), 36. 10.1007/s10462-024-11017-5 (2024). [Google Scholar]

[CR20] 20.Faseeh, M. et al. Hybrid approach to automated essay scoring: integrating deep learning embeddings with handcrafted linguistic features for improved accuracy. Mathematics12 (21). 10.3390/math12213416 (2024).

[CR21] 21.Chen, S. et al. Enhancing Chinese comprehension and reasoning for large Language models: an efficient LoRA fine-tuning and tree of thoughts framework. J. Supercomput. 81 (1), 50. 10.1007/s11227-024-06499-7 (2024). [Google Scholar]

[CR22] 22.Liu, Y. & Li, R. Deep Learning Scoring Model in the Evaluation of Oral English Teaching, Comput Intell Neurosci, vol. no. 1, p. 6931796, 2022, (2022). 10.1155/2022/6931796 [DOI] [PMC free article] [PubMed]

[CR23] 23.lang, A. Evaluation Algorithm of English Audiovisual Teaching Effect Based on Deep Learning, Math Probl Eng, vol. no. 1, p. 7687008, 2022, (2022). 10.1155/2022/7687008

[CR24] 24.Liu, Y., Cao, S. & Chen, G. Research on the Long-term mechanism of using public service platforms in National smart Education—Based on the double reduction policy. Sage Open.14 (1). 10.1177/21582440241239471 (Jan. 2024).

[CR25] 25.Zhang, L. Assessing english Language teachers’ pedagogical effectiveness using convolutional neural networks optimized by modified virus colony search algorithm. Sci. Rep.15 (1), 15295. 10.1038/s41598-025-98033-9 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR26] 26.Lagares Rodríguez, J. A., Díaz-Díaz, N., Barranco, C. D. & González A comparative analysis of student performance prediction: evaluating optimized deep learning ensembles against Semi-Supervised feature Selection-Based models, Applied Sciences, 15, 9, (2025). 10.3390/app15094818

[CR27] 27.Liu, S. et al. Dual-view cross attention enhanced semi-supervised learning method for discourse cognitive engagement classification in online course discussions. Expert Syst. Appl.278, 127339. 10.1016/J.ESWA.2025.127339 (Jun. 2025).

[CR28] 28.Al-Zawqari, A., Peumans, D. & Vandersteen, G. A flexible feature selection approach for predicting students’ academic performance in online courses. Computers Education: Artif. Intell.3, 100103. 10.1016/j.caeai.2022.100103 (2022). [Google Scholar]

[CR29] 29.Ding, J. et al. DialogueINAB: an interaction neural network based on attitudes and behaviors of interlocutors for dialogue emotion recognition. J. Supercomput. 79 (18), 20481–20514. 10.1007/s11227-023-05439-1 (2023). [Google Scholar]

[CR30] 30.Kaur, H., Kaur, T., Bhardwaj, V. & Kumar, M. An ensemble deep learning model for classification of students as weak and strong learners via multiparametric analysis. Discover Appl. Sci.6 (11), 595. 10.1007/s42452-024-06274-6 (2024). [Google Scholar]

[CR31] 31.Abdasalam, M., Alzubi, A. & Iyiola, K. Student grade prediction for effective learning approaches using the optimized ensemble deep neural network. Educ. Inf. Technol. (Dordr). 10.1007/s10639-024-13224-7 (2024). [Google Scholar]

[CR32] 32.Alnasyan, B., Basheri, M. & Alassafi, M. O. The power of deep learning techniques for predicting student performance in virtual learning environments: A systematic literature review. Comput. Educ. Artif. Intell.6, 100231. 10.1016/J.CAEAI.2024.100231 (2024). [Google Scholar]

[CR33] 33.Alnasyan, B., Basheri, M., Alassafi, M. & Alnasyan, K. Kanformer: an attention-enhanced deep learning model for predicting student performance in virtual learning environments. Soc. Netw. Anal. Min.15 (1), 25. 10.1007/s13278-025-01446-7 (2025). [Google Scholar]

[CR34] 34.Sapuguh, I. et al. Development of fuzzy logic based student performance prediction system.

[CR35] 35.Peng, N. Research on the effectiveness of english online learning based on neural network. Neural Comput. Appl.34 (4), 2543–2554. 10.1007/s00521-021-05855-5 (2022). [Google Scholar]

[CR36] 36.Wu, H. & Luo, X. Evaluating English Teaching Quality in Colleges Using Fuzzy Logic and Online Game-Based Learning, Comput Aided Des Appl, vol. 21, no. s5, pp. 237–251, (2024). 10.14733/cadaps.2024.S5.237-251

[CR37] 37.Ding, F. Supporting adaptive english learning with fuzzy Logic-Based personalized learning. Int. J. Gaming Comput. Mediat Simul.14 (2). 10.4018/IJGCMS.314588 (Apr. 2022).

[CR38] 38.Algshat, N. Evaluating Students’ Academic Progress in the Role of Fuzzy Logic, AlQalam Journal of Medical and Applied Sciences, pp. 984–989, Oct. (2024). 10.54361/ajmas.247412

[CR39] 39.Chrysafiadi, K., Virvou, M. & Tsihrintzis, G. A. A Fuzzy-Based evaluation of E-Learning acceptance and effectiveness by computer science students in Greece in the period of COVID-19. Electron. (Basel). 12 (2). 10.3390/electronics12020428 (2023).

[CR40] 40.Chang, S. Y., Chen, C. T., Wang, L. H. & Chen, W. Text-dependent English pronunciation learning system based on GMM, in Third International Conference on Electrical, Electronics, and Information Engineering (EEIE 2024), H. Malik, Ed., SPIE, Jan. p. 66. (2025). 10.1117/12.3057529

[CR41] 41.Li, S., Wang, C. & Wang, Y. Fuzzy evaluation model for physical education teaching methods in colleges and universities using artificial intelligence. Sci. Rep.14 (1), 4788. 10.1038/s41598-024-53177-y (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR42] 42.S. Chehbi and C. Fri, “Learners’ Activity Indicators Prediction in e-Learning using Fuzzy Logic,” International Journal of Advanced ComputerScience and Applications, 11, 12, 10.14569/ijacsa.2020.0111257 (2020)

[CR43] 43.Naz, A. et al. AI knows you: deep learning model for prediction of extroversion personality trait. IEEE Access. 1. 10.1109/ACCESS.2024.3486578 (2024).

[CR44] 44.Faisal, C. M. S., Daud, A., Imran, F. & Rho, S. A novel framework for social web forums’ thread ranking based on semantics and post quality features. J. Supercomput. 72 (11), 4276–4295. 10.1007/s11227-016-1839-z (2016). [Google Scholar]

[CR45] 45.Urooj, A., Khan, H. U., Iqbal, S. & Althebyan, Q. On Prediction of Research Excellence using Data Mining and Deep Learning Techniques, in 8th International Conference on Social Network Analysis, Management and Security, SNAMS 2021, Institute of Electrical and Electronics Engineers Inc., 2021., Institute of Electrical and Electronics Engineers Inc., 2021. (2021). 10.1109/SNAMS53716.2021.9732153

[CR46] 46.Mageira, K. et al. Educational AI chatbots for content and Language integrated learning. Appl. Sci.12 (7). 10.3390/app12073239 (2022).

[CR47] 47.Hong Yun, Z. et al. A decision-support system for assessing the function of machine learning and artificial intelligence in music education for network games, Soft comput, vol. 26, no. 20, pp. 11063–11075, Oct. (2022). 10.1007/s00500-022-07401-4

[CR48] 48.García-Orozco, D. et al. An overview of the most influential journals in fuzzy systems research. Expert Syst. Appl.200, 117090. 10.1016/j.eswa.2022.117090 (2022). [Google Scholar]

[CR49] 49.Liu, Y. et al. VNDHR: variational single nighttime image dehazing for enhancing visibility in intelligent transportation systems via hybrid regularization. IEEE Trans. Intell. Transp. Syst.10.1109/TITS.2025.3550267 (2025). [Google Scholar]

[CR50] 50.Khan, H. U. & Daud, A. Finding the top influential bloggers based on productivity and popularity features. New. Rev. Hypermedia Multimedia. 23 (3), 189–206. 10.1080/13614568.2016.1236151 (2017). [Google Scholar]

[CR51] 51.Mthethwa-Kunene, K., Rugube, T. & Maphosa, C. Rethinking Pedagogy: Interrogating Ways of Promoting Deeper Learning in Higher Education, European Journal of Interactive Multimedia and Education, vol. 3, no. 1, p. e02204, Dec. (2021). 10.30935/ejimed/11439

[CR52] 52.Alsini, R. et al. Using deep learning and word embeddings for predicting human agreeableness behavior. Sci. Rep.14 (1). 10.1038/s41598-024-81506-8 (Dec. 2024). [DOI] [PMC free article] [PubMed]

[CR53] 53.Liu, Y. et al. NightHazeFormer: Single Nighttime Haze Removal Using Prior Query Transformer, MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia, pp. 4119–4128, Oct. (2023). 10.1145/3581783.3611744

[CR54] 54.Al-Smadi, B. S. DeBERTa-BiLSTM: A multi-label classification model of Arabic medical questions using pre-trained models and deep learning. Comput. Biol. Med.170, 107921. 10.1016/j.compbiomed.2024.107921 (2024). [DOI] [PubMed] [Google Scholar]

[CR55] 55.Hassan, M. M., Khan, M. A. R., Islam, K. K., Hassan, M. M. & Rabbi, M. M. F. Depression Detection system with Statistical Analysis and Data Mining Approaches, in International Conference on Science & Contemporary Technologies (ICSCT), 2021, pp. 1–6., 2021, pp. 1–6. (2021). 10.1109/ICSCT53883.2021.9642550

[CR56] 56.Belle, V. & Papantonis, I. Principles and practice of explainable machine learning. Jul 01 2021 Frontiers Media S A10.3389/fdata.2021.688969 [DOI] [PMC free article] [PubMed]

[CR57] 57.Naz, A. et al. Machine and deep learning for personality traits detection: a comprehensive survey and open research challenges. Artif. Intell. Rev.58 (8), 239. 10.1007/s10462-025-11245-3 (2025). [Google Scholar]

[CR58] 58.Chug, A., Bhatia, A., Singh, A. P. & Singh, D. A novel framework for image-based plant disease detection using hybrid deep learning approach, Soft comput, vol. 27, no. 18, pp. 13613–13638, Sep. (2023). 10.1007/S00500-022-07177-7

PERMALINK

A hybrid deep learning and fuzzy logic framework for feature-based evaluation of english Language learners

XiuHua Zhao

Abstract

Introduction

Related work

Deep learning-based ELL evaluation

Fuzzy logic-based ELL education evaluation

Research proposed methodology

Fig. 1.

Data collection

Table 1.

Data preprocessing

Table 2.

Feature ranking and importance

Information gain (IG)

Gain ratio (GR)

Entropy

Principal component analysis

Permutation importance with random forest

Fuzzy rule-based classification technique

Fig. 2.

Predictive modelling

Table 3.

Proposed model

Fig. 3.

Input processing

DeBERTa embedding layer

Pooling layer

Feature fusion layer

LSTM integration layer

Dense and output layer

Fig. 4.

Input 1:

Input 2:

Fusion:

Hidden:

Output:

Hyperparameter settings

Table 4.

Statistical analysis

Table 5.

XAI SHAP and deepshap

Performance evaluation measures

Table 6.

Results and discussion

Exploratory data analysis

Fig. 5.

Fig. 6.

Fig. 7.

Fig. 8.

Fig. 9.

Fig. 10.

Feature ranking and fuzzy rule

Contextual and demographic variables

Table 7.

Fig. 11.

Psycholinguistic and cognitive factors

Table 8.

Fig. 12.

Linguistic features

Table 9.

Fig. 13.

Fig. 14.

Fig. 15.

Fig. 16.

Fig. 17.

Fig. 18.

Fig. 19.

Fig. 20.

Fig. 21.

Fig. 22.

Fig. 23.

Predictive modelling results

Machine and ensemble learning

Table 10.

Fig. 24.

Fig. 25.

Fig. 26.

Deep learning and transformer-based model