Evaluating ensemble models for fair and interpretable prediction in higher education using multimodal data

Felipe Emiliano Arévalo-Cordovilla; Marta Peña

doi:10.1038/s41598-025-15388-9

. 2025 Aug 11;15:29420. doi: 10.1038/s41598-025-15388-9

Evaluating ensemble models for fair and interpretable prediction in higher education using multimodal data

Felipe Emiliano Arévalo-Cordovilla ^1,^2,^✉, Marta Peña ²

PMCID: PMC12339690 PMID: 40789907

Abstract

Early prediction of academic performance is vital for reducing attrition in online higher education. However, existing models often lack comprehensive data integration and comparison with state-of-the-art techniques. This study, which involved 2,225 engineering students at a public university in Ecuador, addressed these gaps. The objective was to develop a robust predictive framework by integrating Moodle interactions, academic history, and demographic data using SMOTE for class balancing. The methodology involved a comparative evaluation of seven base learners, including traditional algorithms, Random Forest, and gradient boosting ensembles (XGBoost, LightGBM), and a final stacking model, all validated using a 5-fold stratified cross-validation. While the LightGBM model emerged as the best-performing base model (Area Under the Curve (AUC) = 0.953, F1 = 0.950), the stacking ensemble (AUC = 0.835) did not offer a significant performance improvement and showed considerable instability. SHAP analysis confirmed that early grades were the most influential predictors across top models. The final model demonstrated strong fairness across gender, ethnicity, and socioeconomic status (consistency = 0.907). These findings enable institutions to identify at-risk students using state-of-the-art interpretable and fair models. These findings enable institutions to identify at-risk students using state-of-the-art, interpretable, and fair models, advancing learning analytics by validating key success predictors against contemporary benchmarks.

Supplementary Information

The online version contains supplementary material available at 10.1038/s41598-025-15388-9.

Keywords: Academic performance, Early prediction, Ensemble model, Gradient boosting, Learning analytics, Stacking

Subject terms: Engineering, Mathematics and computing

Introduction

Early prediction of student performance has become a strategic priority in higher education, especially in fully online Information Technology (IT) programs, where attrition threatens both institutional sustainability and educational equity^1–3. Learning Management Systems (LMS), such as Moodle, store detailed logs of access, submissions, and time-on-task that, when combined with early grades and demographic data, enable the development of proactive alert systems. These records are essential for understanding students’ behavior and learning patterns in virtual environments, providing valuable insights that can be used to predict and enhance academic outcomes.

Over the last decade, predictive models, including decision trees, random forests, neural networks, and Support Vector Machines (SVMs), have achieved accuracies of 70–75% using only basic student information^1–3. When focusing solely on Moodle click-log data, these predictions can explain up to 78% of the variance in final grades⁴ and careful feature engineering has increased the AUC to 0.990⁵. These approaches highlight the potential of digital platforms as key data sources for predicting student performance.

Furthermore, ensemble methods consistently outperform traditional algorithms. For instance, techniques like XGBoost have achieved remarkable accuracies of 97.2%⁶, while Random Forest has shown strong robustness, reaching 97% accuracy when combined with balancing techniques such as SMOTE⁷. Although base models, such as Support Vector Machine (SVM), can deliver competitive performance, they generally fall short of the high precision offered by ensemble methods⁸.

Within the ensemble family, gradient boosting-based models such as XGBoost and LightGBM often demonstrate a predictive advantage over bagging-based models such as Random Forest, with evidence suggesting that gradient-boosted trees outperform random forests in various educational contexts^9,10. Stacking models are often implemented to leverage the unique strengths of these different model families. This technique enhances predictive accuracy using a two-layer structure, where base models (e.g., instance-based, bagging, and boosting) generate predictions that serve as inputs for a meta-model, which then produces the final outcome^11,12.

Simultaneously, class balancing techniques, such as SMOTE, have become critical tools for addressing class imbalances and mitigating biases against minority groups^13–15. The effectiveness of SMOTE has been demonstrated in virtual learning environments by improving the predictions of student engagement and performance through the creation of balanced datasets¹⁶. Moreover, approaches such as Equi-Fused-Data-based SMOTE have shown high accuracy and fairness in predicting academic outcomes¹⁷ and refined versions such as SMOTE and Adaptive Synthetic Sampling Approach (ADASYN) have optimized the handling of data imbalance, particularly benefiting at-risk student groups by increasing recall, precision, and F1 scores for minority classes¹⁸. However, it is important to note that SMOTE can introduce noise, which has led to the development of more advanced variants and highlights the need for careful application^14,15. While beneficial, these techniques should be part of a broader strategy that ensures fairness and equity in educational predictive modeling¹⁹.

Despite these advancements, significant gaps remain in the literature. Most studies have focused on only a subset of possible data sources and methods. For instance, studies centered on Moodle indicate that attendance, submissions, and resource reviews are strong predictors²⁰ and that early visualizations foster intervention²¹; however, they rarely incorporate partial grades or demographic data. Conversely, research grounded in academic history and demographic information reports substantial gains^22,23 but tends to overlook the online activity. Similarly, while analyses of the fairness impact of SMOTE highlight its benefits, they generally remain isolated from multilevel evaluations^17–19.

To verify the existence of studies that simultaneously integrate (i) Moodle interactions, (ii) first partial grades, (iii) demographic features, (iv) stacking models, and (v) SMOTE, specific searches were conducted in Scopus and the Web of Science (WoS) for the period 2019–2024. The search string used in Scopus is as follows:

TITLE-ABS-KEY (“Moodle”) AND TITLE-ABS-KEY (“Higher Education” OR “university”) AND TITLE-ABS-KEY (“stacking” OR “ensemble learning”) AND TITLE-ABS-KEY (“SMOTE” OR “synthetic minority oversampling”) AND PUBYEAR > 2018 AND PUBYEAR < 2025.

The following query was applied in WoS:

TS =((“Moodle”) AND (“Higher Education” OR “university”) AND (“stacking” OR “ensemble learning”) AND (“SMOTE” OR “synthetic minority oversampling”)) AND PY=(2019–2024).

Both searches yielded zero results, confirming the absence of research combining all five dimensions proposed in this study. The closest related work is²⁴who applied SMOTE and a weighted ensemble to early grades and demographic data but lacked LMS data, reinforcing the originality and innovative character of this study. It is important to note that this directed search was not a formal systematic literature review but rather a targeted exploratory analysis conducted by the author to substantiate the need for this study.

Moreover, although stacking is often assumed to outperform individual models, several studies (~ 22% of those reviewed) have reported that it may not surpass SVMs or Random Forests when these models are finely tuned or when data noise is low^25,26. This evidence underscores the need for additional empirical research to clarify the conditions under which the added complexity of stacking models yields significant advantages.

The Latin American context presents specific challenges to equity. While techniques such as SMOTE and its variants (e.g., ADASYN) have increased the recall rate of at-risk students and reduced gaps by gender and socioeconomic status^15,17,18they may also inject noise and increase variance, necessitating careful analysis¹⁹. This is particularly relevant in contexts where data representativeness is often low and structural inequalities persist.

This study aimed to overcome these limitations by integrating Moodle interactions, partial grades, demographic data, SMOTE balancing techniques, and stacking ensemble models to predict student performance. This framework will be evaluated using a real cohort of 2,225 students enrolled in fully online IT programmes at the Universidad Estatal de Milagro (2023). This study seeks to determine whether stacking outperforms (or fails to outperform) well-tuned individual models, such as SVM and Random Forest, when a large multimodal feature set is available. It will also analyze fairness before and after balancing, report disaggregated metrics by gender, semester, and socioeconomic stratum, and provide interpretability through SHapley Additive exPlanations (SHAP). The integration of SHAP with high-accuracy boosting models is particularly valuable, as it enhances their interpretability—an essential feature in educational settings where understanding the factors behind a prediction is as important as the prediction itself^27,28. This approach will yield a clearer understanding of the dynamics among these varied predictors and contribute to a more equitable and personalized early prediction framework.

The first step is to identify at-risk students. The subsequent critical phase involves translating these predictions into effective personalized interventions that address each learner’s unique needs. By combining data-driven insights with evidence-based pedagogical strategies, institutions can proactively respond to academic challenges. This study seeks to contribute to these efforts by delivering a robust and nuanced predictive model, thereby promoting student retention, achievement, and personalized institutional support.

Methods

Study design

This study employed a quantitative, correlational, and predictive research design to develop and evaluate an early academic performance prediction model. The participants were students enrolled in the Information Technology Engineering program in an online modality at the Universidad Estatal de Milagro in Ecuador. The primary objective of this study was to implement and compare multiple machine learning algorithms, culminating in a stacking ensemble model to improve predictive accuracy. Data were collected during the first partial assessment of each academic period in 2023, enabling the early identification of at-risk students. This study used a cross-validation approach to ensure the generalizability and robustness of our findings.

Data collection, feature selection, and Preparation

Data source and initial processing

The data were obtained through an Extract, Transform, and Load (ETL) process from two primary institutional sources: records of interactions from the Moodle Virtual Learning Environment (VLE) and students’ academic records. Moodle interaction records were processed (pivoted) and cleaned to create student-level summary. Academic grades from the first partial exam were standardized to appropriate numeric formats. A consolidated dataset was created by merging these sources using unique student identifiers. The initial dataset comprised records of 15,155 students per course. Crucially, the data for the key academic predictors (exam1, note1, and note2) correspond to the first partial assessment, which concludes at the end of the eighth week of a 16-week academic term. To ensure data timeliness, the ETL process is executed once the first partial assessment period concludes in alignment with the academic calendar. This architecture makes the necessary data available to the predictive model within 24 h of the final grades for all assessment categories being officially recorded, enabling the prompt identification of at-risk students at this critical mid-term juncture.

Feature selection and justification

A crucial step in developing the model was selecting relevant predictive features. Based on a comprehensive review of the literature and criteria such as data availability, quality, and ethical considerations, 22 features were chosen. This selection aligns with the established research on the predictors of academic success in higher education.

The selected features were classified into three main categories:

Academic Performance Indicators.

Variables such as previous grades (note1, note2) and first partial exam scores (exam1) are widely recognized as strong predictors of subsequent academic performance, as they directly reflect the knowledge and skills acquired^29,30.

Virtual Learning Environment (VLE) Interaction Metrics:

Participation in the Moodle platform, including reviewed assignments (assignments_reviewed), course accesses (course_accesses), reviewed grades (grades_reviewed), completed quizzes (quizzes_completed), reviewed quizzes (quizzes_reviewed), reviewed resources (resources_reviewed), and updated assignments submitted (updated_assignments_submitted), served as indirect indicators of student engagement, time management, and study habits. Numerous studies have demonstrated a significant correlation between VLE activity and academic outcomes^31–33.

Sociodemographic and Contextual Factors.

Variables such as gender, current age (current_age_now), ethnicity, nationality, country of residence, province of residence, city of residence, socioeconomic status, disability, semester, group (parallel), and course were included in the analysis. These factors are often considered in educational research to understand potential disparities in performance and develop more equitable intervention strategies^34–36.

This multifaceted feature selection approach aims to capture the complexity of educational phenomena and build an interpretable and actionable model to improve student learning and retention^37,38. A complete list and description of the 22 features are provided in Supplementary Table S1.

Data preprocessing

The selected raw data underwent several preprocessing steps.

Target Variable Transformation: The ‘approved_status’ variable, indicating a student’s success in a course, was transformed into a binary target variable status_bin (1: Approved, 0: Not Approved). The dataset showed an imbalance, with 12,290 instances of “Approved” and 2,865 instances of “Not Approved.”
Handling Missing Data: For numerical features, missing values were imputed using the mean of the respective columns. The mode was used as the imputation value for the categorical features.
Feature Scaling and Encoding: Numerical features were standardized using StandardScaler to achieve a zero mean and unit variance. Categorical features were transformed into a numeric format using OneHotEncoder, creating binary columns for each category and handling unknown categories by omitting them from the dataset.
Preprocessing pipeline: A column transformer was implemented to apply these transformations consistently to the corresponding numerical and categorical columns.

Modeling and evaluation

Data splitting and cross-validation framework

To robustly evaluate the model’s performance and prevent data leakage, a stratified 5-fold cross-validation strategy was employed (StratifiedKFold with n_splits = 5, shuffle = True, random_state = 42). This ensured that each fold maintained approximately the same percentage of samples from each target class as in the full dataset. The dataset was not split into a single training and testing set before cross-validation for base model evaluation. Instead, the performance was assessed using the out-of-fold predictions generated by the cross-validation process itself.

Class balancing within cross-validation

Given the imbalance in the target variable, the synthetic minority oversampling technique (SMOTE) was applied to balance the class distribution. To ensure that the validation data of each fold remained unseen during the oversampling process, a strict sequential procedure was enforced within each training fold of cross-validation. Specifically, the training data portion was first balanced using SMOTE (with k_neighbors = 5, random_state = 42), and the model was then trained on this newly generated balanced dataset. This approach prevents data leakage from the validation set into the training process, which is essential for obtaining unbiased performance estimates of the model.

Base machine learning models

Seven base classification algorithms were selected to represent a wide spectrum of learning strategies, from traditional models to state-of-the-art gradient boosting ensembles.

Logistic Regression (LR): A linear model (max_iter = 500, solver=’lbfgs’, class_weight=’balanced’, random_state = 42).
K-Nearest Neighbors (KNN): A distance- and example-based model (n_neighbors = 15).
Decision Tree (DT): A tree-based model (random_state = 42).
Random Forest (RF): An ensemble of decision trees (n_estimators = 300, random_state = 42).
Support Vector Machines (SVM): A model that finds an optimal hyperplane (kernel=’rbf’, probability = True, random_state = 42).
Extreme Gradient Boosting (XGBoost): A highly efficient and scalable implementation of gradient boosting, renowned for its predictive accuracy and regularization capabilities to prevent overfitting (random_state = 42, use_label_encoder = False, eval_metric=’logloss’).
Light Gradient Boosting Machine (LightGBM): A fast, distributed, high-performance gradient boosting framework designed for speed and efficiency, particularly effective with large datasets (random_state = 42).

Each base model was trained and evaluated within a 5-fold cross-validation framework, adhering to the procedure described in Sect. 2.3.2. For each fold, the SMOTE algorithm was applied to the training partition before the model was trained, ensuring that the evaluation was always performed on hold-out data that had not been synthetically augmented.

Stacking ensemble model

A stacking ensemble model was implemented to enhance the predictive performance beyond that of the individual base models. The process involved two levels of modeling:

Level 0 Models (Base Models) and Meta-feature Generation:

The out-of-fold (OOF) predictions (probabilities for the positive class, “Approved”) from each of the seven base models were collected across all folds. These OOF predictions, forming a set of probabilities for each base model across the entire dataset, served as “meta-features” for training the Level 1 model. This method is crucial to prevent information leakage from the target variable into the meta-model’s training process, as the predictions for each data point are generated by models that do not see a specific point during training.

Training the Level 1 Model (Meta-Model).

A Logistic Regression classifier (with parameters max_iter = 200, solver=’lbfgs’, class_weight=’balanced’, random_state = 42) was chosen as the meta-model (or Level 1 model). Logistic Regression is often preferred as a meta-learner owing to its simplicity, ability to provide well-calibrated probabilities, and tendency to reduce overfitting by combining predictions from multiple base models³⁹.

The meta-model was trained using the collected OOF meta-features (i.e., the matrix of out-of-fold probabilities from the base models) as its input features, and the original target variable (status_bin) as its output variable.

Final Retraining of Base Models for Deployment and Prediction.

After the CV process and meta-model training, each of the seven base models (including their specific preprocessing steps and SMOTE) was retrained on the full dataset (X, Y). This step creates the final deployable versions of the base models, ready to make predictions on new unseen data or on the full dataset for analyses such as SHAP or fairness assessments.

Stacking Model Prediction Process (on the Full Dataset or New Data):

Obtain a prediction from the fully trained stacking ensemble for any instance (whether from the full dataset for analysis or from new unseen data):

The instance features were first processed through the complete modeling sequence (including any preprocessing and SMOTE application) of each of the seven final retrained base models.
Each base model pipeline generated a probability of a positive class.
These seven probabilities (one from each base model) are the input features for the trained meta-model.
The meta-model processes these seven probabilities and produces the final prediction probability using the stacking ensemble. This procedure was used to generate stack_full_prob for fairness and SHAP analyses.

Performance evaluation metrics and statistical analysis

Model performance was primarily evaluated using the Area Under the Receiver Operating Characteristic Curve (AUC-ROC), which measures the model’s ability to discriminate between positive and negative results. The F1-score, which is the harmonic mean of precision and recall, is particularly useful for imbalanced datasets. For each model (base and stacking), the mean AUC-ROC and mean F1-score across the five folds were reported, along with their standard deviations (SD) and 95% confidence intervals (CI95), calculated using bootstrapping (n = 1000 resamples) to assess the reliability of the results.

The DeLong test was used to assess the statistical significance of the differences in AUC-ROC values between the key models, particularly comparing the best-performing base model against the stacking model, as well as against other top-performing ensembles (e.g., Random Forest vs. XGBoost).

To analyze the association between categorical features (gender, ethnicity, socioeconomic status) and the target variable (status_bin), Chi-squared (χ²) independence tests or Fisher’s exact test (when expected cell counts were low) were conducted. Statistical significance was set at p < 0.05.

Fairness assessment

Recognizing the ethical imperative to avoid algorithmic bias, this study incorporated an assessment of fairness in the final selected model. The primary candidate for this analysis was the stacking ensemble; however, the best-performing base model (e.g., Random Forest, XGBoost) was also considered if it presented a more advantageous balance of predictive power and interpretability. The assessment will be conducted with respect to the following sensitive attributes: gender, ethnicity (dichotomized as majority vs. minority group), and socioeconomic status (dichotomized based on the median).

Three key fairness metrics were computed using the model predictions for the entire dataset.

Statistical Parity (SP): This metric quantifies the magnitude of the difference in the rate of positive outcomes (predicted “Passed”) between privileged and unprivileged groups. Although the direction of the difference could also be informative, the implementation focuses on the absolute magnitude, providing a clear view of the extent of disparity.
Disparate Impact (DI): This metric calculates the ratio of the rates of positive outcomes between privileged and non-privileged groups. A value closer to 1 indicates equitable treatment, whereas values significantly below 1 may indicate an adverse impact.
Consistency: This metric evaluates how consistently the model produces similar predictions for individuals with similar characteristics regardless of group membership. In this study, consistency was calculated following the theoretical definition by comparing each prediction with the average prediction of its nearest neighbors in the feature space. This rigorous approach enhances the measurement of local consistency and offers a more precise evaluation of the fairness of the model.

These metrics provide a solid quantitative foundation for assessing potential biases in the model across different demographic subgroups.

Tools and libraries

The analysis was conducted using Python (version 3.12). Key libraries included:

pandas and NumPy for data manipulation.
Scikit-learn was used for machine learning model implementation, preprocessing, cross-validation, and metric calculations.
imblearn was used to implement the SMOTE algorithm.
scipy.stats for the statistical tests.
Matplotlib and seaborn were used for data visualization.
shap for model interpretability (though detailed SHAP results are presented in the Results section).
xgboost for implementing the Extreme Gradient Boosting model.
lightgbm for implementing the Light Gradient Boosting Machine model.

Ethical considerations

This study utilized anonymized academic data and Moodle interaction logs provided by the Faculty of Science and Engineering at Universidad Estatal de Milagro under strict confidentiality protocols. No experiments were conducted on human subjects, and no new sensitive personal data were collected beyond those already available in institutional records. The research design and data handling procedures adhered to the prevailing ethical and regulatory standards, ensuring the protection of student privacy.

The inclusion of demographic features, such as ethnicity and socioeconomic status, as predictors, while potentially improving model accuracy, necessitates careful consideration of fairness and the risk of perpetuating existing inequities in the healthcare system. The fairness assessment (Sect.2.5) is the primary step in identifying these biases.

Results

This section details the study’s findings, beginning with the performance of the early academic performance prediction models. Subsequently, it examines the association between demographic variables and performance, delves into the importance and interpretability of model features using SHAP, assesses the fairness of the predictive model, and analyzes the characteristic student profiles.

Performance of machine learning models

Seven base models and a stacking ensemble model were evaluated using 5-fold stratified cross-validation, with the SMOTE technique integrated into each training fold to address the class imbalance. The evaluated algorithms included traditional models (logistic regression, KNN, and decision tree), bagging-based ensembles (Random Forest), and state-of-the-art gradient boosting ensembles (XGBoost and LightGBM). Performance was primarily measured using the area under the ROC curve (AUC-ROC) and F1-score. Table 1 summarizes the average performance of each model, including their standard deviations (SD) and 95% confidence intervals (95% CI) for AUC, obtained via bootstrapping (1000 resamples).

Table 1.

Performance of base and stacking models in 5-fold cross-validation. This table summarizes the performance of the seven base machine learning models (including Logistic Regression, KNN, DT, Random Forest, SVM, XGBoost, and LightGBM) and the Stacking Model. Performance is reported as the mean Area Under the ROC Curve (AUC), standard deviation (SD) of AUC, 95% confidence interval (CI) for AUC, and mean F1-score, evaluated using 5-fold stratified cross-validation with SMOTE integrated within each training fold.

Model	AUC (Mean)	AUC (SD)	95% CI AUC (Lower)	95% CI AUC (Upper)	F1 (Mean)
Logistic Regression (LR)	0.945	0.005	0.941	0.948	0.927
K-Nearest Neighbors (KNN)	0.931	0.004	0.928	0.934	0.890
Decision Tree (DT)	0.808	0.007	0.803	0.814	0.917
Random Forest (RF)	0.952	0.003	0.950	0.954	0.948
Support Vector Machine (SVM)	0.947	0.005	0.943	0.951	0.940
XGBoost (XGB)	0.950	0.004	0.947	0.954	0.949
LightGBM (LGB)	0.953	0.005	0.949	0.956	0.950
Stacking Model	0.835	0.189	0.736	1.006	0.921

Open in a new tab

Note: The best values for the AUC and F1 are shown in bold. .

Figure 1 illustrates a comparison of the mean AUC values with 95% confidence intervals for each model. The results indicated that the gradient boosting models, particularly LightGBM (LGB), demonstrated the highest overall performance (AUC = 0.953 ± 0.005; F1 = 0.950), closely followed by Random Forest (AUC = 0.952) and XGBoost (AUC = 0.950). In contrast, the stacking model exhibited a lower average AUC and significantly greater variability (AUC = 0.835, SD = 0.189), indicating its reduced stability. Although LightGBM outperformed Random Forest, DeLong’s test revealed no statistically significant difference between them (p > 0.05), confirming their comparable predictive power. Similarly, the difference between LightGBM and the stacking model was not statistically significant because of the high variance of the latter.

Fig. 1 — Comparison of AUC Across Models with 95% Confidence Intervals. This bar chart displays the mean Area Under the ROC Curve (AUC) with 95% confidence intervals for the seven base models (including traditional algorithms, Random Forest, XGBoost, and LightGBM) and the Stacking Model. The LightGBM model achieved the highest mean AUC (0.953 ± 0.005), whereas the Stacking Model (0.835 ± 0.189) exhibited lower performance and high instability.

Association between categorical features and academic performance

The association between the target variable (binary academic performance) and the categorical features of gender, ethnicity, and socioeconomic status was investigated using chi-squared (χ² tests or Fisher’s exact test when expected cell counts were low. The results were as follows:

Gender: No statistically significant association was found between gender and academic performance (p = 0.177).
Ethnicity: A statistically significant association was identified (p = 0.005), suggesting that the ethnic distribution varied between the performance groups.
Socioeconomic Status: A highly significant association was found (p < 0.001), confirming a strong relationship between this variable and the academic performance.

These findings indicate that, in the context of this study, ethnicity and socioeconomic status were significantly associated with students’ academic performance, whereas gender did not show a relevant association with academic performance.

Importance and interpretation of models and features

To understand the behavior of the stacking ensemble model, an analysis using SHapley Additive exPlanations (SHAP) was performed on the meta-model. This analysis aimed to determine the global importance of each base model output (i.e., their predicted probabilities) in contributing to the final prediction. The results presented in Table 2 reveal that the Random Forest component (RF_prob) remained the most influential predictor, with a mean absolute SHAP value of 0.931. Interestingly, the K-Nearest Neighbors (KNN_prob) and Logistic Regression (LR_prob) models also showed significant contributions. The LightGBM (LGB_prob) and XGBoost (XGB_prob) models ranked fourth and fifth, respectively, indicating that although they are strong predictors individually, their contributions were partially redundant with the information already provided by the Random Forest model. The Decision Tree (DT_prob) and Support Vector Machine (SVM_prob) components had a negligible impact on the final prediction.

Table 2.

Feature importance of base classifiers in the stacking model. This table displays the mean absolute SHAP values for each of the seven base models, quantifying their global importance and contribution to the final prediction of the meta-learner in the stacking ensemble model.

Base learner	Mean SHAP value
RF_prob	0.9308
KNN_prob	0.4566
LR_prob	0.3118
LGB_prob	0.2960
XGB_prob	0.1117
SVM_prob	0.0968
DT_prob	0.0031

Open in a new tab

Following the analysis of the meta-learner, we interpreted the best-performing base model to understand which student features were most predictive. As established in Sect. 3.1, LightGBM (LGB) achieved the highest AUC (0.953) and was therefore selected for the detailed analysis. A SHAP summary plot (Fig. 2) was generated to visualize the impact of each feature on the model output.

Fig. 2 — SHAP summary (LightGBM model)–complete dataset. SHAP Summary Plot for the LightGBM Model on the Complete Dataset. This plot illustrates the impact of the feature values on the LightGBM model. Each point represents a SHAP value for a feature and an instance. The color indicates the feature value (red for high, blue for low), and the horizontal position shows the impact of the SHAP value on the model’s prediction towards a higher (positive SHAP value) or lower (negative SHAP value) probability of a student passing. The features are ordered based on their global importance.

The summary plot confirms that early academic performance indicators are the most significant predictors of student success. High values for features such as num_exam1, num_note2, and num_note1 (shown in red) consistently push the model output towards a positive prediction (passing), whereas low values (blue) are associated with a negative prediction (failing).

The feature ranking in Fig. 3 clearly shows that the first partial exam (num_exam1) and the first two continuous assessment grades (num_note2 and num_note1) are the three most powerful predictors. The combined importance of these variables significantly outweighed that of all other variables. This finding strongly suggests that early academic performance is the most reliable indicator of the final outcomes.

In addition to academic grades, several other features have predictive value. Engagement metrics such as num_resources_reviewed, num_grades_reviewed, and num_course_accesses appear in the top 10, indicating that students who actively use the learning platform are more likely to succeed academically. Furthermore, certain categorical variables, such as cat__subject_FUNDAMENTALS OF PROGRAMMING and cat__nationality_ECUADORIAN, also demonstrated predictive relevance, suggesting that specific courses or demographic factors may influence student performance in this context. The consistency of these top features, particularly the academic ones, across different high-performing models, such as random forest and light gradient boosting machine, strengthens the validity of these findings.

Fairness assessment of the predictive model

The fairness of the stacking model (the final meta-model combining the base models) was evaluated with respect to three sensitive attributes: gender, ethnicity (dichotomized as majority vs. minority group), and socioeconomic status (SES, dichotomized by the median). Three key metrics were calculated (Table 3), as follows:

Table 3.

Fairness metrics for the stacking model. This table presents the fairness assessment results for the Stacking Model concerning three sensitive attributes: gender, ethnicity (dichotomized), and socioeconomic status (SES; dichotomized). The metrics reported are Statistical Parity (SP) and Disparate Impact (DI). The Consistency metric for the model was 0.907.

Sensitive attribute	Statistical parity (SP)	Disparate impact (DI)
Gender	0.007	1.010
Ethnicity	0.010	0.987
Socioeconomic status	0.032	1.043

Open in a new tab

In addition, the consistency metric calculated using the nearest neighbors (k-NN) and stacking model predictions yielded a value of 0.907, indicating high consistency (close to 1). This means that the model tends to treat individuals with similar feature profiles in a similar manner, regardless of their demographic groups. These results suggest that the stacking model performs well in terms of the evaluated fairness metrics.

Average student profile: interaction and performance

To visualize the average student behavior in relation to the study’s numerical variables, a radar chart (Fig. 4) was generated using the mean rescaled values (MinMax) of the top ten numerical variables related to performance and interaction.

The radar chart (Fig. 4) and the associated rescaled averages (for example, grade1:0.809, grade2:0.815, exam1:0.780) show that, on average, students tended to achieve high scores in formal assessments. In contrast, the variables measuring interaction on the Moodle platform, such as assignments_reviewed (0.071), course_accesses (0.042), and resources_reviewed (0.026), displayed considerably lower, rescaled averages. This suggests a possible disconnect, in which students prioritize performance in direct assessments over active and continuous engagement with the platform’s supplementary resources. Variables such as quizzes_completed (0.272) fell within the intermediate range.

Discussion

This study aimed to enhance early academic performance prediction by developing and evaluating a comprehensive modeling framework that included traditional algorithms, state-of-the-art gradient boosting ensembles, and a final stacking model. The principal findings indicate that LightGBM, a gradient boosting model, emerged as the best-performing base model (AUC = 0.953), slightly outperforming the Random Forest and XGBoost models. This highlights the predictive power of modern ensemble techniques in educational contexts. The second key finding was that the stacking ensemble model, despite its complexity, did not yield a statistically significant improvement over the best base models and exhibited considerable instability. Crucially, the analysis of feature importance, conducted on the top-performing LightGBM model, confirmed that early academic grades remained the most salient predictors of student success, a finding consistent across different model families. Furthermore, the model demonstrated robust fairness across sensitive demographic characteristics.

The high performance of the gradient boosting models, particularly LightGBM (AUC = 0.953) and XGBoost (AUC = 0.950), demonstrates the effectiveness of state-of-the-art techniques on this dataset, directly addressing the novelty of the methods used. The central finding, however, was that the stacking model (AUC = 0.835) did not outperform the best individual models. This outcome can be attributed to two factors. First, the top-performing base models, LightGBM and Random Forest, achieved an exceptionally high AUC of over 0.95, suggesting a “ceiling effect” where there is minimal room for a meta-learner to improve upon near-optimal predictions. This aligns with literature indicating that stacking offers diminishing returns when base learners are already highly accurate^40,41.

Second, the SHAP analysis of the meta-model provides insights into information redundancy. Although Random Forest (RF_prob) was the most influential component, the contributions of LightGBM (LGB_prob) and XGBoost (XGB_prob) were lower than their individual performances. This implies that the predictions from these powerful tree-based ensembles are highly correlated, providing redundant information to the meta-learner. A simple meta-model, such as Logistic Regression, while effective at mitigating overfitting³⁹ struggles to extract significant new values when its inputs are not sufficiently diverse^40,42. Therefore, the lack of improvement is not a failure of the stacking concept itself, but a revealing insight into the data structure and the limits of the model complexity in this specific high-performance context.

A particularly robust finding of this study was the consistency of the key predictors across different model architectures. SHAP analysis was conducted on the best-performing model, LightGBM, and reaffirmed that early academic performance indicators—specifically num_exam1, num_note1, and num_note2—are the most dominant features. This consistency, observed in both the bagging-based Random Forest and boosting-based LightGBM, strengthens the validity of the conclusion that early grades are the most reliable indicators of final academic outcomes, regardless of the underlying predictive algorithm. This aligns with extensive prior research on the power of initial academic achievement^43,44. Although VLE interaction metrics and certain demographic features also showed predictive value, their impact remained secondary, reinforcing the importance of direct performance measures. This finding supports the growing body of evidence linking VLE activities to academic outcomes^33,45,46. Interestingly, the average student profile indicated high scores in formal assessments but comparatively lower engagement with Moodle platform resources. This suggests a potential disconnect in which students may prioritize summative assessments over continuous engagement with learning materials, highlighting an area for pedagogical intervention to foster deeper learning habits.

A critical consideration central to the practical utility of this model is the timeliness of its predictions for effective intervention. The model is designed to generate predictions following the first partial assessment, a point strategically chosen because it provides the first comprehensive and reliable measure of a student’s performance. Although this occurs at the midpoint of the term, it represents a crucial window of opportunity. At this stage, there is still sufficient time—approximately half of the academic period—for meaningful interventions, such as targeted academic support, tutoring, or counseling, to substantially impact a student’s final outcome. The automated ETL process, triggered at the conclusion of the first partial assessment, ensures that predictions are available promptly, allowing academic advisors to act without significant delays. Although the exact timing of the first partial assessment may vary slightly across different course structures, the system is designed to trigger predictions as soon as this consolidated data becomes available, ensuring its applicability in a dynamic academic environment. While the model is designed to be actionable, it is important to clarify that this study covers the crucial first phase: model development, validation, and interpretation. The subsequent phase, which involves a pilot implementation to guide and evaluate real-world interventions with academic advisors, is planned as the next step of this institutional research project. This staged approach ensures that interventions are based on a robust and ethically vetted predictive foundation before being deployed.

The significant association found between ethnicity, socioeconomic status, and academic performance underscores the persistent equity challenges in higher education, aligning with broader studies^47–49. In this context, a proactive fairness assessment of the stacking model is critical. The model demonstrated favorable results in terms of statistical parity, disparate impact, and consistency (0.907), suggesting that it did not exacerbate bias against the analyzed demographic subgroups. This commitment to fairness is crucial, given the ethical concerns that predictive models may perpetuate existing biases^47,49. Although these quantitative metrics are encouraging, continuous monitoring and qualitative understanding remain essential for ensuring their equitable application. Notably, these favorable fairness metrics were maintained even when the stacking model was built using a more complex and diverse set of seven base learners, reinforcing the stability of this finding.

The implementation of the stacking model involves several technical considerations. Although our study did not explicitly report issues such as excessive computational demand, these are known challenges associated with stacking, requiring significant resources and expertise for training and hyperparameter tuning^40,42,50. Interpretability, often a concern with complex ensemble models, was addressed in this study through SHAP analysis of both the meta-model and the most influential base model (LightGBM). This approach provides valuable insights into feature importance and model behavior, enhancing transparency, which is a key factor in building trust among educators and stakeholders.

This study had some limitations that should be considered. The findings were derived from a single institution and a specific online program (Information Technology Engineering at the Universidad Estatal de Milagro, Ecuador), which may limit their generalizability to other academic contexts, disciplines, and modalities. Furthermore, while the model’s activation is systematically tied to the availability of first partial assessment data, the practical window for intervention may vary depending on institutional calendars and specific course timelines. Although 22 features were selected based on the literature and data availability, other unmeasured variables, such as intrinsic student motivation or nuanced teaching quality differences, could also influence students’ academic performance. The SMOTE technique, which is effective for addressing class imbalance within cross-validation folds, involves generating synthetic minority instances, the implications of which require further investigation. Finally, VLE interaction metrics, primarily based on activity counts, serve as proxies for engagement and may not fully capture the quality or depth of students’ interaction with the learning materials. Furthermore, a significant limitation of this study is its focus on the development and validation of the predictive model itself, without subsequent validation of the model’s effectiveness in guiding real-world educational interventions. Assessing the impact of such data-driven interventions is a critical step for future research.

Conclusions

This study successfully developed and evaluated a wide range of machine learning models, from traditional algorithms to state-of-the-art gradient boosting ensembles and a final stacking model, for the early prediction of academic performance. The LightGBM model emerged as the most robust and accurate predictor, slightly outperforming other powerful ensembles, such as Random Forest and XGBoost. A key finding was that the stacking ensemble, despite its complexity, did not improve the predictive performance and showed significant instability. Across all high-performing models, the findings consistently highlight the paramount importance of early academic grades as the primary predictor of success. The final model also demonstrated a strong performance in fairness metrics across sensitive demographic attributes, such as ethnicity and socioeconomic status.

The principal contribution of this study lies in its comprehensive and rigorous methodological framework. This framework integrates a diverse feature set from Moodle analytics and academic-demographic records with a comparative evaluation across different ensemble families (bagging vs. boosting) and a final-stacking architecture. The methodology emphasizes robust evaluation through cross-validation, deep model interpretability via SHAP analysis on the best-performing models, and a crucial assessment of algorithmic fairness. This study provides a validated and interpretable model that serves as a foundation for developing actionable interventions at the Universidad Estatal de Milagro but also offers a replicable framework for other institutions seeking to leverage learning analytics to provide equitable student support.

In conclusion, the high-performing models developed in this study, particularly LightGBM, represent powerful and reliable tools for identifying at-risk students. The results reinforce that early and continuous assessment is the most critical factor for success in online learning. The finding that a complex stacking model did not outperform a well-tuned individual model provides a valuable lesson on the limits of complexity, suggesting that institutions can achieve state-of-the-art predictive accuracy without necessarily resorting to the most computationally intensive methods. Importantly, this study underscores the ethical imperative to build predictive tools that are not only accurate but also equitable. Although these models offer powerful capabilities, they must be integrated into a broader, human-centered strategy aimed at fostering student success for all learners.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary Material 1^{(14.8KB, docx)}

Author contributions

F.E.A.-C. (Felipe Emiliano Arévalo-Cordovilla) and M.P. (Marta Peña) conceptualized the study and designed the research methodology. F.E.A.-C. performed data curation, formal analysis, and investigation. F.E.A.-C. drafted the original manuscript, created visualizations, and managed the project. M.P. supervised and validated the study methodology. Both authors have reviewed and edited the manuscript.

Funding

This research was supported by a scholarship from Universidad Estatal de Milagro (UNEMI).

Data availability

Data that support the findings of this study are available upon request from the corresponding author, Felipe Emiliano Arévalo-Cordovilla (farevaloc@unemi. edu. ec).

Declarations

Competing interests

The authors declare no competing interests.

Consent for publication

The authors confirm that they have read and approved the final version of this manuscript and consented to its publication.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1.Balaji, P., Alelyani, S., Qahmash, A. & Mohana, M. Contributions of machine learning models towards student academic performance prediction: A systematic review. Applied Sciences11, (2021).
2.Yağcı, M. Educational data mining: prediction of students’ academic performance using machine learning algorithms. Smart Learn. Environments9, (2022).
3.Yu, J. Academic performance prediction method of online education using random forest algorithm and artificial intelligence methods. Int. J. Emerg. Technol. Learn.16, 45–57 (2021). [Google Scholar]
4.Gaftandzhieva, S. et al. Exploring Online Activities to Predict the Final Grade of Student. Mathematics 10, (2022).
5.Perkash, A. et al. Feature optimization and machine learning for predicting students’ academic performance in higher education institutions. Educ. Inf. Technol.10.1007/s10639-024-12698-9 (2024). [Google Scholar]
6.Zoralioglu, Y., Gul, M. F., Azizoglu, F., Azizoglu, G. & Toprak, A. N. Predicting Academic Performance of Students Using Machine Learning Techniques. 2023 Innovations in Intelligent Systems and Applications Conference, ASYU 2023 (2023). 10.1109/ASYU58738.2023.10296648
7.Cruz, M. M. P. & Lumauag, R. G. Comparative Analysis of Machine Learning Algorithms for Predicting Student Academic Performance in Higher Education. Proceedings of the 4th International Conference on Ubiquitous Computing and Intelligent Information Systems, ICUIS 2024 888–896 (2024). 10.1109/ICUIS64676.2024.10866086
8.Wu, M. et al. Using machine Learning-based algorithms to predict academic Performance - A systematic literature review. 4th Int. Conf. Innovative Practices Technol. Manage. 2024 ICIPTM 2024. 10.1109/ICIPTM59628.2024.10563566 (2024). [Google Scholar]
9.Tin, T. T., Hock, L. S. & Ikumapayi, O. M. Educational big data mining: comparison of multiple machine learning algorithms in predictive modelling of student academic performance. Int. J. Adv. Comput. Sci. Appl.15, 633–645 (2024). [Google Scholar]
10.Yan, K. Student performance prediction using XGBoost method from a macro perspective. Proc. – 2021 2nd Int. Conf. Comput. Data Sci. CDS 2021. 453–45910.1109/CDS52072.2021.00084 (2021).
11.Shan, Y., Zhang, X., Lin, Y., Ding, T. & Accident Injury Severity Prediction Research Based on Stacking Model. 4th International Conference on Computer Engineering and Application, ICCEA 2023 158–164 (2023) 158–164 (2023) (2023). 10.1109/ICCEA58433.2023.10135421
12.Ting, K. M. & Witten, I. H. Stacking Bagged and Dagged Models. (1997). https://hdl.handle.net/10289/1072
13.Chawla, N. V., Bowyer, K. W., Hall, L. O. & Kegelmeyer, W. P. SMOTE: synthetic minority Over-Sampling technique. Journal Artif. Intell. Research16 (2002).
14.Ghorbani, R. & Ghousi, R. Comparing different resampling methods in predicting students’ performance using machine learning techniques. IEEE Access.8, 67899–67911 (2020). [Google Scholar]
15.Sha, L., Rakovic, M., Das, A., Gasevic, D. & Chen, G. Leveraging class balancing techniques to alleviate algorithmic bias for predictive tasks in education. IEEE Trans. Learn. Technol.15, 481–492 (2022). [Google Scholar]
16.Jawad, K., Shah, M. A. & Tahir, M. Students’ academic performance and engagement prediction in a virtual learning environment using random forest with data balancing. Sustainability (Switzerland) 14, (2022).
17.Chachoui, Y., Azizi, N., Hotte, R. & Bensebaa, T. Enhancing algorithmic assessment in education: Equi-fused-data-based SMOTE for balanced learning. Computers Education: Artif. Intell.6, 100222 (2024). [Google Scholar]
18.Cu, N. G., Nghiem, T. L., Ngo, T. H., Nguyen, M. T. L. & Phung, H. Q. Increment of Academic Performance Prediction of At-Risk Student by Dealing With Data Imbalance Problem. Applied Computational Intelligence and Soft Computing 4795606 (2024). (2024).
19.Baker, R. S. & Hawn, A. Algorithmic bias in education. Int. J. Artif. Intell. Educ.32, 1052–1092 (2022). [Google Scholar]
20.Duch, D., May, M. & George, S. Enhancing predictive analytics for students’ performance in moodle: insight from an empirical study. J. Data Sci. Intell. Syst.10.47852/BONVIEWJDSIS42023777 (2024). [Google Scholar]
21.Siafis, V. & Rangoussi, M. Educational data Mining-based visualization and early prediction of student performance: A synergistic approach. ACM Int. Conf. Proceeding Ser.246, 253. 10.1145/3575879.3576000 (2022). [Google Scholar]
22.Alamgir, Z., Akram, H., Karim, S. & Wali, A. Enhancing student performance prediction via educational data mining on academic data. Inf. Educ.23, 1–24 (2024). [Google Scholar]
23.Meghji, A. F., Shaikh, F. B., Wadho, S. A., Bhatti, S. & Ayyasamy, R. K. Using educational data mining to predict student academic performance. VFAST Trans. Softw. Eng.11, 43–49 (2023). [Google Scholar]
24.Jain, A. et al. A PSO weighted ensemble framework with SMOTE balancing for student dropout prediction in smart education systems. Sci. Rep.15, 1–28 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Barton, M. & Lennox, B. Model stacking to improve prediction and variable importance robustness for soft sensor development. Digit. Chem. Eng.3, 100034 (2022). [Google Scholar]
26.Tsiligaridis, J. Tree-Based ensemble models, algorithms and performance measures for classification. Adv. Sci. Technol. Eng. Syst. J.8, 19–25 (2023). [Google Scholar]
27.Li, T., Ren, W., Xia, Z., Wu, F. A. & Study of Academic Achievement Attribution Analysis Based on Explainable Machine Learning Techniques. IEEE 12th International Conference on Educational and Information Technology, ICEIT 2023 114–119 (2023) 114–119 (2023) (2023). 10.1109/ICEIT57125.2023.10107887
28.Wang, S. & Luo, B. Academic achievement prediction in higher education through interpretable modeling. PLoS One. 19, e0309838 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Aman, F., Rauf, A., Ali, R., Iqbal, F. & Khattak, A. M. A Predictive Model for Predicting Students Academic Performance. 10th International Conference on Information, Intelligence, Systems and Applications, IISA 2019 (2019). 10.1109/IISA.2019.8900760
30.Tsiakmaki, M. et al. Predicting university students’ grades based on previous academic achievements. 9th International Conference on Information, Intelligence, Systems and Applications, IISA 2018 (2018) (2018) (2018). 10.1109/IISA.2018.8633618
31.Paxinou, E. et al. Tracing student activity patterns in E-Learning environments: insights into academic performance. Future Internet 2024. 16, 190 (2024). [Google Scholar]
32.Suad, A., Tapalova, O., Berestova, A. & Vlasova, S. Moodle’s communicative instruments: the impact on students academic performance. Innovations Educ. Teach. Int.10.1080/14703297.2024.2437117 (2024). [Google Scholar]
33.Zhang, Y., Ghandour, A. & Shestak, V. Retracted article: using learning analytics to predict students performance in moodle LMS. Int. J. Emerg. Technol. Learn. (iJET). 15, 102–115 (2020). [Google Scholar]
34.McDuff, N., Hughes, A., Tatam, J., Morrow, E. & Ross, F. Improving equality of opportunity in higher education through the adoption of an inclusive curriculum framework. Widening Participation Lifelong Learn.22, 83–121 (2020). [Google Scholar]
35.Naim, A. Equity across the educational spectrum: innovations in educational access crosswise all levels. Front. Educ.9, 1499642 (2024). [Google Scholar]
36.Zhenda, M. & Yang Sun. Unraveling the complexities of educational inequalities: challenges and strategies for a more equitable future. Front. Educational Res.6, 185–189 (2023). [Google Scholar]
37.Liang, J. et al. Student modeling and analysis in adaptive instructional systems. IEEE Access.10, 59359–59372 (2022). [Google Scholar]
38.Shafiq, D. A., Marjani, M., Habeeb, R. A. A. & Asirvatham, D. Student retention using educational data mining and predictive analytics: A systematic literature review. IEEE Access.10, 72480–72503 (2022). [Google Scholar]
39.Hastie, T., Tibshirani, R. & Friedman, J. The Elements of Statistical Learning (Springer New York, 2009). 10.1007/978-0-387-84858-7
40.Butt, N. A. et al. Performance prediction of students in higher education using Multi-Model ensemble approach. IEEE Access.11, 136091–136108 (2023). [Google Scholar]
41.Sridhara, A., Falkner, N. & Atapattu, T. Leveraging inference: A Regression-Based learner performance prediction system for knowledge tracing. IEEE Access.11, 123458–123475 (2023). [Google Scholar]
42.Keser, S. B. & Aghalarova, S. H. E. L. A. A novel hybrid ensemble learning algorithm for predicting academic performance of students. Educ. Inf. Technol.27, 4521–4552 (2022). [Google Scholar]
43.Moreno-Marcos, P. M., Pong, T. C., Munoz-Merino, P. J. & Kloos, C. D. Analysis of the factors influencing learners’ performance prediction with learning analytics. IEEE Access.8, 5264–5282 (2020). [Google Scholar]
44.Yoo, J. E., Rho, M. & Lee, Y. Online students’ learning behaviors and academic success: an analysis of LMS log data from flipped classrooms via regularization. IEEE Access.10, 10740–10753 (2022). [Google Scholar]
45.Kaensar, C. & Wongnin, W. Analysis and prediction of student performance based on moodle log data using machine learning techniques. Int. J. Emerg. Technol. Learn. (iJET). 18, 184–203 (2023). [Google Scholar]
46.Xhomara, N. & Dasho, A. Online interactions and student learning outcomes in a Moodle-based e-learning system. Technol. Pedagogy Educ.32, 419–433 (2023). [Google Scholar]
47.Cabral-Gouveia, C., Menezes, I. & Neves, T. Educational strategies to reduce the achievement gap: a systematic review. Front. Educ.8, 1155741 (2023). [Google Scholar]
48.Kassaw, C., Demareva, V. & Herut, A. H. Trends of academic achievement of higher education students in ethiopia: literature review. Front. Educ.9, 1431661 (2024). [Google Scholar]
49.Zhao, D., Liu, S. & Li, Q. Effects of socioeconomic status and its components on academic achievement: evidence from Beijing-Shanghai-Jiangsu-Zhejiang (China). Asia Pac. J. Educ.43, 968–983 (2023). [Google Scholar]
50.Sahlaoui, H., Alaoui, E. A. A., Nayyar, A., Agoujil, S. & Jaber, M. M. Predicting and interpreting student performance using ensemble models and Shapley additive explanations. IEEE Access.9, 152688–152703 (2021). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material 1^{(14.8KB, docx)}

Data Availability Statement

Data that support the findings of this study are available upon request from the corresponding author, Felipe Emiliano Arévalo-Cordovilla (farevaloc@unemi. edu. ec).

[CR1] 1.Balaji, P., Alelyani, S., Qahmash, A. & Mohana, M. Contributions of machine learning models towards student academic performance prediction: A systematic review. Applied Sciences11, (2021).

[CR2] 2.Yağcı, M. Educational data mining: prediction of students’ academic performance using machine learning algorithms. Smart Learn. Environments9, (2022).

[CR3] 3.Yu, J. Academic performance prediction method of online education using random forest algorithm and artificial intelligence methods. Int. J. Emerg. Technol. Learn.16, 45–57 (2021). [Google Scholar]

[CR4] 4.Gaftandzhieva, S. et al. Exploring Online Activities to Predict the Final Grade of Student. Mathematics 10, (2022).

[CR5] 5.Perkash, A. et al. Feature optimization and machine learning for predicting students’ academic performance in higher education institutions. Educ. Inf. Technol.10.1007/s10639-024-12698-9 (2024). [Google Scholar]

[CR6] 6.Zoralioglu, Y., Gul, M. F., Azizoglu, F., Azizoglu, G. & Toprak, A. N. Predicting Academic Performance of Students Using Machine Learning Techniques. 2023 Innovations in Intelligent Systems and Applications Conference, ASYU 2023 (2023). 10.1109/ASYU58738.2023.10296648

[CR7] 7.Cruz, M. M. P. & Lumauag, R. G. Comparative Analysis of Machine Learning Algorithms for Predicting Student Academic Performance in Higher Education. Proceedings of the 4th International Conference on Ubiquitous Computing and Intelligent Information Systems, ICUIS 2024 888–896 (2024). 10.1109/ICUIS64676.2024.10866086

[CR8] 8.Wu, M. et al. Using machine Learning-based algorithms to predict academic Performance - A systematic literature review. 4th Int. Conf. Innovative Practices Technol. Manage. 2024 ICIPTM 2024. 10.1109/ICIPTM59628.2024.10563566 (2024). [Google Scholar]

[CR9] 9.Tin, T. T., Hock, L. S. & Ikumapayi, O. M. Educational big data mining: comparison of multiple machine learning algorithms in predictive modelling of student academic performance. Int. J. Adv. Comput. Sci. Appl.15, 633–645 (2024). [Google Scholar]

[CR10] 10.Yan, K. Student performance prediction using XGBoost method from a macro perspective. Proc. – 2021 2nd Int. Conf. Comput. Data Sci. CDS 2021. 453–45910.1109/CDS52072.2021.00084 (2021).

[CR11] 11.Shan, Y., Zhang, X., Lin, Y., Ding, T. & Accident Injury Severity Prediction Research Based on Stacking Model. 4th International Conference on Computer Engineering and Application, ICCEA 2023 158–164 (2023) 158–164 (2023) (2023). 10.1109/ICCEA58433.2023.10135421

[CR12] 12.Ting, K. M. & Witten, I. H. Stacking Bagged and Dagged Models. (1997). https://hdl.handle.net/10289/1072

[CR13] 13.Chawla, N. V., Bowyer, K. W., Hall, L. O. & Kegelmeyer, W. P. SMOTE: synthetic minority Over-Sampling technique. Journal Artif. Intell. Research16 (2002).

[CR14] 14.Ghorbani, R. & Ghousi, R. Comparing different resampling methods in predicting students’ performance using machine learning techniques. IEEE Access.8, 67899–67911 (2020). [Google Scholar]

[CR15] 15.Sha, L., Rakovic, M., Das, A., Gasevic, D. & Chen, G. Leveraging class balancing techniques to alleviate algorithmic bias for predictive tasks in education. IEEE Trans. Learn. Technol.15, 481–492 (2022). [Google Scholar]

[CR16] 16.Jawad, K., Shah, M. A. & Tahir, M. Students’ academic performance and engagement prediction in a virtual learning environment using random forest with data balancing. Sustainability (Switzerland) 14, (2022).

[CR17] 17.Chachoui, Y., Azizi, N., Hotte, R. & Bensebaa, T. Enhancing algorithmic assessment in education: Equi-fused-data-based SMOTE for balanced learning. Computers Education: Artif. Intell.6, 100222 (2024). [Google Scholar]

[CR18] 18.Cu, N. G., Nghiem, T. L., Ngo, T. H., Nguyen, M. T. L. & Phung, H. Q. Increment of Academic Performance Prediction of At-Risk Student by Dealing With Data Imbalance Problem. Applied Computational Intelligence and Soft Computing 4795606 (2024). (2024).

[CR19] 19.Baker, R. S. & Hawn, A. Algorithmic bias in education. Int. J. Artif. Intell. Educ.32, 1052–1092 (2022). [Google Scholar]

[CR20] 20.Duch, D., May, M. & George, S. Enhancing predictive analytics for students’ performance in moodle: insight from an empirical study. J. Data Sci. Intell. Syst.10.47852/BONVIEWJDSIS42023777 (2024). [Google Scholar]

[CR21] 21.Siafis, V. & Rangoussi, M. Educational data Mining-based visualization and early prediction of student performance: A synergistic approach. ACM Int. Conf. Proceeding Ser.246, 253. 10.1145/3575879.3576000 (2022). [Google Scholar]

[CR22] 22.Alamgir, Z., Akram, H., Karim, S. & Wali, A. Enhancing student performance prediction via educational data mining on academic data. Inf. Educ.23, 1–24 (2024). [Google Scholar]

[CR23] 23.Meghji, A. F., Shaikh, F. B., Wadho, S. A., Bhatti, S. & Ayyasamy, R. K. Using educational data mining to predict student academic performance. VFAST Trans. Softw. Eng.11, 43–49 (2023). [Google Scholar]

[CR24] 24.Jain, A. et al. A PSO weighted ensemble framework with SMOTE balancing for student dropout prediction in smart education systems. Sci. Rep.15, 1–28 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR25] 25.Barton, M. & Lennox, B. Model stacking to improve prediction and variable importance robustness for soft sensor development. Digit. Chem. Eng.3, 100034 (2022). [Google Scholar]

[CR26] 26.Tsiligaridis, J. Tree-Based ensemble models, algorithms and performance measures for classification. Adv. Sci. Technol. Eng. Syst. J.8, 19–25 (2023). [Google Scholar]

[CR27] 27.Li, T., Ren, W., Xia, Z., Wu, F. A. & Study of Academic Achievement Attribution Analysis Based on Explainable Machine Learning Techniques. IEEE 12th International Conference on Educational and Information Technology, ICEIT 2023 114–119 (2023) 114–119 (2023) (2023). 10.1109/ICEIT57125.2023.10107887

[CR28] 28.Wang, S. & Luo, B. Academic achievement prediction in higher education through interpretable modeling. PLoS One. 19, e0309838 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR29] 29.Aman, F., Rauf, A., Ali, R., Iqbal, F. & Khattak, A. M. A Predictive Model for Predicting Students Academic Performance. 10th International Conference on Information, Intelligence, Systems and Applications, IISA 2019 (2019). 10.1109/IISA.2019.8900760

[CR30] 30.Tsiakmaki, M. et al. Predicting university students’ grades based on previous academic achievements. 9th International Conference on Information, Intelligence, Systems and Applications, IISA 2018 (2018) (2018) (2018). 10.1109/IISA.2018.8633618

[CR31] 31.Paxinou, E. et al. Tracing student activity patterns in E-Learning environments: insights into academic performance. Future Internet 2024. 16, 190 (2024). [Google Scholar]

[CR32] 32.Suad, A., Tapalova, O., Berestova, A. & Vlasova, S. Moodle’s communicative instruments: the impact on students academic performance. Innovations Educ. Teach. Int.10.1080/14703297.2024.2437117 (2024). [Google Scholar]

[CR33] 33.Zhang, Y., Ghandour, A. & Shestak, V. Retracted article: using learning analytics to predict students performance in moodle LMS. Int. J. Emerg. Technol. Learn. (iJET). 15, 102–115 (2020). [Google Scholar]

[CR34] 34.McDuff, N., Hughes, A., Tatam, J., Morrow, E. & Ross, F. Improving equality of opportunity in higher education through the adoption of an inclusive curriculum framework. Widening Participation Lifelong Learn.22, 83–121 (2020). [Google Scholar]

[CR35] 35.Naim, A. Equity across the educational spectrum: innovations in educational access crosswise all levels. Front. Educ.9, 1499642 (2024). [Google Scholar]

[CR36] 36.Zhenda, M. & Yang Sun. Unraveling the complexities of educational inequalities: challenges and strategies for a more equitable future. Front. Educational Res.6, 185–189 (2023). [Google Scholar]

[CR37] 37.Liang, J. et al. Student modeling and analysis in adaptive instructional systems. IEEE Access.10, 59359–59372 (2022). [Google Scholar]

[CR38] 38.Shafiq, D. A., Marjani, M., Habeeb, R. A. A. & Asirvatham, D. Student retention using educational data mining and predictive analytics: A systematic literature review. IEEE Access.10, 72480–72503 (2022). [Google Scholar]

[CR39] 39.Hastie, T., Tibshirani, R. & Friedman, J. The Elements of Statistical Learning (Springer New York, 2009). 10.1007/978-0-387-84858-7

[CR40] 40.Butt, N. A. et al. Performance prediction of students in higher education using Multi-Model ensemble approach. IEEE Access.11, 136091–136108 (2023). [Google Scholar]

[CR41] 41.Sridhara, A., Falkner, N. & Atapattu, T. Leveraging inference: A Regression-Based learner performance prediction system for knowledge tracing. IEEE Access.11, 123458–123475 (2023). [Google Scholar]

[CR42] 42.Keser, S. B. & Aghalarova, S. H. E. L. A. A novel hybrid ensemble learning algorithm for predicting academic performance of students. Educ. Inf. Technol.27, 4521–4552 (2022). [Google Scholar]

[CR43] 43.Moreno-Marcos, P. M., Pong, T. C., Munoz-Merino, P. J. & Kloos, C. D. Analysis of the factors influencing learners’ performance prediction with learning analytics. IEEE Access.8, 5264–5282 (2020). [Google Scholar]

[CR44] 44.Yoo, J. E., Rho, M. & Lee, Y. Online students’ learning behaviors and academic success: an analysis of LMS log data from flipped classrooms via regularization. IEEE Access.10, 10740–10753 (2022). [Google Scholar]

[CR45] 45.Kaensar, C. & Wongnin, W. Analysis and prediction of student performance based on moodle log data using machine learning techniques. Int. J. Emerg. Technol. Learn. (iJET). 18, 184–203 (2023). [Google Scholar]

[CR46] 46.Xhomara, N. & Dasho, A. Online interactions and student learning outcomes in a Moodle-based e-learning system. Technol. Pedagogy Educ.32, 419–433 (2023). [Google Scholar]

[CR47] 47.Cabral-Gouveia, C., Menezes, I. & Neves, T. Educational strategies to reduce the achievement gap: a systematic review. Front. Educ.8, 1155741 (2023). [Google Scholar]

[CR48] 48.Kassaw, C., Demareva, V. & Herut, A. H. Trends of academic achievement of higher education students in ethiopia: literature review. Front. Educ.9, 1431661 (2024). [Google Scholar]

[CR49] 49.Zhao, D., Liu, S. & Li, Q. Effects of socioeconomic status and its components on academic achievement: evidence from Beijing-Shanghai-Jiangsu-Zhejiang (China). Asia Pac. J. Educ.43, 968–983 (2023). [Google Scholar]

[CR50] 50.Sahlaoui, H., Alaoui, E. A. A., Nayyar, A., Agoujil, S. & Jaber, M. M. Predicting and interpreting student performance using ensemble models and Shapley additive explanations. IEEE Access.9, 152688–152703 (2021). [Google Scholar]

PERMALINK

Evaluating ensemble models for fair and interpretable prediction in higher education using multimodal data

Felipe Emiliano Arévalo-Cordovilla

Marta Peña

Abstract

Supplementary Information

Introduction

Methods

Study design

Data collection, feature selection, and Preparation

Data source and initial processing

Feature selection and justification

Data preprocessing

Modeling and evaluation

Data splitting and cross-validation framework

Class balancing within cross-validation

Base machine learning models

Stacking ensemble model

Performance evaluation metrics and statistical analysis

Fairness assessment

Tools and libraries

Ethical considerations

Results

Performance of machine learning models

Table 1.

Fig. 1.

Association between categorical features and academic performance

Importance and interpretation of models and features

Table 2.

Fig. 2.

Fig. 3.

Fairness assessment of the predictive model

Table 3.

Average student profile: interaction and performance

Fig. 4.

Discussion

Conclusions

Supplementary Information

Author contributions

Funding

Data availability

Declarations

Competing interests

Consent for publication

Footnotes

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases