A Machine Learning Approach to Predictive Modelling of Student Performance

Hu Ng; Azmin Alias bin Mohd Azha; Timothy Tzen Vun Yap; Vik Tor Goh

doi:10.12688/f1000research.73180.2

. 2022 May 23;10:1144. Originally published 2021 Nov 11. [Version 2] doi: 10.12688/f1000research.73180.2

A Machine Learning Approach to Predictive Modelling of Student Performance

Hu Ng ^1,^a, Azmin Alias bin Mohd Azha ¹, Timothy Tzen Vun Yap ¹, Vik Tor Goh ²

PMCID: PMC9194521 PMID: 35719314

Version Changes

Revised. Amendments from Version 1

Referring to comments from the reviewers, we have made the following changes: 1) We have added papers into the Literature Review. 2) We have also amended the Introduction to better show the contributions, significant findings as well as the structure of the paper. 3) We have also added statements that better present the research problem in the Introduction. 4) To improve the flow of the paper, we have also changed the title of the Section 'Methodology' to 'Methodology and Results'. 5) We have also emphasized the significant findings in the discussions and the Conclusions. 6) Updates have been made to Fig 2 to show missing values. 7) The Conclusions have been revised to include the significant results, contribution and the weakness, which will be addressed in future work.

Abstract

Background - Many factors affect student performance such as the individual’s background, habits, absenteeism and social activities. Using these factors, corrective actions can be determined to improve their performance. This study looks into the effects of these factors in predicting student performance from a data mining approach. This study presents a data mining approach in identify significant factors and predict student performance, based on two datasets collected from two secondary schools in Portugal.

Methods – In this study, two datasets are augmented to increase the sample size by merging them. Following that, data pre-processing is performed and the features are normalized with linear scaling to avoid bias on heavy weighted attributes. The selected features are then assigned into four groups comprising of student background, lifestyle, history of grades and all features. Next, Boruta feature selection is performed to remove irrelevant features. Finally, the classification models of Support Vector Machine (SVM) , Naïve Bayes (NB) , and Multilayer Perceptron (MLP) origins are designed and their performances evaluated.

Results - The models were trained and evaluated on an integrated dataset comprising 1044 student records with 33 features, after feature selection. The classification was performed with SVM, NB and MLP with 60-40 and 50-50 train-test splits and 10-fold cross validation. GridSearchCV was applied to perform hyperparameter tuning. The performance metrics were accuracy, precision, recall and F1-Score. SVM obtained the highest accuracy with scores of 77%, 80%, 91% and 90% on background, lifestyle, history of grades and all features respectively in 50-50 train-test splits for binary levels classification . SVM also obtained highest accuracy for five levels classification with 39%, 38%, 73% and 71% for the four categories respectively. The results show that the history of grades form significant influence on the student performance.

Keywords: Student performance, data mining, support vector machine, naïve bayes, multilayer perceptron

Introduction

The definition of a student is a person who attends school or any education institution level to achieve a certain level of knowledge or skill set in a course under the supervision of an educator. Almost everyone was once a student with responsibilities to acquire proper education. Acquiring knowledge by means of getting the right education is of utmost importance and each person should have basic equality in receiving education.

When discussing education at the secondary level, a vital aspect to consider is student performance. Student performance can be assessed in a variety of dimensions, either through exam-based assessment or participation-based assessment. Exam-based assessment includes quizzes, midterms, and final exams, while participation-based assessment is a two-way communication during learning and group activities.

Apart from the obvious, there are so many factors that can affect student performance, such as individual habits, absenteeism, social activities after school and others. This gives way to having machines to learn patterns from data so that they can predict how well a student performs; by acknowledging these factors and subsequently detecting and improve their performance as early as possible.

The contributions of this paper are the identification of significant features that influence student assessment, which in turn can be used to develop various predictive models to ascertain student performance. This will assist educators to form corrective or remedial actions can help to improve student performance. In addition, this may also assist in formulating curriculums that may direct students to career pathways that are most suitable for them.

The paper is structured as follows: Related Works, Methodology and Results, followed by Conclusions.

Related Works

Student performance is an essential part in a secondary-level education as it will show where the student stands when continuing to higher education. Daud et al. ¹ noted that the ability to predict the success of a student is essential and seems to be a fascinating area to dig into.

Sokkhey et al. ² found out that mathematics is one of the subjects that has scientific progression on students. All aspects of human life at various levels are influenced by mathematics and there are no instances in life where mathematics is not used.

Akhtar et al. ³ discovered that social status is correlated with family’s social and monetary wealth. They managed to find the effect of monetary wealth on students’ grade in Pakistan.

Amazona et al. ⁴ as well as Hussain et al. ⁵ have adopted educational data mining (EDM) methods to perform gathering, achieving, and studying of information concerning student’s assessment and learning.

On another note, researchers have also looked at student dropout, ⁶ interpersonal influences ⁷ as well as career decisions after graduation albeit at the tertiary level. ⁸ ^– ¹⁰

Exploratory data analysis (EDA) ¹¹ is a method of analyzing dataset to summarize the important features via visualization. EDA helps:

•
to find errors.
•
to check assumptions.
•
to determine the tentative choice of suitable models and tools.
•
to determine the relationship between the dependent and independent variables.
•
to detect the directions and size of the relationship between variables.

Feature selection is a component of dimensionality reduction where it reduces the number of features to maximize the performance of a machine learning model. Too many features in a dataset can overwhelm a machine learning classifier and potentially reduce the efficacy. ¹²

The Boruta feature algorithm is a wrapper algorithm that underpins the random forest model. From the results yielded by Tang et al., ¹² feature selection is able to effectively recognize and improve overall evaluation metrics on their medical dataset research.

Support Vector Machine (SVM) is able to build the best possible boundary of a line called hyperplanes, which can segregate dimensional spaces into classes. In the work of Sekeroglu et al., ¹³ they achieved good results with SVM on Mathematics and Portuguese subjects from two secondary schools.

Naïve Bayes (NB) is based on Bayes rule of conditional probability and has high capabilities in dealing big datasets. ⁴ The method is used to estimate the probability of a property given set of data as proof and Bayes’ theorem. The posterior is calculated from the product of likelihood and prior and divisible by its evidence.

Multilayer perceptron (MLP) underpins the artificial neutral network (ANN). ⁴ It has an interconnection of perceptron in which it flows from the input to the output in a single direction with multiple routes.

Methodology and Results

In this research work, the approach consists of seven stages, namely data acquisition, data processing, data integration, data discretization, data transformation, feature selection and classification. The flow of the research is shown in Figure 1.

a) Data acquisition

The dataset of student performance is taken from a population of two Portuguese secondary schools namely Gabriel Pereira Secondary School (395 students) ¹⁴ and Mousinho da Silveira Secondary School (649 students). ¹⁵ In the survey, the students were taking the subjects, Mathematics and Portuguese. The two datasets were combined and consisted of 1044 students’ personal data and scores for the two subjects. The datasets are visualizations and shown in Figures 2 to 6.

b) Data processing

This process helps to validate the two datasets by making sure there is no missing term in any feature.

c) Data integration

The two datasets were combined and consisted of 1044 students’ records with 33 features. By adopting EDA, ¹¹ the selected features are then assigned into four groups comprising of student background (12 features), lifestyle (18 features), history of grades (three features) and all features. Tables 1 to 3 shown the features in student background, lifestyle, history of grades respectively. The category ‘all’ consists of the entire 33 features.

Table 1. Student background.

Feature	Description	Value
sex	Gender of student	Male or Female
age	Age of student	15–22
school	School of student	Gabriel Pereira or Mousinho da Silveira
address	Type of student’s home address	Urban or Rural
famsize	Size of family	≤3 or >3
Pstatus	Parent’s cohabitation status	Living together or apart
Medu	Education of parents	None, Primary education, 5 ^th to 9 ^th grade, Secondary education, Higher education
Fedu	Education of parents
Mjob	Job of parents	At home, Civil services, Teacher, Healthcare related, Other
Fjob	Job of parents	At home, Civil services, Teacher, Healthcare related, Other
reason	Reason to choose the school	Close to home, School reputation, Course preference, Other
guardian	Guardian of student	Father or mother, Other

Open in a new tab

Table 2. Student lifestyle.

Feature	Description	Value
traveltime	Travel time from home to school	<15 minutes 15 to 30 minutes 30 minutes to 1 hour >1 hour
studytime	Weekly study time	<2 hours 2 to 5 hours 5 to 10 hours >10 hours
failures	Number of past class failures	n if 1 ≤ n < 3, else 4
schoolsup	Extra educational school support	Yes or no
famsup	Educational support from family
paid	Extra paid classes within the course subject
activites	Extra-curricular activities
nursery	Attended nursery school
higher	Plans for higher education
internet	Have internet access at home
romantic	In a romantic relationship
famrel	Quality relationship with family	Very low (1) to very high (5)
freetime	Free time after school
goout	Going out with friends
Dalc	Weekday alcohol consumption
Walc	Weekend alcohol consumption
health	Current health status
absences	Number of school absences	0–93

Open in a new tab

Table 3. Student history of grades.

Feature	Description	Value
G1	First period grade	0–20
G2	Second period grade
G3	Final grade

Open in a new tab

d) Data discretization

Tables 4 and 5 show the binary levels and 5 levels ⁵ after discretization, representing the grades of the students.

Table 4. Binary levels classification.

Ordinal categorical	Value
Pass	10–20
Fail	0–9

Open in a new tab

Table 5. 5 Levels classification.

Ordinal categorical	Value
A	15–20
B	13–14
C	10–12
D	8–9
F	0–7

Open in a new tab

e) Data transformation

The features are normalized with linear scaling to avoid bias on heavy weighted attributes.

f) Feature selection

Next, Boruta feature selection was performed to remove irrelevant features.

g) Classification

Three supervised machine learning techniques were implemented which are support vector machine, naïve Bayes, and multilayer perceptron 60–40 and 50–50 train-test splits and 10-fold cross validation. Four categories that comprise of student background, student lifestyle, student history of grades (history) and all features. Experiments are carried out on binary levels and five level classification. Binary levels classification will indicate fail or pass, meanwhile for the five levels classification is for student scores F, D, C, B and A.

GridSearchCV is applied to perform hyperparameter tuning. The performance metrics are accuracy, precision, recall and F1-Score. The experiments results are shown from Tables 6 to 11.

Table 6. SVM (Binary levels).

Metrics	Background	Lifestyle	History	All
60 Train – 40 Test
Accuracy	0.768	0.789	0.899	0.895
Precision	0.768	0.804	0.928	0.934
Recall	1.000	0.958	0.941	0.929
F1 Score	0.869	0.874	0.934	0.931
50 Train – 50 Test
Accuracy	0.772	0.798	0.908	0.900
Precision	0.772	0.809	0.932	0.931
Recall	0.999	0.966	0.948	0.941
F1 Score	0.871	0.880	0.940	0.936

Open in a new tab

Table 11. MLP (5 Levels).

Metrics	Background	Lifestyle	History	All
60 Train – 40 Test
Accuracy	0.386	0.383	0.744	0.715
Precision	0.236	0.305	0.751	0.707
Recall	0.386	0.383	0.744	0.715
F1 Score	0.264	0.301	0.735	0.700
50 Train – 50 Test
Accuracy	0.371	0.375	0.720	0.705
Precision	0.213	0.361	0.721	0.708
Recall	0.391	0.385	0.720	0.715
F1 Score	0.239	0.326	0.692	0.706

Open in a new tab

Table 7. SVM (5 Levels).

Metrics	Background	Lifestyle	History	All
60 Train – 40 Test
Accuracy	0.394	0.388	0.742	0.716
Precision	0.195	0.322	0.750	0.715
Recall	0.394	0.388	0.742	0.716
F1 Score	0.246	0.286	0.742	0.708
50 Train – 50 Test
Accuracy	0.389	0.381	0.729	0.708
Precision	0.191	0.329	0.735	0.708
Recall	0.389	0.381	0.729	0.708
F1 Score	0.230	0.300	0.711	0.699

Open in a new tab

Table 8. NB (Binary levels).

Metrics	Background	Lifestyle	History	All
60 Train – 40 Test
Accuracy	0.760	0.785	0.903	0.891
Precision	0.775	0.826	0.954	0.948
Recall	0.968	0.913	0.918	0.907
F1 Score	0.861	0.867	0.935	0.927
50 Train – 50 Test
Accuracy	0.761	0.787	0.907	0.894
Precision	0.779	0.830	0.958	0.948
Recall	0.964	0.911	0.922	0.912
F1 Score	0.861	0.868	0.939	0.930

Open in a new tab

Table 9. NB (5 Levels).

Metrics	Background	Lifestyle	History	All
60 Train – 40 Test
Accuracy	0.373	0.250	0.748	0.515
Precision	0.283	0.263	0.752	0.515
Recall	0.373	0.250	0.748	0.515
F1 Score	0.306	0.166	0.747	0.497
50 Train – 50 Test
Accuracy	0.370	0.257	0.719	0.542
Precision	0.284	0.261	0.747	0.537
Recall	0.370	0.257	0.739	0.542
F1 Score	0.310	0.173	0.740	0.523

Open in a new tab

Table 10. MLP (Binary levels).

Metrics	Background	Lifestyle	History	All
60 Train – 40 Test
Accuracy	0.767	0.787	0.899	0.886
Precision	0.769	0.803	0.928	0.921
Recall	0.995	0.957	0.941	0.934
F1 Score	0.868	0.873	0.934	0.927
50 Train – 50 Test
Accuracy	0.767	0.785	0.906	0.886
Precision	0.773	0.791	0.932	0.919
Recall	0.987	0.982	0.948	0.936
F1 Score	0.867	0.786	0.940	0.927

Open in a new tab

SVM obtained the highest accuracy, with scores of 77%, 80%, 91% and 90% on background, lifestyle, history of grades and all features respectively in 50–50 train–test splits for binary classification (pass or fail). SVM also obtained highest accuracy for the five-class classification (grade A, B, C, D and F) with 39%, 38%, 73% and 71% for the four categories respectively. Based on the results, history of student grades shows significant contribution to a good student performance, where the classification rates obtained are the highest among the four respective categories in each respective classifier. This finding is consistent with the observations from Hwang et al., ¹⁶ Mega et al. ¹⁷ and Waheed et al., ¹⁸ that the students’ performance is highly related to the history of grades.

Table 12 shows the comparison of our models with other research work in 50–50 train–test splits for binary classification (pass or fail) on the dataset with population of two Portuguese secondary schools.

Table 12. Comparison of our models with others research work on two Portuguese secondary schools.

Model and features	Data
Model and features	Mathematics (395 students)	Portuguese (649 students)	Mathematics and Portuguese (1044 students)
SVM on all features [our model]	-	-	0.90
SVM on history of grades [our model]	-	-	0.91
SVM on all features ⁴	0.89	-	-
Naive predictor on all features ¹⁷	0.92	0.90	-
SVM on all features ¹⁷	0.86	0.91	-

Open in a new tab

Conclusions

The paper presented predictive modelling of student performance based on four categories. Based on the results, history of student grades shows significant contribution to a good student performance. SVM obtained the highest accuracy with scores of 77%, 80%, 91% and 90% on background, lifestyle, history of grades and all features respectively in 50-50 train-test splits for binary classification (pass or fail). SVM also obtained highest accuracy for five class classification (grade A, B, C, D and F) with 39%, 38%, 73% and 71% for the four categories respectively. The results show that the history of grades form significant influence on the student performance. The study looks at data only from Portugal and may not reflect a general view of the case. Future work will include more datasets from different countries. Also, other classifiers will be explored and investigated.

Data availability

Underlying data

Kaggle: A machine learning approach to predictive modelling of student performance

https://www.kaggle.com/larsen0966/student-performance-data-set

and

https://archive.ics.uci.edu/ml/datasets/Student+Performance

Data are available under the terms of the Creative Commons Zero “No rights reserved” data waiver (CC0 1.0 Public domain dedication).

Ethics approval

Ethical Approval Number: EA1612021 (From Technology Transfer Office (TTO), Multimedia University).

Funding Statement

The author(s) declared that no grants were involved in supporting this work.

[version 2; peer review: 2 approved]

References

1. Daud A, Aljohani NR, Abbasi RA, et al. : Predicting student performance using advanced learning analytics. Proceedings of the 26th international conference on world wide web companion. 2017, April; (pp.415–421). 10.1145/3041021.3054164 [DOI]
2. Sokkhey P, Okazaki T: Comparative Study of Prediction Models on High School Student Performance in Mathematics. 34th International Technical Conference on Circuits/Systems, Computers and Communications (ITC-CSCC). 2019, June; (pp.1–4). IEEE. 10.1109/ITC-CSCC.2019.8793331 [DOI]
3. Akhtar Z: Socio-economic status factors effecting the students achievement: a predictive study. Int. J. Soc. Sci. Educ. .2012;2(1):281–287. [Google Scholar]
4. Amazona MV, Hernandez AA: Modelling student performance using data mining techniques: Inputs for academic program development. Proceedings of the 2019 5th International Conference on Computing and Data Engineering. 2019, May; (pp.36–40). 10.1145/3330530.3330544 [DOI]
5. Hussain S, Dahan NA, Ba-Alwib FM, et al. : Educational data mining and analysis of students’ academic performance using WEKA. Indones. J. Electr. Eng. Comput. Sci. 2018;9(2):447–459. [Google Scholar]
6. Chung JY, Lee S: Dropout early warning systems for high school students using machine learning. Child. Youth Serv. Rev. 2019;96:346–353. [Google Scholar]
7. Nauta MM, Saucier AM, Woodard LE: Interpersonal influences on students’ academic and career decisions: The impact of sexual orientation. Career Dev. Q. 2001;49:352–362. [Google Scholar]
8. Lee PC, Lee MJ, Dopson LR: Who influences college students’ career choices? An empirical study of hospitality management students. J. Hosp. Tour. Educ. 2019;31:74–86. [Google Scholar]
9. Kim SY, Ahn T, Fouad N: Family influence on Korean students’ career decisions: A social cognitive perspective. J. Career Assess. 2016;24:513–526. 10.1177/1069072715599403 [DOI] [Google Scholar]
10. Wang Z, Liang G, Chen H: Tool for Predicting College Student Career Decisions: An Enhanced Support Vector Machine Framework. Appl. Sci. 2022;12(9):4776. [Google Scholar]
11. Komorowski M, Marshall DC, Salciccioli JD, et al. : Exploratory data analysis. Secondary analysis of electronic health records. 2016;185–203. 10.1007/978-3-319-43742-2_15 [DOI] [Google Scholar]
12. Tang R, Zhang X: CART Decision Tree Combined with Boruta Feature Selection for Medical Data Classification. 2020 5th IEEE International Conference on Big Data Analytics (ICBDA). 2020, May; (pp.80–84). IEEE. 10.1109/ICBDA49040.2020.9101199 [DOI]
13. Sekeroglu B, Dimililer K, Tuncal K: Student performance prediction and classification using machine learning algorithms. Proceedings of the 2019 8th International Conference on Educational and Information Technology. 2019, March; (pp.7–11). 10.1145/3318396.3318419 [DOI]
14. Cortez P, Silva A: Student Performance Data Set. 2014. cited as 2 October. Reference Source
15. Cortez P, Silva A: Using data mining to predict secondary school student performance. 15th European Concurrent Engineering Conference 2008, ECEC 2008-5th Future Business Technology Conference, FUBUTEC 2008. 2008;2003(2000):5–12 [Google Scholar]
16. Hwang A, Kessler EH, Francesco AM: Student networking behavior, culture, and grade performance: An empirical study and pedagogical recommendations. Acad. Manag. Learn. Edu. 2004;3(2):139–150. 10.5465/amle.2004.13500532 [DOI] [Google Scholar]
17. Mega C, Ronconi L, De Beni R: What makes a good student? How emotions, self-regulated learning, and motivation contribute to academic achievement. J. Educ. Psychol. 2014;106(1):121–131. 10.1037/a0033546 [DOI] [Google Scholar]
18. Waheed H, Hassan SU, Aljohani NR, et al. : Predicting academic performance of students from VLE big data using deep learning models. Comput. Hum. Behav. 2020;104:106189. 10.1016/j.chb.2019.106189 [DOI] [Google Scholar]

F1000Res. 2022 Jun 13. doi: 10.5256/f1000research.134172.r138617

Reviewer response for version 2

Huiling Chen ¹

The authors have solved all the comments, now the paper is ready for indexing.

Is the work clearly and accurately presented and does it cite the current literature?

Yes

If applicable, is the statistical analysis and its interpretation appropriate?

Yes

Are all the source data underlying the results available to ensure full reproducibility?

Yes

Is the study design appropriate and is the work technically sound?

Yes

Are the conclusions drawn adequately supported by the results?

Yes

Are sufficient details of methods and analysis provided to allow replication by others?

Yes

Reviewer Expertise:

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

F1000Res. 2022 May 24. doi: 10.5256/f1000research.134172.r138618

Reviewer response for version 2

Sadiq Hussain ¹

All my queries and concerns were handled and the article may be accepted for indexing.

Is the work clearly and accurately presented and does it cite the current literature?

Partly

If applicable, is the statistical analysis and its interpretation appropriate?

Yes

Are all the source data underlying the results available to ensure full reproducibility?

Yes

Is the study design appropriate and is the work technically sound?

Yes

Are the conclusions drawn adequately supported by the results?

Partly

Are sufficient details of methods and analysis provided to allow replication by others?

Yes

Reviewer Expertise:

Educational Data Mining, Medical Analytics

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

F1000Res. 2022 May 24.

Hu NG ¹

Dearest Reviewer,

Thank you for the comments. Thank you for everything.

F1000Res. 2022 May 12. doi: 10.5256/f1000research.76815.r136540

Reviewer response for version 1

Huiling Chen ¹

The authors propose a data mining approach to identify significant factors and predict student performance, based on two datasets collected from two secondary schools in Portugal. The overall structure of the article is also reasonable. The interpretation and description of the experimental results are also detailed. The experimental results show that the proposed method is excellent in terms of accuracy, precision, recall, and F1 score. In my opinion, this method has some practical value.

However, this manuscript suffers from a number of weak points, it should be further improved before being considered for indexing. Let’s elaborate on some of them:

In the results of the Abstract, the authors summarize the classification results of SVM on the dataset. However, we do not see the impact and contribution of the proposed method on the experimental results.
Introduction - include the description of the innovations, contributions, and the structure of the article.
In the introduction, I also suggest the authors make a comprehensive investigation on the machine learning method such as the works by Wang et al. (2022 ¹) and Wang et al. (2022 ²), in the literature in the introduction part and give analysis of the existing works to make the whole work more in-depth.
The title of the fourth part of the paper, “Methodology”, should be changed to “Methodology and Experimental Results”.
The number of students aged 20 and 21 is not given in Figure 2, is it a problem with the data set？
With the experimental results in Table 6-11, there are differences in the results obtained by different classifiers. What is the theoretical basis of the paper for the choice of classifier?
In the conclusion, the contributions and flaws of the proposed method are not discussed.

Is the work clearly and accurately presented and does it cite the current literature?

Yes

If applicable, is the statistical analysis and its interpretation appropriate?

Yes

Are all the source data underlying the results available to ensure full reproducibility?

Yes

Is the study design appropriate and is the work technically sound?

Yes

Are the conclusions drawn adequately supported by the results?

Yes

Are sufficient details of methods and analysis provided to allow replication by others?

Yes

Reviewer Expertise:

data mining

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

References

1. : Tool for Predicting College Student Career Decisions: An Enhanced Support Vector Machine Framework. Applied Sciences .2022;12(9) : 10.3390/app12094776 10.3390/app12094776 [DOI] [Google Scholar]
2. : Lupus nephritis diagnosis using enhanced moth flame algorithm with support vector machines. Comput Biol Med .2022;145: 10.1016/j.compbiomed.2022.105435 105435 10.1016/j.compbiomed.2022.105435 [DOI] [PubMed] [Google Scholar]

F1000Res. 2022 May 17.

Hu NG ¹

Dear Prof Huiling Chen,

We are greatly appreciative of the insightful comments and helpful suggestions that you have provided.

The following are our response on the issues that you have highlighted:

Your comment:

In the results of the Abstract, the authors summarize the classification results of SVM on the dataset. However, we do not see the impact and contribution of the proposed method on the experimental results.

Our response:

Thank you for the comments. From this research work, we found out that history of grades forms significant influence on the student performance. This is the main impact and contribution.

Your comment:

2. Introduction - include the description of the innovations, contributions, and the structure of the article.

Our response:

The last 2 paragraphs of the Introduction have been rewritten to reflect this.

The paper is structured as follows: Related Works, Methodology and Results, followed by Conclusions.

Your comment:

3. In the introduction, I also suggest the authors make a comprehensive investigation on the machine learning method such as the works by Wang et al. (2022 ¹ ) and Wang et al. (2022 ² ), in the literature in the introduction part and give analysis of the existing works to make the whole work more in-depth.

Our response:

We have added one of the papers recommended into the Related Works. Thank you for the recommendation.

Wang, Z., Liang, G., & Chen, H. (2022). Tool for Predicting College Student Career Decisions: An Enhanced Support Vector Machine Framework. Applied Sciences, 12(9), 4776.

Your comment:

4. The title of the fourth part of the paper, “Methodology”, should be changed to “Methodology and Experimental Results”.

Our response:

The title has been rephrased to ‘Methodology and Results’.

Your comment:

5. The number of students aged 20 and 21 is not given in Figure 2, is it a problem with the data set？

Our response:

Figure 2 had been edited to show the number of students aged 20 and aged 21.

Your comment:

6. With the experimental results in Table 6-11, there are differences in the results obtained by different classifiers. What is the theoretical basis of the paper for the choice of classifier?

Our response:

Due to previous work, we found out that these classifiers work well for our use cases, that is why in this work we have only applied these. We will compare other classifiers in future work.

Your comment:

7. In the conclusion, the contributions and flaws of the proposed method are not discussed.

Our response:

The conclusions have been revised to include the significant results, contribution and the weakness, which will be addressed in future work.

F1000Res. 2021 Nov 18. doi: 10.5256/f1000research.76815.r99883

Reviewer response for version 1

Sadiq Hussain ¹

The paper presented a predictive model on student performance based on the data from schools in Portugal. Student grades are the most important feature observed in the study. The study is complete from an experimental perspective, but it needs improvement in related works and the introduction section. After these modifications, the study will be approvable.

There are grammatical errors in this paper, which should be revised.
The literature review should be more in detail and add at least five more papers.
The conclusion should also be elaborated a little more. The major findings in their study should be discussed. For example, out of the classifiers applied, which classifier demonstrated the best accuracy? An evaluation of the methodology that the authors deployed would be welcome.
The introduction section should also focus on the research problem. Why this kind of research is beneficial, and to whom? How can management take advantage of it and how can companies evaluate these results to find the best students/universities for job placements?
Was the dataset balanced? What was the imbalance ratio?
Kindly provide justification for using the applied classifiers only: why were other classifiers not considered?
What were the most influential features in student performance? Was there any unnecessary feature that was taken into account?

Is the work clearly and accurately presented and does it cite the current literature?

Partly

If applicable, is the statistical analysis and its interpretation appropriate?

Yes

Are all the source data underlying the results available to ensure full reproducibility?

Yes

Is the study design appropriate and is the work technically sound?

Yes

Are the conclusions drawn adequately supported by the results?

Partly

Are sufficient details of methods and analysis provided to allow replication by others?

Yes

Reviewer Expertise:

Educational Data Mining, Medical Analytics

F1000Res. 2022 May 17.

Hu NG ¹

Dear Prof Sadiq Hussain,

We are greatly appreciative of the insightful comments and helpful suggestions that you have provided.

The following are our response on the issues that you have highlighted:

Your comment:

a) There are grammatical errors in this paper, which should be revised.

Our response:

The paper has been proofread for grammatical errors. Thank you for pointing this out.

Your comment:

b) The literature review should be more in detail and add at least five more papers.

Our response:

Six papers on student dropout, interpersonal influences, as well as career decisions have been added to the Literature Review.

Hussain, S., Dahan, N. A., Ba-Alwib, F. M., & Ribata, N. (2018). Educational data mining and analysis of students’ academic performance using WEKA. Indonesian Journal of Electrical Engineering and Computer Science, 9(2), 447-459.
Chung, J.Y.; Lee, S. Dropout early warning systems for high school students using machine learning. Child. Youth Serv. Rev. 2019, 96, 346–353.
Nauta, M.M.; Saucier, A.M.; Woodard, L.E. Interpersonal influences on students’ academic and career decisions: The impact of sexual orientation. Career Dev. Q. 2001, 49, 352–362.
Lee, P.C.; Lee, M.J.; Dopson, L.R. Who influences college students’ career choices? An empirical study of hospitality management students. J. Hosp. Tour. Educ. 2019, 31, 74–86.
Kim, S.-Y.; Ahn, T.; Fouad, N. Family influence on Korean students’ career decisions: A social cognitive perspective. J. Career Assess. 2016, 24, 513–526.
Wang, Z., Liang, G., & Chen, H. (2022). Tool for Predicting College Student Career Decisions: An Enhanced Support Vector Machine Framework. Applied Sciences, 12(9), 4776.

Your comment:

c) The conclusion should also be elaborated a little more. The major findings in their study should be discussed. For example, out of the classifiers applied, which classifier demonstrated the best accuracy? An evaluation of the methodology that the authors deployed would be welcome.

Our response:

The conclusion has been elaborated with the results. The following has been added:

SVM obtained the highest accuracy with scores of 77%, 80%, 91% and 90% on background, lifestyle, history of grades and all features respectively in 50-50 train-test splits for binary classification (pass or fail). SVM also obtained highest accuracy for five class classification (grade A, B, C, D and F) with 39%, 38%, 73% and 71% for the four categories respectively.

Your comment:

d) The introduction section should also focus on the research problem. Why this kind of research is beneficial, and to whom? How can management take advantage of it and how can companies evaluate these results to find the best students/universities for job placements?

Our response:

The following statements have been added to the Introduction to address this:

This will assist educators to form corrective or remedial actions can help to improve student performance. In addition, this may also assist in formulating curriculums that may direct students to career pathways that are most suitable for them.

For management and companies, this is not in the scope of this research and will be explored in future work.

Your comment:

e) Was the dataset balanced? What was the imbalance ratio?

Our response:

The original datasets are not balanced due to the distribution of grades. In order to overcome this, in this work we performed discretization, to form 2 groups – binary levels and 5 levels. As per referred to the work from

Hussain, S., Dahan, N. A., Ba-Alwib, F. M., & Ribata, N. (2018). Educational data mining and analysis of students’ academic performance using WEKA. Indonesian Journal of Electrical Engineering and Computer Science, 9(2), 447-459.

Your comment:

f) Kindly provide justification for using the applied classifiers only: why were other classifiers not considered?

Our response:

Due to previous work, we found out that these classifiers work well for our use cases, that is why in this work we have only applied these. We will compare other classifiers in future work.

Your comment:

g) What were the most influential features in student performance? Was there any unnecessary feature that was taken into account?

Our response:

The most influential features come from the history of student grades (Table 3). The following statement can be found in the discussions of the classification result:

Based on the results, history of student grades shows significant contribution to a good student performance, where the classification rates obtained are the highest among the four respective categories in each respective classifier.

No unnecessary features were taken into account.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Underlying data

Kaggle: A machine learning approach to predictive modelling of student performance

https://www.kaggle.com/larsen0966/student-performance-data-set

and

https://archive.ics.uci.edu/ml/datasets/Student+Performance

Data are available under the terms of the Creative Commons Zero “No rights reserved” data waiver (CC0 1.0 Public domain dedication).

[ref1] 1. Daud A, Aljohani NR, Abbasi RA, et al. : Predicting student performance using advanced learning analytics. Proceedings of the 26th international conference on world wide web companion. 2017, April; (pp.415–421). 10.1145/3041021.3054164 [DOI]

[ref2] 2. Sokkhey P, Okazaki T: Comparative Study of Prediction Models on High School Student Performance in Mathematics. 34th International Technical Conference on Circuits/Systems, Computers and Communications (ITC-CSCC). 2019, June; (pp.1–4). IEEE. 10.1109/ITC-CSCC.2019.8793331 [DOI]

[ref3] 3. Akhtar Z: Socio-economic status factors effecting the students achievement: a predictive study. Int. J. Soc. Sci. Educ. .2012;2(1):281–287. [Google Scholar]

[ref4] 4. Amazona MV, Hernandez AA: Modelling student performance using data mining techniques: Inputs for academic program development. Proceedings of the 2019 5th International Conference on Computing and Data Engineering. 2019, May; (pp.36–40). 10.1145/3330530.3330544 [DOI]

[ref5] 5. Hussain S, Dahan NA, Ba-Alwib FM, et al. : Educational data mining and analysis of students’ academic performance using WEKA. Indones. J. Electr. Eng. Comput. Sci. 2018;9(2):447–459. [Google Scholar]

[ref6] 6. Chung JY, Lee S: Dropout early warning systems for high school students using machine learning. Child. Youth Serv. Rev. 2019;96:346–353. [Google Scholar]

[ref7] 7. Nauta MM, Saucier AM, Woodard LE: Interpersonal influences on students’ academic and career decisions: The impact of sexual orientation. Career Dev. Q. 2001;49:352–362. [Google Scholar]

[ref8] 8. Lee PC, Lee MJ, Dopson LR: Who influences college students’ career choices? An empirical study of hospitality management students. J. Hosp. Tour. Educ. 2019;31:74–86. [Google Scholar]

[ref9] 9. Kim SY, Ahn T, Fouad N: Family influence on Korean students’ career decisions: A social cognitive perspective. J. Career Assess. 2016;24:513–526. 10.1177/1069072715599403 [DOI] [Google Scholar]

[ref10] 10. Wang Z, Liang G, Chen H: Tool for Predicting College Student Career Decisions: An Enhanced Support Vector Machine Framework. Appl. Sci. 2022;12(9):4776. [Google Scholar]

[ref11] 11. Komorowski M, Marshall DC, Salciccioli JD, et al. : Exploratory data analysis. Secondary analysis of electronic health records. 2016;185–203. 10.1007/978-3-319-43742-2_15 [DOI] [Google Scholar]

[ref12] 12. Tang R, Zhang X: CART Decision Tree Combined with Boruta Feature Selection for Medical Data Classification. 2020 5th IEEE International Conference on Big Data Analytics (ICBDA). 2020, May; (pp.80–84). IEEE. 10.1109/ICBDA49040.2020.9101199 [DOI]

[ref13] 13. Sekeroglu B, Dimililer K, Tuncal K: Student performance prediction and classification using machine learning algorithms. Proceedings of the 2019 8th International Conference on Educational and Information Technology. 2019, March; (pp.7–11). 10.1145/3318396.3318419 [DOI]

[ref14] 14. Cortez P, Silva A: Student Performance Data Set. 2014. cited as 2 October. Reference Source

[ref15] 15. Cortez P, Silva A: Using data mining to predict secondary school student performance. 15th European Concurrent Engineering Conference 2008, ECEC 2008-5th Future Business Technology Conference, FUBUTEC 2008. 2008;2003(2000):5–12 [Google Scholar]

[ref16] 16. Hwang A, Kessler EH, Francesco AM: Student networking behavior, culture, and grade performance: An empirical study and pedagogical recommendations. Acad. Manag. Learn. Edu. 2004;3(2):139–150. 10.5465/amle.2004.13500532 [DOI] [Google Scholar]

[ref17] 17. Mega C, Ronconi L, De Beni R: What makes a good student? How emotions, self-regulated learning, and motivation contribute to academic achievement. J. Educ. Psychol. 2014;106(1):121–131. 10.1037/a0033546 [DOI] [Google Scholar]

[ref18] 18. Waheed H, Hassan SU, Aljohani NR, et al. : Predicting academic performance of students from VLE big data using deep learning models. Comput. Hum. Behav. 2020;104:106189. 10.1016/j.chb.2019.106189 [DOI] [Google Scholar]

PERMALINK

A Machine Learning Approach to Predictive Modelling of Student Performance

Hu Ng

Azmin Alias bin Mohd Azha

Timothy Tzen Vun Yap

Vik Tor Goh

Roles

Version Changes

Revised. Amendments from Version 1

Abstract

Introduction

Related Works

Methodology and Results

Figure 1. Flow of the processes.

Figure 2. Distribution of age.

Figure 6. Distribution of relationship with parents.

Figure 4. Distribution of subject.

Table 1. Student background.

Figure 3. Distribution of gender.

Table 2. Student lifestyle.

Table 3. Student history of grades.

Table 4. Binary levels classification.

Figure 5. Distribution of student accommodation.

Table 5. 5 Levels classification.

Table 6. SVM (Binary levels).

Table 11. MLP (5 Levels).

Table 7. SVM (5 Levels).

Table 8. NB (Binary levels).

Table 9. NB (5 Levels).

Table 10. MLP (Binary levels).

Table 12. Comparison of our models with others research work on two Portuguese secondary schools.

Conclusions

Data availability

Underlying data

Ethics approval

Funding Statement

References

Reviewer response for version 2

Huiling Chen

Roles

Reviewer response for version 2

Sadiq Hussain

Roles

Hu NG

Reviewer response for version 1

Huiling Chen

Roles

References

Hu NG

Reviewer response for version 1

Sadiq Hussain

Roles

Hu NG

Associated Data

Data Availability Statement

Underlying data

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases