Knowledge Discovery for Higher Education Student Retention Based on Data Mining: Machine Learning Algorithms and Case Study in Chile

Carlos A Palacios; José A Reyes-Suárez; Lorena A Bearzotti; Víctor Leiva; Carolina Marchant

doi:10.3390/e23040485

. 2021 Apr 20;23(4):485. doi: 10.3390/e23040485

Knowledge Discovery for Higher Education Student Retention Based on Data Mining: Machine Learning Algorithms and Case Study in Chile

Carlos A Palacios ^1,², José A Reyes-Suárez ³, Lorena A Bearzotti ⁴, Víctor Leiva ^5,^*, Carolina Marchant ^6,⁷

Editor: Pentti Nieminen

PMCID: PMC8072774 PMID: 33923879

Abstract

Data mining is employed to extract useful information and to detect patterns from often large data sets, closely related to knowledge discovery in databases and data science. In this investigation, we formulate models based on machine learning algorithms to extract relevant information predicting student retention at various levels, using higher education data and specifying the relevant variables involved in the modeling. Then, we utilize this information to help the process of knowledge discovery. We predict student retention at each of three levels during their first, second, and third years of study, obtaining models with an accuracy that exceeds 80% in all scenarios. These models allow us to adequately predict the level when dropout occurs. Among the machine learning algorithms used in this work are: decision trees, k-nearest neighbors, logistic regression, naive Bayes, random forest, and support vector machines, of which the random forest technique performs the best. We detect that secondary educational score and the community poverty index are important predictive variables, which have not been previously reported in educational studies of this type. The dropout assessment at various levels reported here is valid for higher education institutions around the world with similar conditions to the Chilean case, where dropout rates affect the efficiency of such institutions. Having the ability to predict dropout based on student’s data enables these institutions to take preventative measures, avoiding the dropouts. In the case study, balancing the majority and minority classes improves the performance of the algorithms.

Keywords: data analytics, databases, data science, Friedman test, socioeconomic index, university dropout

1. Symbology, Introduction, and Bibliographical Review

In this section, abbreviations, acronyms, notations, and symbols used in our work are defined in Table 1. In addition, we provide here the introduction, the bibliographical review on the topic about related works, and an overview of the models utilized together with the description of the sections considered in this paper.

Table 1.

Abbreviations, acronyms, notations, and symbols employed in the present document.

Abbreviations/Acronyms		Notations/Symbols
ANN	artificial neural networks	∼	distributed as
CLU	clustering	k	number of nearest neighbors
CP	community poverty index	n	sample size
DT	decision trees	$l = β_{0} + β_{1} x$	log-odd
EDM	educational data mining	$o = b^{β_{0} + β_{1} x}$	odd
EM	ensemble models	$β_{0}, β_{1}$	regression coefficients
FN	false negative	X	independent variable or feature
FP	false positive	Y	dependent variable or response
HE	higher education	$p = P (Y = 1)$	probability function of LR
IG	information gain	$= \frac{exp (β_{0} + β_{1} x)}{exp (β_{0} + β_{1} x) + 1}$
KNN	k-nearest neighbors	$= \frac{1}{1 + exp (- β_{0} - β_{1} x)}$
LR	logistic regression	$P (Y = c ∣ X = x)$	probability Y given $X$
ML	machine learning	$\frac{P (Y = c) P (X = x ∣ Y = c)}{P (X = x)}$	Bayes conditional probability
NB	naive Bayes	$X = (X_{1}, \dots, X_{p})$	vector of independent variables
NEM	secondary educational score	$(x_{1}, Y_{1}), \dots, (x_{n}, Y_{n})$	instances
	(notas enseñanza media)	c	number of classes
PSU	university selection test	$∥ x ∥$	norm of a point x
	(prueba selección universitaria)	s	number of folds in cross-validation
RAM	random access memory	$w$	normal vector to the hyperplane
RF	random forest	TP/(TP + FP)	precision
SVM	support vector machines	$κ = (p_{a} - p_{e}) / (1 - p_{e})$	$κ$ -statistic
TF	true negative	$p_{a}$	% of agreement classifier/ground truth
TP	true positive	$p_{e}$	agreement chance
UCM	Catholic University of Maule	$Q = \frac{12 n}{c (c + 1)} \sum_{j = 1}^{c} {({\bar{r}}_{\cdot j} - \frac{c + 1}{2})}^{2}$	Friedman statistic
	(Universidad Católica del Maule)	${x_{i j}}_{n \times c}$	$n \times c$ data matrix
SMOTE	synthetic minority	${r_{i j}}_{n \times c}$	$n \times c$ rank matrix
	over-sampling technique	${\bar{r}}_{\cdot j} = \frac{1}{n} \sum_{i = 1}^{n} r_{i j}$	rank average of column j
KDD	knowledge discovery	$P (χ_{c}^{2} \geq Q)$	p-value
	in databases	$χ_{c}^{2}$	chi-squared distribution
			with c degrees of freedom

Reference	Instances	Technique(s)	Confusion Matrix	Accuracy	Institution	Country
[7,8]	16,066	ANN, DT, SVM, LR	Yes	87.23%	Oklahoma State
					University	USA
[23]	713	DT, NB, LR, EM, RF	Yes	80%	Eindhoven University
					of Technology	Netherlands
[24]	N/A	ANN, SVM, EM	No	N/A	National Technical
					University of Athens	Greece
[25]	8025	DT, NB	Yes	79%	Kent State
					University	USA
[26]	452	ANN, DT, KNN	Yes	N/A	University
					of Chile	Chile
[27]	6078	NN, NB	Yes	N/A	Roma Tre
					University	Italy
[28]	17,910	RF, DT	Yes	N/A	University
					of Duisburg	Germany
[29]	N/A	LR, DT, ANN, EM	No	N/A	N/A
					N/A	USA
[30]	1500	CLU, SVM, RF	No	N/A	University
					of Bologna	Italy
[31]	6470	DT	No	87%	Mugla Sitki
					Kocman University	Turkey
[32]	811	EM, NB, KNN, ANN	No	N/A	Mae Fah
					Luang University	Thailand
[33]	3877	LR, SVM, DT	No	N/A	Purdue
					University	USA
[34]	456	ANN, DT	No	N/A	University of
					Computer Science	Cuba
[35]	1359	NB, SVM	Yes	87%	Federal University
					of Rio de Janeiro	Brazil
[36]	N/A	N/A	No	61%	Unitec Institute	New
					of Technology	Zealand
[37]	22,099	LR, DT, ANN	No	N/A	several
					universities	USA
[38]	1055	C45, RF, CART, SVM	No	86.6%	University
					of Oviedo	Spain
[39]	6500	DT, KNN	No	98.98%	Technical University
					of Izúcar	Mexico
[40]	N/A	DT	Yes	N/A	N/A
					N/A	India
[41]	6690	ANN, LR, DT	No	76.95%	Arizona State
					University	USA

Attributes	Features
Demographic background	Name, age, gender.
Geographic origin	Place of origin, province.
Socioeconomic index	CP index.
School performance	High school grades, secondary educational score (NEM), PSU score.
University performance	Number of approved courses, failed courses, approved credits, failed credits.
Financial indicators	Economic quintile, family income.
Others	Readmissions, program, application preference, selected/waiting list, health insurance.

		Global		First Level		Second Level		Third Level
Rank	IG	Variable	IG	Variable	IG	Variable	IG	Variable
1	0.430	NEM	0.511	NEM	0.357	NEM	0.098	Marks 3rd semester
2	0.385	CP index	0.468	CP index	0.220	CP index	0.087	Marks 4th semester
3	0.209	Program	0.286	School	0.211	School	0.084	Approved courses 3rd semester
4	0.204	School	0.190	Program	0.211	Approved courses 2nd semester	0.083	Approved courses 2nd semester
5	0.105	PSU specific topic	0.112	PSU specific topic	0.195	Approved credits 2nd semester	0.074	School
6	0.068	Quintile	0.110	PSU language	0.183	Approved credits 1st semester	0.069	Marks 1st semester
7	0.059	Gender	0.098	Quintile	0.176	Approved courses 1st semester	0.067	Approved courses 4th semester
8	0.051	Family income	0.056	Age	0.163	Marks 1st semester	0.066	Approved courses 1st semester
9	0.041	Age	0.053	Educational area	0.149	Program	0.063	Marks 2nd semester
10	0.037	Educational area	0.047	PSU weighted score	0.141	Marks 2nd semester	0.059	Approved credits 1st semester
11	0.034	PSU language	0.043	Graduate/non-graduate	0.130	Entered credits 2nd semester	0.059	Entered credits 2nd semester
12	0.030	Province	0.037	Family income	0.103	Entered credits 1st semester	0.056	Approved credits 2nd semester
13	0.027	Application preference	0.034	Province	0.079	Registered courses 2nd semester	0.051	Approved credits 4th semester
14	0.026	Health insurance	0.033	Gender	0.058	Gender	0.049	Entered credits 3rd semester
15	0.025	Readmissions	0.030	PSU math	0.038	Registered courses 1st semester	0.049	Program
16	0.025	PSU weighted score	0.029	Readmissions	0.032	Province	0.048	Entered credits 4th semester
17	0.019	PSU math	0.028	Health insurance	0.030	Family income	0.044	Approved credits 3rd semester
18	0.015	Graduate/non-graduate	0.025	PSU language/math	0.029	Quintile	0.042	Registered courses 1st semester
19	0.014	PSU language/math	0.022	Application preference	0.025	Age	0.030	Registered courses 3rd semester
20	0.001	Dependent group	0.001	Dependent group	0.024	Educational area	0.030	Registered courses 4th semester

ML Algorithm	Accuracy	Precision	TP Rate	FP Rate	F-Measure	RMSE	$κ$ -Statistic
DT	82.75%	0.840	0.973	0.806	0.902	0.365	0.227
KNN	81.36%	0.822	0.984	0.929	0.896	0.390	0.082
LR	82.42%	0.849	0.954	0.739	0.898	0.373	0.271
NB	79.63%	0.860	0.894	0.631	0.877	0.387	0.283
RF	81.82%	0.829	0.979	0.879	0.897	0.370	0.143
SVM	81.67%	0.828	0.977	0.881	0.897	0.428	0.138

Algorithm	Accuracy	Precision	TP Rate	FP Rate	F-Measure	RMSE	$κ$ -Statistic	Friedman Value (Ranking)
DT	82.19%	0.814	0.837	0.194	0.825	0.368	0.644	3.49475 (4)
KNN	83.93%	0.859	0.814	0.135	0.836	0.363	0.679	3.61317 (6)
LR	83.45%	0.825	0.851	0.182	0.838	0.351	0.669	3.48317 (3)
NB	79.14%	0.791	0.796	0.213	0.793	0.399	0.583	3.51025 (5)
RF	88.43%	0.860	0.920	0.151	0.889	0.301	0.769	3.45125 (1)
SVM	83.97%	0.822	0.869	0.190	0.845	0.400	0.679	3.44875 (1)

Algorithm	Accuracy	Precision	TP Rate	FP Rate	F-Measure	RMSE	$κ$ -Statistic	Friedman Value (Ranking)
DT	89.21%	0.888	0.933	0.166	0.910	0.294	0.775	3.44033 (1)
KNN	89.43%	0.929	0.887	0.096	0.908	0.298	0.784	3.61133 (6)
LR	87.70%	0.885	0.908	0.166	0.896	0.309	0.745	3.48183 (4)
NB	83.95%	0.869	0.854	0.181	0.862	0.349	0.671	3.56733 (5)
RF	93.65%	0.921	0.976	0.119	0.947	0.238	0.868	3.42083 (1)
SVM	88.30%	0.889	0.914	0.160	0.901	0.342	0.758	3.47883 (3)

ML Algorithm	Accuracy	Precision	TP Rate	FP Rate	F-Measure	RMSE	$κ$ -Statistic	Friedman Value (Ranking)
DT	91.06%	0.938	0.954	0.288	0.946	0.278	0.687	3.45675 (3)
KNN	94.41%	0.965	0.967	0.161	0.966	0.222	0.809	3.49425 (5)
LR	93.57%	0.958	0.964	0.193	0.961	0.232	0.779	3.48625 (4)
NB	86.69%	0.954	0.880	0.194	0.916	0.347	0.603	3.69075 (6)
RF	95.76%	0.959	0.99	0.196	0.975	0.193	0.847	3.41825 (1)
SVM	94.40%	0.958	0.975	0.196	0.966	0.237	0.804	3.45425 (2)

ML Algorithm	Accuracy	Precision	TP Rate	FP Rate	F-Measure	RMSE	$κ$ -Statistic	Friedman Value (Ranking)
DT	94.99%	0.955	0.993	0.739	0.974	0.208	0.360	3.35866 (1)
KNN	96.90%	0.977	0.990	0.371	0.984	0.168	0.689	3.43132 (3)
LR	90.58%	0.973	0.926	0.414	0.949	0.305	0.376	3.60561 (5)
NB	88.09%	0.987	0.885	0.181	0.933	0.331	0.396	3.76270 (6)
RF	96.92%	0.969	0.999	0.503	0.984	0.160	0.641	3.38406 (1)
SVM	96.17%	0.978	0.982	0.356	0.980	0.196	0.644	3.45825 (4)

Friedman Test (Significance Level of 0.05)
Statistic	p-value	Result
360.428080	0.00000	H0 is rejected
Friedman Value	Algorithm	Ranking
3.44875	SVM	1
3.45125	RF	2
3.48317	LR	3
3.49475	DT	4
3.51025	NB	5
3.61317	KNN	6
Post-Hoc Analysis (Significance Level of 0.05)
Comparison	Statistic	p-value	Result
KNN vs. SVM	4.16872	0.00046	H0 is rejected
KNN vs. RF	4.10534	0.00057	H0 is rejected
KNN vs. LR	3.29610	0.01274	H0 is rejected
KNN vs. DT	3.00241	0.03214	H0 is rejected
KNN vs. NB	2.60941	0.09977	H0 is accepted
NB vs. SVM	1.55931	1.00000	H0 is accepted
NB vs. RF	1.49592	1.00000	H0 is accepted
DT vs. SVM	1.16631	1.00000	H0 is accepted
RF vs. DT	1.10293	1.00000	H0 is accepted
LR vs. SVM	0.87262	1.00000	H0 is accepted
RF vs. LR	0.80924	1.00000	H0 is accepted
NB vs. LR	0.68669	1.00000	H0 is accepted
NB vs. DT	0.39300	1.00000	H0 is accepted
LR vs. DT	0.29369	1.00000	H0 is accepted
RF vs. SVM	0.06339	1.00000	H0 is accepted

PERMALINK

Knowledge Discovery for Higher Education Student Retention Based on Data Mining: Machine Learning Algorithms and Case Study in Chile

Carlos A Palacios

José A Reyes-Suárez

Lorena A Bearzotti

Víctor Leiva

Carolina Marchant

Roles

Abstract

1. Symbology, Introduction, and Bibliographical Review

Table 1.

1.1. Abbreviations, Acronyms, Notations, and Symbols

1.2. Introduction

1.3. Related Works

Table 2.

1.4. Models and Description of Sections

Figure 1.

2. Methodology

2.1. Contextualization

Figure 2.

2.2. Data Selection

2.3. Preprocessing and Transformation

2.4. Data Mining/ML Algorithms

Decision trees:

k-nearest neighbors:

Logistic regression:

Naive Bayes:

Random forest:

Support vector machines:

2.5. Data Mining/ML Algorithms’ Performance

Figure 3.

Accuracy:

TP rate:

FP rate:

Precision:

F-measure:

Root mean squared error (RMSE):

κ-statistic:

Friedman value (ranking):

2.6. Interpretation and Evaluation

3. Case Study

3.1. ML Algorithms and Computer Configurations

3.2. Data Selection

Figure 4.

Table 3.

Table 4.

3.3. Preprocessing, Transformation of Data, and Initial Results

Table 5.

3.4. Performance Evaluation of Predictive Models

Global model results:

Table 6.

Table 7.

First-level model results:

Table 8.

Second-level model results:

Table 9.

Third-level model results:

Table 10.

3.5. Interpretation and Evaluation

4. Conclusions, Results, Limitations, Knowledge Discovery, and Future Work

Acknowledgments

Appendix A. Friedman Test Results and Post-Hoc Analysis

Table A1.

Table A2.

Table A3.

Table A4.

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Footnotes

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

$κ$ -statistic: